DEV Community

Cover image for How to host ArcticDB on S3 and connect with Lambda
Kyle Foo
Kyle Foo

Posted on • Edited on

How to host ArcticDB on S3 and connect with Lambda

ArcticDB has been the modern database for Pandas' dataframe. It can handle billions of rows at scale, making it efficient for quantitative analysis. Therefore, I've decided to give it a spin in my data scrapping project.

Besides, Serverless Framework has been my top choice when it comes to developing Lambda function and its deployment to AWS. In this project, I wrote a data scrapping function that gets triggerred every min to scrap news and store them into ArcticDB.

First I created a S3 Bucket named devto-arctic, then connect locally with Jupyter Notebook to set up its library. I have opted to use AWS access key method to gain connection to the storage bucket.

# Jupyter Notebook
import pandas as pd
import arcticdb as adb
import os
import dotenv

dotenv.load_dotenv()
ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=us-east-2&access={os.getenv('AWS_ACCESS_KEY_ID')}&secret={os.getenv('AWS_SECRET_ACCESS_KEY')}")
ac.create_library('intro')
ac.list_libraries() # output list of library in the db
df = pd.DataFrame()
ac.write('news_frame', df) # writing a empty df to a table
Enter fullscreen mode Exit fullscreen mode

You will notice in your S3 bucket, a file *prefix*Intro will be created within.

Then, we should proceed to setting up Lambda Function with serverless framework. After npm install serverless, we can then initialize a Python project. Run serverless login to login to your serverless account before initialization. Next, execute serverless to choose a scheduled task python template as starter.

Serverless Framework

Once initialized, you'll get a Python project folder with all the necessary files. In the handler.py, it should be your function codes to make connection to ArcticDB for performing data read and write.

# handler.py
import datetime
import logging
import arcticdb as adb
import requests
import pandas as pd
import json
from dotenv import load_dotenv
import os

load_dotenv()

ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=ap-southeast-1&access={os.environ['AWS_ACCESS_KEY_ENV']}&secret={os.environ['AWS_SECRET_ACCESS_KEY_ENV']}")
lib = ac.get_library('intro', create_if_missing=True)
ac.list_libraries()
ac.list_symbols() # symbols are equivalent to tables in a library

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

def fetch_news():
    url = "https://news.endpoint.com/api?limit=500" # dummy endpoint
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        return response.json()
    except requests.RequestException as e:
        logger.error(f"Error fetching news: {str(e)}")
        return None

def run(event, context):
    symbol = 'news_frame'
    current_time = datetime.datetime.now().timestamp() * 1000
    logger.info("Your cron function ran at " + str(datetime.datetime.now().time()))

    # Fetch news data
    news_data = fetch_news()
    if news_data is None:
        return {
            'statusCode': 500,
            'body': json.dumps('Failed to fetch news data')
        }
     df = pd.DataFrame([{
            'time': datetime.datetime.fromtimestamp(int(news['time'])/1000),  # Convert ms to datetime
            'title': str(news.get('title', '')),
            'source': str(news.get('source', '')),
            'news_id': str(news.get('news_id', '')),
            'url': str(news.get('url', '')),
            'icon': str(news.get('icon', '')),
            'image': str(news.get('image', ''))
        } for news in news_data])

     try:
          print(f"\nWriting DataFrame for {symbol}:")
          lib.append(symbol, df) # use append so it doesn't overwrite old data
          print(f"Successfully wrote {symbol} to ArcticDB")
      except Exception as e:
          print(f"Error writing {symbol} to ArcticDB: {str(e)}")

    logger.info(f"Successfully processed news articles")
    return {
        'statusCode': 200,
        'body': json.dumps({
            'message': 'Successfully processed news data',
            'time': str(current_time)
        })
    }

Enter fullscreen mode Exit fullscreen mode

Now we can deploy the Lambda function, but first make sure requirements.txt has all the dependencies:

# requirement.txt
arcticdb; sys_platform != "darwin"
requests
pandas
numpy
python-dotenv
Enter fullscreen mode Exit fullscreen mode

Note that we skip arcticdb from pip install because of binary support for Mac machine is not yet ready at the time of writing. Running pip install locally could fail without the sys_platform != "darwin" syntax. This is a work around so that Mac would skip installing arcticdb via pip. You don't need the syntax on Windows or Linux.

If you are on Mac and want to test the code locally, do activate a virtual python env and use conda install -c conda-forge arcticdb to install arcticdb, run serverless invoke local to execute the the function.

Github ArcticDB
See ArcticDB 4.3.1

In the serverless' package.json, I have made sure plugin serverless-python-requirements is included so that during serverless deployment, python dependencies in the requirements.txt will be packaged as Layer for Lambda function to import the dependent modules from.

Image description

Next, if you are on Windows or Linux, you may deploy straight from local by running serverless deploy. Deploy from Mac machine could fail as arcticdb would spit error for not finding its binary distribution as mentioned.
Image description
The workaround will be using cloud CI/CD to package and deploy the Lambda.

The scripts install-plugin and deploy in package.json will be used on CI/CD. In this case, let's use Github Actions as deployment tool, deployment script as followed:

# deploy.yml
name: deploy serverless
on:
  push:
    branches:
      - main
jobs:
  deploy:
    name: deploy
    runs-on: ubuntu-latest
    environment: ${{ inputs.environment }}
    permissions:
      contents: read
      deployments: write
    strategy:
      matrix:
        node-version: [18.x]
        python-version: [3.9]
    steps:
      - uses: actions/checkout@v3
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
          architecture: x64
      - run: npm ci --include=dev
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-2
      - name: Install Plugin and Deploy
        run: npm run install-plugin && npm run deploy
        env:
          SERVERLESS_ACCESS_KEY: ${{ secrets.SERVERLESS_ACCESS_KEY }}

Enter fullscreen mode Exit fullscreen mode

The step to configure your AWS credentials allow serverless to deploy accordingly to your AWS environment. Please make sure the IAM for such access key is given admin permission to Lambda and S3.

Above Github Action will be triggered on push to main branch. You may configure to how you would like the deployment trigger.

After the deployment, you can see the Event Bridge is auto set up as the scheduler, and a Layer is uploaded and attached to the Lambda.
Lambda

Hooray, here we go with the serverless approach to scrap data and save into ArcticDB! You may then use Jupyter Notebook to read data locally and analyze them while Lambda doing its thing in the background.

Top comments (0)