ArcticDB has been the modern database for Pandas' dataframe. It can handle billions of rows at scale, making it efficient for quantitative analysis. Therefore, I've decided to give it a spin in my data scrapping project.
Besides, Serverless Framework has been my top choice when it comes to developing Lambda function and its deployment to AWS. In this project, I wrote a data scrapping function that gets triggerred every min to scrap news and store them into ArcticDB.
First I created a S3 Bucket named devto-arctic
, then connect locally with Jupyter Notebook to set up its library. I have opted to use AWS access key method to gain connection to the storage bucket.
# Jupyter Notebook
import pandas as pd
import arcticdb as adb
import os
import dotenv
dotenv.load_dotenv()
ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=us-east-2&access={os.getenv('AWS_ACCESS_KEY_ID')}&secret={os.getenv('AWS_SECRET_ACCESS_KEY')}")
ac.create_library('intro')
ac.list_libraries() # output list of library in the db
df = pd.DataFrame()
ac.write('news_frame', df) # writing a empty df to a table
You will notice in your S3 bucket, a file *prefix*Intro will be created within.
Then, we should proceed to setting up Lambda Function with serverless framework. After npm install serverless
, we can then initialize a Python project. Run serverless login
to login to your serverless account before initialization. Next, execute serverless
to choose a scheduled task python template as starter.
Once initialized, you'll get a Python project folder with all the necessary files. In the handler.py
, it should be your function codes to make connection to ArcticDB for performing data read and write.
# handler.py
import datetime
import logging
import arcticdb as adb
import requests
import pandas as pd
import json
from dotenv import load_dotenv
import os
load_dotenv()
ac = adb.Arctic(f"s3://s3.us-east-2.amazonaws.com:devto-arctic?region=ap-southeast-1&access={os.environ['AWS_ACCESS_KEY_ENV']}&secret={os.environ['AWS_SECRET_ACCESS_KEY_ENV']}")
lib = ac.get_library('intro', create_if_missing=True)
ac.list_libraries()
ac.list_symbols() # symbols are equivalent to tables in a library
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
def fetch_news():
url = "https://news.endpoint.com/api?limit=500" # dummy endpoint
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
return response.json()
except requests.RequestException as e:
logger.error(f"Error fetching news: {str(e)}")
return None
def run(event, context):
symbol = 'news_frame'
current_time = datetime.datetime.now().timestamp() * 1000
logger.info("Your cron function ran at " + str(datetime.datetime.now().time()))
# Fetch news data
news_data = fetch_news()
if news_data is None:
return {
'statusCode': 500,
'body': json.dumps('Failed to fetch news data')
}
df = pd.DataFrame([{
'time': datetime.datetime.fromtimestamp(int(news['time'])/1000), # Convert ms to datetime
'title': str(news.get('title', '')),
'source': str(news.get('source', '')),
'news_id': str(news.get('news_id', '')),
'url': str(news.get('url', '')),
'icon': str(news.get('icon', '')),
'image': str(news.get('image', ''))
} for news in news_data])
try:
print(f"\nWriting DataFrame for {symbol}:")
lib.append(symbol, df) # use append so it doesn't overwrite old data
print(f"Successfully wrote {symbol} to ArcticDB")
except Exception as e:
print(f"Error writing {symbol} to ArcticDB: {str(e)}")
logger.info(f"Successfully processed news articles")
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Successfully processed news data',
'time': str(current_time)
})
}
Now we can deploy the Lambda function, but first make sure requirements.txt
has all the dependencies:
# requirement.txt
arcticdb; sys_platform != "darwin"
requests
pandas
numpy
python-dotenv
Note that we skip arcticdb from pip install because of binary support for Mac machine is not yet ready at the time of writing. Running pip install
locally could fail without the sys_platform != "darwin"
syntax. This is a work around so that Mac would skip installing arcticdb via pip. You don't need the syntax on Windows or Linux.
If you are on Mac and want to test the code locally, do activate a virtual python env and use conda install -c conda-forge arcticdb
to install arcticdb, run serverless invoke local
to execute the the function.
In the serverless' package.json, I have made sure plugin serverless-python-requirements
is included so that during serverless deployment, python dependencies in the requirements.txt will be packaged as Layer for Lambda function to import the dependent modules from.
Next, if you are on Windows or Linux, you may deploy straight from local by running serverless deploy
. Deploy from Mac machine could fail as arcticdb would spit error for not finding its binary distribution as mentioned.
The workaround will be using cloud CI/CD to package and deploy the Lambda.
The scripts install-plugin
and deploy
in package.json will be used on CI/CD. In this case, let's use Github Actions as deployment tool, deployment script as followed:
# deploy.yml
name: deploy serverless
on:
push:
branches:
- main
jobs:
deploy:
name: deploy
runs-on: ubuntu-latest
environment: ${{ inputs.environment }}
permissions:
contents: read
deployments: write
strategy:
matrix:
node-version: [18.x]
python-version: [3.9]
steps:
- uses: actions/checkout@v3
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Use Node.js ${{ matrix.node-version }}
uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
architecture: x64
- run: npm ci --include=dev
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-2
- name: Install Plugin and Deploy
run: npm run install-plugin && npm run deploy
env:
SERVERLESS_ACCESS_KEY: ${{ secrets.SERVERLESS_ACCESS_KEY }}
The step to configure your AWS credentials allow serverless to deploy accordingly to your AWS environment. Please make sure the IAM for such access key is given admin permission to Lambda and S3.
Above Github Action will be triggered on push to main branch. You may configure to how you would like the deployment trigger.
After the deployment, you can see the Event Bridge is auto set up as the scheduler, and a Layer is uploaded and attached to the Lambda.
Hooray, here we go with the serverless approach to scrap data and save into ArcticDB! You may then use Jupyter Notebook to read data locally and analyze them while Lambda doing its thing in the background.
Top comments (0)