kazuho cryer-shinozuka

Posted on Jan 11

CDK construct for a SaaS that performs data analysis on S3 using DuckDB

#awscdk #duckdb #aws

Introduction

Are you familiar with DuckDB? It can be described as a fast analytical database that operates with a single file, similar to SQLite. I believe the following blog will convey its greatness.

https://www.linkedin.com/pulse/analysing-aws-application-load-balancer-logs-duckdb-unleashing/

By setting up AWS CLI credentials and simply starting DuckDB, you can directly load files on S3 and perform various analyses with very simple queries.

SELECT * FROM read_csv_auto('s3://your-bucket-name/your-file.csv');

However, considering real-world use cases, there may be many situations where issuing AWS credentials to analysts is not ideal.

You want to delegate log analysis to operators, but issuing CLI credentials for them involves a cumbersome approval process.
You want to limit or update the target buckets for analysis, but adjusting IAM policies every time is a hassle.

With this in mind, I thought it would be convenient to run DuckDB on the backend and allow users to freely execute SQL from a frontend with an authentication interface. To address this, I published a CDK construct (cloud-duck).

https://constructs.dev/packages/cloud-duck

In this article, I would like to introduce its features, usage, and the trial-and-error process during development.

Features

cloud-duck offers the following features:

Provides an environment for easily analyzing diverse data on S3.
Access is restricted to authenticated users only.
Allows each user to save analysis results independently.
Extremely easy to deploy using CDK.

Architecture

The frontend is served via S3 + CloudFront.
Users are managed using Cognito User Pool.
API calls are restricted to authenticated users only.
DuckDB runs on Lambda, with read permissions granted to access S3 within the same account.
DuckDB files are persistently saved to S3 whenever a CREATE TABLE operation is performed.

Example Usage

Upon logging in, you will see an SQL input form and a result output form.

Let’s immediately load s3://target-bucket/testdata.csv and extract some data.

You can also save the load results as a table using CREATE TABLE {table_name} AS SELECT ....

Let’s run a query on the test_data table we just created.

The query successfully runs without directly accessing the original data stored in S3.

Data Storage Per User

Query results are stored independently for each logged-in user. This means you can create tables freely without worrying about other users.

Costs

The primary cost comes from the compute usage of the Lambda function running DuckDB. By default, a 1GB memory x86 Lambda function is used. Assuming an average query takes 10 seconds and 100 queries are executed daily, the cost is approximately $0.5 per month.

Additional costs include S3 storage for large tables generated by DuckDB. However, unless used heavily, the monthly cost should remain within a few dollars.

The serverless approach truly shines here.

Setup Instructions

Now, let’s walk through how to deploy cloud-duck using AWS CDK.

Deployment

Since cloud-duck is provided as a CDK construct, it must be deployed using AWS CDK.

For setting up CDK itself, refer to the official documentation. Following the steps up to STEP 5 is sufficient.

Once your CDK application is ready, install cloud-duck and update the Stack definition as follows.

npm install cloud-duck

import * as cdk from 'aws-cdk-lib';
import { CloudDuck } from 'cloud-duck';
import { Construct } from 'constructs';

export class CloudDuckStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create CloudDuck instance
    new CloudDuck(this, 'CloudDuck');
  }
}

Let’s proceed with the deployment.

$ npx cdk deploy

✨  Synthesis time: 5.39s

CloudDuckStack: start: Building 6b82c08c411ad583faa859a28107837b81c3dc67035e3a7bebd7f45fc243e8f0:current_account-current_region
...

IAM Statement Changes

// continue by entering "y"
Do you wish to deploy these changes (y/n)? y
CloudDuckStack: deploying... [1/1]
CloudDuckStack: creating CloudFormation changeset...

 ✅  CloudDuckStack

✨  Deployment time: 646.04s

Outputs:
CloudDuckStack.CloudDuckApiEndpoint38BFF0BB = https://nm5qodr9jl.execute-api.us-east-1.amazonaws.com/api/
// Cognito User Pool ID
CloudDuckStack.CloudDuckCognitoUserPoolId2D258434 = us-east-1_D8Wk6DLME
// Frontend URL
CloudDuckStack.CloudDuckDistributionUrl84FC8296 = https://d28c2qhdxa1so0.cloudfront.net
Stack ARN:
arn:aws:cloudformation:us-east-1:123456789012:stack/CloudDuckStack/75019580-c39b-11ef-b8b8-12856c43826d

✨  Total time: 651.44s

The deployment will be completed in approximately 10 minutes.

Add Users to Cognito User Pool

Add users to the Cognito User Pool who should be granted access. Use the User Pool ID that was output during the deployment process and pass it as the --user-pool-id option.

aws cognito-idp admin-create-user \
--user-pool-id {user-pool-id} \
--username "naonao@example.com" \
--user-attributes Name=email,Value="naonao@example.com" Name=email_verified,Value=true \
--message-action SUPPRESS \
--temporary-password Password1!

Access to the frontend

Access the CloudFront URL displayed during deployment in your browser.

You will be directed to the sign-in screen. Log in using the temporary password that was set when adding the user.

Once successful, you will be redirected to a password change screen. Enter a new password to proceed.

You will then be redirected to the query execution screen.

Note

As of version 0.0.13, there may be an issue where the Submit Query button does not respond immediately after logging in. A simple page reload will resolve the issue, so please try that. I plan to address this in future updates.

Development Trials and Errors

Now, I'd like to share some of the trial and error I encountered during development.

Running DuckDB in the Browser

One of the key attractions of DuckDB is its fast query execution after the initial data load. Initially, I thought it would be ideal to run DuckDB in the browser to make the most of this feature. However, since regular DuckDB is built as a binary dependent on the CPU architecture, running it in the browser requires using the webassembly version, duckdb-wasm.

Of course, there were pioneers before me, and this wonderful Japanese blog was incredibly helpful in guiding me.

https://zenn.dev/shiguredo/articles/duckdb-wasm-s3-parquet-opfs

I also tried creating a Cognito Identity Pool and issuing temporary AWS credentials to the frontend, then saving the S3 data to OPSF and querying duckdb-wasm.

However, I found that duckdb-wasm is limited in functionality compared to the main DuckDB, especially since the AWS extension cannot be used, making it difficult to easily query S3. When I first started the project, I was captivated by the power of DuckDB in cases like the ALB log analysis queries, where the AWS extension truly shines. Writing complex data retrieval queries seemed to undermine the beauty of DuckDB, so I decided to switch to running DuckDB on the backend.

CREATE TABLE alb_log_202411 AS
SELECT *
-- Just with this, it automatically consolidates a bunch of gzip-compressed log files into a DuckDB table.
FROM read_csv(
   's3://[YOUR_S3_BUCKET_NAME]/AWSLogs/[YOUR_ACCOUNT_ID]/elasticloadbalancing/[YOUR_REGION]/2024/11/**/*.log.gz',
    columns={
        'type': 'VARCHAR',
       ...

Passing deploy-time values to the frontend after deployment

The frontend operates as a Single Page Application (SPA), but to integrate authentication and make API calls, the following information needs to be configured:

Cognito User Pool ID
Cognito Application Client ID
API Endpoint

However, these values are finalized only after the deployment to each environment, so they cannot be hardcoded into the frontend code in advance. There are two ways to resolve this:

Output the deploy-time values as a JSON file via S3 Bucket Deployment, and dynamically import them in the frontend.
- https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_s3_deployment.Source.html#static-jsonwbrdataobjectkey-obj
Build the frontend after deployment is complete.
- deploy-time-build construct

There’s no issue with either method as both allow for a one-time deployment, but this time, we chose the latter to avoid modifying the frontend.

Both methods can be resolved instantly thanks to using CDK, but without it, the process would be quite cumbersome, which makes me appreciate CDK even more.

Persistence of DuckDB Files

By default, DuckDB stores data in memory, allowing for fast performance. However, it is also possible to persist the data to a file.

const duckdb = require('duckdb');
// in memory
const db = new duckdb.Database(':memory:');
// persist data to a file
const persistedDb = new duckdb.Database('/tmp/db.duckdb');

Since the application is running on Lambda, it is essential to persist the data to a file. Also, since the saved files would be stored on the ephemeral storage (/tmp) of Lambda and would be lost after execution, it is necessary to save them to S3 every time.

So, I initially created the following Lambda code:

Download the DuckDB file from S3 (if it exists).
Generate a connection, execute a query, and close the connection. Upload the DuckDB file back to S3.

However, when I actually ran it, for some reason, the result of the CREATE TABLE was not saved.

$ CREATE TABLE test_table AS SELECT ...
// Succesfully completed and show the number of columns of the created table
{
  count: 1234
}

$ SELECT * FROM test_table;
// The test_table I created does not exist.
test_table does not exist.

The cause was that it was also necessary to save the intermediate file, duckdb.wal, which is automatically generated by DuckDB. Often, the .wal file is not deleted when connection.close() is called, and in those cases, it was necessary to generate a new connection while both the DuckDB file and the .wal file were present at the same time.

In the end, I am running the Lambda function with the following steps:

(If present) Download the DuckDB file and the intermediate file.
Generate connection, execute query, and close connection.
Sync the DuckDB file and the intermediate file between Lambda and S3:
- If there was an intermediate file after the previous query, but there is no intermediate file for the current query, delete the intermediate file from S3 as well.

This part is a bit cumbersome, and I’m considering whether mounting EFS to store the files might have been a better approach.

Including Lambda and Frontend Code in CDK Constructs

Deploying Lambda Code

To avoid unnecessary processing for the user when using NodejsFunction in a construct, I decided to build the Lambda code during the release process and specify the pre-built JS files using the lambda.Functionclass.

As detailed in the mentioned presentation, the Lambda code is built using the prependExec in .projenrc.ts. Once built, I simply specify the pre-built .js files in the lambda.Function constructor.

// generate .js files at `lambda/duckdb/build`
project.projectBuild.compileTask.prependExec('npm ci && npm run build', {
  cwd: 'lambda/duckdb',
});

 const duckdbHandler = new lambda.Function(this, 'DuckDbHandler', {
      runtime: lambda.Runtime.NODEJS_20_X,
      // specify lambda/duckdb/build/index.js
      code: lambda.Code.fromAsset(path.join(__dirname, '../../../lambda/duckdb/build')),
      handler: 'index.handler',
    });

Folder Structure

The CDK construct itself is created using projen's AwsCdkConstruct. It provides a convenient template, including GitHub Actions for npm package publishing, which has been very helpful.

Initially, I set up frontend and lambda directories under the src directory (for some reason), but the uploaded npm package ended up being an empty shell, without any of the code from frontend or lambda.

It seems that projen has some filtering logic that excludes these files.

Ultimately, by organizing the project with /frontend, /lambda, and /src directories at the root, it began working properly.

Bad Example

.
└── src
    ├── bin
    ├── frontend
    ├── lambda
    └── lib

Good Example

.
├── frontend
├── lambda
└── src
    ├── bin
    └── lib

Finally

This concludes the introduction of a CDK construct that allows you to quickly leverage DuckDB. I hadn’t seen a CDK construct that includes frontend code before, so I believe this could be a valuable contribution.

I would be very happy if those who haven’t used CDK yet could use this as a starting point to get into it! I’m also open to bug reports, feature requests, and suggestions for areas where implementation may have been lacking.

https://github.com/badmintoncryer/cloud-duck

DEV Community