Daniel Bayerlein

Posted on Feb 28

Building robust and scalable serverless event tracking and analytics with AWS

#aws #analytics #tracking

Sometimes, it becomes necessary to develop a custom analytics tracking solution instead of relying on services like Google Analytics. There are several reasons for this. Firstly, a customized solution allows for precise alignment with data privacy laws and regulations, as well as specific compliance requirements. Additionally, it grants full control over the collected data, which is crucial when dealing with sensitive information.

My code examples are written in TypeScript. I use the AWS Cloud Development Kit (CDK), which allows you to define your cloud infrastructure as code in any of the supported programming languages.

Challenge

There are a few important points to think about when talking about a user tracking solution. One of them is the large amount of data generated by user interactions. The endpoint that receives this event data needs to be able to process thousands of data points within seconds.

The amount of data is also a challenge for the analysis. Therefore, a solution is needed that can process thousands of user events very efficiently and quickly.

Let's take a look at the architecture and the services that are used.

Architecture

In the client application the Amplify Analytics library is used. This provides a ready-to-use implementation that also takes care of batch processing and authentication using Amazon Cognito.

Amplify Analytics uses Amazon Cognito Identity Pools to identify users in the app. Cognito can be used to receive data from authenticated and unauthenticated users in the app.

The Amazon Data Firehose analytics provider allows you to send analytics data to an Amazon Data Firehose stream for reliably storing data.

Amazon Data Firehose sends the streamed data to an Amazon S3 bucket. The data is stored there batched as an S3 object by default.

The Amazon Athena service makes it easy to analyze data in Amazon S3 using standard SQL. Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query.

Infrastructure

Every user can receive temporary permissions that allows them to write data directly to Data Firehose. This can be done with a Cognito Identity Pool. For more details about authenticated and unauthenticated user please have a look at the Cognito documentation.

This example uses an existing Cognito UserPool and UserPoolClient as a provider for authenticated users.

import {
  IdentityPool,
  UserPoolAuthenticationProvider,
} from '@aws-cdk/aws-cognito-identitypool-alpha';

const identityPool = new IdentityPool(this, 'AnalyticsIdentityPool', {
  authenticationProviders: {
    userPools: [
      new UserPoolAuthenticationProvider({
        userPool,
        userPoolClient,
      }),
    ],
  },
});

Data Firehose requires an Amazon S3 bucket to store the data there.

import { aws_s3 } from 'aws-cdk-lib';

const bucket = new aws_s3.Bucket(this, 'AnalyticsBucket', {
  blockPublicAccess: aws_s3.BlockPublicAccess.BLOCK_ALL,
});

A Firehose Delivery Stream is provided as the recipient for the user events. The S3 bucket is used as the destination for all streaming data.

As the data analysis is to be carried out on a daily basis, a very common structure is used within the S3 bucket:

/{year}/{month}/{day}/

This structure can be configured with the dataOutputPrefix property.

import { aws_kinesisfirehose } from 'aws-cdk-lib';

const firehoseDeliveryStream = new aws_kinesisfirehose.DeliveryStream(
  this,
  'AnalyticsDeliveryStream',
  {
    deliveryStreamName: 'analyticsStream',
    destination: new aws_kinesisfirehose.S3Bucket(bucket, {
      dataOutputPrefix:
        'data/!{partitionKeyFromQuery:year}/!{partitionKeyFromQuery:month}/!{partitionKeyFromQuery:day}/',
      errorOutputPrefix:
        'failures/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/',
    }),
  },
);

firehoseDeliveryStream.grant(
  identityPool.authenticatedRole,
  'firehose:PutRecord',
  'firehose:PutRecordBatch',
);

As not all service properties are yet implemented in the Construct, some properties must be overwritten directly on the delivery stream node.

Enable configuration of the dynamic partitioning mechanism with the ExtendedS3DestinationConfiguration.DynamicPartitioningConfigurationproperty type, which creates targeted datasets from streaming data by partitioning it based on partition keys.

The ExtendedS3DestinationConfiguration.ProcessingConfiguration property type configures an Amazon S3 destination for an Amazon Kinesis Data Firehose delivery stream.

import { aws_kinesisfirehose } from 'aws-cdk-lib';

const firehose = firehoseDeliveryStream.node
  .defaultChild as aws_kinesisfirehose.CfnDeliveryStream;

firehose.addPropertyOverride(
  'ExtendedS3DestinationConfiguration.DynamicPartitioningConfiguration',
  { Enabled: true },
);

firehose.addPropertyOverride(
  'ExtendedS3DestinationConfiguration.ProcessingConfiguration',
  {
    Enabled: true,
    Processors: [
      {
        Type: 'RecordDeAggregation',
        Parameters: [
          {
            ParameterName: 'SubRecordType',
            ParameterValue: 'JSON',
          },
        ],
      },
      {
        Type: 'AppendDelimiterToRecord',
      },
      {
        Type: 'MetadataExtraction',
        Parameters: [
          {
            ParameterName: 'MetadataExtractionQuery',
            ParameterValue:
              '{year:.event_timestamp| strftime("%Y"),month:.event_timestamp| strftime("%m"),day:.event_timestamp| strftime("%d")}',
          },
          {
            ParameterName: 'JsonParsingEngine',
            ParameterValue: 'JQ-1.6',
          },
        ],
      },
    ],
  },
);

⚠️ At the time of writing, the AWS Glue CDK Construct (@aws-cdk/aws-glue-alpha) was still experimental.

Create an AWS Glue database and table. Configure the table with a partition based on event_date and columns for event and event_timestamp.

Customize the table properties to enable and configure partition projection.

import * as aws_glue from '@aws-cdk/aws-glue-alpha';
import type { CfnTable } from 'aws-cdk-lib/aws-glue';

const database = new aws_glue.Database(this, 'AnalyticsDatabase');
const table = new aws_glue.Table(this, 'AnalyticsTable', {
  bucket,
  s3Prefix: 'data/',
  dataFormat: aws_glue.DataFormat.JSON,
  database,
  partitionKeys: [
    {
      name: 'event_date',
      type: aws_glue.Schema.STRING,
    },
  ],
  columns: [
    {
      name: 'event',
      type: aws_glue.Schema.STRING,
    },
    {
      name: 'event_timestamp',
      type: aws_glue.Schema.TIMESTAMP,
    },
  ],
});

const cfnTable = table.node.defaultChild as CfnTable;
cfnTable.addPropertyOverride('TableInput.Parameters', {
  'projection.enabled': 'true',
  'projection.event_date.type': 'date',
  'projection.event_date.format': 'yyyy/MM/dd',
  'projection.event_date.range': '2024/01/01,NOW',
  'projection.event_date.interval.unit': 'DAYS',
  'projection.event_date.interval': '1',
  'storage.location.template':
    analyticsBucket.s3UrlForObject() + '/data/${event_date}/',
});

If you want to store additional data, add a column for each data point.

The infrastructure for the serverless event tracking solution is now in place!

Client

Now let's take a look at the client application. As described above, aws-amplify/analytics provides a ready-to-use implementation for Amazon Kinesis Firehose.

npm install aws-amplify

Ensure you have setup IAM permissions for firehose:PutRecordBatch.

Example IAM policy for Amazon Kinesis Firehose:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "firehose:PutRecordBatch",
    // replace the template fields
    "Resource": "arn:aws:firehose:<Region>:<AccountId>:deliverystream/analyticsStream"
  }]
}

Configure Kinesis Firehose:

import { Amplify } from 'aws-amplify';

Amplify.configure({
  Analytics: {
    KinesisFirehose: {
      // REQUIRED - Amazon Kinesis Firehose service region
      region: 'eu-central-1',

      // OPTIONAL - The buffer size for events in number of items.
      bufferSize: 1000,

      // OPTIONAL - The number of events to be deleted from the buffer when flushed.
      flushSize: 100,

      // OPTIONAL - The interval in milliseconds to perform a buffer check and flush if necessary.
      flushInterval: 5000, // 5s

      // OPTIONAL - The limit for failed recording retries.
      resendLimit: 5
    }
  }
});

You can send a data to a Kinesis Firehose stream with the standard record method. Any data is acceptable and streamName is required:

import { record } from 'aws-amplify/analytics/kinesis-firehose';

record({
  data: {
    event_timestamp: Date.now() / 1000,
    event: 'LOGIN',
    // The data blob to put into the record
  },
  streamName: 'analyticsStream'
});

If you want to store additional data, add it to the data object and remember to add a specific column to the AWS Glue table.

That's it. Now you can track events in your application however you like.

Analysis

Now that the infrastructure and client application are in place, it is time to analyze the data. This can be done using the Amazon Athena service. The analysis can be performed directly in the AWS Console or via the AWS SDK if you want to visualize the results in an application.

AWS Glue

When you look at the AWS Glue service, you can see the metadata of the table.

These columns can now be queried using Amazon Athena.

Amazon Athena

Amazon Athena makes it easy to analyze data in Amazon S3 using standard SQL. For more information, see 📖 SQL reference for Athena

If you have any kind of feedback, suggestions or ideas - feel free to comment this post!

DEV Community

Building robust and scalable serverless event tracking and analytics with AWS

Challenge

Architecture

Infrastructure

Client

Analysis

AWS Glue

Amazon Athena

Top comments (0)

Read next

How AWS manages fully managed services: Simplifying the cloud

Simplifying Internal APIs with Direct AWS Lambda Invocation

The ultimate guide to cost management in AWS: 20 Detailed and effective strategies

Document Translation Service using Streamlit & AWS Translator