DEV Community

Davide de Paolis
Davide de Paolis

Posted on

How to deal with Aurora Serverless coldstarts

In the project I am currently working we are migrating a Web Application built with PHP and a Vertica Database to a React SAP + Serverless Backend ( AWS API Gateway + Lambda + Aurora Serverless).

The switch from Vertica to Aurora was the last step and from our integration tests, everything looked fine. The only issue we were facing was that those tests were "sometimes" failing.
We were pretty sure from the beginning it has to do with Aurora Serverless DB "cold starts". We already applied some workaround to avoid the Lambda "falling asleep" but now it was the case of the VPC + DB Cluster taking ages to start up.

falling asleep

Checking the logs we found out that the Lambda was timing out after 5 seconds. Easy. Just increase the timeout in serverless.yml:


 30 # vcp + paused Aurora cluster can take a while

Enter fullscreen mode Exit fullscreen mode

We put a tremendously high value to exclude any timeout from the lambda itself.
Now logs were telling us that the MySQL driver was timing out. We checked the documentation of mysql2 and found out that it defaulted to 10 seconds.
We increased that as well but since a sleeping Aurora DB cluster can take up to 25 seconds to be awakened... we hit the hard limit on Gateway API endpoint.

screaming

Why HARD limit? Because AWS tells us that an API Gateway times out after 29 seconds and that timeout can NOT be increased ( and this makes perfect sense - you definitely donยดt want your RestAPIs to hang for so long - and if they really need so much time probably it's the case of changing the architecture and move to something more asynchronous)

So.. what could we do?

We already had a warm-up but that was just used to spawn the container: handler was immediately returning after checking the context (as I described here).
We could have simply modified the logic so that on warm-up we would ping the DB waking it up.
But a better solution is to disable the "pausing" feature on Aurora Serverless and reducing Capacity Units to 1 so that your DB Cluster never goes to sleep and you have always at least one ACU available and if itยดs the case AWS scales it up automatically.
Of course, this seems to somehow defeat the purpose of the Serverless DB - where you configure your DB to autoscale when needed and pay only for its usage:

if I have to keep an instance always on because the startup time is way too much, then what's the gain in respect to having Aurora on EC2?

Well, the benefit is exactly the autoscaling functionality ( over multiple Availability Zones).

If you don't need multiple AZ and you are so cost sensitive you can definitely just use a provisioned aurora instead of a serverless.

Here you can find an awesome and detailed article about the costs of Aurora Serverless compared to Aurora on EC2.

In our case it was not a big deal spending a bunch of euros more to have a more stable service and don't go crazy keeping up with cronjobs and warmups therefore we decided to keep 1 ACU always active on production and just bear with the cold starts on staging and dev (any QA tester would just have to refresh the ReactApp page in order to have the connection running - and same could be done for integration tests - either retry or ping the DB - wait and then execute them)

How do you set this configuration in serverless?

In the AWS UIconsole, it's easy. Just click the configuration tab - change ACU and Autopause fields and it's done.

AWS UI Console RDS configuration

With the Serverless Framework, the hardest part is always finding the right properties to use to describe the stack in code and navigating the massive AWS documentation.
You can read about the Scaling Configuration of your DB Cluster Capacity in the API Reference pages or in the AWS SDK documentation but to find out the right configuration in your yml file you have to go to the Cloud Formation documentation.
Once you are there you will realize it is indeed super simple.

Under Resources just put:



 RDSCluster:
      Type: AWS::RDS::DBCluster
      Properties:
        MasterUsername: YOUR_DB_USERNAME
        MasterUserPassword: YOUR_DB_PSW
        DatabaseName: YOUR_DB_NAME
        Engine: aurora
        EngineMode: serverless
        ScalingConfiguration:
          AutoPause: false
          MaxCapacity: 64
          MinCapacity: 1
        DBSubnetGroupName: YOUR_SUBNET_NAME
        BackupRetentionPeriod: 1
        DeletionProtection: true


Enter fullscreen mode Exit fullscreen mode

This will create an Aurora Serverless DB cluster that can't be deleted, never goes to sleep, and has a minimum of 1 ACU.

In our case though, we wanted to have a different configuration for each different environment. We don't want to waste money on an always available instance for QA and DEV nor we needed Snapshots and Backups for those environments.
So - since conditionals do not really exist in yml - we created a bunch of custom properties and refer to them based on stage:

Under Custom just declare them:



autopause:
    production: false
    default: true


Enter fullscreen mode Exit fullscreen mode

and in the DBCluster config just refer to them like this:



AutoPause: ${self:custom.autopause.${self:provider.stage}, self:custom.autopause.default}


Enter fullscreen mode Exit fullscreen mode

This is how within the serverless you can grab a property based on the stage, and if the stage-name does not exist in the properties fallback to the default.


A nice tip when you play around with the configuration is using



sls print -s YOUR_STAGENAME


Enter fullscreen mode Exit fullscreen mode

to see how all the final yml will look like with all variables resolved.

Did you have any issues with AuroraServerless or do you have any interesting suggesion on the topic?

Top comments (1)

Collapse
 
bdombro profile image
Brian L Dombrowski

I appreciate this post! I'd argue though that API Gateway should at least allow long enough timeouts for Aurora Serverless to cold start. I've found that startup on Serverless v2 is usually just shy of 28 seconds, but now I'm paranoid that it may occasionally creep above the timeout.