Does Scale-to-Zero Redefine Aurora’s ‘Serverlessness’?

Amazon Aurora Serverless v2 now supports scale-to-zero, allowing you to stop paying for your database when it isn't in use.

Sounds great - but what does it mean, and is it really serverless?

🖥️ What is Aurora Serverless v2?

First, some basics: Aurora Serverless v2 abstracts away the management of individual underlying database instances, and replaces it with "Aurora Capacity Units" (ACU's).

Each ACU is equivalent to about 2GiB of RAM (plus corresponding CPU, networking, etc.), and Aurora can seamlessly scale up or down to meet demand according to your configurations.

You incur a cost (about $0.12 per hour) as long as the ACU is provisioned. So at the previously cheapest level (0.5 ACU's), you're running up a $43.80 monthly bill, even with zero requests.

👻 Defining "Serverless"

Serverless is a technology with a problem: There's no universally accepted definition of what 'serverless' is. Ben Kehoe captures this perfectly in The Meaning(lessness) of Serverless.

For this scenario though, we need a framework to assess this against:

The team at Momento came up with their Litmus Test for Serverless, which I'll use as our gold standard of 'pure serverless'.

⚙️ How it Works

The official launch blog by AWS does its usual good job at covering the key information. Here's how scale-to-zero works in a nutshell:

With 0 ACUs, Aurora Serverless v2 DB instances can automatically pause after a period of inactivity. Aurora pauses an Aurora Serverless v2 instance if it doesn’t have connections initiated by user activity within the speciﬁed time period. For the paused duration, there are no charges for compute as the database is at 0 ACUs.

This is a change from previous releases, where you needed at least 0.5 ACU's - incurring a permanent minimum cost. If the database isn't used^*, you won't be paying for it. When new connections are received, it'll start up again.

Documentation also refers to this as auto-pause, which is very descriptive of the underlying mechanism. It's effectively pausing the database instance, since it's not in use.

👀 Caveats

There's a lot to know about what counts as being "used". Again to their credit, AWS lays it both in their blog post and much deeper in the Aurora docs; but to capture a few very interesting points:

Any open connections prevents pausing
Use of Replication and/or RDS Proxy also prevents pausing
Any connection attempt will cause the instance to resume
Resuming instances takes up to 15 seconds

These make sense given the underlying mechanisms, but this might be a surprise and remove a few use cases. For example, using Lambda Functions with RDS Proxy effectively prevents this feature's usage.

There's also a lot of specific recommendations depending on your cluster design, so as always the documentation is worth digging into before taking this to production.

🥶 (Very) Cold Starts

Cold starts are a persistent challenge of serverless, usually very solvable within the service. But for this case, we get positively arctic.

Instances take up to 15 seconds to resume, and ensuring a long-enough connection timeout will be necessary to prevent unhappy consumers. This has been discussed plenty.

One thing nobody seems to have talked about though is the Deep Sleep.

If an Aurora Serverless v2 instance remains paused more than 24 hours, Aurora can put the instance into a deeper sleep that takes longer to resume. In that case, the resume time can be 30 seconds or longer, roughly equivalent to doing a reboot of the instance.

Yikes...

This might not be the death-blow you'd expect - there are plenty of systems using SQL that aren't sensitive to long connection times; primarily system-to-system communications. But using it on human users would not be pleasant.

So let's try exactly that!

🧪 Experimenting with WordPress

First thought, how viable would this be to run a WordPress instance? Spoiler: Not very.

The cold start times make it a painful experience. Page load times aren't just a fluff metric either; poor loading times do impact conversion rates, so while you're saving money on infrastructure, you're probably losing exponentially more from lost revenue.

Any time you're not paying for the ACU's, you will pay with a guaranteed future cold start.

One thing I notice is that while you can configure it to shut down after 5 mins for maximum savings, you might choose to set that far longer just to avoid those cold starts.

Periodic reporting databases, where you might have many internal users slamming an application for a few days a month, but otherwise sits virtually unused. Set to a full 24 hours, most users would never see the difference.

For a typical solo WordPress site though, it'd be cheaper and more consistent to run both web server and database on a single EC2 instance.

🤔 Conclusion

tl;dr: It's definitely more serverless, but not "pure serverless". And I don't think that's a problem.

Databases are complicated, and "just making it serverless" is a hard problem. And while some solutions like CockroachDB make huge strides - check out this video where they explain how they scale.

"The right tool for the right job" is the motto of the wise architect. In similar cases, I'd also consider a single-table DynamoDB or even SQLite.

Even if this isn't 'pure serverless', this is a huge improvement that might make it just the right tool for your use case.