The value of serverless: a real-life example

#serverless #aws #valkey #cloud

I often hear that managed services, especially serverless services, are expensive. So I wanted to tell this story.

Before I do, let me remind you that cloud providers very often offer many options to run a workload. AWS' motto is "we prefer AND rather than OR". Let's look at queues for instance:

You may set up EC2 instances and run Rabbit MQ on top of those.
Or you could do the same with containers on ECS/EKS. No more OS to manage.
Or you can use AWS MQ for RabbitMQ. No more software/runtime to manage.
Or you can use AWS SQS. No more hourly fee, and also no capacity planning!

Now back to my story. A customer of mine needed a cache system to persist customer session data, so that all containers might share customer context.

Because we don't want to manage servers, AWS Elasticache for Redis was the obvious go-to solution. But since my customer needed really little cache memory (~50Mb) we used a cache.t4g.micro instances. Single instance in dev, a small cluster in staging and production.

For a few months, everything went fine. Our system operated smoothly. However, one day, 6h after releasing a new version to production, the system crashed. Let's say an e-commerce website doesn't do well without it's session cache.

What had happened is that the release had added just a tiny bit of load on the cache, leading to higher throughput. T4 instances have a baseline throughput and a bucket of throughput for burst. If your workload consumes more than the baseline for a long period of time, then the bucket gets empty and the instance becomes nearly unresponsive.

That took a little while to diagnose and restore the system to full operational capacity. This time means

loss of business for my client, and reputation impact.
time of employee + consulting to diagnose, restore then evolve the system.

All of this because it was considered too costly to use the Serverless version, because its 1Gb minimum storage meant the cost would be at least ~106$/month. By then, AWS had released Elasticache for Valkey that has a lower minimum storage (100Mb) so we ended up moving all environments on this solution.

With the serverless version of Elasticache, we only have to worry about how much cache memory we use and how many queries we do. And that's only for FinOps reasons, not reliability considerations. AWS removed all the heavy lifting of monitoring server metrics, such as CPU, throughput, IOPS, etc.

Key take-aways:

Unless you have an Ops team whose job it is to monitor servers, use serverless.
Consider TCO, not just service cost when making those decisions. (Storage is cheap, Compute less so, but Human time is by all measures the most expensive resource you have).
The trade off is almost always in favor of serverless unless you have very high throughput/load on your service. Then only it may be worth deploying and monitoring servers.
Continuously monitor AWS what's new as AWS often releases new options that should lead you to reconsider past architecture decisions (and log the reasons for these decisions, so you can reconsider them with less effort)
Any time you use T4 instances, monitor not only CPU credits but also Network allowance metrics !
Monitor your systems around releases (I'll get back to that soon).

I hope this saves you from many production incidents!