Why?
Summer last year we started refactoring one of our services in order to get rid of I/O blocking operations. It was almost a complete rewrite of the service in about 2 months.
After running the updated service for some time, we found out the hard way Elasticache Redis’ maximum allowance bandwidth constraints during weeks of intense traffic.
The traffic doubled in a very short period of time and exceeded the maximum bandwidth allowance for one of our Redis nodes for a sustained period of time. When this happens, queue increases and AWS starts to drop packets. There were a couple of issues that led this to cause a cascade effect, practically putting our service down for almost one hour while we identified the problem, scaled the nodes accordingly and waited for the cluster to balance.
Notice the “sustained” wording in the paragraph above.
Despite the fact we ran load tests in this service, we were not able to see this issue before because the test only ran from 10 to 30min maximum, and Elasticache allows you to exceed the network baseline for some undetermined period of time (up to an hour if I am not mistaken) until it starts to drop packets.
There we had a couple of problems leading to the downtime:
Problem #1: during the refactoring, we missed to set the default timeout for the Redis client, leading it to wait for 60 seconds (default) before timing out, causing our clients to timeout first.
Problem #2: during the refactoring, we also missed to migrate our circuit breaker implementation to our custom cache handling, and therefore, never skipping the cluster and going directly to the DB and performing requests normally as it should (notice here the cache in this case is used for answering fast to clients, and the DB should always be sized accordingly in order to handle normal load if caching is not present for any reason).
Problem #3: during the rewriting, we optimised our startup time to a few seconds by removing a local cache in memory for one of our data structures. The issue we didn’t realise is that by doing so, we created a hot shard in our cluster, because the key for such lookups were not hashed and the values were basically always the same for all requests, leading to the load to not be distributed over our shards and therefore consuming more bandwidth from one of its nodes and making it not horizontally scalable.
Problem #4: because we could not predict this sudden load accordingly, our Elasticache cluster was not sized accordingly (e.g: by increasing number of shards) and therefore hit the network limitation even though the cluster seemed healthy at a first glance (CPU and MEM).
The mitigation in this case was to simply scale the cluster vertically and increase the number of shards, and the final solution was to identify and fix each one of the points mentioned above in the next days.
After that, we knew AWS offered a Serverless version of Elasticache which we planned to have a look at for some time already, and we’ve just had heard about Valkey, announced about 1 month prior to this incident.
Elasticache Valkey
Valkey is an opensource project forked from the open source Redis project right before the transition to their new source available licenses. Because of the transition, AWS and other tech giants started to contribute for the project, looking to keep Redis compatibility while enhance overall functionality and performance.
The highlights from my point of view and experience so far are:
lower price than other engines (up to 33% lower)
it provides microseconds read and write latency and can scale to 500 million requests per second (RPS) on single self-designed (node-based) cluster
it is compatible with Redis OSS APIs and data formats
zero downtime migration
continuous updates (in exchange with some of the people involved in the project, they opened up ideas and plans to improve the service further, which will make it even more attractive in the future)
Valkey is offered in both cluster and serverless variants.
You can read about all its bells and whistles at official AWS documentation and also see how it works in this video presented by one of the maintainers last summer:
Serverless Elasticache Valkey
In addition to the highlights mentioned above, the serverless variant abstracts the cluster managing (minor updates) and sizing, which is very interesting in special if your traffic can suddenly change as pictured in the beginning of this article.
Of course, not everything is flowers. The serverless variant may become very expensive depending on your workload and if you have a predictable and sustained load, you’d be probably paying much more for the serverless variant than the self-designed cluster one.
In the serverless variant, you are billed by Storage and ECPU (ElastiCache Processing Units). Storage is straight forward and you can already estimate it based on your current values. ECPU is however a bit more tricky as it is basicallly the processing time, which is affected by the payload size and type of commands you execute. In general, 1 ECPU relates to approx. 1kb of payload data (read this for more information).
However, the very cool things about the serverless variant are:
If your workload is periodical or irregular, you might save on costs during low usage periods
ElastiCache Serverless for Valkey can double the supported requests per second (RPS) every 2–3 minutes, reaching 5M RPS per cache from zero in under 13 minutes, with consistent sub-millisecond p50 read latency
Take the following metrics as an example:
The red and yellow areas in some of those panels are set only for cost control purposes, but the cluster is able to scale much more than that. There we can see traffic doubling in a very short amount of time and no throttles observed. Elasticache Valkey Serverless scales seamslessly in order to support such traffic increase, and downscales to the minimum configured ECPU value when traffic decreases.
Because our service traffic has such sinusoidal look and the amount of data we transfer per second under normal load is not very high, we end up saving on costs with the benefit of it autoscaling as it needs.
Careful considerations
Before deciding to switch to Elasticache Serverless, analyse careful your workload and the AWS documentation in order to identify for example:
is your cache traffic predictable and relatively constant or it is periodical or unpredictable?
how much ECPU your application would consume under normal load?
how much storage your application requires?
do you need/want to set a minimum and/or maximum ECPU/s?
do you need/want to set a minimum and/or maximum storage?
Be aware, for example, that if you set maximum constraints, your application might receive errors from Elasticache when it surpasses such values, instead of scaling. However, setting limits might be a good idea if your application can tolerate errors and want to avoid excessive costs (e.g.: bypassing cache with circuit breakers and low client timeouts).
Setting minimum values, on the other hand, could be a good idea to guarantee that your Elasticache will serve at least that amount of data at any given time.
Use AWS’ pricing calculator to estimate how much it would cost you and make the better decision for your own use case.
Make sure to also double check your security group rules, as the serverless variant requires the 6380 port for the reader nodes in addition to the standard 6379, otherwise, your application might start but you may experience latency. Read more here.
Good luck!
Thanks for reading! Got any cool ideas or feedback you want to share? Drop a comment, send me a message or follow me and let’s keep moving things forward!
Top comments (0)