Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service on AWS. With the rise of stateful applications running on Kubernetes, it’s more important than ever to understand the role storage plays for these critical workloads.
In this post I’ll try to simplify the EKS storage selection process in AWS, and explain in which scenarios it’s best to use a particular service between Amazon EFS, EFS, S3, and Amazon FSx for NetApp ONTAP (FSx for ONTAP).
In this analysis I’ll touch upon six significant storage metrics/capabilities to make it easier to choose the right storage for a few different workloads running on EKS: general file storage for a SaaS application, data-intensive AI/ML and analytics workloads, NoSQL databases, web applications, and queuing systems.
The metrics capabilities we’re going to cover are:
- Performance, divided into two aspects: Throughput / IOPS and latency
- Durability and availability
- Scalability
- ReadWriteMany
- Supported protocols: Block, NFS, and SMB/CIFS
- Cost
Read on as we cover:
- The Metrics
- How the different AWS storage services stack up
- Mapping optimal service options per workload
- Conclusion
The Metrics
Here’s a short description for each metric/capability that we’ll be looking for in the different storage services:
Performance
Performance describes how quickly a storage service can respond to user requests and changes. It can be measured in two ways:
Throughput is a measure of the amount of data (measured in bits or bytes) that can be processed every second. The shorthand for this measurement is IOPS (input/output per second). These two terms are useful in describing performance.
Latency is the measure for the time interval that it takes a storage service to serve out read requests and respond to write operations.
Durability and availability
Durability is a measure of how safe data is from being lost. Availability refers to the best possible uptime provided by a service.
Scalability
By scalability, we’re talking about the ability to both scale up the amount of storage capacity in use by adding hard drives, memory, etc. to increase the compute power of the servers in use. Scale out refers to adding instances in order to handle the most demanding workloads.
ReadWriteMany
ReadWriteMany is the ability for multiple nodes to have access to a volume.
Protocols
Different operating systems are tied to specific protocols: Linux machines will use NFS while Windows requires SMB/CIFS. The iSCSI protocol is also widely used in many important workloads.
Cost
Cost is a basic metric that is always a consideration.
How the different AWS storage services stack up
The table below checks the metrics against the different AWS storage options we’ll be looking at—Amazon EBS, Amazon EFS, Amazon FSx for NetApp ONTAP, and Amazon S3. This will make it easier to show when each is best to use.
In the table below, green represents options that support that feature for demanding workloads (i.e., high performance, full support, low cost, etc.), yellow represents some support, and red denotes limited to no support for that feature at the enterprise level.
Mapping optimal service options per workload
Each workload has its own characteristics and considerations. In this section, I’ll try to map the most important metrics for some of today’s most popular workloads—namely, general file storage for a SaaS application, data-intensive AI/ML and analytics workloads, NoSQL databases, web applications, and queuing systems—to pinpoint the best storage option for each workload.
General files for SaaS applications
Top considerations: Durability and availability, scalability, cost
For SaaS-type applications, durability and availability is in many cases the most critical factor for selecting a storage service. For that, EFS, FSx for ONTAP, and S3 all provide good options. EBS runs by default in a single AZ, which makes it less durable and is more susceptible to downtime.
S3 is indeed the most cost-effective option. However, some important considerations to take into account:
- If used as StorageClass you’ll need an S3 CSI driver mountpoint, a new option by AWS. This is less reliable when used as a file system.
- Latency is relatively high.
- If you need to read/write a lot of small files, the cost might be overwhelming (PUT/GET requests).
EFS and FSx for ONTAP address durability and availability considerations quite well. Their scalability, both scale up and out, are also notable.
In terms of costs, both EFS and FSx for ONTAP provide multi-AZ availability. The difference is that FSx for ONTAP costs are based on the disk capacity, not the used capacity. If your data is compressible, the cost for FSx for ONTAP will be substantially less expensive. Both have capacity pool options that tier cold data to reduce costs.
When it comes to performance of EFS and FSx for ONTAP, you should take into account that EFS has a relatively high latency. That can be a dealbreaker for many applications. That high latency plays a major role in data ingestion time and data processing tasks. Below is a customer benchmark for processing 1B records using FSx for ONTAP and EFS.
Find the full details of this benchmark test in How MYCOM OSI Optimized SaaS Storage with Amazon FSx for NetApp ONTAP.
Data-intensive AI/ML and analytics workloads
Top considerations: Performance, latency, scalability, and cost
Example workloads: Analytics (BI), SageMaker, Kubeflow, Airflow
When running these types of workloads on EKS, runtime is a big issue. You’ll need to make sure that reading/writing data to disk is done in the most efficient way possible.
For performance/latency considerations, EBS, EFS, and FSx for ONTAP all provide single-digit millisecond latency. That makes these services better suited to handle these data-intensive workloads. FSx for ONTAP is more favorable from a latency perspective than EFS and EBS. For a full discussion of these latency benchmarks, check out Benchmarking AWS CSI Drivers, which broke down the results in the following graph:
Given these benchmarks, for scalability considerations, EBS might be less ideal. EFS and FSx for ONTAP both support scale out and scale up capabilities. EFS can scale up to dozens of PBs of capacity and scale out to provide 3-30 GBps throughput (source). FSx for ONTAP can scale up to 36 GBps throughput and dozens of PBs of capacity (source).
From a cost perspective, FSx for ONTAP provides a single-AZ deployment that can be accessed from pods in different AZs. That differs from EFS, where single-AZ deployments can only be accessed by pods within the same AZ. That might be a major limitation and force you to adopt a multi-AZ deployment.
NoSQL DBs
Top considerations: Durability and availability, latency, performance
Example workloads: Cassandra, Elasticsearch, Redis, MongoDB
When running business-critical applications, you don’t want your database to go down, so we’ll first concentrate on the metrics that look at securely running a database in Kubernetes. Important to note: I’m not touching data protection methods in this article, just the storage backend properties of availability and durability.
Since most of the databases mentioned above don't officially support S3 (that’s only possible using unsupported plugins), I’m leaving it out of our consideration for this workload. I’ll also exclude EFS from the comparison since it only supports NFS, and the best practice for deploying these databases is to attach them as local via iSCSI.
For the most part, EBS can be a good option, providing the best latency characteristics out of the bunch. Usually, you should be able to determine the number of replicas needed by your databases, allowing you to determine the data protection level you require. However, if your application requires near real-time consistency, that can only be achieved with a multi-AZ deployment, which would make FSx for ONTAP the only viable option.
When it comes to costs, it’s important to keep in mind that running EBS at scale can become quite a significant expense. For large-scale deployments, it might be a better option to consider FSx for ONTAP. That’s because FSx for ONTAP volumes are thinly provisioned and supported by storage efficiency features—including auto-tiering cold data to a capacity pool and data deduplication, compression, compaction—all of which combine to significantly drive down storage costs.
Web applications
Top considerations: Durability and availability, scalability, cost
This workload describes scenarios such as file storage for a web server (such as nginx) or for a web content management system such as WordPress. This workload is very similar to the considerations for running file storage for SaaS applications, and everything covered above also applies here.
Queuing systems
Top 3 considerations: Latency, durability and availability, cost
Example workloads: RabbitMQ, Kafka
Latency is the key for successful deployment of RabbitMQ or any other queuing system. In that regard, EBS can be a good option. However, for durability and cost you might want to consider FSx for ONTAP.
RabbitMQ can support cluster deployment for high availability in different AZs, however, this will incur significant cross-AZ traffic costs and won’t support near real-time consistency.
FSx for ONTAP will have a latency penalty compared to using EBS, however, it offers more cost efficiency and the option for multi-AZ deployment.
Conclusion
The purpose of this article is not to replace proper evaluation and testing of the different storage options for your EKS application, but rather help you narrow down the options for choosing a storage platform for it.
Below are some useful links that can help you in the deployment of the storage solutions for EKS:
Top comments (0)