Introduction
Amazon EMR on EKS (Elastic Kubernetes Service) is a service offering from Amazon Web Services (AWS) that allows users to run Apache Spark and other big data frameworks on Kubernetes clusters managed by Amazon EKS. This offering combines the capabilities of Amazon EMR (Elastic MapReduce), a managed big data processing service, with the flexibility and scalability of Kubernetes.With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management.
Here are some reasons why someone might choose Amazon EMR on EKS:
Flexibility: By leveraging Kubernetes, users can take advantage of its flexibility in managing containerized workloads. They can deploy, scale, and manage their big data applications using Kubernetes primitives.
Integration: Amazon EMR on EKS integrates seamlessly with other AWS services and tools. Users can easily integrate with AWS Identity and Access Management (IAM), Amazon S3 for data storage, and other AWS services.
Scalability: Kubernetes and Amazon EKS provide scalability features that allow users to dynamically scale their big data workloads based on demand. This ensures that resources are allocated efficiently and cost-effectively.
Cost-effectiveness: With Amazon EMR on EKS, users only pay for the resources they use. They can optimize resource allocation and scale resources up or down as needed, helping to manage costs effectively.
Containerization Benefits: Running big data workloads in containers provides several benefits such as improved resource utilization, easier management of dependencies, and consistent deployment across environments.
Open Standards: Kubernetes is an open-source platform with a large and active community. By using Kubernetes, users can take advantage of the ecosystem of tools and solutions built around it.
Security: Amazon EKS provides robust security features such as network isolation, IAM integration, and encryption to help secure big data workloads running on the platform.
Overall, Amazon EMR on EKS offers a powerful and flexible platform for running big data workloads, combining the strengths of Amazon EMR and Kubernetes to provide a scalable, cost-effective, and easy-to-manage solution.
Why Amazon EMR ?
Amazon EMR (Elastic MapReduce) is a cloud-based big data processing service provided by Amazon Web Services (AWS). It simplifies the processing of large amounts of data using popular open-source frameworks such as Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, Apache Flink, and Presto.
Here's a breakdown of what Amazon EMR is and its primary uses:
Big Data Processing: Amazon EMR enables you to process vast amounts of data quickly and cost-effectively. It allows you to run various distributed computing frameworks, such as Hadoop and Spark, on resizable clusters of Amazon EC2 instances.
Managed Service: Amazon EMR is fully managed, meaning AWS takes care of provisioning, configuring, and managing the underlying infrastructure. This allows users to focus on analyzing and deriving insights from their data rather than managing infrastructure.
Flexible and Scalable: EMR clusters can be easily scaled up or down based on workload requirements. You can start with a small cluster and scale it up as your data processing needs grow, and scale it down when the workload decreases, optimizing costs.
Integration with AWS Services: Amazon EMR integrates seamlessly with other AWS services like Amazon S3 (Simple Storage Service), Amazon DynamoDB, Amazon Redshift, and AWS Glue. This allows users to ingest data from various sources, store it in S3, process it using EMR, and analyze it with services like Redshift or visualize it with Amazon QuickSight.
Batch Processing and ETL: EMR is commonly used for batch processing tasks such as data transformation (ETL - Extract, Transform, Load), log analysis, data warehousing, and machine learning model training. It can handle diverse workloads from simple batch jobs to complex analytics pipelines.
Data Lake and Data Lake Analytics: With its integration with S3, Amazon EMR is often used as a foundational component of data lakes. It allows organizations to store vast amounts of structured and unstructured data in their S3 buckets and analyze it at scale using EMR and other analytics services.
Data Processing Workloads: Amazon EMR supports a wide range of data processing workloads including data preparation, data warehousing, machine learning, real-time analytics, and large-scale data processing for various industries such as finance, healthcare, retail, and media & entertainment.
Amazon EMR provides a powerful, flexible, and cost-effective solution for processing and analyzing large datasets, enabling organizations to derive valuable insights and make data-driven decisions.
Why Amazon EKS ?
The EKS (Elastic Kubernetes Service) is a managed Kubernetes service provided by Amazon Web Services (AWS). Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. EKS simplifies the process of deploying, managing, and scaling Kubernetes clusters on AWS infrastructure.
Key features of Amazon EKS include:
Managed Kubernetes Control Plane: AWS manages the Kubernetes control plane, including the API server, scheduler, and etcd storage, ensuring high availability and scalability without requiring manual intervention from users.
Easy Cluster Deployment: With Amazon EKS, users can create Kubernetes clusters with a few clicks using the AWS Management Console, AWS CLI, or AWS SDKs. It abstracts the complexities of setting up and configuring Kubernetes, allowing users to focus on deploying and managing their applications.
Security and Compliance: Amazon EKS integrates with AWS Identity and Access Management (IAM) for authentication and authorization, allowing users to control access to Kubernetes resources using IAM policies. It also supports integration with AWS Key Management Service (KMS) for encryption of sensitive data.
Scalability and High Availability: EKS automatically scales the Kubernetes control plane to handle changes in workload and provides multiple availability zones for increased fault tolerance. Users can also scale worker nodes horizontally to accommodate changes in application demand.
Integration with AWS Services: EKS seamlessly integrates with other AWS services, such as Amazon Elastic Container Registry (ECR) for storing container images, Amazon VPC for networking, and Amazon CloudWatch for monitoring and logging.
Compatibility with Kubernetes Ecosystem: Amazon EKS is compatible with standard Kubernetes APIs and tools, allowing users to leverage the rich ecosystem of Kubernetes-compatible applications, tools, and libraries.
Cost-Effective Pricing Model: Users pay only for the resources consumed by their EKS clusters and worker nodes, with no upfront costs or long-term commitments. Pricing is based on the number and type of EC2 instances used for worker nodes.
Amazon EKS provides a reliable, scalable, and cost-effective platform for deploying and managing containerized applications using Kubernetes on AWS infrastructure. It is suitable for a wide range of use cases, from small development projects to large-scale production deployments.
How does it work?
Setting up Amazon EMR on EKS
Below are steps one need to follow-
Install the AWS CLI
Install eksctl
Set up an Amazon EKS cluster
Enable cluster access for Amazon EMR on EKS
Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster
Create a job execution role
Update the trust policy of the job execution role
Grant users access to Amazon EMR on EKS
Register the Amazon EKS cluster with Amazon EMR
Note- I already have an EC2 instance created with Amazon Linux AMI and eksctl, kubectl, AWS CLI are already installed & configured. So, I will skip step 1 & 2 and will start with step 3.
Set up an Amazon EKS cluster
eksctl create cluster \
--name my-demo-cluster \
--region ap-south-1 \
--with-oidc \
--instance-types=t3.medium \
--managed
View and validate resources
kubectl get nodes -o wide
view the workloads running on your cluster
kubectl get pods --all-namespaces -o wide
Enable cluster access for Amazon EMR on EKS
You must allow Amazon EMR on EKS access to a specific namespace in your cluster by taking the following actions: creating a Kubernetes role, binding the role to a Kubernetes user, and mapping the Kubernetes user with the service linked role AWSServiceRoleForAmazonEMRContainers. These actions are automated in eksctl when the IAM identity mapping command is used with emr-containers as the service name. You can perform these operations easily by using the following command.
eksctl create iamidentitymapping \
--cluster my-demo-cluster \
--namespace emrnamespace \
--service-name "emr-containers"
Note- I have already created namespace "emrnamespace"
Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster
If your cluster supports IAM roles for service accounts, it has an OpenID Connect issuer URL associated with it. You can view this URL in the Amazon EKS console, or you can use the following AWS CLI command to retrieve it.
aws eks describe-cluster --name my-demo-cluster --query "cluster.identity.oidc.issuer" --output text
create an IAM OIDC identity provider for your cluster with eksctl
eksctl utils associate-iam-oidc-provider --cluster my-demo-cluster --approve
Create IAM Role for job execution:
To run workloads on Amazon EMR on EKS, you need to create an IAM role. This role is referred as the job execution role.
cat emr-trust-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "elasticmapreduce.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
aws iam create-role --role-name EMRContainers-JobExecutionRole --assume-role-policy-document file://emr-trust-policy.json
Next, we need to attach the required IAM policies to the role so it can write logs to s3 and cloudwatch.
cat EMRContainers-JobExecutionRole.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:CreateLogStream",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:*"
]
}
]
}
aws iam put-role-policy --role-name EMRContainers-JobExecutionRole --policy-name EMR-Containers-Job-Execution --policy-document file://EMRContainers-JobExecutionRole.json
Update trust relationship for job execution role
aws emr-containers update-role-trust-policy --cluster-name my-demo-cluster --namespace emrnamespace --role-name EMRContainers-JobExecutionRole
Register EKS cluster with EMR
Now, create a virtual cluster with a name of your choice for the Amazon EKS cluster and namespace that you have created in earlier step.
aws emr-containers create-virtual-cluster --name my-virt-cluster --container-provider '{"id": "my-demo-cluster","type": "EKS","info": {"eksInfo": {"namespace": "emrnamespace"}}}'
Run Sample Workload
aws emr-containers start-job-run \
--virtual-cluster-id=$VIRTUAL_CLUSTER_ID \
--name=pi-2 \
--execution-role-arn=$EMR_ROLE_ARN \
--release-label=emr-6.2.0-latest \
--job-driver='{
"sparkSubmitJobDriver": {
"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
"sparkSubmitParameters": "--conf spark.executor.instances=1 --conf spark.executor.memory=2G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
}
}'
You will be able to see the running job in EMR console. It should look like below:
Bingo, demo is completed. Please do not forget to delete resources post demo, else you will end up spending huge bill :)
Top comments (0)