DEV Community

Cover image for Transitioning from Kubernetes to EC2 for Enhanced Kafka Performance
Gorakhnath Yadav for Hyperswitch

Posted on

Transitioning from Kubernetes to EC2 for Enhanced Kafka Performance

Scaling distributed systems is never just about performance, it's also about cost and operational efficiency. While Kubernetes provided us with a solid foundation for container orchestration, we hit unexpected roadblocks when running Kafka clusters at scale.

  • Costs were rising due to inefficiencies in resource allocation.
  • Auto-scaling wasn’t handling our stateful workload well.
  • Kafka node management with Strimzi led to operational complexity.

After months of firefighting, we decided to move from Kubernetes to EC2, a transition that improved performance, simplified operations, and cut costs by 28%.

Here’s the story of that journey, what worked, and the lessons we learned.

Why Kubernetes wasn’t working for us?

1. Resource Allocation Inefficiencies

Kubernetes dynamically manages resources, but in our case, it led to hidden inefficiencies.

For example, when allocating 2 CPU cores and 8GB RAM, we observed that the actual provisioned resources were often slightly lower (1.8 CPU cores, 7.5GB RAM). This discrepancy may seem trivial, but at scale, it resulted in significant wasted resources and unexpected cost overruns.

Imagine paying for a full tank of fuel, but your car only gets 90% of it. Over time, those missing liters add up.

2. Auto-Scaling Challenges for Stateless Applications

Kubernetes’ auto-scaling mechanism works well for stateless applications, but Kafka isn’t stateless. When resources ran out, Kubernetes would restart our Kafka application instead of efficiently scaling it.

This resulted in:

  • 15-second delays in message processing.
  • Increased latency during scaling events.
  • Operational headaches managing stateful workloads.

3. Kafka Node Management with Strimzi

Initially, we relied on Strimzi for managing Kafka clusters. However, it had major drawbacks:

  • New Kafka nodes often failed to integrate properly.
  • Manual intervention was required for every scaling event.
  • Overall Kafka performance was unpredictable.

Managing our Kafka clusters felt like playing whack-a-mole every time we solved one issue, another would pop up.

Decided to move to EC2

After evaluating various alternatives, we decided to move Kafka from Kubernetes to EC2. This gave us more control over resource allocation, auto-scaling, and cluster management.

Here’s what changed:

1. Replacing Strimzi with a Custom Kafka Controller

Instead of relying on third-party tools, we built an in-house Kafka Controller tailored to our needs.

  • Seamless integration of new Kafka nodes
  • Automated scaling based on real-time workload analysis
  • Better cluster management with minimal manual intervention

Result? Kafka nodes were now automatically recognized and integrated instantly.

2. Precise Resource Allocation

Unlike Kubernetes, where we had limited control over resource provisioning, EC2 allowed us to:

  • Allocate exactly the CPU and memory we needed.
  • Avoid wasted resources and over-provisioning costs.

Example: Previously, we paid $180/month per instance on Kubernetes. After transitioning to EC2, this dropped to $130/month, saving 28% on infrastructure costs.

3. Streamlined Kafka Node Support

With EC2, we could now:

  • Scale up Kafka nodes seamlessly without restarts.
  • Perform vertical scaling (switching to more powerful machines) with zero downtime.
  • Ensure predictable performance under peak loads.

Last month, we moved from a T-class instance to a C-class instance in EC2 without downtime. If we had been on Kubernetes, this would have required:

  • Creating a new node group.
  • Rebalancing partitions manually.
  • Managing potential downtime.

Instead, on EC2, it was a simple instance upgrade with zero complexity, zero downtime.

The Impact: Cost Savings & efficiency gains

Key lessons & Takeaways

  • Not all workloads are ideal for Kubernetes – It’s great for general-purpose container orchestration but not always the best for stateful applications like Kafka.
  • Custom solutions can be worth it – Building an in-house Kafka Controller gave us better control and reliability.
  • Cost inefficiencies add up – Even small inefficiencies in resource allocation can result in thousands of dollars lost at scale.
  • EC2 provides better flexibility – We gained granular control over scaling and performance with EC2.

Conclusion

If you’re running Kafka on Kubernetes and experiencing similar issues, EC2 might be a better fit.

  • Who should consider moving? Teams struggling with stateful workloads on Kubernetes.
  • Who should stick with Kubernetes? Those managing stateless, highly dynamic applications.

Every infrastructure decision should be guided by workload needs, not just industry trends. Kubernetes is powerful, but for Kafka, EC2 provided the right balance of cost, performance, and operational efficiency for us.

Top comments (0)