Scaling distributed systems is never just about performance, it's also about cost and operational efficiency. While Kubernetes provided us with a solid foundation for container orchestration, we hit unexpected roadblocks when running Kafka clusters at scale.
- Costs were rising due to inefficiencies in resource allocation.
- Auto-scaling wasn’t handling our stateful workload well.
- Kafka node management with Strimzi led to operational complexity.
After months of firefighting, we decided to move from Kubernetes to EC2, a transition that improved performance, simplified operations, and cut costs by 28%.
Here’s the story of that journey, what worked, and the lessons we learned.
Why Kubernetes wasn’t working for us?
1. Resource Allocation Inefficiencies
Kubernetes dynamically manages resources, but in our case, it led to hidden inefficiencies.
For example, when allocating 2 CPU cores and 8GB RAM, we observed that the actual provisioned resources were often slightly lower (1.8 CPU cores, 7.5GB RAM). This discrepancy may seem trivial, but at scale, it resulted in significant wasted resources and unexpected cost overruns.
Imagine paying for a full tank of fuel, but your car only gets 90% of it. Over time, those missing liters add up.
2. Auto-Scaling Challenges for Stateless Applications
Kubernetes’ auto-scaling mechanism works well for stateless applications, but Kafka isn’t stateless. When resources ran out, Kubernetes would restart our Kafka application instead of efficiently scaling it.
This resulted in:
- 15-second delays in message processing.
- Increased latency during scaling events.
- Operational headaches managing stateful workloads.
3. Kafka Node Management with Strimzi
Initially, we relied on Strimzi for managing Kafka clusters. However, it had major drawbacks:
- New Kafka nodes often failed to integrate properly.
- Manual intervention was required for every scaling event.
- Overall Kafka performance was unpredictable.
Managing our Kafka clusters felt like playing whack-a-mole every time we solved one issue, another would pop up.
Decided to move to EC2
After evaluating various alternatives, we decided to move Kafka from Kubernetes to EC2. This gave us more control over resource allocation, auto-scaling, and cluster management.
Here’s what changed:
1. Replacing Strimzi with a Custom Kafka Controller
Instead of relying on third-party tools, we built an in-house Kafka Controller tailored to our needs.
- Seamless integration of new Kafka nodes
- Automated scaling based on real-time workload analysis
- Better cluster management with minimal manual intervention
Result? Kafka nodes were now automatically recognized and integrated instantly.
2. Precise Resource Allocation
Unlike Kubernetes, where we had limited control over resource provisioning, EC2 allowed us to:
- Allocate exactly the CPU and memory we needed.
- Avoid wasted resources and over-provisioning costs.
Example: Previously, we paid $180/month per instance on Kubernetes. After transitioning to EC2, this dropped to $130/month, saving 28% on infrastructure costs.
3. Streamlined Kafka Node Support
With EC2, we could now:
- Scale up Kafka nodes seamlessly without restarts.
- Perform vertical scaling (switching to more powerful machines) with zero downtime.
- Ensure predictable performance under peak loads.
Last month, we moved from a T-class instance to a C-class instance in EC2 without downtime. If we had been on Kubernetes, this would have required:
- Creating a new node group.
- Rebalancing partitions manually.
- Managing potential downtime.
Instead, on EC2, it was a simple instance upgrade with zero complexity, zero downtime.
The Impact: Cost Savings & efficiency gains
Key lessons & Takeaways
- Not all workloads are ideal for Kubernetes – It’s great for general-purpose container orchestration but not always the best for stateful applications like Kafka.
- Custom solutions can be worth it – Building an in-house Kafka Controller gave us better control and reliability.
- Cost inefficiencies add up – Even small inefficiencies in resource allocation can result in thousands of dollars lost at scale.
- EC2 provides better flexibility – We gained granular control over scaling and performance with EC2.
Conclusion
If you’re running Kafka on Kubernetes and experiencing similar issues, EC2 might be a better fit.
- Who should consider moving? Teams struggling with stateful workloads on Kubernetes.
- Who should stick with Kubernetes? Those managing stateless, highly dynamic applications.
Every infrastructure decision should be guided by workload needs, not just industry trends. Kubernetes is powerful, but for Kafka, EC2 provided the right balance of cost, performance, and operational efficiency for us.
Top comments (0)