Amazon MSK (Managed Streaming for Apache Kafka) -Cheat Sheet

AWS MSK (Managed Streaming for Apache Kafka) Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

Amazon MSK (Managed Streaming for Apache Kafka) is a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data. MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters.

Key components:

Brokers: Kafka server instances that store and serve data
Topics: Categories/feeds where records are stored and published
Partitions: Divisions of topics for parallel processing
Consumer Groups: Groups of consumers that work together to consume data
ZooKeeper: Coordination service for Kafka (in traditional deployments)
MSK Connect: Managed service for Kafka Connect to integrate with other data sources/sinks
MSK Serverless: Serverless option with automatic scaling

MSK Cheat Sheet Points

MSK Deployment Options:
- Provisioned: You select instance types and configure capacity
- Serverless: Automatically provisions and scales compute and storage
MSK Provisioned Broker Types:
- kafka.t3.small (2 vCPU, 2 GiB RAM)
- kafka.m5.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge/24xlarge
- kafka.m7g.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge
- kafka.c5.large/xlarge/2xlarge/4xlarge/9xlarge/18xlarge
- kafka.c6g.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge
MSK Storage Options:
- EBS storage volumes attached to each broker
- Default: 1000 GiB per broker
- Maximum: 16384 GiB per broker
MSK Supported Kafka Versions: 1.1.1, 2.1.0, 2.2.1, 2.3.1, 2.4.1, 2.5.1, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.7.0, 2.7.1, 2.7.2, 2.8.0, 2.8.1, 3.1.1, 3.2.0, 3.3.1, 3.3.2, 3.4.0, 3.5.1
MSK Networking:
- Runs within your VPC
- Requires at least 2 subnets in different AZs
- Recommended: 3 subnets in different AZs for high availability
MSK Security Features:
- TLS encryption for in-transit data
- KMS encryption for data at rest
- IAM authentication for control plane
- SASL/SCRAM, IAM, or mTLS for client authentication
- Network isolation with VPC
MSK High Availability:
- Multi-AZ deployment with brokers distributed across AZs
- Automatic recovery from broker failures
- Replication factor determines data durability
MSK Monitoring:
- Integration with CloudWatch, Prometheus, and open-source monitoring tools
- Broker logs can be sent to CloudWatch Logs, S3, or Firehose
MSK Connect allows you to:
- Run Kafka Connect connectors without managing infrastructure
- Use built-in connectors or bring your own
- Scale workers automatically
MSK Serverless Features:
- No broker management
- Automatic scaling based on throughput
- Pay only for data in/out and storage used
MSK Limits:
- Provisioned: 1-30 brokers per cluster (soft limit)
- Serverless: Up to 6 Kafka units (KUs) per partition
- Maximum of 100 clusters per account (soft limit)
MSK Replication Factor:
- Default: 3 (recommended for production)
- Minimum: 2 for multi-AZ clusters
- Maximum: Equal to the number of brokers
MSK Partition Limits:
- Default: 100 partitions per broker (soft limit)
- Maximum: Can be increased via support ticket
MSK Storage Throughput:
- General Purpose (gp2): 3 IOPS/GiB
- Provisioned IOPS (io1): Up to 50 IOPS/GiB
MSK Network Throughput:
- Depends on instance type
- t3.small: Up to 5 Gbps
- m5.24xlarge: Up to 25 Gbps
MSK Pricing Components:
- Broker hours (for provisioned)
- Storage (per GB-month)
- Data transfer (per GB)
- MSK Connect worker hours (if used)
- Serverless: Kafka Unit hours and data in/out
MSK Backup and Restore:
- Automated backups not provided
- Use MirrorMaker 2.0 for cross-cluster replication
- Use S3 connector for data backup
MSK Encryption Options:
- In-transit: TLS (required)
- At-rest: KMS (optional but recommended)
- Client-broker: Plain text, TLS, SASL/SCRAM, IAM, mTLS
MSK Authentication Methods:
- IAM access control
- SASL/SCRAM with AWS Secrets Manager
- TLS mutual authentication (mTLS)
- Unauthenticated access (not recommended)
MSK Configuration Types:
- Default configuration
- Custom configuration
- MSK configuration (reusable across clusters)
MSK Cluster States:
- CREATING, ACTIVE, FAILED, DELETING, UPDATING, HEALING
MSK Maintenance Windows:
- Default: Random 2-hour window per week
- Customizable to specific day and time
MSK Patching:
- Security patches applied automatically
- Version upgrades must be initiated manually
MSK Monitoring Levels:
- DEFAULT: Basic metrics to CloudWatch
- PER_BROKER: Detailed broker-level metrics
- PER_TOPIC_PER_BROKER: Most granular metrics
- PER_TOPIC_PER_PARTITION: Partition-level metrics
MSK Open Monitoring:
- Prometheus and Grafana integration
- JMX Exporter and Node Exporter metrics
MSK Logging Options:
- Broker logs to CloudWatch Logs
- Broker logs to S3
- Broker logs to Firehose
MSK Connect Worker Configurations:
- Auto-scaling: 1-10 workers
- Worker CPU: 1, 2, 4, or 8 vCPU
- Worker memory: 2, 4, 8, or 16 GiB
MSK Connect Connector Types:
- Source connectors (data into Kafka)
- Sink connectors (data from Kafka)
- Custom connectors (uploaded as ZIP)
MSK Serverless Scaling:
- Scales from 1 to 6 Kafka Units (KUs) per partition
- 1 KU = 1 MBps ingress, 2 MBps egress, 5 TPS
MSK Serverless Limits:
- Maximum 120 partitions per topic
- Maximum 100 consumer groups per cluster
- Maximum 500 connections per broker
MSK Cluster Operations:
- Update broker count
- Update broker type
- Update configuration
- Update security settings
- Update monitoring settings
MSK Topic Operations:
- Create, delete, describe topics
- Update topic configurations
- List topics
MSK Consumer Group Operations:
- List consumer groups
- Describe consumer groups
- Reset consumer group offsets
MSK Networking Requirements:
- Inbound rules for ports 9092 (plaintext), 9094 (TLS), 9096 (SASL), 9098 (IAM)
- Outbound rules for ZooKeeper (2181, 2182, 2183)
- VPC endpoints for AWS services
MSK Rebalancing:
- Automatic when adding brokers
- Manual using Kafka tools for custom rebalancing
MSK Storage Auto Scaling:
- Can be enabled to automatically increase storage
- Threshold: 70% utilization
- Maximum: 16384 GiB per broker
MSK Data Retention:
- Configurable per topic
- Default: 7 days (168 hours)
- Maximum: Limited by storage capacity
MSK Schema Registry Options:
- AWS Glue Schema Registry
- Self-managed Schema Registry
- Confluent Schema Registry
MSK Client Compatibility:
- Any Kafka client compatible with the cluster version
- AWS-specific clients for IAM authentication
MSK Cross-Region Replication:
- MirrorMaker 2.0 for cross-cluster replication
- Can be deployed on EC2 or using MSK Connect
MSK Disaster Recovery Options:
- Multi-AZ deployment for high availability
- Cross-region replication for disaster recovery
- S3 backups for long-term storage
MSK Performance Factors:
- Broker instance type
- Number of partitions
- Replication factor
- Batch size
- Compression settings
MSK Throughput Calculation:
- Producer throughput = Number of brokers × Instance throughput / Replication factor
- Consumer throughput = Number of brokers × Instance throughput
MSK Partition Calculation:
- Recommended partitions = max(Throughput in MB/s ÷ 1 MB/s, Number of consumers)
- Example: For 10 MB/s throughput and 5 consumers, use at least 10 partitions
MSK vs. Self-Managed Kafka:
- MSK: Managed infrastructure, automatic scaling, integrated security
- Self-managed: Complete control, potentially lower cost, more configuration options
MSK vs. Kinesis Data Streams:
- MSK: Open-source compatibility, longer retention, more partitions
- Kinesis: Simpler setup, automatic scaling, native AWS integration
MSK Throttling:
- API rate limits: 100 TPS for control plane operations
- Data plane throttling based on broker capacity
- Exponential backoff recommended for retries
MSK Connect Error Handling:
- Dead letter queues for failed records
- Configurable retry policies
- Error tolerance settings
MSK Replay Capabilities:
- Consumer groups track offsets
- Reset offsets to replay data
- Retention period limits how far back you can replay
MSK Data Ingestion Patterns:
- Direct producer applications
- MSK Connect source connectors
- AWS services integration (Lambda, Kinesis, etc.)
MSK Data Processing Patterns:
- Consumer applications
- Stream processing (Flink, Kafka Streams)
- Lambda integration
MSK Data Delivery Patterns:
- MSK Connect sink connectors
- Consumer applications writing to destinations
- Integration with AWS analytics services
MSK Latency Characteristics:
- End-to-end latency: Typically 10-100ms
- Affected by batch size, compression, and instance type
- Provisioned generally lower latency than Serverless
MSK Serverless Throughput Characteristics:
- Base capacity: 1 MBps ingress, 2 MBps egress per partition
- Scales up to 6x base capacity automatically
- Maximum: 6 MBps ingress, 12 MBps egress per partition
MSK Provisioned Throughput Characteristics:
- Depends on instance type
- t3.small: ~50 MBps
- m5.4xlarge: ~400 MBps
- m5.24xlarge: ~2 GBps
MSK Partition Strategies:
- Key-based: Same key to same partition
- Round-robin: Even distribution
- Custom: Implement custom partitioner
MSK Consumer Group Strategies:
- Static membership: Reduces rebalancing
- Cooperative rebalancing: Minimizes disruption
- Eager rebalancing: Traditional approach
MSK Compression Options:
- gzip: Highest compression, higher CPU
- snappy: Balanced compression/CPU
- lz4: Fast compression, lower ratio
- zstd: Good compression, moderate CPU
MSK Security Best Practices:
- Enable encryption in-transit and at-rest
- Use IAM or SASL/SCRAM authentication
- Implement VPC security groups
- Use private subnets with VPC endpoints
MSK Monitoring Best Practices:
- Monitor broker CPU, memory, disk usage
- Track under-replicated partitions
- Monitor consumer lag
- Set up alerts for critical metrics
MSK Cost Optimization:
- Right-size broker instances
- Use Serverless for variable workloads
- Enable storage auto-scaling
- Compress data to reduce storage costs
MSK Troubleshooting Common Issues:
- Under-replicated partitions
- Consumer lag
- Connection issues
- Performance degradation
MSK Integration with AWS Services:
- Lambda: Trigger functions from Kafka topics
- Glue: Schema Registry and ETL jobs
- S3: Data sink for long-term storage
- CloudWatch: Monitoring and logging
MSK Quotas and Limits Management:
- Monitor quota usage in Service Quotas console
- Request quota increases for production workloads
- Implement client-side throttling
MSK Replayability Implementation:
- Set appropriate retention period
- Use consumer group offset management
- Consider compacted topics for state
MSK Data Ingestion Pipeline Design:
- Source → MSK → Processing → Storage/Analytics
- Consider exactly-once semantics if needed
- Implement proper error handling and DLQs
MSK ZooKeeper vs. KRaft Mode:
- ZooKeeper: Traditional metadata management
- KRaft: Kafka's built-in consensus protocol (newer)
- MSK supports both depending on version
MSK Partition Assignment Strategies:
- Range: Assigns consecutive partitions
- RoundRobin: Distributes evenly
- Sticky: Minimizes partition movement
- CooperativeSticky: Incremental rebalancing
MSK Producer Acknowledgment Modes:
- acks=0: No acknowledgment (highest throughput, no durability)
- acks=1: Leader acknowledgment (balanced)
- acks=all: All replicas acknowledgment (highest durability)
MSK Consumer Offset Commit Strategies:
- Auto-commit: Periodic commits
- Manual commit: Explicit control
- Exactly-once: Transactional commits

MSK Service Comparison Tables

MSK Deployment Options Comparison

Feature	MSK Provisioned	MSK Serverless
Scaling	Manual scaling	Automatic scaling
Broker Management	Customer managed	Fully managed
Capacity Planning	Required	Not required
Cost Model	Pay for provisioned capacity	Pay per usage
Max Brokers	30 brokers (soft limit)	N/A (serverless)
Max Partitions per Broker	100 (soft limit)	N/A
Max Partitions per Topic	No specific limit	120
Kafka Version Control	Customer controlled	AWS managed
Custom Configurations	Extensive	Limited
Use Cases	Predictable workloads, specific requirements	Variable workloads, simplicity

MSK vs. Self-Managed Kafka vs. Kinesis Data Streams

Feature	MSK	Self-Managed Kafka	Kinesis Data Streams
Management Overhead	Low	High	Low
Scaling	Manual (Provisioned), Auto (Serverless)	Manual	Automatic
Max Retention	Limited by storage	Limited by storage	Up to 365 days
Throughput Limits	Based on instance type	Based on instance type	1MB/s per shard (default)
Partitioning	Topics with partitions	Topics with partitions	Shards
Client Compatibility	Any Kafka client	Any Kafka client	Kinesis Client Library
Open Source	Yes (Apache Kafka)	Yes (Apache Kafka)	No (proprietary)
Pricing Model	Broker hours + storage	Infrastructure costs	Per shard hour + data transfer
Integration	AWS services + any Kafka compatible	Any Kafka compatible	Native AWS integration
Schema Management	Glue Schema Registry	Self-managed registry	No built-in schema registry

MSK Authentication Methods Comparison

Authentication Method	Security Level	Setup Complexity	Use Case
Unauthenticated	Low	Low	Development only
TLS	Medium	Medium	Basic security
SASL/SCRAM	High	Medium	Username/password auth
IAM	High	Low	AWS-integrated environments
mTLS	Very High	High	Strict security requirements

MSK Instance Types and Performance

Instance Type	vCPU	Memory (GiB)	Network Bandwidth	EBS Bandwidth	Typical Throughput
kafka.t3.small	2	2	Up to 5 Gbps	Up to 2.085 Gbps	~50 MBps
kafka.m5.large	2	8	Up to 10 Gbps	Up to 4.75 Gbps	~100 MBps
kafka.m5.xlarge	4	16	Up to 10 Gbps	Up to 4.75 Gbps	~200 MBps
kafka.m5.2xlarge	8	32	Up to 10 Gbps	Up to 4.75 Gbps	~300 MBps
kafka.m5.4xlarge	16	64	Up to 10 Gbps	4.75 Gbps	~400 MBps
kafka.m5.12xlarge	48	192	12 Gbps	9.5 Gbps	~1 GBps
kafka.m5.24xlarge	96	384	25 Gbps	19 Gbps	~2 GBps

MSK Monitoring Metrics

Important CloudWatch Metrics for MSK

Metric	Description	Threshold	Impact
CPUUtilization	CPU utilization percentage	>80%	Performance degradation
MemoryUsed	Memory used by broker	>80%	Potential OOM errors
KafkaDataLogsDiskUsed	Disk space used for data logs	>85%	Risk of disk full errors
NetworkProcessorAvgIdlePercent	Network thread idle time	<30%	Network bottleneck
RequestTime	Time to process requests	>100ms	Increased latency
UnderReplicatedPartitions	Partitions not fully replicated	>0	Data durability risk
OfflinePartitionsCount	Partitions with no leader	>0	Data unavailability
LeaderCount	Number of partitions led by broker	Imbalance	Uneven load
ActiveControllerCount	Broker is controller (0 or 1)	0 across cluster	Controller failure
BytesInPerSec	Incoming bytes rate	Varies by instance	Ingress capacity planning
BytesOutPerSec	Outgoing bytes rate	Varies by instance	Egress capacity planning
MessagesInPerSec	Message ingestion rate	Varies by instance	Throughput monitoring
ProduceTotalTimeMs	Producer request latency	>100ms	Producer performance issues
ConsumeTotalTimeMs	Consumer request latency	>100ms	Consumer performance issues
ReplicationBytesInPerSec	Replication ingress rate	Varies by instance	Replication load
ReplicationBytesOutPerSec	Replication egress rate	Varies by instance	Replication load

MSK Mind Map

Amazon MSK
├── Deployment Options
│   ├── Provisioned
│   └── Serverless
├── Core Components
│   ├── Brokers
│   ├── Topics
│   ├── Partitions
│   ├── Consumer Groups
│   └── ZooKeeper/KRaft
├── Security
│   ├── Encryption
│   │   ├── In-transit (TLS)
│   │   └── At-rest (KMS)
│   ├── Authentication
│   │   ├── IAM
│   │   ├── SASL/SCRAM
│   │   ├── mTLS
│   │   └── Unauthenticated
│   └── Network Security
│       ├── VPC
│       ├── Security Groups
│       └── Private Link
├── Monitoring
│   ├── CloudWatch
│   ├── Prometheus
│   ├── Open Monitoring
│   └── Logging Options
│       ├── CloudWatch Logs
│       ├── S3
│       └── Firehose
├── MSK Connect
│   ├── Source Connectors
│   ├── Sink Connectors
│   ├── Custom Connectors
│   └── Worker Configuration
├── Integration
│   ├── AWS Services
│   │   ├── Lambda
│   │   ├── Glue
│   │   ├── S3
│   │   └── CloudWatch
│   └── External Systems
│       ├── Databases
│       ├── Data Lakes
│       └── Applications
├── Operations
│   ├── Cluster Management
│   ├── Topic Management
│   ├── Consumer Group Management
│   └── Maintenance
├── Performance
│   ├── Throughput
│   ├── Latency
│   ├── Scaling
│   └── Optimization
└── Data Management
    ├── Retention
    ├── Replication
    ├── Compaction
    └── Schema Management

Apache Kafka Open Source Components in MSK

Amazon MSK is built on Apache Kafka, an open-source distributed event streaming platform. Here's how MSK implements and extends the open-source components:

Apache Kafka Core: MSK uses unmodified Apache Kafka for its core functionality, ensuring compatibility with standard Kafka clients and tools.
ZooKeeper: Traditional MSK clusters use Apache ZooKeeper for metadata management and broker coordination.
KRaft (Kafka Raft): Newer MSK versions support KRaft mode, which eliminates the ZooKeeper dependency.
Kafka Connect: MSK Connect is based on the open-source Kafka Connect framework for data integration.
MirrorMaker 2.0: MSK supports MirrorMaker 2.0 for cross-cluster replication.
Kafka Streams: Fully compatible with Kafka Streams for stream processing applications.
KSQL/ksqlDB: Not directly provided but can be deployed separately and used with MSK.
Schema Registry: MSK integrates with AWS Glue Schema Registry instead of the Confluent Schema Registry.
Kafka REST Proxy: Not directly provided but can be deployed separately.
Kafka UI Tools: Not directly provided but compatible with open-source tools like Kafka UI, AKHQ, etc.

MSK vs. Self-Managed Apache Kafka

Component	MSK Implementation	Self-Managed Implementation
Kafka Brokers	Fully managed	Self-managed on EC2/on-premises
ZooKeeper	Fully managed	Self-managed on EC2/on-premises
Monitoring	CloudWatch + Prometheus	Custom monitoring stack
Security	AWS-integrated (IAM, KMS)	Custom security implementation
Scaling	Console/API/CloudFormation	Manual or custom automation
Upgrades	One-click version upgrades	Manual upgrade process
Connect	MSK Connect (managed)	Self-managed Kafka Connect
Schema Registry	AWS Glue Schema Registry	Confluent Schema Registry
Cost	Pay for provisioned resources	Pay for underlying infrastructure
Control	Limited to supported configurations	Complete control

MSK Data Ingestion and Processing

Throughput and Latency Characteristics

Producer Throughput: Determined by broker instance type, network bandwidth, and replication factor.
Consumer Throughput: Determined by broker instance type, network bandwidth, and consumer parallelism.
End-to-End Latency: Typically 10-100ms, affected by:
- Network latency
- Broker load
- Batch size
- Acknowledgment settings
- Consumer poll frequency
Throughput Calculation Example:
- Cluster: 3 brokers of m5.xlarge (200 MBps each)
- Replication factor: 3
- Producer throughput = (3 × 200 MBps) ÷ 3 = 200 MBps
- Consumer throughput = 3 × 200 MBps = 600 MBps
Partition Calculation Example:
- Target throughput: 50 MBps
- Consumer count: 10
- Recommended partitions = max(50 ÷ 1, 10) = 50 partitions

Implementing Throttling and Overcoming Rate Limits

Producer Throttling:
- Use max.block.ms to control how long send() will block
- Set buffer.memory to control memory used for buffering
- Implement backoff retry logic for throttled requests
Consumer Throttling:
- Control max.poll.records to limit batch size
- Adjust fetch.max.bytes to control data volume
- Use max.poll.interval.ms to prevent rebalancing during processing
API Rate Limit Handling:
- Implement exponential backoff for control plane operations
- Cache results of frequent API calls
- Use AWS SDK retry mechanisms
Overcoming MSK Limits:
- Request quota increases for soft limits
- Distribute load across multiple clusters
- Optimize message size and batching

Replayability of Data Ingestion Pipelines

Offset Management:
- Store consumer group offsets in Kafka (__consumer_offsets topic)
- Use auto.offset.reset to control behavior for new consumers
- Implement manual offset management for precise control
Replay Strategies:
- Reset consumer group offsets to specific timestamp
- Create new consumer group to start from beginning
- Use Kafka's seek() API for programmatic control
Retention Configuration:
- Set retention.ms or retention.bytes at topic level
- Use log compaction for key-based datasets
- Consider S3 archiving for long-term storage
Exactly-Once Processing:
- Use transactional producers
- Implement idempotent consumers
- Store offsets and results atomically

Additional MSK Features and Best Practices

MSK Rebalance Detector: Helps identify and troubleshoot consumer group rebalancing issues.
MSK Tiered Storage: Separates storage from compute, allowing for cost-effective storage of large amounts of data.
MSK Multi-VPC Connectivity: Connect to MSK clusters from multiple VPCs using transit gateway or VPC peering.
MSK Private CA: Use AWS Private Certificate Authority for mTLS authentication.
MSK Cluster Policy: IAM resource policy to control access to MSK clusters.
MSK Best Practices:
- Deploy across 3 AZs for high availability
- Monitor and alert on under-replicated partitions
- Use appropriate replication factor (3 for production)
- Implement proper topic partitioning strategy
- Regularly update to latest Kafka versions
MSK Exam Tips:
- Understand differences between Provisioned and Serverless
- Know authentication and encryption options
- Be familiar with monitoring metrics and troubleshooting
- Understand integration patterns with other AWS services
- Know how to calculate throughput and partition requirements