AWS MSK (Managed Streaming for Apache Kafka) Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)
Core Concepts and Building Blocks
Amazon MSK (Managed Streaming for Apache Kafka) is a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data. MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters.
Key components:
- Brokers: Kafka server instances that store and serve data
- Topics: Categories/feeds where records are stored and published
- Partitions: Divisions of topics for parallel processing
- Consumer Groups: Groups of consumers that work together to consume data
- ZooKeeper: Coordination service for Kafka (in traditional deployments)
- MSK Connect: Managed service for Kafka Connect to integrate with other data sources/sinks
- MSK Serverless: Serverless option with automatic scaling
MSK Cheat Sheet Points
-
MSK Deployment Options:
- Provisioned: You select instance types and configure capacity
- Serverless: Automatically provisions and scales compute and storage
-
MSK Provisioned Broker Types:
- kafka.t3.small (2 vCPU, 2 GiB RAM)
- kafka.m5.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge/24xlarge
- kafka.m7g.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge
- kafka.c5.large/xlarge/2xlarge/4xlarge/9xlarge/18xlarge
- kafka.c6g.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge
-
MSK Storage Options:
- EBS storage volumes attached to each broker
- Default: 1000 GiB per broker
- Maximum: 16384 GiB per broker
MSK Supported Kafka Versions: 1.1.1, 2.1.0, 2.2.1, 2.3.1, 2.4.1, 2.5.1, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.7.0, 2.7.1, 2.7.2, 2.8.0, 2.8.1, 3.1.1, 3.2.0, 3.3.1, 3.3.2, 3.4.0, 3.5.1
-
MSK Networking:
- Runs within your VPC
- Requires at least 2 subnets in different AZs
- Recommended: 3 subnets in different AZs for high availability
-
MSK Security Features:
- TLS encryption for in-transit data
- KMS encryption for data at rest
- IAM authentication for control plane
- SASL/SCRAM, IAM, or mTLS for client authentication
- Network isolation with VPC
-
MSK High Availability:
- Multi-AZ deployment with brokers distributed across AZs
- Automatic recovery from broker failures
- Replication factor determines data durability
-
MSK Monitoring:
- Integration with CloudWatch, Prometheus, and open-source monitoring tools
- Broker logs can be sent to CloudWatch Logs, S3, or Firehose
-
MSK Connect allows you to:
- Run Kafka Connect connectors without managing infrastructure
- Use built-in connectors or bring your own
- Scale workers automatically
-
MSK Serverless Features:
- No broker management
- Automatic scaling based on throughput
- Pay only for data in/out and storage used
-
MSK Limits:
- Provisioned: 1-30 brokers per cluster (soft limit)
- Serverless: Up to 6 Kafka units (KUs) per partition
- Maximum of 100 clusters per account (soft limit)
-
MSK Replication Factor:
- Default: 3 (recommended for production)
- Minimum: 2 for multi-AZ clusters
- Maximum: Equal to the number of brokers
-
MSK Partition Limits:
- Default: 100 partitions per broker (soft limit)
- Maximum: Can be increased via support ticket
-
MSK Storage Throughput:
- General Purpose (gp2): 3 IOPS/GiB
- Provisioned IOPS (io1): Up to 50 IOPS/GiB
-
MSK Network Throughput:
- Depends on instance type
- t3.small: Up to 5 Gbps
- m5.24xlarge: Up to 25 Gbps
-
MSK Pricing Components:
- Broker hours (for provisioned)
- Storage (per GB-month)
- Data transfer (per GB)
- MSK Connect worker hours (if used)
- Serverless: Kafka Unit hours and data in/out
-
MSK Backup and Restore:
- Automated backups not provided
- Use MirrorMaker 2.0 for cross-cluster replication
- Use S3 connector for data backup
-
MSK Encryption Options:
- In-transit: TLS (required)
- At-rest: KMS (optional but recommended)
- Client-broker: Plain text, TLS, SASL/SCRAM, IAM, mTLS
-
MSK Authentication Methods:
- IAM access control
- SASL/SCRAM with AWS Secrets Manager
- TLS mutual authentication (mTLS)
- Unauthenticated access (not recommended)
-
MSK Configuration Types:
- Default configuration
- Custom configuration
- MSK configuration (reusable across clusters)
-
MSK Cluster States:
- CREATING, ACTIVE, FAILED, DELETING, UPDATING, HEALING
-
MSK Maintenance Windows:
- Default: Random 2-hour window per week
- Customizable to specific day and time
-
MSK Patching:
- Security patches applied automatically
- Version upgrades must be initiated manually
-
MSK Monitoring Levels:
- DEFAULT: Basic metrics to CloudWatch
- PER_BROKER: Detailed broker-level metrics
- PER_TOPIC_PER_BROKER: Most granular metrics
- PER_TOPIC_PER_PARTITION: Partition-level metrics
-
MSK Open Monitoring:
- Prometheus and Grafana integration
- JMX Exporter and Node Exporter metrics
-
MSK Logging Options:
- Broker logs to CloudWatch Logs
- Broker logs to S3
- Broker logs to Firehose
-
MSK Connect Worker Configurations:
- Auto-scaling: 1-10 workers
- Worker CPU: 1, 2, 4, or 8 vCPU
- Worker memory: 2, 4, 8, or 16 GiB
-
MSK Connect Connector Types:
- Source connectors (data into Kafka)
- Sink connectors (data from Kafka)
- Custom connectors (uploaded as ZIP)
-
MSK Serverless Scaling:
- Scales from 1 to 6 Kafka Units (KUs) per partition
- 1 KU = 1 MBps ingress, 2 MBps egress, 5 TPS
-
MSK Serverless Limits:
- Maximum 120 partitions per topic
- Maximum 100 consumer groups per cluster
- Maximum 500 connections per broker
-
MSK Cluster Operations:
- Update broker count
- Update broker type
- Update configuration
- Update security settings
- Update monitoring settings
-
MSK Topic Operations:
- Create, delete, describe topics
- Update topic configurations
- List topics
-
MSK Consumer Group Operations:
- List consumer groups
- Describe consumer groups
- Reset consumer group offsets
-
MSK Networking Requirements:
- Inbound rules for ports 9092 (plaintext), 9094 (TLS), 9096 (SASL), 9098 (IAM)
- Outbound rules for ZooKeeper (2181, 2182, 2183)
- VPC endpoints for AWS services
-
MSK Rebalancing:
- Automatic when adding brokers
- Manual using Kafka tools for custom rebalancing
-
MSK Storage Auto Scaling:
- Can be enabled to automatically increase storage
- Threshold: 70% utilization
- Maximum: 16384 GiB per broker
-
MSK Data Retention:
- Configurable per topic
- Default: 7 days (168 hours)
- Maximum: Limited by storage capacity
-
MSK Schema Registry Options:
- AWS Glue Schema Registry
- Self-managed Schema Registry
- Confluent Schema Registry
-
MSK Client Compatibility:
- Any Kafka client compatible with the cluster version
- AWS-specific clients for IAM authentication
-
MSK Cross-Region Replication:
- MirrorMaker 2.0 for cross-cluster replication
- Can be deployed on EC2 or using MSK Connect
-
MSK Disaster Recovery Options:
- Multi-AZ deployment for high availability
- Cross-region replication for disaster recovery
- S3 backups for long-term storage
-
MSK Performance Factors:
- Broker instance type
- Number of partitions
- Replication factor
- Batch size
- Compression settings
-
MSK Throughput Calculation:
- Producer throughput = Number of brokers × Instance throughput / Replication factor
- Consumer throughput = Number of brokers × Instance throughput
-
MSK Partition Calculation:
- Recommended partitions = max(Throughput in MB/s ÷ 1 MB/s, Number of consumers)
- Example: For 10 MB/s throughput and 5 consumers, use at least 10 partitions
-
MSK vs. Self-Managed Kafka:
- MSK: Managed infrastructure, automatic scaling, integrated security
- Self-managed: Complete control, potentially lower cost, more configuration options
-
MSK vs. Kinesis Data Streams:
- MSK: Open-source compatibility, longer retention, more partitions
- Kinesis: Simpler setup, automatic scaling, native AWS integration
-
MSK Throttling:
- API rate limits: 100 TPS for control plane operations
- Data plane throttling based on broker capacity
- Exponential backoff recommended for retries
-
MSK Connect Error Handling:
- Dead letter queues for failed records
- Configurable retry policies
- Error tolerance settings
-
MSK Replay Capabilities:
- Consumer groups track offsets
- Reset offsets to replay data
- Retention period limits how far back you can replay
-
MSK Data Ingestion Patterns:
- Direct producer applications
- MSK Connect source connectors
- AWS services integration (Lambda, Kinesis, etc.)
-
MSK Data Processing Patterns:
- Consumer applications
- Stream processing (Flink, Kafka Streams)
- Lambda integration
-
MSK Data Delivery Patterns:
- MSK Connect sink connectors
- Consumer applications writing to destinations
- Integration with AWS analytics services
-
MSK Latency Characteristics:
- End-to-end latency: Typically 10-100ms
- Affected by batch size, compression, and instance type
- Provisioned generally lower latency than Serverless
-
MSK Serverless Throughput Characteristics:
- Base capacity: 1 MBps ingress, 2 MBps egress per partition
- Scales up to 6x base capacity automatically
- Maximum: 6 MBps ingress, 12 MBps egress per partition
-
MSK Provisioned Throughput Characteristics:
- Depends on instance type
- t3.small: ~50 MBps
- m5.4xlarge: ~400 MBps
- m5.24xlarge: ~2 GBps
-
MSK Partition Strategies:
- Key-based: Same key to same partition
- Round-robin: Even distribution
- Custom: Implement custom partitioner
-
MSK Consumer Group Strategies:
- Static membership: Reduces rebalancing
- Cooperative rebalancing: Minimizes disruption
- Eager rebalancing: Traditional approach
-
MSK Compression Options:
- gzip: Highest compression, higher CPU
- snappy: Balanced compression/CPU
- lz4: Fast compression, lower ratio
- zstd: Good compression, moderate CPU
-
MSK Security Best Practices:
- Enable encryption in-transit and at-rest
- Use IAM or SASL/SCRAM authentication
- Implement VPC security groups
- Use private subnets with VPC endpoints
-
MSK Monitoring Best Practices:
- Monitor broker CPU, memory, disk usage
- Track under-replicated partitions
- Monitor consumer lag
- Set up alerts for critical metrics
-
MSK Cost Optimization:
- Right-size broker instances
- Use Serverless for variable workloads
- Enable storage auto-scaling
- Compress data to reduce storage costs
-
MSK Troubleshooting Common Issues:
- Under-replicated partitions
- Consumer lag
- Connection issues
- Performance degradation
-
MSK Integration with AWS Services:
- Lambda: Trigger functions from Kafka topics
- Glue: Schema Registry and ETL jobs
- S3: Data sink for long-term storage
- CloudWatch: Monitoring and logging
-
MSK Quotas and Limits Management:
- Monitor quota usage in Service Quotas console
- Request quota increases for production workloads
- Implement client-side throttling
-
MSK Replayability Implementation:
- Set appropriate retention period
- Use consumer group offset management
- Consider compacted topics for state
-
MSK Data Ingestion Pipeline Design:
- Source → MSK → Processing → Storage/Analytics
- Consider exactly-once semantics if needed
- Implement proper error handling and DLQs
-
MSK ZooKeeper vs. KRaft Mode:
- ZooKeeper: Traditional metadata management
- KRaft: Kafka's built-in consensus protocol (newer)
- MSK supports both depending on version
-
MSK Partition Assignment Strategies:
- Range: Assigns consecutive partitions
- RoundRobin: Distributes evenly
- Sticky: Minimizes partition movement
- CooperativeSticky: Incremental rebalancing
-
MSK Producer Acknowledgment Modes:
- acks=0: No acknowledgment (highest throughput, no durability)
- acks=1: Leader acknowledgment (balanced)
- acks=all: All replicas acknowledgment (highest durability)
-
MSK Consumer Offset Commit Strategies:
- Auto-commit: Periodic commits
- Manual commit: Explicit control
- Exactly-once: Transactional commits
MSK Service Comparison Tables
MSK Deployment Options Comparison
Feature | MSK Provisioned | MSK Serverless |
---|---|---|
Scaling | Manual scaling | Automatic scaling |
Broker Management | Customer managed | Fully managed |
Capacity Planning | Required | Not required |
Cost Model | Pay for provisioned capacity | Pay per usage |
Max Brokers | 30 brokers (soft limit) | N/A (serverless) |
Max Partitions per Broker | 100 (soft limit) | N/A |
Max Partitions per Topic | No specific limit | 120 |
Kafka Version Control | Customer controlled | AWS managed |
Custom Configurations | Extensive | Limited |
Use Cases | Predictable workloads, specific requirements | Variable workloads, simplicity |
MSK vs. Self-Managed Kafka vs. Kinesis Data Streams
Feature | MSK | Self-Managed Kafka | Kinesis Data Streams |
---|---|---|---|
Management Overhead | Low | High | Low |
Scaling | Manual (Provisioned), Auto (Serverless) | Manual | Automatic |
Max Retention | Limited by storage | Limited by storage | Up to 365 days |
Throughput Limits | Based on instance type | Based on instance type | 1MB/s per shard (default) |
Partitioning | Topics with partitions | Topics with partitions | Shards |
Client Compatibility | Any Kafka client | Any Kafka client | Kinesis Client Library |
Open Source | Yes (Apache Kafka) | Yes (Apache Kafka) | No (proprietary) |
Pricing Model | Broker hours + storage | Infrastructure costs | Per shard hour + data transfer |
Integration | AWS services + any Kafka compatible | Any Kafka compatible | Native AWS integration |
Schema Management | Glue Schema Registry | Self-managed registry | No built-in schema registry |
MSK Authentication Methods Comparison
Authentication Method | Security Level | Setup Complexity | Use Case |
---|---|---|---|
Unauthenticated | Low | Low | Development only |
TLS | Medium | Medium | Basic security |
SASL/SCRAM | High | Medium | Username/password auth |
IAM | High | Low | AWS-integrated environments |
mTLS | Very High | High | Strict security requirements |
MSK Instance Types and Performance
Instance Type | vCPU | Memory (GiB) | Network Bandwidth | EBS Bandwidth | Typical Throughput |
---|---|---|---|---|---|
kafka.t3.small | 2 | 2 | Up to 5 Gbps | Up to 2.085 Gbps | ~50 MBps |
kafka.m5.large | 2 | 8 | Up to 10 Gbps | Up to 4.75 Gbps | ~100 MBps |
kafka.m5.xlarge | 4 | 16 | Up to 10 Gbps | Up to 4.75 Gbps | ~200 MBps |
kafka.m5.2xlarge | 8 | 32 | Up to 10 Gbps | Up to 4.75 Gbps | ~300 MBps |
kafka.m5.4xlarge | 16 | 64 | Up to 10 Gbps | 4.75 Gbps | ~400 MBps |
kafka.m5.12xlarge | 48 | 192 | 12 Gbps | 9.5 Gbps | ~1 GBps |
kafka.m5.24xlarge | 96 | 384 | 25 Gbps | 19 Gbps | ~2 GBps |
MSK Monitoring Metrics
Important CloudWatch Metrics for MSK
Metric | Description | Threshold | Impact |
---|---|---|---|
CPUUtilization | CPU utilization percentage | >80% | Performance degradation |
MemoryUsed | Memory used by broker | >80% | Potential OOM errors |
KafkaDataLogsDiskUsed | Disk space used for data logs | >85% | Risk of disk full errors |
NetworkProcessorAvgIdlePercent | Network thread idle time | <30% | Network bottleneck |
RequestTime | Time to process requests | >100ms | Increased latency |
UnderReplicatedPartitions | Partitions not fully replicated | >0 | Data durability risk |
OfflinePartitionsCount | Partitions with no leader | >0 | Data unavailability |
LeaderCount | Number of partitions led by broker | Imbalance | Uneven load |
ActiveControllerCount | Broker is controller (0 or 1) | 0 across cluster | Controller failure |
BytesInPerSec | Incoming bytes rate | Varies by instance | Ingress capacity planning |
BytesOutPerSec | Outgoing bytes rate | Varies by instance | Egress capacity planning |
MessagesInPerSec | Message ingestion rate | Varies by instance | Throughput monitoring |
ProduceTotalTimeMs | Producer request latency | >100ms | Producer performance issues |
ConsumeTotalTimeMs | Consumer request latency | >100ms | Consumer performance issues |
ReplicationBytesInPerSec | Replication ingress rate | Varies by instance | Replication load |
ReplicationBytesOutPerSec | Replication egress rate | Varies by instance | Replication load |
MSK Mind Map
Amazon MSK
├── Deployment Options
│ ├── Provisioned
│ └── Serverless
├── Core Components
│ ├── Brokers
│ ├── Topics
│ ├── Partitions
│ ├── Consumer Groups
│ └── ZooKeeper/KRaft
├── Security
│ ├── Encryption
│ │ ├── In-transit (TLS)
│ │ └── At-rest (KMS)
│ ├── Authentication
│ │ ├── IAM
│ │ ├── SASL/SCRAM
│ │ ├── mTLS
│ │ └── Unauthenticated
│ └── Network Security
│ ├── VPC
│ ├── Security Groups
│ └── Private Link
├── Monitoring
│ ├── CloudWatch
│ ├── Prometheus
│ ├── Open Monitoring
│ └── Logging Options
│ ├── CloudWatch Logs
│ ├── S3
│ └── Firehose
├── MSK Connect
│ ├── Source Connectors
│ ├── Sink Connectors
│ ├── Custom Connectors
│ └── Worker Configuration
├── Integration
│ ├── AWS Services
│ │ ├── Lambda
│ │ ├── Glue
│ │ ├── S3
│ │ └── CloudWatch
│ └── External Systems
│ ├── Databases
│ ├── Data Lakes
│ └── Applications
├── Operations
│ ├── Cluster Management
│ ├── Topic Management
│ ├── Consumer Group Management
│ └── Maintenance
├── Performance
│ ├── Throughput
│ ├── Latency
│ ├── Scaling
│ └── Optimization
└── Data Management
├── Retention
├── Replication
├── Compaction
└── Schema Management
Apache Kafka Open Source Components in MSK
Amazon MSK is built on Apache Kafka, an open-source distributed event streaming platform. Here's how MSK implements and extends the open-source components:
Apache Kafka Core: MSK uses unmodified Apache Kafka for its core functionality, ensuring compatibility with standard Kafka clients and tools.
ZooKeeper: Traditional MSK clusters use Apache ZooKeeper for metadata management and broker coordination.
KRaft (Kafka Raft): Newer MSK versions support KRaft mode, which eliminates the ZooKeeper dependency.
Kafka Connect: MSK Connect is based on the open-source Kafka Connect framework for data integration.
MirrorMaker 2.0: MSK supports MirrorMaker 2.0 for cross-cluster replication.
Kafka Streams: Fully compatible with Kafka Streams for stream processing applications.
KSQL/ksqlDB: Not directly provided but can be deployed separately and used with MSK.
Schema Registry: MSK integrates with AWS Glue Schema Registry instead of the Confluent Schema Registry.
Kafka REST Proxy: Not directly provided but can be deployed separately.
Kafka UI Tools: Not directly provided but compatible with open-source tools like Kafka UI, AKHQ, etc.
MSK vs. Self-Managed Apache Kafka
Component | MSK Implementation | Self-Managed Implementation |
---|---|---|
Kafka Brokers | Fully managed | Self-managed on EC2/on-premises |
ZooKeeper | Fully managed | Self-managed on EC2/on-premises |
Monitoring | CloudWatch + Prometheus | Custom monitoring stack |
Security | AWS-integrated (IAM, KMS) | Custom security implementation |
Scaling | Console/API/CloudFormation | Manual or custom automation |
Upgrades | One-click version upgrades | Manual upgrade process |
Connect | MSK Connect (managed) | Self-managed Kafka Connect |
Schema Registry | AWS Glue Schema Registry | Confluent Schema Registry |
Cost | Pay for provisioned resources | Pay for underlying infrastructure |
Control | Limited to supported configurations | Complete control |
MSK Data Ingestion and Processing
Throughput and Latency Characteristics
Producer Throughput: Determined by broker instance type, network bandwidth, and replication factor.
Consumer Throughput: Determined by broker instance type, network bandwidth, and consumer parallelism.
-
End-to-End Latency: Typically 10-100ms, affected by:
- Network latency
- Broker load
- Batch size
- Acknowledgment settings
- Consumer poll frequency
-
Throughput Calculation Example:
- Cluster: 3 brokers of m5.xlarge (200 MBps each)
- Replication factor: 3
- Producer throughput = (3 × 200 MBps) ÷ 3 = 200 MBps
- Consumer throughput = 3 × 200 MBps = 600 MBps
-
Partition Calculation Example:
- Target throughput: 50 MBps
- Consumer count: 10
- Recommended partitions = max(50 ÷ 1, 10) = 50 partitions
Implementing Throttling and Overcoming Rate Limits
-
Producer Throttling:
- Use
max.block.ms
to control how longsend()
will block - Set
buffer.memory
to control memory used for buffering - Implement backoff retry logic for throttled requests
- Use
-
Consumer Throttling:
- Control
max.poll.records
to limit batch size - Adjust
fetch.max.bytes
to control data volume - Use
max.poll.interval.ms
to prevent rebalancing during processing
- Control
-
API Rate Limit Handling:
- Implement exponential backoff for control plane operations
- Cache results of frequent API calls
- Use AWS SDK retry mechanisms
-
Overcoming MSK Limits:
- Request quota increases for soft limits
- Distribute load across multiple clusters
- Optimize message size and batching
Replayability of Data Ingestion Pipelines
-
Offset Management:
- Store consumer group offsets in Kafka (
__consumer_offsets
topic) - Use
auto.offset.reset
to control behavior for new consumers - Implement manual offset management for precise control
- Store consumer group offsets in Kafka (
-
Replay Strategies:
- Reset consumer group offsets to specific timestamp
- Create new consumer group to start from beginning
- Use Kafka's
seek()
API for programmatic control
-
Retention Configuration:
- Set
retention.ms
orretention.bytes
at topic level - Use log compaction for key-based datasets
- Consider S3 archiving for long-term storage
- Set
-
Exactly-Once Processing:
- Use transactional producers
- Implement idempotent consumers
- Store offsets and results atomically
Additional MSK Features and Best Practices
MSK Rebalance Detector: Helps identify and troubleshoot consumer group rebalancing issues.
MSK Tiered Storage: Separates storage from compute, allowing for cost-effective storage of large amounts of data.
MSK Multi-VPC Connectivity: Connect to MSK clusters from multiple VPCs using transit gateway or VPC peering.
MSK Private CA: Use AWS Private Certificate Authority for mTLS authentication.
MSK Cluster Policy: IAM resource policy to control access to MSK clusters.
-
MSK Best Practices:
- Deploy across 3 AZs for high availability
- Monitor and alert on under-replicated partitions
- Use appropriate replication factor (3 for production)
- Implement proper topic partitioning strategy
- Regularly update to latest Kafka versions
-
MSK Exam Tips:
- Understand differences between Provisioned and Serverless
- Know authentication and encryption options
- Be familiar with monitoring metrics and troubleshooting
- Understand integration patterns with other AWS services
- Know how to calculate throughput and partition requirements
Top comments (0)