DEV Community

Data Tech Bridge
Data Tech Bridge

Posted on

Amazon MSK (Managed Streaming for Apache Kafka) -Cheat Sheet

AWS MSK (Managed Streaming for Apache Kafka) Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

Amazon MSK (Managed Streaming for Apache Kafka) is a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data. MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters.

Key components:

  • Brokers: Kafka server instances that store and serve data
  • Topics: Categories/feeds where records are stored and published
  • Partitions: Divisions of topics for parallel processing
  • Consumer Groups: Groups of consumers that work together to consume data
  • ZooKeeper: Coordination service for Kafka (in traditional deployments)
  • MSK Connect: Managed service for Kafka Connect to integrate with other data sources/sinks
  • MSK Serverless: Serverless option with automatic scaling

MSK Cheat Sheet Points

  1. MSK Deployment Options:

    • Provisioned: You select instance types and configure capacity
    • Serverless: Automatically provisions and scales compute and storage
  2. MSK Provisioned Broker Types:

    • kafka.t3.small (2 vCPU, 2 GiB RAM)
    • kafka.m5.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge/24xlarge
    • kafka.m7g.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge
    • kafka.c5.large/xlarge/2xlarge/4xlarge/9xlarge/18xlarge
    • kafka.c6g.large/xlarge/2xlarge/4xlarge/8xlarge/12xlarge/16xlarge
  3. MSK Storage Options:

    • EBS storage volumes attached to each broker
    • Default: 1000 GiB per broker
    • Maximum: 16384 GiB per broker
  4. MSK Supported Kafka Versions: 1.1.1, 2.1.0, 2.2.1, 2.3.1, 2.4.1, 2.5.1, 2.6.0, 2.6.1, 2.6.2, 2.6.3, 2.7.0, 2.7.1, 2.7.2, 2.8.0, 2.8.1, 3.1.1, 3.2.0, 3.3.1, 3.3.2, 3.4.0, 3.5.1

  5. MSK Networking:

    • Runs within your VPC
    • Requires at least 2 subnets in different AZs
    • Recommended: 3 subnets in different AZs for high availability
  6. MSK Security Features:

    • TLS encryption for in-transit data
    • KMS encryption for data at rest
    • IAM authentication for control plane
    • SASL/SCRAM, IAM, or mTLS for client authentication
    • Network isolation with VPC
  7. MSK High Availability:

    • Multi-AZ deployment with brokers distributed across AZs
    • Automatic recovery from broker failures
    • Replication factor determines data durability
  8. MSK Monitoring:

    • Integration with CloudWatch, Prometheus, and open-source monitoring tools
    • Broker logs can be sent to CloudWatch Logs, S3, or Firehose
  9. MSK Connect allows you to:

    • Run Kafka Connect connectors without managing infrastructure
    • Use built-in connectors or bring your own
    • Scale workers automatically
  10. MSK Serverless Features:

    • No broker management
    • Automatic scaling based on throughput
    • Pay only for data in/out and storage used
  11. MSK Limits:

    • Provisioned: 1-30 brokers per cluster (soft limit)
    • Serverless: Up to 6 Kafka units (KUs) per partition
    • Maximum of 100 clusters per account (soft limit)
  12. MSK Replication Factor:

    • Default: 3 (recommended for production)
    • Minimum: 2 for multi-AZ clusters
    • Maximum: Equal to the number of brokers
  13. MSK Partition Limits:

    • Default: 100 partitions per broker (soft limit)
    • Maximum: Can be increased via support ticket
  14. MSK Storage Throughput:

    • General Purpose (gp2): 3 IOPS/GiB
    • Provisioned IOPS (io1): Up to 50 IOPS/GiB
  15. MSK Network Throughput:

    • Depends on instance type
    • t3.small: Up to 5 Gbps
    • m5.24xlarge: Up to 25 Gbps
  16. MSK Pricing Components:

    • Broker hours (for provisioned)
    • Storage (per GB-month)
    • Data transfer (per GB)
    • MSK Connect worker hours (if used)
    • Serverless: Kafka Unit hours and data in/out
  17. MSK Backup and Restore:

    • Automated backups not provided
    • Use MirrorMaker 2.0 for cross-cluster replication
    • Use S3 connector for data backup
  18. MSK Encryption Options:

    • In-transit: TLS (required)
    • At-rest: KMS (optional but recommended)
    • Client-broker: Plain text, TLS, SASL/SCRAM, IAM, mTLS
  19. MSK Authentication Methods:

    • IAM access control
    • SASL/SCRAM with AWS Secrets Manager
    • TLS mutual authentication (mTLS)
    • Unauthenticated access (not recommended)
  20. MSK Configuration Types:

    • Default configuration
    • Custom configuration
    • MSK configuration (reusable across clusters)
  21. MSK Cluster States:

    • CREATING, ACTIVE, FAILED, DELETING, UPDATING, HEALING
  22. MSK Maintenance Windows:

    • Default: Random 2-hour window per week
    • Customizable to specific day and time
  23. MSK Patching:

    • Security patches applied automatically
    • Version upgrades must be initiated manually
  24. MSK Monitoring Levels:

    • DEFAULT: Basic metrics to CloudWatch
    • PER_BROKER: Detailed broker-level metrics
    • PER_TOPIC_PER_BROKER: Most granular metrics
    • PER_TOPIC_PER_PARTITION: Partition-level metrics
  25. MSK Open Monitoring:

    • Prometheus and Grafana integration
    • JMX Exporter and Node Exporter metrics
  26. MSK Logging Options:

    • Broker logs to CloudWatch Logs
    • Broker logs to S3
    • Broker logs to Firehose
  27. MSK Connect Worker Configurations:

    • Auto-scaling: 1-10 workers
    • Worker CPU: 1, 2, 4, or 8 vCPU
    • Worker memory: 2, 4, 8, or 16 GiB
  28. MSK Connect Connector Types:

    • Source connectors (data into Kafka)
    • Sink connectors (data from Kafka)
    • Custom connectors (uploaded as ZIP)
  29. MSK Serverless Scaling:

    • Scales from 1 to 6 Kafka Units (KUs) per partition
    • 1 KU = 1 MBps ingress, 2 MBps egress, 5 TPS
  30. MSK Serverless Limits:

    • Maximum 120 partitions per topic
    • Maximum 100 consumer groups per cluster
    • Maximum 500 connections per broker
  31. MSK Cluster Operations:

    • Update broker count
    • Update broker type
    • Update configuration
    • Update security settings
    • Update monitoring settings
  32. MSK Topic Operations:

    • Create, delete, describe topics
    • Update topic configurations
    • List topics
  33. MSK Consumer Group Operations:

    • List consumer groups
    • Describe consumer groups
    • Reset consumer group offsets
  34. MSK Networking Requirements:

    • Inbound rules for ports 9092 (plaintext), 9094 (TLS), 9096 (SASL), 9098 (IAM)
    • Outbound rules for ZooKeeper (2181, 2182, 2183)
    • VPC endpoints for AWS services
  35. MSK Rebalancing:

    • Automatic when adding brokers
    • Manual using Kafka tools for custom rebalancing
  36. MSK Storage Auto Scaling:

    • Can be enabled to automatically increase storage
    • Threshold: 70% utilization
    • Maximum: 16384 GiB per broker
  37. MSK Data Retention:

    • Configurable per topic
    • Default: 7 days (168 hours)
    • Maximum: Limited by storage capacity
  38. MSK Schema Registry Options:

    • AWS Glue Schema Registry
    • Self-managed Schema Registry
    • Confluent Schema Registry
  39. MSK Client Compatibility:

    • Any Kafka client compatible with the cluster version
    • AWS-specific clients for IAM authentication
  40. MSK Cross-Region Replication:

    • MirrorMaker 2.0 for cross-cluster replication
    • Can be deployed on EC2 or using MSK Connect
  41. MSK Disaster Recovery Options:

    • Multi-AZ deployment for high availability
    • Cross-region replication for disaster recovery
    • S3 backups for long-term storage
  42. MSK Performance Factors:

    • Broker instance type
    • Number of partitions
    • Replication factor
    • Batch size
    • Compression settings
  43. MSK Throughput Calculation:

    • Producer throughput = Number of brokers × Instance throughput / Replication factor
    • Consumer throughput = Number of brokers × Instance throughput
  44. MSK Partition Calculation:

    • Recommended partitions = max(Throughput in MB/s ÷ 1 MB/s, Number of consumers)
    • Example: For 10 MB/s throughput and 5 consumers, use at least 10 partitions
  45. MSK vs. Self-Managed Kafka:

    • MSK: Managed infrastructure, automatic scaling, integrated security
    • Self-managed: Complete control, potentially lower cost, more configuration options
  46. MSK vs. Kinesis Data Streams:

    • MSK: Open-source compatibility, longer retention, more partitions
    • Kinesis: Simpler setup, automatic scaling, native AWS integration
  47. MSK Throttling:

    • API rate limits: 100 TPS for control plane operations
    • Data plane throttling based on broker capacity
    • Exponential backoff recommended for retries
  48. MSK Connect Error Handling:

    • Dead letter queues for failed records
    • Configurable retry policies
    • Error tolerance settings
  49. MSK Replay Capabilities:

    • Consumer groups track offsets
    • Reset offsets to replay data
    • Retention period limits how far back you can replay
  50. MSK Data Ingestion Patterns:

    • Direct producer applications
    • MSK Connect source connectors
    • AWS services integration (Lambda, Kinesis, etc.)
  51. MSK Data Processing Patterns:

    • Consumer applications
    • Stream processing (Flink, Kafka Streams)
    • Lambda integration
  52. MSK Data Delivery Patterns:

    • MSK Connect sink connectors
    • Consumer applications writing to destinations
    • Integration with AWS analytics services
  53. MSK Latency Characteristics:

    • End-to-end latency: Typically 10-100ms
    • Affected by batch size, compression, and instance type
    • Provisioned generally lower latency than Serverless
  54. MSK Serverless Throughput Characteristics:

    • Base capacity: 1 MBps ingress, 2 MBps egress per partition
    • Scales up to 6x base capacity automatically
    • Maximum: 6 MBps ingress, 12 MBps egress per partition
  55. MSK Provisioned Throughput Characteristics:

    • Depends on instance type
    • t3.small: ~50 MBps
    • m5.4xlarge: ~400 MBps
    • m5.24xlarge: ~2 GBps
  56. MSK Partition Strategies:

    • Key-based: Same key to same partition
    • Round-robin: Even distribution
    • Custom: Implement custom partitioner
  57. MSK Consumer Group Strategies:

    • Static membership: Reduces rebalancing
    • Cooperative rebalancing: Minimizes disruption
    • Eager rebalancing: Traditional approach
  58. MSK Compression Options:

    • gzip: Highest compression, higher CPU
    • snappy: Balanced compression/CPU
    • lz4: Fast compression, lower ratio
    • zstd: Good compression, moderate CPU
  59. MSK Security Best Practices:

    • Enable encryption in-transit and at-rest
    • Use IAM or SASL/SCRAM authentication
    • Implement VPC security groups
    • Use private subnets with VPC endpoints
  60. MSK Monitoring Best Practices:

    • Monitor broker CPU, memory, disk usage
    • Track under-replicated partitions
    • Monitor consumer lag
    • Set up alerts for critical metrics
  61. MSK Cost Optimization:

    • Right-size broker instances
    • Use Serverless for variable workloads
    • Enable storage auto-scaling
    • Compress data to reduce storage costs
  62. MSK Troubleshooting Common Issues:

    • Under-replicated partitions
    • Consumer lag
    • Connection issues
    • Performance degradation
  63. MSK Integration with AWS Services:

    • Lambda: Trigger functions from Kafka topics
    • Glue: Schema Registry and ETL jobs
    • S3: Data sink for long-term storage
    • CloudWatch: Monitoring and logging
  64. MSK Quotas and Limits Management:

    • Monitor quota usage in Service Quotas console
    • Request quota increases for production workloads
    • Implement client-side throttling
  65. MSK Replayability Implementation:

    • Set appropriate retention period
    • Use consumer group offset management
    • Consider compacted topics for state
  66. MSK Data Ingestion Pipeline Design:

    • Source → MSK → Processing → Storage/Analytics
    • Consider exactly-once semantics if needed
    • Implement proper error handling and DLQs
  67. MSK ZooKeeper vs. KRaft Mode:

    • ZooKeeper: Traditional metadata management
    • KRaft: Kafka's built-in consensus protocol (newer)
    • MSK supports both depending on version
  68. MSK Partition Assignment Strategies:

    • Range: Assigns consecutive partitions
    • RoundRobin: Distributes evenly
    • Sticky: Minimizes partition movement
    • CooperativeSticky: Incremental rebalancing
  69. MSK Producer Acknowledgment Modes:

    • acks=0: No acknowledgment (highest throughput, no durability)
    • acks=1: Leader acknowledgment (balanced)
    • acks=all: All replicas acknowledgment (highest durability)
  70. MSK Consumer Offset Commit Strategies:

    • Auto-commit: Periodic commits
    • Manual commit: Explicit control
    • Exactly-once: Transactional commits

MSK Service Comparison Tables

MSK Deployment Options Comparison

Feature MSK Provisioned MSK Serverless
Scaling Manual scaling Automatic scaling
Broker Management Customer managed Fully managed
Capacity Planning Required Not required
Cost Model Pay for provisioned capacity Pay per usage
Max Brokers 30 brokers (soft limit) N/A (serverless)
Max Partitions per Broker 100 (soft limit) N/A
Max Partitions per Topic No specific limit 120
Kafka Version Control Customer controlled AWS managed
Custom Configurations Extensive Limited
Use Cases Predictable workloads, specific requirements Variable workloads, simplicity

MSK vs. Self-Managed Kafka vs. Kinesis Data Streams

Feature MSK Self-Managed Kafka Kinesis Data Streams
Management Overhead Low High Low
Scaling Manual (Provisioned), Auto (Serverless) Manual Automatic
Max Retention Limited by storage Limited by storage Up to 365 days
Throughput Limits Based on instance type Based on instance type 1MB/s per shard (default)
Partitioning Topics with partitions Topics with partitions Shards
Client Compatibility Any Kafka client Any Kafka client Kinesis Client Library
Open Source Yes (Apache Kafka) Yes (Apache Kafka) No (proprietary)
Pricing Model Broker hours + storage Infrastructure costs Per shard hour + data transfer
Integration AWS services + any Kafka compatible Any Kafka compatible Native AWS integration
Schema Management Glue Schema Registry Self-managed registry No built-in schema registry

MSK Authentication Methods Comparison

Authentication Method Security Level Setup Complexity Use Case
Unauthenticated Low Low Development only
TLS Medium Medium Basic security
SASL/SCRAM High Medium Username/password auth
IAM High Low AWS-integrated environments
mTLS Very High High Strict security requirements

MSK Instance Types and Performance

Instance Type vCPU Memory (GiB) Network Bandwidth EBS Bandwidth Typical Throughput
kafka.t3.small 2 2 Up to 5 Gbps Up to 2.085 Gbps ~50 MBps
kafka.m5.large 2 8 Up to 10 Gbps Up to 4.75 Gbps ~100 MBps
kafka.m5.xlarge 4 16 Up to 10 Gbps Up to 4.75 Gbps ~200 MBps
kafka.m5.2xlarge 8 32 Up to 10 Gbps Up to 4.75 Gbps ~300 MBps
kafka.m5.4xlarge 16 64 Up to 10 Gbps 4.75 Gbps ~400 MBps
kafka.m5.12xlarge 48 192 12 Gbps 9.5 Gbps ~1 GBps
kafka.m5.24xlarge 96 384 25 Gbps 19 Gbps ~2 GBps

MSK Monitoring Metrics

Important CloudWatch Metrics for MSK

Metric Description Threshold Impact
CPUUtilization CPU utilization percentage >80% Performance degradation
MemoryUsed Memory used by broker >80% Potential OOM errors
KafkaDataLogsDiskUsed Disk space used for data logs >85% Risk of disk full errors
NetworkProcessorAvgIdlePercent Network thread idle time <30% Network bottleneck
RequestTime Time to process requests >100ms Increased latency
UnderReplicatedPartitions Partitions not fully replicated >0 Data durability risk
OfflinePartitionsCount Partitions with no leader >0 Data unavailability
LeaderCount Number of partitions led by broker Imbalance Uneven load
ActiveControllerCount Broker is controller (0 or 1) 0 across cluster Controller failure
BytesInPerSec Incoming bytes rate Varies by instance Ingress capacity planning
BytesOutPerSec Outgoing bytes rate Varies by instance Egress capacity planning
MessagesInPerSec Message ingestion rate Varies by instance Throughput monitoring
ProduceTotalTimeMs Producer request latency >100ms Producer performance issues
ConsumeTotalTimeMs Consumer request latency >100ms Consumer performance issues
ReplicationBytesInPerSec Replication ingress rate Varies by instance Replication load
ReplicationBytesOutPerSec Replication egress rate Varies by instance Replication load

MSK Mind Map

Amazon MSK
├── Deployment Options
│   ├── Provisioned
│   └── Serverless
├── Core Components
│   ├── Brokers
│   ├── Topics
│   ├── Partitions
│   ├── Consumer Groups
│   └── ZooKeeper/KRaft
├── Security
│   ├── Encryption
│   │   ├── In-transit (TLS)
│   │   └── At-rest (KMS)
│   ├── Authentication
│   │   ├── IAM
│   │   ├── SASL/SCRAM
│   │   ├── mTLS
│   │   └── Unauthenticated
│   └── Network Security
│       ├── VPC
│       ├── Security Groups
│       └── Private Link
├── Monitoring
│   ├── CloudWatch
│   ├── Prometheus
│   ├── Open Monitoring
│   └── Logging Options
│       ├── CloudWatch Logs
│       ├── S3
│       └── Firehose
├── MSK Connect
│   ├── Source Connectors
│   ├── Sink Connectors
│   ├── Custom Connectors
│   └── Worker Configuration
├── Integration
│   ├── AWS Services
│   │   ├── Lambda
│   │   ├── Glue
│   │   ├── S3
│   │   └── CloudWatch
│   └── External Systems
│       ├── Databases
│       ├── Data Lakes
│       └── Applications
├── Operations
│   ├── Cluster Management
│   ├── Topic Management
│   ├── Consumer Group Management
│   └── Maintenance
├── Performance
│   ├── Throughput
│   ├── Latency
│   ├── Scaling
│   └── Optimization
└── Data Management
    ├── Retention
    ├── Replication
    ├── Compaction
    └── Schema Management
Enter fullscreen mode Exit fullscreen mode

Apache Kafka Open Source Components in MSK

Amazon MSK is built on Apache Kafka, an open-source distributed event streaming platform. Here's how MSK implements and extends the open-source components:

  1. Apache Kafka Core: MSK uses unmodified Apache Kafka for its core functionality, ensuring compatibility with standard Kafka clients and tools.

  2. ZooKeeper: Traditional MSK clusters use Apache ZooKeeper for metadata management and broker coordination.

  3. KRaft (Kafka Raft): Newer MSK versions support KRaft mode, which eliminates the ZooKeeper dependency.

  4. Kafka Connect: MSK Connect is based on the open-source Kafka Connect framework for data integration.

  5. MirrorMaker 2.0: MSK supports MirrorMaker 2.0 for cross-cluster replication.

  6. Kafka Streams: Fully compatible with Kafka Streams for stream processing applications.

  7. KSQL/ksqlDB: Not directly provided but can be deployed separately and used with MSK.

  8. Schema Registry: MSK integrates with AWS Glue Schema Registry instead of the Confluent Schema Registry.

  9. Kafka REST Proxy: Not directly provided but can be deployed separately.

  10. Kafka UI Tools: Not directly provided but compatible with open-source tools like Kafka UI, AKHQ, etc.

MSK vs. Self-Managed Apache Kafka

Component MSK Implementation Self-Managed Implementation
Kafka Brokers Fully managed Self-managed on EC2/on-premises
ZooKeeper Fully managed Self-managed on EC2/on-premises
Monitoring CloudWatch + Prometheus Custom monitoring stack
Security AWS-integrated (IAM, KMS) Custom security implementation
Scaling Console/API/CloudFormation Manual or custom automation
Upgrades One-click version upgrades Manual upgrade process
Connect MSK Connect (managed) Self-managed Kafka Connect
Schema Registry AWS Glue Schema Registry Confluent Schema Registry
Cost Pay for provisioned resources Pay for underlying infrastructure
Control Limited to supported configurations Complete control

MSK Data Ingestion and Processing

Throughput and Latency Characteristics

  1. Producer Throughput: Determined by broker instance type, network bandwidth, and replication factor.

  2. Consumer Throughput: Determined by broker instance type, network bandwidth, and consumer parallelism.

  3. End-to-End Latency: Typically 10-100ms, affected by:

    • Network latency
    • Broker load
    • Batch size
    • Acknowledgment settings
    • Consumer poll frequency
  4. Throughput Calculation Example:

    • Cluster: 3 brokers of m5.xlarge (200 MBps each)
    • Replication factor: 3
    • Producer throughput = (3 × 200 MBps) ÷ 3 = 200 MBps
    • Consumer throughput = 3 × 200 MBps = 600 MBps
  5. Partition Calculation Example:

    • Target throughput: 50 MBps
    • Consumer count: 10
    • Recommended partitions = max(50 ÷ 1, 10) = 50 partitions

Implementing Throttling and Overcoming Rate Limits

  1. Producer Throttling:

    • Use max.block.ms to control how long send() will block
    • Set buffer.memory to control memory used for buffering
    • Implement backoff retry logic for throttled requests
  2. Consumer Throttling:

    • Control max.poll.records to limit batch size
    • Adjust fetch.max.bytes to control data volume
    • Use max.poll.interval.ms to prevent rebalancing during processing
  3. API Rate Limit Handling:

    • Implement exponential backoff for control plane operations
    • Cache results of frequent API calls
    • Use AWS SDK retry mechanisms
  4. Overcoming MSK Limits:

    • Request quota increases for soft limits
    • Distribute load across multiple clusters
    • Optimize message size and batching

Replayability of Data Ingestion Pipelines

  1. Offset Management:

    • Store consumer group offsets in Kafka (__consumer_offsets topic)
    • Use auto.offset.reset to control behavior for new consumers
    • Implement manual offset management for precise control
  2. Replay Strategies:

    • Reset consumer group offsets to specific timestamp
    • Create new consumer group to start from beginning
    • Use Kafka's seek() API for programmatic control
  3. Retention Configuration:

    • Set retention.ms or retention.bytes at topic level
    • Use log compaction for key-based datasets
    • Consider S3 archiving for long-term storage
  4. Exactly-Once Processing:

    • Use transactional producers
    • Implement idempotent consumers
    • Store offsets and results atomically

Additional MSK Features and Best Practices

  1. MSK Rebalance Detector: Helps identify and troubleshoot consumer group rebalancing issues.

  2. MSK Tiered Storage: Separates storage from compute, allowing for cost-effective storage of large amounts of data.

  3. MSK Multi-VPC Connectivity: Connect to MSK clusters from multiple VPCs using transit gateway or VPC peering.

  4. MSK Private CA: Use AWS Private Certificate Authority for mTLS authentication.

  5. MSK Cluster Policy: IAM resource policy to control access to MSK clusters.

  6. MSK Best Practices:

    • Deploy across 3 AZs for high availability
    • Monitor and alert on under-replicated partitions
    • Use appropriate replication factor (3 for production)
    • Implement proper topic partitioning strategy
    • Regularly update to latest Kafka versions
  7. MSK Exam Tips:

    • Understand differences between Provisioned and Serverless
    • Know authentication and encryption options
    • Be familiar with monitoring metrics and troubleshooting
    • Understand integration patterns with other AWS services
    • Know how to calculate throughput and partition requirements

Top comments (0)