Below are the important pointers for AWS Kinesis Service (Cheat Sheet) for AWS Certified Data Engineer-Associate Exam.
Core Components of AWS Kinesis
AWS Kinesis is a platform for streaming data on AWS, making it easy to collect, process, and analyze real-time, streaming data.
Component | Description |
---|---|
Kinesis Data Streams | Real-time data streaming service for ingesting and storing data streams |
Kinesis Data Firehose | Loads streaming data into AWS data stores with near-zero management |
Kinesis Data Analytics | Process and analyze streaming data in real-time with SQL or Apache Flink |
Kinesis Video Streams | Stream video from connected devices to AWS for analytics and ML |
Mind Map: AWS Kinesis Ecosystem
AWS Kinesis
├── Kinesis Data Streams
│ ├── Producers
│ │ ├── AWS SDK
│ │ ├── Kinesis Producer Library (KPL)
│ │ └── Kinesis Agent
│ ├── Consumers
│ │ ├── AWS SDK
│ │ ├── Kinesis Client Library (KCL)
│ │ ├── Lambda
│ │ └── Kinesis Data Firehose
│ └── Shards (throughput units)
├── Kinesis Data Firehose
│ ├── Sources
│ │ ├── Direct PUT
│ │ ├── Kinesis Data Streams
│ │ └── CloudWatch Logs/Events
│ └── Destinations
│ ├── S3
│ ├── Redshift
│ ├── Elasticsearch
│ └── Splunk
├── Kinesis Data Analytics
│ ├── SQL Applications
│ └── Apache Flink Applications
└── Kinesis Video Streams
├── Producers (cameras, etc.)
└── Consumers (ML services, etc.)
Detailed Features and Specifications
Feature | Kinesis Data Streams | Kinesis Data Firehose | Kinesis Data Analytics | Kinesis Video Streams |
---|---|---|---|---|
Purpose | Real-time streaming data ingestion | Load streaming data into data stores | Process streaming data in real-time | Stream video data for analytics |
Data Retention | 24 hours (default) up to 365 days | No storage (immediate delivery) | No storage (immediate processing) | Up to 7 days |
Scaling Unit | Shards | Automatic scaling | KPUs (Kinesis Processing Units) | Automatic scaling |
Throughput | 1MB/sec or 1000 records/sec per shard (ingestion) 2MB/sec per shard (consumption) |
Dynamic, up to service quotas | Based on KPUs | MB/sec per stream |
Latency | ~200ms | At least 60 seconds | Real-time (seconds) | Real-time |
Reprocessing | Yes (by position) | No | No | Yes (by timestamp) |
Pricing Model | Per shard-hour + PUT payload units | Volume of data + optional data transformation | KPU-hours | GB ingested + stored |
Kinesis Data Streams Deep Dive
1. Shards: Basic throughput unit for Kinesis Data Streams. Each shard provides 1MB/sec data input and 2MB/sec data output capacity.
2. Partition Key: Determines which shard a data record goes to. Good partition key design ensures even distribution across shards.
3. Sequence Number: Unique identifier for each record within a shard, assigned when a record is ingested.
4. Data Record: The unit of data stored in Kinesis Data Streams (up to 1MB).
5. Retention Period: Data can be stored from 24 hours (default) up to 365 days.
6. Enhanced Fan-Out: Provides dedicated throughput of 2MB/sec per consumer per shard (vs. shared 2MB/sec without it).
7. Capacity Modes:
- Provisioned: You specify the number of shards
- On-demand: Automatically scales based on observed throughput
8. Resharding Operations:
- Shard split: Increases stream capacity
- Shard merge: Decreases stream capacity
9. Producer Options:
- AWS SDK (simple, low throughput)
- Kinesis Producer Library (KPL) (high performance, batching, retry)
- Kinesis Agent (log file collection)
10. Consumer Options:
- AWS SDK (simple, manual)
- Kinesis Client Library (KCL) (distributed, coordinated consumption)
- Lambda (serverless processing)
- Firehose (delivery to destinations)
11. Throughput Calculation Example:
- Required ingestion: 10MB/sec
- Required shards = 10MB ÷ 1MB = 10 shards
12. KCL vs KPL: KPL (Producer Library) handles data production with batching and retries; KCL (Consumer Library) manages distributed consumption with checkpointing.
Kinesis Data Firehose Details
13. Delivery Frequency:
- S3: Buffer size (1-128 MB) or buffer interval (60-900 seconds)
- Other destinations: Buffer size (1-100 MB) or buffer interval (60-900 seconds)
14. Data Transformation: Lambda can transform records before delivery.
15. Format Conversion: Convert data to Parquet or ORC before S3 delivery.
16. Compression Options: GZIP, ZIP, Snappy for S3 delivery.
17. Error Handling: Failed records go to an S3 error bucket.
18. Dynamic Partitioning: Partition data in S3 based on record content.
19. No Data Loss: Retries until successful delivery.
20. Serverless: No capacity planning needed.
Kinesis Data Analytics Features
21. SQL Applications: Process streams using SQL queries.
22. Apache Flink Applications: Use Java, Scala, or Python with Apache Flink.
23. Input Sources: Kinesis Data Streams, Firehose, or reference data from S3.
24. Processing Features:
- Windowed aggregations (tumbling, sliding, session)
- Anomaly detection
- Stream-to-stream joins
25. Output Options: Kinesis Data Streams, Firehose, or Lambda.
26. Scaling: Based on Kinesis Processing Units (KPUs).
27. Checkpointing: Automatic state persistence for fault tolerance.
28. Parallelism: Automatically parallelizes processing across multiple instances.
Kinesis Video Streams
29. Use Cases: Security cameras, CCTV, body cameras, audio feeds.
30. Integration: Works with AWS ML services like Rekognition and SageMaker.
31. Producer SDK: Available for C++, Android, and iOS.
32. Fragments: Video is divided into fragments for processing.
33. Metadata: Can attach time-indexed metadata to video streams.
Performance Optimization and Limits
34. Kinesis Data Streams Limits:
- Per shard: 1MB/sec ingestion, 2MB/sec consumption
- Maximum record size: 1MB
- API limits: PutRecord (1000 records/sec), GetRecords (5 transactions/sec)
- Maximum data retention: 365 days
35. Handling Hot Shards: Use a good partition key design to distribute data evenly.
36. Resharding Best Practices: Perform during low traffic periods; complete one operation before starting another.
37. Kinesis Data Firehose Limits:
- Maximum record size: 1MB
- Service quota: 5000 records/sec or 5MB/sec per delivery stream (can be increased)
38. Kinesis Data Analytics Limits:
- SQL applications: Up to 8 KPUs per application
- Flink applications: Based on parallelism configuration
39. Implementing Throttling:
- Use exponential backoff for retries
- Implement client-side rate limiting
- Monitor and alert on throttling metrics
40. Overcoming Rate Limits:
- Request quota increases
- Implement batching with KPL
- Add more shards (Data Streams)
Replayability and Data Recovery
41. Replay Options in Kinesis Data Streams:
- Start from specific sequence number
- Start from timestamp
- Start from TRIM_HORIZON (oldest available record)
- Start from LATEST (most recent record)
42. Checkpointing: KCL stores consumption progress in DynamoDB for fault tolerance.
43. Reprocessing Strategies:
- Create new consumer with different application name
- Reset existing consumer's checkpoints
- Use enhanced fan-out for parallel processing
44. Data Persistence: Consider archiving to S3 via Firehose for long-term storage.
45. Disaster Recovery: Implement cross-region replication using Lambda.
Open Source Integration
46. Kinesis Client Library (KCL) vs. Apache Kafka Consumer:
Feature | KCL | Kafka Consumer |
---|---|---|
Language Support | Java, Node.js, Python, .NET, Ruby | Java, multiple languages via clients |
Coordination | DynamoDB | ZooKeeper/Broker |
Scaling | Per shard consumption | Consumer groups |
Checkpointing | Built-in | Manual offset management |
Fault Tolerance | Automatic worker rebalancing | Rebalance protocol |
47. Apache Flink Integration: Kinesis Data Analytics for Apache Flink provides managed Flink environment.
48. Spark Streaming Integration: Can consume from Kinesis using Spark Kinesis connector.
CloudWatch Monitoring for Kinesis
49. Key Metrics for Kinesis Data Streams:
- GetRecords.IteratorAgeMilliseconds
: Age of the oldest record (high values indicate consumer lag)
- IncomingBytes
/IncomingRecords
: Volume of data/records being written
- ReadProvisionedThroughputExceeded
/WriteProvisionedThroughputExceeded
: Throttling events
- PutRecord.Success
/GetRecords.Success
: Success rates for operations
50. Key Metrics for Kinesis Data Firehose:
- DeliveryToS3.Success
: Success rate of S3 deliveries
- IncomingBytes
/IncomingRecords
: Volume of data/records being received
- ThrottledRecords
: Number of records throttled
- DeliveryToS3.DataFreshness
: Age of the oldest record not yet delivered
51. Key Metrics for Kinesis Data Analytics:
- KPUs
: Number of KPUs being used
- fullRestarts
: Number of application restarts
- downtime
: Application downtime
- InputProcessing.OkBytes
/InputProcessing.OkRecords
: Successfully processed data
52. Recommended Alarms:
- Iterator age > 30 seconds (consumer lag)
- High throttling rates (>10%)
- Delivery freshness > buffer time + 60 seconds (Firehose)
- Error rates > 1%
Security and Compliance
53. Encryption:
- Server-side encryption with KMS
- HTTPS endpoints for in-transit encryption
- Client-side encryption options
54. Authentication and Authorization:
- IAM roles and policies
- Fine-grained access control with IAM
55. VPC Integration: Private access via VPC endpoints.
56. Compliance: Supports HIPAA, PCI DSS, SOC, and ISO compliance.
57. Audit Logging: All API calls logged to CloudTrail.
Cost Optimization
58. Kinesis Data Streams Cost Factors:
- Shard hours (provisioned mode)
- Data ingested (on-demand mode)
- Extended data retention (beyond 24 hours)
- Enhanced fan-out consumers
59. Kinesis Data Firehose Cost Factors:
- Data ingested
- Format conversion
- VPC delivery
60. Kinesis Data Analytics Cost Factors:
- KPU hours
- Durable application backups
61. Cost Optimization Strategies:
- Right-size shard count
- Use on-demand mode for variable workloads
- Batch records when possible
- Monitor and adjust resources based on usage patterns
Common Architectures and Patterns
62. Real-time Analytics Pipeline:
- Data Streams → Data Analytics → Firehose → S3/Redshift
63. Log Processing Pipeline:
- CloudWatch Logs → Firehose → S3 → Athena
64. IoT Data Pipeline:
- IoT Core → Data Streams → Lambda → DynamoDB
65. Click-stream Analysis:
- Web/Mobile → Kinesis Data Streams → Data Analytics → ElastiCache
66. Machine Learning Pipeline:
- Data Streams → Firehose → S3 → SageMaker
Exam Tips and Common Scenarios
67. Scenario: High-throughput Ingestion
- Solution: Use KPL with batching and appropriate shard count
68. Scenario: Consumer Lag
- Solution: Add more consumers, use enhanced fan-out, or increase processing efficiency
69. Scenario: Data Transformation Before Storage
- Solution: Use Firehose with Lambda transformation
70. Scenario: Real-time Anomaly Detection
- Solution: Kinesis Data Analytics with RANDOM_CUT_FOREST function
71. Scenario: Video Analysis
- Solution: Kinesis Video Streams with Rekognition integration
72. Scenario: Exactly-once Processing
- Solution: Use KCL with careful checkpoint management
73. Scenario: Cross-region Replication
- Solution: Consumer application that reads from one region and produces to another
74. Scenario: Handling Spiky Traffic
- Solution: On-demand capacity mode for Data Streams
75. Scenario: Long-term Analytics
- Solution: Firehose to S3 with Athena or Redshift Spectrum
76. Scenario: Stream Enrichment
- Solution: Kinesis Data Analytics with reference data from S3
Troubleshooting Common Issues
77. ProvisionedThroughputExceededException:
- Cause: Exceeding shard limits
- Solution: Add more shards, implement backoff, use KPL batching
78. Iterator Expiration:
- Cause: Not processing data within 5 minutes of retrieval
- Solution: Process faster or request less data per GetRecords call
79. Consumer Lag:
- Cause: Slow processing, insufficient consumers
- Solution: Add consumers, optimize processing, use enhanced fan-out
80. Duplicate Records:
- Cause: Producer retries, consumer restarts
- Solution: Implement idempotent processing, track processed records
81. Data Loss:
- Cause: Exceeding retention period, not handling failures
- Solution: Increase retention, implement proper error handling
82. Uneven Shard Distribution:
- Cause: Poor partition key choice
- Solution: Use high-cardinality partition keys, avoid hot keys
Advanced Features and Integrations
83. Kinesis Data Streams Enhanced Fan-Out:
- Dedicated throughput of 2MB/sec per consumer per shard
- Uses HTTP/2 for push-based delivery
- Lower latency (70ms vs. 200ms+)
84. Kinesis Data Firehose Dynamic Partitioning:
- Partition data in S3 based on record content
- Creates logical folders based on partition keys
- Optimizes for query performance
85. Kinesis Data Analytics Connectors:
- Apache Kafka
- Amazon MSK
- Amazon S3
- Amazon DynamoDB
86. Kinesis Video Streams Edge Agent:
- Run on IoT devices
- Buffer and upload video when connectivity is restored
- Supports intermittent connectivity scenarios
87. AWS Lambda Integration:
- Event source mapping for Data Streams
- Transformation for Firehose
- Processing for Video Streams
88. AWS Glue Integration:
- Streaming ETL jobs can use Kinesis as source
- Process and transform streaming data
- Load into data lake or data warehouse
89. Amazon SageMaker Integration:
- Real-time ML inference on streaming data
- Model training on historical stream data
- Anomaly detection on streams
90. Amazon EventBridge Integration:
- Can route events to Kinesis
- Enables serverless event-driven architectures
Comparison with Other AWS Streaming Services
Feature | Kinesis Data Streams | MSK (Managed Kafka) | EventBridge | SQS |
---|---|---|---|---|
Purpose | Real-time streaming | Streaming & messaging | Event routing | Messaging |
Throughput | 1MB/sec per shard | MB/sec per broker | Thousands/sec | Unlimited |
Retention | Up to 365 days | Configurable | No retention | 14 days |
Ordering | Per shard | Per partition | Not guaranteed | FIFO available |
Consumers | Multiple | Multiple | Multiple | Single (unless using fan-out) |
Replay | Yes | Yes | No | No |
Scaling | Manual/Auto | Manual | Automatic | Automatic |
Latency | ~200ms | ~ms | ~seconds | ~ms |
Data Ingestion Patterns and Throughput Characteristics
91. Batch vs. Real-time Ingestion:
- Batch: Lower cost, higher latency
- Real-time: Higher cost, lower latency
92. Throughput Patterns:
- Steady-state: Predictable, consistent load
- Spiky: Unpredictable bursts of traffic
- Cyclical: Predictable peaks and valleys
93. Handling Backpressure:
- Buffer in SQS before Kinesis
- Implement client-side throttling
- Use adaptive batching
94. Kinesis Agent Features:
- Pre-processing capabilities
- Automatic retry with backoff
- CloudWatch monitoring integration
95. Multi-Region Considerations:
- Independent streams per region
- Cross-region replication via Lambda
- Global producer routing strategies
Replayability of Data Ingestion Pipelines
96. Replay Strategies:
- Store raw data in S3 for full reprocessing
- Use Kinesis retention period for short-term replay
- Implement event sourcing patterns
97. Replay Scenarios:
- Bug fixes in processing logic
- Recovery from downstream failures
- Historical analysis with new algorithms
98. Implementing Replayability:
- Maintain immutable event store
- Version processing applications
- Use consumer group IDs for isolation
99. Replay Challenges:
- Handling side effects (idempotency)
- Managing processing order
- Balancing storage costs vs. replay needs
100. Replay Best Practices:
- Test replay capabilities regularly
- Document replay procedures
- Monitor replay performance and correctness
Exam-Specific Tips
101. Remember the throughput limits: 1MB/sec in, 2MB/sec out per shard.
102. Know the retention period options: 24 hours (default) to 365 days.
103. Understand the difference between KPL and KCL: Producer vs. Consumer libraries.
104. Know when to use each Kinesis service:
- Data Streams: Raw ingestion and processing
- Firehose: Easy loading to destinations
- Data Analytics: Real-time processing
- Video Streams: Video ingestion and processing
105. Understand resharding operations: Split increases capacity, merge decreases it.
106. Know the consumer options: SDK, KCL, Lambda, Firehose.
107. Remember Firehose buffer settings: Size (1-128 MB) and interval (60-900 seconds).
108. Know the encryption options: Server-side with KMS, client-side, TLS in transit.
109. Understand monitoring metrics: Iterator age, throughput exceeded, success rates.
110. Know the common error scenarios and solutions: Provisioned throughput exceeded, iterator expiration.
111. Understand the cost model: Shard-hours, data ingestion, retention, enhanced fan-out.
112. Know the integration points: Lambda, Glue, SageMaker, S3, Redshift.
113. Understand the differences between Kinesis and other services: MSK, SQS, EventBridge.
114. Know the replay options: TRIM_HORIZON, LATEST, AT_TIMESTAMP, AT_SEQUENCE_NUMBER.
115. Understand the capacity modes: Provisioned vs. On-demand.
116. Know the enhanced fan-out benefits: Dedicated throughput, lower latency.
117. Understand the Firehose delivery frequency: Buffer size or interval, whichever comes first.
118. Know the Data Analytics processing options: SQL vs. Flink.
119. Understand the Video Streams use cases: Security cameras, CCTV, ML integration.
120. Know the common architectures: Real-time analytics, log processing, IoT data pipelines.
Top comments (0)