DEV Community

Data Tech Bridge
Data Tech Bridge

Posted on

Amazon Kinesis Cheat Sheet

Below are the important pointers for AWS Kinesis Service (Cheat Sheet) for AWS Certified Data Engineer-Associate Exam.

Core Components of AWS Kinesis

AWS Kinesis is a platform for streaming data on AWS, making it easy to collect, process, and analyze real-time, streaming data.

Component Description
Kinesis Data Streams Real-time data streaming service for ingesting and storing data streams
Kinesis Data Firehose Loads streaming data into AWS data stores with near-zero management
Kinesis Data Analytics Process and analyze streaming data in real-time with SQL or Apache Flink
Kinesis Video Streams Stream video from connected devices to AWS for analytics and ML

Mind Map: AWS Kinesis Ecosystem

AWS Kinesis
├── Kinesis Data Streams
│   ├── Producers
│   │   ├── AWS SDK
│   │   ├── Kinesis Producer Library (KPL)
│   │   └── Kinesis Agent
│   ├── Consumers
│   │   ├── AWS SDK
│   │   ├── Kinesis Client Library (KCL)
│   │   ├── Lambda
│   │   └── Kinesis Data Firehose
│   └── Shards (throughput units)
├── Kinesis Data Firehose
│   ├── Sources
│   │   ├── Direct PUT
│   │   ├── Kinesis Data Streams
│   │   └── CloudWatch Logs/Events
│   └── Destinations
│       ├── S3
│       ├── Redshift
│       ├── Elasticsearch
│       └── Splunk
├── Kinesis Data Analytics
│   ├── SQL Applications
│   └── Apache Flink Applications
└── Kinesis Video Streams
    ├── Producers (cameras, etc.)
    └── Consumers (ML services, etc.)
Enter fullscreen mode Exit fullscreen mode

Detailed Features and Specifications

Feature Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Kinesis Video Streams
Purpose Real-time streaming data ingestion Load streaming data into data stores Process streaming data in real-time Stream video data for analytics
Data Retention 24 hours (default) up to 365 days No storage (immediate delivery) No storage (immediate processing) Up to 7 days
Scaling Unit Shards Automatic scaling KPUs (Kinesis Processing Units) Automatic scaling
Throughput 1MB/sec or 1000 records/sec per shard (ingestion)
2MB/sec per shard (consumption)
Dynamic, up to service quotas Based on KPUs MB/sec per stream
Latency ~200ms At least 60 seconds Real-time (seconds) Real-time
Reprocessing Yes (by position) No No Yes (by timestamp)
Pricing Model Per shard-hour + PUT payload units Volume of data + optional data transformation KPU-hours GB ingested + stored

Kinesis Data Streams Deep Dive

1. Shards: Basic throughput unit for Kinesis Data Streams. Each shard provides 1MB/sec data input and 2MB/sec data output capacity.

2. Partition Key: Determines which shard a data record goes to. Good partition key design ensures even distribution across shards.

3. Sequence Number: Unique identifier for each record within a shard, assigned when a record is ingested.

4. Data Record: The unit of data stored in Kinesis Data Streams (up to 1MB).

5. Retention Period: Data can be stored from 24 hours (default) up to 365 days.

6. Enhanced Fan-Out: Provides dedicated throughput of 2MB/sec per consumer per shard (vs. shared 2MB/sec without it).

7. Capacity Modes:

  • Provisioned: You specify the number of shards
  • On-demand: Automatically scales based on observed throughput

8. Resharding Operations:

  • Shard split: Increases stream capacity
  • Shard merge: Decreases stream capacity

9. Producer Options:

  • AWS SDK (simple, low throughput)
  • Kinesis Producer Library (KPL) (high performance, batching, retry)
  • Kinesis Agent (log file collection)

10. Consumer Options:
- AWS SDK (simple, manual)
- Kinesis Client Library (KCL) (distributed, coordinated consumption)
- Lambda (serverless processing)
- Firehose (delivery to destinations)

11. Throughput Calculation Example:
- Required ingestion: 10MB/sec
- Required shards = 10MB ÷ 1MB = 10 shards

12. KCL vs KPL: KPL (Producer Library) handles data production with batching and retries; KCL (Consumer Library) manages distributed consumption with checkpointing.

Kinesis Data Firehose Details

13. Delivery Frequency:
- S3: Buffer size (1-128 MB) or buffer interval (60-900 seconds)
- Other destinations: Buffer size (1-100 MB) or buffer interval (60-900 seconds)

14. Data Transformation: Lambda can transform records before delivery.

15. Format Conversion: Convert data to Parquet or ORC before S3 delivery.

16. Compression Options: GZIP, ZIP, Snappy for S3 delivery.

17. Error Handling: Failed records go to an S3 error bucket.

18. Dynamic Partitioning: Partition data in S3 based on record content.

19. No Data Loss: Retries until successful delivery.

20. Serverless: No capacity planning needed.

Kinesis Data Analytics Features

21. SQL Applications: Process streams using SQL queries.

22. Apache Flink Applications: Use Java, Scala, or Python with Apache Flink.

23. Input Sources: Kinesis Data Streams, Firehose, or reference data from S3.

24. Processing Features:
- Windowed aggregations (tumbling, sliding, session)
- Anomaly detection
- Stream-to-stream joins

25. Output Options: Kinesis Data Streams, Firehose, or Lambda.

26. Scaling: Based on Kinesis Processing Units (KPUs).

27. Checkpointing: Automatic state persistence for fault tolerance.

28. Parallelism: Automatically parallelizes processing across multiple instances.

Kinesis Video Streams

29. Use Cases: Security cameras, CCTV, body cameras, audio feeds.

30. Integration: Works with AWS ML services like Rekognition and SageMaker.

31. Producer SDK: Available for C++, Android, and iOS.

32. Fragments: Video is divided into fragments for processing.

33. Metadata: Can attach time-indexed metadata to video streams.

Performance Optimization and Limits

34. Kinesis Data Streams Limits:
- Per shard: 1MB/sec ingestion, 2MB/sec consumption
- Maximum record size: 1MB
- API limits: PutRecord (1000 records/sec), GetRecords (5 transactions/sec)
- Maximum data retention: 365 days

35. Handling Hot Shards: Use a good partition key design to distribute data evenly.

36. Resharding Best Practices: Perform during low traffic periods; complete one operation before starting another.

37. Kinesis Data Firehose Limits:
- Maximum record size: 1MB
- Service quota: 5000 records/sec or 5MB/sec per delivery stream (can be increased)

38. Kinesis Data Analytics Limits:
- SQL applications: Up to 8 KPUs per application
- Flink applications: Based on parallelism configuration

39. Implementing Throttling:
- Use exponential backoff for retries
- Implement client-side rate limiting
- Monitor and alert on throttling metrics

40. Overcoming Rate Limits:
- Request quota increases
- Implement batching with KPL
- Add more shards (Data Streams)

Replayability and Data Recovery

41. Replay Options in Kinesis Data Streams:
- Start from specific sequence number
- Start from timestamp
- Start from TRIM_HORIZON (oldest available record)
- Start from LATEST (most recent record)

42. Checkpointing: KCL stores consumption progress in DynamoDB for fault tolerance.

43. Reprocessing Strategies:
- Create new consumer with different application name
- Reset existing consumer's checkpoints
- Use enhanced fan-out for parallel processing

44. Data Persistence: Consider archiving to S3 via Firehose for long-term storage.

45. Disaster Recovery: Implement cross-region replication using Lambda.

Open Source Integration

46. Kinesis Client Library (KCL) vs. Apache Kafka Consumer:

Feature KCL Kafka Consumer
Language Support Java, Node.js, Python, .NET, Ruby Java, multiple languages via clients
Coordination DynamoDB ZooKeeper/Broker
Scaling Per shard consumption Consumer groups
Checkpointing Built-in Manual offset management
Fault Tolerance Automatic worker rebalancing Rebalance protocol

47. Apache Flink Integration: Kinesis Data Analytics for Apache Flink provides managed Flink environment.

48. Spark Streaming Integration: Can consume from Kinesis using Spark Kinesis connector.

CloudWatch Monitoring for Kinesis

49. Key Metrics for Kinesis Data Streams:
- GetRecords.IteratorAgeMilliseconds: Age of the oldest record (high values indicate consumer lag)
- IncomingBytes/IncomingRecords: Volume of data/records being written
- ReadProvisionedThroughputExceeded/WriteProvisionedThroughputExceeded: Throttling events
- PutRecord.Success/GetRecords.Success: Success rates for operations

50. Key Metrics for Kinesis Data Firehose:
- DeliveryToS3.Success: Success rate of S3 deliveries
- IncomingBytes/IncomingRecords: Volume of data/records being received
- ThrottledRecords: Number of records throttled
- DeliveryToS3.DataFreshness: Age of the oldest record not yet delivered

51. Key Metrics for Kinesis Data Analytics:
- KPUs: Number of KPUs being used
- fullRestarts: Number of application restarts
- downtime: Application downtime
- InputProcessing.OkBytes/InputProcessing.OkRecords: Successfully processed data

52. Recommended Alarms:
- Iterator age > 30 seconds (consumer lag)
- High throttling rates (>10%)
- Delivery freshness > buffer time + 60 seconds (Firehose)
- Error rates > 1%

Security and Compliance

53. Encryption:
- Server-side encryption with KMS
- HTTPS endpoints for in-transit encryption
- Client-side encryption options

54. Authentication and Authorization:
- IAM roles and policies
- Fine-grained access control with IAM

55. VPC Integration: Private access via VPC endpoints.

56. Compliance: Supports HIPAA, PCI DSS, SOC, and ISO compliance.

57. Audit Logging: All API calls logged to CloudTrail.

Cost Optimization

58. Kinesis Data Streams Cost Factors:
- Shard hours (provisioned mode)
- Data ingested (on-demand mode)
- Extended data retention (beyond 24 hours)
- Enhanced fan-out consumers

59. Kinesis Data Firehose Cost Factors:
- Data ingested
- Format conversion
- VPC delivery

60. Kinesis Data Analytics Cost Factors:
- KPU hours
- Durable application backups

61. Cost Optimization Strategies:
- Right-size shard count
- Use on-demand mode for variable workloads
- Batch records when possible
- Monitor and adjust resources based on usage patterns

Common Architectures and Patterns

62. Real-time Analytics Pipeline:
- Data Streams → Data Analytics → Firehose → S3/Redshift

63. Log Processing Pipeline:
- CloudWatch Logs → Firehose → S3 → Athena

64. IoT Data Pipeline:
- IoT Core → Data Streams → Lambda → DynamoDB

65. Click-stream Analysis:
- Web/Mobile → Kinesis Data Streams → Data Analytics → ElastiCache

66. Machine Learning Pipeline:
- Data Streams → Firehose → S3 → SageMaker

Exam Tips and Common Scenarios

67. Scenario: High-throughput Ingestion
- Solution: Use KPL with batching and appropriate shard count

68. Scenario: Consumer Lag
- Solution: Add more consumers, use enhanced fan-out, or increase processing efficiency

69. Scenario: Data Transformation Before Storage
- Solution: Use Firehose with Lambda transformation

70. Scenario: Real-time Anomaly Detection
- Solution: Kinesis Data Analytics with RANDOM_CUT_FOREST function

71. Scenario: Video Analysis
- Solution: Kinesis Video Streams with Rekognition integration

72. Scenario: Exactly-once Processing
- Solution: Use KCL with careful checkpoint management

73. Scenario: Cross-region Replication
- Solution: Consumer application that reads from one region and produces to another

74. Scenario: Handling Spiky Traffic
- Solution: On-demand capacity mode for Data Streams

75. Scenario: Long-term Analytics
- Solution: Firehose to S3 with Athena or Redshift Spectrum

76. Scenario: Stream Enrichment
- Solution: Kinesis Data Analytics with reference data from S3

Troubleshooting Common Issues

77. ProvisionedThroughputExceededException:
- Cause: Exceeding shard limits
- Solution: Add more shards, implement backoff, use KPL batching

78. Iterator Expiration:
- Cause: Not processing data within 5 minutes of retrieval
- Solution: Process faster or request less data per GetRecords call

79. Consumer Lag:
- Cause: Slow processing, insufficient consumers
- Solution: Add consumers, optimize processing, use enhanced fan-out

80. Duplicate Records:
- Cause: Producer retries, consumer restarts
- Solution: Implement idempotent processing, track processed records

81. Data Loss:
- Cause: Exceeding retention period, not handling failures
- Solution: Increase retention, implement proper error handling

82. Uneven Shard Distribution:
- Cause: Poor partition key choice
- Solution: Use high-cardinality partition keys, avoid hot keys

Advanced Features and Integrations

83. Kinesis Data Streams Enhanced Fan-Out:
- Dedicated throughput of 2MB/sec per consumer per shard
- Uses HTTP/2 for push-based delivery
- Lower latency (70ms vs. 200ms+)

84. Kinesis Data Firehose Dynamic Partitioning:
- Partition data in S3 based on record content
- Creates logical folders based on partition keys
- Optimizes for query performance

85. Kinesis Data Analytics Connectors:
- Apache Kafka
- Amazon MSK
- Amazon S3
- Amazon DynamoDB

86. Kinesis Video Streams Edge Agent:
- Run on IoT devices
- Buffer and upload video when connectivity is restored
- Supports intermittent connectivity scenarios

87. AWS Lambda Integration:
- Event source mapping for Data Streams
- Transformation for Firehose
- Processing for Video Streams

88. AWS Glue Integration:
- Streaming ETL jobs can use Kinesis as source
- Process and transform streaming data
- Load into data lake or data warehouse

89. Amazon SageMaker Integration:
- Real-time ML inference on streaming data
- Model training on historical stream data
- Anomaly detection on streams

90. Amazon EventBridge Integration:
- Can route events to Kinesis
- Enables serverless event-driven architectures

Comparison with Other AWS Streaming Services

Feature Kinesis Data Streams MSK (Managed Kafka) EventBridge SQS
Purpose Real-time streaming Streaming & messaging Event routing Messaging
Throughput 1MB/sec per shard MB/sec per broker Thousands/sec Unlimited
Retention Up to 365 days Configurable No retention 14 days
Ordering Per shard Per partition Not guaranteed FIFO available
Consumers Multiple Multiple Multiple Single (unless using fan-out)
Replay Yes Yes No No
Scaling Manual/Auto Manual Automatic Automatic
Latency ~200ms ~ms ~seconds ~ms

Data Ingestion Patterns and Throughput Characteristics

91. Batch vs. Real-time Ingestion:
- Batch: Lower cost, higher latency
- Real-time: Higher cost, lower latency

92. Throughput Patterns:
- Steady-state: Predictable, consistent load
- Spiky: Unpredictable bursts of traffic
- Cyclical: Predictable peaks and valleys

93. Handling Backpressure:
- Buffer in SQS before Kinesis
- Implement client-side throttling
- Use adaptive batching

94. Kinesis Agent Features:
- Pre-processing capabilities
- Automatic retry with backoff
- CloudWatch monitoring integration

95. Multi-Region Considerations:
- Independent streams per region
- Cross-region replication via Lambda
- Global producer routing strategies

Replayability of Data Ingestion Pipelines

96. Replay Strategies:
- Store raw data in S3 for full reprocessing
- Use Kinesis retention period for short-term replay
- Implement event sourcing patterns

97. Replay Scenarios:
- Bug fixes in processing logic
- Recovery from downstream failures
- Historical analysis with new algorithms

98. Implementing Replayability:
- Maintain immutable event store
- Version processing applications
- Use consumer group IDs for isolation

99. Replay Challenges:
- Handling side effects (idempotency)
- Managing processing order
- Balancing storage costs vs. replay needs

100. Replay Best Practices:
- Test replay capabilities regularly
- Document replay procedures
- Monitor replay performance and correctness

Exam-Specific Tips

101. Remember the throughput limits: 1MB/sec in, 2MB/sec out per shard.

102. Know the retention period options: 24 hours (default) to 365 days.

103. Understand the difference between KPL and KCL: Producer vs. Consumer libraries.

104. Know when to use each Kinesis service:
- Data Streams: Raw ingestion and processing
- Firehose: Easy loading to destinations
- Data Analytics: Real-time processing
- Video Streams: Video ingestion and processing

105. Understand resharding operations: Split increases capacity, merge decreases it.

106. Know the consumer options: SDK, KCL, Lambda, Firehose.

107. Remember Firehose buffer settings: Size (1-128 MB) and interval (60-900 seconds).

108. Know the encryption options: Server-side with KMS, client-side, TLS in transit.

109. Understand monitoring metrics: Iterator age, throughput exceeded, success rates.

110. Know the common error scenarios and solutions: Provisioned throughput exceeded, iterator expiration.

111. Understand the cost model: Shard-hours, data ingestion, retention, enhanced fan-out.

112. Know the integration points: Lambda, Glue, SageMaker, S3, Redshift.

113. Understand the differences between Kinesis and other services: MSK, SQS, EventBridge.

114. Know the replay options: TRIM_HORIZON, LATEST, AT_TIMESTAMP, AT_SEQUENCE_NUMBER.

115. Understand the capacity modes: Provisioned vs. On-demand.

116. Know the enhanced fan-out benefits: Dedicated throughput, lower latency.

117. Understand the Firehose delivery frequency: Buffer size or interval, whichever comes first.

118. Know the Data Analytics processing options: SQL vs. Flink.

119. Understand the Video Streams use cases: Security cameras, CCTV, ML integration.

120. Know the common architectures: Real-time analytics, log processing, IoT data pipelines.

Top comments (0)