Amazon OpenSearch Service - Cheat Sheet

Amazon OpenSearch Service Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

AWS OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. OpenSearch is a distributed, open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis.

Key Components:

Domain - A logical container for an OpenSearch cluster
Cluster - Collection of one or more data nodes, optional dedicated master nodes, and optional UltraWarm nodes
Node - An instance in the cluster (data, master, or UltraWarm)
Index - Collection of documents with similar characteristics
Shard - Horizontal partition of an index (primary or replica)
Document - Basic unit of information that can be indexed

OpenSearch Service Features and Details

Feature	Description
Deployment Options	Single-AZ or Multi-AZ with standby
Instance Types	T2, T3, M5, C5, R5, I3, etc.
Storage Types	EBS (gp2, gp3, io1) or Instance Store
Data Tiers	Hot (active data), UltraWarm (less frequently accessed), Cold (archive)
Security	Fine-grained access control, encryption at rest and in transit, VPC support
Backup	Automated snapshots to S3
Scaling	Horizontal (add nodes) and vertical (change instance types)
Monitoring	CloudWatch metrics, logs, and dashboards
Integrations	Kinesis Data Firehose, CloudWatch Logs, Lambda, etc.
API Compatibility	Compatible with OpenSearch and legacy Elasticsearch APIs
Visualization	OpenSearch Dashboards (formerly Kibana)

Important Pointers for AWS OpenSearch Service

OpenSearch Service is the successor to Amazon Elasticsearch Service, offering compatibility with both OpenSearch and Elasticsearch APIs.
OpenSearch is an open-source, distributed search and analytics engine derived from Elasticsearch 7.10.2.
A domain is the fundamental unit in OpenSearch Service, representing a cluster with its configuration, instance types, instance count, and storage options.
OpenSearch Service supports both development and production instance types, with production types offering dedicated resources.
For production workloads, it's recommended to use at least three dedicated master nodes for cluster stability.
Multi-AZ deployment with standby is recommended for production environments to ensure high availability.
OpenSearch Service supports three data tiers: Hot (active data), UltraWarm (less frequently accessed), and Cold (archive).
UltraWarm nodes use Amazon S3 for storage, offering cost-effective storage for less frequently accessed data at about 1/10th the cost of hot storage.
Cold storage is even more cost-effective than UltraWarm, designed for data that is accessed infrequently and can tolerate slightly higher access latency.
The maximum domain storage limit is 3 PB for hot tier (EBS), 20 PB for UltraWarm tier, and unlimited for cold tier.
OpenSearch Service supports encryption at rest using AWS KMS and encryption in transit using HTTPS.
Fine-grained access control allows you to control who can access specific data within your OpenSearch cluster.
OpenSearch Service integrates with Amazon Cognito for OpenSearch Dashboards authentication.
VPC support allows you to isolate your OpenSearch domain within your VPC and use security groups for access control.
OpenSearch Service supports both automated and manual snapshots for backup and recovery.
Automated snapshots are stored in S3 at no additional charge and retained for 14 days by default.
Manual snapshots can be stored indefinitely and used for cluster migration or backup.
OpenSearch Service supports cross-region replication for disaster recovery and global deployments.
The maximum number of nodes per cluster is 80 for most instance types.
The maximum number of shards per node is 1,000, including both primary and replica shards.
The recommended maximum shard size is 50 GB for optimal performance.
The maximum JVM heap size is 32 GB, regardless of instance size, due to JVM limitations.
OpenSearch Service supports both index-level and document-level security.
OpenSearch Service integrates with AWS CloudTrail for auditing API calls.
OpenSearch Service supports SQL queries in addition to the Query DSL.
OpenSearch Service supports anomaly detection for identifying unusual patterns in your data.
OpenSearch Service supports alerting to notify you when data meets certain conditions.
OpenSearch Service supports asynchronous search for long-running queries.
OpenSearch Service supports cross-cluster search to query multiple domains.
OpenSearch Service supports index state management for automating index lifecycle tasks.
OpenSearch Service supports k-NN (k-nearest neighbor) search for vector-based similarity searches.
OpenSearch Service supports learning to rank (LTR) for improving search relevance.
OpenSearch Service supports custom packages for adding plugins and dictionaries.
OpenSearch Service supports auto-tune to optimize domain performance automatically.
OpenSearch Service supports index rollover to manage index size and age.
OpenSearch Service supports index transforms to create a new index with transformed data.
OpenSearch Service supports index aliases for abstracting indices from clients.
OpenSearch Service supports index templates for defining settings and mappings for new indices.
OpenSearch Service supports snapshot lifecycle management for automating snapshot creation and deletion.
OpenSearch Service supports data streams for time-series data.
OpenSearch Service supports ingest pipelines for preprocessing documents before indexing.
OpenSearch Service supports field mappings to define how fields are indexed and stored.
OpenSearch Service supports dynamic mappings to automatically detect field types.
OpenSearch Service supports explicit mappings to define field types explicitly.
OpenSearch Service supports nested fields for indexing arrays of objects.
OpenSearch Service supports parent-child relationships for related documents.
OpenSearch Service supports aggregations for data analysis and visualization.
OpenSearch Service supports scripting for custom logic in queries and aggregations.
OpenSearch Service supports percolator queries for matching documents against stored queries.
OpenSearch Service supports highlighting for highlighting search terms in results.
OpenSearch Service supports suggesters for search-as-you-type functionality.
OpenSearch Service supports completion suggesters for auto-complete functionality.
OpenSearch Service supports phrase suggesters for correcting misspelled phrases.
OpenSearch Service supports term suggesters for correcting misspelled terms.
OpenSearch Service supports fuzzy queries for approximate matching.
OpenSearch Service supports wildcard queries for pattern matching.
OpenSearch Service supports regular expression queries for pattern matching.
OpenSearch Service supports range queries for numeric and date ranges.
OpenSearch Service supports geo-spatial queries for location-based searches.
OpenSearch Service supports boolean queries for combining multiple queries.
OpenSearch Service supports function score queries for custom scoring.
OpenSearch Service supports script score queries for custom scoring with scripts.
OpenSearch Service supports decay functions for boosting by distance, date, or numeric value.
OpenSearch Service supports field collapsing for grouping results by field.
OpenSearch Service supports search templates for parameterized queries.
OpenSearch Service supports rank evaluation for evaluating search quality.
OpenSearch Service supports profile API for analyzing query performance.
OpenSearch Service supports explain API for explaining query scoring.
OpenSearch Service supports validate API for validating queries.
OpenSearch Service supports count API for counting documents without returning them.
OpenSearch Service supports bulk API for batch operations.
OpenSearch Service supports multi-get API for retrieving multiple documents.
OpenSearch Service supports multi-search API for executing multiple searches.
OpenSearch Service supports update API for updating documents.
OpenSearch Service supports update-by-query API for updating documents matching a query.
OpenSearch Service supports delete-by-query API for deleting documents matching a query.
OpenSearch Service supports reindex API for copying documents from one index to another.
OpenSearch Service supports scroll API for retrieving large result sets.
OpenSearch Service supports search after for deep pagination.
OpenSearch Service supports point-in-time search for consistent results across multiple searches.
OpenSearch Service supports field capabilities API for retrieving field information.
OpenSearch Service supports cat APIs for compact and aligned text output.
OpenSearch Service supports cluster APIs for managing cluster settings.
OpenSearch Service supports index APIs for managing indices.
OpenSearch Service supports snapshot APIs for managing snapshots.
OpenSearch Service supports task management APIs for managing long-running tasks.
OpenSearch Service supports node stats API for retrieving node statistics.
OpenSearch Service supports cluster stats API for retrieving cluster statistics.
OpenSearch Service supports index stats API for retrieving index statistics.
OpenSearch Service supports shard stats API for retrieving shard statistics.
OpenSearch Service supports hot threads API for identifying busy threads.
OpenSearch Service supports thread pool API for monitoring thread pools.
OpenSearch Service supports circuit breaker API for monitoring memory usage.
OpenSearch Service supports cluster health API for monitoring cluster health.
OpenSearch Service supports cluster allocation explain API for explaining shard allocation.
OpenSearch Service supports cluster reroute API for manually allocating shards.
OpenSearch Service supports cluster update settings API for updating cluster settings.
OpenSearch Service supports index recovery API for monitoring index recovery.
OpenSearch Service supports index segments API for retrieving segment information.
OpenSearch Service supports index shard stores API for retrieving shard store information.
OpenSearch Service supports index flush API for flushing indices to disk.
OpenSearch Service supports index refresh API for refreshing indices.
OpenSearch Service supports index force merge API for merging segments.
OpenSearch Service supports index clear cache API for clearing caches.
OpenSearch Service supports index analyze API for analyzing text.

Comparison of OpenSearch Service Instance Types

Instance Type	vCPU	Memory (GiB)	Use Case
t3.small	2	2	Development and testing
t3.medium	2	4	Development and testing
m5.large	2	8	Production - balanced workloads
m5.xlarge	4	16	Production - balanced workloads
c5.large	2	4	Production - compute-intensive workloads
c5.xlarge	4	8	Production - compute-intensive workloads
r5.large	2	16	Production - memory-intensive workloads
r5.xlarge	4	32	Production - memory-intensive workloads
i3.large	2	15.25	Production - storage-intensive workloads
i3.xlarge	4	30.5	Production - storage-intensive workloads

Comparison of OpenSearch Service Storage Types

Storage Type	Performance	Use Case	Cost
EBS gp2	Medium	General purpose	Medium
EBS gp3	Medium-High	General purpose with customizable IOPS	Medium
EBS io1	High	I/O-intensive workloads	High
Instance Store	Very High	High-performance workloads	Included with instance
UltraWarm	Medium	Less frequently accessed data	Low
Cold Storage	Low	Rarely accessed data	Very Low

Data Ingestion Methods and Throughput Characteristics

Ingestion Method	Throughput	Latency	Replayability	Rate Limiting
Direct API	High	Low	Manual	Per domain limits
Kinesis Firehose	Medium-High	Medium	Automatic with S3 backup	Configurable
Logstash	Medium	Medium	Configurable	Configurable
Fluentd	Medium	Medium	Configurable	Configurable
Lambda	Medium	Medium-High	Depends on source	Lambda concurrency limits
CloudWatch Logs	Medium	Medium-High	Limited	CloudWatch Logs limits

OpenSearch vs Elasticsearch Comparison

Feature	OpenSearch	Elasticsearch
License	Apache 2.0	Elastic License (not fully open-source)
Development	Community-driven	Elastic N.V.
AWS Support	Full support	Limited to older versions
Security Features	Included	Requires paid subscription in newer versions
Visualization	OpenSearch Dashboards	Kibana
Machine Learning	Basic capabilities	Advanced capabilities in paid tiers
Alerting	Included	Requires paid subscription
SQL Support	Included	Requires paid subscription
Anomaly Detection	Included	Requires paid subscription

Important CloudWatch Metrics for Monitoring

Metric	Description	Threshold Recommendation
ClusterStatus.red	Indicates one or more primary shards are missing	Should be 0
ClusterStatus.yellow	Indicates one or more replica shards are missing	Should be 0 in steady state
CPUUtilization	CPU usage percentage	<80%
JVMMemoryPressure	JVM heap usage percentage	<80%
FreeStorageSpace	Available storage space	>25% of total storage
SearchLatency	Time to complete search requests	Depends on application requirements
IndexingLatency	Time to complete indexing requests	Depends on application requirements
KibanaHealthyNodes	Number of healthy OpenSearch Dashboards nodes	Equal to number of nodes
MasterCPUUtilization	CPU usage of master nodes	<50%
MasterJVMMemoryPressure	JVM heap usage of master nodes	<80%
AutomatedSnapshotFailure	Indicates failed automated snapshots	Should be 0
ThreadpoolSearchQueue	Number of queued search threads	Should be near 0 in steady state
ThreadpoolWriteQueue	Number of queued write threads	Should be near 0 in steady state
Shards.active	Number of active shards	Should be stable
Nodes	Number of nodes in the cluster	Should match expected count

Mind Map: AWS OpenSearch Service Components

AWS OpenSearch Service
├── Domain Management
│   ├── Creation and Configuration
│   ├── Scaling (Vertical and Horizontal)
│   ├── Version Management
│   └── Deletion and Backup
├── Node Types
│   ├── Data Nodes
│   ├── Dedicated Master Nodes
│   ├── UltraWarm Nodes
│   └── Cold Storage
├── Storage Options
│   ├── EBS Volumes (gp2, gp3, io1)
│   ├── Instance Store
│   ├── UltraWarm (S3)
│   └── Cold Storage (S3)
├── Security
│   ├── Fine-grained Access Control
│   ├── Encryption at Rest
│   ├── Encryption in Transit
│   ├── VPC Access
│   ├── IAM Authentication
│   └── Cognito Integration
├── Data Management
│   ├── Indexing
│   ├── Sharding
│   ├── Replication
│   ├── Snapshots
│   └── Index Lifecycle Management
├── Search and Analytics
│   ├── Full-text Search
│   ├── Aggregations
│   ├── SQL Support
│   ├── PPL Support
│   └── Visualization with OpenSearch Dashboards
├── Advanced Features
│   ├── Anomaly Detection
│   ├── Alerting
│   ├── k-NN Search
│   ├── Cross-cluster Search
│   └── Asynchronous Search
└── Monitoring and Management
    ├── CloudWatch Integration
    ├── Auto-Tune
    ├── Index State Management
    ├── Performance Analyzer
    └── CloudTrail Integration

Example Calculations

Shard Calculation

For a 200 GB index with expected growth to 300 GB:

Target shard size = 50 GB
Number of primary shards = 300 GB / 50 GB = 6 primary shards
With 1 replica: Total shards = 6 primary * (1 + 1) = 12 shards

Node Calculation

For a cluster with 500 GB of data, using r5.large.search instances (16 GB RAM):

JVM heap size = 8 GB (50% of instance memory)
Storage per node = ~2 TB (EBS)
Number of nodes needed for storage = 500 GB / 2 TB = 1 node (minimum)
For high availability: Minimum 2 nodes + 3 dedicated master nodes

Throughput Calculation

For a cluster with 5 data nodes, each capable of handling 10,000 requests/second:

Theoretical max throughput = 5 nodes * 10,000 requests/second = 50,000 requests/second
With 30% headroom: Recommended max throughput = 50,000 * 0.7 = 35,000 requests/second

Implementing Throttling and Overcoming Rate Limits

OpenSearch Service has service quotas that limit the number of domains, instances, and storage per account and region.
To overcome API rate limits, implement exponential backoff and jitter in your client applications.
Use bulk API operations instead of single document operations to reduce the number of API calls.
For high-throughput ingestion, consider using a buffer like Kinesis Data Firehose or SQS to smooth out traffic spikes.
Monitor the 4xx and 5xx error rates in CloudWatch to detect rate limiting issues.
Implement client-side throttling to prevent overwhelming the OpenSearch cluster.
Use connection pooling in your client applications to reduce connection overhead.
Consider using the _bulk API with optimal batch sizes (typically 5-15 MB per batch) for efficient indexing.
Distribute indexing workloads evenly across all data nodes by using a consistent routing strategy.
Use the refresh_interval setting to control how frequently OpenSearch makes new documents available for search.

Data Ingestion Pipeline Replayability

When using Kinesis Data Firehose for ingestion, enable S3 backup to store raw data for replay capability.
For direct API ingestion, maintain source data in S3 or another durable store for potential replay.
Use DynamoDB or another database to track ingestion state and progress for resumability.
Implement idempotent processing to safely replay data without duplicates by using document IDs.
Consider using SQS dead-letter queues to capture and retry failed ingestion attempts.
For Logstash pipelines, use persistent queues to prevent data loss during service disruptions.
Implement checkpointing in your ingestion pipelines to track progress and enable resumption.
Use Lambda destinations to capture failed events for later reprocessing.
Consider using Apache Kafka as a buffer before OpenSearch for enhanced replay capabilities.
Implement a circuit breaker pattern in your ingestion pipeline to handle temporary OpenSearch unavailability.