Efficiently querying vast amounts of data in real-time is essential for gaining actionable insights and making informed decisions. Amazon OpenSearch Service combined with Amazon S3 provides a powerful solution for organizations looking to leverage their data effectively. This comprehensive guide is the first-ever detailed walkthrough that covers every step of integrating these powerful AWS services, ensuring you can implement and optimize direct queries with ease.
Table of Contents
- High-Level Architecture
- Prerequisites
- Step-by-Step Implementation
- Best Practices
- Testing and Validation
- Troubleshooting
- Conclusion
- References
- Glossary
- Appendices
- Custom Open Graph (OG) Image
High-Level Architecture
The integration involves the following key components:
- Amazon S3: Stores your datasets in formats like JSON, CSV, or Parquet.
- AWS Glue Data Catalog: Manages metadata and schema definitions for the data stored in S3.
- Amazon OpenSearch Service: Executes direct queries on the data in S3, leveraging its powerful search and analytics capabilities.
- IAM Roles and Policies: Ensure secure and controlled access between OpenSearch, S3, and Glue.
- OpenSearch Dashboards (Query Workbench): Interface for executing queries, managing accelerations, and visualizing data.
Real-World Use Case:
Consider an e-commerce platform that stores transactional data in Amazon S3. By integrating OpenSearch Service, the platform can perform real-time analytics on sales data, generate dashboards for monitoring sales performance, and enable advanced search functionalities for business intelligence.
Prerequisites
Before diving into the implementation, ensure you have the following:
- AWS Account: Access to AWS Management Console with permissions to create and manage S3, OpenSearch Service, Glue, and IAM resources.
-
Data in S3: Your dataset (e.g.,
transactions.json
) stored in Amazon S3 in JSON Lines format. - OpenSearch Version: Ensure you're using Amazon OpenSearch Service version 2.13 or later.
- IAM Knowledge: Understanding of IAM roles and policies to grant necessary permissions.
- Basic SQL Knowledge: Familiarity with SQL for creating databases, tables, and writing queries.
- AWS CLI (Optional): For advanced configurations and troubleshooting.
Naming Conventions
Consistent and clear naming conventions enhance maintainability and scalability. Below are recommended standards:
Component | Naming Convention | Example |
---|---|---|
S3 Buckets | opensearch-<environment>-<data-type>-bucket |
opensearch-prod-json-data-bucket |
AWS Glue Databases | <project>-<data-type>-db |
videodbtransaction-db |
AWS Glue Tables | <table-name> |
transactions |
IAM Roles | OpenSearchS3AccessRole |
OpenSearchS3AccessRole |
OpenSearch Domains | <project>-<environment>-domain |
opensearch-poc-domain |
OpenSearch Indices | <data-type>-<purpose>-index |
transactions-covering-index |
Checkpoints in S3 | checkpoint/<data-type>/<index-name>/ |
checkpoint/transactions/transactions_covering_index/ |
Accelerations | <data-type>_<acceleration-type> |
transactions_skipping_index |
Materialized Views | <table-name>__<frequency>_mview |
transactions__week_live_mview |
Covering Indexes | <table-name>_covering_index |
transactions_covering_index |
PPL Commands | N/A | N/A |
Best Practice: Use lowercase letters and hyphens to separate words for readability and consistency.
Step-by-Step Implementation
Follow these detailed steps to set up and configure direct queries on Amazon OpenSearch Service with Amazon S3, leveraging AWS Glue and implementing performance optimizations through accelerations.
1. Creating an Amazon S3 Bucket
Step 1: Navigate to the Amazon S3 Console
- URL: Amazon S3 Console
Step 2: Create a New Bucket
- Click on "Create bucket".
-
Configure Bucket Settings:
-
Bucket name:
opensearch-s3-poc-json-bucket
(must be unique across all AWS accounts). -
Region: Select the same AWS Region as your OpenSearch domain (e.g.,
us-east-1
).
-
Bucket name:
-
Set Object Ownership:
- Recommended: Set to Bucket owner preferred.
-
Block Public Access:
- Enable: Ensure all options are checked to block public access.
-
Additional Settings:
- Configure Versioning, Tags, Default encryption as per requirements.
-
Review and Create:
- Click Create bucket after reviewing all settings.
Step 3: Upload Sample Data to S3
- Open the newly created bucket
opensearch-s3-poc-json-bucket
. -
Create a Folder (Optional):
- Example:
transactions/
for better organization.
- Example:
-
Upload Files:
- Click Upload.
-
Add Files: Select and upload your
transactions.json
. - Click Upload.
Best Practice: Organize data using prefixes (folders) like transactions/
and ensure data consistency.
2. Setting Up AWS Glue Data Catalog
Step 1: Navigate to the AWS Glue Console
- URL: AWS Glue Console
Step 2: Create a Database
- In the Glue console, click on "Databases" under the Data Catalog section.
- Click "Add database".
-
Name:
videodbtransaction
. - Click "Create".
Step 3: Create a Table
- Click on "Tables" under the Data Catalog section.
- Click "Add tables" > "Add tables manually".
-
Table Name:
transactions
. -
Location: Specify the S3 path where your data resides (e.g.,
s3://opensearch-s3-poc-json-bucket/transactions/
). -
Define Schema:
- Define columns corresponding to your JSON data fields (e.g.,
transaction_id
,customer_id
,amount
,currency
,transaction_date
,items
). - Specify data types appropriately (e.g.,
INT
,DOUBLE
,STRING
,TIMESTAMP
,ARRAY<STRING>
).
- Define columns corresponding to your JSON data fields (e.g.,
- Click "Create".
Note: Ensure the schema accurately reflects your data to prevent query issues.
3. Configuring IAM Roles and Policies
Purpose: Grant Amazon OpenSearch Service the necessary permissions to access your S3 bucket and interact with AWS Glue.
Step 1: Navigate to the IAM Console
- URL: AWS IAM Console
Step 2: Create an IAM Policy
- In the IAM console, click on "Policies" in the left navigation pane.
- Click "Create policy".
- Select the JSON tab.
-
Paste the following policy, replacing
<account-id>
and<opensearch-domain-name>
with your actual AWS Account ID and desired OpenSearch domain name.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "OpenSearchDomainPermissions", "Effect": "Allow", "Action": "es:ESHttp*", "Resource": "arn:aws:es:us-east-1:<account-id>:domain/<opensearch-domain-name>/*" }, { "Sid": "S3ReadAccess", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::opensearch-s3-poc-json-bucket", "arn:aws:s3:::opensearch-s3-poc-json-bucket/*" ] }, { "Sid": "GluePermissions", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:GetTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions" ], "Resource": "*" }, { "Sid": "S3CheckpointAccess", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/", "arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/*" ] } ] }
Click "Next: Tags" (optional).
Click "Next: Review".
Name:
OpenSearchS3AccessPolicy
.Description:
Policy granting OpenSearch access to S3 and Glue
.Click "Create policy".
Step 3: Create an IAM Role
- In the IAM console, click on "Roles" in the left navigation pane.
- Click "Create role".
- Trusted Entity: Select "Custom trust policy".
- Click "Switch to trusted entity".
-
Paste the following trust policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "opensearchservice.amazonaws.com", "directquery.opensearchservice.amazonaws.com", "glue.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
Click "Next: Permissions".
Search for
OpenSearchS3AccessPolicy
.Select the policy and click "Next: Tags".
Click "Next: Review".
Role name:
OpenSearchS3AccessRole
.Description:
Role for OpenSearch to access S3 and Glue
.Click "Create role".
Best Practice: Use clear and descriptive names for roles and policies for easier management.
4. Creating an Amazon OpenSearch Service Domain
Step 1: Navigate to the Amazon OpenSearch Service Console
Step 2: Create a New Domain
- Click "Create domain".
- Deployment Type: Select Production for a production environment or Development and testing for a proof-of-concept (POC).
- Engine Version: Choose OpenSearch 2.13 or later.
-
Domain Name:
opensearch-poc-domain
(ensure uniqueness).
Step 3: Configure Instance and Storage
-
Instance Type: Select
m7.large.search
choose based on your performance requirements. -
Number of Nodes:
1
(for testing; scale as needed for production).
Step 4: Set Up Access
-
Network Configuration:
- VPC Access: For production, select VPC access for enhanced security. For testing, you may opt for Public access.
-
Fine-Grained Access Control:
- Enable: Toggle to enable.
-
Master User: Set a username (e.g.,
admin
) and a strong password.
-
Access Policy:
- Choose Allow open access to the domain (not recommended for production).
- Alternative: Define a more restrictive access policy based on your security requirements.
Example Access Policy for Restricted Access:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAccessFromSpecificIP", "Effect": "Allow", "Principal": "*", "Action": "es:*", "Condition": { "IpAddress": { "aws:SourceIp": "203.0.113.0/24" } }, "Resource": "arn:aws:es:us-east-1:<account-id>:domain/opensearch-poc-domain/*" } ] }
Step 5: Advanced Options
-
IAM Role: Select
OpenSearchS3AccessRole
. -
Encryption:
- At Rest: Enable encryption if required.
- In Transit: Enable TLS for secure data transfer.
- Additional Configurations: Adjust based on specific needs (e.g., node-to-node encryption).
Step 6: Review and Create
- Review all settings.
- Click "Create" to initiate the domain setup.
Note: Domain creation may take several minutes. Monitor the progress in the console.
Best Practice: For production environments, prefer VPC access and restrict permissions to enhance security.
5. Creating Spark Tables Using Query Workbench
Purpose: Create Spark databases and tables directly within OpenSearch Dashboards using Query Workbench to ensure compatibility with OpenSearch's acceleration features.
Efficiently managing your data schemas and tables is crucial for optimizing query performance. By utilizing the Query Workbench within OpenSearch Dashboards, you can seamlessly create both databases and tables using Spark SQL. This approach ensures that your data is structured correctly and leverages OpenSearch's acceleration capabilities effectively.
Step 1: Access OpenSearch Dashboards
-
Navigate to OpenSearch Dashboards:
- Open your web browser and go to the OpenSearch Dashboards URL provided during your domain creation.
- Example:
https://your-opensearch-domain/_dashboards/
-
Log In:
- Enter your admin credentials (username and password) to access the dashboard.
Step 2: Create a Data Source
-
Open the Menu:
- Click the Menu icon (three horizontal lines) in the top-left corner of the dashboard.
-
Navigate to Data Sources:
- Under the Management section, select Data sources.
-
Create a New Data Source:
- Click on "Create data source".
-
Configure the Data Source:
-
Name:
TransactionsDataSource
. - Data Source Type: Select Amazon S3 from the dropdown menu.
-
IAM Role: Choose
OpenSearchS3AccessRole
from the available roles. This role must have the necessary permissions to access your S3 bucket and AWS Glue Data Catalog. - Additional Settings: Configure any additional settings as required, such as data format specifications or connection properties.
-
Name:
-
Finalize Creation:
- Click "Create" to establish the new data source.
Step 3: Use Query Workbench to Create a Spark Database and Table
Note: While AWS Glue can manage databases and tables, for compatibility with OpenSearch's acceleration features, it is recommended to create Spark databases and tables directly using Query Workbench.
a. Creating a Spark Database
-
Open Query Workbench:
- Click the Menu icon again.
- Under OpenSearch Plugins, select Query Workbench.
-
Select the Spark SQL Interface:
- Ensure that you're using the Spark SQL interface within Query Workbench.
-
Create the Spark Database:
- Paste the following SQL command to create a new Spark database named
videodbtransaction
:
CREATE DATABASE IF NOT EXISTS videodbtransaction WITH ( LOCATION = 's3://opensearch-s3-poc-json-bucket/databases/videodbtransaction/' );
- Paste the following SQL command to create a new Spark database named
Explanation:
-
CREATE DATABASE IF NOT EXISTS
: Ensures that the database is created only if it does not already exist. -
videodbtransaction
: The name of the database, adhering to the recommended naming conventions. -
LOCATION
: Specifies the S3 path where the database metadata and related files will be stored.- Execute the Command:
- Click "Run" to execute the database creation command.
- Verify Database Creation:
-
To confirm the database was created successfully, execute the following command:
SHOW DATABASES;
Expected Result: The
videodbtransaction
database should appear in the list of available databases.
b. Creating a Spark Table
-
Ensure the Correct Database Context:
- Before creating the table, set the context to the newly created
videodbtransaction
database:
USE videodbtransaction;
- Before creating the table, set the context to the newly created
- Click "Run" to execute.
- Create the Spark Table:
-
Paste the following SQL command to create the
transactions
table within thevideodbtransaction
database:
CREATE EXTERNAL TABLE transactions ( transaction_id INT, customer_id INT, amount DOUBLE, currency STRING, transaction_date TIMESTAMP, items ARRAY<STRING> ) USING json LOCATION 's3://opensearch-s3-poc-json-bucket/transactions/';
Explanation:
-
CREATE EXTERNAL TABLE
: Creates a table that references data stored externally in Amazon S3. -
transactions
: The name of the table, following the naming conventions. -
Column Definitions:
-
transaction_id INT
: Unique identifier for each transaction. -
customer_id INT
: Identifier for the customer involved in the transaction. -
amount DOUBLE
: Monetary value of the transaction. -
currency STRING
: Currency code (e.g., USD, EUR, GBP). -
transaction_date TIMESTAMP
: Date and time of the transaction. -
items ARRAY<STRING>
: List of items purchased in the transaction.
-
-
USING json
: Specifies that the data format is JSON. -
LOCATION
: Points to the S3 bucket path where thetransactions.json
file is stored.- Execute the Command:
- Click "Run" to create the
transactions
table.- Verify Table Creation:
-
Execute a sample query to ensure the table references data correctly:
SELECT * FROM transactions LIMIT 5;
Expected Result: The first five records from
transactions.json
should be displayed, confirming successful table creation and data linkage.
Best Practice: Validate the table schema against your JSON data to ensure accuracy.
Example of Verifying Schema and Data Integrity
-
Check Column Types:
DESCRIBE transactions;
- **Expected Output:** A list of columns with their respective data types matching the defined schema.
-
Sample Data Validation:
SELECT transaction_id, customer_id, amount, currency, transaction_date, items FROM transactions LIMIT 5;
- **Expected Output:** Displays the first five transactions with all fields correctly populated.
6. Implementing Accelerations
Implementing accelerations such as Skipping Indexes, Covering Indexes, and Materialized Views enhances query performance by pre-processing and indexing critical data attributes.
a. Skipping Indexes
Purpose: Index metadata to efficiently locate data without scanning entire partitions and files.
Step-by-Step Guide:
-
Open Query Workbench:
- Navigate to OpenSearch Dashboards.
- Select Query Workbench from the menu.
-
Select Data Source:
- Ensure
TransactionsDataSource
is selected.
- Ensure
-
Execute Skipping Index Creation:
CREATE SKIPPING INDEX ON transaction_db_datasource.videodbtransaction.transaction ( transaction_date PARTITION, currency VALUE_SET, amount MIN_MAX ) WITH ( index_settings = '{"number_of_shards":5,"number_of_replicas":2}', auto_refresh = true, refresh_interval = '60 minutes', checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions/' );
Explanation:
- **`transaction_date` PARTITION**: Optimizes queries based on date ranges.
- **`currency` VALUE_SET**: Enhances exact match filtering on currency.
- **`amount` MIN_MAX**: Improves range queries on transaction amounts.
- **`number_of_shards`:** Set to `5` for parallel processing.
- **`number_of_replicas`:** Set to `2` to align with cluster shard allocation awareness attributes (`3` zones).
-
Verify Acceleration:
- Navigate to Data sources >
TransactionsDataSource
> Accelerations tab. - Ensure the skipping index
transaction
is listed with statusInitializing
,Running
, orCompleted
.
- Navigate to Data sources >
Note: The checkpoint_location
must be accessible by OpenSearch with appropriate IAM permissions.
b. Covering Indexes
Purpose: Ingest specific columns to create a high-performance index tailored for advanced analytics and visualization.
Covering Index Creation Command:
CREATE INDEX transactions_covering_index
ON transaction_db_datasource.videodbtransaction.transaction (
transaction_id,
customer_id,
amount,
currency,
transaction_date,
items
)
WITH (
index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
auto_refresh = true,
refresh_interval = '10 Minute',
checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions_covering_index/'
);
Execution Steps:
-
Open Query Workbench:
- Navigate to OpenSearch Dashboards.
- Select Query Workbench.
-
Select Data Source:
- Choose
TransactionsDataSource
.
- Choose
-
Run Covering Index Command:
- Paste the above SQL command into the editor.
- Click "Run" to execute.
-
Verify Acceleration:
- Navigate to Data sources >
TransactionsDataSource
> Accelerations tab. - Ensure
transactions_covering_index
is listed with the appropriate status.
- Navigate to Data sources >
Note: Adjust refresh_interval
based on your data update frequency and query requirements.
c. Materialized Views
Purpose: Precompute and store results of complex queries (e.g., aggregations) to power dashboard visualizations.
Materialized View Creation Command:
CREATE MATERIALIZED VIEW transactions__week_live_mview AS
SELECT
customer_id AS customer_id,
currency AS currency,
COUNT(*) AS transaction_count,
SUM(amount) AS total_amount,
AVG(amount) AS average_amount,
CAST(FROM_UNIXTIME(transaction_date / 1000) AS TIMESTAMP) AS @timestamp
FROM
transaction_db_datasource.videodbtransaction.transaction
GROUP BY
customer_id, currency
WITH (
auto_refresh = true,
refresh_interval = '15 Minute',
checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions__week_live_mview/',
watermark_delay = '1 Minute'
);
Execution Steps:
-
Open Query Workbench:
- Navigate to OpenSearch Dashboards.
- Select Query Workbench.
-
Select Data Source:
- Choose
TransactionsDataSource
.
- Choose
-
Run Materialized View Command:
- Paste the above SQL command into the editor.
- Click "Run" to execute.
-
Verify Acceleration:
- Navigate to Data sources >
TransactionsDataSource
> Accelerations tab. - Ensure
transactions__week_live_mview
is listed with the appropriate status.
- Navigate to Data sources >
Note: Adjust refresh_interval
and watermark_delay
based on your data freshness requirements.
Best Practices
Implementing best practices ensures that your OpenSearch and S3 integration is secure, efficient, and scalable.
1. Security
- Least Privilege Principle: Grant only necessary permissions to IAM roles to minimize security risks.
- VPC Access: For production environments, prefer VPC-based access to restrict OpenSearch domain access to trusted networks.
- Encryption: Enable encryption both at rest and in transit to protect sensitive data.
- Access Policies: Restrict access to trusted entities and IP ranges to prevent unauthorized access.
2. Data Organization
- Consistent Naming Conventions: Use clear and consistent names for S3 buckets, folders, and OpenSearch indices to enhance manageability.
- Data Partitioning: Organize data using prefixes (folders) based on logical partitions such as date, region, or data type to improve query performance and manageability.
3. Performance Optimization
- Accelerations: Regularly review and optimize accelerations (Skipping Indexes, Covering Indexes, Materialized Views) based on evolving query patterns and data usage.
- Resource Allocation: Adjust OpenSearch instance types and counts based on data volume and query complexity to maintain optimal performance.
- Monitor Metrics: Utilize OpenSearch Dashboards and AWS CloudWatch to monitor cluster health, query performance, and resource utilization, allowing for proactive optimizations.
4. Cost Management
- Resource Scaling: Dynamically scale OpenSearch resources based on usage to optimize costs without compromising performance.
- Data Lifecycle Management: Implement data lifecycle policies to manage data retention, archiving, and deletion in S3, controlling storage costs effectively.
5. Schema Management
- Consistency: Ensure data schemas remain consistent across all data sources to prevent query failures and data mismatches.
- Versioning: Track schema versions to manage and document changes effectively, facilitating easier maintenance and updates.
6. Automation and Documentation
- Infrastructure as Code (IaC): Utilize tools like AWS CloudFormation or Terraform to automate deployments and configurations, ensuring reproducibility and reducing manual errors.
- Comprehensive Documentation: Maintain thorough documentation of configurations, policies, and procedures to facilitate onboarding and ongoing maintenance.
7. Regular Audits and Reviews
- Security Audits: Periodically review IAM roles and access policies to ensure they adhere to security best practices and compliance requirements.
- Performance Audits: Regularly assess query performance and acceleration effectiveness to identify and address potential bottlenecks or inefficiencies.
8. Backup and Recovery
- Data Backups: Regularly back up your data in S3 and OpenSearch indices to prevent data loss and ensure business continuity.
- Disaster Recovery Planning: Develop and test disaster recovery plans to handle potential outages or data breaches effectively.
Testing and Validation
After setting up the integration and accelerations, it's crucial to validate that everything works as expected and that performance improvements are realized.
1. Execute Sample Queries
Run the following queries in Query Workbench to test data retrieval and performance enhancements.
a. Retrieve Recent Transactions
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE transaction_date >= '2023-10-01'
LIMIT 10;
Utilizes the transaction_date
partition skipping index to efficiently fetch recent transactions.
b. Filter Transactions by Currency
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE currency = 'USD'
LIMIT 10;
Leverages the currency
VALUE_SET skipping index for quick filtering.
c. Retrieve Transactions with Amount Greater Than 50
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE amount > 50
LIMIT 10;
Utilizes the amount
MIN_MAX skipping index to efficiently filter transactions exceeding $50.
d. Combined Filters: Date Range, Currency, and Amount
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE transaction_date BETWEEN '2023-10-01' AND '2023-10-31'
AND currency = 'USD'
AND amount > 50
LIMIT 10;
Combines transaction_date
, currency
, and amount
skipping indexes for optimized multi-condition filtering.
e. Aggregate Total and Average Transaction Amount per Customer
SELECT
customer_id,
COUNT(*) AS total_transactions,
SUM(amount) AS total_amount,
AVG(amount) AS average_amount
FROM transaction_db_datasource.videodbtransaction.transaction
GROUP BY customer_id
ORDER BY total_amount DESC
LIMIT 10;
Aggregates transaction data per customer, leveraging customer_id
and amount
indexes for swift computations.
f. Find Top 5 Highest Transactions
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
ORDER BY amount DESC
LIMIT 5;
Uses the amount
MIN_MAX skipping index to quickly retrieve the highest transactions.
g. Count Transactions per Currency
SELECT
currency,
COUNT(*) AS transaction_count
FROM transaction_db_datasource.videodbtransaction.transaction
GROUP BY currency
ORDER BY transaction_count DESC;
Aggregates the number of transactions per currency, utilizing the currency
VALUE_SET skipping index.
h. Retrieve Transactions Containing Specific Items
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE 'Item_A' IN items
LIMIT 10;
Queries the items
array field to find transactions that include Item_A
.
i. Handle Missing Amounts with COALESCE
SELECT
transaction_id,
customer_id,
COALESCE(amount, 0) AS amount,
currency,
transaction_date,
items
FROM transaction_db_datasource.videodbtransaction.transaction
LIMIT 10;
Uses the COALESCE
function to substitute missing amount
values with 0
.
j. Aggregate Number of Items Sold per Day
SELECT
DATE(transaction_date) AS transaction_day,
SUM(array_length(items)) AS total_items_sold
FROM transaction_db_datasource.videodbtransaction.transaction
GROUP BY transaction_day
ORDER BY transaction_day DESC
LIMIT 10;
Calculates the total number of items sold each day by summing the lengths of the items
arrays.
k. Retrieve Transactions for Multiple Currencies
SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE currency IN ('USD', 'EUR', 'GBP')
LIMIT 10;
Filters transactions that use any of the specified currencies, leveraging the currency
VALUE_SET skipping index.
2. Compare Query Performance
- Before Accelerations: If possible, execute similar queries without any indexes or accelerations and note the execution time.
- After Accelerations: Run the same queries with the skipping and covering indexes in place to observe reduced execution times.
Tip: Use OpenSearch Dashboards' built-in query profiler to analyze query execution plans and confirm that accelerations are being utilized.
3. Monitor Shard Allocation and Cluster Health
Use the following APIs to ensure shards are correctly allocated and the cluster is healthy.
a. Check Shard Allocation
GET /_cat/shards/transaction_db_datasource.videodbtransaction.transaction?v
Expected Output:
-
Number of Shards:
5
-
Replicas per Shard:
2
-
Total Copies per Shard:
3
(1 primary + 2 replicas) -
Shard Distribution: Evenly across the three
zone
awareness attributes.
b. Check Cluster Health
GET /_cluster/health/transaction_db_datasource.videodbtransaction.transaction
Expected Output:
-
Status:
green
-
Unassigned Shards:
0
4. Use OpenSearch Dashboards Monitoring Tools
- Query Profiler: Analyze how queries utilize indexes and accelerations.
- Resource Metrics: Monitor CPU, memory, and I/O to ensure optimal resource utilization.
Tip: Regularly review dashboards to detect any anomalies or performance bottlenecks early.
Troubleshooting
Despite careful setup, you might encounter issues. Here are common problems and their solutions:
1. Validation Failed: Expected Total Copies Needs to Be a Multiple of Total Awareness Attributes
Error Message:
{
"Message":"Fail to run query. Cause: OpenSearch exception [type=illegal_argument_exception, reason=Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];]"
}
Cause: The total number of shard copies (primary shards + replicas) is not a multiple of the number of shard allocation awareness attributes (3
).
Solution:
-
Adjust Number of Replicas:
- Ensure
number_of_replicas
is set to2
(1 primary + 2 replicas = 3 copies).
- Ensure
Updated Skipping Index Command:
CREATE SKIPPING INDEX
ON transaction_db_datasource.videodbtransaction.transaction (
transaction_date PARTITION,
currency VALUE_SET,
amount MIN_MAX
)
WITH (
index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
auto_refresh = true,
refresh_interval = '60 minutes',
checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions/'
);
Action Steps:
- Open Query Workbench.
- Select
TransactionsDataSource
. - Run the updated skipping index creation command.
2. Access Denied Errors
Error Message:
{
"Message":"User: arn:aws:iam::<account-id>:role/OpenSearchS3AccessRole is not authorized to perform: s3:GetObject on resource: arn:aws:s3:::opensearch-s3-poc-json-bucket/transactions/*"
}
Cause: IAM role lacks necessary permissions to access the S3 bucket or specific objects.
Solution:
-
Review and Update IAM Policy:
- Ensure
s3:GetObject
,s3:ListBucket
, ands3:PutObject
permissions are correctly set for the bucket and checkpoint locations.
- Ensure
Policy Example:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ReadAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::opensearch-s3-poc-json-bucket",
"arn:aws:s3:::opensearch-s3-poc-json-bucket/*"
]
},
{
"Sid": "S3CheckpointAccess",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/",
"arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/*"
]
}
]
}
Action Steps:
- Navigate to IAM Console.
- Select OpenSearchS3AccessPolicy.
- Verify and update permissions as needed.
- Save changes and retry the query.
3. Data Not Appearing in OpenSearch
Cause: Incorrect schema definitions or issues with the Spark table setup.
Solution:
-
Verify AWS Glue Schema:
- Ensure that the Glue table schema matches the structure of your JSON data.
-
Check Spark Table Creation:
- Re-run the
CREATE EXTERNAL TABLE
command in Query Workbench to ensure it was successful.
- Re-run the
-
Validate Data in S3:
- Confirm that data is correctly uploaded and accessible in the specified S3 path.
Action Steps:
- Review the Glue table schema in AWS Glue Console.
- Open Query Workbench and verify the Spark table.
- Execute a sample query to confirm data visibility.
4. Slow Query Performance Despite Accelerations
Cause: Queries may not be effectively utilizing the skipping or covering indexes.
Solution:
-
Use Query Profiler:
- Analyze query execution plans to ensure indexes are being leveraged.
-
Optimize Query Structure:
- Ensure that filters and selections align with the indexed fields.
-
Adjust Refresh Intervals:
- Tweak
refresh_interval
settings based on data update frequency and query requirements.
- Tweak
Action Steps:
- Open Query Workbench and execute the query profiler.
- Review the execution plan for index utilization.
- Modify queries to better align with indexed fields.
- Adjust
refresh_interval
if necessary.
Conclusion
Integrating Amazon OpenSearch Service with Amazon S3 for direct queries offers a robust solution for high-performance data analytics and real-time search capabilities. This integration leverages the scalability and durability of Amazon S3 for data storage, the schema management prowess of AWS Glue, and the advanced search and analytics features of OpenSearch Service.
Key Takeaways:
- Streamlined Data Access: Direct queries enable efficient data retrieval from S3 without the need for data ingestion into OpenSearch indices.
- Enhanced Performance: Implementing accelerations such as Skipping Indexes, Covering Indexes, and Materialized Views significantly improves query performance and reduces latency.
- Scalability and Flexibility: The architecture supports scalability, allowing you to handle large volumes of data and complex queries seamlessly.
- Security and Compliance: Adhering to best practices in IAM configurations and data encryption ensures your data remains secure and compliant with industry standards.
- Cost Efficiency: Optimizing resource allocation and implementing data lifecycle policies help manage costs effectively without sacrificing performance.
By following this guide and adhering to the outlined best practices, you can establish an efficient, scalable, and secure data querying environment using Amazon OpenSearch Service and Amazon S3. This setup empowers your organization to perform real-time analytics, derive actionable insights, and drive informed decision-making with confidence.
References
- Amazon S3 Documentation: Amazon Simple Storage Service
- AWS IAM Documentation: AWS Identity and Access Management
- Amazon OpenSearch Service Documentation: Amazon OpenSearch Service
- AWS Glue Documentation: AWS Glue
- OpenSearch Query Workbench: Query Workbench in OpenSearch
- Shard Allocation Awareness: OpenSearch Shard Allocation Awareness
- Monitoring and Observability: Monitoring OpenSearch
- Index State Management: Index State Management in OpenSearch
- Apache Spark SQL Documentation: Apache Spark SQL
Glossary
Term | Definition |
---|---|
Amazon S3 | Amazon Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. |
Amazon OpenSearch Service | A managed service that makes it easy to deploy, operate, and scale OpenSearch for log analytics, full-text search, application monitoring, and more. |
AWS Glue | A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. |
IAM Role | An AWS Identity and Access Management (IAM) entity that defines a set of permissions for making AWS service requests. |
Query Workbench | A feature within OpenSearch Dashboards that allows users to run SQL-like queries against data sources to create databases and tables. |
Skipping Index | An acceleration technique that indexes metadata to efficiently locate data without scanning entire partitions and files. |
Covering Index | An acceleration technique that indexes specific columns to create a high-performance index tailored for advanced analytics and visualization. |
Materialized View | A database object that contains the results of a query and can be refreshed periodically, providing faster query performance for complex aggregations and joins. |
Piped Processing Language (PPL) | A query language used in OpenSearch for processing and analyzing data streams. |
JSON Lines Format | A convenient format for storing structured data that may be processed one record at a time. Each line is a valid JSON value. |
Checkpoint Location | An S3 path where OpenSearch stores metadata and state information for accelerations to maintain and update indexes. |
Appendices
Appendix A: Sample transactions.json
File
Below is a sample transactions.json
file in JSON Lines format (.jsonl
). Each line represents a single transaction record.
{"transaction_id": 1001, "customer_id": 501, "amount": 250.75, "currency": "USD", "transaction_date": "2023-10-15T10:15:30Z", "items": ["Item_A", "Item_B"]}
{"transaction_id": 1002, "customer_id": 502, "amount": 89.99, "currency": "EUR", "transaction_date": "2023-10-16T11:20:45Z", "items": ["Item_C"]}
{"transaction_id": 1003, "customer_id": 501, "amount": 120.00, "currency": "USD", "transaction_date": "2023-10-17T09:05:10Z", "items": ["Item_A", "Item_D", "Item_E"]}
{"transaction_id": 1004, "customer_id": 503, "amount": 45.50, "currency": "GBP", "transaction_date": "2023-10-18T14:30:00Z", "items": ["Item_F"]}
{"transaction_id": 1005, "customer_id": 504, "amount": 300.00, "currency": "USD", "transaction_date": "2023-10-19T16:45:25Z", "items": ["Item_G", "Item_H"]}
Explanation of Fields:
-
transaction_id
(INT): Unique identifier for the transaction. -
customer_id
(INT): Identifier for the customer making the transaction. -
amount
(DOUBLE): Monetary value of the transaction. -
currency
(STRING): Currency code (e.g., USD, EUR, GBP). -
transaction_date
(TIMESTAMP): Date and time when the transaction occurred. -
items
(ARRAY): List of items purchased in the transaction.
Best Practice: Ensure that all records follow the same schema and that data types are consistent to prevent issues during ingestion and querying.
Appendix B: Sample SQL Commands for Accelerations
Implementing accelerations is crucial for optimizing query performance. Below are sample SQL commands tailored for your setup.
1. Skipping Index Creation
CREATE SKIPPING INDEX
ON transaction_db_datasource.videodbtransaction.transaction (
transaction_date PARTITION,
currency VALUE_SET,
amount MIN_MAX
)
WITH (
index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
auto_refresh = true,
refresh_interval = '60 minutes',
checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions/'
);
2. Materialized View Creation
CREATE MATERIALIZED VIEW transactions__week_live_mview AS
SELECT
customer_id AS customer_id,
currency AS currency,
COUNT(*) AS transaction_count,
SUM(amount) AS total_amount,
AVG(amount) AS average_amount,
CAST(FROM_UNIXTIME(transaction_date / 1000) AS TIMESTAMP) AS @timestamp
FROM
transaction_db_datasource.videodbtransaction.transaction
GROUP BY
customer_id, currency
WITH (
auto_refresh = true,
refresh_interval = '15 Minute',
checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions__week_live_mview/',
watermark_delay = '1 Minute'
);
3. Covering Index Creation
CREATE INDEX transactions_covering_index
ON transaction_db_datasource.videodbtransaction.transaction (
transaction_id,
customer_id,
amount,
currency,
transaction_date,
items
)
WITH (
index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
auto_refresh = true,
refresh_interval = '10 Minute',
checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions_covering_index/'
);
Tags
aws
amazon-opensearch
amazon-s3
data-analytics
cloud-computing
big-data
tech-guide
data-engineering
aws-cloud
real-time-analytics
opensearch-integration
data-optimization
tech-innovation
Happy Implementing! ππ
If you found this guide helpful, please consider clapping, sharing it with your network, or leaving a comment below. Your feedback helps improve the content and supports the community!
Top comments (0)