Rahul Ladumor

Posted on Nov 19, 2024 • Edited on Feb 4

Working with Amazon OpenSearch Service Direct Queries with Amazon S3: The First-Ever Detailed Guide

#database #opensearch #s3 #cloud

Efficiently querying vast amounts of data in real-time is essential for gaining actionable insights and making informed decisions. Amazon OpenSearch Service combined with Amazon S3 provides a powerful solution for organizations looking to leverage their data effectively. This comprehensive guide is the first-ever detailed walkthrough that covers every step of integrating these powerful AWS services, ensuring you can implement and optimize direct queries with ease.

High-Level Architecture
Prerequisites
Step-by-Step Implementation
Best Practices
Testing and Validation
Troubleshooting
Conclusion
References

High-Level Architecture

The integration involves the following key components:

Amazon S3: Stores your datasets in formats like JSON, CSV, or Parquet.
AWS Glue Data Catalog: Manages metadata and schema definitions for the data stored in S3.
Amazon OpenSearch Service: Executes direct queries on the data in S3, leveraging its powerful search and analytics capabilities.
IAM Roles and Policies: Ensure secure and controlled access between OpenSearch, S3, and Glue.
OpenSearch Dashboards (Query Workbench): Interface for executing queries, managing accelerations, and visualizing data.

Real-World Use Case:

Consider an e-commerce platform that stores transactional data in Amazon S3. By integrating OpenSearch Service, the platform can perform real-time analytics on sales data, generate dashboards for monitoring sales performance, and enable advanced search functionalities for business intelligence.

Prerequisites

Before diving into the implementation, ensure you have the following:

AWS Account: Access to AWS Management Console with permissions to create and manage S3, OpenSearch Service, Glue, and IAM resources.
Data in S3: Your dataset (e.g., transactions.json) stored in Amazon S3 in JSON Lines format.
OpenSearch Version: Ensure you're using Amazon OpenSearch Service version 2.13 or later.
IAM Knowledge: Understanding of IAM roles and policies to grant necessary permissions.
Basic SQL Knowledge: Familiarity with SQL for creating databases, tables, and writing queries.
AWS CLI (Optional): For advanced configurations and troubleshooting.

Step-by-Step Implementation

Follow these detailed steps to set up and configure direct queries on Amazon OpenSearch Service with Amazon S3, leveraging AWS Glue and implementing performance optimizations through accelerations.

1. Creating an Amazon S3 Bucket

Step 1: Navigate to the Amazon S3 Console

URL: Amazon S3 Console

Step 2: Create a New Bucket

Click on "Create bucket".
Configure Bucket Settings:
- Bucket name: opensearch-s3-poc-json-bucket (must be unique across all AWS accounts).
- Region: Select the same AWS Region as your OpenSearch domain (e.g., us-east-1).
Set Object Ownership:
- Recommended: Set to Bucket owner preferred.
Block Public Access:
- Enable: Ensure all options are checked to block public access.
Additional Settings:
- Configure Versioning, Tags, Default encryption as per requirements.
Review and Create:
- Click Create bucket after reviewing all settings.

Step 3: Upload Sample Data to S3

Open the newly created bucket opensearch-s3-poc-json-bucket.
Create a Folder (Optional):
- Example: transactions/ for better organization.
Upload Files:
- Click Upload.
- Add Files: Select and upload your transactions.json.
- Click Upload.

Best Practice: Organize data using prefixes (folders) like transactions/ and ensure data consistency.

2. Setting Up AWS Glue Data Catalog

Step 1: Navigate to the AWS Glue Console

URL: AWS Glue Console

Step 2: Create a Database

In the Glue console, click on "Databases" under the Data Catalog section.
Click "Add database".
Name: videodbtransaction.
Click "Create".

Step 3: Create a Table

Click on "Tables" under the Data Catalog section.
Click "Add tables" > "Add tables manually".
Table Name: transactions.
Location: Specify the S3 path where your data resides (e.g., s3://opensearch-s3-poc-json-bucket/transactions/).
Define Schema:
- Define columns corresponding to your JSON data fields (e.g., transaction_id, customer_id, amount, currency, transaction_date, items).
- Specify data types appropriately (e.g., INT, DOUBLE, STRING, TIMESTAMP, ARRAY<STRING>).
Click "Create".

Note: Ensure the schema accurately reflects your data to prevent query issues.

3. Configuring IAM Roles and Policies

Purpose: Grant Amazon OpenSearch Service the necessary permissions to access your S3 bucket and interact with AWS Glue.

Step 1: Navigate to the IAM Console

URL: AWS IAM Console

Step 2: Create an IAM Policy

In the IAM console, click on "Policies" in the left navigation pane.
Click "Create policy".
Select the JSON tab.

Paste the following policy, replacing <account-id> and <opensearch-domain-name> with your actual AWS Account ID and desired OpenSearch domain name.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "OpenSearchDomainPermissions",
      "Effect": "Allow",
      "Action": "es:ESHttp*",
      "Resource": "arn:aws:es:us-east-1:<account-id>:domain/<opensearch-domain-name>/*"
    },
    {
      "Sid": "S3ReadAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::opensearch-s3-poc-json-bucket",
        "arn:aws:s3:::opensearch-s3-poc-json-bucket/*"
      ]
    },
    {
      "Sid": "GluePermissions",
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetPartition",
        "glue:GetPartitions"
      ],
      "Resource": "*"
    },
    {
      "Sid": "S3CheckpointAccess",
      "Effect": "Allow",
      "Action": [
        "s3:*"
      ],
      "Resource": [
        "arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/",
        "arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/*"
      ]
    }
  ]
}

Click "Next: Tags" (optional).
Click "Next: Review".
Name: OpenSearchS3AccessPolicy.
Description: Policy granting OpenSearch access to S3 and Glue.
Click "Create policy".

Step 3: Create an IAM Role

In the IAM console, click on "Roles" in the left navigation pane.
Click "Create role".
Trusted Entity: Select "Custom trust policy".
Click "Switch to trusted entity".

Paste the following trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "opensearchservice.amazonaws.com",
          "directquery.opensearchservice.amazonaws.com",
          "glue.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Click "Next: Permissions".
Search for OpenSearchS3AccessPolicy.
Select the policy and click "Next: Tags".
Click "Next: Review".
Role name: OpenSearchS3AccessRole.
Description: Role for OpenSearch to access S3 and Glue.
Click "Create role".

Best Practice: Use clear and descriptive names for roles and policies for easier management.

4. Creating an Amazon OpenSearch Service Domain

Step 1: Navigate to the Amazon OpenSearch Service Console

URL: Amazon OpenSearch Service Console

Step 2: Create a New Domain

Click "Create domain".
Deployment Type: Select Production for a production environment or Development and testing for a proof-of-concept (POC).
Engine Version: Choose OpenSearch 2.13 or later.
Domain Name: opensearch-poc-domain (ensure uniqueness).

Step 3: Configure Instance and Storage

Instance Type: Select m7.large.search choose based on your performance requirements.
Number of Nodes: 1 (for testing; scale as needed for production).

Step 4: Set Up Access

Network Configuration:
- VPC Access: For production, select VPC access for enhanced security. For testing, you may opt for Public access.
Fine-Grained Access Control:
- Enable: Toggle to enable.
- Master User: Set a username (e.g., admin) and a strong password.

Access Policy:

Choose Allow open access to the domain (not recommended for production).
Alternative: Define a more restrictive access policy based on your security requirements.

Example Access Policy for Restricted Access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAccessFromSpecificIP",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "es:*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "203.0.113.0/24"
        }
      },
      "Resource": "arn:aws:es:us-east-1:<account-id>:domain/opensearch-poc-domain/*"
    }
  ]
}

Step 5: Advanced Options

IAM Role: Select OpenSearchS3AccessRole.
Encryption:
- At Rest: Enable encryption if required.
- In Transit: Enable TLS for secure data transfer.
Additional Configurations: Adjust based on specific needs (e.g., node-to-node encryption).

Step 6: Review and Create

Review all settings.
Click "Create" to initiate the domain setup.

Note: Domain creation may take several minutes. Monitor the progress in the console.

Best Practice: For production environments, prefer VPC access and restrict permissions to enhance security.

5. Creating Spark Tables Using Query Workbench

Purpose: Create Spark databases and tables directly within OpenSearch Dashboards using Query Workbench to ensure compatibility with OpenSearch's acceleration features.

Efficiently managing your data schemas and tables is crucial for optimizing query performance. By utilizing the Query Workbench within OpenSearch Dashboards, you can seamlessly create both databases and tables using Spark SQL. This approach ensures that your data is structured correctly and leverages OpenSearch's acceleration capabilities effectively.

Step 1: Access OpenSearch Dashboards

Navigate to OpenSearch Dashboards:
- Open your web browser and go to the OpenSearch Dashboards URL provided during your domain creation.
- Example: https://your-opensearch-domain/_dashboards/
Log In:
- Enter your admin credentials (username and password) to access the dashboard.

Step 2: Create a Data Source

Open the Menu:
- Click the Menu icon (three horizontal lines) in the top-left corner of the dashboard.
Navigate to Data Sources:
- Under the Management section, select Data sources.
Create a New Data Source:
- Click on "Create data source".
Configure the Data Source:
- Name: TransactionsDataSource.
- Data Source Type: Select Amazon S3 from the dropdown menu.
- IAM Role: Choose OpenSearchS3AccessRole from the available roles. This role must have the necessary permissions to access your S3 bucket and AWS Glue Data Catalog.
- Additional Settings: Configure any additional settings as required, such as data format specifications or connection properties.
Finalize Creation:
- Click "Create" to establish the new data source.

Step 3: Use Query Workbench to Create a Spark Database and Table

Note: While AWS Glue can manage databases and tables, for compatibility with OpenSearch's acceleration features, it is recommended to create Spark databases and tables directly using Query Workbench.

a. Creating a Spark Database

Open Query Workbench:
- Click the Menu icon again.
- Under OpenSearch Plugins, select Query Workbench.
Select the Spark SQL Interface:
- Ensure that you're using the Spark SQL interface within Query Workbench.

Create the Spark Database:

Paste the following SQL command to create a new Spark database named videodbtransaction:

 CREATE DATABASE IF NOT EXISTS videodbtransaction
 WITH (
   LOCATION = 's3://opensearch-s3-poc-json-bucket/databases/videodbtransaction/'
 );

Explanation:

CREATE DATABASE IF NOT EXISTS: Ensures that the database is created only if it does not already exist.
videodbtransaction: The name of the database, adhering to the recommended naming conventions.
LOCATION: Specifies the S3 path where the database metadata and related files will be stored.
1. Execute the Command:
Click "Run" to execute the database creation command.
1. Verify Database Creation:
To confirm the database was created successfully, execute the following command:
```
 SHOW DATABASES;
```
Expected Result: The videodbtransaction database should appear in the list of available databases.

b. Creating a Spark Table

Ensure the Correct Database Context:
- Before creating the table, set the context to the newly created videodbtransaction database:
```
 USE videodbtransaction;
```

Click "Run" to execute.
1. Create the Spark Table:

Paste the following SQL command to create the transactions table within the videodbtransaction database:

 CREATE EXTERNAL TABLE transactions (
   transaction_id INT,
   customer_id INT,
   amount DOUBLE,
   currency STRING,
   transaction_date TIMESTAMP,
   items ARRAY<STRING>
 )
 USING json
 LOCATION 's3://opensearch-s3-poc-json-bucket/transactions/';

Explanation:

CREATE EXTERNAL TABLE: Creates a table that references data stored externally in Amazon S3.
transactions: The name of the table, following the naming conventions.
Column Definitions:
- transaction_id INT: Unique identifier for each transaction.
- customer_id INT: Identifier for the customer involved in the transaction.
- amount DOUBLE: Monetary value of the transaction.
- currency STRING: Currency code (e.g., USD, EUR, GBP).
- transaction_date TIMESTAMP: Date and time of the transaction.
- items ARRAY<STRING>: List of items purchased in the transaction.
USING json: Specifies that the data format is JSON.
LOCATION: Points to the S3 bucket path where the transactions.json file is stored.
1. Execute the Command:
Click "Run" to create the transactions table.
1. Verify Table Creation:
Execute a sample query to ensure the table references data correctly:
```
 SELECT * FROM transactions LIMIT 5;
```
Expected Result: The first five records from transactions.json should be displayed, confirming successful table creation and data linkage.

Best Practice: Validate the table schema against your JSON data to ensure accuracy.

Example of Verifying Schema and Data Integrity

Check Column Types:
```
DESCRIBE transactions;
```

- **Expected Output:** A list of columns with their respective data types matching the defined schema.

Sample Data Validation:

SELECT transaction_id, customer_id, amount, currency, transaction_date, items
FROM transactions
LIMIT 5;

- **Expected Output:** Displays the first five transactions with all fields correctly populated.

6. Implementing Accelerations

Implementing accelerations such as Skipping Indexes, Covering Indexes, and Materialized Views enhances query performance by pre-processing and indexing critical data attributes.

a. Skipping Indexes

Purpose: Index metadata to efficiently locate data without scanning entire partitions and files.

Step-by-Step Guide:

Open Query Workbench:
- Navigate to OpenSearch Dashboards.
- Select Query Workbench from the menu.
Select Data Source:
- Ensure TransactionsDataSource is selected.

Execute Skipping Index Creation:

CREATE SKIPPING INDEX
ON transaction_db_datasource.videodbtransaction.transaction (
  transaction_date PARTITION,
  currency VALUE_SET,
  amount MIN_MAX
)
WITH (
  index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
  auto_refresh = true,
  refresh_interval = '60 minutes',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions/'
);

Explanation:

- **`transaction_date` PARTITION**: Optimizes queries based on date ranges.
- **`currency` VALUE_SET**: Enhances exact match filtering on currency.
- **`amount` MIN_MAX**: Improves range queries on transaction amounts.
- **`number_of_shards`:** Set to `5` for parallel processing.
- **`number_of_replicas`:** Set to `2` to align with cluster shard allocation awareness attributes (`3` zones).

Verify Acceleration:
- Navigate to Data sources > TransactionsDataSource > Accelerations tab.
- Ensure the skipping index transaction is listed with status Initializing, Running, or Completed.

Note: The checkpoint_location must be accessible by OpenSearch with appropriate IAM permissions.

b. Covering Indexes

Purpose: Ingest specific columns to create a high-performance index tailored for advanced analytics and visualization.

Covering Index Creation Command:

CREATE INDEX transactions_covering_index
ON transaction_db_datasource.videodbtransaction.transaction (
  transaction_id,
  customer_id,
  amount,
  currency,
  transaction_date,
  items
)
WITH (
  index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
  auto_refresh = true,
  refresh_interval = '10 Minute',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions_covering_index/'
);

Execution Steps:

Open Query Workbench:
- Navigate to OpenSearch Dashboards.
- Select Query Workbench.
Select Data Source:
- Choose TransactionsDataSource.
Run Covering Index Command:
- Paste the above SQL command into the editor.
- Click "Run" to execute.
Verify Acceleration:
- Navigate to Data sources > TransactionsDataSource > Accelerations tab.
- Ensure transactions_covering_index is listed with the appropriate status.

Note: Adjust refresh_interval based on your data update frequency and query requirements.

c. Materialized Views

Purpose: Precompute and store results of complex queries (e.g., aggregations) to power dashboard visualizations.

Materialized View Creation Command:

CREATE MATERIALIZED VIEW transactions__week_live_mview AS
  SELECT
    customer_id AS customer_id,
    currency AS currency,
    COUNT(*) AS transaction_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS average_amount,
    CAST(FROM_UNIXTIME(transaction_date / 1000) AS TIMESTAMP) AS @timestamp
  FROM
    transaction_db_datasource.videodbtransaction.transaction
  GROUP BY
    customer_id, currency
WITH (
  auto_refresh = true,
  refresh_interval = '15 Minute',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions__week_live_mview/',
  watermark_delay = '1 Minute'
);

Execution Steps:

Open Query Workbench:
- Navigate to OpenSearch Dashboards.
- Select Query Workbench.
Select Data Source:
- Choose TransactionsDataSource.
Run Materialized View Command:
- Paste the above SQL command into the editor.
- Click "Run" to execute.
Verify Acceleration:
- Navigate to Data sources > TransactionsDataSource > Accelerations tab.
- Ensure transactions__week_live_mview is listed with the appropriate status.

Note: Adjust refresh_interval and watermark_delay based on your data freshness requirements.

Best Practices

Implementing best practices ensures that your OpenSearch and S3 integration is secure, efficient, and scalable.

1. Security

Least Privilege Principle: Grant only necessary permissions to IAM roles to minimize security risks.
VPC Access: For production environments, prefer VPC-based access to restrict OpenSearch domain access to trusted networks.
Encryption: Enable encryption both at rest and in transit to protect sensitive data.
Access Policies: Restrict access to trusted entities and IP ranges to prevent unauthorized access.

2. Data Organization

Consistent Naming Conventions: Use clear and consistent names for S3 buckets, folders, and OpenSearch indices to enhance manageability.
Data Partitioning: Organize data using prefixes (folders) based on logical partitions such as date, region, or data type to improve query performance and manageability.

3. Performance Optimization

Accelerations: Regularly review and optimize accelerations (Skipping Indexes, Covering Indexes, Materialized Views) based on evolving query patterns and data usage.
Resource Allocation: Adjust OpenSearch instance types and counts based on data volume and query complexity to maintain optimal performance.
Monitor Metrics: Utilize OpenSearch Dashboards and AWS CloudWatch to monitor cluster health, query performance, and resource utilization, allowing for proactive optimizations.

4. Cost Management

Resource Scaling: Dynamically scale OpenSearch resources based on usage to optimize costs without compromising performance.
Data Lifecycle Management: Implement data lifecycle policies to manage data retention, archiving, and deletion in S3, controlling storage costs effectively.

5. Schema Management

Consistency: Ensure data schemas remain consistent across all data sources to prevent query failures and data mismatches.
Versioning: Track schema versions to manage and document changes effectively, facilitating easier maintenance and updates.

6. Automation and Documentation

Infrastructure as Code (IaC): Utilize tools like AWS CloudFormation or Terraform to automate deployments and configurations, ensuring reproducibility and reducing manual errors.
Comprehensive Documentation: Maintain thorough documentation of configurations, policies, and procedures to facilitate onboarding and ongoing maintenance.

7. Regular Audits and Reviews

Security Audits: Periodically review IAM roles and access policies to ensure they adhere to security best practices and compliance requirements.
Performance Audits: Regularly assess query performance and acceleration effectiveness to identify and address potential bottlenecks or inefficiencies.

8. Backup and Recovery

Data Backups: Regularly back up your data in S3 and OpenSearch indices to prevent data loss and ensure business continuity.
Disaster Recovery Planning: Develop and test disaster recovery plans to handle potential outages or data breaches effectively.

Testing and Validation

After setting up the integration and accelerations, it's crucial to validate that everything works as expected and that performance improvements are realized.

1. Execute Sample Queries

Run the following queries in Query Workbench to test data retrieval and performance enhancements.

a. Retrieve Recent Transactions

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE transaction_date >= '2023-10-01'
LIMIT 10;

Utilizes the transaction_date partition skipping index to efficiently fetch recent transactions.

b. Filter Transactions by Currency

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE currency = 'USD'
LIMIT 10;

Leverages the currency VALUE_SET skipping index for quick filtering.

c. Retrieve Transactions with Amount Greater Than 50

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE amount > 50
LIMIT 10;

Utilizes the amount MIN_MAX skipping index to efficiently filter transactions exceeding $50.

d. Combined Filters: Date Range, Currency, and Amount

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE transaction_date BETWEEN '2023-10-01' AND '2023-10-31'
  AND currency = 'USD'
  AND amount > 50
LIMIT 10;

Combines transaction_date, currency, and amount skipping indexes for optimized multi-condition filtering.

e. Aggregate Total and Average Transaction Amount per Customer

SELECT
  customer_id,
  COUNT(*) AS total_transactions,
  SUM(amount) AS total_amount,
  AVG(amount) AS average_amount
FROM transaction_db_datasource.videodbtransaction.transaction
GROUP BY customer_id
ORDER BY total_amount DESC
LIMIT 10;

Aggregates transaction data per customer, leveraging customer_id and amount indexes for swift computations.

f. Find Top 5 Highest Transactions

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
ORDER BY amount DESC
LIMIT 5;

Uses the amount MIN_MAX skipping index to quickly retrieve the highest transactions.

g. Count Transactions per Currency

SELECT
  currency,
  COUNT(*) AS transaction_count
FROM transaction_db_datasource.videodbtransaction.transaction
GROUP BY currency
ORDER BY transaction_count DESC;

Aggregates the number of transactions per currency, utilizing the currency VALUE_SET skipping index.

h. Retrieve Transactions Containing Specific Items

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE 'Item_A' IN items
LIMIT 10;

Queries the items array field to find transactions that include Item_A.

i. Handle Missing Amounts with COALESCE

SELECT
  transaction_id,
  customer_id,
  COALESCE(amount, 0) AS amount,
  currency,
  transaction_date,
  items
FROM transaction_db_datasource.videodbtransaction.transaction
LIMIT 10;

Uses the COALESCE function to substitute missing amount values with 0.

j. Aggregate Number of Items Sold per Day

SELECT
  DATE(transaction_date) AS transaction_day,
  SUM(array_length(items)) AS total_items_sold
FROM transaction_db_datasource.videodbtransaction.transaction
GROUP BY transaction_day
ORDER BY transaction_day DESC
LIMIT 10;

Calculates the total number of items sold each day by summing the lengths of the items arrays.

k. Retrieve Transactions for Multiple Currencies

SELECT * FROM transaction_db_datasource.videodbtransaction.transaction
WHERE currency IN ('USD', 'EUR', 'GBP')
LIMIT 10;

Filters transactions that use any of the specified currencies, leveraging the currency VALUE_SET skipping index.

2. Compare Query Performance

Before Accelerations: If possible, execute similar queries without any indexes or accelerations and note the execution time.
After Accelerations: Run the same queries with the skipping and covering indexes in place to observe reduced execution times.

Tip: Use OpenSearch Dashboards' built-in query profiler to analyze query execution plans and confirm that accelerations are being utilized.

3. Monitor Shard Allocation and Cluster Health

Use the following APIs to ensure shards are correctly allocated and the cluster is healthy.

a. Check Shard Allocation

GET /_cat/shards/transaction_db_datasource.videodbtransaction.transaction?v

Expected Output:

Number of Shards: 5
Replicas per Shard: 2
Total Copies per Shard: 3 (1 primary + 2 replicas)
Shard Distribution: Evenly across the three zone awareness attributes.

b. Check Cluster Health

GET /_cluster/health/transaction_db_datasource.videodbtransaction.transaction

Expected Output:

Status: green
Unassigned Shards: 0

4. Use OpenSearch Dashboards Monitoring Tools

Query Profiler: Analyze how queries utilize indexes and accelerations.
Resource Metrics: Monitor CPU, memory, and I/O to ensure optimal resource utilization.

Tip: Regularly review dashboards to detect any anomalies or performance bottlenecks early.

Troubleshooting

Despite careful setup, you might encounter issues. Here are common problems and their solutions:

1. Validation Failed: Expected Total Copies Needs to Be a Multiple of Total Awareness Attributes

Error Message:

{
  "Message":"Fail to run query. Cause: OpenSearch exception [type=illegal_argument_exception, reason=Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];]"
}

Cause: The total number of shard copies (primary shards + replicas) is not a multiple of the number of shard allocation awareness attributes (3).

Solution:

Adjust Number of Replicas:
- Ensure number_of_replicas is set to 2 (1 primary + 2 replicas = 3 copies).

Updated Skipping Index Command:

CREATE SKIPPING INDEX
ON transaction_db_datasource.videodbtransaction.transaction (
  transaction_date PARTITION,
  currency VALUE_SET,
  amount MIN_MAX
)
WITH (
  index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
  auto_refresh = true,
  refresh_interval = '60 minutes',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions/'
);

Action Steps:

Open Query Workbench.
Select TransactionsDataSource.
Run the updated skipping index creation command.

2. Access Denied Errors

Error Message:

{
  "Message":"User: arn:aws:iam::<account-id>:role/OpenSearchS3AccessRole is not authorized to perform: s3:GetObject on resource: arn:aws:s3:::opensearch-s3-poc-json-bucket/transactions/*"
}

Cause: IAM role lacks necessary permissions to access the S3 bucket or specific objects.

Solution:

Review and Update IAM Policy:
- Ensure s3:GetObject, s3:ListBucket, and s3:PutObject permissions are correctly set for the bucket and checkpoint locations.

Policy Example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3ReadAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::opensearch-s3-poc-json-bucket",
        "arn:aws:s3:::opensearch-s3-poc-json-bucket/*"
      ]
    },
    {
      "Sid": "S3CheckpointAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/",
        "arn:aws:s3:::opensearch-s3-poc-json-bucket/checkpoint/transactions/*"
      ]
    }
  ]
}

Action Steps:

Navigate to IAM Console.
Select OpenSearchS3AccessPolicy.
Verify and update permissions as needed.
Save changes and retry the query.

3. Data Not Appearing in OpenSearch

Cause: Incorrect schema definitions or issues with the Spark table setup.

Solution:

Verify AWS Glue Schema:
- Ensure that the Glue table schema matches the structure of your JSON data.
Check Spark Table Creation:
- Re-run the CREATE EXTERNAL TABLE command in Query Workbench to ensure it was successful.
Validate Data in S3:
- Confirm that data is correctly uploaded and accessible in the specified S3 path.

Action Steps:

Review the Glue table schema in AWS Glue Console.
Open Query Workbench and verify the Spark table.
Execute a sample query to confirm data visibility.

4. Slow Query Performance Despite Accelerations

Cause: Queries may not be effectively utilizing the skipping or covering indexes.

Solution:

Use Query Profiler:
- Analyze query execution plans to ensure indexes are being leveraged.
Optimize Query Structure:
- Ensure that filters and selections align with the indexed fields.
Adjust Refresh Intervals:
- Tweak refresh_interval settings based on data update frequency and query requirements.

Action Steps:

Open Query Workbench and execute the query profiler.
Review the execution plan for index utilization.
Modify queries to better align with indexed fields.
Adjust refresh_interval if necessary.

Conclusion

Integrating Amazon OpenSearch Service with Amazon S3 for direct queries offers a robust solution for high-performance data analytics and real-time search capabilities. This integration leverages the scalability and durability of Amazon S3 for data storage, the schema management prowess of AWS Glue, and the advanced search and analytics features of OpenSearch Service.

Key Takeaways:

Streamlined Data Access: Direct queries enable efficient data retrieval from S3 without the need for data ingestion into OpenSearch indices.
Enhanced Performance: Implementing accelerations such as Skipping Indexes, Covering Indexes, and Materialized Views significantly improves query performance and reduces latency.
Scalability and Flexibility: The architecture supports scalability, allowing you to handle large volumes of data and complex queries seamlessly.
Security and Compliance: Adhering to best practices in IAM configurations and data encryption ensures your data remains secure and compliant with industry standards.
Cost Efficiency: Optimizing resource allocation and implementing data lifecycle policies help manage costs effectively without sacrificing performance.

By following this guide and adhering to the outlined best practices, you can establish an efficient, scalable, and secure data querying environment using Amazon OpenSearch Service and Amazon S3. This setup empowers your organization to perform real-time analytics, derive actionable insights, and drive informed decision-making with confidence.

References

Amazon S3 Documentation: Amazon Simple Storage Service
AWS IAM Documentation: AWS Identity and Access Management
Amazon OpenSearch Service Documentation: Amazon OpenSearch Service
AWS Glue Documentation: AWS Glue
OpenSearch Query Workbench: Query Workbench in OpenSearch
Shard Allocation Awareness: OpenSearch Shard Allocation Awareness
Monitoring and Observability: Monitoring OpenSearch
Index State Management: Index State Management in OpenSearch
Apache Spark SQL Documentation: Apache Spark SQL

Appendices

Appendix A: Sample `transactions.json` File

Below is a sample transactions.json file in JSON Lines format (.jsonl). Each line represents a single transaction record.

{"transaction_id": 1001, "customer_id": 501, "amount": 250.75, "currency": "USD", "transaction_date": "2023-10-15T10:15:30Z", "items": ["Item_A", "Item_B"]}
{"transaction_id": 1002, "customer_id": 502, "amount": 89.99, "currency": "EUR", "transaction_date": "2023-10-16T11:20:45Z", "items": ["Item_C"]}
{"transaction_id": 1003, "customer_id": 501, "amount": 120.00, "currency": "USD", "transaction_date": "2023-10-17T09:05:10Z", "items": ["Item_A", "Item_D", "Item_E"]}
{"transaction_id": 1004, "customer_id": 503, "amount": 45.50, "currency": "GBP", "transaction_date": "2023-10-18T14:30:00Z", "items": ["Item_F"]}
{"transaction_id": 1005, "customer_id": 504, "amount": 300.00, "currency": "USD", "transaction_date": "2023-10-19T16:45:25Z", "items": ["Item_G", "Item_H"]}

Explanation of Fields:

transaction_id (INT): Unique identifier for the transaction.
customer_id (INT): Identifier for the customer making the transaction.
amount (DOUBLE): Monetary value of the transaction.
currency (STRING): Currency code (e.g., USD, EUR, GBP).
transaction_date (TIMESTAMP): Date and time when the transaction occurred.
items (ARRAY): List of items purchased in the transaction.

Best Practice: Ensure that all records follow the same schema and that data types are consistent to prevent issues during ingestion and querying.

Appendix B: Sample SQL Commands for Accelerations

Implementing accelerations is crucial for optimizing query performance. Below are sample SQL commands tailored for your setup.

1. Skipping Index Creation

CREATE SKIPPING INDEX
ON transaction_db_datasource.videodbtransaction.transaction (
  transaction_date PARTITION,
  currency VALUE_SET,
  amount MIN_MAX
)
WITH (
  index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
  auto_refresh = true,
  refresh_interval = '60 minutes',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions/'
);

2. Materialized View Creation

CREATE MATERIALIZED VIEW transactions__week_live_mview AS
  SELECT
    customer_id AS customer_id,
    currency AS currency,
    COUNT(*) AS transaction_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS average_amount,
    CAST(FROM_UNIXTIME(transaction_date / 1000) AS TIMESTAMP) AS @timestamp
  FROM
    transaction_db_datasource.videodbtransaction.transaction
  GROUP BY
    customer_id, currency
WITH (
  auto_refresh = true,
  refresh_interval = '15 Minute',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions__week_live_mview/',
  watermark_delay = '1 Minute'
);

3. Covering Index Creation

CREATE INDEX transactions_covering_index
ON transaction_db_datasource.videodbtransaction.transaction (
  transaction_id,
  customer_id,
  amount,
  currency,
  transaction_date,
  items
)
WITH (
  index_settings = '{"number_of_shards":5,"number_of_replicas":2}',
  auto_refresh = true,
  refresh_interval = '10 Minute',
  checkpoint_location = 's3://opensearch-s3-poc-json-bucket/checkpoint/transactions_covering_index/'
);

Table of Contents

High-Level Architecture

Prerequisites

Step-by-Step Implementation

1. Creating an Amazon S3 Bucket

2. Setting Up AWS Glue Data Catalog

3. Configuring IAM Roles and Policies

4. Creating an Amazon OpenSearch Service Domain

5. Creating Spark Tables Using Query Workbench

Step 1: Access OpenSearch Dashboards

Step 2: Create a Data Source

Step 3: Use Query Workbench to Create a Spark Database and Table

a. Creating a Spark Database

b. Creating a Spark Table

6. Implementing Accelerations

a. Skipping Indexes

b. Covering Indexes

c. Materialized Views

Best Practices

1. Security

2. Data Organization

3. Performance Optimization

4. Cost Management

5. Schema Management

6. Automation and Documentation

7. Regular Audits and Reviews

8. Backup and Recovery

Testing and Validation

1. Execute Sample Queries

a. Retrieve Recent Transactions

b. Filter Transactions by Currency

c. Retrieve Transactions with Amount Greater Than 50

d. Combined Filters: Date Range, Currency, and Amount

e. Aggregate Total and Average Transaction Amount per Customer

f. Find Top 5 Highest Transactions

g. Count Transactions per Currency

h. Retrieve Transactions Containing Specific Items

i. Handle Missing Amounts with COALESCE

j. Aggregate Number of Items Sold per Day

k. Retrieve Transactions for Multiple Currencies

2. Compare Query Performance

3. Monitor Shard Allocation and Cluster Health

a. Check Shard Allocation

b. Check Cluster Health

4. Use OpenSearch Dashboards Monitoring Tools

Troubleshooting

1. Validation Failed: Expected Total Copies Needs to Be a Multiple of Total Awareness Attributes

2. Access Denied Errors

3. Data Not Appearing in OpenSearch

4. Slow Query Performance Despite Accelerations

Conclusion

References

Appendices

Appendix A: Sample transactions.json File

Appendix B: Sample SQL Commands for Accelerations

1. Skipping Index Creation

2. Materialized View Creation

3. Covering Index Creation

Tags

Read next

SQL database projects intro

Advanced VPC Networking: Detailed Breakdown

Azure Portal Overview: The Gateway to Cloud Excellence

AWS EBS Encryption Simplified : Protecting Your Cloud Data Effectively

Appendix A: Sample `transactions.json` File