Wadee Sami

Posted on Feb 11

ElasticSearch Architecture: A Comprehensive Guide

#elasticsearch #architecture #distributedsystems

Elasticsearch Architecture: A Comprehensive Guide

Elasticsearch is a powerful, distributed search and analytics engine designed to handle a variety of data types, from structured to unstructured, and from text to numerical data. Built on top of Apache Lucene, Elasticsearch provides a scalable, high-performance solution for full-text search, real-time analytics, and more. In this comprehensive guide, we’ll dive deep into the architecture of Elasticsearch, exploring its core components, data storage model, indexing process, and scaling strategies. Whether you're a beginner or an experienced engineer, this guide will help you understand how Elasticsearch works under the hood and how to use it effectively.

Introduction to Elasticsearch Architecture
High-Level Overview
Storage Model
Data Types and Mapping
Text Analysis and Inverted Indexes
Building Blocks: Documents, Indexes, and Shards
Replicas and Redundancy
Nodes and Clusters
Inverted Indexes and Search
Relevancy Scoring
Routing Algorithms
Scaling Strategies: Vertical vs. Horizontal
Best Practices for Shard Sizing
Conclusion

1. Introduction to Elasticsearch Architecture

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. It is built on top of Apache Lucene, a high-performance, full-text search library. While Lucene provides the core search capabilities, Elasticsearch adds distributed features, scalability, and ease of use, making it a popular choice for modern applications.

As engineers, we often need to fine-tune queries, handle clustering, and debug memory issues. Understanding Elasticsearch's architecture is crucial for optimizing performance, ensuring high availability, and troubleshooting problems.

2. High-Level Overview

Elasticsearch is developed in Java and leverages Apache Lucene as its core search library. Lucene is a powerful library for full-text search, but it is not a complete application. Elasticsearch wraps Lucene and extends it with distributed features, making it suitable for large-scale applications.

Key Features:

Distributed Nature: Elasticsearch is designed to scale horizontally, allowing you to add more nodes to the cluster as your data grows.
Real-Time Search: Elasticsearch provides near real-time search capabilities, making it ideal for applications that require fast query responses.
Schema-less: Elasticsearch is schema-less by default, meaning you can start indexing documents without defining a schema. However, you can define mappings to control how data is indexed and searched.

3. Storage Model

Elasticsearch stores data on disk using optimized and compressed techniques. Data is organized into indexes, which are similar to tables in a relational database. However, unlike traditional databases, Elasticsearch indexes are schema-less, meaning you can store documents with different structures in the same index.

Data Organization:

Nodes: Each node in an Elasticsearch cluster stores a portion of the data. Nodes are responsible for managing shards and replicas.
Indexes: An index is a logical collection of documents. Each index is divided into shards, which are physical instances of Apache Lucene.
Shards: Shards are the basic units of storage in Elasticsearch. They hold the actual data and are distributed across nodes for scalability and fault tolerance.

4. Data Types and Mapping

Elasticsearch receives data in JSON format and performs a process called mapping to map incoming data to Elasticsearch's data types. Mapping defines how fields in the document should be indexed and searched.

Common Data Types:

Text: Used for full-text search. Text fields undergo text analysis, which includes tokenization and normalization.
Keyword: Used for exact matches, such as IDs or tags.
Date: Used for date and time values.
Numeric: Used for numerical data, such as integers, floats, and doubles.

Text Analysis:

When a text field is indexed, Elasticsearch performs text analysis, which involves:

Tokenization: Breaking text into individual tokens (words).
Normalization: Adding metadata, such as synonyms or translations, to make the text searchable.

Figure 1: Text Analysis Process in Elasticsearch

5. Text Analysis and Inverted Indexes

Text analysis is a critical part of Elasticsearch's search capabilities. When a document is indexed, Elasticsearch creates an inverted index, which maps tokens to the documents that contain them. This allows for fast and efficient full-text search.

Inverted Index Example:

Consider two documents:

{"greeting": "Hello, WORLD"}
{"greeting": "Hello, Mate"}

After text analysis, the inverted index might look like this:

Token	Document IDs
hello	1, 2
world	1
mate	2

When you search for "hello," Elasticsearch consults the inverted index and retrieves documents 1 and 2.

6. Building Blocks: Documents, Indexes, and Shards

Documents:

Basic Unit: A document is the basic unit of data in Elasticsearch. It is a JSON object that contains the data you want to index and search.
Fields: Each document consists of fields, which are key-value pairs. Fields can be of different data types, such as text, keyword, date, or numeric.

Indexes:

Logical Collection: An index is a logical collection of documents. It is similar to a table in a relational database.
Shards: Each index is divided into shards, which are physical instances of Apache Lucene. Shards are distributed across nodes for scalability and fault tolerance.

Shards:

Primary Shards: Each document belongs to a primary shard. The number of primary shards is fixed when the index is created.
Replicas: Replicas are copies of primary shards. They provide redundancy and improve search performance by serving read requests.

7. Replicas and Redundancy

Replicas are essential for ensuring high availability and fault tolerance in Elasticsearch. Each primary shard can have one or more replicas, which are stored on different nodes. If a node fails, the replicas on other nodes can take over, ensuring that data is not lost.

Key Points:

Redundancy: Replicas provide redundancy, ensuring that data is available even if a node fails.
Read Requests: Replicas can serve read requests, which helps distribute the load and improve search performance.
Distribution: Replicas are not stored on the same node as the primary shard. This ensures that data is not lost if a node fails.

Figure 2: Shards and Replicas in Elasticsearch

8. Nodes and Clusters

Nodes:

Instance: A node is a single instance of Elasticsearch. When you start Elasticsearch on a machine, you are creating a node.
Shards and Replicas: Each node hosts a set of shards and replicas. The index, which is a logical collection of documents, is created across these shards and replicas.

Clusters:

Collection of Nodes: A cluster is a collection of nodes that work together to store and manage data. When you start a node, it forms a single-node cluster by default.
Multi-Node Clusters: In a production environment, you typically have multiple nodes forming a cluster. This provides scalability and fault tolerance.

Figure 3: Single-Node Cluster

Figure 4: Multi-Node Cluster

9. Inverted Indexes and Search

Inverted indexes are at the heart of Elasticsearch's search capabilities. When you perform a search, Elasticsearch consults the inverted index to find the documents that match your query.

Search Process:

Tokenization: The search query is tokenized into individual terms.
Inverted Index Lookup: Elasticsearch looks up each term in the inverted index to find the documents that contain it.
Relevancy Scoring: Elasticsearch calculates a relevancy score for each document based on factors like term frequency and inverse document frequency.
Result Retrieval: The documents with the highest relevancy scores are returned as search results.

10. Relevancy Scoring

Elasticsearch uses a relevancy score to rank search results. The score is calculated using algorithms like BM25 (Best Match 25), which takes into account factors like term frequency, inverse document frequency, and field length.

Key Factors:

Term Frequency (TF): The number of times a term appears in a document.
Inverse Document Frequency (IDF): A measure of how common or rare a term is across the entire set of documents.
Field Length: The length of the field in which the term appears. Shorter fields are considered more relevant.

11. Routing Algorithms

Every document in Elasticsearch is assigned to a specific shard using a routing algorithm. The algorithm hashes the document ID and uses the modulo operation to determine the shard number.

Routing Formula:

shard_number = hash(document_id) % number_of_primary_shards

Key Points:

Fixed Shards: The number of shards cannot be changed after the index is created. If you need to change the number of shards, you must reindex your data.
Reindexing: Reindexing involves creating a new index with the desired number of shards and copying data from the old index to the new one.

12. Scaling Strategies: Vertical vs. Horizontal

Elasticsearch offers two main scaling strategies: vertical scaling and horizontal scaling.

Vertical Scaling (Scaling Up):

Definition: Adding more resources (CPU, memory, disk) to existing nodes.
Pros: Easier to manage, as you are working with fewer machines.
Cons: Limited by the hardware capabilities of the machine. Requires downtime for upgrades.

Horizontal Scaling (Scaling Out):

Definition: Adding more nodes to the cluster.
Pros: Provides better scalability and fault tolerance. No downtime required.
Cons: More complex to manage, as you are dealing with multiple machines.

13. Best Practices for Shard Sizing

Shard sizing is critical for optimizing Elasticsearch performance. Here are some best practices:

Shard Size: With modern hardware, aim for shard sizes between 20GB and 50GB. In some cases, with proper hardware and optimization, shards can grow up to 100GB. However, monitor performance carefully when exceeding 50GB.
Heap Memory: Ensure that each node has enough heap memory to handle the shards. A good rule of thumb is to allocate up to 20 shards per gigabyte of heap memory, but always stay below 31GB total heap to avoid JVM memory issues.
Replicas: Use replicas to improve redundancy and search performance. However, avoid placing replicas on the same node as the primary shard.

14. Conclusion

Elasticsearch is a powerful and flexible search engine that can handle a wide range of use cases. Its distributed architecture, combined with features like inverted indexes and relevancy scoring, makes it an ideal choice for applications that require fast and accurate search capabilities. By understanding the core components of Elasticsearch's architecture, you can optimize your cluster for performance, scalability, and reliability.

Whether you're building a small application or a large-scale enterprise system, Elasticsearch provides the tools you need to deliver a seamless search experience. With the right configuration and best practices, you can ensure that your Elasticsearch cluster is ready to handle the demands of modern applications.

Key Takeaways:

Elasticsearch is built on Apache Lucene, providing distributed search and analytics capabilities.
Shards and Replicas ensure data is distributed and redundant, improving performance and fault tolerance.
Inverted Indexes enable fast full-text search by mapping tokens to documents.
Scaling can be done vertically (adding resources) or horizontally (adding nodes), depending on your needs.
Schema Design while optional, explicit mappings are recommended for production environments.

Elasticsearch Architecture: A Comprehensive Guide

Table of Contents

1. Introduction to Elasticsearch Architecture

2. High-Level Overview

Key Features:

3. Storage Model

Data Organization:

4. Data Types and Mapping

Common Data Types:

Text Analysis:

5. Text Analysis and Inverted Indexes

Inverted Index Example:

6. Building Blocks: Documents, Indexes, and Shards

Documents:

Indexes:

Shards:

7. Replicas and Redundancy

Key Points:

Figure 2: Shards and Replicas in Elasticsearch

8. Nodes and Clusters

Nodes:

Clusters:

9. Inverted Indexes and Search

Search Process:

10. Relevancy Scoring

Key Factors:

11. Routing Algorithms

Routing Formula:

Key Points:

12. Scaling Strategies: Vertical vs. Horizontal

Vertical Scaling (Scaling Up):

Horizontal Scaling (Scaling Out):

13. Best Practices for Shard Sizing

14. Conclusion

Key Takeaways:

Further Reading

Read next

Linux file system

Decentralization or Deception? My Rant on Web3 Startups

AI for SEO: Automating Content Optimization for Better Rankings 🚀

Understanding Access and Refresh Tokens: A Beginner's Guide