Nishanthan K

Posted on Feb 19

How MongoDB’s ObjectID and B-Tree Search Make Your Queries Super Fast

#mongodb #database #optimization #performance

MongoDB uses a unique identifier, called ObjectId, for each document stored in its database. An ObjectId is a 12-byte identifier that is generated in a specific format to ensure it’s unique across distributed systems.

In this article, we will break down the structure of MongoDB's ObjectId, how it is generated, and how MongoDB uses a binary tree-like structure for indexing and searching. We will also discuss how this structure helps MongoDB perform searches efficiently, even with millions of entries.

Structure of MongoDB's ObjectId

MongoDB’s ObjectId is a 12-byte identifier, and its structure is designed to ensure uniqueness while keeping it compact. The 12 bytes are divided into four main parts:

1. Timestamp (4 bytes)

The first 4 bytes represent the timestamp when the ObjectId was generated.
This timestamp is recorded in seconds since the Unix epoch (January 1, 1970).
This ensures that ObjectIds are roughly ordered by creation time.

2. Machine Id (3 bytes)

The next 3 bytes represent a unique identifier for the machine where the ObjectId was generated.
Typically, it’s derived from the machine's IP address or a hash of the machine's network interfaces.
This will be constant for its lifetime untill the hostname and other system details got changed
This ensures that even if two MongoDB instances are running on different machines, they can generate unique ObjectIds.

3. Process Id (2 bytes)

The following 2 bytes represent the process ID of the MongoDB server process.
This helps to distinguish ObjectIds generated by different processes running on the same machine.
This is only constant for that run and it will change when the server was restarted / on re-run.

4. Counter (3 bytes)

The last 3 bytes represent an incrementing counter.
The counter starts at a random value. If multiple documents / objectIds are generated at the same timestamp (same second), the counter ensures each generated ObjectId is unique by incrementing.
If only one Document / ObjectId is generated per second, the counter will reset to another random value for the next second.
Key point: This counter ensures uniqueness when multiple ObjectId values are generated in the same second by the same machine and process.

Example of an ObjectId:

507f191e810c19729de860ea

The first 4 bytes: 507f191e represent the timestamp.
The next 3 bytes: 81 represent the machine ID.
The next 2 bytes: 0c represent the process ID.
The last 3 bytes: 19729de represent the incrementing counter.

MongoDB Index Structure and Page Layout:

MongoDB uses a B-tree-like index structure, not an real tree we are using in data structures but with certain modifications for optimization. Here's a breakdown:

1. Root Node:

The root node of the index tree stores keys (usually indexed fields like ObjectId, or other indexed fields).
This root node acts as the starting point for the search process, guiding MongoDB towards the right child node.
Key Features:
- Contains sorted keys.
- Points to child nodes which continue the search deeper into the tree.

2. Child Pages:

Child nodes (or child pages) store sorted keys and pointers to their leaf nodes or further child nodes.
These nodes are like intermediates, helping narrow down the range of values that need to be searched.
Key Features:
- Contains sorted key-value pairs.
- Contains pointers to the next level (further child pages or leaf pages).
- May be created or split as the index grows.

3. Leaf Pages:

The leaf pages are the actual locations where document references (pointers to documents) are stored.
Each entry in the leaf page contains a key (indexed field value) and a pointer to the actual document (i.e., location in the data store).
These pages are where the final results are stored, which MongoDB returns in a query.
Key Features:
- Contains the actual document references.
- Sorted by the indexed field for fast lookups.

Page Structure:

In MongoDB, index pages are typically designed to be 4 KB in size. If the page contains entries like ObjectId (a 12-byte identifier), each page can store around 317 entries. This size may vary slightly based on the size of indexed fields.

Pages are fetched from the disk into memory during searches, ensuring that queries are handled efficiently.
MongoDB uses a binary search within a page to locate the exact entry, which ensures faster retrieval of documents.

How MongoDB Manages Page Growth:

As MongoDB adds more entries to an index, it needs to split nodes to ensure that the index remains balanced.

When a page reaches its capacity, MongoDB splits the page into two and promotes the middle key to the parent node.
The middle key is the key that divides the entries in half and is used to maintain a balanced index.
As the index continues to grow, this process of splitting and promoting ensures the B-tree remains balanced.
Balancing the tree through splits and promotions ensures that search operations remain efficient and that the index is optimal for query performance.

Example of How the Index Tree Works:

Consider an index on the ObjectId field in MongoDB:

Root Node: Contains keys pointing to child nodes. For example, if you have a collection of documents with ObjectId, the root node will store keys like ObjectId:123, ObjectId:124, etc., with pointers to child nodes that contain more detailed keys.
Child Pages: These child nodes will store ranges of ObjectIds and pointers to further nodes or leaf nodes, helping narrow down the search.
Leaf Pages: Finally, the leaf pages contain the actual document pointers for specific ObjectIds. For example, a leaf page might contain a key like ObjectId:123 and a pointer to the exact location of the document in the storage.

When a query is executed, MongoDB will traverse this tree structure by following the key pointers from the root node, through child nodes, and ultimately to the leaf nodes where it finds the document references. This ensures efficient searching, even with large datasets.

Key Takeaways:

MongoDB uses a modified B-tree index structure for fast lookups.
Pages are crucial for holding index entries and documents. Pages are typically 4 KB in size and contain sorted entries.
Leaf pages contain the actual document references and are where the search ultimately lands.
As the tree grows, node splits and middle key promotions ensure that the index remains balanced and efficient for query processing.

Example: From Scratch to 17 Entries

Let’s consider the process of adding entries to a MongoDB index and how the keys move up in the tree.

Step-by-Step Tree Construction:

Insert 1, 2, 3: Initially, we add three entries, and they all fit in the root node.

[1, 2, 3]

Insert 4: Adding the fourth entry causes the root to split. The middle key (2) gets promoted to a new root.

Root: [2]
Left Child: [1]
Right Child: [3, 4]

Insert 5, 6, 7: The right child now overflows, so the middle key (6) moves up.

Root: [2, 6]
Left Child: [1]
Middle Child: [3, 4, 5]
Right Child: [7]

Insert 8, 9, 10: The right child overflows again, so the middle key (9) moves up.

Root: [2, 6, 9]
Left Child: [1]
Middle Left Child: [3, 4, 5]
Middle Right Child: [7, 8]
Right Child: [10]

Insert 11, 12, 13, 14, 15, 16, 17: We continue inserting entries, and they cause further splits, with the middle keys getting promoted up.

Root: [2, 6, 9, 13]

Left Child: [1]

Middle Left Child: [3, 4, 5]

Middle Child: [7, 8]

Right Child: [10, 11]

Final Child: [12, 14, 15, 16, 17]

Final tree strucutre

                         [6, 9, 13]
                      /    |    |    \
         [1, 2, 3, 4, 5]   [7, 8]  [10, 11]  [12, 14, 15, 16, 17]

Simplified MongoDB Search Process

Here’s a simplified summary of how MongoDB searches for documents using the index:

Start with the Index:
- MongoDB first looks at the index (similar to a table of contents in a book) to figure out where to find documents.
- The index is stored in a B-tree structure, which helps MongoDB quickly find the right spot.
Navigate the Tree:
- The root node of the tree holds pointers to the next pages (child nodes).
- The tree is split into pages, each containing key-value pairs. These keys are the indexed fields (like ObjectId, or any other field you're indexing).
Find the Relevant Range:
- When a query is made, MongoDB starts at the root node and uses binary search to find the right range of keys.
- It keeps following the pointers to the next pages until it narrows down to the correct range of keys for the search.
Determine the Page to Search:
- As MongoDB navigates the tree, it looks for the page number (stored in the index) that contains the matching _id or document key range.
- Once it finds the correct page, it jumps to that page.
Search on the Page:
- The page contains a binary search of documents stored in it. MongoDB uses the pointers in the page to go straight to the document's storage location (like the exact spot where the document is stored on disk).
Fetch the Document:
- Finally, MongoDB uses the storage location found in the index (the position pointer) to go directly to the document on the disk and fetch it.

In short:

MongoDB first uses the index (a tree structure) to find the right range of documents.
It uses binary search to find the correct page.
Then, it jumps to that page and finds the document by looking up the position pointer, directly fetching it from the disk.

This process ensures that MongoDB can quickly locate and fetch documents even if the database grows large.

Real strucuture with 64 level split

                           ┌────────────────── Root Node ───────────────────┐
                           │ Keys: [512, 1024, 1536]                         │
                           │ Child Pointers: [P1, P2, P3, P4]                │
                           └────────────────────────────────────────────────┘
                                         ↓ Choose correct child based on _id

 ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 │                         Internal Nodes (Level 2)                                                              │
 └───────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 │ P1 → Keys: [128, 256, 384]  │ P2 → Keys: [640, 768, 896]  │ P3 → Keys: [1152, 1280, 1408]  │ P4 → Keys: [1664, 1792, 1920] │
 │ Child Pointers: [P5-P8]     │ Child Pointers: [P9-P12]    │ Child Pointers: [P13-P16]     │ Child Pointers: [P17-P20]    │

                                         ↓ Choose correct child based on _id

 ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 │                         Internal Nodes (Level 3)                                                              │
 └───────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 │ P5 → Keys: [32, 64, 96]  │ P6 → Keys: [160, 192, 224]  │ P7 → Keys: [288, 320, 352]  │ P8 → Keys: [416, 448, 480]  │
 │ P9 → Keys: [544, 576, 608]  │ P10 → Keys: [672, 704, 736]  │ P11 → Keys: [800, 832, 864]  │ P12 → Keys: [928, 960, 992]  │
 │ P13 → Keys: [1056, 1088, 1120]  │ P14 → Keys: [1184, 1216, 1248]  │ P15 → Keys: [1312, 1344, 1376]  │ P16 → Keys: [1440, 1472, 1504]  │
 │ P17 → Keys: [1568, 1600, 1632]  │ P18 → Keys: [1728, 1760, 1792]  │ P19 → Keys: [1856, 1888, 1920]  │ P20 → Keys: [1984, 2000]  │

                                         ↓ Choose correct child based on _id

 ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 │                         Leaf Nodes (Level 4) - Actual Document Pointers                                       │
 └───────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 │ P21 → Keys: [1-64] (D1-D64)  │ P22 → Keys: [65-128] (D65-D128)  │ P23 → Keys: [129-192] (D129-D192)  │ P24 → Keys: [193-256] (D193-D256)  │
 │ P25 → Keys: [257-320] (D257-D320)  │ P26 → Keys: [321-384] (D321-D384)  │ P27 → Keys: [385-448] (D385-D448)  │ P28 → Keys: [449-512] (D449-D512)  │
 │ P29 → Keys: [513-576] (D513-D576)  │ P30 → Keys: [577-640] (D577-D640)  │ P31 → Keys: [641-704] (D641-D704)  │ P32 → Keys: [705-768] (D705-D768)  │
 │ P33 → Keys: [769-832] (D769-D832)  │ P34 → Keys: [833-896] (D833-D896)  │ P35 → Keys: [897-960] (D897-D960)  │ P36 → Keys: [961-1024] (D961-D1024)  │
 │ P37 → Keys: [1025-1088] (D1025-D1088)  │ P38 → Keys: [1089-1152] (D1089-D1152)  │ P39 → Keys: [1153-1216] (D1153-D1216)  │ P40 → Keys: [1217-1280] (D1217-D1280)  │
 │ P41 → Keys: [1281-1344] (D1281-D1344)  │ P42 → Keys: [1345-1408] (D1345-D1408)  │ P43 → Keys: [1409-1472] (D1409-D1472)  │ P44 → Keys: [1473-1536] (D1473-D1536)  │
 │ P45 → Keys: [1537-1600] (D1537-D1600)  │ P46 → Keys: [1601-1664] (D1601-D1664)  │ P47 → Keys: [1665-1728] (D1665-D1728)  │ P48 → Keys: [1729-1792] (D1729-D1792)  │
 │ P49 → Keys: [1793-1856] (D1793-D1856)  │ P50 → Keys: [1857-1920] (D1857-D1920)  │ P51 → Keys: [1921-1984] (D1921-D1984)  │ P52 → Keys: [1985-2000] (D1985-D2000)  │

Search flow for _id 1754

Root Node: _id = 1754 falls in range [1536, 2000], follow P4
Internal Node (L2): _id = 1754 falls in range [1664, 1920], follow P18
Internal Node (L3): _id = 1754 falls in range [1728, 1792], follow P48
Leaf Node (L4): _id = 1754 is found, get Document Pointer D1754
Retrieve document from storage

Searching for an ID in MongoDB (Example with 3.6 Million Entries)

Let’s say we have a collection of 3.6 million entries and you need to find an entry with a specific ObjectId. Here’s how MongoDB would perform the search.
At each level, you have 64 choices (64 child pointers).
So for large values (like the 3.7 millionth record), this structure is efficient.

import math

total_entries = 3600000

branching_factor = 64

num_searches = math.ceil(math.log(total_entries, branching_factor))
print(num_searches)

For searching the 3.6 millionth entry, it would take approximately 4 searches (or 4 levels of traversal in the tree). This is because the B-tree structure narrows down the search space significantly at each level.

Conclusion

MongoDB’s ObjectId provides a compact, efficient way to uniquely identify documents, and its indexing system uses a B-tree-like structure to allow for fast searches. As documents are inserted, keys are promoted to parent nodes, ensuring that MongoDB maintains a balanced and efficient index structure. When searching for a specific ObjectId, MongoDB can quickly narrow down the search space using binary search on the root and child pages, resulting in fast lookups even for large collections.

By combining timestamp-based generation for ObjectIds and a binary search-based indexing structure, MongoDB is able to offer high performance for both data storage and retrieval.

DEV Community