DEV Community

Cover image for Document Expiration with TTL Indexes in MongoDB
Franck Pachot
Franck Pachot

Posted on • Edited on

Document Expiration with TTL Indexes in MongoDB

All databases should implement an information lifecycle management policy to archive, purge, or delete expired data. This may be necessary for legal compliance, to enhance performance, or to reduce costs.
Allowing the database to grow indefinitely can slow operational workloads, increase maintenance activities, and affect storage and backup costs.

NoSQL offers more flexibility than SQL, as NoSQL documents are self-contained and can be deleted directly at the storage level. In contrast, SQL databases store each business document across multiple tables with enforced relational integrity through foreign keys. Deletions must occur at the SQL layer rather than the storage level to avoid breaking referential integrity. Certain SQL features can assist in this process. For example, cascading deletes can preserve integrity when removing a parent row and its children, but this method is slow. Conversely, partitioning by date can expedite purging by truncating or dropping older partitions, although it introduces additional operational complexity.

With MongoDB, related data resides within a single document without the complexity of referential integrity constraints and cascading actions. Each document can be assigned a time to live (TTL) based on a selected date or timestamp attribute present in the document. An index must be created to identify expired documents chosen for deletion quickly. The TTL policy is established on this index.

In this post, I will demonstrate the process first without utilizing the TTL feature, performing deletions through a job that removes expired documents. Then, I will run the same with the TTL policy. The aim is to showcase how both methods function similarly and are exposed in server metrics.


Without TTL (manual delete job)

I created the following collection:

db.runCommand( {
 create: "demo",
 clusteredIndex: { "key": { _id: 1 }, "unique": true }
} )
Enter fullscreen mode Exit fullscreen mode

I created an index on created_at attribute:

db.demo.createIndex(
 { created_at: 1 }
);
Enter fullscreen mode Exit fullscreen mode

I ran bulk inserts for 15 minutes (900 seconds):

function bulkInsert(collection, start, end) {
    const bulkData = [];
    for (let i = start; i <= end; i++) {
     bulkData.push({
      value: i ,
      created_at: new Date(),
      filler: Array.from({ length: 5 }, () => new UUID())
     });
    }
    collection.insertMany(bulkData);
}

function run(durationInSeconds, func, ...args) {
    const startTime = Date.now();
    let executionCount = 0;
    while ((Date.now() - startTime) / 1000 < durationInSeconds) {
        executionCount++;
        const timestamp = new Date().toISOString();
        func(...args);
        console.log(`[${timestamp}] Executing: ${func.name}(${args.join(", ")}) ${(executionCount/((Date.now()-startTime)/1000)).toFixed(1)}/s`);
    }
}

run(900,bulkInsert, db.demo, 1, 1000);

Enter fullscreen mode Exit fullscreen mode

In another session, I ran the deletion of documents created more than five minutes (300 seconds) ago, in a loop that wakes up every minute:

for (let i = 1; i <= 15; i++) {
sleep(60*1000);
db.demo.deleteMany({
  created_at: { $lt: new Date(Date.now() - 300 * 1000) }
});
}
Enter fullscreen mode Exit fullscreen mode

I observed some statistics using my lab for MongoDB Learning, Analyzing, and Benchmarking

The size increases for the first five minutes, and then every minute, the deletion runs and gets the size back. The collection contains between five and six minutes of data:
Image description

I'll show more details later, but first, let's do the same with the automated DDL.

With TTL (expireAfterSeconds)

To start at the same state, I removed all rows and dropped the index:

deleteAll(db.demo);
db.demo.dropIndex({ created_at: 1 } );
Enter fullscreen mode Exit fullscreen mode

I created the same index but with an additional attribute to set TTL to 5 minutes, using { expireAfterSeconds: 300 }:

db.demo.createIndex(
 { created_at: 1 } , { expireAfterSeconds: 300 } 
);
Enter fullscreen mode Exit fullscreen mode

I ran the same bulk inserts:

run(900,bulkInsert, db.demo, 1, 1000);

Enter fullscreen mode Exit fullscreen mode

Without the need to add a deletion job, I observed the same result:
Image description

The metrics for the collection come from:

callStats() mongodb_exporter
storageStats.count mongodb_collstats_storageStats_count

The metrics for command and document counters come from:

serverStatus() mongodb_exporter
opcounters mongodb_ss_opcounters {{legacy_op_type}}

The metrics for WiredTiger cursor calls come from:

serverStatus() mongodb_exporter
cursor insert calls mongodb_ss_wt_cursor_cursor_insert_calls
cursor search calls mongodb_ss_wt_cursor_cursor_search_calls
cursor remove calls mongodb_ss_wt_cursor_cursor_remove_calls
cursor reset calls mongodb_ss_wt_cursor_cursor_reset_calls

Let's go into more detail.


Comparison (internals)

The TTL Monitor thread of MongoDB server is enabled by default (ttlMonitorEnabled=true) for all indexes with and expireAfterSeconds and wakes up to run deletion every minute by default (ttlMonitorSleepSecs=60). I also chose one minute for the manual solution to compare it easily.

The observation on the collection count, operation counters, and cursor calls in WiredTiger was the same. To know that the deletion comes from the TTL Monitor, I can look at other statistics that show if it comes from the query executor or TTL thread:
Image description

The documents deleted by my manual job increased those metrics:

serverStatus() name mongodb_exporter name
metrics.document.deleted mongodb_ss_metrics_document {{doc_op_type}}
metrics.queryExecutor.scanned mongodb_ss_metrics_queryExecutor_scanned
metrics.queryExecutor.scannedObjects mongodb_ss_metrics_queryExecutor_scannedObjects

The documents deleted by the TTL Monitor increased those metrics:

serverStatus() name mongodb_exporter name
metrics.ttl.deletedDocuments mongodb_ss_metrics_ttl_deletedDocuments

Given the throughput in this lab, the performance is the same: inserting at 8500 documents per second and deleting at 5500 documents per second.

As I love to understand how it works, I've taken a flamegraph of perf record (see how in FlameGraphs on Steroids) and here is the call stack from TTL Monitor during the deletion:
Image description

TTL Monitor deleteExpiredWithIndex executes a batched deletion down to a wiredTigerCursorRemove

The flamegraph from the manual deletion job shows the same stack under performDeletes:
Image description

Given the throughput and stack trace, I conclude that the TTL Monitor performs the same job as a manual deletion but is easier to operate because it is automated and requires a simple index declaration. This differs from some databases that implement TTL at the storage level, for example, in LSM tree compaction (YugabyteDB does that for the Cassandra-like API YCQL where there are no secondary indexes). As MongoDB maintains secondary indexes with strong consistency, deletion must be performed at a higher level.


You might wonder if going into those details is useful or just geek fun. In my case, it helped me realize that I missed one important metric to account for the WiredTiger remove call and added it later to my dashboard:
Image description
There are two remove calls in the storage for one delete call because there is one entry in the primary index (I used a clustered collection) and one in the secondary index (the one used for TTL).

Another reason for going into those details is to consider all the consequences of this behavior. When TTL is implemented at the storage level, like with LSM trees, the queries must skip many tombstones until the SST files are compacted. MongoDB stores indexes in B-trees, but some databases must read through many empty blocks after deleting old data. Let's investigate how this works in MongoDB.

Queries after TTL deletion (skip deleted pages)

I re-created my collection, inserted rows for 15 minutes, and created the TTL index to expire rows after 5 minutes:

db.demo.drop();
db.runCommand( {
 create: "demo",
 clusteredIndex: { "key": { _id: 1 }, "unique": true }
} )

run(900,bulkInsert, db.demo, 1, 1000);
db.demo.createIndex(
 { created_at: 1 } , { expireAfterSeconds: 300 }
);

Enter fullscreen mode Exit fullscreen mode

The collection has reached 8 million documents, and the TTL monitor will delete them for five minutes. I created the index after the bulk load to show what happens when lots or rows are deleted, which is the case for the first run.

I ran the following to query for the earliest document, in a loop for 15 minutes:

function query(){
 print(
  db.demo.find()
         .sort({ created_at: 1 })
         .limit(1)
         .explain("executionStats")
         .executionStats.executionStages.inputStage.inputStage
 );
}

run(900,query);

Enter fullscreen mode Exit fullscreen mode

This kind of query must scan through all the deleted entries before finding the latest non-deleted.

I've run the query with explain() to verify the execution statistics for the index scan:

[2025-01-10T22:56:01.605Z] Executing: query() 297.5/s
{
  stage: 'IXSCAN',
  nReturned: 1,
  executionTimeMillisEstimate: 0,
  works: 1,
  advanced: 1,
  needTime: 0,
  needYield: 0,
  saveState: 0,
  restoreState: 0,
  isEOF: 0,
  keyPattern: { created_at: 1 },
  indexName: 'created_at_1',
  isMultiKey: false,
  multiKeyPaths: { created_at: [] },
  isUnique: false,
  isSparse: false,
  isPartial: false,
  indexVersion: 2,
  direction: 'forward',
  indexBounds: { created_at: [ '[MinKey, MaxKey]' ] },
  keysExamined: 1,
  seeks: 1,
  dupsTested: 0,
  dupsDropped: 0
}
Enter fullscreen mode Exit fullscreen mode

Here, the index scan looks optimized with only one seek to the minimum index bound (seeks: 1), and one index entry read (keysExamined: 1) to return the first row (nReturned: 1).

Here is the number of rows in the collection and the insert throughput:

Image description

I zoomed on the query part:
Image description

The read throughput, visible as command rather than query because I used explain(), was lower after the first deletion, which removed 6 million documents. Then, it ran at 400 queries per second, with a slight drop the next minute.

Here are other statistics that are correlated:
Image description

The read throughput drop correlates with the TTL deletion. Another metric from the storage statistics index details explains the consequence of deleted entries: "Total number of in-memory deleted pages skipped during tree walk" increases at the end of the first massive deletion when the query throughput was at its lowest. The name is explicit: the index scan walks the B-tree from the oldest leaf to the one with the first visible row to be returned. It has to skip through the deleted entries. Skipping the pages that are known to contain only deleted entries makes it faster than examining all deleted keys.

I may look deeper into it in a future blog post. For the moment, here is the Git issue about this optimization that allows skipping pages where all entries are deleted:

WT-10807 Skip in-memory deleted pages as part of the tree walk #9877

Traversing an in-memory page that contains all the deleted values that are visible to the current transaction led to an increase in latency due to the time spent skipping these deleted values.

By saving the aggregated timestamp information in the ref when the page has all deleted values, this aggregated information can be validated against the transaction snapshot to skip traversing the page completely and improve the latency when there are many deleted pages.

The downside of this approach is that the in-memory size of each ref is increased by 8 more bytes, but this increase shouldn't cause any problem.


This long post explored the TTL mechanism of MongoDB, including a small test case, statistics from serverState() and collStats(), and some insights from the stack trace and source code. A fully reproducible test helps to learn and understand and can be used further to investigate different variations. The documentation tutorial for this feature is: Expire Data from Collections by Setting TTL. There is also a similar feature for time series (Automatic Removal for Time Series Collections), which works differently and maybe a topic for a future blog post.

I don't know any other database that provides consistent secondary indexes and offers such an easy-to-setup Time-To-Live (TTL) feature.

Top comments (0)