rachelle palmer for MongoDB

Posted on Oct 4, 2024

Supercharging Time Series Collections: Key Enhancements in MongoDB 8.0 with Block Processing

#mongodb #performance #timeseries

The landscape for time-series data has evolved significantly in recent years. Businesses are capturing more granular data, recognizing that the value of data rises in proportion with its precision. Storing time-stamped data and performing temporal analytics is becoming essential, even mandated across industries. As data volumes inevitably grow and precision-based analytics become increasingly crucial, tools that provide efficient ways to work with time-series data will become even more critical.

To address the rising demand for time-series analytics, we introduced Time Series Collections in MongoDB 5.0, designed to meet the expanding needs of time-stamped data. Unlike point solutions that require separate complex setups, MongoDB’s Time Series Collection enables users to simply stand up a collection and instantly leverage time-series capabilities. Over time, we’ve expanded its capabilities with features like columnar compression, enhanced temporal analytics, enriched indexing, geo-support, and seamless integration with the broader MongoDB portfolio—creating a streamlined, developer-friendly experience.

In our upcoming 8.0 release, we’re excited to introduce new features that significantly enhance scalability and query performance for managing large time-series workloads, delivering even better price-performance for our users.

Enhanced Time Series Scalability Improvements

As time-series data volumes grow, the challenge isn't just about scaling, it's about doing so efficiently, balancing resources, cost, and performance. With MongoDB 8.0, we’re introducing key optimizations in Time Series Collections to help users maximize resource value while managing increasingly complex workloads.

In previous versions, Time Series Collections inserted data in an uncompressed format, causing a larger working set, increased cache use, and write amplification, leading to high I/O as uncompressed data was written to the WiredTiger storage engine. This was especially problematic for high-cardinality workloads with millions of devices, sensors etc. With MongoDB 8.0, Time Series Collections now directly write into a column-compressed format, reducing cache usage, lowering write I/O, improving insert performance and storage efficiency.

For users, the benefits of lower cache usage directly translates into cost savings. Users will be able to extract a lot more value from their existing cluster resources resulting in better price-performance. We’ve observed throughput improvements of 2-3x, with cache usage reduced by 10-20x compared to workloads on version 7.0.
For example, workload on MongoDB 7.0 can experience performance fluctuations, leading to inconsistent write performance and a sawtooth pattern caused by I/O overload. Writing large amounts of uncompressed data to disk strained checkpoints. As shown below, a test on a 7.0 Atlas M50 (RAM - 32GB, Storage 160GB, 8 vCPUs) cluster reached peaks of 500K inserts/s while displaying this pattern:

With 8.0, the write performance is now steady eliminating I/O strain. As shown below, the same workload on Atlas M50 achieves a stable 600K inserts/s, eliminating the previous sawtooth pattern.

Block Processing with Time Series Collections

User engagement highlighted that query performance is crucial as time-series workloads scale. Traditionally, the MongoDB query engine processes data one document at a time, which can be inefficient for large-scale analytics. For Time Series Collections, this inefficiency is due to the need to unpack and reshape a large volume of compressed data. To address this, we introduced Block Processing for Time Series Collections – a new automatic query execution model that processes "blocks of data" at once, leveraging column-level summaries while avoiding the overhead of unpacking and reshaping documents. This approach significantly improves performance, particularly for aggregation that leverage stages such as $match, $sort, $group and other analytical stages like $setWindowFields.

By making each step more efficient, the overall impact becomes exponential, reducing overhead and leveraging time-series data patterns previously unavailable. Each aggregation stage processes larger data chunks, leading to a more efficient query execution model with significantly faster performance. Use cases like financial analysis and IoT, which involve intensive filtering, grouping, and sorting (i.e $match, $group, $sort), will see major performance improvements. With Block Processing, we've observed improvements ranging from 10-40x, with some large-scale aggregations reaching up to 100x.

Let’s explore how Block Processing can optimize common financial aggregations, such as generating OHLC (Open, High, Low, Close) and calculating an exponential moving average over a specified time period. In this example, we analyze an aggregation that saw a 20x improvement on an Atlas M50 replica set using MongoDB's fork of TSBS (time-series benchmarking suite). We use the TSBS finance use case that generates a workload containing 10 stock symbols, each creating an event per second over 7 days, resulting in approximately 6 million events loaded into a Time Series collection.

Here's a sample document so you can see what this looks like:

{
  "time" : ISODate("2022-01-01T00:00:00Z"),
  "tags" : {
    "symbol" : "MDB"
  },
  "_id" : ObjectId("64c4092a9451cd8064c69be1"),
  "measurement" : "price", 
  "price" : 200.13171
}

We start by creating a time-series collection market_data in MongoDB, where the timeField is set to time, and metaField to tags. The granularity is set to "seconds" to capture financial data at a 1-minute interval.

db.createCollection("market_data", {
  timeseries: {
    timeField: "time",
    metaField: "tags",
    granularity: "seconds"
  }
}

Index:
We create two compound indexes to support queries within the workload:

db.market_data.createIndex({ "tags": 1, "time": 1 }); 
db.market_data.createIndex({ "tags.symbol": 1, "time": -1 });

Query:
Next, we construct a query to generate the OHLC and exponential moving average for a group of stock symbols computed every 4 hours over a 1 day time window.

db.market_data.aggregate([
  {"$match": {"$expr": {"$gte": ["$time", {"$dateSubtract": {"startDate": new Date("2022-01-01T03:00:00Z"), "unit": "hour", "amount": 24}}]}}},
  {"$sort": {"time": 1}},
  {"$group": {
    "_id": {"symbol": "$tags.symbol", "time": {"$dateTrunc": {"date": "$time", "unit": "minute", "binSize": 240}}}, 
    "high": {"$max": "$price"},
    "low": {"$min": "$price"},
    "open": {"$first": "$price"},
    "close": {"$last": "$price"}
  }},
  {"$setWindowFields": {
    "partitionBy": "$_id.symbol",
    "sortBy": {"_id.time": 1},
    "output": {"expMovingAverage": {"$expMovingAvg": {"input": "$close", "N": 100}}}
  }},
  {"$sort": {"_id.time": -1}}
]);

Sample Output:

{
  "_id": { "symbol": "MDB", "time": ISODate("2022-01-01T02:00:00Z") },
  "high": 148.27729312597552,
  "low": 51.01901106672195,
  "open": 126.83590008130241,
  "close": 99.44471233463418,
  "expMovingAverage": 99.44471233463418
}

Compared to 7.0, this query saw its operations per second improve by an incredible 2000%, or 20x. This dramatic boost reflects our ongoing commitment to helping MongoDB users handle complex time-series data with ease, and we’re excited to see what can be achieved with these new enhancements.

Please try it out, and let us know your feedback!

contributed by Nishith Atreya and Michael Gargiulo

DEV Community

Supercharging Time Series Collections: Key Enhancements in MongoDB 8.0 with Block Processing

Enhanced Time Series Scalability Improvements

Block Processing with Time Series Collections

Top comments (0)

Read next

Why AI Can't Handle Uncertainty Like Nature Does: New Research Shows Key Evolution Lessons

New Light-Based Computer Chip Makes AI 4.4x Faster Using Silicon Photonics

Inside NVIDIA's Hopper GPU: New Architecture Delivers 3X Faster Processing and 40% Lower Data Movement Overhead

Breakthrough: AI System Combines Language Models and Reinforcement Learning for Better Problem-Solving