DEV Community

Franck Pachot for MongoDB

Posted on

Skip Scan in MongoDB with Range, Equality (RE) index as an alternative to Equality, Sort, Range (ESR)

To achieve the ideal index for each query while adhering to the Equality, Sort, and Range (ESR) rule, you may end up with too many indexes. Instead of creating the best index for every query, the most effective indexing strategy is using a small set of indexes that addresses performance and scalability requirements.
Indexes that start with a field not used as a filter can still be beneficial if the column has a limited range of values. The index scans the entries based on the index key's following field and skips the first field's values to seek the following range. We will illustrate this using a similar collection as in the previous post:

mdb> db.demoesr.drop();

mdb> db.demoesr.insertMany([
 { e:42, s:"a", r:10 },
 { e:42, s:"b", r:20 },
 { e:42, s:"b", r:10 },
 { e:42, s:"d", r:30 },
 { e:99,        r:11 }  // add one to skip
]);
Enter fullscreen mode Exit fullscreen mode

My query finds the document with e = 42 and r > 10:

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      )
;
[
  { e: 42, s: 'b', r: 20 },
  { e: 42, s: 'd', r: 30 }
]
Enter fullscreen mode Exit fullscreen mode

Without an index, this reads all documents with a COLLSCAN:

db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 2,
  totalKeysExamined: 0,
  totalDocsExamined: 5,
  executionStages: {
    stage: 'PROJECTION_SIMPLE',
    nReturned: 2,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$and': [ { e: { '$eq': 42 } }, { r: { '$gt': 10 } } ]
      },
      nReturned: 2,
      direction: 'forward',
      docsExamined: 5
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Although my collection for the demo is limited, the execution plan highlights its scalability challenge. It examines all documents but only returns a subset, which lacks efficiency.

ESR index (Equality, Sort, Range)

With the index created in the previous post, my query uses an index scan:

test> db.demoesr.createIndex(
      { e: 1 , s: 1, r : 1 }
);
e_1_s_1_r_1

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 2,
  totalKeysExamined: 5,
  totalDocsExamined: 0,
  executionStages: {
    stage: 'PROJECTION_COVERED',
    nReturned: 2,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 2,
      keyPattern: { e: 1, s: 1, r: 1 },
      indexName: 'e_1_s_1_r_1',
      isMultiKey: false,
      multiKeyPaths: { e: [], s: [], r: [] },
      direction: 'forward',
      indexBounds: {
        e: [ '[42, 42]' ],
        s: [ '[MinKey, MaxKey]' ],
        r: [ '(10, inf.0]' ]
      },
      keysExamined: 5,
      seeks: 3
    }
  }
}


Enter fullscreen mode Exit fullscreen mode

This plan is acceptable if the equality part (e: [ '[42, 42]' ]) is highly selective. Since the subsequent range is not filtering, all values were read. However, a skip scan was used to seek to the next value at the end of each range of the next bound (r: [ '(10, inf.0]' ]).

ER index (Equality, Range)

Since I don't do any sort and read all values for "s", having an index on the two fields used for filtering is preferable:

test> db.demoesr.createIndex(
      { e: 1 , r : 1 }
);
e_1_r_1

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;
{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 3,
  totalKeysExamined: 2,
  totalDocsExamined: 2,
  executionStages: {
    stage: 'PROJECTION_SIMPLE',
    nReturned: 2,
    executionTimeMillisEstimate: 0,
    works: 4,
    advanced: 2,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'FETCH',
      nReturned: 2,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 2,
        keyPattern: { e: 1, r: 1 },
        indexName: 'e_1_r_1',
        isMultiKey: false,
        multiKeyPaths: { e: [], r: [] },
        direction: 'forward',
        indexBounds: { e: [ '[42, 42]' ], r: [ '(10, inf.0]' ] },
        keysExamined: 2,
        seeks: 1
      }
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

This decreased the number of seeks since no values were skipped. It reads a single range (e: [ '[42, 42]' ], r: [ '(10, inf.0]' ]) encompassing all the keys for the returned documents. You achieve the optimal index for your filter when seeks: 1 and keysExamined = nReturned.

RE index (Range, Equality)

Now, imagine that a similar index was created for another critical query with the fields arranged in a different order, as this use case had an equality predicate on "r" and reads many values of "e". I don't want to create another index for my query if I don't need to, so let's remove the indexes created above and check the execution plan with an index starting with "r":

test> db.demoesr.dropIndexes("*");
{
  nIndexesWas: 3,
  msg: 'non-_id indexes dropped for collection',
  ok: 1
}

test> db.demoesr.createIndex(
      { r: 1, e : 1 }
);

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 2,
  totalKeysExamined: 3,
  totalDocsExamined: 2,
  executionStages: {
    stage: 'PROJECTION_SIMPLE',
    nReturned: 2,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'FETCH',
      nReturned: 2,
      docsExamined: 2,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 2,
        keyPattern: { r: 1, e: 1 },
        indexName: 'r_1_e_1',
        isMultiKey: false,
        multiKeyPaths: { r: [], e: [] },
        direction: 'forward',
        indexBounds: { r: [ '(10, inf.0]' ], e: [ '[42, 42]' ] },
        keysExamined: 3,
        seeks: 2
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

In my case, I have only one value for "e" to skip, as { e:99, r:11 } is in the first range r: [ '(10, inf.0]' ] but not in the following one e: [ '[42, 42]' ], which is evident from one additional seek. If there aren't too many of those, this index remains efficient, and there's no need to create an additional one with a better key field ordering for this query.

Keep in mind that if there's no range predicate in your query, the IXCAN cannot be utilized, as we discussed in the previous post. Without an index that begins with "e", this will result in a COLLSCAN:

db.demoesr.find( 
 { e:{$eq: 42} } , {e:1,s:1,r:1,"_id":0 } 
).explain("executionStats").executionStats;
Enter fullscreen mode Exit fullscreen mode

The workaround remains unchanged from the previous post, which involves adding an infinite range filter to the first field of the index:

db.demoesr.find( 
 { r:{$gte:MinKey}, e:{$eq: 42} } , {e:1,s:1,r:1,"_id":0 } 
).sort({s:1}).explain("executionStats").executionStats;
Enter fullscreen mode Exit fullscreen mode

We have seen equality predicates, where the index scan can seek directly to the desired point, and range predicates, where multiple values can be examined but skipped according to the following ranges. You may wonder if a filter on a list of values ($in) is processed like multiple equalities or a single range with skips. That will be the next topic.

Top comments (0)