DEV Community

Cover image for Why Aggregation in MongoDB
Shaktijeet Sahoo
Shaktijeet Sahoo

Posted on

Why Aggregation in MongoDB

I have seen a lot of developers start with .find() when working with MongoDB and stick to that. It's not inherently bad, as it's simple nature gives a simple and fluid head start, but sticking to that as you move is what I am not a fan of.

Before jumping in, let's get things out of the way. find() is faster than aggregate() in general. If you are running a simple query, on a collection with limited documents, go with find(). For example (with mongoose, a JavaScript library that works as a ODM driver for MongoDB in node)):

// .find()
const findResults = await BookModel.find({author: "John Doe"});

// .aggregate()
const aggrResults = await BookModel.aggregate([
    {
        $match: {
            author: "John Doe"
        }
    }
]);
Enter fullscreen mode Exit fullscreen mode

If you run the above app, you'll have a noticeable time difference (depends on the collection size), where aggregation takes more time than find.

But that is not the reason I am writing this article, obviously. The problem with find comes in when we want to perform post-processing on the data. For example, after finding results you want to group them based on some parameter. Let's first weave an example:

Data
[
    {"name": "The Silent Patient", "author": "Alex Michaelides", "rating": 4.3},
    {"name": "The Maidens", "author": "Alex Michaelides", "rating": 4.0},
    {"name": "Ares", "author": "Alex Michaelides", "rating": 3.9},
    {"name": "1984", "author": "George Orwell", "rating": 4.7},
    {"name": "Animal Farm", "author": "George Orwell", "rating": 4.5},
    {"name": "Homage to Catalonia", "author": "George Orwell", "rating": 4.2},
    {"name": "Norwegian Wood", "author": "Haruki Murakami", "rating": 4.4},
    {"name": "Kafka on the Shore", "author": "Haruki Murakami", "rating": 4.6},
    {"name": "1Q84", "author": "Haruki Murakami", "rating": 4.1},
    {"name": "Killing Commendatore", "author": "Haruki Murakami", "rating": 4.0},
    {"name": "Down and Out in Paris and London", "author": "George Orwell", "rating": 4.3},
    {"name": "Hard-Boiled Wonderland and the End of the World", "author": "Haruki Murakami", "rating": 4.2}
]
Enter fullscreen mode Exit fullscreen mode
With find()
// .find()
const findResults = await BookModel.find();
const byAuthor = findResults.reduce((acc, r) => ({
    ...acc,
    [r.author]: [...(acc.author ?? []), r],
}));
const byRating = findResults.reduce(
  (acc, r) =>
    r.rating > 4
      ? {
          ...acc,
          highRated: [...acc.highRated, r],
        }
      : r.rating < 2
        ? {
            ...acc,
            lowRated: [...acc.lowRated, r],
          }
        : {
            midRated: [...acc.midRated, r],
          },
  {
    highRated: [],
    lowRated: [],
    midRated: [],
  },
);

console.log({ byAuthor, byRating });
Enter fullscreen mode Exit fullscreen mode
With .aggregate()
// .aggregate()
const aggrResults = await BookModel.aggregate([
    {
        $facet: {
            byAuthors: [
                {
                    $group: {
                        _id: "$author",
                        books: { $push: "$$ROOT" },
                    }
                },
                {
                    $group: {
                        _id: null,
                        byAuthor: { $push: { k: "$_id", v: "$books" } },
                    }
                },
                {
                    $project: {
                        _id: 0,
                        byAuthor: { $arrayToObject: "$byAuthor" },
                    }
                }
            ],
            byRating: [
                {
                    $group: {
                        _id: {
                            $switch: {
                                branches: [
                                    {
                                        case: {
                                            $gte: ["$rating", 4],
                                        },
                                        then: "highRated"
                                    },
                                    {
                                        case: {
                                            $lte: ["$rating", 2],
                                        },
                                        then: "lowRated"
                                    },
                                ],
                                default: "midRated",
                            }
                        },
                        books: { $push: "$$ROOT" },
                    }
                },
                {
                    $group: {
                        _id: null,
                        byRating: { $push: { k: "$_id", v: "$books" } },
                    },
                },
                {
                    $project: {
                        _id: 0,
                        byRating: { $arrayToObject: "$byRating" },
                    }
                }
            ]
        }
    }
]);

const [{ byAuthor, byRating }] = aggrResult; // assuming the everything went right

console.log({ byAuthor: byAuthor[0], byRating: byRating[0] });
Enter fullscreen mode Exit fullscreen mode

📝 I am not including the results as they can be guessed.

So, I gave an example that makes aggregation looks complicated. Why go to such length when you can just find it and then go about your day? Well, as I said, in the intro for simple use cases, like this one, with 10 documents in your collection, always find() it.

Let's complicate the example a bit, where you have a author collection and you want to bring in data from there as well. Now, if you are used to mongoose, you can recall a nifty feature called .populate().

So our example becomes:

const findResults = await BookModel.find().limit(1000).populate("author");
// ... all the post-processing stuff
Enter fullscreen mode Exit fullscreen mode

Day 1: All good. Never better.

Day 2: Now you have a million documents.

What changes is that you make an API call, binge watch One Piece and then come back to a meet a loading screen.

Let me explain why: The populate() function in mongoose makes a query for every book in your document to get it's author by id. So your one .find().populate("author") actually runs 1001 queries. Without the post-processing you are going to apply, i.e. 1000*n operations. And God forbid you have more relational fields that you want to populate.

What shines here is aggregate().

We run:

const aggrResults = await BookModel.aggregate([
    {
        $limit: 1000
    },
    {
        $lookup: {
            from: "authors",
            localField: "author",
            foreignField: "$_id",
            as: "author"
        }
    },
    // ... rest of your pipelines
])
Enter fullscreen mode Exit fullscreen mode

Now, a single query runs. And it is faster than .find().populate("author"). And we don't have to post process the data. We are served on a silver plate and then fed with a silver spoon.

And this is just me touching the surface. Quoting Sir Isaac Newton, "What we know is a drop, what we don't know is an ocean." (from Dark, ofc)

But before you jump into aggregation()-ing everything, let me let you know of the caveats:

  1. You have to be extremely careful with the limits of your MongoDB server. Aggregation, while handy, requires more resources to run than find.
  2. In case you are using corporate managed MongoDB instances, like Atlas, you have to be careful about pricing, as prices spike up when you run resource expensive queries.
  3. This extends point 1: If you have a locally setup server, you must keep in mind the memory allocated to the process/server, or your deployment will crash processing while large queries.

With that out of the way, get to replacing your complex find()s with aggregate().

I will cover more of aggregation in details in my following writeups.

📝 Thank you for reading till the end. I am open to criticisms, constructive or otherwise, as this is my first blog. Please help me to get better.

Top comments (0)