I have seen a lot of developers start with .find()
when working with MongoDB and stick to that. It's not inherently bad, as it's simple nature gives a simple and fluid head start, but sticking to that as you move is what I am not a fan of.
Before jumping in, let's get things out of the way. find()
is faster than aggregate()
in general. If you are running a simple query, on a collection with limited documents, go with find()
. For example (with mongoose, a JavaScript library that works as a ODM driver for MongoDB in node)):
// .find()
const findResults = await BookModel.find({author: "John Doe"});
// .aggregate()
const aggrResults = await BookModel.aggregate([
{
$match: {
author: "John Doe"
}
}
]);
If you run the above app, you'll have a noticeable time difference (depends on the collection size), where aggregation takes more time than find.
But that is not the reason I am writing this article, obviously. The problem with find comes in when we want to perform post-processing on the data. For example, after finding results you want to group them based on some parameter. Let's first weave an example:
Data
[
{"name": "The Silent Patient", "author": "Alex Michaelides", "rating": 4.3},
{"name": "The Maidens", "author": "Alex Michaelides", "rating": 4.0},
{"name": "Ares", "author": "Alex Michaelides", "rating": 3.9},
{"name": "1984", "author": "George Orwell", "rating": 4.7},
{"name": "Animal Farm", "author": "George Orwell", "rating": 4.5},
{"name": "Homage to Catalonia", "author": "George Orwell", "rating": 4.2},
{"name": "Norwegian Wood", "author": "Haruki Murakami", "rating": 4.4},
{"name": "Kafka on the Shore", "author": "Haruki Murakami", "rating": 4.6},
{"name": "1Q84", "author": "Haruki Murakami", "rating": 4.1},
{"name": "Killing Commendatore", "author": "Haruki Murakami", "rating": 4.0},
{"name": "Down and Out in Paris and London", "author": "George Orwell", "rating": 4.3},
{"name": "Hard-Boiled Wonderland and the End of the World", "author": "Haruki Murakami", "rating": 4.2}
]
With find()
// .find()
const findResults = await BookModel.find();
const byAuthor = findResults.reduce((acc, r) => ({
...acc,
[r.author]: [...(acc.author ?? []), r],
}));
const byRating = findResults.reduce(
(acc, r) =>
r.rating > 4
? {
...acc,
highRated: [...acc.highRated, r],
}
: r.rating < 2
? {
...acc,
lowRated: [...acc.lowRated, r],
}
: {
midRated: [...acc.midRated, r],
},
{
highRated: [],
lowRated: [],
midRated: [],
},
);
console.log({ byAuthor, byRating });
With .aggregate()
// .aggregate()
const aggrResults = await BookModel.aggregate([
{
$facet: {
byAuthors: [
{
$group: {
_id: "$author",
books: { $push: "$$ROOT" },
}
},
{
$group: {
_id: null,
byAuthor: { $push: { k: "$_id", v: "$books" } },
}
},
{
$project: {
_id: 0,
byAuthor: { $arrayToObject: "$byAuthor" },
}
}
],
byRating: [
{
$group: {
_id: {
$switch: {
branches: [
{
case: {
$gte: ["$rating", 4],
},
then: "highRated"
},
{
case: {
$lte: ["$rating", 2],
},
then: "lowRated"
},
],
default: "midRated",
}
},
books: { $push: "$$ROOT" },
}
},
{
$group: {
_id: null,
byRating: { $push: { k: "$_id", v: "$books" } },
},
},
{
$project: {
_id: 0,
byRating: { $arrayToObject: "$byRating" },
}
}
]
}
}
]);
const [{ byAuthor, byRating }] = aggrResult; // assuming the everything went right
console.log({ byAuthor: byAuthor[0], byRating: byRating[0] });
📝 I am not including the results as they can be guessed.
So, I gave an example that makes aggregation looks complicated. Why go to such length when you can just find it and then go about your day? Well, as I said, in the intro for simple use cases, like this one, with 10 documents in your collection, always find()
it.
Let's complicate the example a bit, where you have a author collection and you want to bring in data from there as well. Now, if you are used to mongoose, you can recall a nifty feature called .populate()
.
So our example becomes:
const findResults = await BookModel.find().limit(1000).populate("author");
// ... all the post-processing stuff
Day 1: All good. Never better.
Day 2: Now you have a million documents.
What changes is that you make an API call, binge watch One Piece and then come back to a meet a loading screen.
Let me explain why: The populate()
function in mongoose makes a query for every book in your document to get it's author by id. So your one .find().populate("author")
actually runs 1001 queries. Without the post-processing you are going to apply, i.e. 1000*n operations. And God forbid you have more relational fields that you want to populate.
What shines here is aggregate()
.
We run:
const aggrResults = await BookModel.aggregate([
{
$limit: 1000
},
{
$lookup: {
from: "authors",
localField: "author",
foreignField: "$_id",
as: "author"
}
},
// ... rest of your pipelines
])
Now, a single query runs. And it is faster than .find().populate("author"). And we don't have to post process the data. We are served on a silver plate and then fed with a silver spoon.
And this is just me touching the surface. Quoting Sir Isaac Newton, "What we know is a drop, what we don't know is an ocean." (from Dark, ofc)
But before you jump into aggregation()
-ing everything, let me let you know of the caveats:
- You have to be extremely careful with the limits of your MongoDB server. Aggregation, while handy, requires more resources to run than find.
- In case you are using corporate managed MongoDB instances, like Atlas, you have to be careful about pricing, as prices spike up when you run resource expensive queries.
- This extends point 1: If you have a locally setup server, you must keep in mind the memory allocated to the process/server, or your deployment will crash processing while large queries.
With that out of the way, get to replacing your complex find()
s with aggregate()
.
I will cover more of aggregation in details in my following writeups.
📝 Thank you for reading till the end. I am open to criticisms, constructive or otherwise, as this is my first blog. Please help me to get better.
Top comments (0)