The original article is published here.
Looking out to find duplicate documents in a MongoDB collection? This article will help you find duplicate records in MongoDB collection by a specific field.
MongoDB is a JSON based document-oriented database that can handle big size of data without compromising the performance.
Find Duplicate Documents In MongoDB Collection
Sometimes you want to remove all the duplicate documents from a collection but you don’t know how to find them. Here we will see how to find duplicate records in MongoDB. Once you find the duplicates, you can easily remove those from Mongo collection.
For this example, I am using Github Public Issues API. I have fetched the issues from Github API and inserted them into the database by making same requests multiple times to make duplicates.
I will use aggregate
method along with $group
and $match
pipeline operators to find duplicates. Let’s step by step implement code to get duplicates.
MongoDB Group Records by Field
First step towards implementing duplicate search is Grouping Records.
db.issues.aggregate([
{$group: {
_id: {IssueNumber: "$issue_number"}
}
}
]);
Above code fetches all the documents present in issues
collection grouped by issue_number
field.
/* 1 */
{
"_id" : {
"IssueNumber" : 2778
}
}
/* 2 */
{
"_id" : {
"IssueNumber" : 80
}
}
/* 3 */
{
"_id" : {
"IssueNumber" : 2592
}
}
/* 4 */
{
"_id" : {
"IssueNumber" : 2596
}
}
/* 5 */
{
"_id" : {
"IssueNumber" : 2599
}
}
/* 6 */
{
"_id" : {
"IssueNumber" : 98
}
}
/* 7 */
{
"_id" : {
"IssueNumber" : 2602
}
}
/* 8 */
{
"_id" : {
"IssueNumber" : 99
}
}
/* 9 */
{
"_id" : {
"IssueNumber" : 2607
}
}
/* 10 */
{
"_id" : {
"IssueNumber" : 100
}
}
MongoDB Get Duplicate Documents
It’s time to implement the complete query and get all the rows that have more than 1 count in MongoDB collection.
db.issues.aggregate([
{$group: {
_id: {IssueNumber: "$issue_number"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
]);
$addToSet
operator adds a value to an array if the value is not already present. If the value is already present, it does nothing.
$sum
calculates and returns the sum of numeric values.
$match
applies condition on the result set. $gt
will check if the count
is greater than provided number (ie 1).
The above query produces following output.
/* 1 */
{
"_id" : {
"IssueNumber" : 34
},
"uniqueIds" : [
ObjectId("61b993c2f8fba0f17accd7fc"),
ObjectId("61b9937b12b680e75f5a7fac"),
ObjectId("61b88043dcb3fc4c9c03ca8e")
],
"count" : 3.0
}
/* 2 */
{
"_id" : {
"IssueNumber" : 37
},
"uniqueIds" : [
ObjectId("61b993c1f8fba0f17accd7f6"),
ObjectId("61b9937b12b680e75f5a7f97"),
ObjectId("61b88042dcb3fc4c9c03ca82")
],
"count" : 3.0
}
/* 3 */
{
"_id" : {
"IssueNumber" : 677
},
"uniqueIds" : [
ObjectId("61b993baf8fba0f17accd6bd"),
ObjectId("61b88042dcb3fc4c9c03ca76")
],
"count" : 2.0
}
/* 4 */
{
"_id" : {
"IssueNumber" : 695
},
"uniqueIds" : [
ObjectId("61b993b9f8fba0f17accd667"),
ObjectId("61b88040dcb3fc4c9c03ca4a")
],
"count" : 2.0
}
/* 5 */
{
"_id" : {
"IssueNumber" : 700
},
"uniqueIds" : [
ObjectId("61b993b8f8fba0f17accd653"),
ObjectId("61b88040dcb3fc4c9c03ca42")
],
"count" : 2.0
}
/* 6 */
{
"_id" : {
"IssueNumber" : 1871
},
"uniqueIds" : [
ObjectId("61b993b8f8fba0f17accd60c"),
ObjectId("61b8803fdcb3fc4c9c03ca25")
],
"count" : 2.0
}
Now you have ObjectId of all the duplicate records present in MongoDB collection. You can run a separate query to remove them.
Looking to insert and update records in MongoDB together? This will help you upsert documents in a single query.
Repair MongoDB instance after crash on Windows.
Conclusion
This article elaborates on how to find duplicate documents in MongoDB collection by a specific field.
TLDR;
- Group records by the field on which you want to find duplicates
- Add all the Object Ids to an array
- Count the documents
- Apply condition to fetch only documents that are available more than once in collection
Hope you find some value out of this article. If you liked the article, make sure to spread it across other developers. See you in next article 🙂
Top comments (0)