i working on online store mongodb, accidentally duplicated brands listed in "wholesalers" collection. have been looking query remove or prune these duplicate entries, far nothing seems work.
db.wholesalers.find().pretty() { "_id" : objectid("..."), "brands" : [ "seiko", "breil", "lorus", "seiko", "breil", "lorus", ], "name" : "seiko nederlands b.v.", "address" : "daniel pichotstraat", "housenr" : "17-31", "postalcode" : "3115jb", "city" : "schiedam", "phone" : "+31 (0)10 - 400 98 66" "email" : "info@seiko.nl" "web" : "http://www.seiko.nl/default" "kind" : "horloges", "country" : "", "brandsnetherlands" : [ ] }
this example of single document in database. can see, brands listed in "brands" array have been duplicated, , need way rid of them. best way this?
for relatively small data, can achieve above using aggregation framework create new array distinct brand values using $setunion
operator in $project
stage. if array contains duplicate entries, $setunion
ignores duplicate entries order of elements.
once new distinct brands field, need additional field filtering documents in entire collection i.e checks whether given array has unique elements using same concept $setunion
operator. use results array aggregated documents update collection iterating results cursor using foreach()
method , updating each document follows:
db.wholesalers.aggregate([ { "$project": { "distinctbrands": { "$setunion": [ "$brands", [] ] }, "hasuniquebrands": { "$eq": [ { "$size": "$brands" }, { "$size": { "$setunion": [ "$brands", [] ] } } ] } } }, { "$match": { "hasuniquebrands": false } } ]).foreach(function(doc) { db.wholesalers.update( { "_id": doc._id }, { "$set": { "brands": doc.distinctbrands } } ) })
whilst optimal small collections, performance large collections reduced since looping through large dataset , sending each update operation per request server incurs computational penalty.
the bulk()
api comes rescue , improves performance since write operations sent server once in bulk. efficiency achieved since method not send every write request server (as current update statement within foreach()
loop) once in every 1000 requests, making updates more efficient , quicker is.
using same concept above foreach()
loop create batches, can update collection in bulk follows.
in demonstration bulk()
api available in mongodb versions >= 2.6 , < 3.2
uses initializeunorderedbulkop()
method execute in parallel, in nondeterministic order, write operations in batches:
var bulk = db.wholesalers.initializeunorderedbulkop(), counter = 0; // counter keep track of batch update size db.wholesalers.aggregate([ { "$project": { "distinctbrands": { "$setunion": [ "$brands", [] ] }, "hasuniquebrands": { "$eq": [ { "$size": "$brands" }, { "$size": { "$setunion": [ "$brands", [] ] } } ] } } }, { "$match": { "hasuniquebrands": false } } ]).foreach(function(doc) { bulk.find({ "_id": doc._id }).updateone({ "$set": { "brands": doc.distinctbrands } }); counter++; // increment counter if (counter % 1000 == 0) { bulk.execute(); // execute per 1000 operations // , re-initialize every 1000 update statements bulk = db.wholesalers.initializeunorderedbulkop(); } });
the next example applies new mongodb version 3.2
has since deprecated bulk()
api , provided newer set of apis using bulkwrite()
.
it uses same cursors above creates arrays bulk operations using same foreach()
cursor method push each bulk write document array. because write commands can accept no more 1000 operations, there's need group operations have @ 1000 operations , re-intialise array when loop hits 1000 iteration:
var bulkupdateops = []; db.wholesalers.aggregate([ { "$project": { "distinctbrands": { "$setunion": [ "$brands", [] ] }, "hasuniquebrands": { "$eq": [ { "$size": "$brands" }, { "$size": { "$setunion": [ "$brands", [] ] } } ] } } }, { "$match": { "hasuniquebrands": false } } ]).foreach(function(doc){ bulkupdateops.push({ "updateone": { "filter": { "_id": doc._id }, "update": { "$set": { "brands": doc.distinctbrands } } } }); if (bulkupdateops.length === 1000) { db.wholesalers.bulkwrite(bulkupdateops); bulkupdateops = []; } }); if (bulkupdateops.length > 0) { db.wholesalers.bulkwrite(bulkupdateops); }
Comments
Post a Comment