MongoDB Bulk Save Equivalent?

MongoDB Bulk Save Equivalent? - javascript

I am a mongodb noob and am running into some difficulty trying to create an equivalent to bulk save (as I can't find a bulk save operation) using the MongoDB bulk operations. Briefly, given an array of documents:
[{ _id:1, name:"a" ... }, { _id:1, name:"b" ... } ... ]
I want to bulk upsert the documents in the array, using the _id attribute as the comparison field to determine which incoming records are equivalent to records already in mongodb. In pseudo-code I want mongodb to bulk upsert as follows:
if(incomingDocument._id == existingDocument._id){
update(incoming) // overwrite existing document with entire incoming document
} else {
insert(incoming)
}
Ideally, I would like to pass mongo an array and an comparator vs queuing up an individual bulk operation for each document.
How/can I do this with Bulk.find().upsert().update(<update>); or similar ?
(Alternately, is there an undocumented bulk save() operation?)
Thank you!

Bulk.find.upsert
With the upsert option set to true, if no matching documents exist for
the Bulk.find() condition, then the update or the replacement
operation performs an insert. If a matching document does exist, then
the update or replacement operation performs the specified update or
replacement.
But you will need to loop over your collection:
var bulk = db.items.initializeUnorderedBulkOp();
myDocumnets.forEach(function(doc) {
bulk.find({_id: doc._id}).upsert().replaceOne(doc);
});
bulk.execute({w: 1, j: true}, function (err, result) {
if (result.isOk()) {
...
}
More, or less, I am sorry I am not able to test it at the moment. I am also not able to say how it will behave on large amounts of documents.
UPDATE
I modified code, as suggested by Colin.

Related

couchdb views: return all fields in doc as map

I have a doc in couchDB:
{
"id":"avc",
"type":"Property",
"username":"user1",
"password":"password1",
"server":"localhost"
}
I want to write a view that returns a map of all these fields.
The map should look like this: [{"username","user1"},{"password","password1"},{"server","localhost"}]
Here's pseudocode of what I want -
HashMap<String,String> getProperties()
{
HashMap<String,String> propMap;
if (doc.type == 'Property')
{
//read all fields in doc one by one
//get value and add field/value to the map
}
return propMap;
}
I am not sure how to do the portion that I have commented above. Please help.
Note: right now, I want to add username, password and server fields and their values in the map. However, I might keep adding more later on. I want to make sure what I do is extensible.
I considered writing a separate view function for each field. Ex: emit("username",doc.username).
But this may not be the best way to do this. Also needs updates every time I add a new field.

First of all, you have to know:
In CouchDB, you'll index documents inside a view with a key-value pair. So if you index the property username and server, you'll have the following view:
[
{"key": "user1", "value": null},
{"key": "localhost", "value": null}
]
Whenever you edit a view, it invalidates the index so Couch has to rebuild the index. If you were to add new fields to that view, that's something you have to take into account.
If you want to query multiple fields in the same query, all those fields must be in the same view. If it's not a requirement, then you could easily build an index for every field you want.
If you want to index multiple fields in the same view, you could do something like this:
// We define a map function as a function which take a single parameter: The document to index.
(doc) => {
// We iterate over a list of fields to index
["username", "password", "server"].forEach((key, value) => {
// If the document has the field to index, we index it.
if (doc.hasOwnProperty(key)) {
// map(key,value) is the function you call to index your document.
// You don't need to pass a value as you'll be able to get the macthing document by using include_docs=true
map(doc[key], null);
}
});
};
Also, note that Apache Lucene allows to make full-text search and might fit better your needs.

In sails/waterline get maximum value of a column in a database agnostic way

While using sails as ORM (version 1.0), I notice that there is a function called Model.avg (as well as sum). - However there is not a maximum or minimum function to get the maximum or minimum from a column in a model; so it seems this is not necessary because it is covered by other functions already?
Now in my database I need to get the "maximum id" in a list; and I have it working for postgresql by using a native query:
const maxnum = await Order.getDatastore().sendNativeQuery('SELECT MAX(\"orderNr\") FROM \"order\"')
While this isn't the most difficult thing, it is not what I truly want: it is limited to only sql-based datastores (so we wouldn't be able to move easily to mongodb); and the syntax might actually be even different for another sql database type.
So I wonder - can this be transformed in such a way it doesn't rely on sendNativeQuery?

You can try .query() to execute a raw SQL query using the specified model's datastore and if u want u can try pg , an NPM package used for communicating with PostgreSQL databases:
Pet.query('SELECT pet.name FROM pet WHERE pet.name = $1', [ 'dog' ]
,function(err, rawResult) {
if (err) { return res.serverError(err); }
sails.log(rawResult);
// (result format depends on the SQL query that was passed in, and
the adapter you're using)
// Then parse the raw result and do whatever you like with it.
return res.ok();
});

You can use the limit and order options waterline provides to get a single Model with a maximal value (then just extract that value).
const orderModel = await Order.find({
where: {},
select: ['orderNr'],
limit: 1,
sort: 'orderNr DESC'
});
console.log(orderModel.orderNr);
Like most things in Waterline, it's probably not as efficient as an SQL SELECT MAX query (or some equivalent in mongo, etc), but it should allow swapping out the database with no maintenance. Last note, don't forget to handle the case of no models found.

Mongoose get sum of fields

I'm trying to track the bandwidth usage of a user based upon two mongoose schemas. I have a user and image schema, were a user has many images. My image schema looks like this:
image = {
creator: 'ObjectId of user',
size: '12345', //kb
uploadedTo:[{}]
}
Essentially I want to create a query that will get all images that belong to a user via the image.creator property. I would then multiply the image.size property by image.uploadedTo.length value to get the total bandwidth used.
For example: If a user has 5 images, each image is 5,000kb and is uploaded to 3 services each, the total bandwidth for the user would be 75,000kb (5*5,000*3).
Is this query possible strictly through mongoose, or would I have to just get the user's images and then use regular javascript to get the total bandwidth?

You'll want to use the aggregation pipeline. The basic projection might look like this:
{
$project: {
size: 1,
number_of_uploads: {
$size: "$uploadedTo"
},
total_bandwidth: {
$multiply: [ "$size", "$number_of_uploads" ]
}
}
You'd get a new document that looks like:
{
size: '1234',
number_of_uploads: 2,
total_bandwidth: 2468
}
You'll need to integrate that with Mongoose's aggregate helper.
If you're using MongoDB 3.2, you can also use $lookup (which is basically a join operation) as part of your pipeline to look up the creator._id, and then run a $sum operation on all of the images (you'll probably $group by that creator ID). The benefit of this is that your server doesn't do any work; the lookups and operations happen inside MongoDB itself.
If you're not using v3.2, you can leverage Mongoose's population to look up (on your own server) the creator ID for you, and then use JavaScript on your own server to calculate the sum.
It's a bit difficult for me to come up with what exactly your pipeline will look like since I don't have a sample dataset to play with, but the above tools should be all that you need.
Additional operation resources
$size
$multiply
(P.S. you're probably looking at this like "WTF?". Sometimes it's easier to just do the calculations yourself and "use regular javascript to get the total bandwidth", as you mentioned. Both solutions will work, it just depends on where you want to put the load - whether on the MongoDB server or on your server - and how many round-trips you want to make.)

Using Meteor publish-with-relations package where each join cannot use the _id field

I am working to solve a problem not dissimilar to the discussion present at the following blog post. This is wishing to publish two related data sets in Meteor, with a 'reactive join' on the server side.
https://www.discovermeteor.com/blog/reactive-joins-in-meteor/
Unfortunately for me, however, the related collection I wish to join to, will not be joined using the "_id" field, but using another field. Normally in mongo and meteor I would create a 'filter' block where I could specify this query. However, as far as I can tell in the PWR package, there is an implicit assumption to join on '_id'.
If you review the example given on the 'publish-with-relations' github page (see below) you can see that both posts and comments are being joined to the Meteor.users '_id' field. But what if we needed to join to the Meteor.users 'address' field ?
https://github.com/svasva/meteor-publish-with-relations
In the short term I have specified my query 'upside down' (as luckily I m able to use the _id field when doing a reverse join), but I suspect this will result in an inefficient query as the datasets grow, so would rather be able to do a join in the direction planned.
The two collections we are joining can be thought of as like a conversation topic/header record, and a conversation message collection (i.e. one entry in the collection for each message in the conversation).
The conversation topic in my solution is using the _id field to join, the conversation messages have a "conversationKey" field to join with.
The following call works, but this is querying from the messages to the conversation, instead of vice versa, which would be more natural.
Meteor.publishWithRelations({
handle: this,
collection: conversationMessages,
filter: { "conversationKey" : requestedKey },
options : {sort: {msgTime: -1}},
mappings: [{
//reverse: true,
key: 'conversationKey',
collection: conversationTopics,
filter: { startTime: { $gt : (new Date().getTime() - aLongTimeAgo ) } },
options: {
sort: { createdAt: -1 }
},
}]
});

Can you do a join without an _id?
No, not with PWR. Joining with a foreign key which is the id in another table/collection is nearly always how relational data is queried. PWR is making that assumption to reduce the complexity of an already tricky implementation.
How can this publish be improved?
You don't actually need a reactive join here because one query does not depend on the result of another. It would if each conversation topic held an array of conversation message ids. Because both collections can be queried independently, you can return an array of cursors instead:
Meteor.publish('conversations', function(requestedKey) {
check(requestedKey, String);
var aLongTimeAgo = 864000000;
var filter = {startTime: {$gt: new Date().getTime() - aLongTimeAgo}};
return [
conversationMessages.find({conversationKey: requestedKey}),
conversationTopics.find(requestedKey, {filter: filter})
];
});
Notes
Sorting in your publish function isn't useful unless you are using a limit.
Be sure to use a forked version of PWR like this one which includes Tom's memory leak fix.
Instead of conversationKey I would call it conversationTopicId to be more clear.

I think this could be now much easier solved with the reactive-publish package (I am one of authors). You can make any query now inside an autorun and then use the results of that to publish the query you want to push to the client. I would write you an example code, but I do not really understand what exactly do you need. For example, you mention you would like to limit topics, but you do not explain why would they be limited if you are providing requestedKey which is an ID of a document anyway? So only one result is available?

Mongo check if a document already exists

In the MEAN app I'm currently building, the client-side makes a $http POST request to my API with a JSON array of soundcloud track data specific to that user. What I now want to achieve is for those tracks to be saved to my app database under a 'tracks' table. That way I'm then able to load tracks for that user from the database and also have the ability to create unique client URLs (/tracks/:track)
Some example data:
{
artist: "Nicole Moudaber"
artwork: "https://i1.sndcdn.com/artworks-000087731284-gevxfm-large.jpg?e76cf77"
source: "soundcloud"
stream: "https://api.soundcloud.com/tracks/162626499/stream.mp3?client_id=7d7e31b7e9ae5dc73586fcd143574550"
title: "In The MOOD - Episode 14"
}
This data is then passed to the API like so:
app.post('/tracks/add/new', function (req, res) {
var newTrack;
for (var i = 0; i < req.body.length; i++) {
newTrack = new tracksTable({
for_user: req.user._id,
title: req.body[i].title,
artist: req.body[i].artist,
artwork: req.body[i].artwork,
source: req.body[i].source,
stream: req.body[i].stream
});
tracksTable.find({'for_user': req.user._id, stream: req.body[i].stream}, function (err, trackTableData) {
if (err)
console.log('MongoDB Error: ' + err);
// stuck here - read below
});
}
});
The point at which I'm stuck, as marked above is this: I need to check if that track already exists in the database for that user, if it doesn't then save it. Then, once the loop has finished and all tracks have either been saved or ignored, a 200 response needs to be sent back to my client.
I've tried several methods so far and nothing seems to work, I've really hit a wall and so help/advice on this would be greatly appreciated.

Create a compound index and make it unique.
Using the index mentioned above will ensure that there are no documents which have the same for_user and stream.
trackSchema.ensureIndex( {for_user:1, stream:1}, {unique, true} )
Now use the mongoDB batch operation to insert multiple documents.
//docs is the array of tracks you are going to insert.
trackTable.collection.insert(docs, options, function(err,savedDocs){
//savedDocs is the array of docs saved.
//By checking savedDocs you can see how many tracks were actually inserted
})
Make sure to validate your objects as by using .collection we are bypassing mongoose.

Make a unique _id based on user and track. In mongo you can pass in the _id that you want to use.
Example {_id : "NicoleMoudaber InTheMOODEpisode14",
artist: "Nicole Moudaber"
artwork: "https://i1.sndcdn.com/artworks-000087731284-gevxfm-large.jpg?e76cf77"
source: "soundcloud"
stream: "https://api.soundcloud.com/tracks/162626499/stream.mp3? client_id=7d7e31b7e9ae5dc73586fcd143574550"
title: "In The MOOD - Episode 14"}
_id must be unique and won't let you insert another document with the same _id. You could also use this to find the record later db.collection.find({_id : NicoleMoudaber InTheMOODEpisode14})
or you could find all tracks for db.collection.find({_id : /^NicoleMoudaber/}) and it will still use the index.
There is another method to this that I can explain if you dont' like this one.
Both options will work in a sharded environment as well as a single replica set. "Unique" indexes do not work in a sharded environment.

Soundcloud API provides a track id, just use it.
then before inserting datas you make a
tracks.find({id_soundcloud : 25645456}).exec(function(err,track){
if(track.length){ console.log("do nothing")}else {//insert}
});

Develop Reference

JavaScript is the programming language of the Web.