SQS with MongoDB for handling duplicates

SQS with MongoDB for handling duplicates - javascript

I have a couple of ideas for stopping duplicate handling of messages from Amazon's SQS queues. The app will also have a MongoDB server, which I think can be an effective part of either strategy:
Store queue items in Mongo, with a 'status' field - default to Pending. Then use SQS to queue the ID of the new message. One of the worker processes will get the ID, then do a findAndModify on the actual item in Mongo to set the status to Processing, unless it's already being processed, when it will flag that up.
Store queue items in the queue. Workers pick up items from the queue, then attempt to do an insert into Mongo with the item ID and some other info. If the item already existed, don't do the insert or continue, since it's a dupe.
The problems and questions I have:
Solution 1 seems counter-intuitive: why use SQS at all? I think it's because polling SQS is more correct than a whole load of worker processes polling Mongo for work.
Solution 2 I don't know how to implement. Is there an atomic find-and-insert-if-doesn't-exist? A simple get-or-insert-but-tell-me-which-occurred operation would do the trick.
Will any of these work in a large scale scenario, and/or is there a proven method that I haven't grasped?
....Humm, just wrote the question above, then had a thought for a get-or-insert-but-tell-me-which-occurred operation (in JS psuedocode):
var thingy = getrandomnumber();
findAndModify({
new: false,
upsert: true,
query: { $eq: { id: item_id } },
update: { thingy: thingy },
fields: { thingy: 1 }
});
If the item exists (and this is a conflict), then since new is false, the old document will be returned.
If the item didn't exist, new is false, so an empty document {} would be returned.
So either we got {}, indicating it resulted in an insert, or an actual document, indicating it was a get, and that ID already exists... all atomic. The thingy is in there because I don't know if MongoDB actually needs data there, I guess it would? If I used $inc on a duplicates field instead, would that work with an upsert? Then we could get stats on dupes later.
Is that right, maybe that would work?

Related

Would giving response to client while letting asynchronous operation continue to run a good idea?

So I need to implement an "expensive" API endpoint. Basically, the user/client would need to be able to create a "group" of existing users.
So this "create group" API would need to check that each users fulfill the criteria, i.e. all users in the same group would need to be from the same region, same gender, within an age group etc. This operation can be quite expensive, especially since there are no limit on how many users in one group, so its possible that the client requests group of 1000 users for example.
My idea is that the endpoint will just create entry in database and mark the "group" as pending, while the checking process is still happening, then after its completed, it will update the group status to "completed" or "error" with error message, then the client would need to periodically fetch the status if its still pending.
My implementation idea is something along this line
const createGroup = async (req, res) => {
const { ownerUserId, userIds } = req.body;
// This will create database entry of group with "pending" status and return the primary key
const groupId = await insertGroup(ownerUserId, 'pending');
// This is an expensive function which will do checking over the network, and would take 0.5s per user id for example
// I would like this to keep running after this API endpoint send the response to client
checkUser(userIds)
.then((isUserIdsValid) => {
if (isUserIdsValid) {
updateGroup(groupId, 'success');
} else {
updateGroup(groupId, 'error');
}
})
.catch((err) => {
console.error(err);
updateGroup(groupId, 'error');
});
// The client will receive a groupId to check periodically whether its ready via separate API
res.status(200).json({ groupId });
};
My question is, is it a good idea to do this? Do I missing something important that I should consider?

Yes, this is the standard approach to long-running operations. Instead of offering a createGroup API that creates and returns a group, think of it as having an addGroupCreationJob API that creates and returns a job.
Instead of polling (periodically fetching the status to check whether it's still pending), you can use a notification API (events via websocket, SSE, webhooks etc) and even subscribe to the progress of processing. But sure, a check-status API (via GET request on the job identifier) is the lowest common denominator that all kinds of clients will be able to use.
Did I not consider something important?
Failure handling is getting much more complicated. Since you no longer create the group in a single transaction, you might find your application left in some intermediate state, e.g. when the service crashed (due to unrelated things) during the checkUser() call. You'll need something to ensure that there are no pending groups in your database for which no actual creation process is running. You'll need to give users the ability to retry a job - will insertGroup work if there already is a group with the same identifier in the error state? If you separate the group and the jobs into independent entities, do you need to ensure that no two pending jobs are trying to create the same group? Last but not least you might want to allow users to cancel a currently running job.

MongoDB updateOne with upsert:true keeps inserting new document

I have a lastTracks collection which I intend to keep it updated with new track data.
I try to do that using this piece of code with MongoDB driver for NodeJS:
db.collection('lastTracks').updateOne(
{
'track.deviceID': lastPacket.deviceID,
timestampZero: {$lte: lastPacket.timestamp}
},
{
$set: {
timestampZero: lastPacket.timestamp,
},
$setOnInsert: {'track.deviceID': lastPacket.deviceID}
},
{ upsert: true}
);
What I want to do is to perform an update for timestampZero if there is a document for deviceID in collection and it's timestampZero is lower or equal than packet's timestamp; And if there is no such document, then insert it.
The problem is whenever this piece of code runs, a new document gets inserted and no update operation happens on existing document with given deviceID. And even if I create an index for 'track.deviceID' a duplicate key error happens.
UPDATE:
I found that the reason a new document gets inserted every time is because there is a document with same deviceID but greater timestamp in the collection, and so my update condition evaluates to false and then Mongo tries to insert a new document (as I wanted with upsert:true).
But what I want is that if timestamp is greater than lastPacket.timestamp neither update nor insert happen.
With a little search I found that my solution could be simply creating a unique index on 'tracks.deviceID' and then ignoring the duplicate error: https://github.com/winstonjs/winston#logging-levels
or I can do a two phase commit: https://docs.mongodb.com/manual/tutorial/perform-two-phase-commits/
The first solution works simple and well but just doesn't feel right and the second solution is a little complicated both for understanding and implementation, so my question is is there any better way for this?

nedb method update and delete creates a new entry instead updating existing one

I'm using nedb and I'm trying to update an existing record by matching it's ID, and changing a title property.
What happens is that a new record gets created, and the old one is still there.
I've tried several combinations, and tried googling for it, but the search results are scarce.
var Datastore = require('nedb');
var db = {
files: new Datastore({ filename: './db/files.db', autoload: true })
};
db.files.update(
{_id: id},
{$set: {title: title}},
{},
callback
);
What's even crazier when performing a delete, a new record gets added again, but this time the record has a weird property:
{"$$deleted":true,"_id":"WFZaMYRx51UzxBs7"}
This is the code that I'm using:
db.files.remove({_id: id}, callback);

In the nedb docs it says followings :
localStorage has size constraints, so it's probably a good idea to set
recurring compaction every 2-5 minutes to save on space if your client
app needs a lot of updates and deletes. See database compaction for
more details on the append-only format used by NeDB.
 
Compacting the database
Under the hood, NeDB's persistence uses an append-only format, meaning
that all updates and deletes actually result in lines added at the end
of the datafile. The reason for this is that disk space is very cheap
and appends are much faster than rewrites since they don't do a seek.
The database is automatically compacted (i.e. put back in the
one-line-per-document format) everytime your application restarts.
You can manually call the compaction function with
yourDatabase.persistence.compactDatafile which takes no argument. It
queues a compaction of the datafile in the executor, to be executed
sequentially after all pending operations.
You can also set automatic compaction at regular intervals with
yourDatabase.persistence.setAutocompactionInterval(interval), interval
in milliseconds (a minimum of 5s is enforced), and stop automatic
compaction with yourDatabase.persistence.stopAutocompaction().
Keep in mind that compaction takes a bit of time (not too much: 130ms
for 50k records on my slow machine) and no other operation can happen
when it does, so most projects actually don't need to use it.
I didn't use this but it seems , it uses localStorage and it has append-only format for update and delete methods.
When investigated its source codes, in that search in persistence.tests they wanted to sure checking $$delete key also they have mentioned `If a doc contains $$deleted: true, that means we need to remove it from the data``.
So, In my opinion you can try to compacting db manually, or in your question; second way can be useful.

How to synchronise multiple RESTFul requests when using NodeJS and saving to MongoDB?

I have been trying to implement a RESTFul API with NodeJS and I use Mongoose (MongoDB) as the database backend.
The following example code registers multiple users with the same username when requests are sent at the same time, which is not what I desire. Although I tried to add a check!
I know this happens because of the asynchronous nature of NodeJS, but I could not find a method to do this properly. It looks like "findOne" method immediately returns, causing registerUser to return and then another request is processed.
By the way, I don't want to check for existing users with a separate API function, I need to check at the registration stage. Is there any way to do this?
Controller.prototype.registerUser = function (req, res) {
Users.findOne({'user_name': req.body.user_name}, function(err, user) {
if(!user) {
new User({user_name: req.body.user_name}).save(function(err) {
if(!err) {
res.send("User saved");
} else {
res.send("DB Error: Could not save user!");
}
});
} else {
res.send("User exists");
}
});
}

You should consider setting the user_name to be unique in the Schema. That would ensure that the user_name stays unique even if simultaneous requests are made to set an identical user name.

Yes, the reason this is happening is as you suspected because multiple requests can execute the code simultaneously and therefore the User.fineOne can return false multiple times. Incidentally this can happen with other stacks as well, even ones that use one thread per request.
To solve this, you need a way to somehow either control that just one user is being worked on at the time, you can accomplish this by adding all registerUser requests to a queue and then pulling them off the queue one by one and calling res.Send only after it's processed form the queue.
Alternatively, maybe you can keep a local array of user names, and each time a new request comes in and check the array if it's already there. If it isn't add it to the array and work on it. If it is in the array, send the response "User exists". Then, once the user has been successfully created, you can remove it from that array. (I haven't thought this one through 100% but I think it should work as well.)

meteor Autosubscribe maintain certain documents

I'm using autosubscribe to get a list of 50 latest chat documents in minimongo. As more messages are posted the older messages are removed from minimongo by autosubscribe. How can I get autosubscribe to not remove certain messages that I mark as active?
I know that I can just manually separately subscribe to a list of "active" messages but that seems unnecessarily laborious. Thanks.
Edit: the active marking is client side only, each user gets to choose the messages that he cares about, it's something ephemeral. The user's marking a the message as the one he's replying so, so it shouldn't be suddenly removed.

You need to sort on the time (_id captures the order it was inserted hence time) as well as with status, both in descending order.
Server code:
Meteor.publish("messages", function () {
return Messages.find({}, {sort: {active: -1, _id:-1}, limit: 50});
});

In the publish function, sort on status.
Meteor.publish("messages", function () {
return Messages.find({}, {sort: {status: 1}, limit: 50});
});

Unless your implementation is limited to a single user being able to mark a line active, then the marking of the chat-line document needs to use the active users id.
This sadly leads to the need for separate subscription even if it 'seems unnecessarily laborious'
Another 'laborious way' would be to make a local client-only collection copy of the selected active messages.

Per client, maintain a session variable containing an array of marked doc IDs: Session.set('markedMessages', matchedDocs)
Within your publish statement, use a $in statement that will match doc id's within the session array, combine this an $or statement to leverage your existing query, limit/slice.
Meteor.publish("markedMessages", function () {
Messages.find({$or: [{ your_existing_query_goes_here },
{_id: { $in: Session.get('markedMessages')}} ] }).fetch()
})
;
Note, within your handlebars template, compare the message id against your markedMessages Session to identify if the message was marked by the user.

Develop Reference

JavaScript is the programming language of the Web.