I have a class in javascript, with the following structure:
class TableManager {
/** an array containing Table objects **/
protected Tables = [];
protected getTable(tableId) {
// iterates over this.Tables, and searches for a table with a specific id: if found, it returns the table object, otherwise it returns null
}
protected async createTable(tableId) {
const Table = await fetchTable(tableId); /** performs an asynchronous operation, that creates a Table object by performing a select operation on the database **/
this.Tables.push(Table);
return Table;
}
protected async joinTable(user, tableId) {
const Table = this.getTable(tableId) ?? await this.createTable(tableId);
Table.addUser(user);
}
}
The idea behind this class, is that it will receive commands via a socket. For example, it may receive the joinTable command, in which case, it should first check if the table that is being joined already exists in the memory: if it does, it will add the user to that table, otherwise, it will create the table, store it in the memory, and add the user to the table.
I am a bit concerned, that this could result in a race condition, if two joinTable() calls are made in a short amount of time, in which case the tables will be created twice, and stored in memory as two separate table instances. Am I right to be afraid about this? If yes, would checking if the table exists before adding it to the array in the createTable function, solve this race condition?
Your concern is right. the idea is transactions and make sure that there is only one transaction running at a given time. In Nodejs, you can use Mutex to implement that. Read more: https://www.nodejsdesignpatterns.com/blog/node-js-race-conditions/.
I am a bit concerned, that this could result in a race condition, if
two joinTable() calls are made in a short amount of time, in which
case the tables will be created twice, and stored in memory as two
separate table instances. Am I right to be afraid about this?
This shouldn't be a problem as long as you await each call (or chain it properly). That is to say, it won't be a problem as long as the operations are sequential. If you allow the promises to resolve at the same time (like with Promise.all) then yes, as it is right now, there would be a race condition.
If yes, would checking if the table exists before adding it to the array in
the createTable function, solve this race condition?
As I understand it, no, it would still create a race condition. The first function call would do the check, see the table does not exist and proceed to send the query to your server in order to create the new entry. The second function call would also do the check but since it's not waiting for the previous request, it's possible the check happens before the first request finishes (this is your race condition). That means another request can be sent to create another table.
What you can do is store your entry as a promise. I would use a Map for this:
protected Tables = new Map();
protected getTable(tableId) {
let Table = this.Tables.get(tableId);
if(!Table){
Table = fetchTable(tableId);
this.Tables.set(tableId, Table);
}
return Table;
}
This way, joinTable can instead do getTable which also creates the Table if it doesn't exist. If the Table is being created it will pick up the promise and no duplicates are ever made this way.
Ultimately, the creation or not of any entity on a server, needs to be managed there... on the server. Otherwise, you risk multiple clients (or even a client restart) creating these duplicates.
Related
We are troubled by eventually occurring cursor not found exceptions for some Morphia Queries asList and I've found a hint on SO, that this might be quite memory consumptive.
Now I'd like to know a bit more about the background: can sombody explain (in English), what a Cursor (in MongoDB) actually is? Why can it kept open or be not found?
The documentation defines a cursor as:
A pointer to the result set of a query. Clients can iterate through a cursor to retrieve results. By default, cursors timeout after 10 minutes of inactivity
But this is not very telling. Maybe it could be helpful to define a batch for query results, because the documentation also states:
The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. [...] For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
Note: in our queries in question we don't use sort statements at all, but also no limit and offset.
Here's a comparison between toArray() and cursors after a find() in the Node.js MongoDB driver. Common code:
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function (err, db) {
assert.equal(err, null);
console.log('Successfully connected to MongoDB.');
const query = { category_code: "biotech" };
// toArray() vs. cursor code goes here
});
Here's the toArray() code that goes in the section above.
db.collection('companies').find(query).toArray(function (err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(doc => {
console.log(`${doc.name} is a ${doc.category_code} company.`);
});
db.close();
});
Per the documentation,
The caller is responsible for making sure that there
is enough memory to store the results.
Here's the cursor-based approach, using the cursor.forEach() method:
const cursor = db.collection('companies').find(query);
cursor.forEach(
function (doc) {
console.log(`${doc.name} is a ${doc.category_code} company.`);
},
function (err) {
assert.equal(err, null);
return db.close();
}
);
});
With the forEach() approach, instead of fetching all data in memory, we're streaming the data to our application. find() creates a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor is to describe our query. The second parameter to cursor.forEach shows what to do when an error occurs.
In the initial version of the above code, it was toArray() which forced the database call. It meant we needed ALL the documents and wanted them to be in an array.
Note that MongoDB returns data in batches. The image below shows requests from cursors (from application) to MongoDB:
forEach scales better than toArray because we can process documents as they come in until we reach the end. Contrast it with toArray - where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it in your application, if you can.
I am by no mean a mongodb expert but I just want to add some observations from working in a medium sized mongo system for the last year. Also thanks to #xameeramir for the excellent walkthough about how cursors work in general.
The causes of a "cursor lost" exception may be several. One that I have noticed is explained in this answer.
The cursor lives server side. It is not distributed over a replica set but exists on the instance that is primary at the time of creation. This means that if another instance takes over as primary the cursor will be lost to the client. If the old primary is still up and around it may still be there but for no use. I guess it is garbaged collected away after a while. So if your mongo replica set is unstable or you have a shaky network in front of it you are out of luck when doing any long running queries.
If the full content of what the cursor wants to return does not fit in memory on the server the query may be very slow. RAM on your servers needs to be larger than the largest query you run.
All this can partly be avoided by designing better. For a use case with large long running queries you may be better of with several smaller database collections instead of a big one.
The collection's find method returns a cursor - this points to the set of documents (called as result set) that are matched to the query filter. The result set is the actual documents that are returned by the query, but this is on the database server.
To the client program, for example the mongo shell, you get a cursor. You can think the cursor is like an API or a program to work with the result set. The cursor has many methods which can be run to perform some actions on the result set. Some of the methods affect the result set data and some provide the status or info about the result set.
As the cursor maintains information about the result set, some information can change as you use the result set data by applying other cursor methods. You use these methods and information to suit your application, i.e., how and what you want to do with the queried data.
Working on the result set using the cursor and some of its commonly used methods and features from mongo shell:
The count() method returns the count of the number of documents in the result set, initially - as the result of the query. It is always constant at any point in the life of the cursor. This is information. This information remains same even after the cursor is closed or exhausted.
As you read documents from the result set, the result set gets exhausted. Once completely exhausted you cannot read any more. The hasNext() tells if there are any documents available to be read - returns a boolean true or false. The next() returns a document if available (you first check with hasNext, and then do a next). These two methods are commonly used to iterate over the result set data. Another iteration method is the forEach().
The data is retrieved from the server in batches - which has a default size. With the first batch you read the documents and when all it's documents are read, the following next() method retrieves the next batch of documents, etc., until all documents are read from the result set. This batch size can be configured and you can also get its status.
If you apply the toArray() method on the cursor, then all the remaining documents in the result set are loaded into the memory of your client computer and are available as a JavaScript array. And, the result set data is exhausted. The following hasNext method will return false, and the next will throw an error (once you exhaust the cursor, you cannot read data from it). This method loads all the result set data into your client's memory (the array). This can be memory consuming in case of large result sets.
The itcount() returns the count of remaining documents in the result set and exhausts the cursor.
There are cursor methods like isClosed(), isExhausted(), size() which give status information about the cursor and its underlying result set as you work with your data.
Those are the basic features of cursor and result set. There are many cursor methods, and you can try and see how they work and get a better understanding.
Reference:
mongo shell's cursor
methods
Cursor behavior with Aggregate
method
(the collection's aggregate method also returns a cursor)
Example usage in mongo shell:
Assume the test collection has 200 documents (run the commands in the same sequence).
var cur = db.test.find( { } ).limit(25) creates a result set with 25
documents only.
But, cur.count() will show 200, which is the actual count of
documents by the query's filter.
hasNext() will return true.
next() will return a document.
itcount() will return 24 (and exhausts the cursor).
itcount() again will return 0.
cur.count() will still show 200.
This error also comes when you have a large set of data and are doing batch processing on that data and each batch takes more time, totalling that time be exceeded the default cursor live time.
Then you need to change that default time to tell mongo that will not expire this cursor until processing is done.
Do check No TimeOut Documentation
A cursor is an object returned by calling db.collection.find() and which enables iterating through documents (NoSQL equivalent of a SQL "row") of a MongoDB collection (NoSQL equivalent of "table").
In case your cluster is stable and no members where down or changing state, the most posible reason for not finding the cursor is this:
Default idle cursor timeout is 10min , but in the versions >= 3.6 the cursor is also associated with session which is having default session timeout 30min , so even you set the cursor to not expire with the option noCursorTimeout() you are still limited by the session timeout of 30min. To avoid your cursor to be killed by the session timeout you will need to perioducally check in your code and execute sessionRefresh command:
db.adminCommand({"refreshSessions" : [sessionId]})
to extend the session with another 30min so your cursor to not be killed if you do something with the data before fetching the next batch...
check the docs here for detail how to do it:
https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/
I am trying to check and insert 1000 vertices in chunk using promise.all(). The code is as follows:
public async createManyByKey(label: string, key: string, properties: object[]): Promise<T[]> {
const promises = [];
const allVertices = __.addV(label);
const propKeys: Array<string> = Object.keys(properties[0]);
for(const propKey of propKeys){
allVertices.property(propKey, __.select(propKey));
}
const chunkedProperties = chunk(properties, 5); // [["demo-1", "demo-2", "demo-3", "demo-4", "demo-5"], [...], ...]
for(const property of chunkedProperties){
const singleQuery = this.g.withSideEffect('User', property)
.inject(property)
.unfold().as('data')
.coalesce(__.V().hasLabel(label).where(eq('data')).by(key).by(__.select(key)), allVertices).iterate();
promises.push(singleQuery);
}
const result = await Promise.all(promises);
return result;
}
This code throws ConcurrentModificationException. Need help to fix/improve this issue.
I'm not quite sure about the data and parameters you are using, but I needed to modify your query a bit to get it to work with a data set I have handy (air routes) as shown below. I did this to help me think through what your query is doing. I had to change the second by step. I'm not sure how that was working otherwise.
gremlin> g.inject(['AUS','ATL','XXX']).unfold().as('d').
......1> coalesce(__.V().hasLabel('airport').limit(10).
......2> where(eq('d')).
......3> by('code').
......4> by(),
......5> constant('X'))
==>v['3']
==>v['1']
==>X
While a query like this runs fine in isolation, once you start running several asynchronous promises (that contain mutating steps as in your query), what can happen is that one promise tries to access a part of the graph that is locked by another one. Even though the execution I believe is more "concurrent" than truly "parallel" if one promise yields due to an IO wait allowing another to run, the next one may fail if the prior promise already has locks in the database that the next promise also needs. In your case as you have a coalesce that references all vertices with a given label and properties, that can potentially cause conflicting locks to be taken. Perhaps it will work better if you await after each for loop iteration rather than do it all at the end in one big Promise.all.
Something else to keep in mind is that this query is going to be somewhat expensive regardless, as the mid traversal V is going to happen five times (in the case of your example) for each for loop iteration. This is because the unfold of the injected data is taken from chunks of size 5 and therefore spawns five traversers, each of which starts by looking at V.
EDITED 2021-11-17
As discussed a little in the comments, I suspect the most optimal path is actually to use multiple queries. The first query simply does a g.V(id1,id2,...) on all the IDs you are potentially going to add. Have it return a list of IDs found. Remove those from the set to add. Next break the adding part up into batches and do it without coalesce as you now know that those elements do not exist. This is most likely the best way to reduce locking and avoid the CMEs (exceptions). Unless someone else may be also trying to add them in parallel, this is the approach I think I would take.
I'm trying to iterate through an array and create records for each iteratee. This is what I am doing like mentioned here at another question:
async.each(data, (datum, callback) => {
console.log('Iterated')
Datum.create({
row: datum,
}).exec((error) => {
if (error) return res.serverError(error)
console.log('Created')
callback()
})
})
Unfortunately, it results in this:
Iterated
Iterated
Iterated
Created
Created
Created
Not this as wanted:
Iterated
Created
Iterated
Created
Iterated
Created
What I'm doing wrong?
async.eachSeries() will run one iteration at a time and wait for each iteration to be terminated before pursuing the next step.
I create an unique user friendly identifier before each creation (like 1, 2, 3 and so on). For that, I've to query the data base to find the latest identifier and increment it which is not available because the records are nearly created at the same time.
This sounds like here's the bottleneck. I don't like running async code in series because this usually slows processes down. How about this approach:
Due to data you know how many identifier you'll need.
implement a function in the backend that doesn't create a single but n such identifier at a time (including the necessary incrementing, etc.) and return that Array to the frontend. Now you can run your regular requests in paralell using/mapping that array of precomputed IDs to the data-array.
This should reduces the runtime from (createAnId + request) * data.length pretty much down to to the runtime of a single iteration. Due to the fact that all these requests can run in paralell, and therefore mostly overlap.
It looks like Datum.create is an asynchronous function.
The forEach whips through each of the three elements of the array, logging them in turn. and since JavaScript won't block prior to the asynchronous events being returned, you get each of the console.logs in turn.
Then after some amount of time, the results come in and "created" is logged to the console.
You seem to be using an asynchronous data processing library. For the result you intend to get, you need to process the data synchronously. Here's how you could do it:
data.forEach(function(datum) {
console.log('Iterated')
Datum.create({
row: datum,
}).exec((error) => {
if (error) return res.serverError(error)
console.log('Created')
callback()
})
})
You may also want to remove the callback function entirely now since the data is processed synchronously.
I have a javascript application, that calls an api, and the api returns json. With the json, I select a specific object, and loop through that.
My code flow is something like this:
Service call -> GetResults
Loop through Results and build Page
The problem though, is sometimes that api returns only one result, so that means it returns an object instead of an array, so I cant loop through results. What would be the best way to go around this?
Should i convert my object, or single result to an arrary? Put/Push it inside an array? or should I do a typeof and check if the element is an array, then do the looping?
Thanks for the help.
//this is what is return when there are more than one results
var results = {
pages: [
{"pageNumber":204},
{"pageNumber":1024},
{"pageNumber":3012}
]
}
//this is what is returned when there is only one result
var results = {
pages: {"pageNumber": 105}
}
My code loops through results, just using a for loop, but it will create errors, since sometimes results is not an array. So again, do I check if its an array? Push results into a new array? What would be better. Thanks
If you have no control over the server side, you could do a simple check to make sure it's an array:
if (!(results.pages instanceof Array)) {
results.pages = [results.pages];
}
// Do your loop here.
Otherwise, this should ideally happen on the server; it should be part of the contract that the results can always be accessed in a similar fashion.
Arrange whatever you do to your objects inside the loop into a separate procedure and if you discover that the object is not an array, apply the procedure to it directly, otherwise, apply it multiple times to each element of that object:
function processPage(page) { /* do something to your page */ }
if (pages instanceof Array) pages.forEach(processPage);
else processPage(pages);
Obvious benefits of this approach as compared to the one, where you create a redundant array is that, well, you don't create a redundant array and you don't modify the data that you received. While at this stage it may not be important that the data is intact, in general it might cause you more troubles, when running integration and regression tests.
Below is a snipet of code that I am having trouble with. The purpose is to check duplicate entries in the database and return "h" with a boolean if true or false. For testing purposes I am returning a true boolean for "h" but by the time the alert(duplicate_count); line gets executed the duplicate_count is still 0. Even though the alert for a +1 gets executed.
To me it seems like the function updateUserFields is taking longer to execute so it's taking longer to finish before getting to the alert.
Any ideas or suggestions? Thanks!
var duplicate_count = 0
for (var i = 0; i < skill_id.length; i++) {
function updateUserFields(h) {
if(h) {
duplicate_count++;
alert("count +1");
} else {
alert("none found");
}
}
var g = new cfc_mentoring_find_mentor();
g.setCallbackHandler(updateUserFields);
g.is_relationship_duplicate(resource_id, mentee_id, section_id[i], skill_id[i], active_ind,table);
};
alert(duplicate_count);
There is no reason whatsoever to use client-side JavaScript/jQuery to remove duplicates from your database. Security concerns aside (and there are a lot of those), there is a much easier way to make sure the entries in your database are unique: use SQL.
SQL is capable of expressing the requirement that there be no duplicates in a table column, and the database engine will enforce that for you, never letting you insert a duplicate entry in the first place. The syntax varies very slightly by database engine, but whenever you create the table you can specify that a column must be unique.
Let's use SQLite as our example database engine. The relevant part of your problem is right now probably expressed with tables something like this:
CREATE TABLE Person(
id INTEGER PRIMARY KEY ASC,
-- Other fields here
);
CREATE TABLE MentorRelationship(
id INTEGER PRIMARY KEY ASC,
mentorID INTEGER,
menteeID INTEGER,
FOREIGN KEY (mentorID) REFERENCES Person(id),
FOREIGN KEY (menteeID) REFERENCES Person(id)
);
However, you can make enforce uniqueness i.e. require that any (mentorID, menteeID) pair is unique, by changing the pair (mentorID, menteeID) to be the primary key. This works because you are only allowed one copy of each primary key. Then, the MentorRelationship table becomes
CREATE TABLE MentorRelationship(
mentorID INTEGER,
menteeID INTEGER,
PRIMARY KEY (mentorID, menteeID),
FOREIGN KEY (mentorID) REFERENCES Person(id),
FOREIGN KEY (menteeID) REFERENCES Person(id)
);
EDIT: As per the comment, alerting the user to duplicates but not actually removing them
This is still much better with SQL than with JavaScript. When you do this in JavaScript, you read one database row at a time, send it over the network, wait for it to come to your page, process it, throw it away, and then request the next one. With SQL, all the hard work is done by the database engine, and you don't lose time by transferring unnecessary data over the network. Using the first set of table definitions above, you could write
SELECT mentorID, menteeID
FROM MentorRelationship
GROUP BY mentorID, menteeID
HAVING COUNT(*) > 1;
which will return all the (mentorID, menteeID) pairs that occur more than once.
Once you have a query like this working on the server (and are also pulling out all the information you want to show to the user, which is presumably more than just a pair of IDs), you need to send this over the network to the user's web browser. Essentially, on the server side you map a URL to return this information in some convenient form (JSON, XML, etc.), and on the client side you read this information by contacting that URL with an AJAX call (see jQuery's website for some code examples), and then display that information to the user. No need to write in JavaScript what a database engine will execute orders of magnitude faster.
EDIT 2: As per the second comment, checking whether an item is already in the database
Almost everything I said in the first edit applies, except for two changes: the schema and the query. The schema should become the second of the two schemas I posted, since you don't want the database engine to allow duplicates. Also, the query should be simply
SELECT COUNT(*) > 0
FROM MentorRelationship
WHERE mentorID = #mentorID AND menteeID = #menteeID;
where #mentorID and #menteeID are the items that the user selected, and are inserted into the query by a query builder library and not by string concatenation. Then, the server will get a true value if the item is already in the database, and a false value otherwise. The server can send that back to the client via AJAX as before, and the client (that's your JavaScript page) can alert the user if the item is already in the database.