IndexedDB and large amount of inserts on Angular app - javascript

I'm struggling with amounts of 20-50k JSON object response from server which I should insert into our indexeddb datastore.
Response is repeated with foreach and every single row is added with each. Calls with response less than 10k rows are working fine and inserted within a minute or so. But when the amounts get larger, the database goes unresponsive after a while and returns this error message
"db Error err=transaction aborted for unknown reason"
I'm using a Dexie wrapper for the database and an angular wrapper for dexie called ngDexie.
var deferred = $q.defer();
var progress = 0;
// make the call
$http({
method: 'GET',
headers: headers,
url: '/Program.API/api/items/getitems/' + user
}).success(function (response) {
// parse response
var items = angular.fromJson(response);
// loop each item
angular.forEach(items, function (item) {
// insert into db
ngDexie.put('stuff', item).then(function () {
progress++;
$ionicLoading.show({
content: 'Loading',
animation: 'fade-in',
template: 'Inserting items to db: ' + progress
+ '/' + items.length,
showBackdrop: true,
maxWidth: 200,
showDelay: 0
});
if (progress == items.length) {
setTimeout(function () {
$ionicLoading.hide();
}, 500);
deferred.resolve(items);
}
});
});
}).error(function (error) {
$log('something went wrong');
$ionicLoading.hide();
});
return deferred.promise;
Do I have the wrong approach with dealing with the whole data in one chunk? Could there be better alternatives? This whole procedure is only done once when the user opens up the site. All help is greatly appreciated. The target device is tablets running Android with Chrome.

Since you are getting a unknown error, there is something going wrong with I/O. My guess is the db underneath has troubles handling the amout of data. May try to split up in batches with a maximum of 10k each.
A transaction can fail for reasons not tied to a particular IDBRequest. For example due to IO errors when committing the transaction, or due to running into a quota limit where the implementation can't tie exceeding the quota to a partcular request. In this case the implementation MUST run the steps for aborting a transaction using the transaction as transaction and the appropriate error type as error. For example if quota was exceeded then QuotaExceededError should be used as error, and if an IO error happened, UnknownError should be used as error.
you can find this in the specs
An other possibility, do you have any indexes defined on the objectstore? Because for every index you have, that index needs to be maintained with every insert.

If you insert many new records i would suggest using add. This was added for performance reasons. See the documentation here:
https://github.com/FlussoBV/NgDexie/wiki/ngDexie.add

I had problems with massive bulk insert (100.000 - 200.000 records). I've solved all my IndexedDB performance problems using bulkPut() from Dexie library. It has this important feature:
Dexie has a kick-ass performance. It's bulk methods take advantage of
a not well known feature in indexedDB that makes it possible to store
stuff without listening to every onsuccess event. This speeds up the
performance to a maximum.
Dexie: https://github.com/dfahlander/Dexie.js
BulkPut() -> http://dexie.org/docs/Table/Table.bulkPut()

Related

Bulk Upsert Javascript stored procedure always exceeds execution cap of 5 seconds and results in a timeout

I'm currently running a script in python SDK which programmatically bulk upserts 1.5 million documents into a collection in azure cosmos db. I've been using the bulk import sproc from the samples provided in the github repo: https://github.com/Azure/azure-cosmosdb-js-server/tree/master/samples/stored-procedures, the only change being that I've swapped collection.createDocument with collection.upsertDocument. I'll include my sproc in full below.
The stored procedure does run successfully - it upserts documents consistently and relatively quickly. Although this will be the case only up until around 30% progress when this error will be thrown:
CosmosHttpResponseError: (RequestTimeout) Message: {"Errors":["The requested operation exceeded maximum alloted time. Learn more: https://aka.ms/cosmosdb-tsg-service-request-timeout"]}
ActivityId: 9f2357c6-918c-4b67-ba20-569034bfde6f, Request URI: /apps/4a997bdb-7123-485a-9808-f952db2b7e52/services/a7c137c6-96b8-4b53-a20c-b9577981b353/partitions/305a8287-11d1-43f8-be1f-983bd4c4a63d/replicas/132488328092882514p/, RequestStats:
RequestStartTime: 2020-11-03T23:43:59.9158203Z, RequestEndTime: 2020-11-03T23:44:05.3858559Z, Number of regions attempted:1
ResponseTime: 2020-11-03T23:44:05.3858559Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-centralus1-fd22.documents.azure.com:14354/apps/4a997bdb-7123-485a-9808-f952db2b7e52/services/a7c137c6-96b8-4b53-a20c-b9577981b353/partitions/305a8287-11d1-43f8-be1f-983bd4c4a63d/replicas/132488328092882514p/, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 408, SubStatusCode: 0, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: null, ResourceType: StoredProcedure, OperationType: ExecuteJavaScript, SDK: Microsoft.Azure.Documents.Common/2.11.0
Is there a way to add some retry logic or to extend the timeout period for bulk upserts? I believe the section of code in the sproc below if (!isAccepted) getContext().getResponse().setBody(count); is supposed to help with this scenario but it doesn't seem to work in my case.
Bulk upsert stored procedure in Javascript:
function bulkUpsert(docs) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();
// The count of imported docs, also used as current doc index.
var count = 0;
// Validate input.
if (!docs) throw new Error("The array is undefined or null.");
var docsLength = docs.length;
if (docsLength == 0) {
getContext().getResponse().setBody(0);
return;
}
// Call the CRUD API to create a document.
tryCreate(docs[count], callback);
// Note that there are 2 exit conditions:
// 1) The upsertDocument request was not accepted.
// In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
// In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback) {
var isAccepted = collection.upsertDocument(collectionLink, doc, callback);
// If the request was accepted, callback will be called.
// Otherwise report current count back to the client,
// which will call the script again with remaining set of docs.
// This condition will happen when this stored procedure has been running too long
// and is about to get cancelled by the server. This will allow the calling client
// to resume this batch from the point we got to before isAccepted was set to false
if (!isAccepted) {
getContext().getResponse().setBody(count);
}
}
// This is called when collection.upsertDocument is done and the document has been persisted.
function callback(err, doc, options) {
if (err) throw err;
// One more document has been inserted, increment the count.
count++;
if (count >= docsLength) {
// If we have created all documents, we are done. Just set the response.
getContext().getResponse().setBody(count);
} else {
// Create next document.
tryCreate(docs[count], callback);
}
}
}
I think that the problem may lie in the stored procedure rather than the python script, if this isn't the case though I can provide my python script. Any help on this would be massively appreciated, it's been a head scratcher for me for days now!
Extra Info:
Throughput = 10,000, partition upsert size ~ 1.9MB consistently.
If anyone else has this problem, the workaround I've used is to increase the throughput to 100,000 instead of 10,000 temporarily whilst the bulk upsert operation is underway. The error doesn't occur if you use that bulk upsert stored procedure in conjunction with a sufficiently high throughput. I think the timeout was happening frequently once the bulk upsert operation had upserted around 30% of the 1.5 million records, likely because the throughput wasn't divided sufficiently between partitions and it was causing a bottleneck. I may have to again assign a greater throughput to my container once it is used in practice or maybe I'll be able to reduce it to save costs. Either way the code to do this is quite simple with just the method below:
new_throughput = 10000; container.replace_throughput(new_throughput)
Stored procedures have a bounded execution time of 5 seconds. However you can write your stored procedure to handle bounded execution by checking a boolean return value and then use the count of items inserted in each invocation of the stored procedure to track and resume progress across batches. There is an example here.

Efficient DB design with PouchDB/CouchDB

So I was reading a lot about how to actually store and fetch data in an efficient way. Basically my application is about time management/capturing for projects. I am very happy for any opinion on which strategy I should use or even suggestions for other strategies. The main concern is about the limited resources for local storage on the different Browsers.
This is the main data I have to store:
db_projects: This is a database where the projects itself are stored.
db_timestamps: Here go the timestamps per project whenever a project is running.
I came up with the following strategies:
1: Storing the status of the project in the timestamps
When a project is started, there is addad a timestamp to db_timestamps like so:
db_timestamps.put({
_id: String(Date.now()),
title: projectID,
status: status //could be: 1=active/2=inactive/3=paused
})...
This follows the strategy to only add data to the db and not modify any entries. The problem I see here is that if I want to get all active projects for example, I would need to query the whole db_timestamp which can contain thousands of entries. Since I can not use the ID to search all active projects, this could result in a quite heavy DB query.
2: Storing the status of the project in db_projects
Each time a project changes it's status, there is a update to the project itself. So the "get all active projects"-query would be much resource friendly, since there are a lot less projects than timestamps. But this would also mean that each time a status change happens, the project entry would be revisioned and therefor would produce "a lot" of overhead. I'm also not sure if the compaction feature would do a good job, since not all revision data is deleted (the documents are, but the leaf revisions not). This means for a state change we have at least the _rev information which is still a string of 34 chars for changing only the status (1 char). Or can I delete the leaf revisions after conflict resolution?
3: Storing the status in a separate DB like db_status
This leads to the same problem as in #2 since status changes lead to revisions on this DB. Or if the states would be added in "only add data"-mode (like in #1), it would just quickly fill with entries.
The general problem is that you have a limited amount of space that you could put into indexedDB. On the other hand the principle of ChouchDB is that storage space is cheap (which it is indeed true when you store on the server side only). Here an interesting discussion about that.
So this is the solution that I use for now. I am using a mix between solution 1 and solution 2 from above with the following additions:
Storing only the timesamps in a synced Database (db_timestamps) with the "only add data" principle.
Storing the projects and their states in a separate local (not
synced) database (db_projects). Therefor I still use pouchDB since
it has a lot simpler API than indexedDB.
Storing the new/changed
project status in each timestamp aswell (so you could rebuild db_projects
out of db_timestams if needed)
Deleting db_projects every so often and repopulate it, so the
revision data (overhead for this db in my case) is eliminated and the size is acceptable.
I use the following code to rebuild my DB:
//--------------------------------------------------------------------
function rebuild_db_project(){
db_project.allDocs({
include_docs: true,
//attachments: true
}).then(function (result) {
// do stuff
console.log('I have read the DB and delete it now...');
deleteDB('db_project', '_pouch_DB_Projekte');
return result;
}).then(function (result) {
console.log('Creating the new DB...'+result);
db_project = new PouchDB('DB_Projekte');
var dbContentArray = [];
for (var row in result.rows) {
delete result.rows[row].doc._rev; //delete the revision of the doc. else it would raise an error on the bulkDocs() operation
dbContentArray.push(result.rows[row].doc);
}
return db_project.bulkDocs(dbContentArray);
}).then(function(response){
console.log('I have successfully populated the DB with: '+JSON.stringify(response));
}).catch(function (err) {
console.log(err);
});
}
//--------------------------------------------------------------------
function deleteDB(PouchDB_Name, IndexedDB_Name){
console.log('DELETE');
new PouchDB(PouchDB_Name).destroy().then(function () {
// database destroyed
console.log("pouchDB destroyed.");
}).catch(function (err) {
// error occurred
});
var DBDeleteRequest = window.indexedDB.deleteDatabase(IndexedDB_Name);
DBDeleteRequest.onerror = function(event) {
console.log("Error deleting database.");
};
DBDeleteRequest.onsuccess = function(event) {
console.log("IndexedDB deleted successfully");
console.log(request.result); // should be null
};
}
So I not only use the pouchDB.destroy() command but also the indexedDB.deleteDatabase() command to get the storage freed nearly completely (there is still some 4kB that are not freed, but this is insignificant to me.)
The timings are not really proper but it works for me. I'm happy if somone has an idea to make the timing work properly (The problem for me is that indexedDB does not support promises).

What would cause "Request timed out" in parse.com cloud code count?

One of my cloud functions is timing out occasionally. It seems to have trouble with counting, although there are only around 700 objects in the class. I would appreciate any tips on how to debug this issue.
The cloud function works correctly most of the time.
Example error logged:
E2015-02-03T02:21:41.410Z] v199: Ran cloud function GetPlayerWorldLevelRank for user xl8YjQElLO with:
Input: {"levelID":60}
Failed with: PlayerWorldLevelRank first count error: Request timed out
Is there anything that looks odd in the code below? The time out error is usually thrown in the second count (query3), although sometimes it times out in the first count (query2).
Parse.Cloud.define("GetPlayerWorldLevelRank", function(request, response) {
var query = new Parse.Query("LevelRecords");
query.equalTo("owner", request.user);
query.equalTo("levelID", request.params.levelID);
query.first().then(function(levelRecord) {
if (levelRecord === undefined) {
response.success(null);
}
// if player has a record, work out his ranking
else {
var query2 = new Parse.Query("LevelRecords");
query2.equalTo("levelID", request.params.levelID);
query2.lessThan("timeSeconds", levelRecord.get("timeSeconds"));
query2.count({
success: function(countOne) {
var numPlayersRankedHigher = countOne;
var query3 = new Parse.Query("LevelRecords");
query3.equalTo("levelID", request.params.levelID);
query3.equalTo("timeSeconds", levelRecord.get("timeSeconds"));
query3.lessThan("bestTimeUpdatedAt", levelRecord.get("bestTimeUpdatedAt"));
query3.count({
success: function(countTwo) {
numPlayersRankedHigher += countTwo;
var playerRanking = numPlayersRankedHigher + 1;
levelRecord.set("rank", playerRanking);
// The SDK doesn't allow an object that has been changed to be serialized into a response.
// This would disable the check and allow you to return the modified object.
levelRecord.dirty = function() { return false; };
response.success(levelRecord);
},
error: function(error) {
response.error("PlayerWorldLevelRank second count error: " + error.message);
}
});
},
error: function(error) {
response.error("PlayerWorldLevelRank first count error: " + error.message);
}
});
}
});
});
I don't think the issue is in your code. Like the error message states: the request times out. That is, the Parse API doesn't respond within the period of the timeout or the network causes it to timeout. As soon as you do .count some API call is probably done, which then can't connect or times out.
Apparently more people have this issue: https://www.parse.com/questions/ios-test-connectivity-to-parse-and-timeout-question. It doesn't seem possible to increase the timeout, so the suggestion in this post states:
For that reason, I suggest setting a NSTimer prior to executing the
query, and invalidating it when the query returns. If the NSTimer
fires before being invalidated, ask the user if they want to keep
waiting for the results to come back, or show them a message
indicating that the request is taking a long time to complete. This
gives the user the chance to wait more if they know their current
network conditions are not ideal.
In case you are dealing with networks, and especially on the mobile platform, you need to prepare for network hickups. So like the post suggests: offer the option to user to try again.

Strange issue with socket.on method

I am facing a strange issue with calling socket.on methods from the Javascript client. Consider below code:
for(var i=0;i<2;i++) {
var socket = io.connect('http://localhost:5000/');
socket.emit('getLoad');
socket.on('cpuUsage',function(data) {
document.write(data);
});
}
Here basically I am calling a cpuUsage event which is emitted by socket server, but for each iteration I am getting the same value. This is the output:
0.03549148310035006
0.03549148310035006
0.03549148310035006
0.03549148310035006
Edit: Server side code, basically I am using node-usage library to calculate CPU usage:
socket.on('getLoad', function (data) {
usage.lookup(pid, function(err, result) {
cpuUsage = result.cpu;
memUsage = result.memory;
console.log("Cpu Usage1: " + cpuUsage);
console.log("Cpu Usage2: " + memUsage);
/*socket.emit('cpuUsage',result.cpu);
socket.emit('memUsage',result.memory);*/
socket.emit('cpuUsage',cpuUsage);
socket.emit('memUsage',memUsage);
});
});
Where as in the server side, I am getting different values for each emit and socket.on. I am very much feeling strange why this is happening. I tried setting data = null after each socket.on call, but still it prints the same value. I don't know what phrase to search, so I posted. Can anyone please guide me?
Please note: I am basically Java developer and have a less experience in Javascript side.
You are making the assumption that when you use .emit(), a subsequent .on() will wait for a reply, but that's not how socket.io works.
Your code basically does this:
it emits two getLoad messages directly after each other (which is probably why the returning value is the same);
it installs two handlers for a returning cpuUsage message being sent by the server;
This also means that each time you run your loop, you're installing more and more handlers for the same message.
Now I'm not sure what exactly it is you want. If you want to periodically request the CPU load, use setInterval or setTimeout. If you want to send a message to the server and want to 'wait' for a response, you may want to use acknowledgement functions (not very well documented, but see this blog post).
But you should assume that for each type of message, you should only call socket.on('MESSAGETYPE', ) once during the runtime of your code.
EDIT: here's an example client-side setup for a periodic poll of the data:
var socket = io.connect(...);
socket.on('connect', function() {
// Handle the server response:
socket.on('cpuUsage', function(data) {
document.write(data);
});
// Start an interval to query the server for the load every 30 seconds:
setInterval(function() {
socket.emit('getLoad');
}, 30 * 1000); // milliseconds
});
Use this line instead:
var socket = io.connect('iptoserver', {'force new connection': true});
Replace iptoserver with the actual ip to the server of course, in this case localhost.
Edit.
That is, if you want to create multiple clients.
Else you have to place your initiation of the socket variable before the for loop.
I suspected the call returns average CPU usage at the time of startup, which seems to be the case here. Checking the node-usage documentation page (average-cpu-usage-vs-current-cpu-usage) I found:
By default CPU Percentage provided is an average from the starting
time of the process. It does not correctly reflect the current CPU
usage. (this is also a problem with linux ps utility)
But If you call usage.lookup() continuously for a given pid, you can
turn on keepHistory flag and you'll get the CPU usage since last time
you track the usage. This reflects the current CPU usage.
Also given the example how to use it.
var pid = process.pid;
var options = { keepHistory: true }
usage.lookup(pid, options, function(err, result) {
});

Progress event for large Firebase queries?

I have the following query:
fire = new Firebase 'ME.firebaseio.com'
users = fire.child 'venues/ID/users'
users.once 'value', (snapshot) ->
# do things with snapshot.val()
...
I am loading 10+ mb of data, and the request takes around 1sec/mb. Is it possible to give the user a progress indicator as content streams in? Ideally I'd like to process the data as it comes in as well (not just notify).
I tried using the on "child_added" event instead, but it doesn't work as expected - instead of children streaming in at a consistent rate, they all come at once after the entire dataset is loaded (which takes 10-15 sec), so in practice it seems to be a less performant version of on "value".
You should be able to optimize your download time from 10-20secs to a few milliseconds by starting with some denormalization.
For example, we could move the images and any other peripherals comprising the majority of the payload to their own path, keep only the meta data (name, email, etc) in the user records, and grab the extras separately:
/users/user_id/name, email, etc...
/images/user_id/...
The number of event listeners you attach or paths you connect to does not have any significant overhead locally or for networking bandwidth (just the payload) so you can do something like this to "normalize" after grabbing the meta data:
var firebaseRef = new Firebase(URL);
firebaseRef.child('users').on('child_added', function(snap) {
console.log('got user ', snap.name());
// I chose once() here to snag the image, assuming they don't change much
// but on() would work just as well
firebaseRef.child('images/'+snap.name()).once('value', function(imageSnap) {
console.log('got image for user ', imageSnap.name());
});
});
You'll notice right away that when you move the bulk of the data out and keep only the meta info for users locally, they will be lightning-fast to grab (all of the "got user" logs will appear right away). Then the images will trickle in one at a time after this, allowing you to create progress bars or process them as they show up.
If you aren't willing to denormalize the data, there are a couple ways you could break up the loading process. Here's a simple pagination approach to grab the users in segments:
var firebaseRef = new Firebase(URL);
grabNextTen(firebaseRef, null);
function grabNextTen(ref, startAt) {
ref.limit(startAt? 11 : 10).startAt(startAt).once('value', function(snap) {
var lastEntry;
snap.forEach(function(userSnap) {
// skip the startAt() entry, which we've already processed
if( userSnap.name() === lastEntry ) { return; }
processUser(userSnap);
lastEntry = userSnap.name();
});
// setTimeout closes the call stack, allowing us to recurse
// infinitely without a maximum call stack error
setTimeout(grabNextTen.bind(null, ref, lastEntry);
});
}
function processUser(snap) {
console.log('got user', snap.name());
}
function didTenUsers(lastEntry) {
console.log('finished up to ', lastEntry);
}
A third popular approach would be to store the images in a static cloud asset like Amazon S3 and simply store the URLs in Firebase. For large data sets in the hundreds of thousands this is very economical, since those solutions are a bit cheaper than Firebase storage.
But I'd highly suggest you both read the article on denormalization and invest in that approach first.

Categories

Resources