NodeJS - Memory/CPU management with MongoJS Stream - javascript

I'm parsing a fairly large dataset from MongoDB (of about 40,000 documents, each with a decent amount of data inside).
The stream is being accessed like so:
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// This "checkList" function parses HTML data with Cheerio to find different elements on the page. Lots of if/else statements
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
Each record gets parsed with Cheerio (the record contains HTML data from a page scraped earlier) and then I process the Cheerio data to look for various selectors, then saved back to MongoDB.
For the first ~2,000 objects the data is parsed quite quickly (in ~30 seconds). After that it becomes far slower, around 50 records being parsed per second.
Looking in my Macbook Air's activity monitor I see that it's not using a crazy amount of memory (226.5mb / 8gb ram) but it is using a whole lot of CPU (io.js is taking up 99% of my cpu).
Is this a possible memory leak? The checkLists function isn't particularly intensive (or at least, as far as I can tell - there are quite a few nested if/else statements but not much else).
Am I meant to be clearing my variables after they're being used, like setting $ = '' or similar? Any other reason with Node would be using so much CPU?

You basically need to "pause" the stream or otherwise "throttle" it from executing on very data item recieved straight away. So the code in the "event" does not wait before completion before the next event is fired, unless you stop the events emitting.
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
cursor.pause(); // stop processessing new events
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// if checkList() is synchronous then here
cursor.resume(); // start events again
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
If checkList() contains async methods, then pass in the cursor
checkList($, rec, url, i,cursor);
And process the "resume" inside:
function checkList(data, rec, url, i, cursor) {
somethingAsync(args,function(err,result) {
// We're done
cursor.resume(); // start events again
})
}
The "pause" stops the events emitting from the stream until the "resume" is called. This means your operations don't "stack up" in memory and wait for each to complete.
You probably want more advanced flow control for some parallel processing, but this is basically how you do it with streams.
And resume inside

Related

How to efficiently stream a real-time chart from a local data file

complete noob picking up NodeJS over the last few days here, and I've gotten myself in big trouble, it looks like. I've currently got a working Node JS+Express server instance, running on a Raspberry Pi, acting as a web interface for a local data acquisition script ("the DAQ"). When executed, the script writes out data to a local file on the Pi, in .csv format, writing out in real-time every second.
My Node app is a simple web interface to start (on-click) the data acquisition script, as well as to plot previously acquired data logs, and visualize the actively being collected data in real time. Plotting of old logs was simple, and I wrote a JS function (using Plotly + d3) to read a local csv file via AJAX call, and plot it - using this script as a starting point, but using the logs served by express rather than an external file.
When I went to translate this into a real-time plot, I started out using the setInterval() method to update the graph periodically, based on other examples. After dealing with a few unwanted recursion issues, and adjusting the interval to a more reasonable setting, I eliminated the memory/traffic issues which were crashing the browser after a minute or two, and things are mostly stable.
However, I need help with one thing primarily:
Improving the efficiency of my first attempt approach: This acquisition script absolutely needs to be written to file every second, but considering that a typical run might last 1-2 weeks, the file size being requested on every Interval loop will quickly start to balloon. I'm completely new to Node/Express, so I'm sure there's a much better way of doing the real-time rendering aspect of this - that's the real issue here. Any pointers of a better way to go about doing this would be massively helpful!
Right now, the killDAQ() call issued by the "Stop" button kills the underlying python process writing out the data to disk. Is there a way to hook into using that same button click to also terminate the setInterval() loop updating the graph? There's no need for it to be updated any longer after the data acquisition has been stopped so having the single click do double duty would be ideal. I think that setting up a listener or res/req approach would be an option, but pointers in the right direction would be massively helpful.
(Edit: I solved #2, using global window. variables. It's a hack, but it seems to work:
window.refreshIntervalId = setInterval(foo);
...
clearInterval(window.refreshIntervalId);
)
Thanks for much for the help!
MWE:
html (using Pug as a template engine):
doctype html
html
body.default
.container-fluid
.row
.col-md-5
.row.text-center
.col-md-6
button#start_button(type="button", onclick="makeCallToDAQ()") Start Acquisition
.col-md-6
button#stop_button(type="button", onclick="killDAQ()") Stop Acquisition
.col-md-7
#myDAQDiv(style='width: 980px; height: 500px;')
javascript (start/stop acquisition):
function makeCallToDAQ() {
fetch('/start_daq', {
// call to app to start the acquisition script
})
.then(console.log(dateTime))
.then(function(response) {
console.log(response)
setInterval(function(){ callPlotly(dateTime.concat('.csv')); }, 5000);
});
}
function killDAQ() {
fetch('/stop_daq')
// kills the process
.then(function(response) {
// Use the response sent here
alert('DAQ has stopped!')
})
}
javascript (call to Plotly for plotting):
function callPlotly(filename) {
var csv_filename = filename;
console.log(csv_filename)
function makeplot(csv_filename) {
// Read data via AJAX call and grab header names
var headerNames = [];
d3.csv(csv_filename, function(error, data) {
headerNames = d3.keys(data[0]);
processData(data, headerNames)
});
};
function processData(allRows, headerNames) {
// Plot data from relevant columns
var plotDiv = document.getElementById("plot");
var traces = [{
x: x,
y: y
}];
Plotly.newPlot('myDAQDiv', traces, plotting_options);
};
makeplot(filename);
}
node.js (the actual Node app):
// Start the DAQ
app.use(express.json());
var isDaqRunning = true;
var pythonPID = 0;
const { spawn } = require('child_process')
var process;
app.post('/start_daq', function(req, res) {
isDaqRunning = true;
// Call the python script here.
const process = spawn('python', ['../private/BIC_script.py', arg1, arg2])
pythonPID = process.pid;
process.stdout.on('data', (myData) => {
res.send("Done!")
})
process.stderr.on('data', (myErr) => {
// If anything gets written to stderr, it'll be in the myErr variable
})
res.status(200).send(); //.json(result);
})
// Stop the DAQ
app.get('/stop_daq', function(req, res) {
isDaqRunning = false;
process.on('close', (code, signal) => {
console.log(
`child process terminated due to receipt of signal ${signal}`);
});
// Send SIGTERM to process
process.kill('SIGTERM');
res.status(200).send();
})

What's the correct way to handle removing a potentially busy file in NodeJS?

I have a NodeJS server managing some files. It's going to watch for a known filename from an external process and, once received, read it and then delete it. However, sometimes it's attempted to be read/deleted before the file has "unlocked" from previous use so likely will fail occasionally. What I'd like to do is retry this file asap, either as soon as it's finished or continuously at a fast pace.
I'd rather avoid a long sleep where possible, because this needs to be handled ASAP and every second counts.
fs.watchFile(intput_json_file, {interval: 10}, function(current_stats, previous_stats) {
var json_data = "";
try {
var file_cont = fs.readFileSync(input_json_file); // < TODO: async this
json_data = JSON.parse(file_cont.toString());
fs.unlink(input_json_file);
} catch (error) {
console.log("The JSON in the could not be parsed. File will continue to be watched.");
console.log(error);
return;
}
// Else, this has loaded properly.
fs.unwatchFile(input_json_file);
// ... do other things with the file's data.
}
// set a timeout for the file watching, just in case
setTimeout(fs.unwatchFile, CLEANUP_TIMEOUT, input_json_file);
I expect "EBUSY: resource busy or locked" to turn up occasionally, but fs.watchFile isn't always called when the file is unlocked.
I thought of creating a function and then calling it with a delay of 1-10ms, where it could call itself if that fails too, but that feels like a fast route to a... cough stack overflow.
I'd also like to steer clear of synchronous methods so that this scales nicely, but being relatively new to NodeJS all the callbacks are starting to turn into a maze.
May be it will be over for this story, but you can create own fs with full control. In this case other programs will write data directly to your program. Just search by word fuse and fuse-binding

Running JS in a killable 'thread'; detecting and canceling long-running processes

Summary: How can I execute a JavaScript function, but then "execute" (kill) it if it does not finish with a timeframe (e.g. 2 seconds)?
Details
I'm writing a web application for interactively writing and testing PEG grammars. Unfortunately, the JavaScript library I'm using for parsing using a PEG has a 'bug' where certain poorly-written or unfinished grammars cause infinite execution (not even detected by some browsers). You can be happily typing along, working on your grammar, when suddenly the browser locks up and you lose all your hard work.
Right now my code is (very simplified):
grammarTextarea.onchange = generateParserAndThenParseInputAndThenUpdateThePage;
I'd like to change it to something like:
grammarTextarea.onchange = function(){
var result = runTimeLimited( generateParserAndThenParseInput, 2000 );
if (result) updateThePage();
};
I've considered using an iframe or other tab/window to execute the content, but even this messy solution is not guaranteed to work in the latest versions of major browsers. However, I'm happy to accept a solution that works only in latest versions of Safari, Chrome, and Firefox.
Web workers provide this capability—as long as the long-running function does not require access to the window or document or closures—albeit in a somewhat-cumbersome manner. Here's the solution I ended up with:
main.js
var worker, activeMsgs, userTypingTimeout, deathRowTimer;
killWorker(); // Also creates the first one
grammarTextarea.onchange = grammarTextarea.oninput = function(){
// Wait until the user has not typed for 500ms before parsing
clearTimeout(userTypingTimeout);
userTypingTimeout = setTimeout(askWorkerToParse,500);
}
function askWorkerToParse(){
worker.postMessage({action:'parseInput',input:grammarTextarea.value});
activeMsgs++; // Another message is in flight
clearTimeout(deathRowTimer); // Restart the timer
deathRowTimer = setTimeout(killWorker,2000); // It must finish quickly
};
function killWorker(){
if (worker) worker.terminate(); // This kills the thread
worker = new Worker('worker.js') // Create a new worker thread
activeMsgs = 0; // No messages are pending on this new one
worker.addEventListener('message',handleWorkerResponse,false);
}
function handleWorkerResponse(evt){
// If this is the last message, it responded in time: it gets to live.
if (--activeMsgs==0) clearTimeout(deathRowTimer);
// **Process the evt.data.results from the worker**
},false);
worker.js
importScripts('utils.js') // Each worker is a blank slate; must load libs
self.addEventListener('message',function(evt){
var data = evt.data;
switch(data.action){
case 'parseInput':
// Actually do the work (which sometimes goes bad and locks up)
var parseResults = parse(data.input);
// Send the results back to the main thread.
self.postMessage({kind:'parse-results',results:parseResults});
break;
}
},false);

How to structure my code to return a callback?

So I've been stuck on this for quite a while. I asked a similar question here: How exactly does done() work and how can I loop executions inside done()?
but I guess my problem has changed a bit.
So the thing is, I'm loading a lot of streams and it's taking a while to process them all. So to make up for that, I want to at least load the streams that have already been processed onto my webpage, and continue processing stream of tweets at the same time.
loadTweets: function(username) {
$.ajax({
url: '/api/1.0/tweetsForUsername.php?username=' + username
}).done(function (data) {
var json = jQuery.parseJSON(data);
var jsonTweets = json['tweets'];
$.Mustache.load('/mustaches.php', function() {
for (var i = 0; i < jsonTweets.length; i++) {
var tweet = jsonTweets[i];
var optional_id = '_user_tweets';
$('#all-user-tweets').mustache('tweets_tweet', { tweet: tweet, optional_id: optional_id });
configureTweetSentiment(tweet);
configureTweetView(tweet);
}
});
});
}};
}
This is pretty much the structure to my code right now. I guess the problem is the for loop, because nothing will display until the for loop is done. So I have two questions.
How can I get the stream of tweets to display on my website as they're processed?
How can I make sure the Mustache.load() is only executed once while doing this?
The problem is that the UI manipulation and JS operations all run in the same thread. So to solve this problem you should just use a setTimeout function so that the JS operations are queued at the end of all UI operations. You can also pass a parameter for the timeinterval (around 4 ms) so that browsers with a slower JS engine can also perform smoothly.
...
var i = 0;
var timer = setInterval(function() {
var tweet = jsonTweets[i++];
var optional_id = '_user_tweets';
$('#all-user-tweets').mustache('tweets_tweet', {
tweet: tweet,
optional_id: optional_id
});
configureTweetSentiment(tweet);
configureTweetView(tweet);
if(i === jsonTweets.length){
clearInterval(timer);
}
}, 4); //Interval between loading tweets
...
NOTE
The solution is based on the following assumptions -
You are manipulating the dom with the configureTweetSentiment and the configureTweetView methods.
Ideally the solution provided above would not be the best solution. Instead you should create all html elements first in javascript only and at the end append the final html string to a div. You would see a drastic change in performance (Seriously!)
You don't want to use web workers because they are not supported in old browsers. If that's not the case and you are not manipulating the dom with the configure methods then web workers are the way to go for data intensive operations.

Background tasks in JavaScript that do not interfere UI interaction

I am working on a mobile App implemented in JavaScript that has to do lots of background stuff -- mainly fetching data from a server (via JSONP) and writing it into a database using local storage. In the foreground the user may navigate on the locally stored data and thus wants a smoothly reacting App.
This interaction (fetching data from server via JSONP and after doing some minor work with it, storing it in the local database) is done asynchronously, but because of the necessary DOM manipulation for JSONP and the database interaction I am not able to this work with WebWorkers. Thus I am running in the problem in blocking the JavaScript event queue with lots of "background processing requests" and the App reacts very slowly (or even locks completely).
Here is a small sketch of the way I am doing the background stuff:
var pendingTasksOnNewObjects = new Object();
//add callbacks to queue
function addTaskToQueue(id, callback) {
if (!pendingTasksOnNewObjects[id]) {
pendingTasksOnNewObjects[id] = new Array();
}
pendingTasksOnNewObjects[id].push(callback);
}
// example for this pending tasks:
var myExampleTask = function () {
this.run = function(jsonObject) {
// do intersting stuff here as early as a specific new object arrives
// i.e. update some elements in the DOM to respect the newly arrived object
};
};
addTaskToQueue("id12345", myExampleTask);
// method to fetch documents from the server via JSONP
function getDocuments(idsAsCommaSeparatedString) {
var elem;
elem = document.createElement("script");
elem.setAttribute("type", "text/javascript");
elem.setAttribute("src", "http://serverurl/servlet/myServlet/documents/"+idsAsCommaSeparatedString);
document.head.appendChild(elem);
}
// "callback" method that is used in the JSONP result to process the data
function locallyStoreDocuments(jsonArray) {
var i;
for (i=0; i<jsonArray.length; i++) {
var obj = jsonArray[i];
storeObjectInDatabase(obj);
runTasks(obj.docID, obj);
}
return true;
}
// process tasks when new object arrives
function runTasks(id, jsonObject) {
if(pendingTasksOnNewObjects[id]) {
while(pendingTasksOnNewObjects[id][0]) {
pendingTasksOnNewObjects[id][0].run(jsonObject);
pendingTasksOnNewObjects[id].shift();
}
}
return true;
}
I already looked around reading some stuff about the event processing mechanisms in JavaScript (i.e. John Resigs description of working with timers), but what I really want to do is the following: Check if there is something to do in the event queue. If so, wait for 100ms and check again. If nothing is currently waiting in the queue, process a small amount of data and check again.
As far as I read, this is not possible, since JavaScript does not have direct access to the event queue.
So are there any design ideas on how to optimize my code to get closer to the goal not to interfere with the UI events to give the user a smooth App interaction?
Web Workers seem to be available in iOS 5, http://caniuse.com/webworkers

Categories

Resources