NodeJS: Memory usage grows during recursive scrape until crash

NodeJS: Memory usage grows during recursive scrape until crash - javascript

I am scraping a bunch of stuff from a GET URL API in NodeJS. I'm looping through the months of the year X a # of cities. I have a scrapeChunk() function that I call once for each instance of the parameters, i.e. {startDate: ..., endDate: ..., location:...}. Inside I do a jsdom parsing of a table, convert to CSV, append the CSV to a file. Inside all of the nested asynchronous callbacks, I finally call the scrapeChunk function again with the next parameters instance.
It all works, but the node instance grows and grows in RAM until I get a "FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory" error.
My Question: Am I doing something wrong or is this a limitation of JavaScript and/or the libraries I'm using? Can I somehow get each task to complete, FREE its memory, and then start the next task? I tried a sequence from FuturesJS and it seems to suffer from the same leak.

What is probably happening is that you're building a very deep call tree, and the upper levels of that tree keep references to their data around, so it never gets claimed by the garbage collector.
One thing to do is, in your own code, when you call a callback at the end, do that by invoking process.nextTick();. That way, the calling function can end and release its variables. Also, make sure you're not piling all your data into a global structure that keeps those references around forever.
Without seeing the code, it's a bit tricky to come up with good responses. But this is not a limitation of node.js or its approach (There are lots of long-running and complex applications out there that use it), but how you make use of it.

You may want to try cheerio instead of JSDom. The author claims it is leaner and 8x times faster.

Assuming your description is correct, I think the cause of the problem is obvious - the recursive call to scrapeChunk(). Dispatch the tasks using a loop (or look into node's stream facilities), and ensure that they actually return.
What's going on here sounds something like this:
var list = [1, 2, 3, 4, ... ];
function scrapeCheck(index) {
// allocate variables, do work, etc, etc
scrapeCheck(index+1)
}
With a long enough list, you are guaranteed to exhaust memory, or stack depth, or the heap, or any number of things, depending on what you do during the function body. What I'd suggest is something like this:
var list = [1, 2, 3, 4, ... ];
list.forEach(function scrapeCheck(index) {
// allocate variables, do work, etc, etc
return;
});
Frustratingly nested callbacks are an orthogonal problem, but I would suggest you take a look at the async library (in particular async/waterfall), which is both popular and useful for this class of task.

This is to do with the recursive call to your function. Put the recursive call inside a
setTimeout(()=>{
recursiveScrapFunHere();
}, 2000);
this way the call is asynchronous and gets put inside a priority heap instead of the usual recursion stack (which is the case for synchronous calls).
This way your parent function (the same function) finishes running till the end and the recursiveScrapFunHere() is outside the recursion stack.
Here the call will be made after a delay of 2 seconds.

Related

Avoid garbage collection in high performance JavaScript applications [duplicate]

I have a fairly complex Javascript app, which has a main loop that is called 60 times per second. There seems to be a lot of garbage collection going on (based on the 'sawtooth' output from the Memory timeline in the Chrome dev tools) - and this often impacts the performance of the application.
So, I'm trying to research best practices for reducing the amount of work that the garbage collector has to do. (Most of the information I've been able to find on the web regards avoiding memory leaks, which is a slightly different question - my memory is getting freed up, it's just that there's too much garbage collection going on.) I'm assuming that this mostly comes down to reusing objects as much as possible, but of course the devil is in the details.
The app is structured in 'classes' along the lines of John Resig's Simple JavaScript Inheritance.
I think one issue is that some functions can be called thousands of times per second (as they are used hundreds of times during each iteration of the main loop), and perhaps the local working variables in these functions (strings, arrays, etc.) might be the issue.
I'm aware of object pooling for larger/heavier objects (and we use this to a degree), but I'm looking for techniques that can be applied across the board, especially relating to functions that are called very many times in tight loops.
What techniques can I use to reduce the amount of work that the garbage collector must do?
And, perhaps also - what techniques can be employed to identify which objects are being garbage collected the most? (It's a farly large codebase, so comparing snapshots of the heap has not been very fruitful)

A lot of the things you need to do to minimize GC churn go against what is considered idiomatic JS in most other scenarios, so please keep in mind the context when judging the advice I give.
Allocation happens in modern interpreters in several places:
When you create an object via new or via literal syntax [...], or {}.
When you concatenate strings.
When you enter a scope that contains function declarations.
When you perform an action that triggers an exception.
When you evaluate a function expression: (function (...) { ... }).
When you perform an operation that coerces to Object like Object(myNumber) or Number.prototype.toString.call(42)
When you call a builtin that does any of these under the hood, like Array.prototype.slice.
When you use arguments to reflect over the parameter list.
When you split a string or match with a regular expression.
Avoid doing those, and pool and reuse objects where possible.
Specifically, look out for opportunities to:
Pull inner functions that have no or few dependencies on closed-over state out into a higher, longer-lived scope. (Some code minifiers like Closure compiler can inline inner functions and might improve your GC performance.)
Avoid using strings to represent structured data or for dynamic addressing. Especially avoid repeatedly parsing using split or regular expression matches since each requires multiple object allocations. This frequently happens with keys into lookup tables and dynamic DOM node IDs. For example, lookupTable['foo-' + x] and document.getElementById('foo-' + x) both involve an allocation since there is a string concatenation. Often you can attach keys to long-lived objects instead of re-concatenating. Depending on the browsers you need to support, you might be able to use Map to use objects as keys directly.
Avoid catching exceptions on normal code-paths. Instead of try { op(x) } catch (e) { ... }, do if (!opCouldFailOn(x)) { op(x); } else { ... }.
When you can't avoid creating strings, e.g. to pass a message to a server, use a builtin like JSON.stringify which uses an internal native buffer to accumulate content instead of allocating multiple objects.
Avoid using callbacks for high-frequency events, and where you can, pass as a callback a long-lived function (see 1) that recreates state from the message content.
Avoid using arguments since functions that use that have to create an array-like object when called.
I suggested using JSON.stringify to create outgoing network messages. Parsing input messages using JSON.parse obviously involves allocation, and lots of it for large messages. If you can represent your incoming messages as arrays of primitives, then you can save a lot of allocations. The only other builtin around which you can build a parser that does not allocate is String.prototype.charCodeAt. A parser for a complex format that only uses that is going to be hellish to read though.

The Chrome developer tools have a very nice feature for tracing memory allocation. It's called the Memory Timeline. This article describes some details. I suppose this is what you're talking about re the "sawtooth"? This is normal behavior for most GC'ed runtimes. Allocation proceeds until a usage threshold is reached triggering a collection. Normally there are different kinds of collections at different thresholds.
Garbage collections are included in the event list associated with the trace along with their duration. On my rather old notebook, ephemeral collections are occurring at about 4Mb and take 30ms. This is 2 of your 60Hz loop iterations. If this is an animation, 30ms collections are probably causing stutter. You should start here to see what's going on in your environment: where the collection threshold is and how long your collections are taking. This gives you a reference point to assess optimizations. But you probably won't do better than to decrease the frequency of the stutter by slowing the allocation rate, lengthening the interval between collections.
The next step is to use the Profiles | Record Heap Allocations feature to generate a catalog of allocations by record type. This will quickly show which object types are consuming the most memory during the trace period, which is equivalent to allocation rate. Focus on these in descending order of rate.
The techniques are not rocket science. Avoid boxed objects when you can do with an unboxed one. Use global variables to hold and reuse single boxed objects rather than allocating fresh ones in each iteration. Pool common object types in free lists rather than abandoning them. Cache string concatenation results that are likely reusable in future iterations. Avoid allocation just to return function results by setting variables in an enclosing scope instead. You will have to consider each object type in its own context to find the best strategy. If you need help with specifics, post an edit describing details of the challenge you're looking at.
I advise against perverting your normal coding style throughout an application in a shotgun attempt to produce less garbage. This is for the same reason you should not optimize for speed prematurely. Most of your effort plus much of the added complexity and obscurity of code will be meaningless.

As a general principle you'd want to cache as much as possible and do as little creating and destroying for each run of your loop.
The first thing that pops in my head is to reduce the use of anonymous functions (if you have any) inside your main loop. Also it'd be easy to fall into the trap of creating and destroying objects that are passed into other functions. I'm by no means a javascript expert, but I would imagine that this:
var options = {var1: value1, var2: value2, ChangingVariable: value3};
function loopfunc()
{
//do something
}
while(true)
{
$.each(listofthings, loopfunc);
options.ChangingVariable = newvalue;
someOtherFunction(options);
}
would run much faster than this:
while(true)
{
$.each(listofthings, function(){
//do something on the list
});
someOtherFunction({
var1: value1,
var2: value2,
ChangingVariable: newvalue
});
}
Is there ever any downtime for your program? Maybe you need it to run smoothly for a second or two (e.g. for an animation) and then it has more time to process? If this is the case I could see taking objects that would normally be garbage collected throughout the animation and keeping a reference to them in some global object. Then when the animation ends you can clear all the references and let the garbage collector do it's work.
Sorry if this is all a bit trivial compared to what you've already tried and thought of.

I'd make one or few objects in the global scope (where I'm sure garbage collector is not allowed to touch them), then I'd try to refactor my solution to use those objects to get the job done, instead of using local variables.
Of course it couldn't be done everywhere in the code, but generally that's my way to avoid garbage collector.
P.S. It might make that specific part of code a little bit less maintainable.

Is this recursion or not

function x(){
window.setTimeout(function(){
foo();
if(notDone()){
x();
};
},1000);
}
My concern being unbounded stack growth. I think this is not recursion since the x() call in the timer results in a brand new set of stack frames based on a new dispatch in the JS engine.
But reading the code as an old-fashioned non JS guy it makes me feel uneasy
One extra side question, what happens if I scheduled something (based on math rather than a literal) that resulted in no delay. Would that execute in place or would it be in immediately executed async, or is that implementation defined

It's not - I call it "pseudo-recursion".
The rationale is that it kind of looks like recursion, except that the function always correctly terminates immediately, hence unwinding the stack. It's then the JS event loop that triggers the next invocation.

It is recusive in a sense that it is a function that calls itself but I believe you are right about the stack trace being gone. Under normal execution the stack will just show that it was invoked by setTimeout. The chrome debugger for example will allow you to keep stack traces on async execution, I am not sure how they are doing it but the engine can keep track of the stack somehow.
No matter how the literal is calculated the execution will still be async.
setTimeout(function(){console.log('timeout');}, 0);console.log('executing');
will output:
executing
undefined
timeout

One extra side question, what happens if I scheduled something (based on math rather than a literal) that resulted in no delay. Would that execute in place or would it be in immediately executed async, or is that implementation defined
Still asynchronous. It's just that the timer will be processed immediately once the function returns and the JavaScript engine can process events on the event loop.

Recursion has many different definitions, but if we define it as the willful (as opposed to bug-induced) use of a function that calls itself repeatedly in order to solve a programming problem (which seems to be a common one in a Javascript context), it absolutely is.
The real question is whether or not this could crash the browser. The answer is no in my experience...at least not on Firefox or Chrome. Whether good practice or not, what you've got there is a pretty common Javascript pattern, used in a lot of semi-real-time web applications. Notably, Twitter used to do something very similar to provide users with semi-real-time feed updates (I doubt they still do it now that they're using a Node server).
Also, out of curiously I ran your script with the schedule reset to run every 50ms, and have experienced no slowdowns.

Best practices for reducing Garbage Collector activity in Javascript

I have a fairly complex Javascript app, which has a main loop that is called 60 times per second. There seems to be a lot of garbage collection going on (based on the 'sawtooth' output from the Memory timeline in the Chrome dev tools) - and this often impacts the performance of the application.
So, I'm trying to research best practices for reducing the amount of work that the garbage collector has to do. (Most of the information I've been able to find on the web regards avoiding memory leaks, which is a slightly different question - my memory is getting freed up, it's just that there's too much garbage collection going on.) I'm assuming that this mostly comes down to reusing objects as much as possible, but of course the devil is in the details.
The app is structured in 'classes' along the lines of John Resig's Simple JavaScript Inheritance.
I think one issue is that some functions can be called thousands of times per second (as they are used hundreds of times during each iteration of the main loop), and perhaps the local working variables in these functions (strings, arrays, etc.) might be the issue.
I'm aware of object pooling for larger/heavier objects (and we use this to a degree), but I'm looking for techniques that can be applied across the board, especially relating to functions that are called very many times in tight loops.
What techniques can I use to reduce the amount of work that the garbage collector must do?
And, perhaps also - what techniques can be employed to identify which objects are being garbage collected the most? (It's a farly large codebase, so comparing snapshots of the heap has not been very fruitful)

A lot of the things you need to do to minimize GC churn go against what is considered idiomatic JS in most other scenarios, so please keep in mind the context when judging the advice I give.
Allocation happens in modern interpreters in several places:
When you create an object via new or via literal syntax [...], or {}.
When you concatenate strings.
When you enter a scope that contains function declarations.
When you perform an action that triggers an exception.
When you evaluate a function expression: (function (...) { ... }).
When you perform an operation that coerces to Object like Object(myNumber) or Number.prototype.toString.call(42)
When you call a builtin that does any of these under the hood, like Array.prototype.slice.
When you use arguments to reflect over the parameter list.
When you split a string or match with a regular expression.
Avoid doing those, and pool and reuse objects where possible.
Specifically, look out for opportunities to:
Pull inner functions that have no or few dependencies on closed-over state out into a higher, longer-lived scope. (Some code minifiers like Closure compiler can inline inner functions and might improve your GC performance.)
Avoid using strings to represent structured data or for dynamic addressing. Especially avoid repeatedly parsing using split or regular expression matches since each requires multiple object allocations. This frequently happens with keys into lookup tables and dynamic DOM node IDs. For example, lookupTable['foo-' + x] and document.getElementById('foo-' + x) both involve an allocation since there is a string concatenation. Often you can attach keys to long-lived objects instead of re-concatenating. Depending on the browsers you need to support, you might be able to use Map to use objects as keys directly.
Avoid catching exceptions on normal code-paths. Instead of try { op(x) } catch (e) { ... }, do if (!opCouldFailOn(x)) { op(x); } else { ... }.
When you can't avoid creating strings, e.g. to pass a message to a server, use a builtin like JSON.stringify which uses an internal native buffer to accumulate content instead of allocating multiple objects.
Avoid using callbacks for high-frequency events, and where you can, pass as a callback a long-lived function (see 1) that recreates state from the message content.
Avoid using arguments since functions that use that have to create an array-like object when called.
I suggested using JSON.stringify to create outgoing network messages. Parsing input messages using JSON.parse obviously involves allocation, and lots of it for large messages. If you can represent your incoming messages as arrays of primitives, then you can save a lot of allocations. The only other builtin around which you can build a parser that does not allocate is String.prototype.charCodeAt. A parser for a complex format that only uses that is going to be hellish to read though.

As a general principle you'd want to cache as much as possible and do as little creating and destroying for each run of your loop.
The first thing that pops in my head is to reduce the use of anonymous functions (if you have any) inside your main loop. Also it'd be easy to fall into the trap of creating and destroying objects that are passed into other functions. I'm by no means a javascript expert, but I would imagine that this:
var options = {var1: value1, var2: value2, ChangingVariable: value3};
function loopfunc()
{
//do something
}
while(true)
{
$.each(listofthings, loopfunc);
options.ChangingVariable = newvalue;
someOtherFunction(options);
}
would run much faster than this:
while(true)
{
$.each(listofthings, function(){
//do something on the list
});
someOtherFunction({
var1: value1,
var2: value2,
ChangingVariable: newvalue
});
}
Is there ever any downtime for your program? Maybe you need it to run smoothly for a second or two (e.g. for an animation) and then it has more time to process? If this is the case I could see taking objects that would normally be garbage collected throughout the animation and keeping a reference to them in some global object. Then when the animation ends you can clear all the references and let the garbage collector do it's work.
Sorry if this is all a bit trivial compared to what you've already tried and thought of.

I'd make one or few objects in the global scope (where I'm sure garbage collector is not allowed to touch them), then I'd try to refactor my solution to use those objects to get the job done, instead of using local variables.
Of course it couldn't be done everywhere in the code, but generally that's my way to avoid garbage collector.
P.S. It might make that specific part of code a little bit less maintainable.

Tasks in javascript?

Essence of the question
The real reason why I ask this question - not because I want solve my problem. I want to know how to work with tasks in JavaScript. I don't need thread paralleling and other stuff. There are two parts of computing smth: IO and CPU. I want to make CPU computing works in time between ajax request sended and ajax request get answer from server. There is obstacle: from one function I run many tasks and this function must produce Task, that waits all runned tasks, process results of them and returns some value. That's all I want. Of course, if you post another way to solve my problem, I will vote for your answer and can set it as solution if there are no other answers about tasks.
Why I describe my problem, not just asking about tasks? Ask guys who minused and closed this question a time ago.
Problem
My problem: I want to traverse a tree in JavaScript to find the smallest possible parsing. I have a dictionary of words stored in the form of a trie. When a user gives an input string, I need to get a count of words that match the input string and is the shortest combination of words.
Example:
My dictionary contains these words: my, code, js, myj, scode
A user types myjscode
I traverse my tree of words and find that the input matches myj + scode and my + js + code
Since the first parsing is the shortest, my function returns 2 (the number of words in the shortest parsing)
My Problem
My dictionary tree is huge, so I can't load it fully. To fix this, I want to do some lazy-loading. Each node of the tree is either loaded and points to child nodes or is not loaded yet and contains a link to the data to be loaded.
So, I need to make node look up calls while I'm traversing the tree. Since these calls are asynchronous, I want to be able to explore other traversals while loading tree nodes. This will improve the response time for the user.
How I want to solve this problem:
My lookup function will return a task. I can call that task and get its results. Once I traverse to the loaded node, I can then make multiple calls to load child nodes and each call returns a task. Since these "tasks" are individual bits of functionality, I can queue them up and execute them while I'm waiting for ajax calls to return.
So, I want to know which library I can use, or how I can emulate tasks in javascript (I'm thinking of tasks as they exist in C#).
There is restriction: no server-side code, only ajax to precompiled dictionaries in javascript. Why? It has to be used as password complexity checker.

You say in your question:
Of course, if you post another way to solve my problem, I will vote for your answer and can set it as solution if there are no other answers about tasks.
Good; sorry, but I don't think that c# style tasks is the right solution here.
I'll accept (although I don't think it's correct) your assertion that for security reasons you have to do everything client-side. As an aside, might I point out that if you are scared of somebody snooping (because you have a security weakness) then passing lots of requests for part of the password is just as insecure as passing one request? Sorry, I appear to have done so without consent!
Nonetheless, I will attempt to answer with a broad outline how I would approach your problem if, indeed, you had to do it in JavaScript; I would use promises. Probably jQuery's Deferred implementation, to be specific. I'll give a very rough pseudo-code outline here.
Overview
You start with a nicely structured Trie. Using recursion I would build up a nicely structured "solution tree", which would be a nested array of arrays; this would give the flexibility of being able to respond to the user with a specific message... however, since you seem prepared to lose that bonus and only want a single digit as a solution, I will outline a slightly simpler approach that you could, if needed, adapt to return arrays of the form (from your example):
[["myj"],["scode"],["my"],["js"],["code"]]
I mention this structure here also, partly, as it helps explain the approach I am adopting.
Notes
I will refer to "nodes" and "valueNodes" in your Trie. I consider "nodes" to be anything and "valueNodes" to be nodes with values.
The recursive promiseToResolveRemainder will resolve 0 for "couldn't do it"; it will only reject the promise if something went wrong (say, the webservice wasn't available).
Dodgy, hacky, untested Pseudo-code
var minDepth=0; //Zero value represents failure to match (Impossible? Not if you are accepting unicode passwords!)
function promiseToResolveRemainder(remainder,fragmentSoFar){
deferred = new jQuery.Deferred();
nextChar = remainder.substring(0,1);
if (remainder.length==1){
//Insert code here to:
//Test for fragmentSoFar+nextChar being a valueNode.
//If so, resolve(1)... otherwise resolve(0)
//!!Note that, subtly, this catches the case where fragmentSoFar is an empty string :)
return;
}
remainder = remainder.substring(1);
promiseToFindValueNode(fragmentSoFar+nextChar).then(
function(success){
//We know that we *could* terminate the growing fragment here and proceed
//But we could also proceed from here by adding to the fragment
var firstPathResolvedIn = 0;
var secondPathResolvedIn = 0;
promiseToResolveRemainder(remainder,'').then(
function(resolvedIn){
firstPathResolvedIn = resolvedIn + 1;
}
).then(
promiseToResolveRemainder(remainder,fragmentSoFar+nextChar).then(
function(resolvedIn){
secondPathResolvedIn = resolvedIn;
if(!firstPathResolvedIn==0 and !secondPathResolvedIn==0){
deffered.resolve(Math.min(firstPathResolvedIn,secondPathResolvedIn));
}
deferred.resolve(Math.max(firstPathResolvedIn,secondPathResolvedIn));//Sloppy, but promises cannot be resolved twice, so no sweat (I know, that's a *dirty* trick!)
}
)
)
},
function(failure){
//We know that we *need* at least a node or this call to
//promiseToResolveRemainder at this iteration has been a failure.
promiseToFindNode(fragmentSoFar+nextChar).then(
function(resolvedIn){
//ok, so we *could* proceed from here by adding to the fragment
promiseToResolveRemainder(remainder,fragmentSoFar+nextChar).then(
function(resolvedIn){
deferred.resolve(resolvedIn);
}
)
},
function(failedBecause){
//ooops! We've hit a dead end, we can't complete from here.
deferred.resolve(0);
}
)
},
)
return deferred.Promise();
}
I am not particularly proud of this untested kludgy attempt at code (and I'm not about to write your solution for you!), but I am proud of the approach and am sure that it will yield a robust, reliable and efficient solution to your problem. Unfortunately, you seem to be dependent on a lot of webService calls... I would be very tempted, therefore, to abstract away any calls to the webService and check them through a local cache first.

Not sure this is what you are looking for, but you might try WebWorkers.
A simple example is at: https://github.com/afshinm/50k but you can google for more.
Note - web workers will be highly browser dependent and do not run in a separate task on the machine.

JavaScript execution hangs page momentarily

I have a web application which uses jQuery/JavaScript heavily. It holds a large array in memory, and the user filters it by typing in a textbox.
Problem: When the filtering algorithm runs, the application becomes non-responsive and the browser may even ask the user whether to let the script continue.
Optimally, I would like the filter function to run in a separate thread, to avoid non-responsiveness. Is this possible in any way? Alternatively, I would like to show a rotating hourglass or similar, but browsers seem unable to display animated GIFs when running heavy scripts.
What is the best way of attacking the problem?

Browsers execute scripts in the main event processing thread. This means any long running scripts can holdup the browser queue.
You should split your filter logic into chunks and run them on timeout callback's. You may use a gap of 0 mills between executions. 0 milli's is just a suggestion to the browser, but the browser will use the gap between subsequent callbacks to clear its event queue. Timeout's is generally how long running scripts are ought to be executed in the browser environment to prevent the page from "hanging".

This type of job is what Web Workers were designed for, support is patchy, but improving.

Expanding on my comment from earlier, given that you are processing an array you are probably using a for loop. You can easily refactor a simple for loop to use setTimeout() so that the work is broken up into chunks and the browser gets a chance to handle screen paints and user interaction between each chunk. Simple example:
// Generic function to execute a callback a given number
// of times with a given delay between each execution
function timeoutLoop(fn, startIndex, endIndex, delay) {
function doIteration() {
if (startIndex < endIndex){
fn(startIndex++);
setTimeout(doIteration, delay);
}
}
doIteration();
}
// pass your function as callback
timeoutLoop(function(i) {
// current iteration processing here, use i if needed;
}, 0, 100, 0);
Demo: http://jsfiddle.net/nnnnnn/LeZxM/1/
That's just something I cobbled together to show the general idea, but obviously it can be expanded in various ways, e.g., you might like to add a chunkSize parameter to timeoutLoop() to say how many loop iterations to do in each timeout (adding a conventional loop around the call to fn()), etc.

Develop Reference

JavaScript is the programming language of the Web.