I'm running a C tool compiled to wasm using emscripten. The tool works on very large files. When running this tool normally on the CLI, often operations stream the results and terminate the program early once enough data has been returned. For example you might run:
./tool <input-file> | head -n 100
The tool would terminate after it detects stdout has been closed by head, effectively only reading a small portion of the input.
The problem is that stdout with emscripten appears to be asynchronous (by overriding Module.print), so the tool runs to completion every time. Is there a way to make it block on stdout so I can only read as much as I need and then terminate the tool?
You can redirect the output to a file and then put the task in the background. Meanwhile monitor the log file. When it reaches 100 lines kill the child pid.
Some like this should work:
rm -f /tmp/log
touch /tmp/log
./tool input_file > /tmp/log 2>&1 &
pid=$!
while sleep 1
do
ret=`cat /tmp/log | wc -l`
if [ "$ret" -ge 100 ]
then
kill $pid
exit 0
fi
done
I put in the touch to create an empty log file. This will avoid a race condition where we cat it before it is created by the child process. wc -l will return the number of lines. Change the sleep value to whatever is appropriate for your test time.
You need to implement a way to tell your tool to stop. There are many ways to do this. Two that come to mind:
Have it take an extra argument indicating the number of lines of output after which it should stop, then call it with this argument. This is the simplest approach and easiest to implement. The main drawback is that you need to know the max number of lines ahead of time, so that you can include it in your call arguments, and tool must be able to accurately know how to translate that into when to stop-- but if that is the case, which it sounds like, then just do this and you're done.
But I suppose if your tool did not know how to count lines-- perhaps it's just outputting blobs, or perhaps you have some downstream filter that is only counting some lines towards your maximum and in any case, your tool needs some other function to tell it when to stop, then this would not work, in which case, read on...
Use a callback. Create and export another function e.g. tool_stop(). In your Module.print override function, at the appropriate time, call tool_stop(). In your C code, create some flag, let's call it stop_processing, that is visible to your tool command and also visible to your function that is processing the input. In your processing loop (e.g. before each fread call), your tool command checks this flag and if it is set, it stops processing (when I say "visible", that could mean you make it a global variable, if you'll never have more than one concurrent invocation running, or make it part of some context data that is allocated via some init call and passed whenever a process() or stop() call is made, and then deallocated via a destroy() call. The latter approach is generally cleaner, more scalable and more maintainable, though is a bit more work since you have to add init + destroy, and add a context pointer to each of your function definitions and calls)
Related
tl;dr:
Calling the asynchronous fs.writeFile from asynchronous events (and perhaps even from just a plain old loop) and then calling process.exit() successfully opens the files but fails to flush the data into the files. The callbacks given to writeFile do not get a chance to run before the process exits. Is this expected behavior?
Regardless of whether process.exit() is failing to perform this cleanup, I call into question whether it should be node's duty to at least attempt to work the file writes into the schedule, because it may very well be the case that the deallocation of huge buffers depends on writing them out to disk.
details
I have a conceptually basic piece of node.js code which performs a transformation on a large data file. This happens to be a LiDAR sensor's data file, which should not be relevant. It is simply a dataset that is quite large owing to the nature of its existence. It is structurally simple. The sensor sends its data over the network. My task for this script is to produce a separate file for each rotating scan. The details of this logic is irrelevant as well.
The basic idea is I use node_pcap to read a huge .pcap file using the method given to do this task by node_pcap, which is "offline mode".
What this means is that instead of asynchronously catching the network packets as they appear, what appears to be a rather dense stream of asynchronous events representing the packets are "generated".
So, the main structure of the program consists of a few global state variables, and a single callback to the pcap session. I initialize globals, then assign the callback function to the pcap session. This callback to the packet event does all the work.
Part of this work is writing out a large array of data files. Once in a while a packet will indicate some condition that means I should move on to writing into the next data file. I increment the data filename index, and call fs.writeFile() again to begin writing the new file. Since I am writing only, it seems natural to let node decide when a good time is to begin writing.
Basically, both fs.writeFileSync and fs.writeFile should end up calling the OS's write() system call on their respective files in an asynchronous fashion. This does not bother me because I am only writing, so the asynchronous nature of the write which can affect certain access patterns does not matter to me since I do not do any access. The only difference is in that writeFileSync forces the node event loop to block until such time as the write() syscall completes.
As the program progresses, when I use writeFile (the js-asynchronous version), hundreds of my output files are created, but no data is written to them. Not one. The very first data file is still open when the hundredth data file is created.
This is conceptually fine. The reason is that node is busy crunching new data, and is happily holding on to the increasing number of file descriptors that it will eventually get to in order to write the files' data in. Meanwhile it also has to keep inside of memory all the eventual contents of the files. This will eventually run out, but let's ignore the RAM size limitation for a moment. Obviously a bad thing to happen here would be running out of RAM and crashing the program. Hopefully node will be smart and realize it just needs to schedule some file writes and then it can free a bunch of buffers...
If I stick a statement in the middle of all this to call process.exit(), I would expect that node will clean up and flush the pending writeFile writes before exiting.
But node does not do this.
Changing to writeFileSync fixes the problem obviously.
Changing and truncating my input data such that process.exit() is not explicitly called also results in the files eventually getting written (and the completion callback given to writeFile to run) at the very end when the input events are done pumping.
This seems to indicate for me that the cleanup is being improperly performed by process.exit().
Question: Is there some alternative to exiting the event loop cleanly in the middle? Note I had to manually truncate my large input file, because terminating with process.exit() caused all the file writes to not complete.
This is node v0.10.26 installed a while ago on OS X with Homebrew.
Continuing with my thought process, the behavior that I am seeing here calls into question the fundamental purpose of using writeFile. It's supposed to improve things to be able to flexibly write my file whenever node deems it fit. However, apparently if node's event loop is pumped hard enough, then it will basically "get behind" on its workload.
It is like the event loop has an inbox and an outbox. In this analogy, the outbox represents the temp variables containing the data I am writing to the files. The assumption that a lazy productive programmer like me wants to make is that the inbox and outbox are interfaces that I can use and that they are flexible and that the system will manage for me. However if I feed the inbox at too high a rate, then node actually can't keep up, and it will just start piling the data into the outbox without having any time to flush it because for one reason or another, the scheduling is such that all the incoming events have to get processed first. This in turn defers all garbage collection of the outbox's contents, and quite quickly we deplete the system's RAM. This is quite easily a hard-to-find bug when this pattern is used in a complex system. I am glad I took a modular approach to this project.
I mean, yes, clearly, obviously, beyond all doubt the answer is to use writeFileSync as I do almost every single time that I write files with node.
What, then, is the value in even having writeFile? At this point I am trading a potential small increase in parallel processing for the increased possibility that if (for some reason) the machine's processing capability drops (whether it's thermal throttling or OS level scheduling or I don't pay my IaaS bills on time, or any other reason), that it can potentially lead to a snowballing memory explosion?
Perhaps this is getting at the core of solving the truly rather complex problems inherent in streaming data processing systems, and that I cannot realistically expect this event-based processing model to step up and elegantly solve these problems automatically. Maybe I should be satisfied that it only gets me about half of the way to something robust. Maybe I am just projecting my wishes onto it and that it is unreasonable for me to assume that node needs to less deterministically "improve" the scheduling of its event loop.
I'm not a node expert but it seems like your problem can be simplified using streams. Streams let you pause and resume and also provide other neat functionality. I suggest you take at look at Chapter 9 of Professional NodeJS by Pedro Teixeira. You can find an online copy easily for reading purposes. It provides very detailed and well explained examples on how to use streams to read and write data and also prevent potential memory leaks and loss of data.
I am playing with the Node.JS library provided by Amazon to manage the EC2 and Autosclaing group. I have written a program to delete an instance so that the autoscaling group will create a new instance. But before I delete any additional instances, I need to make sure that the new one is generated by amazon and is running. For that I need to pause my program and keep checking till I get a positive response. So this is how it works ---
(I usually have 3-4 instances and have to replace with 3-4 instances of new type.)
So my program is --
updateServer(){
Which has the code to retrieve the instances from the server.
foreach(instance in the list of instances){
replace(); (with new one.)
}
}
replace() {
delete the old one.
while(new instance is not generated) {
check();
}
}
check(){
return the status.
}
Now the question is I need this to happen in sequence. How can I pause the program?
Because in my case, the foreach loop executes and the program ends and it does not enter that check loop. Can you please give me ideas.
Thanks
Links -- http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/frames.html
The problem you are describing here is one one 'polling' and while talking of 'pauses' is normally an anti-pattern in node's asynchronous programming model, polling is possibly the one case where it has valid application.
However for reasons of code maintainability - in this case the ability to add other tasks later (such as other checks), you should also handle the polling asynchronously.
Here are some different approaches that should solve your problem.
1. Don't handle the problem in node.js at all - invoke the node application from the host's crontab and design the program to exit. This isn't very node-ish though but its certainly robust solution.
2. Use npm to install the node timer module [https://github.com/markussieber/timer]. Using timer you would pass check() function as an argument so it would call it back periodically. This is more slippery but scales in that you can have lots of checks running which for a scaleable EC2 deployment is what is probably called for.
timer = require('timer'); //import the time module
function check(){ //check function now polls
//add an if(something==true) statement here to allow you to exit if necessary
console.log("Checking stuff");
timer.timer(200, check); //calls this function repeatedly every 200ms
}
check(); //starts the code polling
3. Having said that, the functionality you are trying to implement sounds to me like it is the same as that provide by Amazon Autoscaling [http://aws.amazon.com/autoscaling/] and Amazon Elastic Beanstalk [http://aws.amazon.com/elasticbeanstalk/]. You might also have some luck with those.
Happy node hacking :)
nodejs isn't made to pause, really. you might be looking for something like an eventEmitter, so you can call a function when you emit an event. http://nodejs.org/api/events.html
You should never pause in node. Take a look at async (https://github.com/caolan/async). What you need is probably async.forEachSeries(), which allows you to do things in series, but asyncronously.
In your case it would start up a server and when that is started a callback is called that makes the series continue on the next server etc.
Here's some sample code from node-levelup https://github.com/rvagg/node-levelup/blob/master/test/benchmarks/index.js#L60-L84
I am scraping a bunch of stuff from a GET URL API in NodeJS. I'm looping through the months of the year X a # of cities. I have a scrapeChunk() function that I call once for each instance of the parameters, i.e. {startDate: ..., endDate: ..., location:...}. Inside I do a jsdom parsing of a table, convert to CSV, append the CSV to a file. Inside all of the nested asynchronous callbacks, I finally call the scrapeChunk function again with the next parameters instance.
It all works, but the node instance grows and grows in RAM until I get a "FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory" error.
My Question: Am I doing something wrong or is this a limitation of JavaScript and/or the libraries I'm using? Can I somehow get each task to complete, FREE its memory, and then start the next task? I tried a sequence from FuturesJS and it seems to suffer from the same leak.
What is probably happening is that you're building a very deep call tree, and the upper levels of that tree keep references to their data around, so it never gets claimed by the garbage collector.
One thing to do is, in your own code, when you call a callback at the end, do that by invoking process.nextTick();. That way, the calling function can end and release its variables. Also, make sure you're not piling all your data into a global structure that keeps those references around forever.
Without seeing the code, it's a bit tricky to come up with good responses. But this is not a limitation of node.js or its approach (There are lots of long-running and complex applications out there that use it), but how you make use of it.
You may want to try cheerio instead of JSDom. The author claims it is leaner and 8x times faster.
Assuming your description is correct, I think the cause of the problem is obvious - the recursive call to scrapeChunk(). Dispatch the tasks using a loop (or look into node's stream facilities), and ensure that they actually return.
What's going on here sounds something like this:
var list = [1, 2, 3, 4, ... ];
function scrapeCheck(index) {
// allocate variables, do work, etc, etc
scrapeCheck(index+1)
}
With a long enough list, you are guaranteed to exhaust memory, or stack depth, or the heap, or any number of things, depending on what you do during the function body. What I'd suggest is something like this:
var list = [1, 2, 3, 4, ... ];
list.forEach(function scrapeCheck(index) {
// allocate variables, do work, etc, etc
return;
});
Frustratingly nested callbacks are an orthogonal problem, but I would suggest you take a look at the async library (in particular async/waterfall), which is both popular and useful for this class of task.
This is to do with the recursive call to your function. Put the recursive call inside a
setTimeout(()=>{
recursiveScrapFunHere();
}, 2000);
this way the call is asynchronous and gets put inside a priority heap instead of the usual recursion stack (which is the case for synchronous calls).
This way your parent function (the same function) finishes running till the end and the recursiveScrapFunHere() is outside the recursion stack.
Here the call will be made after a delay of 2 seconds.
Quite often see in JavaScript libraries code like this:
setTimeout(function() {
...
}, 0);
I would like to know why use such a wrapper code.
Very simplified:
Browsers are single threaded and this single thread (The UI thread) is shared between the rendering engine and the js engine.
If the thing you want to do takes alot of time (we talking cycles here but still) it could halt (paus) the rendering (flow and paint).
In browsers there also exists "The bucket" where all events are first put in wait for the UI thread to be done with whatever it´s doing. As soon as the thread is done it looks in the bucket and picks the task first in line.
Using setTimeout you create a new task in the bucket after the delay and let the thread deal with it as soon as it´s available for more work.
A story:
After 0 ms delay create a new task of the function
and put it in the bucket. At that exact moment the UI thread is busy
doing something else, and there is another tasks in the bucket
already. After 6ms the thread is available and gets the task infront
of yours, good, you´re next. But what? That was one huge thing! It has
been like foreeeeeever (30ms)!!
At last, now the thread is done with that and comes and gets your
task.
Most browsers have a minimum delay that is more then 0 so putting 0 as delay means: Put this task in the basket ASAP. But telling the UA to put it in the bucket ASAP is no guarantee it will execute at that moment. The bucket is like the post office, it could be that there is a long queue of other tasks. Post offices are also single threaded with only one person helping all the task... sorry customers with their tasks. Your task has to get in the line as everyone else.
If the browser doesn´t implement its own ticker, it uses the tick cycles of the OS. Older browsers had minimum delays between 10-15ms. HTML5 specifies that if delay is less then 4ms the UA should increase it to 4ms. This is said to be consistent across browsers released in 2010 and onward.
See How JavaScript Timers Work by John Resig for more detail.
Edit: Also see What the heck is the event loop anyway? by Philip Roberts from JSConf EU 2014. This is mandatory viewing for all people touching front-end code.
There are a couple of reasons why you would do this
There is an action you don't want to run immediately but do want to run at some near future time period.
You want to allow other previously registered handlers from a setTimeout or setInterval to run
When you want to execute rest of your code without waiting previous one to finish you need to add it in anonymous method passed to setTimeout function. Otherwise your code will wait until previous is done
Example:
function callMe()
{
for(var i = 0; i < 100000; i++)
{
document.title = i;
}
}
var x = 10;
setTimeout(callMe, 0);
var el = document.getElementById('test-id');
el.innerHTML = 'Im done before callMe method';
That is the reason I use it.
Apart from previous answers I'd like to add another useful scenario I can think of: to "escape" from a try-catch block. A setTimeout-delay from within a try-catch block will be executed outside the block and any exception will propagate in the global scope instead.
Perhaps best example scenario: In today's JavaScript, with the more common use of so called Deferreds/Promises for asynchronous callbacks you are (often) actually running inside a try-catch.
Deferreds/Promises wrap the callback in a try-catch to be able to detect and propagate an exception as an error in the async-chain. This is all good for functions that need to be in the chain, but sooner or later you're "done" (i.e fetched all your ajax) and want to run plain non-async code where you Don't want exceptions to be "hidden" anymore.
AFAIK Dojo, Kris Kowal's Q, MochiKit and Google Closure lib use try-catch wrapping (Not jQuery though).
(On couple of odd occasions I've also used the technique to restart singleton-style code without causing recursion. I.e doing a teardown-restart in same loop).
To allow any previously set timeouts to execute.
In my specific case, at least. Not trying to make general statements here.
I've got this web crawler that I wrote in Node.js. I'd love to use Ruby instead, so I re-wrote it in EventMachine. Since the original was in CoffeeScript, it was actually surprisingly easy, and the code is very much the same, except that in EventMachine I can actually trap and recover from exceptions (since I'm using fibers).
The problem is that tests that run in under 20 seconds on the Node.js code take up to and over 5 minutes on EventMachine. When I watch the connection count it almost looks like they are not even running in parallel (they queue up into the hundreds, then very slowly work their way down), though logging shows that the code points are hit in parallel.
I realize that without code you can't really know what exactly is going on, but I was just wondering if there is some kind of underlying difference and I should give up, or if they really should be able to run about as fast (a small slowdown is fine) and I should keep trying to figure out what the issue is.
I did the following, but it didn't really seem to have any effect:
puts "Running with ulimit: " + EM.set_descriptor_table_size(60000).to_s
EM.set_effective_user('nobody')
EM.kqueue
Oh, and I'm very sure that I don't have any blocking calls in EventMachine. I've combed through every line about 10 times looking for anything that could be blocking. All my network calls are EM::HttpRequest.
The problem is that tests that run in under 20 seconds on the Node.js code take up to and over 5 minutes on EventMachine. When I watch the connection count it almost looks like they are not even running in parallel (they queue up into the hundreds, then very slowly work their way down), though logging shows that the code points are hit in parallel.
If they're not running in parallel then it's not asynchronous. So you're blocking.
Basically you need to figure out what blocking IO call you've made in the standard Ruby library and remove that and replace it with an EventMachine non blocking IO call.
Your code may not have any blocking calls but are you using 3rd party code that is not your own or not from EM ? They may block. Even something as simple as a debug print / log can block.
All my network calls are EM::HttpRequest.
What about file IO, what about TCP ? What about anything else that can block. What about 3rd party libraries.
We really need to see some code here. Either to identify a bottle neck in your code or a blocking call.
node.js should not be more than an order of magnitude faster then EM.