Pass large array to node child process

Pass large array to node child process - javascript

I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process.
var spawn = require('child_process').spawn;
// dataAsNumbers is a large 2D array
var child = spawn(process.execPath, ['/child_process_scripts/getStatistics', dataAsNumbers]);
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
But when I do, node gives the error:
spawn E2BIG
I came across this article
So piping the data to the child process seems to be the way to go. My code is now:
var spawn = require('child_process').spawn;
console.log('creating child........................');
var options = { stdio: [null, null, null, 'pipe'] };
var args = [ '/getStatistics' ];
var child = spawn(process.execPath, args, options);
var pipe = child.stdio[3];
pipe.write(Buffer('awesome'));
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
And then in getStatistics.js:
console.log('im inside child');
process.stdin.on('data', function(data) {
console.log('data is ', data);
process.exit(0);
});
However the callback in process.stdin.on isn't reached. How can I receive a stream in my child script?
EDIT
I had to abandon the buffer approach. Now I'm sending the array as a message:
var cp = require('child_process');
var child = cp.fork('/getStatistics.js');
child.send({
dataAsNumbers: dataAsNumbers
});
But this only works when the length of dataAsNumbers is below about 20,000, otherwise it times out.

With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.
shm-typed-array is a very simple module that seems suited to your application. Example:
parent.js
"use strict";
const shm = require('shm-typed-array');
const fork = require('child_process').fork;
// Create shared memory
const SIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');
// Fill with dummy data
Array.prototype.fill.call(data, 1);
// Spawn child, set up communication, and give shared memory
const child = fork("child.js");
child.on('message', sum => {
console.log(`Got answer: ${sum}`);
// Demo only; ideally you'd re-use the same child
child.kill();
});
child.send(data.key);
child.js
"use strict";
const shm = require('shm-typed-array');
process.on('message', key => {
// Get access to shared memory
const data = shm.get(key, 'Float64Array');
// Perform processing
const sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);
// Return processed data
process.send(sum);
});
Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.
Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.

I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following
// main.js
const fork = require('child_process').fork
const child = fork('./getStats.js')
const dataAsNumbers = Array(100000).fill(0).map(() =>
Array(100).fill(0).map(() => Math.round(Math.random() * 100)))
child.send({
dataAsNumbers: dataAsNumbers,
})
And
// getStats.js
process.on('message', function (data) {
console.log('data is ', data)
process.exit(0)
})
node main.js 2.72s user 0.45s system 103% cpu 3.045 total
I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.
If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.
Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.

Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.

You could consider using OS pipes you'll find a gist here as an input to your node child application.
I know this is not exactly what you're asking for, but you could use the cluster module (included in node). This way you can get as many instances as cores you machine has to speed up processing. Moreover consider using streams if you don't need to have all the data available before you start processing. If the data to be processed is too large i would store it in a file so you can reinilize if there is any error during the process.
Here is an example of clustering.
var cluster = require('cluster');
var numCPUs = 4;
if (cluster.isMaster) {
for (var i = 0; i < numCPUs; i++) {
var worker = cluster.fork();
console.log('id', worker.id)
}
} else {
doSomeWork()
}
function doSomeWork(){
for (var i=1; i<10; i++){
console.log(i)
}
}
More info sending messages across workers question 8534462.

Why do you want to make a subprocess? The sending of the data across subprocesses is likely to cost more in terms of cpu and realtime than you will save in making the processing happen within the same process.
Instead, I would suggest that for super efficient coding you consider to do your statistics calculations in a worker thread that runs within the same memory as the nodejs main process.
You can use the NAN to write C++ code that you can post to a worker thread, and then have that worker thread to post the result and an event back to your nodejs event loop when done.
The benefit of this is that you don't need extra time to send the data across to a different process, but the downside is that you will write a bit of C++ code for the threaded action, but the NAN extension should take care of most of the difficult task for you.

To address the performance issue while passing large data to the child process, save the data to the .json or .txt file and pass only the filename to the childprocess. I've achieved 70% performance improvement with this approach.

For long process tasks you could use something like gearman You could do the heavy work process on workers, in this way you can setup how many workers you need, for example I do some file processing in this way, if I need scale you create more worker instance, also I have different workers for different tasks, process zip files, generate thumbnails, etc, the good of this is the workers can be written on any language node.js, Java, python and can be integrated on your project with ease
// worker-unzip.js
const debug = require('debug')('worker:unzip');
const {series, apply} = require('async');
const gearman = require('gearmanode');
const {mkdirpSync} = require('fs-extra');
const extract = require('extract-zip');
module.exports.unzip = unzip;
module.exports.worker = worker;
function unzip(inputPath, outputDirPath, done) {
debug('unzipping', inputPath, 'to', outputDirPath);
mkdirpSync(outputDirPath);
extract(inputPath, {dir: outputDirPath}, done);
}
/**
*
* #param {Job} job
*/
function workerUnzip(job) {
const {inputPath, outputDirPath} = JSON.parse(job.payload);
series([
apply(unzip, inputPath, outputDirPath),
(done) => job.workComplete(outputDirPath)
], (err) => {
if (err) {
console.error(err);
job.reportError();
}
});
}
function worker(config) {
const worker = gearman.worker(config);
if (config.id) {
worker.setWorkerId(config.id);
}
worker.addFunction('unzip', workerUnzip, {timeout: 10, toStringEncoding: 'ascii'});
worker.on('error', (err) => console.error(err));
return worker;
}
a simple index.js
const unzip = require('./worker-unzip').worker;
unzip(config); // pass host and port of the Gearman server
I normally run workers with PM2
the integration with your code it's very easy. something like
//initialize
const gearman = require('gearmanode');
gearman.Client.logger.transports.console.level = 'error';
const client = gearman.client(configGearman); // same host and port
the just add work to the queue passing the name of the functions
const taskpayload = {inputPath: '/tmp/sample-file.zip', outputDirPath: '/tmp/unzip/sample-file/'}
const job client.submitJob('unzip', JSON.stringify(taskpayload));
job.on('complete', jobCompleteCallback);
job.on('error', jobErrorCallback);

Related

How can I update or add new objects in the middle of a stream process in Nodejs?

I'm just new in programming and now self-studying how to use createStream. I'm kind of lost how to use stream in nodejs using JS. Basically, what I wanted to do is to read a JSON file (more than 1GB) which have a massive array of object. Update the existing value of a certain object or add another set of object. I able to do it using the normal read, update or add function then write. Problem is I'm getting a large spike in RAM usage.
My code is like this:
const fs = require(`fs-extra`);
async function updateOrAdd () {
var datafile = await fs.readJson(`./bigJSONfile.json`);
var tofind = {user:alexa, age:21,country: japan, pending: 1, paid: 0};
foundData = datafile.filter(x => x.user === tofind.user && x.country === tofind.country);
if (foundData === null){
datafile = datafile.concat(tofind)
} else {
foundData.pending += 1
foundData.paid += 1
}
await fs.writeJson(`./bigJSONfile.json`, datafile)
}
I saw some codes for reference in createStream and they say pipe is the most efficient way for memory usage. Though, mostly what I saw is like making a copy only from the original one.
I really appreciate it if anyone can teach me how to do this using stream or if you can provide me the code for it :).

Running out of memory writing to a file in NodeJS

I'm processing a very large amount of data that I'm manipulating and storing it in a file. I iterate over the dataset, then I want to store it all in a JSON file.
My initial method using fs, storing it all in an object then dumping it didn't work as I was running out of memory and it became extremely slow.
I'm now using fs.createWriteStream but as far as I can tell it's still storing it all in memory.
I want the data to be written object by object to the file, unless someone can recommend a better way of doing it.
Part of my code:
// Top of the file
var wstream = fs.createWriteStream('mydata.json');
...
// In a loop
let JSONtoWrite = {}
JSONtoWrite[entry.word] = wordData
wstream.write(JSON.stringify(JSONtoWrite))
...
// Outside my loop (when memory is probably maxed out)
wstream.end()
I think I'm using Streams wrong, can someone tell me how to write all this data to a file without running out of memory? Every example I find online relates to reading a stream in but because of the calculations I'm doing on the data, I can't use a readable stream. I need to add to this file sequentially.

The problem is that you're not waiting for the data to be flushed to the filesystem, but instead keep throwing new and new data to the stream synchronously in a tight loop.
Here's an piece of pseudocode that should work for you:
// Top of the file
const wstream = fs.createWriteStream('mydata.json');
// I'm no sure how're you getting the data, let's say you have it all in an object
const entry = {};
const words = Object.keys(entry);
function writeCB(index) {
if (index >= words.length) {
wstream.end()
return;
}
const JSONtoWrite = {};
JSONtoWrite[words[index]] = entry[words[index]];
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(index + 1));
}
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(0));

You should wrap your data source in a readable stream too. I don't know what is your source, but you have to make sure, it does not load all your data in memory.
For example, assuming your data set come from another file where JSON objects are splitted with end of line character, you could create a Read stream as follow:
const Readable = require('stream').Readable;
class JSONReader extends Readable {
constructor(options={}){
super(options);
this._source=options.source: // the source stream
this._buffer='';
source.on('readable', function() {
this.read();
}.bind(this));//read whenever the source is ready
}
_read(size){
var chunk;
var line;
var lineIndex;
var result;
if (this._buffer.length === 0) {
chunk = this._source.read(); // read more from source when buffer is empty
this._buffer += chunk;
}
lineIndex = this._buffer.indexOf('\n'); // find end of line
if (lineIndex !== -1) { //we have a end of line and therefore a new object
line = this._buffer.slice(0, lineIndex); // get the character related to the object
if (line) {
result = JSON.parse(line);
this._buffer = this._buffer.slice(lineIndex + 1);
this.push(JSON.stringify(line) // push to the internal read queue
} else {
this._buffer.slice(1)
}
}
}}
now you can use
const source = fs.createReadStream('mySourceFile');
const reader = new JSONReader({source});
const target = fs.createWriteStream('myTargetFile');
reader.pipe(target);
then you'll have a better memory flow:
Please note that the picture and the above example are taken from the excellent nodejs in practice book

Fast rising memory when starting webworker in node.js with "big" data

I have the typical code to start a webworker in node:
var Threads = require('webworker-threads');
var worker = new Threads.Worker(__dirname + '/workers/myworker.js');
worker.onmessage = function (event) {
// 1.
// ... create and execute cypher query ...
};
// Start the worker.
worker.postMessage({
'data' : data
});
At 1. I send small pieces of processed data to a Neo4J db.
For small data this works perfectly fine, but when the data gets slightly bigger node/the worker starts to struggle.
The actual data I want to process is a csv I parsed with BabyParse resulting in an object with 149000 properties where each has another 17 properties. (149000 rows by 17 columns = 2533000 properties). The file is 17MB.
When doing this node will allocate a lot of memory and eventually crash around 53% memory allocation. The machine has 4GB.
The worker looks roughly like this:
self.onmessage = function (event) {
process(event.data.data);
};
function process(data) {
for (var i = 0; i < data.length; i++) {
self.postMessage({
'properties' : data[i]
});
}
}
I tried to chunk the data and process it chunkwise within the worker which also works okay. But I want to generate a graph and to process the edges I need the complete data, because I need to check every row (vertex) against all others.
Is there a way to stream the data into the worker? Or does anyone have an idea why node allocates so much memory with 17MB of data being send?

Instead of parsing the data in the main thread you can also pass the filename as a message to the worker and have the worker load it from disk. Otherwise you will have all the data in memory twice, once in the host and once in the worker.
A different option would be to use the csv npm package with the streaming parser. postMessage the lines as they come in and buffer them up till the final result in the worker.
Why your solution tries to allocate those enormous amounts of memory I don't know. I do know postMessage is intended to pass small messages.

Error when introducing huge data in Neo4j from Node.js

I am trying to introduce a huge amount of data in neo4j from a file. I am using node.js code, simple javascript with no much complexity.
The thing is that I have 386213 lines or 'nodes' to introduce, but when executed (and wait 3 hours) I only see the half moreless. I think some of the queries are lost in the way, but I do not know why...
I am using npm node-neo4j package for the connection and that.
Here my node.js code:
var neo4j = require('neo4j');
var readline = require("readline");
var fs = require("fs")
var db = new neo4j.GraphDatabase('http://neo4j:Gemitis26#localhost:7474');
var rl = readline.createInterface({
input: fs.createReadStream('C:/Users/RRamos/Documents/Projects/test-neo4j/Files/kaggle_songs.txt')
});
var i=1;
rl.on('line', function (line) {
var str = line.split(" ");
db.cypher({
query: "CREATE (:Song {id: '{line1}', num_id: {line2}})",
params: {
line1: str[0],
line2: str[1],
},
}, callback);
console.log(i + " " + "CREATE (:Song {id: '"+str[0]+"', num_id: "+str[1]+"})");
i = i+1;
});
function callback(err, results){
if(err) throw err;
}

Making 386213 separate Cypher REST queries (in separate transactions) is probably the slowest possible way to create such a large number of nodes.
There are at least 3 better ways (in order of increasing performance):
Create multiple nodes at a time by sending as a parameter an array containing the data for multiple nodes. For example, you can create 8 nodes by sending this array parameter: [['a', 1],['b', 2],['c', 3],['d', 4],['e', 5],['f', 6],['g', 7],['h', 8]], and using this query:
UNWIND {data} AS d
CREATE (:Song {id: d[0], num_id: d[0]})
You can use the LOAD CSV clause to create the nodes. Since your input file seems to use a space to separate node property values, this might work for you:
LOAD CSV FROM 'file:///C:/Users/RRamos/Documents/Projects/test-neo4j/Files/kaggle_songs.txt' AS line
FIELDTERMINATOR ' '
CREATE (:Song {id: line[0], num_id: line[1]})
For even better performance, you could use the Import tool, which is a command line tool for initializing a new DB.

Create exportable object or module to wrap third-party library with CommonJS/NodeJS javascript

I'm new to JavaScript and creating classes/objects. I'm trying to wrap an open source library's code with some simple methods for me to use in my routes.
I have the below code that is straight from the source (sjwalter's Github repo; thanks Stephen for the library!).
I'm trying to export a file/module to my main app/server.js file with something like this:
var twilio = require('nameOfMyTwilioLibraryModule');
or whatever it is I need to do.
I'm looking to create methods like twilio.send(number, message)that I can easily use in my routes to keep my code modular. I've tried a handful of different ways but couldn't get anything to work. This might not be a great question because you need to know how the library works (and Twilio too). The var phone = client.getPhoneNumber(creds.outgoing); line makes sure that my outgoing number is a registered/paid for number.
Here's the full example that I'm trying to wrap with my own methods:
var TwilioClient = require('twilio').Client,
Twiml = require('twilio').Twiml,
creds = require('./twilio_creds').Credentials,
client = new TwilioClient(creds.sid, creds.authToken, creds.hostname),
// Our numbers list. Add more numbers here and they'll get the message
numbers = ['+numbersToSendTo'],
message = '',
numSent = 0;
var phone = client.getPhoneNumber(creds.outgoing);
phone.setup(function() {
for(var i = 0; i < numbers.length; i++) {
phone.sendSms(numbers[i], message, null, function(sms) {
sms.on('processed', function(reqParams, response) {
console.log('Message processed, request params follow');
console.log(reqParams);
numSent += 1;
if(numSent == numToSend) {
process.exit(0);
}
});
});
}
});`

Simply add the function(s) you wish to expose as properties on the exports object. Assuming your file was named mytwilio.js and stored under app/ and looks like,
app/mytwilio.js
var twilio = require('twilio');
var TwilioClient = twilio.Client;
var Twiml = twilio.Twiml;
var creds = require('./twilio_creds').Credentials;
var client = new TwilioClient(creds.sid, creds.authToken, creds.hostname);
// keeps track of whether the phone object
// has been populated or not.
var initialized = false;
var phone = client.getPhoneNumber(creds.outgoing);
phone.setup(function() {
// phone object has been populated
initialized = true;
});
exports.send = function(number, message, callback) {
// ignore request and throw if not initialized
if (!initialized) {
throw new Error("Patience! We are init'ing");
}
// otherwise process request and send SMS
phone.sendSms(number, message, null, function(sms) {
sms.on('processed', callback);
});
};
This file is mostly identical to what you already have with one crucial difference. It remembers whether the phone object has been initialized or not. If it hasn't been initialized, it simply throws an error if send is called. Otherwise it proceeds with sending the SMS. You could get fancier and create a queue that stores all messages to be sent until the object is initialized, and then sends em' all out later.
This is just a lazy approach to get you started. To use the function(s) exported by the above wrapper, simply include it the other js file(s). The send function captures everything it needs (initialized and phone variables) in a closure, so you don't have to worry about exporting every single dependency. Here's an example of a file that makes use of the above.
app/mytwilio-test.js
var twilio = require("./mytwilio");
twilio.send("+123456789", "Hello there!", function(reqParams, response) {
// do something absolutely crazy with the arguments
});
If you don't like to include with the full/relative path of mytwilio.js, then add it to the paths list. Read up more about the module system, and how module resolution works in Node.JS.

Develop Reference

JavaScript is the programming language of the Web.

Pass large array to node child process - javascript

Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.

To address the performance issue while passing large data to the child process, save the data to the .json or .txt file and pass only the filename to the childprocess. I've achieved 70% performance improvement with this approach.

Related

How can I update or add new objects in the middle of a stream process in Nodejs?

Running out of memory writing to a file in NodeJS

Fast rising memory when starting webworker in node.js with "big" data

Error when introducing huge data in Neo4j from Node.js

Create exportable object or module to wrap third-party library with CommonJS/NodeJS javascript

Categories

Resources