I am currently trying to make a data pipeline using Node.js
Of course, it's not the best way to make it but I want to try implementing it anyways before I make improvements upon it.
This is the situation
I have multiple gzip compressed csv files on AWS S3. I get these "objects" using aws sdk
like the following and make them into readStream
const unzip = createGunzip()
const input = s3.getObject(parameterWithBucketandKey)
.createReadStream()
.pipe(unzip)
and using the stream above I create readline interface
const targetFile = createWriteSTream('path to target file');
const rl = createInterface({
input: input
})
let first = true;
rl.on('line', (line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
and this is wrapped as a promise
and I have array of filenames to be retrieved from AWS S3 and map those array of filenames like this
const arrayOfFileNames = [name1, name2, name3 ... and 5000 more]
const arrayOfPromiseFileProcesses= arrayOfFileNames.map((filename) => return promiseFileProcess(filename))
await Promise.all(arrayOfPromiseFileProcesses);
// the result should be multiple gzip files that are compressed again.
sorry I wrote in pseudocode if it needs more to provide context then I will write more but I thought this would give a general contenxt of my problem.
My problem is that it writes to a file fine, but when i change the file_name it it doesn't create one afterwards. I am lost in this synchronous and asynchronous world...
Please give me a hint/reference to read upon. Thank you.
line event handler must be a async function as it invokes await
rl.on('line', async(line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process.
var spawn = require('child_process').spawn;
// dataAsNumbers is a large 2D array
var child = spawn(process.execPath, ['/child_process_scripts/getStatistics', dataAsNumbers]);
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
But when I do, node gives the error:
spawn E2BIG
I came across this article
So piping the data to the child process seems to be the way to go. My code is now:
var spawn = require('child_process').spawn;
console.log('creating child........................');
var options = { stdio: [null, null, null, 'pipe'] };
var args = [ '/getStatistics' ];
var child = spawn(process.execPath, args, options);
var pipe = child.stdio[3];
pipe.write(Buffer('awesome'));
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
And then in getStatistics.js:
console.log('im inside child');
process.stdin.on('data', function(data) {
console.log('data is ', data);
process.exit(0);
});
However the callback in process.stdin.on isn't reached. How can I receive a stream in my child script?
EDIT
I had to abandon the buffer approach. Now I'm sending the array as a message:
var cp = require('child_process');
var child = cp.fork('/getStatistics.js');
child.send({
dataAsNumbers: dataAsNumbers
});
But this only works when the length of dataAsNumbers is below about 20,000, otherwise it times out.
With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.
shm-typed-array is a very simple module that seems suited to your application. Example:
parent.js
"use strict";
const shm = require('shm-typed-array');
const fork = require('child_process').fork;
// Create shared memory
const SIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');
// Fill with dummy data
Array.prototype.fill.call(data, 1);
// Spawn child, set up communication, and give shared memory
const child = fork("child.js");
child.on('message', sum => {
console.log(`Got answer: ${sum}`);
// Demo only; ideally you'd re-use the same child
child.kill();
});
child.send(data.key);
child.js
"use strict";
const shm = require('shm-typed-array');
process.on('message', key => {
// Get access to shared memory
const data = shm.get(key, 'Float64Array');
// Perform processing
const sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);
// Return processed data
process.send(sum);
});
Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.
Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.
I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following
// main.js
const fork = require('child_process').fork
const child = fork('./getStats.js')
const dataAsNumbers = Array(100000).fill(0).map(() =>
Array(100).fill(0).map(() => Math.round(Math.random() * 100)))
child.send({
dataAsNumbers: dataAsNumbers,
})
And
// getStats.js
process.on('message', function (data) {
console.log('data is ', data)
process.exit(0)
})
node main.js 2.72s user 0.45s system 103% cpu 3.045 total
I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.
If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.
Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.
Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.
You could consider using OS pipes you'll find a gist here as an input to your node child application.
I know this is not exactly what you're asking for, but you could use the cluster module (included in node). This way you can get as many instances as cores you machine has to speed up processing. Moreover consider using streams if you don't need to have all the data available before you start processing. If the data to be processed is too large i would store it in a file so you can reinilize if there is any error during the process.
Here is an example of clustering.
var cluster = require('cluster');
var numCPUs = 4;
if (cluster.isMaster) {
for (var i = 0; i < numCPUs; i++) {
var worker = cluster.fork();
console.log('id', worker.id)
}
} else {
doSomeWork()
}
function doSomeWork(){
for (var i=1; i<10; i++){
console.log(i)
}
}
More info sending messages across workers question 8534462.
Why do you want to make a subprocess? The sending of the data across subprocesses is likely to cost more in terms of cpu and realtime than you will save in making the processing happen within the same process.
Instead, I would suggest that for super efficient coding you consider to do your statistics calculations in a worker thread that runs within the same memory as the nodejs main process.
You can use the NAN to write C++ code that you can post to a worker thread, and then have that worker thread to post the result and an event back to your nodejs event loop when done.
The benefit of this is that you don't need extra time to send the data across to a different process, but the downside is that you will write a bit of C++ code for the threaded action, but the NAN extension should take care of most of the difficult task for you.
To address the performance issue while passing large data to the child process, save the data to the .json or .txt file and pass only the filename to the childprocess. I've achieved 70% performance improvement with this approach.
For long process tasks you could use something like gearman You could do the heavy work process on workers, in this way you can setup how many workers you need, for example I do some file processing in this way, if I need scale you create more worker instance, also I have different workers for different tasks, process zip files, generate thumbnails, etc, the good of this is the workers can be written on any language node.js, Java, python and can be integrated on your project with ease
// worker-unzip.js
const debug = require('debug')('worker:unzip');
const {series, apply} = require('async');
const gearman = require('gearmanode');
const {mkdirpSync} = require('fs-extra');
const extract = require('extract-zip');
module.exports.unzip = unzip;
module.exports.worker = worker;
function unzip(inputPath, outputDirPath, done) {
debug('unzipping', inputPath, 'to', outputDirPath);
mkdirpSync(outputDirPath);
extract(inputPath, {dir: outputDirPath}, done);
}
/**
*
* #param {Job} job
*/
function workerUnzip(job) {
const {inputPath, outputDirPath} = JSON.parse(job.payload);
series([
apply(unzip, inputPath, outputDirPath),
(done) => job.workComplete(outputDirPath)
], (err) => {
if (err) {
console.error(err);
job.reportError();
}
});
}
function worker(config) {
const worker = gearman.worker(config);
if (config.id) {
worker.setWorkerId(config.id);
}
worker.addFunction('unzip', workerUnzip, {timeout: 10, toStringEncoding: 'ascii'});
worker.on('error', (err) => console.error(err));
return worker;
}
a simple index.js
const unzip = require('./worker-unzip').worker;
unzip(config); // pass host and port of the Gearman server
I normally run workers with PM2
the integration with your code it's very easy. something like
//initialize
const gearman = require('gearmanode');
gearman.Client.logger.transports.console.level = 'error';
const client = gearman.client(configGearman); // same host and port
the just add work to the queue passing the name of the functions
const taskpayload = {inputPath: '/tmp/sample-file.zip', outputDirPath: '/tmp/unzip/sample-file/'}
const job client.submitJob('unzip', JSON.stringify(taskpayload));
job.on('complete', jobCompleteCallback);
job.on('error', jobErrorCallback);
I'm processing a very large amount of data that I'm manipulating and storing it in a file. I iterate over the dataset, then I want to store it all in a JSON file.
My initial method using fs, storing it all in an object then dumping it didn't work as I was running out of memory and it became extremely slow.
I'm now using fs.createWriteStream but as far as I can tell it's still storing it all in memory.
I want the data to be written object by object to the file, unless someone can recommend a better way of doing it.
Part of my code:
// Top of the file
var wstream = fs.createWriteStream('mydata.json');
...
// In a loop
let JSONtoWrite = {}
JSONtoWrite[entry.word] = wordData
wstream.write(JSON.stringify(JSONtoWrite))
...
// Outside my loop (when memory is probably maxed out)
wstream.end()
I think I'm using Streams wrong, can someone tell me how to write all this data to a file without running out of memory? Every example I find online relates to reading a stream in but because of the calculations I'm doing on the data, I can't use a readable stream. I need to add to this file sequentially.
The problem is that you're not waiting for the data to be flushed to the filesystem, but instead keep throwing new and new data to the stream synchronously in a tight loop.
Here's an piece of pseudocode that should work for you:
// Top of the file
const wstream = fs.createWriteStream('mydata.json');
// I'm no sure how're you getting the data, let's say you have it all in an object
const entry = {};
const words = Object.keys(entry);
function writeCB(index) {
if (index >= words.length) {
wstream.end()
return;
}
const JSONtoWrite = {};
JSONtoWrite[words[index]] = entry[words[index]];
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(index + 1));
}
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(0));
You should wrap your data source in a readable stream too. I don't know what is your source, but you have to make sure, it does not load all your data in memory.
For example, assuming your data set come from another file where JSON objects are splitted with end of line character, you could create a Read stream as follow:
const Readable = require('stream').Readable;
class JSONReader extends Readable {
constructor(options={}){
super(options);
this._source=options.source: // the source stream
this._buffer='';
source.on('readable', function() {
this.read();
}.bind(this));//read whenever the source is ready
}
_read(size){
var chunk;
var line;
var lineIndex;
var result;
if (this._buffer.length === 0) {
chunk = this._source.read(); // read more from source when buffer is empty
this._buffer += chunk;
}
lineIndex = this._buffer.indexOf('\n'); // find end of line
if (lineIndex !== -1) { //we have a end of line and therefore a new object
line = this._buffer.slice(0, lineIndex); // get the character related to the object
if (line) {
result = JSON.parse(line);
this._buffer = this._buffer.slice(lineIndex + 1);
this.push(JSON.stringify(line) // push to the internal read queue
} else {
this._buffer.slice(1)
}
}
}}
now you can use
const source = fs.createReadStream('mySourceFile');
const reader = new JSONReader({source});
const target = fs.createWriteStream('myTargetFile');
reader.pipe(target);
then you'll have a better memory flow:
Please note that the picture and the above example are taken from the excellent nodejs in practice book
I'm using Express.js and have a route to upload images that I then need to resize. Currently I just let Express write the file to disk (which I think uses node-formidable under the covers) and then resize using gm (http://aheckmann.github.com/gm/) which writes a second version to disk.
gm(path)
.resize(540,404)
.write(dest, function (err) { ... });
I've read that you can get a hold of the node-formidable file stream before it writes it to disk, and since gm can accept a stream instead of just a path, I should be able to pass this right through eliminating the double write to disk.
I think I need to override form.onPart but I'm not sure where (should it be done as Express middleware?) and I'm not sure how to get a hold of form or what exactly to do with the part. This is the code skeleton that I've seen in a few places:
form.onPart = function(part) {
if (!part.filename) { form.handlePart(part); return; }
part.on('data', function(buffer) {
});
part.on('end', function() {
}
}
Can somebody help me put these two pieces together? Thanks!
You're on the right track by rewriting form.onPart. Formidable writes to disk by default, so you want to act before it does.
Parts themselves are Streams, so you can pipe them to whatever you want, including gm. I haven't tested it, but this makes sense based on the documentation:
var form = new formidable.IncomingForm;
form.onPart = function (part) {
if (!part.filename) return this.handlePart(part);
gm(part).resize(200, 200).stream(function (err, stdout, stderr) {
stdout.pipe(fs.createWriteStream('my/new/path/to/img.png'));
});
};
As for the middleware, I'd copypaste the multipart middleware from Connect/Express and add the onPart function to it: http://www.senchalabs.org/connect/multipart.html
It'd be a lot nicer if formidable didn't write to disk by default or if it took a flag, wouldn't it? You could send them an issue.
I'm trying to write a function, that would use native openssl to do some RSA heavy-lifting for me, rather than using a js RSA library. The target is to
Read binary data from a file
Do some processing in the node process, using JS, resulting in a Buffer containing binary data
Write the buffer to the stdin stream of the exec command
RSA encrypt/decrypt the data and write it to the stdout stream
Get the input data back to a Buffer in the JS-process for further processing
The child process module in Node has an exec command, but I fail to see how I can pipe the input to the process and pipe it back to my process. Basically I'd like to execute the following type of command, but without having to rely on writing things to files (didn't check the exact syntax of openssl)
cat the_binary_file.data | openssl -encrypt -inkey key_file.pem -certin > the_output_stream
I could do this by writing a temp file, but I'd like to avoid it, if possible. Spawning a child process allows me access to stdin/out but haven't found this functionality for exec.
Is there a clean way to do this in the way I drafted here? Is there some alternative way of using openssl for this, e.g. some native bindings for openssl lib, that would allow me to do this without relying on the command line?
You've mentioned spawn but seem to think you can't use it. Possibly showing my ignorance here, but it seems like it should be just what you're looking for: Launch openssl via spawn, then write to child.stdin and read from child.stdout. Something very roughly like this completely untested code:
var util = require('util'),
spawn = require('child_process').spawn;
function sslencrypt(buffer_to_encrypt, callback) {
var ssl = spawn('openssl', ['-encrypt', '-inkey', ',key_file.pem', '-certin']),
result = new Buffer(SOME_APPROPRIATE_SIZE),
resultSize = 0;
ssl.stdout.on('data', function (data) {
// Save up the result (or perhaps just call the callback repeatedly
// with it as it comes, whatever)
if (data.length + resultSize > result.length) {
// Too much data, our SOME_APPROPRIATE_SIZE above wasn't big enough
}
else {
// Append to our buffer
resultSize += data.length;
data.copy(result);
}
});
ssl.stderr.on('data', function (data) {
// Handle error output
});
ssl.on('exit', function (code) {
// Done, trigger your callback (perhaps check `code` here)
callback(result, resultSize);
});
// Write the buffer
ssl.stdin.write(buffer_to_encrypt);
}
You should be able to set encoding to binary when you make a call to exec, like..
exec("openssl output_something_in_binary", {encoding: 'binary'}, function(err, out, err) {
//do something with out - which is in the binary format
});
If you want to write out the content of "out" in binary, make sure to set the encoding to binary again, like..
fs.writeFile("out.bin", out, {encoding: 'binary'});
I hope this helps!