Running out of memory writing to a file in NodeJS - javascript

I'm processing a very large amount of data that I'm manipulating and storing it in a file. I iterate over the dataset, then I want to store it all in a JSON file.
My initial method using fs, storing it all in an object then dumping it didn't work as I was running out of memory and it became extremely slow.
I'm now using fs.createWriteStream but as far as I can tell it's still storing it all in memory.
I want the data to be written object by object to the file, unless someone can recommend a better way of doing it.
Part of my code:
// Top of the file
var wstream = fs.createWriteStream('mydata.json');
...
// In a loop
let JSONtoWrite = {}
JSONtoWrite[entry.word] = wordData
wstream.write(JSON.stringify(JSONtoWrite))
...
// Outside my loop (when memory is probably maxed out)
wstream.end()
I think I'm using Streams wrong, can someone tell me how to write all this data to a file without running out of memory? Every example I find online relates to reading a stream in but because of the calculations I'm doing on the data, I can't use a readable stream. I need to add to this file sequentially.

The problem is that you're not waiting for the data to be flushed to the filesystem, but instead keep throwing new and new data to the stream synchronously in a tight loop.
Here's an piece of pseudocode that should work for you:
// Top of the file
const wstream = fs.createWriteStream('mydata.json');
// I'm no sure how're you getting the data, let's say you have it all in an object
const entry = {};
const words = Object.keys(entry);
function writeCB(index) {
if (index >= words.length) {
wstream.end()
return;
}
const JSONtoWrite = {};
JSONtoWrite[words[index]] = entry[words[index]];
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(index + 1));
}
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(0));

You should wrap your data source in a readable stream too. I don't know what is your source, but you have to make sure, it does not load all your data in memory.
For example, assuming your data set come from another file where JSON objects are splitted with end of line character, you could create a Read stream as follow:
const Readable = require('stream').Readable;
class JSONReader extends Readable {
constructor(options={}){
super(options);
this._source=options.source: // the source stream
this._buffer='';
source.on('readable', function() {
this.read();
}.bind(this));//read whenever the source is ready
}
_read(size){
var chunk;
var line;
var lineIndex;
var result;
if (this._buffer.length === 0) {
chunk = this._source.read(); // read more from source when buffer is empty
this._buffer += chunk;
}
lineIndex = this._buffer.indexOf('\n'); // find end of line
if (lineIndex !== -1) { //we have a end of line and therefore a new object
line = this._buffer.slice(0, lineIndex); // get the character related to the object
if (line) {
result = JSON.parse(line);
this._buffer = this._buffer.slice(lineIndex + 1);
this.push(JSON.stringify(line) // push to the internal read queue
} else {
this._buffer.slice(1)
}
}
}}
now you can use
const source = fs.createReadStream('mySourceFile');
const reader = new JSONReader({source});
const target = fs.createWriteStream('myTargetFile');
reader.pipe(target);
then you'll have a better memory flow:
Please note that the picture and the above example are taken from the excellent nodejs in practice book

Related

How can I update or add new objects in the middle of a stream process in Nodejs?

I'm just new in programming and now self-studying how to use createStream. I'm kind of lost how to use stream in nodejs using JS. Basically, what I wanted to do is to read a JSON file (more than 1GB) which have a massive array of object. Update the existing value of a certain object or add another set of object. I able to do it using the normal read, update or add function then write. Problem is I'm getting a large spike in RAM usage.
My code is like this:
const fs = require(`fs-extra`);
async function updateOrAdd () {
var datafile = await fs.readJson(`./bigJSONfile.json`);
var tofind = {user:alexa, age:21,country: japan, pending: 1, paid: 0};
foundData = datafile.filter(x => x.user === tofind.user && x.country === tofind.country);
if (foundData === null){
datafile = datafile.concat(tofind)
} else {
foundData.pending += 1
foundData.paid += 1
}
await fs.writeJson(`./bigJSONfile.json`, datafile)
}
I saw some codes for reference in createStream and they say pipe is the most efficient way for memory usage. Though, mostly what I saw is like making a copy only from the original one.
I really appreciate it if anyone can teach me how to do this using stream or if you can provide me the code for it :).

Is it possible to write text in the middle of a file with fs.createWriteStream ? (or in nodejs in general)

I'm trying to write in a text file, but not at the end like appendFile() do or by replacing the entiere content...
I saw it was possible to chose where you want to start with start parameter of fs.createwritestream() -> https://nodejs.org/api/fs.html#fs_fs_createwritestream_path_options
But there is no parameter to say where to stop writting, right ? So it remove all the end of my file after I wrote with this function.
const fs = require('fs');
var logger = fs.createWriteStream('result.csv', {
flags: 'r+',
start: 20 //start to write at the 20th caracter
})
logger.write('5258,525,98951,0,1\n') //example a new line to write
Is there a way to specify where to stop writting in the file to have something like:
....
data from begining
....
5258,525,98951,0,1
...
data till the end
...
I suspect you mean, "Is it possible to insert in the middle of the file." The answer to that is: No, it isn't.
Instead, to insert, you have to:
Determine how big what you're inserting is
Copy the data at your insertion point to that many bytes later in the file
Write your data
Obviously when doing #2 you need to be sure that you're not overwriting data you haven't copied yet (either by reading it all into memory first or by working in blocks, from the end of the file toward the insertion point).
(I've never looked for one, but there may be an npm module out there that does this for you...)
You could read/parse your file at first. Then apply the modifications and save the new file.
Something like:
const fs = require("fs");
const fileData = fs.readFileSync("result.csv", { encoding: "utf8" });
const fileDataArray = fileData.split("\n");
const newData = "5258,525,98951,0,1";
const index = 2; // after each row to insert your data
fileDataArray.splice(index, 0, newData); // insert data into the array
const newFileData = fileDataArray.join("\n"); // create the new file
fs.writeFileSync("result.csv", newFileData, { encoding: "utf8" }); // save it

Pass large array to node child process

I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process.
var spawn = require('child_process').spawn;
// dataAsNumbers is a large 2D array
var child = spawn(process.execPath, ['/child_process_scripts/getStatistics', dataAsNumbers]);
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
But when I do, node gives the error:
spawn E2BIG
I came across this article
So piping the data to the child process seems to be the way to go. My code is now:
var spawn = require('child_process').spawn;
console.log('creating child........................');
var options = { stdio: [null, null, null, 'pipe'] };
var args = [ '/getStatistics' ];
var child = spawn(process.execPath, args, options);
var pipe = child.stdio[3];
pipe.write(Buffer('awesome'));
child.stdout.on('data', function(data){
console.log('from child: ', data.toString());
});
And then in getStatistics.js:
console.log('im inside child');
process.stdin.on('data', function(data) {
console.log('data is ', data);
process.exit(0);
});
However the callback in process.stdin.on isn't reached. How can I receive a stream in my child script?
EDIT
I had to abandon the buffer approach. Now I'm sending the array as a message:
var cp = require('child_process');
var child = cp.fork('/getStatistics.js');
child.send({
dataAsNumbers: dataAsNumbers
});
But this only works when the length of dataAsNumbers is below about 20,000, otherwise it times out.
With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.
shm-typed-array is a very simple module that seems suited to your application. Example:
parent.js
"use strict";
const shm = require('shm-typed-array');
const fork = require('child_process').fork;
// Create shared memory
const SIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');
// Fill with dummy data
Array.prototype.fill.call(data, 1);
// Spawn child, set up communication, and give shared memory
const child = fork("child.js");
child.on('message', sum => {
console.log(`Got answer: ${sum}`);
// Demo only; ideally you'd re-use the same child
child.kill();
});
child.send(data.key);
child.js
"use strict";
const shm = require('shm-typed-array');
process.on('message', key => {
// Get access to shared memory
const data = shm.get(key, 'Float64Array');
// Perform processing
const sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);
// Return processed data
process.send(sum);
});
Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.
Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.
I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following
// main.js
const fork = require('child_process').fork
const child = fork('./getStats.js')
const dataAsNumbers = Array(100000).fill(0).map(() =>
Array(100).fill(0).map(() => Math.round(Math.random() * 100)))
child.send({
dataAsNumbers: dataAsNumbers,
})
And
// getStats.js
process.on('message', function (data) {
console.log('data is ', data)
process.exit(0)
})
node main.js 2.72s user 0.45s system 103% cpu 3.045 total
I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.
If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.
Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.
Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.
You could consider using OS pipes you'll find a gist here as an input to your node child application.
I know this is not exactly what you're asking for, but you could use the cluster module (included in node). This way you can get as many instances as cores you machine has to speed up processing. Moreover consider using streams if you don't need to have all the data available before you start processing. If the data to be processed is too large i would store it in a file so you can reinilize if there is any error during the process.
Here is an example of clustering.
var cluster = require('cluster');
var numCPUs = 4;
if (cluster.isMaster) {
for (var i = 0; i < numCPUs; i++) {
var worker = cluster.fork();
console.log('id', worker.id)
}
} else {
doSomeWork()
}
function doSomeWork(){
for (var i=1; i<10; i++){
console.log(i)
}
}
More info sending messages across workers question 8534462.
Why do you want to make a subprocess? The sending of the data across subprocesses is likely to cost more in terms of cpu and realtime than you will save in making the processing happen within the same process.
Instead, I would suggest that for super efficient coding you consider to do your statistics calculations in a worker thread that runs within the same memory as the nodejs main process.
You can use the NAN to write C++ code that you can post to a worker thread, and then have that worker thread to post the result and an event back to your nodejs event loop when done.
The benefit of this is that you don't need extra time to send the data across to a different process, but the downside is that you will write a bit of C++ code for the threaded action, but the NAN extension should take care of most of the difficult task for you.
To address the performance issue while passing large data to the child process, save the data to the .json or .txt file and pass only the filename to the childprocess. I've achieved 70% performance improvement with this approach.
For long process tasks you could use something like gearman You could do the heavy work process on workers, in this way you can setup how many workers you need, for example I do some file processing in this way, if I need scale you create more worker instance, also I have different workers for different tasks, process zip files, generate thumbnails, etc, the good of this is the workers can be written on any language node.js, Java, python and can be integrated on your project with ease
// worker-unzip.js
const debug = require('debug')('worker:unzip');
const {series, apply} = require('async');
const gearman = require('gearmanode');
const {mkdirpSync} = require('fs-extra');
const extract = require('extract-zip');
module.exports.unzip = unzip;
module.exports.worker = worker;
function unzip(inputPath, outputDirPath, done) {
debug('unzipping', inputPath, 'to', outputDirPath);
mkdirpSync(outputDirPath);
extract(inputPath, {dir: outputDirPath}, done);
}
/**
*
* #param {Job} job
*/
function workerUnzip(job) {
const {inputPath, outputDirPath} = JSON.parse(job.payload);
series([
apply(unzip, inputPath, outputDirPath),
(done) => job.workComplete(outputDirPath)
], (err) => {
if (err) {
console.error(err);
job.reportError();
}
});
}
function worker(config) {
const worker = gearman.worker(config);
if (config.id) {
worker.setWorkerId(config.id);
}
worker.addFunction('unzip', workerUnzip, {timeout: 10, toStringEncoding: 'ascii'});
worker.on('error', (err) => console.error(err));
return worker;
}
a simple index.js
const unzip = require('./worker-unzip').worker;
unzip(config); // pass host and port of the Gearman server
I normally run workers with PM2
the integration with your code it's very easy. something like
//initialize
const gearman = require('gearmanode');
gearman.Client.logger.transports.console.level = 'error';
const client = gearman.client(configGearman); // same host and port
the just add work to the queue passing the name of the functions
const taskpayload = {inputPath: '/tmp/sample-file.zip', outputDirPath: '/tmp/unzip/sample-file/'}
const job client.submitJob('unzip', JSON.stringify(taskpayload));
job.on('complete', jobCompleteCallback);
job.on('error', jobErrorCallback);

Pass object by reference from/to webworker

Is it possible passing an object from/to webWorker from/to main thread by reference? I have read here information about transferable objects.
Chrome 13 introduced sending ArrayBuffers to/from a Web Worker using
an algorithm called structured cloning. This allowed the postMessage()
API to accept messages that were not just strings, but complex types
like File, Blob, ArrayBuffer, and JSON objects. Structured cloning is
also supported in later versions of Firefox.
I just want to pass information, not object with methods. Just something like this (but with a lot of information, a few MB, so that main thread does not have to receive a copy of the object):
var test = {
some: "data"
}
Once you have some data in an object (this: {bla:666, color:"red"}) you will have to copy it and there is no way to avoid it. The reason is, that you don't have control over the memory object is stored in, so you can't transfer it. The only memory that can be transferred is memory allocated for transferable objects - typed arrays.
Therefore if you need some data transferred, you must think in advance and use the transferable interface. Also keep in mind that even when object is copied, the transfer speed is very fast.
I wrote a library that converts object to binary data (therefore transferable), but it isn't faster than native transfer, it's way slower actually. The only advantage is that it allows me to transfer unsupported data types (eg. Function).
There Is An Array 2nd Argument To postMessage
Actually yes, it is possible in, (surprise, Surprise!) Chrome 17+ and Firefox 18+ for certain objects (see here).
// Create a 32MB "file" and fill it.
var uInt8Array = new Uint8Array(1024 * 1024 * 32); // 32MB
for (var i = 0; i < uInt8Array.length; ++i) {
uInt8Array[i] = i;
}
worker.postMessage(uInt8Array.buffer, [uInt8Array.buffer]);
You can also apply this to strings by converting the string to and from an array buffer using FastestSmallestTextEncoderDecoder as shown below.
//inside the worker
var encoderInst = new TextEncoder;
function post_string(the_string){
var my_array_buffer = encoderInst.encode(the_string).buffer;
postMessage( my_array_buffer, [my_array_buffer] );
}
Then, to read the arraybuffer as a string:
// var workerInstance = new Worker("/path/to/file.js");
var decoderInst = new TextDecoder;
workerInstance.onmessage = function decode_buffer(evt){
var buffer = evt.data;
var str = decoderInst.decode(buffer);
console.log("From worker: " + str);
return str;
}
Here is a small interactive example of using a Worker to increment each letter of a string.
var incrementWorker = new Worker("data:text/javascript;base64,"+btoa(function(){
// inside the worker
importScripts("https://dl.dropboxusercontent.com/s/r55397ld512etib/Encode" +
"rDecoderTogether.min.js?dl=0");
const decoderInst = new TextDecoder;
self.onmessage = function(evt){
const u8Array = new Uint8Array(evt.data);
for (var i=0, len=u8Array.length|0; i<len; i=i+1|0) {
++u8Array[i];
}
postMessage(decoderInst.decode(u8Array));
};
} .toString().slice("function(){".length, -"}".length)));
const inputElement = document.getElementById("input");
const encoderInst = new TextEncoder;
(inputElement.oninput = function() {
const buffer = encoderInst.encode(inputElement.value).buffer;
incrementWorker.postMessage(buffer, [buffer]); // pass happens HERE
})();
incrementWorker.onmessage = function(evt){
document.getElementById("output").value = evt.data;
};
<script src="https://dl.dropboxusercontent.com/s/r55397ld512etib/EncoderDecoderTogether.min.js?dl=0" type="text/javascript"></script>
Before: <input id="input" type="text" value="abc123 foobar" /><br />
After: <input id="output" type="text" readonly="" />
Sources: Google Developers and MDN
It is not possible. You have to send the object, update it in the worker and then return the updated version to the main thread.
If you want to pass an object just with information, you only need to pass your object as a string
myWorker.postMessage(JSON.stringify(myObject));
parse the object inside your worker
JSON.parse(myObject)
and finally return your updated object to the main thread.
Take a look also to ParallelJs that is library to work easier with web-workers

Create exportable object or module to wrap third-party library with CommonJS/NodeJS javascript

I'm new to JavaScript and creating classes/objects. I'm trying to wrap an open source library's code with some simple methods for me to use in my routes.
I have the below code that is straight from the source (sjwalter's Github repo; thanks Stephen for the library!).
I'm trying to export a file/module to my main app/server.js file with something like this:
var twilio = require('nameOfMyTwilioLibraryModule');
or whatever it is I need to do.
I'm looking to create methods like twilio.send(number, message)that I can easily use in my routes to keep my code modular. I've tried a handful of different ways but couldn't get anything to work. This might not be a great question because you need to know how the library works (and Twilio too). The var phone = client.getPhoneNumber(creds.outgoing); line makes sure that my outgoing number is a registered/paid for number.
Here's the full example that I'm trying to wrap with my own methods:
var TwilioClient = require('twilio').Client,
Twiml = require('twilio').Twiml,
creds = require('./twilio_creds').Credentials,
client = new TwilioClient(creds.sid, creds.authToken, creds.hostname),
// Our numbers list. Add more numbers here and they'll get the message
numbers = ['+numbersToSendTo'],
message = '',
numSent = 0;
var phone = client.getPhoneNumber(creds.outgoing);
phone.setup(function() {
for(var i = 0; i < numbers.length; i++) {
phone.sendSms(numbers[i], message, null, function(sms) {
sms.on('processed', function(reqParams, response) {
console.log('Message processed, request params follow');
console.log(reqParams);
numSent += 1;
if(numSent == numToSend) {
process.exit(0);
}
});
});
}
});`
Simply add the function(s) you wish to expose as properties on the exports object. Assuming your file was named mytwilio.js and stored under app/ and looks like,
app/mytwilio.js
var twilio = require('twilio');
var TwilioClient = twilio.Client;
var Twiml = twilio.Twiml;
var creds = require('./twilio_creds').Credentials;
var client = new TwilioClient(creds.sid, creds.authToken, creds.hostname);
// keeps track of whether the phone object
// has been populated or not.
var initialized = false;
var phone = client.getPhoneNumber(creds.outgoing);
phone.setup(function() {
// phone object has been populated
initialized = true;
});
exports.send = function(number, message, callback) {
// ignore request and throw if not initialized
if (!initialized) {
throw new Error("Patience! We are init'ing");
}
// otherwise process request and send SMS
phone.sendSms(number, message, null, function(sms) {
sms.on('processed', callback);
});
};
This file is mostly identical to what you already have with one crucial difference. It remembers whether the phone object has been initialized or not. If it hasn't been initialized, it simply throws an error if send is called. Otherwise it proceeds with sending the SMS. You could get fancier and create a queue that stores all messages to be sent until the object is initialized, and then sends em' all out later.
This is just a lazy approach to get you started. To use the function(s) exported by the above wrapper, simply include it the other js file(s). The send function captures everything it needs (initialized and phone variables) in a closure, so you don't have to worry about exporting every single dependency. Here's an example of a file that makes use of the above.
app/mytwilio-test.js
var twilio = require("./mytwilio");
twilio.send("+123456789", "Hello there!", function(reqParams, response) {
// do something absolutely crazy with the arguments
});
If you don't like to include with the full/relative path of mytwilio.js, then add it to the paths list. Read up more about the module system, and how module resolution works in Node.JS.

Categories

Resources