fetch and parse a large file, only then fetch the next - javascript

I have to fetch about 30 files with ES6, each of them consists of 100 MB lines of text.
I parse the text, line by line, counting some data points. The result is a small array like
[{"2014":34,"2015":34,"2016":34,"2017":34,"2018":12}]
I'm running into memory problems while parsing the files (Chrome simply crashes the debugger), probably because I am parsing them all with map:
return Promise.all(filenamesArray.map( /*fetch each file in filenamesArray */ )).
then(() => { /*parse them all */ })
I'm not posting the full code because I know it's wrong anyway. What I would like to do is
Load a single file with fetch
Parse its text with a result array such as above
Return the result array and store it somewhere until every file has been parsed
Give the js engine / gc enough time to clear the text from step 1 from memory
Load the next file (continue with 1, but only after step 1-4 are finished!).
but I can't seem to find a solution for that. Could anyone show me an example?
I don't care if its promises, callback functions, async/await...as long as each file is parsed completely before the next one is started.
EDIT 2020825
Sorry for my late update, I only came around fixing my problem now.
While I appreciate Josh Linds answer, I realized that I still have a problem with the async nature of fetch which I apparently did not describe well enough:
How do I deal with promises to make sure one file is finished and its memory may be released? I implemented Joshs solution with Promises.all, only to discover that this would still load all files first and then start processing them.
Luckily, I found another SO question with almost the same problem:
Resolve promises one after another (i.e. in sequence)?
and so I learned about async functions. In order to use them with fetch, this question helped me:
How to use fetch with async/await?
So my final code looks like this:
//returns a promise resolving with an array of all processed files
loadAndCountFiles(filenamesArray) {
async function readFiles(filenamesArray) {
let resultArray = [];
for (const filename of filenamesArray) {
const response = await fetch(filename);
const text = await response.text();
//process the text and return a much smaller extract
const yearCountObject = processText(text);
resultArray.push({
filename: filename,
yearCountObject: yearCountObject
});
console.log("processed file " + filename);
}
return resultArray;
}
return new Promise(
(resolve, reject) => {
console.log("starting filecount...");
readFiles(filenamesArray)
.then(resultArray => {
console.log("done: " + resultArray);
resolve(resultArray);
})
.catch((error) => {
reject(error);
})
}
);
}
Now every file is fetched and processed before the next.

Global variable:
dictionary = {};
In main:
fileNamesArray.forEach(fname => readFile(fname));
Functions:
const readFile = (fname) => {
/* get file */.then(file => {
/* parse file */
addToDict(year); // year is a string. Call this when you find a year
})
}
const addToDict = (key) => {
if (dictionary[key]) dictionary[key]++;
else dictionary[key] = 1;
}

Related

How can I update or add new objects in the middle of a stream process in Nodejs?

I'm just new in programming and now self-studying how to use createStream. I'm kind of lost how to use stream in nodejs using JS. Basically, what I wanted to do is to read a JSON file (more than 1GB) which have a massive array of object. Update the existing value of a certain object or add another set of object. I able to do it using the normal read, update or add function then write. Problem is I'm getting a large spike in RAM usage.
My code is like this:
const fs = require(`fs-extra`);
async function updateOrAdd () {
var datafile = await fs.readJson(`./bigJSONfile.json`);
var tofind = {user:alexa, age:21,country: japan, pending: 1, paid: 0};
foundData = datafile.filter(x => x.user === tofind.user && x.country === tofind.country);
if (foundData === null){
datafile = datafile.concat(tofind)
} else {
foundData.pending += 1
foundData.paid += 1
}
await fs.writeJson(`./bigJSONfile.json`, datafile)
}
I saw some codes for reference in createStream and they say pipe is the most efficient way for memory usage. Though, mostly what I saw is like making a copy only from the original one.
I really appreciate it if anyone can teach me how to do this using stream or if you can provide me the code for it :).

What are ways to reduce memory usage by a NodeJS program that is using worker threads?

I have a program that processes a file of 5million address lines. It reads them in one at a time, places those in chunks, extracts some data and then some more data, and then returns the updated chunk to a file writer to print out one line at a time to a different file. I use read streams, file streams, and I just started experimenting with worker threads for one of the data extractions. What are reasons it could be reaching a Javasccript heap out of memory issue, and what are possible ways to improve it?
I'm not linking all the code for reasons of its way too big a program, but here's how I'm managing the worker threads
Function that gets called on each chunk to extract extra data (inData contains the array of the chunk, addProper is reference data that gets compared to and is a key:value object, DataOut is the function that writes the chunk to file; it is not async)
//Check if other details of the address are correct by the post code
DataValid: async function DataValid(inData, final) {
const promise = new Promise((resolve, reject) => {
//Check if city is correct
let worker = new Worker('./cityprocess.js', {workerData: [inData, addProper]});
worker.on('message', (message) => resolve(message));
worker.on('error', reject);
//Pass updated list with validated cities to DataOut
});
promise.then((val) => {
this.DataOut(val, final);
});
},
This is how I start up the script that is the worker thread (the cityprocess.js script)
if (workerData != null) {
ChunkPull(workerData[0]);
}
This is the function that is called at the end of the chunk processing, when its time to return the updated chunk to the main script (addChunk is that updated chunk)
function ReturnParentMessage(code) {
parentPort.postMessage(addChunk);
worker.terminate();
}
The code also seems to buffer the threads, as it'll process a few chunks with the threads and then write out all the outputs of about 5 threads at once before processing more. I was planning to fix this by having the writer get called by the worker threads directly

Reading Multiple files and writing to one file Node.JS

I am currently trying to make a data pipeline using Node.js
Of course, it's not the best way to make it but I want to try implementing it anyways before I make improvements upon it.
This is the situation
I have multiple gzip compressed csv files on AWS S3. I get these "objects" using aws sdk
like the following and make them into readStream
const unzip = createGunzip()
const input = s3.getObject(parameterWithBucketandKey)
.createReadStream()
.pipe(unzip)
and using the stream above I create readline interface
const targetFile = createWriteSTream('path to target file');
const rl = createInterface({
input: input
})
let first = true;
rl.on('line', (line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});
and this is wrapped as a promise
and I have array of filenames to be retrieved from AWS S3 and map those array of filenames like this
const arrayOfFileNames = [name1, name2, name3 ... and 5000 more]
const arrayOfPromiseFileProcesses= arrayOfFileNames.map((filename) => return promiseFileProcess(filename))
await Promise.all(arrayOfPromiseFileProcesses);
// the result should be multiple gzip files that are compressed again.
sorry I wrote in pseudocode if it needs more to provide context then I will write more but I thought this would give a general contenxt of my problem.
My problem is that it writes to a file fine, but when i change the file_name it it doesn't create one afterwards. I am lost in this synchronous and asynchronous world...
Please give me a hint/reference to read upon. Thank you.
line event handler must be a async function as it invokes await
rl.on('line', async(line) => {
if(first) {
first = false;
return;
}
targetFile.write(line);
await getstats_and_fetch_filesize();
if(filesize > allowed_size){
changed_file_name = change_the_name_of_file()
compress(change_file_name)
}
});

How to read from one stream and write to several at once?

Suppose I have a readable stream, e.g. request(URL). And I want to write its response on the disk via fs.createWriteStream() and piping with the request. But at the same time I want to calculate a checksum of the downloading data via crypto.createHash() stream.
readable -+-> calc checksum
|
+-> write to disk
And I want to do it on the fly, without buffering an entire response in memory.
It seems that I can implement it using oldschool on('data') hook. Pseudocode below:
const hashStream = crypto.createHash('sha256');
hashStream.on('error', cleanup);
const dst = fs.createWriteStream('...');
dst.on('error', cleanup);
request(...).on('data', (chunk) => {
hashStream.write(chunk);
dst.write(chunk);
}).on('end', () => {
hashStream.end();
const checksum = hashStream.read();
if (checksum != '...') {
cleanup();
} else {
dst.end();
}
}).on('error', cleanup);
function cleanup() { /* cancel streams, erase file */ };
But such approach looks pretty awkward. I tried to use stream.Transform or stream.Writable to implement something like read | calc + echo | write but I'm stuck with the implementation.
Node.js readable streams have a .pipe method which works pretty much like the unix pipe-operator, except that you can stream js objects as well as just strings of some type.
Here's a link to the doc on pipe
An example of the use in your case could be something like:
const req = request(...);
req.pipe(dst);
req.pipe(hash);
Note that you still have to handle errors per stream as they're not propagated and the destinations are not closed if the readable errors.

Running out of memory writing to a file in NodeJS

I'm processing a very large amount of data that I'm manipulating and storing it in a file. I iterate over the dataset, then I want to store it all in a JSON file.
My initial method using fs, storing it all in an object then dumping it didn't work as I was running out of memory and it became extremely slow.
I'm now using fs.createWriteStream but as far as I can tell it's still storing it all in memory.
I want the data to be written object by object to the file, unless someone can recommend a better way of doing it.
Part of my code:
// Top of the file
var wstream = fs.createWriteStream('mydata.json');
...
// In a loop
let JSONtoWrite = {}
JSONtoWrite[entry.word] = wordData
wstream.write(JSON.stringify(JSONtoWrite))
...
// Outside my loop (when memory is probably maxed out)
wstream.end()
I think I'm using Streams wrong, can someone tell me how to write all this data to a file without running out of memory? Every example I find online relates to reading a stream in but because of the calculations I'm doing on the data, I can't use a readable stream. I need to add to this file sequentially.
The problem is that you're not waiting for the data to be flushed to the filesystem, but instead keep throwing new and new data to the stream synchronously in a tight loop.
Here's an piece of pseudocode that should work for you:
// Top of the file
const wstream = fs.createWriteStream('mydata.json');
// I'm no sure how're you getting the data, let's say you have it all in an object
const entry = {};
const words = Object.keys(entry);
function writeCB(index) {
if (index >= words.length) {
wstream.end()
return;
}
const JSONtoWrite = {};
JSONtoWrite[words[index]] = entry[words[index]];
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(index + 1));
}
wstream.write(JSON.stringify(JSONtoWrite), writeCB.bind(0));
You should wrap your data source in a readable stream too. I don't know what is your source, but you have to make sure, it does not load all your data in memory.
For example, assuming your data set come from another file where JSON objects are splitted with end of line character, you could create a Read stream as follow:
const Readable = require('stream').Readable;
class JSONReader extends Readable {
constructor(options={}){
super(options);
this._source=options.source: // the source stream
this._buffer='';
source.on('readable', function() {
this.read();
}.bind(this));//read whenever the source is ready
}
_read(size){
var chunk;
var line;
var lineIndex;
var result;
if (this._buffer.length === 0) {
chunk = this._source.read(); // read more from source when buffer is empty
this._buffer += chunk;
}
lineIndex = this._buffer.indexOf('\n'); // find end of line
if (lineIndex !== -1) { //we have a end of line and therefore a new object
line = this._buffer.slice(0, lineIndex); // get the character related to the object
if (line) {
result = JSON.parse(line);
this._buffer = this._buffer.slice(lineIndex + 1);
this.push(JSON.stringify(line) // push to the internal read queue
} else {
this._buffer.slice(1)
}
}
}}
now you can use
const source = fs.createReadStream('mySourceFile');
const reader = new JSONReader({source});
const target = fs.createWriteStream('myTargetFile');
reader.pipe(target);
then you'll have a better memory flow:
Please note that the picture and the above example are taken from the excellent nodejs in practice book

Categories

Resources