NodeJS: Using Pipe To Write A File From A Readable Stream Gives Heap Memory Error

NodeJS: Using Pipe To Write A File From A Readable Stream Gives Heap Memory Error - javascript

I am trying to create 150 million lines of data and write the data into a csv file so that I can insert the data into different databases with little modification.
I am using a few functions to generate seemingly random data and pushing the data into the writable stream.
The code that I have right now is unsuccessful at handling memory issue.
After a few hours of research, I am starting to think that I should not be pushing each data at the end of the for loop because it seems that the pipe method simply cannot handle garbage collection this way.
Also, I found a few StackOverFlow answers and NodeJS docs that recommend against using push at all.
However, I am very new to NodeJS and I feel like I am blocked and do not know how to proceed from here.
If someone can provide me any guidance on how to proceed and give me an example, I would really appreciate it.
Below is a part of my code to give you a better understanding of what I am trying to achieve.
P.S. -
I have found a way to write successfully handle memory issue without using pipe method at all --I used the drain event-- but I had to start from scratch and now I am curious to know if there is a simple way to handle this memory issue without completely changing this bit of code.
Also, I have been trying to avoid using any library because I feel like there should be a relatively easy tweak to make this work without using a library but please tell me if I am wrong. Thank you in advance.
// This is my target number of data
const targetDataNum = 150000000;
// Create readable stream
const readableStream = new Stream.Readable({
read() {}
});
// Create writable stream
const writableStream = fs.createWriteStream('./database/RDBMS/test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
};
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// End the stream
readableStream.push(null);

Since you're new to streams, maybe start with an easier abstraction: generators. Generators generate data only when it is consumed (just like Streams should), but they don't have buffering and complicated constructors and methods.
This is just your for loop, moved into a generator function:
function * generateData(targetDataNum) {
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
yield `${id},${body},${date},${dp}\n`;
}
}
In Node 12, you can create a Readable stream directly from any iterable, including generators and async generators:
const stream = Readable.from(generateData(), {encoding: 'utf8'})
stream.pipe(writableStream)

i suggest to try a solution like the following:
const { Readable } = require('readable-stream');
class CustomReadable extends Readable {
constructor(max, options = {}) {
super(options);
this.targetDataNum = max;
this.i = 1;
}
_read(size) {
if (i <= this.targetDataNum) {
// your code to build the csv content
this.push(data, 'utf8');
return;
}
this.push(null);
}
}
const rs = new CustomReadable(150000000);
rs.pipe(ws);
Just complete it with your portion of code to fill the csv and create the writable stream.
With this solution you leave calling the rs.push method to the internal _read stream method invoked until this.push(null) is not called. Probably before you were filling the internal stream buffer too fast calling push manually in a loop getting the out memory error.

Try pipeing to the WritableStream before you start pumping data into the ReadableStream and yield before you write the next chunk.
...
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
// somehow YIELD for the STREAM to drain out.
};
...
The entire Stream implementation of Node.js relies on the fact that the wire is slow and that the CPU can actually have a downtime before the next chunk of data comes in from the stream source or till the next chunk of data has been written to the stream destination.
In the current implementation, since the for-loop has booked up the CPU, there is no downtime for the actual pipeing of the data to the writestream. You will be able to catch this if you watch cat test.csv which will not change while the loop is running.
As (I am sure) you know, pipe helps in guaranteeing that the data you are working with is buffered in memory only in chunks and not as a whole. But that guarantee only holds true if the CPU gets enough downtime to actually drain the data.
Having said all that, I wrapped your entire code into an async IIFE and ran it with an await for a setTimeout which ensures that I yield for the stream to drain the data.
let fs = require('fs');
let Stream = require('stream');
(async function () {
// This is my target number of data
const targetDataNum = 150000000;
// Create readable stream
const readableStream = new Stream.Readable({
read() { }
});
// Create writable stream
const writableStream = fs.createWriteStream('./test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Pipe readable stream to writeable stream
readableStream.pipe(writableStream);
// Then, push a number of data to the readable stream (150M in this case)
for (var i = 1; i <= targetDataNum; i += 1) {
console.log(`Pushing ${i}`);
const id = i;
const body = `body${i}`;
const date = `date${i}`;
const dp = `dp${i}`;
const data = `${id},${body},${date},${dp}\n`;
readableStream.push(data, 'utf8');
await new Promise(resolve => setImmediate(resolve));
};
// End the stream
readableStream.push(null);
})();
This is what top looks like pretty much the whole time I am running this.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15213 binaek ** ** ****** ***** ***** * ***.* 0.5 *:**.** node
Notice the %MEM which stays more-or-less static.

You were running out of memory because you were pre-generating all the data in memory before you wrote any of it to disk. Instead, you need a strategy to write is as you generate so you don't have to hold large amounts of data in memory.
It does not seem like you need .pipe() here because you control the generation of the data (it's not coming from some random readStream).
So, you can just generate the data and immediately write it and handle the drain event when needed. Here's a runnable example (this creates a very large file):
const {once} = require('events');
const fs = require('fs');
// This is my target number of data
const targetDataNum = 150000000;
async function run() {
// Create writable stream
const writableStream = fs.createWriteStream('./test.csv');
// Write columns first
writableStream.write('id, body, date, dp\n', 'utf8');
// Then, push a number of data to the readable stream (150M in this case)
for (let i = 1; i <= targetDataNum; i += 1) {
const id = i;
const body = lorem.paragraph(1);
const date = randomDate(new Date(2014, 0, 1), new Date());
const dp = randomNumber(1, 1000);
const data = `${id},${body},${date},${dp}\n`;
const canWriteMore = writableStream.write(data);
if (!canWriteMore) {
// wait for stream to be ready for more writing
await once(writableStream, "drain");
}
}
writableStream.end();
}
run().then(() => {
console.log(done);
}).catch(err => {
console.log("got rejection: ", err);
});
// placeholders for the functions that were being used
function randomDate(low, high) {
let rand = randomNumber(low.getTime(), high.getTime());
return new Date(rand);
}
function randomNumber(low, high) {
return Math.floor(Math.random() * (high - low)) + low;
}
const lorem = {
paragraph: function() {
return "random paragraph";
}
}

Related

transform a byte array to a readablestream

Hi I am stuck on this issue,
I reconstructed a few protobuf object in order to seed data to my miragejs instance,
const unknownObject = new core.UnknownObject()
const unknownEntity = new message.UnknownObjectEntity()
const unknownEntityRepo = new message.UnknownObjectEntityRepository()
const unknownObjectNotification = new message.UnknownObjectNotification()
const date = new google_protobuf_timestamp_pb.Timestamp()
unknownObject.setImage(Base64())
unknownObject.setTimestamp(date)
unknownObject.setWaypoint(new core.Waypoint())
console.log('unknownObject:', unknownObject.getTimestamp())
unknownEntity.setId('1')
unknownEntity.setUnknownobject(unknownObject)
console.log('unknownEntity:', unknownEntity)
const endBuffer = unknownEntityRepo.addEntity(unknownEntity).serializeBinary()
the end result (endBuffer) is a byte array. I want to reconstruct this byte array into a readablestream in order to seed the data. This is the final result needed.
ReadableStream {locked: false}
I can only fund resources made to be used to read the stream but never the transform a byte array into the ReadableStream.
Does anyone have experience with this. Thanks

You need to implement ReadableStream and add your content to the controller. Mozilla's MDN Web Docs provide some guidance in their simple random string example.
Here is a simplified example:
function buildStream(data)
{
//depending on what data is and how it will be used, you may need to convert it
//data = Uint8Array.from(data);
return new ReadableStream({
start(controller) {
// Add the data to the stream
controller.enqueue(data);
},
pull(controller) {
// nothing left to pull, so close
controller.close();
},
cancel() {
//nothing to do
}
});
}

How to use web worker inside a for loop in javascript?

Following is the code to create a 2d matrix in javascript:
function Create2DArray(rows) {
var arr = [];
for (var i=0;i<rows;i++) {
arr[i] = [];
}
return arr;
}
now I have a couple of 2d matrices inside an array:
const matrices = []
for(let i=1; i<10000; i++){
matrices.push(new Create2DArray(i*100))
}
// I'm just mocking it here. In reality we have data available in matrix form.
I want to do operations on each matrix like this:
for(let i=0; i<matrices.length; i++){
...domeAnythingWithEachMatrix()
}
& since it will be a computationally expensive process, I would like to do it via a web worker so that the main thread is not blocked.
I'm using paralleljs for this purpose since it will provide nice api for multithreading. (Or should I use the native Webworker? Please suggest.)
update() {
for(let i=0; i<matrices.length; i++){
var p = new Parallel(matrices[i]);
p.spawn(function (matrix) {
return doanythingOnMatrix(matrix)
// can be anything like transpose, scaling, translate etc...
}).then(function (matrix) {
return back so that I can use those values to update the DOM or directly update the DOM here.
// suggest a best way so that I can prevent crashes and improve performance.
});
}
requestAnimationFrame(update)
}
So my question is what is the best way of doing this?
Is it ok to use a new Webworker or Parallel instance inside a for loop?
Would it cause memory issues?
Or is it ok to create a global instance of Parallel or Webworker and use it for manipulating each matrix?
Or suggest a better approach.
I'm using Parallel.js for as alternative for Webworker
Is it ok to use parallel.js for multithreading? (Or do I need to use the native Webworker?)
In reality, the matrices would contain position data & this data is processed by the Webworker or parallel.js instance behind the scenes and returns the processed result back to the main app, which is then used to draw items / update canvas
UPDATE NOTE
Actually, this is an animation. So it will have to be updated for each matrix during each tick.
Currently, I'm creating a new Instance of parallel inside the for loop. I fear that this would be a non conventional approach. Or it would cause memory leaks. I need the best way of doing this. Please suggest.
UPDATE
This is my example:

Following our discussion in the comments, here is an attempt at using chunks. The data is processed by groups of 10 (a chunk), so that you can receive their results regularly, and we only start the animation after receiving 200 of them (buffer) to get a head start (think of it like a video stream). But these values may need to be adjusted depending on how long each matrix takes to process.
That being said, you added details afterwards about the lag you get. I'm not sure if this will solve it, or if the problem lays in your canvas update function. That's just a path to explore:
/*
* A helper function to process data in chunks
*/
async function processInChunks({ items, processingFunc, chunkSize, bufferSize, onData, onComplete }) {
const results = [];
// For each group of {chunkSize} items
for (let i = 0; i < items.length; i += chunkSize) {
// Process this group in parallel
const p = new Parallel( items.slice(i, i + chunkSize) );
// p.map is no a real Promise, so we create one
// to be able to await it
const chunkResults = await new Promise(resolve => {
return p.map(processingFunc).then(resolve);
});
// Add to the results
results.push(...chunkResults);
// Pass the results to a callback if we're above the {bufferSize}
if (i >= bufferSize && typeof onData === 'function') {
// Flush the results
onData(results.splice(0, results.length));
}
}
// In case there was less data than the wanted {bufferSize},
// pass the results anyway
if (results.length) {
onData(results.splice(0, results.length));
}
if (typeof onComplete === 'function') {
onComplete();
}
}
/*
* Usage
*/
// For the demo, a fake matrix Array
const matrices = new Array(3000).fill(null).map((_, i) => i + 1);
const results = [];
let animationRunning = false;
// For the demo, a function which takes time to complete
function doAnythingWithMatrix(matrix) {
const start = new Date().getTime();
while (new Date().getTime() - start < 30) { /* sleep */ }
return matrix;
}
processInChunks({
items: matrices,
processingFunc: doAnythingWithMatrix,
chunkSize: 10, // Receive results after each group of 10
bufferSize: 200, // But wait for at least 200 before starting to receive them
onData: (chunkResults) => {
results.push(...chunkResults);
if (!animationRunning) { runAnimation(); }
},
onComplete: () => {
console.log('All the matrices were processed');
}
});
function runAnimation() {
animationRunning = results.length > 0;
if (animationRunning) {
updateCanvas(results.shift());
requestAnimationFrame(runAnimation);
}
}
function updateCanvas(currentMatrixResult) {
// Just for the demo, we're not really using a canvas
canvas.innerHTML = `Frame ${currentMatrixResult} out of ${matrices.length}`;
info.innerHTML = results.length;
}
<script src="https://unpkg.com/paralleljs#1.0/lib/parallel.js"></script>
<h1 id="canvas">Buffering...</h1>
<h3>(we've got a headstart of <span id="info">0</span> matrix results)</h3>

Nodejs - removing substring from a huge file

I need to remove a substring (that appears only in specific known lines of the file) from a file.
there are simple solutions of reading all file data to a string, removing the substring, and then write the fixed data to the file.
here is a code I found in here:
Node js - Remove string from text file
var data = fs.readFileSync('banlist.txt', 'utf-8');
var newValue = data.replace(new RegEx("STRING_TO_REMOVE"), '');
fs.writeFileSync('banlist.txt', newValue, 'utf-8');
My problem is, that the file is huge - up to billion lines of logs, so I can't read all content to the memory.

Why not a simple transform stream and replace()? replace can take a callback as second parameter i.e. .replace(/bad1|bad2|bad3/g, filterWords) in case you need to replace words rather than remove them completely.
const fs = require("fs")
const { pipeline, Transform } = require("stream")
const { join } = require("path")
const readFile = fs.createReadStream("./words.txt")
const writeFile = fs.createWriteStream(
join(__dirname, "words-filtered.txt"),
"utf8"
)
const transformFile = new Transform({
transform(chunk, enc, next) {
let c = chunk.toString().replace(/bad/g, "replaced")
this.push(c)
next()
},
})
pipeline(readFile, transformFile, writeFile, (err) => {
if (err) {
console.log(err.message)
}
})

https://nodejs.org/api/fs.html#fs_fs_read_fd_buffer_offset_length_position_callback
Dont read the whole file at once... read a small buffered piece of it.. and look for your input with that buffered piece.... then increment your buffer starting position and do it again.... would recommend having each buffer start not at the end of the previous buffer... but overlap by at least the expected size of the data being sought so that you dont run into half of your data being at end of one buffer and other half at beginning of the other

You could use a file read stream. However, you would have to find a way to detect if the read data only contains part of the result.

What you probably want to do is use streams so that you are writing after partial reads. this example could probably work for you. you need to copy over the output text file ".tmp" over the original to get the same behavior in your question. It works by reading a chunk and then looking to see if you've come across a new line. then it processes that line, writes it, then removes it from the buffer. This should help with your memory problem.
var fs = require("fs");
var readStream = fs.createReadStream("./BFFile.txt", { encoding: "utf-8" });
var writeStream = fs.createWriteStream("./BFFile.txt.tmp");
const STRING_TO_REMOVE = "badword";
var buffer = ""
readStream.on("data", (chunk) => {
buffer += chunk;
var indexOfNewLine = buffer.search("\n");
while (indexOfNewLine !== -1) {
var line = buffer.substring(0, indexOfNewLine + 1);
buffer = buffer.substring(indexOfNewLine + 1, buffer.length);
line = line.replace(new RegExp(STRING_TO_REMOVE), "");
writeStream.write(line);
indexOfNewLine = buffer.search("\n");
}
})
readStream.on("end", () => {
buffer = buffer.replace(new RegExp(STRING_TO_REMOVE), "");
writeStream.write(buffer);
writeStream.close();
})
There are a few assumptions with this solution such as the data being UTF-8, there only being 1 bad word potentially per line, every line having some text (I didn't test for that), and that every line ends with new line and not some other line ending.
Heres the docs for streams in Node
another thought I had was to use pipe and a transform stream but that seems like over kill.

You can use this code to do it. I'm using fs stream. it's created for read huge files in small memory by chunks. docs
const fs = require('fs');
const readStream = fs.createReadStream('./XXXXX');
const writeStream = fs.createWriteStream('./XXXXXXX');
readStream.on('data', (chunk) => {
const data = chunk.toString().replace('STRING_TO_REMOVE', 'XXXXXX');
writeStream.write(data);
});
readStream.on('end', () => {
writeStream.close();
});

fs.createWriteStream doesn't use back-pressure when writing data to a file, causing high memory usage

Problem
I'm trying to scan a drive directory (recursively walk all the paths) and write all the paths to a file (as it's finding them) using fs.createWriteStream in order to keep the memory usage low, but it doesn't work, the memory usage reaches 2GB during the scan.
Expected
I was expecting fs.createWriteStream to automatically handle memory/disk usage at all times, keeping memory usage at a minimum with back-pressure.
Code
const fs = require('fs')
const walkdir = require('walkdir')
let dir = 'C:/'
let options = {
"max_depth": 0,
"track_inodes": true,
"return_object": false,
"no_return": true,
}
const wstream = fs.createWriteStream("C:/Users/USERNAME/Desktop/paths.txt")
let walker = walkdir(dir, options)
walker.on('path', (path) => {
wstream.write(path + '\n')
})
walker.on('end', (path) => {
wstream.end()
})
Is it because I'm not using .pipe()? I tried creating a new Stream.Readable({read{}}) and then inside the .on('path' emitter pushing paths into it with readable.push(path) but that didn't really work.
UPDATE:
Method 2:
I tried the proposed in the answers drain method but it doesn't help much, it does reduce memory usage to 500mb (which is still too much for a stream) but it slows down the code significantly (from seconds to minutes)
Method 3:
I also tried using readdirp, it uses even less memory (~400mb) and is faster but I don't know how to pause it and use the drain method there to reduce the memory usage further:
const readdirp = require('readdirp')
let dir = 'C:/'
const wstream = fs.createWriteStream("C:/Users/USERNAME/Desktop/paths.txt")
readdirp(dir, {alwaysStat: false, type: 'files_directories'})
.on('data', (entry) => {
wstream.write(`${entry.fullPath}\n`)
})
Method 4:
I also tried doing this operation with a custom recursive walker, and even though it uses only 30mb of memory, which is what I wanted, but it is like 10 times slower than the readdirp method and it is synchronous which is undesirable:
const fs = require('fs')
const path = require('path')
let dir = 'C:/'
function customRecursiveWalker(dir) {
fs.readdirSync(dir).forEach(file => {
let fullPath = path.join(dir, file)
// Folders
if (fs.lstatSync(fullPath).isDirectory()) {
fs.appendFileSync("C:/Users/USERNAME/Desktop/paths.txt", `${fullPath}\n`)
customRecursiveWalker(fullPath)
}
// Files
else {
fs.appendFileSync("C:/Users/USERNAME/Desktop/paths.txt", `${fullPath}\n`)
}
})
}
customRecursiveWalker(dir)

Preliminary observation: you've attempted to get the results you want using multiple approaches. One complication when comparing the approaches you used is that they do not all do the same work. If you run tests on file tree that contains only regular files, that tree does not contain mount points, you can probably compare the approaches fairly, but when you start adding mount points, symbolic links, etc, you may get different memory and time statistics merely due to the fact that one approach excludes files that another approach includes.
I've initially attempted a solution using readdirp, but unfortunately, but that library appears buggy to me. Running it on my system here, I got inconsistent results. One run would output 10Mb of data, another run with the same input parameters would output 22Mb, then I'd get another number, etc. I looked at the code and found that it does not respect the return value of push:
_push(entry) {
if (this.readable) {
this.push(entry);
}
}
As per the documentation the push method may return a false value, in which case the Readable stream should stop producing data and wait until _read is called again. readdirp entirely ignores that part of the specification. It is crucial to pay attention to the return value of push to get proper handling of back-pressure. There are also other things that seemed questionable in that code.
So I abandoned that and worked on a proof of concept showing how it could be done. The crucial parts are:
When the push method returns false it is imperative to stop adding data to the stream. Instead, we record where we were, and stop.
We start again only when _read is called.
If you uncomment the console.log statements that print START and STOP. You'll see them printed out in succession on the console. We start, produce data until Node tells us to stop, and then we stop, until Node tells us to start again, and so on.
const stream = require("stream");
const fs = require("fs");
const { readdir, lstat } = fs.promises;
const path = require("path");
class Walk extends stream.Readable {
constructor(root, maxDepth = Infinity) {
super();
this._maxDepth = maxDepth;
// These fields allow us to remember where we were when we have to pause our
// work.
// The path of the directory to process when we resume processing, and the
// depth of this directory.
this._curdir = [root, 1];
// The directories still to process.
this._dirs = [this._curdir];
// The list of files to process when we resume processing.
this._files = [];
// The location in `this._files` were to continue processing when we resume.
this._ix = 0;
// A flag recording whether or not the fetching of files is currently going
// on.
this._started = false;
}
async _fetch() {
// Recall where we were by loading the state in local variables.
let files = this._files;
let dirs = this._dirs;
let [dir, depth] = this._curdir;
let ix = this._ix;
while (true) {
// If we've gone past the end of the files we were processing, then
// just forget about them. This simplifies the code that follows a bit.
if (ix >= files.length) {
ix = 0;
files = [];
}
// Read directories until we have files to process.
while (!files.length) {
// We've read everything, end the stream.
if (dirs.length === 0) {
// This is how the stream API requires us to indicate the stream has
// ended.
this.push(null);
// We're no longer running.
this._started = false;
return;
}
// Here, we get the next directory to process and get the list of
// files in it.
[dir, depth] = dirs.pop();
try {
files = await readdir(dir, { withFileTypes: true });
}
catch (ex) {
// This is a proof-of-concept. In a real application, you should
// determine what exceptions you want to ignore (e.g. EPERM).
}
}
// Process each file.
for (; ix < files.length; ++ix) {
const dirent = files[ix];
// Don't include in the results those files that are not directories,
// files or symbolic links.
if (!(dirent.isFile() || dirent.isDirectory() || dirent.isSymbolicLink())) {
continue;
}
const fullPath = path.join(dir, dirent.name);
if (dirent.isDirectory() & depth < this._maxDepth) {
// Keep track that we need to walk this directory.
dirs.push([fullPath, depth + 1]);
}
// Finally, we can put the data into the stream!
if (!this.push(`${fullPath}\n`)) {
// If the push returned false, we have to stop pushing results to the
// stream until _read is called again, so we have to stop.
// Uncomment this if you want to see when the stream stops.
// console.log("STOP");
// Record where we were in our processing.
this._files = files;
// The element at ix *has* been processed, so ix + 1.
this._ix = ix + 1;
this._curdir = [dir, depth];
// We're stopping, so indicate that!
this._started = false;
return;
}
}
}
}
async _read() {
// Do not start the process that puts data on the stream over and over
// again.
if (this._started) {
return;
}
this._started = true; // Yep, we've started.
// Uncomment this if you want to see when the stream starts.
// console.log("START");
await this._fetch();
}
}
// Change the paths to something that makes sense for you.
stream.pipeline(new Walk("/home/", 5),
fs.createWriteStream("/tmp/paths3.txt"),
(err) => console.log("ended with", err));
When I run the first attempt you made with walkdir here, I get the following statistics:
Elapsed time (wall clock): 59 sec
Maximum resident set size: 2.90 GB
When I use the code I've shown above:
Elapsed time (wall clock): 35 sec
Maximum resident set size: 0.1 GB
The file tree I use for the tests produces a file listing of 792 MB

You could exploit the returned value from WritableStream.write(): it essentially states if you should continue to read or not. a WritableStream has an internal property that stores the threshold after which the buffer should be processed by the OS. The drain event will be emitted when the buffer has been flushed, i.e. you can call safely call WritableStream.write() without risking to excessively fill the buffer (which means the RAM). Luckily for you, walkdir let you control the process: you can emit pause(pause the walk. no more events will be emitted until resume) and resume(resume the walk) event from the walkdir object, pausing and resuming the writing process on you stream accordingly. Try with this:
let is_emitter_paused = false;
wstream.on('drain', (evt) => {
if (is_emitter_paused) {
walkdir.resume();
}
});
walkdir.on('path', function(path, stat) {
is_emitter_paused = !wstream.write(path + '\n');
if (is_emitter_paused) {
walkdir.pause();
}
});

Here's an implementation inspired by #Louis's answer. I think it's a bit easier to follow and in my minimal testing it performs about the same.
const fs = require('fs');
const path = require('path');
const stream = require('stream');
class Walker extends stream.Readable {
constructor(root = process.cwd(), maxDepth = Infinity) {
super();
// Dirs to process
this._dirs = [{ path: root, depth: 0 }];
// Max traversal depth
this._maxDepth = maxDepth;
// Files to flush
this._files = [];
}
_drain() {
while (this._files.length > 0) {
const file = this._files.pop();
if (file.isFile() || file.isDirectory() || file.isSymbolicLink()) {
const filePath = path.join(this._dir.path, file.name);
if (file.isDirectory() && this._maxDepth > this._dir.depth) {
// Add directory to be walked at a later time
this._dirs.push({ path: filePath, depth: this._dir.depth + 1 });
}
if (!this.push(`${filePath}\n`)) {
// Hault walking
return false;
}
}
}
if (this._dirs.length === 0) {
// Walking complete
this.push(null);
return false;
}
// Continue walking
return true;
}
async _step() {
try {
this._dir = this._dirs.pop();
this._files = await fs.promises.readdir(this._dir.path, { withFileTypes: true });
} catch (e) {
this.emit('error', e); // Uh oh...
}
}
async _walk() {
this.walking = true;
while (this._drain()) {
await this._step();
}
this.walking = false;
}
_read() {
if (!this.walking) {
this._walk();
}
}
}
stream.pipeline(new Walker('some/dir/path', 5),
fs.createWriteStream('output.txt'),
(err) => console.log('ended with', err));

Node Streams - Pushing to Read unintentionally splits into 3 Write streams

Goal: Objects will be pushed to a readable stream and then saved in a separate .csv depending on what channel (Email, Push, In-App) they come from.
Problem: I am unable to separate out the streams in to different .pipe() "lines" so all .csv logs receive only their channel specific event objects. But in the current iteration all of the .csv files created by the Writestream are receiving the event objects from all channels.
Questions:
Can I dynamically create the multiple channel "pipe() lines" in the setup() function programmatically or is the current way I am approaching this correct?
Is this manual creation of the "pipe() lines" the reason all of the .csv's are being populated with events? Can this be solved with one "pipe() line" and dynamic routing?
A brief explanation of the code below:
setup() calls makeStreams() - creates an object with a Readable and a Writable (rotating file system Writable stream) (setup() is an unnecessary function right now but will hold more setup tasks later.)
pushStream() is called when an inbound event occurs and pushes an object like: {Email: {queryParam:1, queryParam:2, etc.}} The event is sorted by the highest level obj (in this case "Email") and then is pushed to the correct writable stream which in theory should be ported to the correct writable stream.
Unfortunately this isn't the case, it's sending the event object to all of the writable streams. How can I send it to only the correct stream?
CODE:
const Readable = require('stream').Readable
const Json2csvTransform = require('json2csv').Transform;
var rfs = require("rotating-file-stream");
const channelTypes = ['Push Notification', 'Email', 'In-app Message']
var streamArr = setup(channelTypes);
const opts = {};
const transformOpts = {
objectMode: true
};
const json2csv = new Json2csvTransform(opts, transformOpts);
function setup(list) {
console.log("Setting up streams...")
streamArr = makeStreams(list) //makes streams out of each endpoint
return streamArr
}
//Stream Builder for Logging Based Upon Channel Name
function makeStreams(listArray) {
listArray = ['Push Notification', 'Email', 'In-app Message']
var length = listArray.length
var streamObjs = {}
for (var name = 0; name < length; name++) {
var fileName = listArray[name] + '.csv'
const readStream = new Readable({
objectMode: true,
read() {}
})
const writeStream = rfs(fileName, {
size: "50M", // rotate every 50 MegaBytes written
interval: "1d" // rotate daily
//compress: "gzip" // compress rotated files
});
var objName = listArray[name]
var obj = {
instream: readStream,
outstream: writeStream
}
streamObjs[objName] = obj
}
return streamObjs
}
function pushStream(obj) {
var keys = Object.keys(obj)
if (streamArr[keys]) {
streamArr[keys].instream.push(obj[keys])
} else {
console.log("event without a matching channel error")
}
}
//Had to make each pipe line here manually. Can this be improved? Is it the reason all of the files are receiving all events?
streamArr['Email'].instream.pipe(json2csv).pipe(streamArr['Email'].outstream)
streamArr['In-app Message'].instream.pipe(json2csv).pipe(streamArr['In-app Message'].outstream)
streamArr['Push Notification'].instream.pipe(json2csv).pipe(streamArr['Push Notification'].outstream)
module.exports = {
makeStreams,
pushStream,
setup
}

Develop Reference

JavaScript is the programming language of the Web.

NodeJS: Using Pipe To Write A File From A Readable Stream Gives Heap Memory Error - javascript

Related

transform a byte array to a readablestream

How to use web worker inside a for loop in javascript?

Nodejs - removing substring from a huge file

fs.createWriteStream doesn't use back-pressure when writing data to a file, causing high memory usage

Node Streams - Pushing to Read unintentionally splits into 3 Write streams

Categories

Resources