I am trying to track the progress of a pipe from a read stream to write stream so I can display the progress to the user.
My original idea was to track progress when the data event is emitted as shown here:
const fs = require('fs');
let final = fs.createWriteStream('output');
fs.createReadStream('file')
.on('close', () => {
console.log('done');
})
.on('error', (err) => {
console.error(err);
})
.on('data', (data) => {
console.log("data");
/* Calculate progress */
})
.pipe(final);
However I realized just cause it was read, doesn't mean it was actually written. This can be seen if the pipe is removed, as the data event still emits.
How can track write progress when piping with Node.js?
You can use a dummy Transform stream like this:
const stream = require('stream');
let totalBytes = 0;
stream.pipeline(
fs.createReadStream(from_file),
new stream.Transform({
transform(chunk, encoding, callback) {
totalBytes += chunk.length;
console.log(totalBytes);
this.push(chunk);
callback();
}
}),
fs.createWriteStream(to_file),
err => {
if (err)
...
}
);
You can do the piping manually, and make use of the callback from writable.write()
callback: < function > Callback for when this chunk of data is flushed
const fs = require('fs');
let from_file = `<from_file>`;
let to_file = '<to_file>';
let from_stream = fs.createReadStream(from_file);
let to_stream = fs.createWriteStream(to_file);
// get total size of the file
let { size } = fs.statSync(from_file);
let written = 0;
from_stream.on('data', data => {
// do the piping manually here.
to_stream.write(data, () => {
written += data.length;
console.log(`written ${written} of ${size} bytes (${(written/size*100).toFixed(2)}%)`);
});
});
Somehow I remember this thread being about memory efficiency, anyway, I've rigged up a small script that's very memory efficient and tracks progress very well. I tested it under a 230MB file and the result speaks for itself. https://gist.github.com/J-Cake/78ce059972595823243526e022e327e4
The sample file I used was a bit weird as the content-length header it reported was in fact off but the program uses no more than 4.5 MiB of memory.
Related
I'm trying to make a transform stream flow that is taking data from socket.io, converting it to JSON, and then sending it to stdout. I am totally perplexed as to why data just seems to go right through without any transformation. I'm using the through2 library. Here is my code:
getStreamNames().then(streamNames => {
const socket = io(SOCKETIO_URL);
socket.on('connect', () => {
socket.emit('Subscribe', {subs: streamNames});
});
const stream = through2.obj(function (chunk, enc, callback) {
callback(null, parseString(chunk))
}).pipe(through2.obj(function (chunk, enc, callback) {
callback(null, JSON.stringify(chunk));
})).pipe(process.stdout);
socket.on('m', data => stream.write(data));
},
);
getStreamNames returns a promise which resolves to an array of stream names (i'm calling an external socket.io API) and parseString takes a string returned from the API and converts it to JSON so it's manageable.
What I'm looking for is my console to print out the stringify'd JSON after I parse it using parseString and then make it stdout-able with JSON.stringify. What is actually happening is the data is going right through the stream and doing no transformation.
For reference, the data coming from the API is in a weird format, something like
field1~field2~0x23~fieldn
and so that's why I need the parseString method.
I must be missing something. Any ideas?
EDIT:
parseString:
function(value) {
var valuesArray = value.split("~");
var valuesArrayLenght = valuesArray.length;
var mask = valuesArray[valuesArrayLenght - 1];
var maskInt = parseInt(mask, 16);
var unpackedCurrent = {};
var currentField = 0;
for (var property in this.FIELDS) {
if (this.FIELDS[property] === 0) {
unpackedCurrent[property] = valuesArray[currentField];
currentField++;
}
else if (maskInt & this.FIELDS[property]) {
if (property === 'LASTMARKET') {
unpackedCurrent[property] = valuesArray[currentField];
}
else {
unpackedCurrent[property] = parseFloat(valuesArray[currentField]);
}
currentField++;
}
}
return unpackedCurrent;
};
Thanks
The issue is that the stream you're writing, is actually process.stdout, because .pipe returns the last stream.Writable, so you can keep chaining, in your case, process.stdout.
const x = stream.pipe(stream2).pipe(stream3).pipe(process.stdout);
x === process.stdout // true
So all you were doing was: process.stdout.write(data) without going through the pipeline.
What you need to do, is assign your first through2 stream to the stream variable, and then .pipe on that stream.
const stream = through2.obj((chunk, enc, callback) => {
callback(null, parseString(chunk))
});
stream
.pipe(through2.obj((chunk, enc, callback) => {
callback(null, JSON.stringify(chunk));
}))
.pipe(process.stdout);
socket.on('m', data => stream.write(data));
I periodically have to download/parse a bunch of Json data, about 1000~1.000.000 lines.
Each request has a chunk limit of 5000. So I would like to fire of a bunch of request at the time, stream each output through its own Transfomer for filtering out the key/value's and then write to a combined stream that writes its output to the database.
But with every attempt it doesn't work, or it gives errors because to many event listeners are set. What seems correct if I understand the the 'last pipe' is always the reference next in the chain.
Here is some code (changed it lot of times so could make little sense).
The question is: Is it bad practice to join multiple streams to one? Google also doesn't show a whole lot about it.
Thanks!
brokerApi/getCandles.js
// The 'combined output' stream
let passStream = new Stream.PassThrough();
countChunks.forEach(chunk => {
let arr = [];
let leftOver = '';
let startFound = false;
let lastPiece = false;
let firstByte = false;
let now = Date.now();
let transformStream = this._client
// Returns PassThrough stream
.getCandles(instrument, chunk.from, chunk.until, timeFrame, chunk.count)
.on('error', err => console.error(err) || passStream.emit('error', err))
.on('end', () => {
if (++finished === countChunks.length)
passStream.end();
})
.pipe(passStream);
transformStream._transform = function(data, type, done) {
/** Treansform to typedArray **/
this.push(/** Taansformed value **/)
}
});
Extra - Other file that 'consumes' the stream (writes to DB)
DataLayer.js
brokerApi.getCandles(instrument, timeFrame, from, until, count)
.on('data', async (buf: NodeBuffer) => {
this._dataLayer.write(instrument, timeFrame, buf);
if (from && until) {
await this._mapper.update(instrument, timeFrame, from, until, buf.length / (10 * Float64Array.BYTES_PER_ELEMENT));
} else {
if (buf.length) {
if (!from)
from = buf.readDoubleLE(0);
if (!until) {
until = buf.readDoubleLE(buf.length - (10 * Float64Array.BYTES_PER_ELEMENT));
console.log('UNTIL TUNIL', until);
}
if (from && until)
await this._mapper.update(instrument, timeFrame, from, until, buf.length / (10 * Float64Array.BYTES_PER_ELEMENT));
}
}
})
.on('end', () => {
winston.info(`Cache: Fetching ${instrument} took ${Date.now() - now} ms`);
resolve()
})
.on('error', reject)
Check out the stream helpers from highlandjs, e.g. (untested, pseudo code):
function getCandle(candle) {...}
_(chunks).map(getCandle).parallel(5000).pipe(...)
I am trying to do the following:
Stream a csv file in line by line.
Modify the data contained in each line.
Once all lines are streamed and processed, finish and move on to next task.
The problem is .on("end") fires before .on("data") finishes processing each line. How can I get .on("end") to fire after .on("data") has finished processing all the lines?
Below is a simple example of what I am talking about:
import parse from 'csv-parse';
var parser = parse({});
fs.createReadStream(this.upload.location)
.pipe(parser)
.on("data", line => {
var num = Math.floor((Math.random() * 100) + 1);
num = num % 3;
num = num * 1000;
setTimeout( () => {
console.log('data process complete');
}, num);
})
.on("end", () => {
console.log('Done: parseFile');
next(null);
});
Thanks in advance.
I think the issue is the setTimeout (or any other async task) within the data event listener. end is firing after data but the async task is causing it to log messages even after the stream fires end.
If you take out the setTimeout then you'll see that it logs all the messages in data before end. You can still perform async tasks but there will be a potential batch of them that run after the stream has ended.
This code helps explain what is going on:
const fs = require('fs')
const testFileName = 'testfile.txt'
fs.writeFileSync(testFileName, '123456789')
let count = 0
const readStream = fs.createReadStream(testFileName, {
encoding: 'utf8',
highWaterMark: 1 // low highWaterMark so we can have more chunks to observe
})
readStream.on('data', (data) => {
console.log('+++++++++++processing sync+++++++++++++')
console.log(data)
console.log('+++++++++++end processing sync+++++++++++++')
setTimeout(() => {
console.log('-----------processing async-------------')
console.log(data)
console.log('-----------end processing async-------------')
}, ++count * 1000)
})
readStream.on('end', () => {
console.log('stream ended but still have async tasks doing their thing')
fs.unlinkSync(testFileName)
})
I need to run two commands in series that need to read data from the same stream.
After piping a stream into another the buffer is emptied so i can't read data from that stream again so this doesn't work:
var spawn = require('child_process').spawn;
var fs = require('fs');
var request = require('request');
var inputStream = request('http://placehold.it/640x360');
var identify = spawn('identify',['-']);
inputStream.pipe(identify.stdin);
var chunks = [];
identify.stdout.on('data',function(chunk) {
chunks.push(chunk);
});
identify.stdout.on('end',function() {
var size = getSize(Buffer.concat(chunks)); //width
var convert = spawn('convert',['-','-scale',size * 0.5,'png:-']);
inputStream.pipe(convert.stdin);
convert.stdout.pipe(fs.createWriteStream('half.png'));
});
function getSize(buffer){
return parseInt(buffer.toString().split(' ')[2].split('x')[0]);
}
Request complains about this
Error: You cannot pipe after data has been emitted from the response.
and changing the inputStream to fs.createWriteStream yields the same issue of course.
I don't want to write into a file but reuse in some way the stream that request produces (or any other for that matter).
Is there a way to reuse a readable stream once it finishes piping?
What would be the best way to accomplish something like the above example?
You have to create duplicate of the stream by piping it to two streams. You can create a simple stream with a PassThrough stream, it simply passes the input to the output.
const spawn = require('child_process').spawn;
const PassThrough = require('stream').PassThrough;
const a = spawn('echo', ['hi user']);
const b = new PassThrough();
const c = new PassThrough();
a.stdout.pipe(b);
a.stdout.pipe(c);
let count = 0;
b.on('data', function (chunk) {
count += chunk.length;
});
b.on('end', function () {
console.log(count);
c.pipe(process.stdout);
});
Output:
8
hi user
The first answer only works if streams take roughly the same amount of time to process data. If one takes significantly longer, the faster one will request new data, consequently overwriting the data still being used by the slower one (I had this problem after trying to solve it using a duplicate stream).
The following pattern worked very well for me. It uses a library based on Stream2 streams, Streamz, and Promises to synchronize async streams via a callback. Using the familiar example from the first answer:
spawn = require('child_process').spawn;
pass = require('stream').PassThrough;
streamz = require('streamz').PassThrough;
var Promise = require('bluebird');
a = spawn('echo', ['hi user']);
b = new pass;
c = new pass;
a.stdout.pipe(streamz(combineStreamOperations));
function combineStreamOperations(data, next){
Promise.join(b, c, function(b, c){ //perform n operations on the same data
next(); //request more
}
count = 0;
b.on('data', function(chunk) { count += chunk.length; });
b.on('end', function() { console.log(count); c.pipe(process.stdout); });
You can use this small npm package I created:
readable-stream-clone
With this you can reuse readable streams as many times as you need
For general problem, the following code works fine
var PassThrough = require('stream').PassThrough
a=PassThrough()
b1=PassThrough()
b2=PassThrough()
a.pipe(b1)
a.pipe(b2)
b1.on('data', function(data) {
console.log('b1:', data.toString())
})
b2.on('data', function(data) {
console.log('b2:', data.toString())
})
a.write('text')
I have a different solution to write to two streams simultaneously, naturally, the time to write will be the addition of the two times, but I use it to respond to a download request, where I want to keep a copy of the downloaded file on my server (actually I use a S3 backup, so I cache the most used files locally to avoid multiple file transfers)
/**
* A utility class made to write to a file while answering a file download request
*/
class TwoOutputStreams {
constructor(streamOne, streamTwo) {
this.streamOne = streamOne
this.streamTwo = streamTwo
}
setHeader(header, value) {
if (this.streamOne.setHeader)
this.streamOne.setHeader(header, value)
if (this.streamTwo.setHeader)
this.streamTwo.setHeader(header, value)
}
write(chunk) {
this.streamOne.write(chunk)
this.streamTwo.write(chunk)
}
end() {
this.streamOne.end()
this.streamTwo.end()
}
}
You can then use this as a regular OutputStream
const twoStreamsOut = new TwoOutputStreams(fileOut, responseStream)
and pass it to to your method as if it was a response or a fileOutputStream
If you have async operations on the PassThrough streams, the answers posted here won't work.
A solution that works for async operations includes buffering the stream content and then creating streams from the buffered result.
To buffer the result you can use concat-stream
const Promise = require('bluebird');
const concat = require('concat-stream');
const getBuffer = function(stream){
return new Promise(function(resolve, reject){
var gotBuffer = function(buffer){
resolve(buffer);
}
var concatStream = concat(gotBuffer);
stream.on('error', reject);
stream.pipe(concatStream);
});
}
To create streams from the buffer you can use:
const { Readable } = require('stream');
const getBufferStream = function(buffer){
const stream = new Readable();
stream.push(buffer);
stream.push(null);
return Promise.resolve(stream);
}
What about piping into two or more streams not at the same time ?
For example :
var PassThrough = require('stream').PassThrough;
var mybiraryStream = stream.start(); //never ending audio stream
var file1 = fs.createWriteStream('file1.wav',{encoding:'binary'})
var file2 = fs.createWriteStream('file2.wav',{encoding:'binary'})
var mypass = PassThrough
mybinaryStream.pipe(mypass)
mypass.pipe(file1)
setTimeout(function(){
mypass.pipe(file2);
},2000)
The above code does not produce any errors but the file2 is empty
In Node.js I'm using the fs.createWriteStream method to append data to a local file. In the Node documentation they mention the drain event when using fs.createWriteStream, but I don't understand it.
var stream = fs.createWriteStream('fileName.txt');
var result = stream.write(data);
In the code above, how can I use the drain event? Is the event used properly below?
var data = 'this is my data';
if (!streamExists) {
var stream = fs.createWriteStream('fileName.txt');
}
var result = stream.write(data);
if (!result) {
stream.once('drain', function() {
stream.write(data);
});
}
The drain event is for when a writable stream's internal buffer has been emptied.
This can only happen when the size of the internal buffer once exceeded its highWaterMark property, which is the maximum bytes of data that can be stored inside a writable stream's internal buffer until it stops reading from the data source.
The cause of something like this can be due to setups that involve reading a data source from one stream faster than it can be written to another resource. For example, take two streams:
var fs = require('fs');
var read = fs.createReadStream('./read');
var write = fs.createWriteStream('./write');
Now imagine that the file read is on a SSD and can read at 500MB/s and write is on a HDD that can only write at 150MB/s. The write stream will not be able to keep up, and will start storing data in the internal buffer. Once the buffer has reached the highWaterMark, which is by default 16KB, the writes will start returning false, and the stream will internally queue a drain. Once the internal buffer's length is 0, then the drain event is fired.
This is how a drain works:
if (state.length === 0 && state.needDrain) {
state.needDrain = false;
stream.emit('drain');
}
And these are the prerequisites for a drain which are part of the writeOrBuffer function:
var ret = state.length < state.highWaterMark;
state.needDrain = !ret;
To see how the drain event is used, take the example from the Node.js documentation.
function writeOneMillionTimes(writer, data, encoding, callback) {
var i = 1000000;
write();
function write() {
var ok = true;
do {
i -= 1;
if (i === 0) {
// last time!
writer.write(data, encoding, callback);
} else {
// see if we should continue, or wait
// don't pass the callback, because we're not done yet.
ok = writer.write(data, encoding);
}
} while (i > 0 && ok);
if (i > 0) {
// had to stop early!
// write some more once it drains
writer.once('drain', write);
}
}
}
The function's objective is to write 1,000,000 times to a writable stream. What happens is a variable ok is set to true, and a loop only executes when ok is true. For each loop iteration, the value of ok is set to the value of stream.write(), which will return false if a drain is required. If ok becomes false, then the event handler for drain waits, and on fire, resumes the writing.
Regarding your code specifically, you don't need to use the drain event because you are writing only once right after opening your stream. Since you have not yet written anything to the stream, the internal buffer is empty, and you would have to be writing at least 16KB in chunks in order for the drain event to fire. The drain event is for writing many times with more data than the highWaterMark setting of your writable stream.
Imagine you're connecting 2 streams with very different bandwidths, say, uploading a local file to a slow server. The (fast) file stream will emit data faster than the (slow) socket stream can consume it.
In this situation, node.js will keep data in memory until the slow stream gets a chance to process it. This can get problematic if the file is very large.
To avoid this, Stream.write returns false when the underlying system buffer is full. If you stop writing, the stream will later emit a drain event to indicate that the system buffer has emptied and it is appropriate to write again.
You can use pause/resume the readable stream and control the bandwidth of the readable stream.
Better: you can use readable.pipe(writable) which will do this for you.
EDIT: There's a bug in your code: regardless of what write returns, your data has been written. You don't need to retry it. In your case, you're writing data twice.
Something like this would work:
var packets = […],
current = -1;
function niceWrite() {
current += 1;
if (current === packets.length)
return stream.end();
var nextPacket = packets[current],
canContinue = stream.write(nextPacket);
// wait until stream drains to continue
if (!canContinue)
stream.once('drain', niceWrite);
else
niceWrite();
}
Here is a version with async/await
const write = (writer, data) => {
return new Promise((resolve) => {
if (!writer.write(data)) {
writer.once('drain', resolve)
}
else {
resolve()
}
})
}
// usage
const run = async () => {
const write_stream = fs.createWriteStream('...')
const max = 1000000
let current = 0
while (current <= max) {
await write(write_stream, current++)
}
}
https://gist.github.com/stevenkaspar/509f792cbf1194f9fb05e7d60a1fbc73
This is a speed-optimized version using Promises (async/await). The caller has to check if it gets a promise back and only in that case await has to be called. Doing await on each call can slow down the program by a factor of 3...
const write = (writer, data) => {
// return a promise only when we get a drain
if (!writer.write(data)) {
return new Promise((resolve) => {
writer.once('drain', resolve)
})
}
}
// usage
const run = async () => {
const write_stream = fs.createWriteStream('...')
const max = 1000000
let current = 0
while (current <= max) {
const promise = write(write_stream, current++)
// since drain happens rarely, awaiting each write call is really slow.
if (promise) {
// we got a drain event, therefore we wait
await promise
}
}
}