How to use drain event of stream.Writable in Node.js - javascript

In Node.js I'm using the fs.createWriteStream method to append data to a local file. In the Node documentation they mention the drain event when using fs.createWriteStream, but I don't understand it.
var stream = fs.createWriteStream('fileName.txt');
var result = stream.write(data);
In the code above, how can I use the drain event? Is the event used properly below?
var data = 'this is my data';
if (!streamExists) {
var stream = fs.createWriteStream('fileName.txt');
}
var result = stream.write(data);
if (!result) {
stream.once('drain', function() {
stream.write(data);
});
}

The drain event is for when a writable stream's internal buffer has been emptied.
This can only happen when the size of the internal buffer once exceeded its highWaterMark property, which is the maximum bytes of data that can be stored inside a writable stream's internal buffer until it stops reading from the data source.
The cause of something like this can be due to setups that involve reading a data source from one stream faster than it can be written to another resource. For example, take two streams:
var fs = require('fs');
var read = fs.createReadStream('./read');
var write = fs.createWriteStream('./write');
Now imagine that the file read is on a SSD and can read at 500MB/s and write is on a HDD that can only write at 150MB/s. The write stream will not be able to keep up, and will start storing data in the internal buffer. Once the buffer has reached the highWaterMark, which is by default 16KB, the writes will start returning false, and the stream will internally queue a drain. Once the internal buffer's length is 0, then the drain event is fired.
This is how a drain works:
if (state.length === 0 && state.needDrain) {
state.needDrain = false;
stream.emit('drain');
}
And these are the prerequisites for a drain which are part of the writeOrBuffer function:
var ret = state.length < state.highWaterMark;
state.needDrain = !ret;
To see how the drain event is used, take the example from the Node.js documentation.
function writeOneMillionTimes(writer, data, encoding, callback) {
var i = 1000000;
write();
function write() {
var ok = true;
do {
i -= 1;
if (i === 0) {
// last time!
writer.write(data, encoding, callback);
} else {
// see if we should continue, or wait
// don't pass the callback, because we're not done yet.
ok = writer.write(data, encoding);
}
} while (i > 0 && ok);
if (i > 0) {
// had to stop early!
// write some more once it drains
writer.once('drain', write);
}
}
}
The function's objective is to write 1,000,000 times to a writable stream. What happens is a variable ok is set to true, and a loop only executes when ok is true. For each loop iteration, the value of ok is set to the value of stream.write(), which will return false if a drain is required. If ok becomes false, then the event handler for drain waits, and on fire, resumes the writing.
Regarding your code specifically, you don't need to use the drain event because you are writing only once right after opening your stream. Since you have not yet written anything to the stream, the internal buffer is empty, and you would have to be writing at least 16KB in chunks in order for the drain event to fire. The drain event is for writing many times with more data than the highWaterMark setting of your writable stream.

Imagine you're connecting 2 streams with very different bandwidths, say, uploading a local file to a slow server. The (fast) file stream will emit data faster than the (slow) socket stream can consume it.
In this situation, node.js will keep data in memory until the slow stream gets a chance to process it. This can get problematic if the file is very large.
To avoid this, Stream.write returns false when the underlying system buffer is full. If you stop writing, the stream will later emit a drain event to indicate that the system buffer has emptied and it is appropriate to write again.
You can use pause/resume the readable stream and control the bandwidth of the readable stream.
Better: you can use readable.pipe(writable) which will do this for you.
EDIT: There's a bug in your code: regardless of what write returns, your data has been written. You don't need to retry it. In your case, you're writing data twice.
Something like this would work:
var packets = […],
current = -1;
function niceWrite() {
current += 1;
if (current === packets.length)
return stream.end();
var nextPacket = packets[current],
canContinue = stream.write(nextPacket);
// wait until stream drains to continue
if (!canContinue)
stream.once('drain', niceWrite);
else
niceWrite();
}

Here is a version with async/await
const write = (writer, data) => {
return new Promise((resolve) => {
if (!writer.write(data)) {
writer.once('drain', resolve)
}
else {
resolve()
}
})
}
// usage
const run = async () => {
const write_stream = fs.createWriteStream('...')
const max = 1000000
let current = 0
while (current <= max) {
await write(write_stream, current++)
}
}
https://gist.github.com/stevenkaspar/509f792cbf1194f9fb05e7d60a1fbc73

This is a speed-optimized version using Promises (async/await). The caller has to check if it gets a promise back and only in that case await has to be called. Doing await on each call can slow down the program by a factor of 3...
const write = (writer, data) => {
// return a promise only when we get a drain
if (!writer.write(data)) {
return new Promise((resolve) => {
writer.once('drain', resolve)
})
}
}
// usage
const run = async () => {
const write_stream = fs.createWriteStream('...')
const max = 1000000
let current = 0
while (current <= max) {
const promise = write(write_stream, current++)
// since drain happens rarely, awaiting each write call is really slow.
if (promise) {
// we got a drain event, therefore we wait
await promise
}
}
}

Related

How to batch process an async read stream?

Im trying to batch process the reading of a file and posting to a database. Currently, I am trying to batch it 20 records at a time, as seen below.
Despite the documentBatch.length check I have put in, it still seems to not be working (the database call inside persistToDB should be called 5 times, for some reason it's only called once) and console logging documentBatch.length, it is hitting higher than that limit. I suspect this is due to concurrency issues, however the persistToDB is from an extrnal lib that needs to be called within an async function.
The way I am trying to batch is to pause the stream and resume the stream once the db work is done, however this seems to be having the same issue.
let documentBatch = [];
const processedMetrics = {
succesfullyProcessed: 0,
unsuccesfullyProcessed: 0,
};
rl.on('line', async (line) => {
try {
const document = JSON.parse(line);
documentBatch.push(document);
console.log(documentBatch.length);
if (documentBatch.length === 20) {
rl.pause();
const batchMetrics = await persistToDB(documentBatch);
documentBatch = [];
processedMetrics.succesfullyProcessed +=
batchMetrics.succesfullyProcessed;
processedMetrics.unsuccesfullyProcessed +=
batchMetrics.unsuccesfullyProcessed;
rl.resume();
}
} catch (e) {
logger.error(`Failed to save document ${line}`);
throw e;
}
});

Is there a way to limit the execution time of regex evaluation in javascript? [duplicate]

Is it possible to cancel a regex.match operation if takes more than 10 seconds to complete?
I'm using an huge regex to match a specific text, and sometimes may work, and sometimes can fail...
regex: MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG\s:\s+\d+\/(\d+)\b(?:\D*(?:(?!\1\/\1)\d\D*)*)\1\/\1(?:[^Z]*(?:Z(?!6:\s\d+)[^Z]*)(?:[\s\S]*?))Z6:\s+\d+
Working example: https://regex101.com/r/kU6rS5/1
So.. i want cancel the operation if takes more than 10 seconds. Is it possible? I'm not finding anything related in sof
Thanks.
You could spawn a child process that does the regex matching and kill it off if it hasn't completed in 10 seconds. Might be a bit overkill, but it should work.
fork is probably what you should use, if you go down this road.
If you'll forgive my non-pure functions, this code would demonstrate the gist of how you could communicate back and forth between the forked child process and your main process:
index.js
const { fork } = require('child_process');
const processPath = __dirname + '/regex-process.js';
const regexProcess = fork(processPath);
let received = null;
regexProcess.on('message', function(data) {
console.log('received message from child:', data);
clearTimeout(timeout);
received = data;
regexProcess.kill(); // or however you want to end it. just as an example.
// you have access to the regex data here.
// send to a callback, or resolve a promise with the value,
// so the original calling code can access it as well.
});
const timeoutInMs = 10000;
let timeout = setTimeout(() => {
if (!received) {
console.error('regexProcess is still running!');
regexProcess.kill(); // or however you want to shut it down.
}
}, timeoutInMs);
regexProcess.send('message to match against');
regex-process.js
function respond(data) {
process.send(data);
}
function handleMessage(data) {
console.log('handing message:', data);
// run your regex calculations in here
// then respond with the data when it's done.
// the following is just to emulate
// a synchronous computational delay
for (let i = 0; i < 500000000; i++) {
// spin!
}
respond('return regex process data in here');
}
process.on('message', handleMessage);
This might just end up masking the real problem, though. You may want to consider reworking your regex like other posters have suggested.
Another solution I found here:
https://www.josephkirwin.com/2016/03/12/nodejs_redos_mitigation/
Based on the use of VM, no process fork.
That's pretty.
const util = require('util');
const vm = require('vm');
var sandbox = {
regex:/^(A+)*B/,
string:"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC",
result: null
};
var context = vm.createContext(sandbox);
console.log('Sandbox initialized: ' + vm.isContext(sandbox));
var script = new vm.Script('result = regex.test(string);');
try{
// One could argue if a RegExp hasn't processed in a given time.
// then, its likely it will take exponential time.
script.runInContext(context, { timeout: 1000 }); // milliseconds
} catch(e){
console.log('ReDos occurred',e); // Take some remedial action here...
}
console.log(util.inspect(sandbox)); // Check the results

How to read large files with fs.read and a buffer in javascript?

I'm just learning javascript, and a common task I perform when picking up a new language is to write a hex-dump program. The requirements are 1. read file supplied on command line, 2. be able to read huge files (reading a buffer-at-a-time), 3. output the hex digits and printable ascii characters.
Try as I might, I can't get the fs.read(...) function to actually execute. Here's the code I've started with:
console.log(process.argv);
if (process.argv.length < 3) {
console.log("usage: node hd <filename>");
process.exit(1);
}
fs.open(process.argv[2], 'r', (err,fd) => {
if (err) {
console.log("Error: ", err);
process.exit(2);
} else {
fs.fstat(fd, (err,stats) => {
if (err) {
process.exit(4);
} else {
var size = stats.size;
console.log("size = " + size);
going = true;
var buffer = new Buffer(8192);
var offset = 0;
//while( going ){
while( going ){
console.log("Reading...");
fs.read(fd, buffer, 0, Math.min(size-offset, 8192), offset, (error_reading_file, bytesRead, buffer) => {
console.log("READ");
if (error_reading_file)
{
console.log(error_reading_file.message);
going = false;
}else{
offset += bytesRead;
for (a=0; a< bytesRead; a++) {
var z = buffer[a];
console.log(z);
}
if (offset >= size) {
going = false;
}
}
});
}
//}
fs.close(fd, (err) => {
if (err) {
console.log("Error closing file!");
process.exit(3);
}
});
}
});
}
});
If I comment-out the while() loop, the read() function executes, but only once of course (which works for files under 8K). Right now, I'm just not seeing the purpose of a read() function that takes a buffer and an offset like this... what's the trick?
Node v8.11.1, OSX 10.13.6
First of all, if this is just a one-off script that you run now and then and this is not code in a server, then there's no need to use the harder asynchronous I/O. You can use synchronous, blocking I/O will calls such as fs.openSync(), fs.statSync(), fs.readSync() etc... and then thinks will work inside your while loop because those calls are blocking (they don't return until the results are done). You can write normal looping and sequential code with them. One should never use synchronous, blocking I/O in a server environment because it ruins the scalability of a server process (it's ability to handle requests from multiple clients), but if this is a one-off local script with only one job to do, then synchronous I/O is perfectly appropriate.
Second, here's why your code doesn't work properly. Javascript in node.js is single-threaded and event-driven. That means that the interpreter pulls an event out of the event queue, runs the code associated with that event and does nothing else until that code returns control back to the interpreter. At that point, it then pulls the next event out of the event queue and runs it.
When you do this:
while(going) {
fs.read(... => (err, data) {
// some logic here that may change the value of the going variable
});
}
You've just created yourself an infinite loop. This is because the while(going) loop just runs forever. It never stops looping and never returns control back to the interpreter so that it can fetch the next event from the event queue. It just keeps looping. But, the completion of the asynchronous, non-blocking fs.read() comes through the event queue. So, you're waiting for the going flag to change, but you never allow the system to process the events that can actually change the going flag. In your actual case, you will probably eventually run out of some sort of resource from calling fs.read() too many times in a tight loop or the interpreter will just hang in an infinite loop.
Understanding how to program a repetitive, looping type of tasks with asynchronous operations involved requires learning some new techniques for programming. Since much I/O in node.js is asynchronous and non-blocking, this is an essential skill to develop for node.js programming.
There are a number of different ways to solve this:
Use fs.createReadStream() and read the file by listening for the data event. This is probably the cleanest scheme. If your objective here is do a hex outputter, you might even want to learn a stream feature called a transform where you transform the binary stream into a hex stream.
Use promise versions of all the relevant fs functions here and use async/await to allow your for loop to wait for an async operation to finish before going to the next iteration. This allows you to write synchronous looking code, but use async I/O.
Write a different type of looping construct (not using a while) loop that manually repeats the loop after fs.read() completes.
Here's a simple example using fs.createReadStream():
const fs = require('fs');
function convertToHex(val) {
let str = val.toString(16);
if (str.length < 2) {
str = "0" + str;
}
return str.toUpperCase();
}
let stream = fs.createReadStream(process.argv[2]);
let outputBuffer = "";
stream.on('data', (data) => {
// you get an unknown length chunk of data from the file here in a Buffer object
for (const val of data) {
outputBuffer += convertToHex(val) + " ";
if (outputBuffer.length > 100) {
console.log(outputBuffer);
outputBuffer = "";
}
}
}).on('error', err => {
// some sort of error reading the file
console.log(err);
}).on('end', () => {
// output any remaining buffer
console.log(outputBuffer);
});
Hopefully you will notice that because the stream handles opening, closing and reading from the file for you that this is a lot simpler way to code. All you have to do is supply event handlers for data that is read, a read error and the end of the operation.
Here's a version using async/await and the new file interface (where the file descriptor is an object that you call methods on) with promises in node v10.
const fs = require('fs').promises;
function convertToHex(val) {
let str = val.toString(16);
if (str.length < 2) {
str = "0" + str;
}
return str.toUpperCase();
}
async function run() {
const readSize = 8192;
let cntr = 0;
const buffer = Buffer.alloc(readSize);
const fd = await fs.open(process.argv[2], 'r');
try {
let outputBuffer = "";
while (true) {
let data = await fd.read(buffer, 0, readSize, null);
for (let i = 0; i < data.bytesRead; i++) {
cntr++;
outputBuffer += convertToHex(buffer.readUInt8(i)) + " ";
if (outputBuffer.length > 100) {
console.log(outputBuffer);
outputBuffer = "";
}
}
// see if all data has been read
if (data.bytesRead !== readSize) {
console.log(outputBuffer);
break;
}
}
} finally {
await fd.close();
}
return cntr;
}
run().then(cntr => {
console.log(`done - ${cntr} bytes read`);
}).catch(err => {
console.log(err);
});

Cancel Regex match if timeout

Is it possible to cancel a regex.match operation if takes more than 10 seconds to complete?
I'm using an huge regex to match a specific text, and sometimes may work, and sometimes can fail...
regex: MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG\s:\s+\d+\/(\d+)\b(?:\D*(?:(?!\1\/\1)\d\D*)*)\1\/\1(?:[^Z]*(?:Z(?!6:\s\d+)[^Z]*)(?:[\s\S]*?))Z6:\s+\d+
Working example: https://regex101.com/r/kU6rS5/1
So.. i want cancel the operation if takes more than 10 seconds. Is it possible? I'm not finding anything related in sof
Thanks.
You could spawn a child process that does the regex matching and kill it off if it hasn't completed in 10 seconds. Might be a bit overkill, but it should work.
fork is probably what you should use, if you go down this road.
If you'll forgive my non-pure functions, this code would demonstrate the gist of how you could communicate back and forth between the forked child process and your main process:
index.js
const { fork } = require('child_process');
const processPath = __dirname + '/regex-process.js';
const regexProcess = fork(processPath);
let received = null;
regexProcess.on('message', function(data) {
console.log('received message from child:', data);
clearTimeout(timeout);
received = data;
regexProcess.kill(); // or however you want to end it. just as an example.
// you have access to the regex data here.
// send to a callback, or resolve a promise with the value,
// so the original calling code can access it as well.
});
const timeoutInMs = 10000;
let timeout = setTimeout(() => {
if (!received) {
console.error('regexProcess is still running!');
regexProcess.kill(); // or however you want to shut it down.
}
}, timeoutInMs);
regexProcess.send('message to match against');
regex-process.js
function respond(data) {
process.send(data);
}
function handleMessage(data) {
console.log('handing message:', data);
// run your regex calculations in here
// then respond with the data when it's done.
// the following is just to emulate
// a synchronous computational delay
for (let i = 0; i < 500000000; i++) {
// spin!
}
respond('return regex process data in here');
}
process.on('message', handleMessage);
This might just end up masking the real problem, though. You may want to consider reworking your regex like other posters have suggested.
Another solution I found here:
https://www.josephkirwin.com/2016/03/12/nodejs_redos_mitigation/
Based on the use of VM, no process fork.
That's pretty.
const util = require('util');
const vm = require('vm');
var sandbox = {
regex:/^(A+)*B/,
string:"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC",
result: null
};
var context = vm.createContext(sandbox);
console.log('Sandbox initialized: ' + vm.isContext(sandbox));
var script = new vm.Script('result = regex.test(string);');
try{
// One could argue if a RegExp hasn't processed in a given time.
// then, its likely it will take exponential time.
script.runInContext(context, { timeout: 1000 }); // milliseconds
} catch(e){
console.log('ReDos occurred',e); // Take some remedial action here...
}
console.log(util.inspect(sandbox)); // Check the results

DataReader.loadAsync is being completed even when unconsumedBufferLength is 0

I'm consuming JSON stream on UWP WinRT with this code:
async function connect() {
let stream: MSStream;
return new CancellableContext<void>(
async (context) => {
// this will be called immediately
stream = await context.queue(() => getStreamByXHR()); // returns ms-stream object
await consumeStream(stream);
},
{
revert: () => {
// this will be called when user cancels the task
stream.msClose();
}
}
).feed();
}
async function consumeStream(stream: MSStream) {
return new CancellableContext<void>(async (context) => {
const input = stream.msDetachStream() as Windows.Storage.Streams.IInputStream;
const reader = new Windows.Storage.Streams.DataReader(input);
reader.inputStreamOptions = Windows.Storage.Streams.InputStreamOptions.partial;
while (!context.canceled) {
const content = await consumeString(1000);
// ... some more codes
}
async function consumeString(count: number) {
await reader.loadAsync(count); // will throw when the stream gets closed
return reader.readString(reader.unconsumedBufferLength);
}
}).feed();
}
Here, the document about InputStreamOptions.partial says:
The asynchronous read operation completes when one or more bytes is available.
However, reader.loadAsync completes even when reader.unconsumedBufferLength is 0 and this makes CPU load. Is this an API bug or can I prevent this behavior so that loadAsync can complete only when unconsumedBufferLength is greater than 0?
PS: Here is a repro with pure JS: https://github.com/SaschaNaz/InputStreamOptionsBugRepro
Is this an API bug or can I prevent this behavior so that loadAsync can complete only when unconsumedBufferLength is greater than 0
Most likey it also completes at the end of stream. So in that case the unconsumedBufferLength will be zero and needs to be catered for.
In fact the example at https://msdn.microsoft.com/en-us/library/windows/apps/windows.storage.streams.datareader.aspx shows something similar (admittedly not using that option):
// Once we have written the contents successfully we load the stream.
await dataReader.LoadAsync((uint)stream.Size);
var receivedStrings = "";
// Keep reading until we consume the complete stream.
while (dataReader.UnconsumedBufferLength > 0)
🌹

Categories

Resources