How to batch process an async read stream? - javascript

Im trying to batch process the reading of a file and posting to a database. Currently, I am trying to batch it 20 records at a time, as seen below.
Despite the documentBatch.length check I have put in, it still seems to not be working (the database call inside persistToDB should be called 5 times, for some reason it's only called once) and console logging documentBatch.length, it is hitting higher than that limit. I suspect this is due to concurrency issues, however the persistToDB is from an extrnal lib that needs to be called within an async function.
The way I am trying to batch is to pause the stream and resume the stream once the db work is done, however this seems to be having the same issue.
let documentBatch = [];
const processedMetrics = {
succesfullyProcessed: 0,
unsuccesfullyProcessed: 0,
};
rl.on('line', async (line) => {
try {
const document = JSON.parse(line);
documentBatch.push(document);
console.log(documentBatch.length);
if (documentBatch.length === 20) {
rl.pause();
const batchMetrics = await persistToDB(documentBatch);
documentBatch = [];
processedMetrics.succesfullyProcessed +=
batchMetrics.succesfullyProcessed;
processedMetrics.unsuccesfullyProcessed +=
batchMetrics.unsuccesfullyProcessed;
rl.resume();
}
} catch (e) {
logger.error(`Failed to save document ${line}`);
throw e;
}
});

Related

Problems with MongoDB + async await

I am currently working on a little JS game (node.js, socket.io) and I need to retrieve data from my MongoDB (currently using Mongoose). Getting the data itself seems not to be the problem since I can console log it.
The problem is the timing. I need the data before the programm continues running. I've tryed using async await, but I think I might still be using it wrong since It doesn't work.
card.controller.js:
exports.CRUD = {
getCardById : async (id) => {
try {
let cards = await Card.find({number:id}).exec();
return cards ;
} catch (err) {
return 'error occured';
}
}
}
game.js:
/* Code is within a normal function (not async) */
(async () => {
game.card = await CRUD.getCardById(cardId - 1);
console.log("Game-Card: ",game.card); // Data gets logged here, but only after it already returned game
})();
/*return game*/

Restart Async Function After Aborting API Fetch

I am creating a Node.js module that takes a list of movie titles and fetches their respective metadata from omdbapi.com.
These lists are often very large and sometimes (with my current slow internet connection) the connection stalls due to too many concurrent connections. So I set up a timeout/abort method that restarts the process after 30 seconds.
The problem I'm having is that whenever I lose internet connection or the connection stalls, it just bails out of the process, and doesn't restart the connection.
Example:
async function getMetadata () {
const remainingMovies = await getRemainingMovies();
for (let i = 0; i < remainingMovies.length;i++) {
const { data, didTimeout } = await fetchMetadata(remainingMovies[i]);
// Update "remainingMovies" Array...
if (didTimeout) {
getMetadata();
break;
}
}
if (!didTimeout) {
return data;
}
}
This is obviously a simplified version but essentially:
The getMetadata Function gets the remainingMovies Array from the global scope.
Fetches the metadata from the server with the fetchMetadata Function.
Checks if the connection timed out or not.
If it did it should restart the Function and attempt to connect again.
If it didn't timeout then finish the for loop and continue.
I guess you want something similar to below script. Error handling using try/catch for async/await which probably is what you are looking for as a missing puzzle.
async function getMetadata() {
const remainingMovies = await getRemainingMovies();
remainingMovies.map(movie => {
try {
return await fetchMetadata(movie);
} catch (err) {
return getMetadata();
}
});
}

How to read large files with fs.read and a buffer in javascript?

I'm just learning javascript, and a common task I perform when picking up a new language is to write a hex-dump program. The requirements are 1. read file supplied on command line, 2. be able to read huge files (reading a buffer-at-a-time), 3. output the hex digits and printable ascii characters.
Try as I might, I can't get the fs.read(...) function to actually execute. Here's the code I've started with:
console.log(process.argv);
if (process.argv.length < 3) {
console.log("usage: node hd <filename>");
process.exit(1);
}
fs.open(process.argv[2], 'r', (err,fd) => {
if (err) {
console.log("Error: ", err);
process.exit(2);
} else {
fs.fstat(fd, (err,stats) => {
if (err) {
process.exit(4);
} else {
var size = stats.size;
console.log("size = " + size);
going = true;
var buffer = new Buffer(8192);
var offset = 0;
//while( going ){
while( going ){
console.log("Reading...");
fs.read(fd, buffer, 0, Math.min(size-offset, 8192), offset, (error_reading_file, bytesRead, buffer) => {
console.log("READ");
if (error_reading_file)
{
console.log(error_reading_file.message);
going = false;
}else{
offset += bytesRead;
for (a=0; a< bytesRead; a++) {
var z = buffer[a];
console.log(z);
}
if (offset >= size) {
going = false;
}
}
});
}
//}
fs.close(fd, (err) => {
if (err) {
console.log("Error closing file!");
process.exit(3);
}
});
}
});
}
});
If I comment-out the while() loop, the read() function executes, but only once of course (which works for files under 8K). Right now, I'm just not seeing the purpose of a read() function that takes a buffer and an offset like this... what's the trick?
Node v8.11.1, OSX 10.13.6
First of all, if this is just a one-off script that you run now and then and this is not code in a server, then there's no need to use the harder asynchronous I/O. You can use synchronous, blocking I/O will calls such as fs.openSync(), fs.statSync(), fs.readSync() etc... and then thinks will work inside your while loop because those calls are blocking (they don't return until the results are done). You can write normal looping and sequential code with them. One should never use synchronous, blocking I/O in a server environment because it ruins the scalability of a server process (it's ability to handle requests from multiple clients), but if this is a one-off local script with only one job to do, then synchronous I/O is perfectly appropriate.
Second, here's why your code doesn't work properly. Javascript in node.js is single-threaded and event-driven. That means that the interpreter pulls an event out of the event queue, runs the code associated with that event and does nothing else until that code returns control back to the interpreter. At that point, it then pulls the next event out of the event queue and runs it.
When you do this:
while(going) {
fs.read(... => (err, data) {
// some logic here that may change the value of the going variable
});
}
You've just created yourself an infinite loop. This is because the while(going) loop just runs forever. It never stops looping and never returns control back to the interpreter so that it can fetch the next event from the event queue. It just keeps looping. But, the completion of the asynchronous, non-blocking fs.read() comes through the event queue. So, you're waiting for the going flag to change, but you never allow the system to process the events that can actually change the going flag. In your actual case, you will probably eventually run out of some sort of resource from calling fs.read() too many times in a tight loop or the interpreter will just hang in an infinite loop.
Understanding how to program a repetitive, looping type of tasks with asynchronous operations involved requires learning some new techniques for programming. Since much I/O in node.js is asynchronous and non-blocking, this is an essential skill to develop for node.js programming.
There are a number of different ways to solve this:
Use fs.createReadStream() and read the file by listening for the data event. This is probably the cleanest scheme. If your objective here is do a hex outputter, you might even want to learn a stream feature called a transform where you transform the binary stream into a hex stream.
Use promise versions of all the relevant fs functions here and use async/await to allow your for loop to wait for an async operation to finish before going to the next iteration. This allows you to write synchronous looking code, but use async I/O.
Write a different type of looping construct (not using a while) loop that manually repeats the loop after fs.read() completes.
Here's a simple example using fs.createReadStream():
const fs = require('fs');
function convertToHex(val) {
let str = val.toString(16);
if (str.length < 2) {
str = "0" + str;
}
return str.toUpperCase();
}
let stream = fs.createReadStream(process.argv[2]);
let outputBuffer = "";
stream.on('data', (data) => {
// you get an unknown length chunk of data from the file here in a Buffer object
for (const val of data) {
outputBuffer += convertToHex(val) + " ";
if (outputBuffer.length > 100) {
console.log(outputBuffer);
outputBuffer = "";
}
}
}).on('error', err => {
// some sort of error reading the file
console.log(err);
}).on('end', () => {
// output any remaining buffer
console.log(outputBuffer);
});
Hopefully you will notice that because the stream handles opening, closing and reading from the file for you that this is a lot simpler way to code. All you have to do is supply event handlers for data that is read, a read error and the end of the operation.
Here's a version using async/await and the new file interface (where the file descriptor is an object that you call methods on) with promises in node v10.
const fs = require('fs').promises;
function convertToHex(val) {
let str = val.toString(16);
if (str.length < 2) {
str = "0" + str;
}
return str.toUpperCase();
}
async function run() {
const readSize = 8192;
let cntr = 0;
const buffer = Buffer.alloc(readSize);
const fd = await fs.open(process.argv[2], 'r');
try {
let outputBuffer = "";
while (true) {
let data = await fd.read(buffer, 0, readSize, null);
for (let i = 0; i < data.bytesRead; i++) {
cntr++;
outputBuffer += convertToHex(buffer.readUInt8(i)) + " ";
if (outputBuffer.length > 100) {
console.log(outputBuffer);
outputBuffer = "";
}
}
// see if all data has been read
if (data.bytesRead !== readSize) {
console.log(outputBuffer);
break;
}
}
} finally {
await fd.close();
}
return cntr;
}
run().then(cntr => {
console.log(`done - ${cntr} bytes read`);
}).catch(err => {
console.log(err);
});

inserting max num of records in mongodb using nodejs in less time

I'm doing a small task that, Need to read a large file(i.e, 1.3GB) in node and devide every line into one record, every record to be inserted into mongodb collection in less time. Please suggest me in code and thanks in advance.
You probably want to read a such amount of data without buffering it into memory.
Assuming you are dealing with JSON data, I think this might be a feasible approach:
var LineByLineReader = require('line-by-line');
var fileHandler = new LineByLineReader('path/to/file', { encoding:'utf8', skipEmptyLines: true });
var entries = [];
var bulkSize = 100000; // tweak as needed
fileHandler.on('error', function (err) {
// process errors here
});
fileHandler.on('line', function (line) {
entries.push(JSON.parse(line));
if (entries.length === bulkSize) {
// pause handler and write data
fileHandler.pause();
YourCollection.insertMany(entries)
.then(() => {
entries = [];
fileHandler.resume();
})
}
});
fileHandler.on('end', function () {
YourCollection.insertMany(entries)
.then(() => {
// everything's done, do your stuff here
});
});
The line-by-line module seems to be a bit buggy and could be deprecated in the future, so you may want to use linebyline instead

How to use drain event of stream.Writable in Node.js

In Node.js I'm using the fs.createWriteStream method to append data to a local file. In the Node documentation they mention the drain event when using fs.createWriteStream, but I don't understand it.
var stream = fs.createWriteStream('fileName.txt');
var result = stream.write(data);
In the code above, how can I use the drain event? Is the event used properly below?
var data = 'this is my data';
if (!streamExists) {
var stream = fs.createWriteStream('fileName.txt');
}
var result = stream.write(data);
if (!result) {
stream.once('drain', function() {
stream.write(data);
});
}
The drain event is for when a writable stream's internal buffer has been emptied.
This can only happen when the size of the internal buffer once exceeded its highWaterMark property, which is the maximum bytes of data that can be stored inside a writable stream's internal buffer until it stops reading from the data source.
The cause of something like this can be due to setups that involve reading a data source from one stream faster than it can be written to another resource. For example, take two streams:
var fs = require('fs');
var read = fs.createReadStream('./read');
var write = fs.createWriteStream('./write');
Now imagine that the file read is on a SSD and can read at 500MB/s and write is on a HDD that can only write at 150MB/s. The write stream will not be able to keep up, and will start storing data in the internal buffer. Once the buffer has reached the highWaterMark, which is by default 16KB, the writes will start returning false, and the stream will internally queue a drain. Once the internal buffer's length is 0, then the drain event is fired.
This is how a drain works:
if (state.length === 0 && state.needDrain) {
state.needDrain = false;
stream.emit('drain');
}
And these are the prerequisites for a drain which are part of the writeOrBuffer function:
var ret = state.length < state.highWaterMark;
state.needDrain = !ret;
To see how the drain event is used, take the example from the Node.js documentation.
function writeOneMillionTimes(writer, data, encoding, callback) {
var i = 1000000;
write();
function write() {
var ok = true;
do {
i -= 1;
if (i === 0) {
// last time!
writer.write(data, encoding, callback);
} else {
// see if we should continue, or wait
// don't pass the callback, because we're not done yet.
ok = writer.write(data, encoding);
}
} while (i > 0 && ok);
if (i > 0) {
// had to stop early!
// write some more once it drains
writer.once('drain', write);
}
}
}
The function's objective is to write 1,000,000 times to a writable stream. What happens is a variable ok is set to true, and a loop only executes when ok is true. For each loop iteration, the value of ok is set to the value of stream.write(), which will return false if a drain is required. If ok becomes false, then the event handler for drain waits, and on fire, resumes the writing.
Regarding your code specifically, you don't need to use the drain event because you are writing only once right after opening your stream. Since you have not yet written anything to the stream, the internal buffer is empty, and you would have to be writing at least 16KB in chunks in order for the drain event to fire. The drain event is for writing many times with more data than the highWaterMark setting of your writable stream.
Imagine you're connecting 2 streams with very different bandwidths, say, uploading a local file to a slow server. The (fast) file stream will emit data faster than the (slow) socket stream can consume it.
In this situation, node.js will keep data in memory until the slow stream gets a chance to process it. This can get problematic if the file is very large.
To avoid this, Stream.write returns false when the underlying system buffer is full. If you stop writing, the stream will later emit a drain event to indicate that the system buffer has emptied and it is appropriate to write again.
You can use pause/resume the readable stream and control the bandwidth of the readable stream.
Better: you can use readable.pipe(writable) which will do this for you.
EDIT: There's a bug in your code: regardless of what write returns, your data has been written. You don't need to retry it. In your case, you're writing data twice.
Something like this would work:
var packets = […],
current = -1;
function niceWrite() {
current += 1;
if (current === packets.length)
return stream.end();
var nextPacket = packets[current],
canContinue = stream.write(nextPacket);
// wait until stream drains to continue
if (!canContinue)
stream.once('drain', niceWrite);
else
niceWrite();
}
Here is a version with async/await
const write = (writer, data) => {
return new Promise((resolve) => {
if (!writer.write(data)) {
writer.once('drain', resolve)
}
else {
resolve()
}
})
}
// usage
const run = async () => {
const write_stream = fs.createWriteStream('...')
const max = 1000000
let current = 0
while (current <= max) {
await write(write_stream, current++)
}
}
https://gist.github.com/stevenkaspar/509f792cbf1194f9fb05e7d60a1fbc73
This is a speed-optimized version using Promises (async/await). The caller has to check if it gets a promise back and only in that case await has to be called. Doing await on each call can slow down the program by a factor of 3...
const write = (writer, data) => {
// return a promise only when we get a drain
if (!writer.write(data)) {
return new Promise((resolve) => {
writer.once('drain', resolve)
})
}
}
// usage
const run = async () => {
const write_stream = fs.createWriteStream('...')
const max = 1000000
let current = 0
while (current <= max) {
const promise = write(write_stream, current++)
// since drain happens rarely, awaiting each write call is really slow.
if (promise) {
// we got a drain event, therefore we wait
await promise
}
}
}

Categories

Resources