Best way to read many files with nodejs?

Best way to read many files with nodejs? - javascript

I have a large glob of file paths. I'm getting this path list from a streaming glob module https://github.com/wearefractal/glob-stream
I was piping this stream to another stream that was creating fileReadStreams for each path and quickly hitting some limits. I was getting the:
warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit
and also Error: EMFILE, open
I've tried bumping the maxListeners but I have ~9000 files that would be creating streams and I'm concerned that will eat memory that number is not constant and will grow. Am I safe to remove the limit here?
Should I be doing this synchronously? or should I be iterating over the paths and reading the files sequentially? Won't that still execute all the reads at once using a for loop?

The max listeners thing is purely a warning. setMaxListeners only controls when that message is printed to the console, nothing else. You can disable it or just ignore it.
The EMFILE is your OS enforcing a limit on the number of open files (file descriptors) your process can have at a single time. You could avoid this by increasing the limit with ulimit.
Because saturating the disk by running many thousands of concurrent filesystem operations won't get you any added performance—in fact, it will hurt, especially on traditional non-SSD drives—it is a good idea to only run a controlled number of operations at once.
I'd probably use an async queue, which allows you to push the name of every file to the queue in one loop, and then only runs n operations at once. When an operation finishes, the next one in the queue starts.
For example:
var q = async.queue(function (file, cb) {
var stream = fs.createReadStream(file.path);
// ...
stream.on('end', function() {
// finish up, then
cb();
});
}, 2);
globStream.on('data', function(file) {
q.push(file);
});
globStream.on('end', function() {
// We don't want to add the `drain` handler until *after* the globstream
// finishes. Otherwise, we could end up in a situation where the globber
// is still running but all pending file read operations have finished.
q.drain = function() {
// All done with everything.
};
// ...and if the queue is empty when the globber finishes, make sure the done
// callback gets called.
if (q.idle()) q.drain();
});
You may have to experiment a little to find the right concurrency number for your application.

Related

Is there a way to minimise CPU usage by reducing number of write operations to chrome.storage?

I am making a chrome extension that keeps track of the time I spend on each site.
In the background.js I am using a map(stored as an array) that saves the list of sites as shown.
let observedTabs = [['chrome://extensions', [time, time, 'icons/sad.png']]];
Every time I update my current site, the starting and ending time of my time on that particular site is stored in the map corresponding to the site's key.
To achieve this, I am performing the chrome.storage.sync.get and chrome.storage.sync.set inside the tabs.onActivated, tabs.onUpdated, windows.onFocusChanged and idle.onStateChanged.
This however results in a very high CPU usage for chrome(around 25%) due to multiple read and write processes from(and to) storage.
I tried to solve the problem by using global variables in background.js and initialising them to undefined. Using the function shown below, I read from storage only when the the current variable is undefined(first time background.js tries to get the data) and at all other times, it just uses the set global variable.
let observedTabs = undefined;
function getObservedTabs(callback) {
if (observedTabs === undefined) {
chrome.storage.sync.get("observedTabs", (observedTabs_obj) => {
callback(observedTabs_obj.observedTabs);
});
} else {
callback(observedTabs);
}
}
This solves the problem of the costly repeated read operations.
As for the write operations, I considered using runtime.onSuspend to write to storage once my background script stops executing, as shown:
chrome.runtime.onSuspend.addListener(() => {
getObservedTabs((_observedTabs) => {
observedTabs = _observedTabs;
chrome.storage.sync.set({"observedTabs": _observedTabs});
});
});
This, however doesn't work. And the documentation also warns about this.
Note that since the page is unloading, any asynchronous operations started while handling this event are not guaranteed to complete.
Is there a workaround that would allow me to minimise my writing operations to storage and hence reduce my CPU usage?

fs.createReadStream loop not completing

I am iterating through an object containing local files, all of which definitely exist, reading them into a buffer and incrementing a counter when each completes. The problem is despite there being 319 files to read, printing out the counter to the console rarely, if ever, shows it getting through all of them. It mysteriously stops in the 200's somewhere... different every time and without throwing any errors.
I have this running in an electron project and the built app works seamlessly on a Mac but won't get through this loop on windows! I've recently updated all the packages and have been through in other areas and made the necessary adjustments and the whole app is working perfectly.. except this and it's driving me mad!
Here's the code:
$.each(compare_object, function(key, item) {
console.log(item.local_path); // this correctly prints out every single file path
var f = fs.createReadStream(item.local_path);
f.on('data', function(buf) {
// normally some other code goes in here but I've simplified it right down for the purposes of getting it working!
});
f.on('end', function(err) {
num++;
console.log(num); // this rarely reached past 280 out of 319 files. Always different though.
});
f.on('error', function(error) {
console.log(error); // this never fires.
num++;
});
});
I'm wondering if there's a cache that's maxing out or if I should be destroying the buffer after 'end' every time but nothing I've read suggest this and even when I tried it made no difference. A lot of the example expect you to be piping it somewhere, which I'm not. In the full code it creates a hash of the complete file and adds it to the object for each of the local files.

I believe the loop is completing here. The problem: you are putting some handlers which are async. The easiest possible solution here is to rewrite your code without streams.
const fs = require('fs')
const util = require('util')
const asyncReadFile = util.promisify(fs.readFile)
//.. this loop goes into some function with async or you can use readFileAsync
for (let [key, item] of Object.entries(compare_object)) {
const data = await asyncReadFile(item.local_path)
///. here goes data handling
}

Issue writing files within a loop and reading from a temporary directory

I'm trying to download a report from an API, but the way the data is being sent back is with a zipped folder with another folder inside of it and with a dozen or so zipped JSON files within that folder. For clarity, it looks like this:
report.zip/
├── reportID/ <- this is a regular folder
│ ├── reportID_123.json.gz
│ ├── reportID_456.json.gz
│ ├── reportID_789.json.gz
│ └── reportID_159.json.gz
I'm trying to unzip the first folder, then unzip each individual file, and finally loop through and read the contents of each JSON file and add them to a single object. But I'm having two issues.
The first is that while the first part of this code works when unzipping the first folder and extracting the name of each JSON file, and the second part works when unzipping them, each unzipped file is exactly the same as the one before (which isn't the case when doing the whole thing manually).
var zip = new AdmZip(tempFilePath);
var zipEntries = zip.getEntries(); // an array of ZipEntry records
const allZips = [];
const tempOutputPath = path.join(os.tmpdir(), 'output');
zip.extractAllTo(/*target path*/tempOutputPath, /*overwrite*/true);
zipEntries.forEach(function(zipEntry) {
allZips.push(zipEntry.entryName);
});
console.log (allZips);
const allData = [];
for (var i = 0; i <= allZips.length; i++) {
const zippedFileName = path.join(tempOutputPath, allZips[i]);
const finalOutputName = path.join(tempOutputPath, allZips[i].replace('.gz', ''));
console.log(zippedFileName);
const inp = fs.createReadStream(zippedFileName);
const out = fs.createWriteStream(finalOutputName);
inp.pipe(unzip).pipe(out);
console.log('File piped successfully');
console.log(finalOutputName);
let rawData = fs.readFileSync(finalOutputName);
let data = rawData.toString();
console.log(data);
allData.push(data);
}
The second problem is that even while looping through, it only manages to actually extract the data from some of the files, seemingly at random considering the code and the files are all identical apart from name. It might have something to do with the fact that as soon as the loop finishes, I get the following errors, despite the end of the loop also being the end of my code:
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 end listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 unpipe listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 drain listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 error listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 finish listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 data listeners added. Use emitter.setMaxListeners() to increase limit
(node:15772) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_ARG_TYPE]: The "path" argument must be of type string. Received type undefined
One last thing in case it's relevant, this all seems to work better (although still not perfectly) when hardcoding the file names instead of using temporary file paths. Unfortunately I have to use temp file paths as this is for a Cloud Function that doesn't allow regular paths.

.pipe() is asynchronous. So, this line of code:
inp.pipe(unzip).pipe(out);
finishes at some unknown time in the future. So, you're trying to read the output file with this:
fs.readFileSync(finalOutputName);
Before you know that the output has finished. If you're going to use streams for this, then you need to watch for the close event on the writestream so you can THEN know that the .pipe() is completely done. You also should be watching for the error event on your streams to implement proper error handling.
And after implementing a listener for the close event to read the output, with the way your code is written, you will have to wait for all the streams to get their close event before you can use allData because only then will it have all the data in it.
In trying to understand the flow of your code to perhaps suggest an alternative, I see this line of code:
inp.pipe(unzip).pipe(out);
But, there is no definition of the variable unzip.
In addition, besides creating all the temporary files, it would be helpful to know what the end goal of this code is so we could perhaps suggest a better approach.
As a minor simplification, you can replace this:
const allZips = [];
zipEntries.forEach(function(zipEntry) {
allZips.push(zipEntry.entryName);
});
with this:
const allZips = zipEntries.map(zipeEntry => zipEntry.entryName);
As I try to understand this code better, it appears that all your're trying to do with your two streams and your .pipe() is to copy the extracted files to a new location. You could do that a lot simpler with fs.copyFile() or fs.copyFileSync().

Is it possible to fork child processes and wait for them to return in Node/JS?

I have to do a certain calculation many many times. I want to split it over multiple cores, so I want to use child_process.fork(), and then pass each child a chunk of the work.
This works fine, except that my main process (the one that does the fork()) just keeps going after fork()ing and has already terminated by the time the children complete their work.
What I really want to do is wait for the children to finish before continuing execution.
I can't find any good way to do this in Node. Any advice?

If you spawn a new V8 process via .fork(), it returns a new child object which implements a communication layer. For instance
var cp = require( 'child_process' ),
proc = cp.fork( '/myfile.js' );
proc.on('message', function( msg ) {
// continue whatever you want here
});
and within /myfile.js you just dispatch an event when you're done with the work
process.send({ custom: 'message' });
Be aware of the fact that this method indeed spawns a new V8 instance, which eats a good chunk of memory. So you should use that very thoughtfully. Maybe you don't even need to do it that way, maybe there is a more "node like" solution (using process.nextTick to calculate heavy data).

Performance heavy algorithms on Node.js

I'm creating some algorithms that are very performance heavy, e.g. evolutionary and artificial intelligence. What matters to me is that my update function gets called often (precision), and I just can't get setInterval to update faster than once per millisecond.
Initially I wanted to just use a while loop, but I'm not sure that those kinds of blocking loops are a viable solution in the Node.js environment. Will Socket.io's socket.on("id", cb) work if I run into an "infinite" loop? Does my code somehow need to return to Node.js to let it check for all the events, or is that done automatically?
And last (but not least), if while loops will indeed block my code, what is another solution to getting really low delta-times between my update functions? I think threads could help, but I doubt that they're possible, my Socket.io server and other classes need to somehow communicate, and by "other classes" I mean the main World class, which has an update method that needs to get called and does the heavy lifting, and a getInfo method that is used by my server. I feel like most of the time the program is just sitting there, waiting for the interval to fire, wasting time instead of doing calculations...
Also, I'd like to know if Node.js is even suited for these sorts of tasks.

You can execute havy algorithms in separate thread using child_process.fork and wait results in main thread via child.on('message', function (message) { });
app.js
var child_process = require('child_process');
var child = child_process.fork('./heavy.js', [ 'some', 'argv', 'params' ]);
child.on('message', function(message) {
// heavy results here
});
heavy.js
while (true) {
if (Math.random() < 0.001) {
process.send({ result: 'wow!' });
}
}

Develop Reference

JavaScript is the programming language of the Web.

Best way to read many files with nodejs? - javascript

Related

Is there a way to minimise CPU usage by reducing number of write operations to chrome.storage?

fs.createReadStream loop not completing

Issue writing files within a loop and reading from a temporary directory

Is it possible to fork child processes and wait for them to return in Node/JS?

Performance heavy algorithms on Node.js

Categories

Resources