I'm new to Node.js and recently learned about the fs module. I'm a little confused about asynchronous vs. synchronous file i/o.
Consider the following test:
var fs = require('fs');
var txtfile = 'async.txt';
var buffer1 = Buffer(1024);
var buffer2 = '1234567890';
fs.appendFile(txtfile, buffer1, function(err) {
if (err) { throw err };
console.log('appended buffer1');
});
fs.appendFile(txtfile, buffer2, function(err) {
if (err) { throw err };
console.log('appended buffer2');
});
About half the time when I run this, it prints appended buffer2 before appended buffer1. But when I open the text file, the data always appears to be in the right order - a bunch of garbage from Buffer(1024) followed by 1234567890. I would have expected the reverse or a jumbled mess.
What's going on here? Am I doing something wrong? Is there some kind of lower-level i/o queue that maintains order?
I've seen some talk about filesystem i/o differences with Node; I'm on a Mac if that makes any difference.
From my understanding, although the code is asynchronous, at the OS level, the file I/O operations of the SAME file are not. That means only one file I/O operation is processing at a time to a single file.
During the 1st append is occurring, the file is locked. Although the 2nd append has been processed, the file I/O part of it is put in the queue by the OS and finishes with no error status. My guess is the OS does some checks to make sure the write operation will be successful such as file exists, is writable, and diskspace is large enough, and etc. If all those conditions met, the OS returns to the application with no error status and will finish the writing operation later when possible. Since the buffer of the 2nd append is much smaller, it might finish processing (not writing to file part of it) before first append finished writing to file. You, therefore, saw the 2nd console.log() first.
Related
I wrote a very simple typescript program, which does the following:
Transform users.csv into an array
For each element/user issue an API call to create that user on a 3rd party platform
Print any errors
The excel file has >160,000 rows and there is no way to create them all in one API call, so I wrote this program to run in the background of my computer for ~>20 hours.
The first time I ran this, the code stopped mid for loop without an exception or anything. So, I deleted the user rows from the csv file that were already uploaded and re-ran the code. Unfortunately, this kept happening.
Interestingly, the code has stopped at non-deterministic iterations, one time it was at i=812, another at i=27650, and so on.
This is the code:
const main = async () => {
const usersFile = await fsPromises.readFile("./users.csv", { encoding: "utf-8" });
const usersArr = makeArray(usersFile);
for (let i = 0; i < usersArr.length; i++) {
const [ userId, email ] = usersArr[i];
console.log(`uploading ${userId}. ${i}/${usersArr.length}`);
try {
await axios.post(/* create user */);
await sleep(150);
} catch (err) {
console.error(`Error uploading ${userId} -`, err.message);
}
}
};
main();
I should mention that exceptions are within the for-loop because many rows will fail to upload with a 400 error code. As such, I've preferred to have the code run non-stop and print any errors onto a file, so that I could later re-run it for the users that failed to upload. Otherwise I would have to check whether it halted because of an error every 10 minutes.
Why does this happen? and What can I do?
I run after compiling as: node build/index.js 2>>errors.txt
EDIT:
There is no code after main() and no code outside the try ... catch block within the loop. errors.txt only contains 400 errors. Even if it contained another run-time exception, it seems to me that this wouldn't/shouldn't halt execution, because it would execute catch and move on to the next iteration.
I think this may have been related to this post. The file I was reading was extremely large as noted, and it was saved into a runtime variable. Undeterministically, the OS could have decided that the memory demanded was too high. This is probably a situation to use a Readable Stream instead of a readFile.
I have a NodeJS server managing some files. It's going to watch for a known filename from an external process and, once received, read it and then delete it. However, sometimes it's attempted to be read/deleted before the file has "unlocked" from previous use so likely will fail occasionally. What I'd like to do is retry this file asap, either as soon as it's finished or continuously at a fast pace.
I'd rather avoid a long sleep where possible, because this needs to be handled ASAP and every second counts.
fs.watchFile(intput_json_file, {interval: 10}, function(current_stats, previous_stats) {
var json_data = "";
try {
var file_cont = fs.readFileSync(input_json_file); // < TODO: async this
json_data = JSON.parse(file_cont.toString());
fs.unlink(input_json_file);
} catch (error) {
console.log("The JSON in the could not be parsed. File will continue to be watched.");
console.log(error);
return;
}
// Else, this has loaded properly.
fs.unwatchFile(input_json_file);
// ... do other things with the file's data.
}
// set a timeout for the file watching, just in case
setTimeout(fs.unwatchFile, CLEANUP_TIMEOUT, input_json_file);
I expect "EBUSY: resource busy or locked" to turn up occasionally, but fs.watchFile isn't always called when the file is unlocked.
I thought of creating a function and then calling it with a delay of 1-10ms, where it could call itself if that fails too, but that feels like a fast route to a... cough stack overflow.
I'd also like to steer clear of synchronous methods so that this scales nicely, but being relatively new to NodeJS all the callbacks are starting to turn into a maze.
May be it will be over for this story, but you can create own fs with full control. In this case other programs will write data directly to your program. Just search by word fuse and fuse-binding
So I am trying read a PDF file and send its buffer as an attachment to an email. The strange thing is I have never had the problem with fs.readFile before, the callback just never fires. I have tried checking the pdf if I can open it, if anything is corrupt but it seems fine.
const destination = './temp/somthing.pdf'
function encodeToBase64(destination, callback) {
return fs.readFile(destination, function (err, data) {
if (err) {
return callback(err);
}
return callback(null, new Buffer(data).toString('base64'));
});
}
I use VSCode and have added breakpoints on all the returns and the node debugger is able to reach the first return which is the readFile, but once I go to the next step my CPU starts to work like a boss and the VSCode node debugger shows messages saying node is unresponsive.
I am at a total loss here as to what is going on. I have tried multiple pdf files aswell but to no avail.
EDIT:
I do not know if this will help but I am on Node v6.9.3
I'm parsing a fairly large dataset from MongoDB (of about 40,000 documents, each with a decent amount of data inside).
The stream is being accessed like so:
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// This "checkList" function parses HTML data with Cheerio to find different elements on the page. Lots of if/else statements
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
Each record gets parsed with Cheerio (the record contains HTML data from a page scraped earlier) and then I process the Cheerio data to look for various selectors, then saved back to MongoDB.
For the first ~2,000 objects the data is parsed quite quickly (in ~30 seconds). After that it becomes far slower, around 50 records being parsed per second.
Looking in my Macbook Air's activity monitor I see that it's not using a crazy amount of memory (226.5mb / 8gb ram) but it is using a whole lot of CPU (io.js is taking up 99% of my cpu).
Is this a possible memory leak? The checkLists function isn't particularly intensive (or at least, as far as I can tell - there are quite a few nested if/else statements but not much else).
Am I meant to be clearing my variables after they're being used, like setting $ = '' or similar? Any other reason with Node would be using so much CPU?
You basically need to "pause" the stream or otherwise "throttle" it from executing on very data item recieved straight away. So the code in the "event" does not wait before completion before the next event is fired, unless you stop the events emitting.
var cursor = db.domains.find({ html: { $exists: true } });
cursor.on('data', function(rec) {
cursor.pause(); // stop processessing new events
i++;
var url = rec.domain;
var $ = cheerio.load(rec.html);
checkList($, rec, url, i);
// if checkList() is synchronous then here
cursor.resume(); // start events again
});
cursor.on('end', function(){
console.log("Streamed all objects!");
})
If checkList() contains async methods, then pass in the cursor
checkList($, rec, url, i,cursor);
And process the "resume" inside:
function checkList(data, rec, url, i, cursor) {
somethingAsync(args,function(err,result) {
// We're done
cursor.resume(); // start events again
})
}
The "pause" stops the events emitting from the stream until the "resume" is called. This means your operations don't "stack up" in memory and wait for each to complete.
You probably want more advanced flow control for some parallel processing, but this is basically how you do it with streams.
And resume inside
I have a bit of javascript I want to run in a webworker, and I am having a hard time understanding the correct approach to getting them to work in lock-step. I invoke the WebWorker from the main script as in the following simplified script:
// main.js
werker = new Worker("webWorkerScaffold.js");
// #1
werker.onmessage = function(msgObj){
console.log("Worker Reply")
console.log(msgObj);
doSomethingWithMsg(msgObj);
};
werker.onerror = function(err){
console.log("Worker Error:");
console.log(err);
};
werker.postMessage("begin");
Then the complimentary worker script looks like the following:
// webWorkerScaffold.js
var doWorkerStuffs = function(msg){}; // Omitted
// #2
onmessage = function (msgObj){
// Messages in will always be json
if (msgObj.data.msg === "begin")
doWorkerStuffs();
};
This code (the actual version) works as expected, but I am having a difficult time confirming it will always perform correctly. Consider the following:
The "new Worker()" call is made, spawning a new thread.
The spawned thread is slow to load (lets say hangs at "// #2")
The parent thread does "werker.postMessage..." with no recipient
... ?
The same applies in the reverse direction, where I might change the worker script to make noise outward once it is setup internally, under that scenario the main thread could hang at "// #1" and miss the incoming message as it dosen't have its comm's up.
Is there some way to guarantee that these scripts move forward in a lock-step way?
What I am really looking for is a zmq-like REP/REQ semantic, where one or the other blocks (or calls back) when 1:1 transactions can take place.