Using promises with streams in node.js - javascript

I've refactored a simple utility to use promises. It fetches a pdf from the web and saves it to disk. It should then open the file in a pdf viewer once saved to disk. The file appears on disk and is valid, the shell command opens the OSX Preview application, but a dialog pops up complaining that the file is empty.
What's the best way to execute the shell function once the filestream has been written to disk?
// download a pdf and save to disk
// open pdf in osx preview for example
download_pdf()
.then(function(path) {
shell.exec('open ' + path).code !== 0);
});
function download_pdf() {
const path = '/local/some.pdf';
const url = 'http://somewebsite/some.pdf';
const stream = request(url);
const write = stream.pipe(fs.createWriteStream(path))
return streamToPromise(stream);
}
function streamToPromise(stream) {
return new Promise(function(resolve, reject) {
// resolve with location of saved file
stream.on("end", resolve(stream.dests[0].path));
stream.on("error", reject);
})
}

In this line
stream.on("end", resolve(stream.dests[0].path));
you are executing resolve immediately, and the result of calling resolve (which will be undefined, because that's what resolve returns) is used as the argument to stream.on - not what you want at all, right.
.on's second argument needs to be a function, rather than the result of calling a function
Therefore, the code needs to be
stream.on("end", () => resolve(stream.dests[0].path));
or, if you're old school:
stream.on("end", function () { resolve(stream.dests[0].path); });
another old school way would be something like
stream.on("end", resolve.bind(null, stream.dests[0].path));
No, don't do that :p see comments

After a bunch of tries I found a solution which works fine all the time. See JSDoc comments for more info.
/**
* Streams input to output and resolves only after stream has successfully ended.
* Closes the output stream in success and error cases.
* #param input {stream.Readable} Read from
* #param output {stream.Writable} Write to
* #return Promise Resolves only after the output stream is "end"ed or "finish"ed.
*/
function promisifiedPipe(input, output) {
let ended = false;
function end() {
if (!ended) {
ended = true;
output.close && output.close();
input.close && input.close();
return true;
}
}
return new Promise((resolve, reject) => {
input.pipe(output);
input.on('error', errorEnding);
function niceEnding() {
if (end()) resolve();
}
function errorEnding(error) {
if (end()) reject(error);
}
output.on('finish', niceEnding);
output.on('end', niceEnding);
output.on('error', errorEnding);
});
};
Usage example:
function downloadFile(req, res, next) {
promisifiedPipe(fs.createReadStream(req.params.file), res).catch(next);
}
Update. I've published the above function as a Node module: http://npm.im/promisified-pipe

In the latest nodejs, specifically, stream v3, you could do this:
const finished = util.promisify(stream.finished);
const rs = fs.createReadStream('archive.tar');
async function run() {
await finished(rs);
console.log('Stream is done reading.');
}
run().catch(console.error);
rs.resume(); // Drain the stream.
https://nodejs.org/api/stream.html#stream_event_finish

The other solution can look like this:
const streamAsPromise = (readable) => {
const result = []
const w = new Writable({
write(chunk, encoding, callback) {ยท
result.push(chunk)
callback()
}
})
readable.pipe(w)
return new Promise((resolve, reject) => {
w.on('finish', resolve)
w.on('error', reject)
}).then(() => result.join(''))
}
and you can use it like:
streamAsPromise(fs.createReadStream('secrets')).then(() => console.log(res))

This can be done very nicely using the promisified pipeline function. Pipeline also provides extra functionality, such as cleaning up the streams.
const pipeline = require('util').promisify(require( "stream" ).pipeline)
pipeline(
request('http://somewebsite/some.pdf'),
fs.createWriteStream('/local/some.pdf')
).then(()=>
shell.exec('open /local/some.pdf').code !== 0)
);

Related

Node.js htmlparser2 writableStream still emit events after end() call

Sorry for the probable trivial question but I still fail to get how streams work in node.js.
I want to parse an html file and get the path of the first script I encounter. I'd like to interrupt the parsing after the first match but the onopentag() listener is still invoked until the effective end of the html file. why ?
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const scriptPath = await new Promise(function(resolve, reject) {
try {
const parser = new WritableStream({
onopentag: (name, attrib) => {
if (name === "script" && attrib.src) {
console.log(`script : ${attrib.src}`);
resolve(attrib.src); // return the first script, effectively called for each script tag
// none of below calls seem to work
indexStream.unpipe(parser);
parser.emit("close");
parser.end();
parser.destroy();
}
},
onend() {
resolve();
}
});
const indexStream = got.stream("/index.html", {
responseType: 'text',
resolveBodyOnly: true
});
indexStream.pipe(parser); // and parse it
} catch (e) {
reject(e);
}
});
Is it possible to close the parser stream before the effective end of indexStream and if yes how ?
If not why ?
Note that the code works and my promise is effectively resolved using the first match.
There's a little confusion on how the WriteableStream works. First off, when you do this:
const parser = new WritableStream(...)
that's misleading. It really should be this:
const writeStream = new WritableStream(...)
The actual HTML parser is an instance variable in the WritableStream object named ._parser (see code). And, it's that parser that is emitting the onopentag() callbacks and because it's working off a buffer that may have some accumulated text disconnecting from the readstream may not immediately stop events that are still coming from the buffered data.
The parser itself has a public reset() method and it appears that if disconnected from the readstream and then you called that reset method, it should stop emitting events.
You can try this (I'm not a TypeScript person so you may have to massage some things to make the TypeScript compiler happy, but hopefully you can see the concept here):
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const scriptPath = await new Promise(function(resolve, reject) {
try {
const writeStream = new WritableStream({
onopentag: (name, attrib) => {
if (name === "script" && attrib.src) {
console.log(`script : ${attrib.src}`);
resolve(attrib.src); // return the first script, effectively called for each script tag
// disconnect the readstream
indexStream.unpipe(writeStream);
// reset the internal parser so it clears any buffers it
// may still be processing
writeStream._parser.reset();
}
},
onend() {
resolve();
}
});
const indexStream = got.stream("/index.html", {
responseType: 'text',
resolveBodyOnly: true
});
indexStream.pipe(writeStream); // and parse it
} catch (e) {
reject(e);
}
});

What are ways to run a script only after another script has finished?

Lets say this is my code (just a sample I wrote up to show the idea)
var extract = require("./postextract.js");
var rescore = require("./standardaddress.js");
RunFunc();
function RunFunc() {
extract.Start();
console.log("Extraction complete");
rescore.Start();
console.log("Scoring complete");
}
And I want to not let the rescore.Start() run until the entire extract.Start() has finished. Both scripts contain a spiderweb of functions inside of them, so having a callback put directly into the Start() function is not appearing viable as the final function won't return it, and I am having a lot of trouble understanding how to use Promises. What are ways I can make this work?
These are the scripts that extract.Start() begins and ends with. OpenWriter() is gotten to through multiple other functions and streams, with the actual fileWrite.write() being in another script that's attached to this (although not needed to detect the end of run. Currently, fileWrite.on('finish') is where I want the script to be determined as done
module.exports = {
Start: function CodeFileRead() {
//this.country = countryIn;
//Read stream of thate address components
fs.createReadStream("Reference\\" + postValid.country + " ADDRESS REF DATA.csv")
//Change separator based on file
.pipe(csv({escape: null, headers: false, separator: delim}))
//Indicate start of reading
.on('resume', (data) => console.log("Reading complete postal code file..."))
//Processes lines of data into storage array for comparison
.on('data', (data) => {
postValid.addProper[data[1]] = JSON.stringify(Object.values(data)).replace(/"/g, '').split(',').join('*');
})
//End of reading file
.on('end', () => {
postValid.complete = true;
console.log("Done reading");
//Launch main script, delayed to here in order to not read ahead of this stream
ThisFunc();
});
},
extractDone
}
function OpenWriter() {
//File stream for writing the processed chunks into a new file
fileWrite = fs.createWriteStream("Processed\\" + fileName.split('.')[0] + "_processed." + fileName.split('.')[1]);
fileWrite.on('open', () => console.log("File write is open"));
fileWrite.on('finish', () => {
console.log("File write is closed");
});
}
EDIT: I do not want to simply add the next script onto the end of the previous one and forego the master file, as I don't know how long it will be and its supposed to be designed to be capable of taking additional scripts past our development period. I cannot just use a package as it stands because approval time in the company takes up to two weeks and I need this more immediately
DOUBLE EDIT: This is all my code, every script and function is all written by me, so I can make the scripts being called do what's needed
You can just wrap your function in Promise and return that.
module.exports = {
Start: function CodeFileRead() {
return new Promise((resolve, reject) => {
fs.createReadStream(
'Reference\\' + postValid.country + ' ADDRESS REF DATA.csv'
)
// .......some code...
.on('end', () => {
postValid.complete = true;
console.log('Done reading');
resolve('success');
});
});
}
};
And Run the RunFunc like this:
async function RunFunc() {
await extract.Start();
console.log("Extraction complete");
await rescore.Start();
console.log("Scoring complete");
}
//or IIFE
RunFunc().then(()=>{
console.log("All Complete");
})
Note: Also you can/should handle error by reject("some error") when some error occurs.
EDIT After knowing about TheFunc():
Making a new Event emitter will probably the easiest solution:
eventEmitter.js
const EventEmitter = require('events').EventEmitter
module.exports = new EventEmitter()
const eventEmitter = require('./eventEmitter');
module.exports = {
Start: function CodeFileRead() {
return new Promise((resolve, reject) => {
//after all of your code
eventEmitter.once('WORK_DONE', ()=>{
resolve("Done");
})
});
}
};
function OpenWriter() {
...
fileWrite.on('finish', () => {
console.log("File write is closed");
eventEmitter.emit("WORK_DONE");
});
}
And Run the RunFunc like as before.
There's no generic way to determine when everything a function call does has finished.
It might accept a callback. It might return a promise. It might not provide any kind of method to determine when it is done. It might have side effects that you could monitor by polling.
You need to read the documentation and/or source code for that particular function.
Use async/await (promises), example:
var extract = require("./postextract.js");
var rescore = require("./standardaddress.js");
RunFunc();
async function extract_start() {
try {
extract.Start()
}
catch(e){
console.log(e)
}
}
async function rescore_start() {
try {
rescore.Start()
}
catch(e){
console.log(e)
}
}
async function RunFunc() {
await extract_start();
console.log("Extraction complete");
await rescore_start();
console.log("Scoring complete");
}

Node.js - Read and Write thousands/millions of JSON files in a loop

I need to process a huge amount of files in a most efficient and fast way.
Read -> Process -> Write (save to same place).
My problem is that my implementation is slow, at least I think so. It took me half a night or so to process 600000 files.
I did it in a synchronous manner on purpose, if this can be done better asynchronously - I'm open to solutions, I just didn't think that processing a lot of files which weigh only 1-3kb will take that long.
Files have simple JSON data and each file is about 1-3kb size like I already said.
Those files lie in separate folders each containing 300 files. I split them up in order to make it more efficient and usable.
So we have ~ 2000 folders each having 300 files (1-3kb size).
Take a look at my code and gimme your thoughts. Thanks!
function test() {
/**
* Get list of folders and loop through
*/
const folderList = fs.readdirSync(`../db`)
for (const folder of folderList) {
/**
* Get list of files for each folder and loop through
*/
const fileList = fs.readdirSync(`../db/${ folder }`)
for (const filePath of fileList) {
/**
* try/catch block to handle JSON.parse errors
*/
try {
/**
* Read file
*/
const file = JSON.parse(fs.readFileSync(`../db/${ folder }/${ filePath }`))
/**
* Process file
*/
processFile(file)
/**
* Write file
*/
fs.writeFileSync(`../db/${ folder }/${ filePath }`, JSON.stringify(file), 'utf8')
} catch (err) {
console.log(err)
}
}
}
}
I expect this to run rather quickly, in reality this takes a while.
Yo guys, so I've came up with this solution as a test, can you check it out and let me know if it's a good implementation? This took like 10-15 minutes to process 600k files instead of hours. Each 'folder' has 300 files in it, so we always wait for 300 promises to complete. I do it so because the files are small (1-3kb, one object, nothing fancy). Could this be done better, could this be done in a minute for example? :)
async function test() {
const folderList = fs.readdirSync(`../db`)
for (const folder of folderList) {
console.log(folder)
const fileList = fs.readdirSync(`../db/${ folder }`)
let promises = []
for (const fileName of fileList) {
promises.push(processFile(site, folder, fileName))
}
await Promise.all(promises)
}
}
async function processFile(folder, fileName) {
const path = `../db/${ folder }/${ fileName }`
const file = await readFile(path)
if (file){
//do something and write
await writeFile(path)
}
}
function readFile(path) {
return new Promise(function (resolve) {
fs.readFile(path, function (err, raw) {
if (err) {
console.log(err)
resolve()
return
}
try {
const file = JSON.parse(raw)
resolve(file)
} catch (err) {
console.log(err)
resolve()
}
})
})
}
function writeFile(path, object) {
return new Promise(function (resolve) {
fs.writeFile(path, JSON.stringify(object), function (err) {
if (err)
console.log(err)
resolve()
})
})
}
So, after playing around with some stuff, I've came to something like this:
const PromisePool = require('es6-promise-pool')
const list = require('./list.json')
let n = 0
let pool = new PromisePool(promiseProducer, 11)
pool.start()
.then(function () {
console.log('Complete')
})
function promiseProducer(){
console.log(n)
if (n < list.length)
return processFile(list[++n])
else
return null
}
This ran rather fast. Though, I still have some questions.
Can anyone write their own implementation of concurrency limit? Without libraries etc
Like before, if I run the script and wait for 20k (for example) files to be processed, if I stop the script and rerun, it will come to 20k (where we stop) REALLY quick and then it will slow down. What is the reason?

Recursively call promises

I've been scouring the web over this one for quite some time now.
I'm prototyping an Angular service for an Ionic app. The purpose of this service is to download an image. Now this is a problem that, in standard JS, I'd like to solve with some recursive calls to avoid duplicate code.
I've tried writing it using promises to get my feet wet with the concept of Promises and it's giving me a hard time.
Consider the following code:
public getBgForName = (name: string) => {
name = name.toLowerCase();
var instance = this;
var dir = this.file.dataDirectory;
return new Promise(function (fulfill, reject) {
instance.file.checkDir(dir, name).then(() => {
// directory exists. Is there a bg file?
dir = dir + '/' + name + '/';
instance.file.checkFile(dir, 'bg.jpg').then(() => {
console.log('read file');
fulfill(dir + '/' + 'bg.jpg')
}, (err) => {
// dl file and re-call
console.log('needs to download file!')
instance.transfer.create().download(encodeURI('https://host.tld/'+name+'/bg.jpg'), dir + 'bg.jpg', true, {})
.then((data) => {
return instance.getBgForName(name).then((url) => {return url});
}, (err) => {
console.log(err)
})
})
}, (err) => {
// create dir and re-call
instance.file.createDir(dir, name, true).then(() => {
instance.getBgForName(name).then((url) => {fulfill(url)});
})
})
});
}
the promise, when called - never quite fully resolves. I think, after reading this article that the problem lies in my the promise resolving not being passed correctly to the "original" promise chain - so that it resolves to solve level, but not all the way to the top. This is supported by the promise resolving correctly when the following is assured:
the directory has already been created
the file has already been downloaded
so I reckon the return statements somehow break up the link here, leading to the promise not being resolved after it's first recursive call.
What is the correct way to call a promise recursively, ensuring the the original caller receives the result when it is ready?
Edit: Outlining the desired result, as suggested by David B.
What the code is supposed to be is the function that is called on a list of items. For each item, there is a background image available, which is stored on a server. This background image will be cached locally. The goal of using recursively calls here is that no matter the state (downloaded, not downloaded) the function call will always return an url to the image on the local filesystem. The steps for this are as follows:
create a directory for the current item
download the file to this directory
return a local URL to the downloaded file
subsequent calls thereafter will only return the image straight from disk (after checking that it exists), with no more downloading.
After reading about the benefits of async / await over promises (and falling in love with the cleaner syntax) I rewrote it using async / await. The refactored (but not perfect!) code looks like this:
public getBgForName = async (name: string) => {
name = name.toLowerCase();
let instance = this;
let dir = this.file.dataDirectory;
try{
await instance.file.checkDir(dir, name)
dir = dir + name + '/';
try{
await instance.file.checkFile(dir, 'bg.jpg')
return dir + 'bg.jpg';
}catch(err) {
// download file
await instance.transfer.create().download(encodeURI('https://host.tld/'+name+'/bg.jpg'), dir + 'bg.jpg', true, {})
return this.getBgForName(name);
}
}catch(err) {
// not catching the error here since if we can't write to the app's local storage something is very off anyway.
await instance.file.createDir(dir, name, true)
return this.getBgForName(name);
}
}
and works as intended.

await for function with callback

I'm playing with streams and async/await functionality. What I have so far is:
let logRecord = ((record, callback) => {
console.log(record);
return callback();
});
let importCSVfromPath = async((csv_path) => {
return new Promise(function(resolve, reject) {
var parser = parse();
var input = fs.createReadStream(csv_path);
var transformer = transform(logRecord, {parallel: 1});
input.on('error', (err) => {
reject(err);
});
input.on('finish', ()=> {
resolve();
});
input.pipe(parser).pipe(transformer);
});
});
Now I want to replace logRecord with importRecord. The problem is that this function has to use functions that are already part of the async stack.
let importRecord = async( (record) => {
.......
await(insertRow(row));
});
What's the right way to do this?
It's slightly more complicated than this - node.js streams are not adapted (at least not yet) to the es7 async/await methods.
If you'd like to develop this on your own, consider writing a class derived from Readable stream. Implementing a promise based interface is quite a task, but it is possible.
If you're however fine with using a permissive licensed framework - take a look at Scramjet. With it your code will look like this (most of the example is parsing the CSV - I'll add a helper in the next version):
fs.createReadStream("file.csv") // open your file
.pipe(new StringStream()) // pass to scramjet
.split("\n") // split by line
.parse((line) => line.split(",")) // convert lines to arrays
.map(async (line) => { // run asynchrounous mapping
await importRecord(line); // import log to DB
return logRecord(line); // return some log for the output
})
.pipe(process.stdout); // pipe the output wherever you like
I believe it's exactly what you're looking for and it will run your record imports in parallel, while keeping the output order.

Categories

Resources