Node.js htmlparser2 writableStream still emit events after end() call

Node.js htmlparser2 writableStream still emit events after end() call - javascript

Sorry for the probable trivial question but I still fail to get how streams work in node.js.
I want to parse an html file and get the path of the first script I encounter. I'd like to interrupt the parsing after the first match but the onopentag() listener is still invoked until the effective end of the html file. why ?
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const scriptPath = await new Promise(function(resolve, reject) {
try {
const parser = new WritableStream({
onopentag: (name, attrib) => {
if (name === "script" && attrib.src) {
console.log(`script : ${attrib.src}`);
resolve(attrib.src); // return the first script, effectively called for each script tag
// none of below calls seem to work
indexStream.unpipe(parser);
parser.emit("close");
parser.end();
parser.destroy();
}
},
onend() {
resolve();
}
});
const indexStream = got.stream("/index.html", {
responseType: 'text',
resolveBodyOnly: true
});
indexStream.pipe(parser); // and parse it
} catch (e) {
reject(e);
}
});
Is it possible to close the parser stream before the effective end of indexStream and if yes how ?
If not why ?
Note that the code works and my promise is effectively resolved using the first match.

There's a little confusion on how the WriteableStream works. First off, when you do this:
const parser = new WritableStream(...)
that's misleading. It really should be this:
const writeStream = new WritableStream(...)
The actual HTML parser is an instance variable in the WritableStream object named ._parser (see code). And, it's that parser that is emitting the onopentag() callbacks and because it's working off a buffer that may have some accumulated text disconnecting from the readstream may not immediately stop events that are still coming from the buffered data.
The parser itself has a public reset() method and it appears that if disconnected from the readstream and then you called that reset method, it should stop emitting events.
You can try this (I'm not a TypeScript person so you may have to massage some things to make the TypeScript compiler happy, but hopefully you can see the concept here):
const { WritableStream } = require("htmlparser2/lib/WritableStream");
const scriptPath = await new Promise(function(resolve, reject) {
try {
const writeStream = new WritableStream({
onopentag: (name, attrib) => {
if (name === "script" && attrib.src) {
console.log(`script : ${attrib.src}`);
resolve(attrib.src); // return the first script, effectively called for each script tag
// disconnect the readstream
indexStream.unpipe(writeStream);
// reset the internal parser so it clears any buffers it
// may still be processing
writeStream._parser.reset();
}
},
onend() {
resolve();
}
});
const indexStream = got.stream("/index.html", {
responseType: 'text',
resolveBodyOnly: true
});
indexStream.pipe(writeStream); // and parse it
} catch (e) {
reject(e);
}
});

Related

ANTLR4 how to wait for processing all rules?

I have created my own Xtext based DSL and vscode based editor with language server protocol. I parse the model from the current TextDocument with antlr4ts. Below is the code snippet for the listener
class TreeShapeListener implements DebugInternalModelListener {
public async enterRuleElem1(ctx: RuleElem1Context): Promise<void> {
...
// by the time the response is received, 'walk' function returns
var resp = await this.client.sendRequest(vscode_languageserver_protocol_1.DefinitionRequest.type,
this.client.code2ProtocolConverter.asTextDocumentPositionParams(this.document, position))
.then(this.client.protocol2CodeConverter.asDefinitionResult, (error) => {
return this.client.handleFailedRequest(vscode_languageserver_protocol_1.DefinitionRequest.type, error, null);
});
...
this.model.addElem1(elem1);
}
public async enterRuleElem2(ctx: RuleElem2Context): void {
...
this.model.addElem2(elem2);
}
and here I create the parser and the tree walker.
// Create the lexer and parser
let inputStream = antlr4ts.CharStreams.fromString(document.getText());
let lexer = new DebugInternaModelLexer(inputStream);
let tokenStream = new antlr4ts.CommonTokenStream(lexer);
let parser = new DebugInternalModelParser(tokenStream);
parser.buildParseTree = true;
let tree = parser.ruleModel();
let model = new Model();
ParseTreeWalker.DEFAULT.walk(new TreeShapeListener(model, client, document) as ParseTreeListener, tree);
console.log(model);
The problem is that while processing one of the rules (enterRuleElem1), I have an async function (client.sendRequest) which is returned after ParseTreeWalker.DEFAULT.walk returns. How can I make walk wait till all the rules are completed?
Edit 1: Not sure if this is how walk function works, but tried to recreate the above scenario with a minimal code below
function setTimeoutPromise(delay) {
return new Promise((resolve, reject) => {
if (delay < 0) return reject("Delay must be greater than 0")
setTimeout(() => {
resolve(`You waited ${delay} milliseconds`)
}, delay)
})
}
async function enterRuleBlah() {
let resp = await setTimeoutPromise(2500);
console.log(resp);
}
function enterRuleBlub() {
console.log('entered blub');
}
function walk() {
enterRuleBlah();
enterRuleBlub();
}
walk();
console.log('finished parsing');
and the output is
entered blub
finished parsing
You waited 2500 milliseconds
Edit 2: I tried the suggestion from the answer and now it works! My solution looks like:
public async doStuff() {
...
return new Promise((resolve)=> {
resolve(0);
})
}
let listener = new TreeShapeListener(model, client, document);
ParseTreeWalker.DEFAULT.walk(listener as ParseTreeListener, tree);
await listener.doStuff();

The tree walk is entirely synchronous, regardless whether you make your listener/visitor rules async or not. Better separate the requests from the walk, which should only collect all the information need to know what to send and after that process this collection and actually send the requests, for which you then can wait.

What are ways to run a script only after another script has finished?

Lets say this is my code (just a sample I wrote up to show the idea)
var extract = require("./postextract.js");
var rescore = require("./standardaddress.js");
RunFunc();
function RunFunc() {
extract.Start();
console.log("Extraction complete");
rescore.Start();
console.log("Scoring complete");
}
And I want to not let the rescore.Start() run until the entire extract.Start() has finished. Both scripts contain a spiderweb of functions inside of them, so having a callback put directly into the Start() function is not appearing viable as the final function won't return it, and I am having a lot of trouble understanding how to use Promises. What are ways I can make this work?
These are the scripts that extract.Start() begins and ends with. OpenWriter() is gotten to through multiple other functions and streams, with the actual fileWrite.write() being in another script that's attached to this (although not needed to detect the end of run. Currently, fileWrite.on('finish') is where I want the script to be determined as done
module.exports = {
Start: function CodeFileRead() {
//this.country = countryIn;
//Read stream of thate address components
fs.createReadStream("Reference\\" + postValid.country + " ADDRESS REF DATA.csv")
//Change separator based on file
.pipe(csv({escape: null, headers: false, separator: delim}))
//Indicate start of reading
.on('resume', (data) => console.log("Reading complete postal code file..."))
//Processes lines of data into storage array for comparison
.on('data', (data) => {
postValid.addProper[data[1]] = JSON.stringify(Object.values(data)).replace(/"/g, '').split(',').join('*');
})
//End of reading file
.on('end', () => {
postValid.complete = true;
console.log("Done reading");
//Launch main script, delayed to here in order to not read ahead of this stream
ThisFunc();
});
},
extractDone
}
function OpenWriter() {
//File stream for writing the processed chunks into a new file
fileWrite = fs.createWriteStream("Processed\\" + fileName.split('.')[0] + "_processed." + fileName.split('.')[1]);
fileWrite.on('open', () => console.log("File write is open"));
fileWrite.on('finish', () => {
console.log("File write is closed");
});
}
EDIT: I do not want to simply add the next script onto the end of the previous one and forego the master file, as I don't know how long it will be and its supposed to be designed to be capable of taking additional scripts past our development period. I cannot just use a package as it stands because approval time in the company takes up to two weeks and I need this more immediately
DOUBLE EDIT: This is all my code, every script and function is all written by me, so I can make the scripts being called do what's needed

You can just wrap your function in Promise and return that.
module.exports = {
Start: function CodeFileRead() {
return new Promise((resolve, reject) => {
fs.createReadStream(
'Reference\\' + postValid.country + ' ADDRESS REF DATA.csv'
)
// .......some code...
.on('end', () => {
postValid.complete = true;
console.log('Done reading');
resolve('success');
});
});
}
};
And Run the RunFunc like this:
async function RunFunc() {
await extract.Start();
console.log("Extraction complete");
await rescore.Start();
console.log("Scoring complete");
}
//or IIFE
RunFunc().then(()=>{
console.log("All Complete");
})
Note: Also you can/should handle error by reject("some error") when some error occurs.
EDIT After knowing about TheFunc():
Making a new Event emitter will probably the easiest solution:
eventEmitter.js
const EventEmitter = require('events').EventEmitter
module.exports = new EventEmitter()
const eventEmitter = require('./eventEmitter');
module.exports = {
Start: function CodeFileRead() {
return new Promise((resolve, reject) => {
//after all of your code
eventEmitter.once('WORK_DONE', ()=>{
resolve("Done");
})
});
}
};
function OpenWriter() {
...
fileWrite.on('finish', () => {
console.log("File write is closed");
eventEmitter.emit("WORK_DONE");
});
}
And Run the RunFunc like as before.

There's no generic way to determine when everything a function call does has finished.
It might accept a callback. It might return a promise. It might not provide any kind of method to determine when it is done. It might have side effects that you could monitor by polling.
You need to read the documentation and/or source code for that particular function.

Use async/await (promises), example:
var extract = require("./postextract.js");
var rescore = require("./standardaddress.js");
RunFunc();
async function extract_start() {
try {
extract.Start()
}
catch(e){
console.log(e)
}
}
async function rescore_start() {
try {
rescore.Start()
}
catch(e){
console.log(e)
}
}
async function RunFunc() {
await extract_start();
console.log("Extraction complete");
await rescore_start();
console.log("Scoring complete");
}

How to avoid an infinite loop in JavaScript

I have a Selenium webdriverIO V5 framework. The issue I am facing here is, the below code works fine on Mac OS, but it does not work correctly on the Windows OS. In the Windows OS it gets stuck with an infinite loop issue.
The below code functionality is: Merge yaml files (which contains locators) and return the value of the locator by passing the key:
const glob = require('glob');
const yamlMerge = require('yaml-merge');
const sleep = require('system-sleep');
let xpath;
class Page {
getElements(elementId) {
function objectCollector() {
glob('tests/wdio/locators/*.yml', function (er, files) {
if (er) throw er;
xpath = yamlMerge.mergeFiles(files);
});
do {
sleep(10);
} while (xpath === undefined);
return xpath;
}
objectCollector();
return xpath[elementId];
}
}
module.exports = new Page();

Since you are waiting on the results of a callback, I would recommend returning a new Promise from your getElements function and resolve() the value you receive inside the callback. Then when you call getElements, you will need to resolve that Promise or use the await notation. The function will stop at that point and wait until the Promise resolves, but the event loop will still continue. See some documentation for more information.
I'll write an example below of what your code might look like using a Promise, but when you call getElements, you will need to put the keyword await before it. If you want to avoid that, you could resolve the Promise from objectCollector while you're in getElements and remove the async keyword from its definition, but you really should not get in the way of asynchronous JavaScript. Also, you can probably shorten the code a bit because objectCollector looks like an unnecessary function in this example:
const glob = require('glob')
const yamlMerge = require('yaml-merge')
const sleep = require('system-sleep')
let xpath
class Page {
function async getElements(elementId) {
function objectCollector() {
return new Promise((resolve,reject) => {
glob('tests/wdio/locators/*.yml', function (er, files) {
if (er) reject(er)
resolve(yamlMerge.mergeFiles(files))
})
})
}
let xpath = await objectCollector()
return xpath[elementId]
}
}
module.exports = new Page();

Using promise to work with web worker inside a JavaScript closure

I was executing an image processing operation in JavaScript which was working as expected expect one thing that sometimes it was freezing the UI, which made me to use Web worker to excute the image processing functions.
I have a scenario where i need to process multiple. Below is a summary of workflow which i am using to achieve the above feat.
//closure
var filter = (function(){
function process(args){
var promise = new Promise(function (resolve, reject) {
if (typeof (Worker) !== "undefined") {
if (typeof (imgWorker) == "undefined") {
imgWorker = new Worker("/processWorker.js");
}
imgWorker.postMessage(args);
imgWorker.onmessage = function (event) {
resolve(event.data);
};
} else {
reject("Sorry, your browser does not support Web Workers...");
}
});
return promise;
}
return {
process: function(args){
return process(args);
}
}
})();
function manipulate(args, callback){
filter.process(args).then(function(res){
callback(res);
});
}
Here, i am loading multiple images and passing them inside manipulate function.
The issue i am facing here in this scenario is that sometimes for few images Promise is not never resolved.
After debugging my code i figured out that it is because i am creating a Promise for an image while previous Promise was not resolved.
I need suggestions on how can i fix this issue, also i have another query should i use same closure(filter here in above scenario) multiple times or create new closure each time when required as below:
var filter = function(){
....
return function(){}
....
}
function manipulate(args, callback){
var abc = filter();
abc.process(args).then(function(res){
callback(res);
});
}
I hope my problem is clear, if not please comment.

A better approach would be to load your image processing Worker once only. during the start of your application or when it is needed.
After that, you can create a Promise only for the function you wish to call from the worker. In your case, filter can return a new Promise object every time that you post to the Worker. This promise object should only be resolved when a reply is received from the worker for the specific function call.
What is happening with your code is that, your promises are resolving even though the onmessage handler is handling a different message from the Worker. ie. if you post 2 times to the worker. if the second post returns a message it automatically resolves both of your promise objects.
I created a worker encapsulation here Orc.js. Although it may not work out of the box due to the fact i haven't cleaned it of some dependencies i built into it. Feel free to use the methods i applied.
Additional:
You will need to map your post and onmessage to your promises. this will require you to modify your Worker code as well.
//
let generateID = function(args){
//generate an ID from your args. or find a unique way to distinguish your promises.
return id;
}
let promises = {}
// you can add this object to your filter object if you like. but i placed it here temporarily
//closure
var filter = (function(){
function process(args){
let id = generateID(args)
promises[id] = {}
promises[id].promise = new Promise(function (resolve, reject) {
if (typeof (Worker) !== "undefined") {
if (typeof (imgWorker) == "undefined") {
imgWorker = new Worker("/processWorker.js");
imgWorker.onmessage = function (event) {
let id = generateID(event.data.args) //let your worker return the args so you can check the id of the promise you created.
// resolve only the promise that you need to resolve
promises[id].resolve(event.data);
}
// you dont need to keep assigning a function to the onmessage.
}
imgWorker.postMessage(args);
// you can save all relevant things in your object.
promises[id].resolve = resolve
promises[id].reject = reject
promises[id].args = args
} else {
reject("Sorry, your browser does not support Web Workers...");
}
});
//return the relevant promise
return promises[id].promise;
}
return {
process: function(args){
return process(args);
}
}
})();
function manipulate(args, callback){
filter.process(args).then(function(res){
callback(res);
});
}

typescirpt equivalent on gist:
Combining answers from "Webworker without external files"
you can add functions to worker scope like the line `(${sanitizeThis.toString()})(this);,` inside Blob constructing array.
There are some problems regarding resolving promise outside of the promise enclosure, mainly about error catching and stack traces, I didn't bother because it works perfectly fine for me right now.
// https://stackoverflow.com/a/37154736/3142238
function sanitizeThis(self){
// #ts-ignore
// console.assert(this === self, "this is not self", this, self);
// 'this' is undefined
"use strict";
var current = self;
var keepProperties = [
// Required
'Object', 'Function', 'Infinity', 'NaN',
'undefined', 'caches', 'TEMPORARY', 'PERSISTENT',
"addEventListener", "onmessage",
// Optional, but trivial to get back
'Array', 'Boolean', 'Number', 'String', 'Symbol',
// Optional
'Map', 'Math', 'Set',
"console",
];
do{
Object.getOwnPropertyNames(
current
).forEach(function(name){
if(keepProperties.indexOf(name) === -1){
delete current[name];
}
});
current = Object.getPrototypeOf(current);
} while(current !== Object.prototype);
}
/*
https://hacks.mozilla.org/2015/07/how-fast-are-web-workers/
https://developers.google.com/protocol-buffers/docs/overview
*/
class WorkerWrapper
{
worker;
stored_resolves = new Map();
constructor(func){
let blob = new Blob([
`"use strict";`,
"const _postMessage = postMessage;",
`(${sanitizeThis.toString()})(this);`,
`const func = ${func.toString()};`,
"(", function(){
// self.onmessage = (e) => {
addEventListener("message", (e) => {
_postMessage({
id: e.data.id,
data: func(e.data.data)
});
})
}.toString(), ")()"
], {
type: "application/javascript"
});
let url = URL.createObjectURL(blob);
this.worker = new Worker(url);
URL.revokeObjectURL(url);
this.worker.onmessage = (e) => {
let { id, data } = e.data;
let resolve = this.stored_resolves.get(id);
this.stored_resolves.delete(id);
if(resolve){
resolve(data);
} else{
console.error("invalid id in message returned by worker")
}
}
}
terminate(){
this.worker.terminate();
}
count = 0;
postMessage(arg){
let id = ++this.count;
return new Promise((res, rej) => {
this.stored_resolves.set(id, res);
this.worker.postMessage({
id,
data: arg
});
})
}
}
// usage
let worker = new WorkerWrapper(
(d) => { return d + d; }
);
worker.postMessage("HEY").then((e) => {
console.log(e); // HEYHEY
})
worker.postMessage("HELLO WORLD").then((f) => {
console.log(f); // HELLO WORLDHELLO WORLD
})
let worker2 = new WorkerWrapper(
(abc) => {
// you can insert anything here,
// just be aware of whether variables/functions are in scope or not
return(
{
"HEY": abc,
[abc]: "HELLO WORLD" // this particular line will fail with babel
// error "ReferenceError: _defineProperty is not defined",
}
);
}
);
worker2.postMessage("HELLO WORLD").then((f) => {
console.log(f);
/*
{
"HEY": "HELLO WORLD",
"HELLO WORLD": "HELLO WORLD"
}
*/
})
/*
observe how the output maybe out of order because
web worker is true async
*/

Using promises with streams in node.js

I've refactored a simple utility to use promises. It fetches a pdf from the web and saves it to disk. It should then open the file in a pdf viewer once saved to disk. The file appears on disk and is valid, the shell command opens the OSX Preview application, but a dialog pops up complaining that the file is empty.
What's the best way to execute the shell function once the filestream has been written to disk?
// download a pdf and save to disk
// open pdf in osx preview for example
download_pdf()
.then(function(path) {
shell.exec('open ' + path).code !== 0);
});
function download_pdf() {
const path = '/local/some.pdf';
const url = 'http://somewebsite/some.pdf';
const stream = request(url);
const write = stream.pipe(fs.createWriteStream(path))
return streamToPromise(stream);
}
function streamToPromise(stream) {
return new Promise(function(resolve, reject) {
// resolve with location of saved file
stream.on("end", resolve(stream.dests[0].path));
stream.on("error", reject);
})
}

In this line
stream.on("end", resolve(stream.dests[0].path));
you are executing resolve immediately, and the result of calling resolve (which will be undefined, because that's what resolve returns) is used as the argument to stream.on - not what you want at all, right.
.on's second argument needs to be a function, rather than the result of calling a function
Therefore, the code needs to be
stream.on("end", () => resolve(stream.dests[0].path));
or, if you're old school:
stream.on("end", function () { resolve(stream.dests[0].path); });
another old school way would be something like
stream.on("end", resolve.bind(null, stream.dests[0].path));
No, don't do that :p see comments

After a bunch of tries I found a solution which works fine all the time. See JSDoc comments for more info.
/**
* Streams input to output and resolves only after stream has successfully ended.
* Closes the output stream in success and error cases.
* #param input {stream.Readable} Read from
* #param output {stream.Writable} Write to
* #return Promise Resolves only after the output stream is "end"ed or "finish"ed.
*/
function promisifiedPipe(input, output) {
let ended = false;
function end() {
if (!ended) {
ended = true;
output.close && output.close();
input.close && input.close();
return true;
}
}
return new Promise((resolve, reject) => {
input.pipe(output);
input.on('error', errorEnding);
function niceEnding() {
if (end()) resolve();
}
function errorEnding(error) {
if (end()) reject(error);
}
output.on('finish', niceEnding);
output.on('end', niceEnding);
output.on('error', errorEnding);
});
};
Usage example:
function downloadFile(req, res, next) {
promisifiedPipe(fs.createReadStream(req.params.file), res).catch(next);
}
Update. I've published the above function as a Node module: http://npm.im/promisified-pipe

In the latest nodejs, specifically, stream v3, you could do this:
const finished = util.promisify(stream.finished);
const rs = fs.createReadStream('archive.tar');
async function run() {
await finished(rs);
console.log('Stream is done reading.');
}
run().catch(console.error);
rs.resume(); // Drain the stream.
https://nodejs.org/api/stream.html#stream_event_finish

The other solution can look like this:
const streamAsPromise = (readable) => {
const result = []
const w = new Writable({
write(chunk, encoding, callback) {·
result.push(chunk)
callback()
}
})
readable.pipe(w)
return new Promise((resolve, reject) => {
w.on('finish', resolve)
w.on('error', reject)
}).then(() => result.join(''))
}
and you can use it like:
streamAsPromise(fs.createReadStream('secrets')).then(() => console.log(res))

This can be done very nicely using the promisified pipeline function. Pipeline also provides extra functionality, such as cleaning up the streams.
const pipeline = require('util').promisify(require( "stream" ).pipeline)
pipeline(
request('http://somewebsite/some.pdf'),
fs.createWriteStream('/local/some.pdf')
).then(()=>
shell.exec('open /local/some.pdf').code !== 0)
);

Develop Reference

JavaScript is the programming language of the Web.

Node.js htmlparser2 writableStream still emit events after end() call - javascript

Related

ANTLR4 how to wait for processing all rules?

What are ways to run a script only after another script has finished?

How to avoid an infinite loop in JavaScript

Using promise to work with web worker inside a JavaScript closure

Using promises with streams in node.js

Categories

Resources