Selecting an html node's text content with htmlparser2 in Node.js

Selecting an html node's text content with htmlparser2 in Node.js - javascript

I want to parse some html with htmlparser2 module for Node.js. My task is to find a precise element by its ID and extract its text content.
I have read the documentation (quite limited) and I know how to setup my parser with the onopentag function but it only gives access to the tag name and its attributes (I cannot see the text). The ontext function extracts all text nodes from the given html string, but ignores all markup.
So here's my code.
const htmlparser = require("htmlparser2");
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if (attribs.id === "heading1"){
console.log(/*how to extract text so I can get "Some heading" here*/);
}
},
ontext: function(text){
console.log(text); // Some heading \n Foobar
}
});
parser.parseComplete(file);
I expect the output of the function call to be 'Some heading'. I believe that there is some obvious solution but somehow it misses my mind.
Thank you.

You can do it like this using the library you asked about:
const htmlparser = require('htmlparser2');
const domUtils = require('domutils');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
var handler = new htmlparser.DomHandler(function(error, dom) {
if (error) {
console.log('Parsing had an error');
return;
} else {
const item = domUtils.findOne(element => {
const matches = element.attribs.id === 'heading1';
return matches;
}, dom);
if (item) {
console.log(item.children[0].data);
}
}
});
var parser = new htmlparser.Parser(handler);
parser.write(file);
parser.end();
The output you will get is "Some Heading". However, you will, in my opinion, find it easier to just use a querying library that is meant for it. You of course, don't need to do this, but you can note how much simpler the following code is: How do I get an element name in cheerio with node.js
Cheerio OR a querySelector API such as https://www.npmjs.com/package/node-html-parser if you prefer the native query selectors is much more lean.
You can compare that code to something more lean, such as the node-html-parser which supports simply querying:
const { parse } = require('node-html-parser');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const root = parse(file);
const text = root.querySelector('#heading1').text;
console.log(text);

Related

Insert Line breaks before text in Google Apps Script

I need to insert some line breaks before certain text in a Google Document.
Tried this approach but get errors:
var body = DocumentApp.getActiveDocument().getBody();
var pattern = "WORD 1";
var found = body.findText(pattern);
var parent = found.getElement().getParent();
var index = body.getChildIndex(parent);
// or parent.getChildIndex(parent);
body.insertParagraph(index, "");
Any idea on how to do this?
Appreciate the help!

For example, as a simple modification, how about modifying the script of https://stackoverflow.com/a/65745933 in your previous question?
In this case, InsertTextRequest is used instead of InsertPageBreakRequest.
Modified script:
Please copy and paste the following script to the script editor of Google Document, and please set searchPattern. And, please enable Google Docs API at Advanced Google services.
function myFunction() {
const searchText = "WORD 1"; // Please set text. This script inserts the pagebreak before this text.
// 1. Retrieve all contents from Google Document using the method of "documents.get" in Docs API.
const docId = DocumentApp.getActiveDocument().getId();
const res = Docs.Documents.get(docId);
// 2. Create the request body for using the method of "documents.batchUpdate" in Docs API.
let offset = 0;
const requests = res.body.content.reduce((ar, e) => {
if (e.paragraph) {
e.paragraph.elements.forEach(f => {
if (f.textRun) {
const re = new RegExp(searchText, "g");
let p = null;
while (p = re.exec(f.textRun.content)) {
ar.push({insertText: {location: {index: p.index + offset},text: "\n"}});
}
}
})
}
offset = e.endIndex;
return ar;
}, []).reverse();
// 3. Request the request body to the method of "documents.batchUpdate" in Docs API.
Docs.Documents.batchUpdate({requests: requests}, docId);
}
Result:
When above script is used, the following result is obtained.
From:
To:
Note:
When you don't want to directly use Advanced Google services like your previous question, please modify the 2nd script of https://stackoverflow.com/a/65745933 is as follows.
From
ar.push({insertPageBreak: {location: {index: p.index + offset}}});
To
ar.push({insertText: {location: {index: p.index + offset},text: "\n"}});
References:
Method: documents.get
Method: documents.batchUpdate
InsertTextRequest

Read environment variable value with cheerio

I am parsing a webpage and trying to get the text values from it.
I am using cheerio to be able to do this with node.js.
Currently whenever I parse a tag it returns {{status}} this is because the value is an environment variable, but I want to be able to read the actual value (in this case it is "2").
This is what I have currently got:
const rp = require('request-promise');
const url = 'my url';
const $ = require('cheerio');
rp(url)
.then(function(html){
//success!
console.log($('.class-name div', html).text());
})
.catch(function(err){
//handle error
});
I have also tried using .html(), .contents() but still not success.
Do I have to change the second parameter in $('.class-name DIV', <PARAMETER>) to achieve what I am after?

You don't provide any URL or HTML you're trying to parse.
So, with cheerio, you can use selector like this format.
$( selector, [context], [root] )
Means search the selector inside context, within root element (usually HTML doc string), selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object.
Meanwhile:
$(selector).text() => Return innerText of the selector
$(selector).html() => Return innerHTML of the selector
$(selector).attr('class') => Return value of class attribute
But cheerio parser is difficult to debug.
I've used cheerio for a while and sometimes this can be a headache.
So i've found jsonframe-cheerio, a package that parse that HTML tags for you.
In this working example below, as you can see it parse cheerio perfectly.
It will translate a format called frame to extract innerText, attributes value, even filter or match some regular expression.
HTML source (simplified for readibility)
https://www.example.com
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
CHeerIO
const request = require ('request-promise')
const cheerio = require ('cheerio')
const jsonfrm = require ('jsonframe-cheerio')
const url = 'https://www.example.com'
let $ = null
;(async () => {
try {
const response = await request(url)
$ = cheerio.load (response)
jsonfrm($)
let frame = {
articles : { // This format will loop every matching selectors occurs
_s : "body > div", // The root of every repeating item
_d : [{
"titling" : "h1", // The innerText of article h1
"excerpt" : "p", // The innerText of article content p
"linkhref": "a[href] # href" // The value of href attribute within a link
}]
}
}
const displayResult = $('body').scrape(frame, { string: true } )
console.log ( displayResult )
} catch ( error ) {
console.log ('ERROR: ', error)
}
})()

Not Populating list in HTML with Javascript

I am learning Javascript. I am working on reading RSS feeds for a personal project. I am using 'RSS-parser' npm library to avoid CORS error.
And also I am using Browserify bundler to make it work on the browser.
When I run this code on the terminal it gives me output without any issue. But when I try with the browser it prints nothing.
My knowledge about Asynchronous JS is limited but I am pretty sure it doesn't have errors in here as I added code to it without changing existing code.
let Parser = require('rss-parser');
let parser = new Parser();
let feed;
async () => {
feed = await parser.parseURL('https://www.reddit.com/.rss');
feedTheList();
};
// setTimeout(function() {
// //your code to be executed after 1 second
// feedTheList();
// }, 5000);
function feedTheList()
{
document.body.innerHTML = "<h1>Total Feeds: " + feed.items.length + "</h1>";
let u_list = document.getElementById("list")[0];
feed.items.forEach(item => {
var listItem = document.createElement("li");
//Add the item text
var newText = document.createTextNode(item.title);
listItem.appendChild(newText);
listItem.innerHTML =item.title;
//Add listItem to the listElement
u_list.appendChild(listItem);
});
}
Here is my HTML code.
<body>
<ul id="list"></ul>
<script src="bundle.js"></script>
</body>
Any guidance is much appreciated.

document.getElementById() returns a single element, not a collection, so you don't need to index it. So this:
let u_list = document.getElementById("list")[0];
sets u_list to `undefined, and you should be getting errors later in the code. It should just be:
let u_list = document.getElementById("list");
Also, when you do:
listItem.innerHTML =item.title;
it will replace the text node that you appended on the previous line with this HTML. Either append the text node or assign to innerHTML (or more correctly, innerText), you don't need to do both.

Looks like the async call is not being executed; You need to wrap it
in an anonymous function call:
See the example here:
https://www.npmjs.com/package/rss-parser
Essentially,
var feed; // change let to var, so feed can be used inside the function
// wrap the below into a function call
(async () => {
feed = await parser.parseURL('https://www.reddit.com/.rss');
feedTheList();
})(); // the (); at the end executes the promise
Now it will execute and feed should have items.
CORS errors when making request
As noted in the documentation at https://www.npmjs.com/package/rss-parser, if you get CORS error on a resource, use a CORS proxy. I've updated their example to fit your code:
// Note: some RSS feeds can't be loaded in the browser due to CORS security.
// To get around this, you can use a proxy.
const CORS_PROXY = "https://cors-anywhere.herokuapp.com/"
let parser = new RSSParser();
(async () => {
await parser.parseURL(CORS_PROXY + 'https://www.reddit.com/.rss', function(err, feed) {
feedTheList(feed);
});
})();
function feedTheList(feed)
{
// unchanged
}
One last thing:
The line
document.body.innerHTML = "<h1>Total Feeds: " + feed.items.length + "</h1>";
Will remove all of the content of <body>
I suggest to look into how element.appendChild works, or just place the <h1> tag in your HTML and modify its innerHTML property instead.

jsdom get text without image

I am trying to use jsdom to get a description from an article.
The html code of the article is
<p><img src="http://localhost/bibi_cms/cms/app/images/upload_photo/1506653694941.png"
style="width: 599.783px; height: 1066px;"></p>
<p>testestestestestestestest<br></p>
Here are my nodejs code for getting the description from the content, It seems it will get the text from first p tag and print out empty string. So I just want to get the content in p tag that contains no image. Anyone help me on this issue?
const dom = new JSDOM(results[i].content.toString());
if (dom.window.document.querySelector("p") !== null)
results[i].description = dom.window.document.querySelector("p").textContent;

Ideally you could test against Node.TEXT_NODE but that is erroring for me on nodejs for some reason so (using gulp just for testing purposes):
const gulp = require("gulp");
const fs = require('fs');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const html = yourHTML.html';
gulp.task('default', ['getText']);
gulp.task('getText', function () {
var dirty;
dirty = fs.readFileSync(html, 'utf8');
const dom = new JSDOM(dirty);
const pList = dom.window.document.querySelectorAll("p");
pList.forEach(function (el, index, list) {
console.log("p.firstElementChild.nodeName : " + el.firstElementChild.nodeName);
if (el.firstElementChild.nodeName !== "IMG") {
console.log(el.textContent);
}
});
return;
})
So the key is the test
el.firstElementChild.nodeName !== "IMG"
if you know that either an img tag or text follows the p tag. In your case the firstElementChild.nodeName you want is actually a br tag but I assume that isn't always necessarily there at the end of the text.
You could also test against an empty string ala :
if (el.textContent.trim() !== "") {} // you may want to trim() that for spaces

Is there any way to create a document fragment from a generic piece of HTML?

I'm working on an application that uses some client-side templates to render data and most of the javascript template engines return a simple string with the markup and it's up to the developer to insert that string into the DOM.
I googled around and saw a bunch of people suggesting the use of an empty div, setting its innerHTML to the new string and then iterating through the child nodes of that div like so
var parsedTemplate = 'something returned by template engine';
var tempDiv = document.createElement('div'), childNode;
var documentFragment = document.createDocumentFragment;
tempDiv.innerHTML = parsedTemplate;
while ( childNode = tempDiv.firstChild ) {
documentFragment.appendChild(childNode);
}
And TADA, documentFragment now contains the parsed template. However, if my template happens to be a tr, adding the div around it doesn't achieve the expected behaviour, as it adds the contents of the td's inside the row.
Does anybody know of a good way of to solve this? Right now I'm checking the node where the parsed template will be inserted and creating an element from its tag name. I'm not even sure there's another way of doing this.
While searching I came across this discussion on the w3 mailing lists, but there was no useful solution, unfortunately.

You can use a DOMParser as XHTML to avoid the HTML "auto-correction" DOMs like to perform:
var parser = new DOMParser(),
doc = parser.parseFromString('<tr><td>something returned </td><td>by template engine</td></tr>', "text/xml"),
documentFragment = document.createDocumentFragment() ;
documentFragment.appendChild( doc.documentElement );
//fragment populated, now view as HTML to verify fragment contains what's expected:
var temp=document.createElement('div');
temp.appendChild(documentFragment);
console.log(temp.outerHTML);
// shows: "<div><tr><td>something returned </td><td>by template engine</td></tr></div>"
this is contrasted to using naive innerHTML with a temp div:
var temp=document.createElement('div');
temp.innerHTML='<tr><td>something returned </td><td>by template engine</td></tr>';
console.log(temp.outerHTML);
// shows: '<div>something returned by template engine</div>' (bad)
by treating the template as XHTML/XML (making sure it's well-formed), we can bend the normal rules of HTML.
the coverage of DOMParser should correlate with the support for documentFragment, but on some old copies (single-digit-versions) of firefox, you might need to use importNode().
as a re-usable function:
function strToFrag(strHTML){
var temp=document.createElement('template');
if( temp.content ){
temp.innerHTML=strHTML;
return temp.content;
}
var parser = new DOMParser(),
doc = parser.parseFromString(strHTML, "text/xml"),
documentFragment = document.createDocumentFragment() ;
documentFragment.appendChild( doc.documentElement );
return documentFragment;
}

This is one for the future, rather than now, but HTML5 defines a <template> element that will create the fragment for you. You will be able to do:
var parsedTemplate = '<tr><td>xxx</td></tr>';
var tempEL = document.createElement('template');
tempEl.innerHTML = parsedTemplate;
var documentFragment = tempEl.content;
It currently works in Firefox. See here

The ideal approach is to use the <template> tag from HTML5. You can create a template element programmatically, assign the .innerHTML to it and all the parsed elements (even fragments of a table) will be present in the template.content property. This does all the work for you. But, this only exists right now in the latest versions of Firefox and Chrome.
If template support exists, it as simple as this:
function makeDocFragment(htmlString) {
var container = document.createElement("template");
container.innerHTML = htmlString;
return container.content;
}
The return result from this works just like a documentFragment. You can just append it directly and it solves the problem just like a documentFragment would except it has the advantage of supporting .innerHTML assignment and it lets you use partially formed pieces of HTML (solving both problems we need).
But, template support doesn't exist everywhere yet, so you need a fallback approach. The brute force way to handle the fallback is to peek at the beginning of the HTML string and see what type of tab it starts with and create the appropriate container for that type of tag and use that container to assign the HTML to. This is kind of a brute force approach, but it works. This special handling is needed for any type of HTML element that can only legally exist in a particular type of container. I've included a bunch of those types of elements in my code below (though I've not attempted to make the list exhaustive). Here's the code and a working jsFiddle link below. If you use a recent version of Chrome or Firefox, the code will take the path that uses the template object. If some other browser, it will create the appropriate type of container object.
var makeDocFragment = (function() {
// static data in closure so it only has to be parsed once
var specials = {
td: {
parentElement: "table",
starterHTML: "<tbody><tr class='xx_Root_'></tr></tbody>"
},
tr: {
parentElement: "table",
starterHTML: "<tbody class='xx_Root_'></tbody>"
},
thead: {
parentElement: "table",
starterHTML: "<tbody class='xx_Root_'></tbody>"
},
caption: {
parentElement: "table",
starterHTML: "<tbody class='xx_Root_'></tbody>"
},
li: {
parentElement: "ul",
},
dd: {
parentElement: "dl",
},
dt: {
parentElement: "dl",
},
optgroup: {
parentElement: "select",
},
option: {
parentElement: "select",
}
};
// feature detect template tag support and use simpler path if so
// testing for the content property is suggested by MDN
var testTemplate = document.createElement("template");
if ("content" in testTemplate) {
return function(htmlString) {
var container = document.createElement("template");
container.innerHTML = htmlString;
return container.content;
}
} else {
return function(htmlString) {
var specialInfo, container, root, tagMatch,
documentFragment;
// can't use template tag, so lets mini-parse the first HTML tag
// to discern if it needs a special container
tagMatch = htmlString.match(/^\s*<([^>\s]+)/);
if (tagMatch) {
specialInfo = specials[tagMatch[1].toLowerCase()];
if (specialInfo) {
container = document.createElement(specialInfo.parentElement);
if (specialInfo.starterHTML) {
container.innerHTML = specialInfo.starterHTML;
}
root = container.querySelector(".xx_Root_");
if (!root) {
root = container;
}
root.innerHTML = htmlString;
}
}
if (!container) {
container = document.createElement("div");
container.innerHTML = htmlString;
root = container;
}
documentFragment = document.createDocumentFragment();
// start at the actual root we want
while (root.firstChild) {
documentFragment.appendChild(root.firstChild);
}
return documentFragment;
}
}
// don't let the feature test template object hang around in closure
testTemplate = null;
})();
// test cases
var frag = makeDocFragment("<tr><td>Three</td><td>Four</td></tr>");
document.getElementById("myTableBody").appendChild(frag);
frag = makeDocFragment("<td>Zero</td><td>Zero</td>");
document.getElementById("emptyRow").appendChild(frag);
frag = makeDocFragment("<li>Two</li><li>Three</li>");
document.getElementById("myUL").appendChild(frag);
frag = makeDocFragment("<option>Second Option</option><option>Third Option</option>");
document.getElementById("mySelect").appendChild(frag);
Working demo with several test cases: http://jsfiddle.net/jfriend00/SycL6/

Use this function
supports IE11
has not to be xml-conform e.g. '<td hidden>test'
function createFragment(html){
var tmpl = document.createElement('template');
tmpl.innerHTML = html;
if (tmpl.content == void 0){ // ie11
var fragment = document.createDocumentFragment();
var isTableEl = /^[^\S]*?<(t(?:head|body|foot|r|d|h))/i.test(html);
tmpl.innerHTML = isTableEl ? '<table>'+html : html;
var els = isTableEl ? tmpl.querySelector(RegExp.$1).parentNode.childNodes : tmpl.childNodes;
while(els[0]) fragment.appendChild(els[0]);
return fragment;
}
return tmpl.content;
}
The solution from #dandavis will accept only xml-conform content in ie11.
I dont know if there are other tag which must be taken into account?

Develop Reference

JavaScript is the programming language of the Web.

Selecting an html node's text content with htmlparser2 in Node.js - javascript

Related

Insert Line breaks before text in Google Apps Script

Read environment variable value with cheerio

Not Populating list in HTML with Javascript

jsdom get text without image

Is there any way to create a document fragment from a generic piece of HTML?

Categories

Resources