I am parsing a webpage and trying to get the text values from it.
I am using cheerio to be able to do this with node.js.
Currently whenever I parse a tag it returns {{status}} this is because the value is an environment variable, but I want to be able to read the actual value (in this case it is "2").
This is what I have currently got:
const rp = require('request-promise');
const url = 'my url';
const $ = require('cheerio');
rp(url)
.then(function(html){
//success!
console.log($('.class-name div', html).text());
})
.catch(function(err){
//handle error
});
I have also tried using .html(), .contents() but still not success.
Do I have to change the second parameter in $('.class-name DIV', <PARAMETER>) to achieve what I am after?
You don't provide any URL or HTML you're trying to parse.
So, with cheerio, you can use selector like this format.
$( selector, [context], [root] )
Means search the selector inside context, within root element (usually HTML doc string), selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object.
Meanwhile:
$(selector).text() => Return innerText of the selector
$(selector).html() => Return innerHTML of the selector
$(selector).attr('class') => Return value of class attribute
But cheerio parser is difficult to debug.
I've used cheerio for a while and sometimes this can be a headache.
So i've found jsonframe-cheerio, a package that parse that HTML tags for you.
In this working example below, as you can see it parse cheerio perfectly.
It will translate a format called frame to extract innerText, attributes value, even filter or match some regular expression.
HTML source (simplified for readibility)
https://www.example.com
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
CHeerIO
const request = require ('request-promise')
const cheerio = require ('cheerio')
const jsonfrm = require ('jsonframe-cheerio')
const url = 'https://www.example.com'
let $ = null
;(async () => {
try {
const response = await request(url)
$ = cheerio.load (response)
jsonfrm($)
let frame = {
articles : { // This format will loop every matching selectors occurs
_s : "body > div", // The root of every repeating item
_d : [{
"titling" : "h1", // The innerText of article h1
"excerpt" : "p", // The innerText of article content p
"linkhref": "a[href] # href" // The value of href attribute within a link
}]
}
}
const displayResult = $('body').scrape(frame, { string: true } )
console.log ( displayResult )
} catch ( error ) {
console.log ('ERROR: ', error)
}
})()
Related
Using cy.request() command how to get meta tag and script tag content. Example given below:
<meta data-n-head="ssr" data-hid="og-title" property="og:title" content="Blue zone booking area">
<script data-n-head="ssr" data-hid="nuxt-jsonld-6378ffa8" type="application/ld+json">{"#context":"https://schema.org","#type":"WebPage","headline":"Parcel area for Blue Zone one","url":"https://staging.booking.com/au/booking-area/zone/blue/"}</script>
I have tried cy.wrap($meta) and iterate but it doesn't work ?. Can anyone suggest how can we grab content="Blue zone booking area" attribute from meta tag and headline attribute content from the script tag ?
note : This is not a front end test, that's why I am using cy.request() to make sure that the SEO/SSR are looking good in our website. As google SEO send a request and hit the above url, so then we should make sure that the rendering are looking good. When you use cy.visit() or cy.get() command it will enable the browser javascript and that is not I want
cy.request(apiHelper.makeGeneralRequestObject("au/booking-area/zone/blue/")).then(
(response) => {
const htmlString = response.body;
const parser = new DOMParser();
const parseHtml = parser.parseFromString(htmlString, 'text/html');
const $meta = parseHtml.getElementsByTagName('meta');
$meta.each(($elem)=>{
// how to get at here
})
});
Looks like you are mixing up Cypress/jQuery style with DOM (native) style queries.
This should do it using the DOM parser
cy.request({
url: 'https://www.booking.com/au',
failOnStatusCode: false
}).then(response => {
const parser = new DOMParser()
const doc = parser.parseFromString(response.body, 'text/html')
const metaTags = doc.head.querySelectorAll('meta') // pull <meta> from document head
metaTags.forEach(metaTag => { // it's a DOMList, use forEach()
const key = metaTag.name // not all have a "name"
const content = metaTag.content // all will have content
console.log(key, content)
})
})
Or with Cypress (arguably better if performing SEO)
cy.visit('https://www.booking.com/au')
cy.document().then(doc => {
cy.wrap(doc.head).find('meta').each($meta => {
const key = $meta.attr('name')
const content = $meta.attr('content')
console.log(key, content)
})
})
Also consider Bahmutov - Cypress Lighthouse Example. There is a SEO section in Lighthouse, and the results for https://www.booking.com/au currently show
No <meta name="viewport"> tag found
jsonLD
There is an example of jsonLD test here cypress-automated-test-for-seo-data
it("Verify jsonLD structured data - simple", () => {
// Query the script tag with type application/ld+json
cy.get("script[type='application/ld+json']").then((scriptTag) => {
// we need to parse the JSON LD from text to a JSON to easily test it
const jsonLD = JSON.parse(scriptTag.text());
// once parsed we can easily test for different data points
expect(jsonLD["#context"]).equal("https://schema.org");
expect(jsonLD.author).length(2);
// Cross referencing SEO data between the page title and the headline
// in the jsonLD data, great for dynamic data
cy.title().then((currentPageTitle) =>
expect(jsonLD["headline"]).equal(currentPageTitle)
)
})
})
I need to load some data and insert it in two ways into an li. First, it is inserted as a <p>. Second,it is inserted as a hidden input with the value specified. How can I do that? I mean, if I do it with the innerHTML property it works, but I need to add 2 elements, no only one. And when I try to use the appendChild it gives the error:
infoPersonal.js:22 Uncaught (in promise) TypeError: Failed to execute
'appendChild' on 'Node': parameter 1 is not of type 'Node'.
what can I do?
EDIT: in the code below it only enters if the condition is met, but it is supossed to add the input with every el
const datos= () => {
const token = localStorage.getItem('token')
if (token) {
return fetch('https://janfa.gharsnull.now.sh/api/auth/me', {
method: 'GET',
headers: {
'Content-Type': 'application/json',
authorization : token,
},
})
.then( x => x.json())
.then( user =>{
const datalist = document.querySelectorAll(".data")
datalist.forEach( function(el){
var input
const template = '<p>' + user[el.getAttribute("data-target")] + '</p>'
if (el.getAttribute("data-target")==="birth") {
input = `<input class="input-date" type ="date" value="${user[date]}" hidden>`
}
el.innerHTML=template //this works
el.appendChild(input) //this doesn't
})
})
}
}
window.onload=() =>{
datos()
}
appendChild expects a Node or element. You have two options:
Create the element:
input=document.createElement("input");
input.type="date";
input.className="input-date";
input.value=user.date;
input.hidden=true;
Or use innerHTML. Of course it will replace the contents of el, but you could use a placeholder or dummy element.
var div=document.createElement("div");
div.innerHTML="<input ...";
input=div.children[0];
I'd do the first thing. Or use a framework if you want to write less, but it's a little overkill just for this.
You can use the insertAdjacentHTML() method on an element. First parameter takes a string denoting the position, and second argument is the HTML string. Really good browser support at this point too.
Element.insertAdjacentHTML('beforeend', '<p class="someClass">Hello, World</p>');
I want to parse some html with htmlparser2 module for Node.js. My task is to find a precise element by its ID and extract its text content.
I have read the documentation (quite limited) and I know how to setup my parser with the onopentag function but it only gives access to the tag name and its attributes (I cannot see the text). The ontext function extracts all text nodes from the given html string, but ignores all markup.
So here's my code.
const htmlparser = require("htmlparser2");
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if (attribs.id === "heading1"){
console.log(/*how to extract text so I can get "Some heading" here*/);
}
},
ontext: function(text){
console.log(text); // Some heading \n Foobar
}
});
parser.parseComplete(file);
I expect the output of the function call to be 'Some heading'. I believe that there is some obvious solution but somehow it misses my mind.
Thank you.
You can do it like this using the library you asked about:
const htmlparser = require('htmlparser2');
const domUtils = require('domutils');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
var handler = new htmlparser.DomHandler(function(error, dom) {
if (error) {
console.log('Parsing had an error');
return;
} else {
const item = domUtils.findOne(element => {
const matches = element.attribs.id === 'heading1';
return matches;
}, dom);
if (item) {
console.log(item.children[0].data);
}
}
});
var parser = new htmlparser.Parser(handler);
parser.write(file);
parser.end();
The output you will get is "Some Heading". However, you will, in my opinion, find it easier to just use a querying library that is meant for it. You of course, don't need to do this, but you can note how much simpler the following code is: How do I get an element name in cheerio with node.js
Cheerio OR a querySelector API such as https://www.npmjs.com/package/node-html-parser if you prefer the native query selectors is much more lean.
You can compare that code to something more lean, such as the node-html-parser which supports simply querying:
const { parse } = require('node-html-parser');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const root = parse(file);
const text = root.querySelector('#heading1').text;
console.log(text);
I am learning Javascript. I am working on reading RSS feeds for a personal project. I am using 'RSS-parser' npm library to avoid CORS error.
And also I am using Browserify bundler to make it work on the browser.
When I run this code on the terminal it gives me output without any issue. But when I try with the browser it prints nothing.
My knowledge about Asynchronous JS is limited but I am pretty sure it doesn't have errors in here as I added code to it without changing existing code.
let Parser = require('rss-parser');
let parser = new Parser();
let feed;
async () => {
feed = await parser.parseURL('https://www.reddit.com/.rss');
feedTheList();
};
// setTimeout(function() {
// //your code to be executed after 1 second
// feedTheList();
// }, 5000);
function feedTheList()
{
document.body.innerHTML = "<h1>Total Feeds: " + feed.items.length + "</h1>";
let u_list = document.getElementById("list")[0];
feed.items.forEach(item => {
var listItem = document.createElement("li");
//Add the item text
var newText = document.createTextNode(item.title);
listItem.appendChild(newText);
listItem.innerHTML =item.title;
//Add listItem to the listElement
u_list.appendChild(listItem);
});
}
Here is my HTML code.
<body>
<ul id="list"></ul>
<script src="bundle.js"></script>
</body>
Any guidance is much appreciated.
document.getElementById() returns a single element, not a collection, so you don't need to index it. So this:
let u_list = document.getElementById("list")[0];
sets u_list to `undefined, and you should be getting errors later in the code. It should just be:
let u_list = document.getElementById("list");
Also, when you do:
listItem.innerHTML =item.title;
it will replace the text node that you appended on the previous line with this HTML. Either append the text node or assign to innerHTML (or more correctly, innerText), you don't need to do both.
Looks like the async call is not being executed; You need to wrap it
in an anonymous function call:
See the example here:
https://www.npmjs.com/package/rss-parser
Essentially,
var feed; // change let to var, so feed can be used inside the function
// wrap the below into a function call
(async () => {
feed = await parser.parseURL('https://www.reddit.com/.rss');
feedTheList();
})(); // the (); at the end executes the promise
Now it will execute and feed should have items.
CORS errors when making request
As noted in the documentation at https://www.npmjs.com/package/rss-parser, if you get CORS error on a resource, use a CORS proxy. I've updated their example to fit your code:
// Note: some RSS feeds can't be loaded in the browser due to CORS security.
// To get around this, you can use a proxy.
const CORS_PROXY = "https://cors-anywhere.herokuapp.com/"
let parser = new RSSParser();
(async () => {
await parser.parseURL(CORS_PROXY + 'https://www.reddit.com/.rss', function(err, feed) {
feedTheList(feed);
});
})();
function feedTheList(feed)
{
// unchanged
}
One last thing:
The line
document.body.innerHTML = "<h1>Total Feeds: " + feed.items.length + "</h1>";
Will remove all of the content of <body>
I suggest to look into how element.appendChild works, or just place the <h1> tag in your HTML and modify its innerHTML property instead.
I am using Node.js to sanitize through some HTML elements using the cheerio module. I am trying to utilize the module so that I can parse tags into DOM elements. What I would like to be able to do is enter in text into a textarea field for a form and when I add HTML elements as a string inside the textarea, I would like for that HTML string to be rendered into an actual DOM element
exports.createStore = async (req, res) => {
req.body.author = req.user._id;
const store = await new Store(req.body).save();
const $ = cheerio.load(store.description);
$(store.description).text();
console.log($(store.description).text());
await store.save();
req.flash(
"success",
`Successfully Created ${store.name}. Care to leave a review?`
);
res.redirect(`/store/${store.slug}`);
};
Where store.description = 'Hello < b>World< /b>'
I would like store.description to equal Hello World