Parse HTML string to HTML DOM element using Node.js - javascript

I am using Node.js to sanitize through some HTML elements using the cheerio module. I am trying to utilize the module so that I can parse tags into DOM elements. What I would like to be able to do is enter in text into a textarea field for a form and when I add HTML elements as a string inside the textarea, I would like for that HTML string to be rendered into an actual DOM element
exports.createStore = async (req, res) => {
req.body.author = req.user._id;
const store = await new Store(req.body).save();
const $ = cheerio.load(store.description);
$(store.description).text();
console.log($(store.description).text());
await store.save();
req.flash(
"success",
`Successfully Created ${store.name}. Care to leave a review?`
);
res.redirect(`/store/${store.slug}`);
};
Where store.description = 'Hello < b>World< /b>'
I would like store.description to equal Hello World

Related

Using `cy.request()` command how to get `meta` tag and `script` tag contents

Using cy.request() command how to get meta tag and script tag content. Example given below:
<meta data-n-head="ssr" data-hid="og-title" property="og:title" content="Blue zone booking area">
<script data-n-head="ssr" data-hid="nuxt-jsonld-6378ffa8" type="application/ld+json">{"#context":"https://schema.org","#type":"WebPage","headline":"Parcel area for Blue Zone one","url":"https://staging.booking.com/au/booking-area/zone/blue/"}</script>
I have tried cy.wrap($meta) and iterate but it doesn't work ?. Can anyone suggest how can we grab content="Blue zone booking area" attribute from meta tag and headline attribute content from the script tag ?
note : This is not a front end test, that's why I am using cy.request() to make sure that the SEO/SSR are looking good in our website. As google SEO send a request and hit the above url, so then we should make sure that the rendering are looking good. When you use cy.visit() or cy.get() command it will enable the browser javascript and that is not I want
cy.request(apiHelper.makeGeneralRequestObject("au/booking-area/zone/blue/")).then(
(response) => {
const htmlString = response.body;
const parser = new DOMParser();
const parseHtml = parser.parseFromString(htmlString, 'text/html');
const $meta = parseHtml.getElementsByTagName('meta');
$meta.each(($elem)=>{
// how to get at here
})
});
Looks like you are mixing up Cypress/jQuery style with DOM (native) style queries.
This should do it using the DOM parser
cy.request({
url: 'https://www.booking.com/au',
failOnStatusCode: false
}).then(response => {
const parser = new DOMParser()
const doc = parser.parseFromString(response.body, 'text/html')
const metaTags = doc.head.querySelectorAll('meta') // pull <meta> from document head
metaTags.forEach(metaTag => { // it's a DOMList, use forEach()
const key = metaTag.name // not all have a "name"
const content = metaTag.content // all will have content
console.log(key, content)
})
})
Or with Cypress (arguably better if performing SEO)
cy.visit('https://www.booking.com/au')
cy.document().then(doc => {
cy.wrap(doc.head).find('meta').each($meta => {
const key = $meta.attr('name')
const content = $meta.attr('content')
console.log(key, content)
})
})
Also consider Bahmutov - Cypress Lighthouse Example. There is a SEO section in Lighthouse, and the results for https://www.booking.com/au currently show
No <meta name="viewport"> tag found
jsonLD
There is an example of jsonLD test here cypress-automated-test-for-seo-data
it("Verify jsonLD structured data - simple", () => {
// Query the script tag with type application/ld+json
cy.get("script[type='application/ld+json']").then((scriptTag) => {
// we need to parse the JSON LD from text to a JSON to easily test it
const jsonLD = JSON.parse(scriptTag.text());
// once parsed we can easily test for different data points
expect(jsonLD["#context"]).equal("https://schema.org");
expect(jsonLD.author).length(2);
// Cross referencing SEO data between the page title and the headline
// in the jsonLD data, great for dynamic data
cy.title().then((currentPageTitle) =>
expect(jsonLD["headline"]).equal(currentPageTitle)
)
})
})

Parse html from JSON file

I'm trying to get HTML tags to work in my json-file that i fetch via js.
So i want the return to somehow make the <strong> to work when render it on the page. How would i do that?
Sample of the json:
{
"header_title": "<strong>test</strong>"
}
JS:
const newTranslations = await fetchTranslationsFor(
newLocale,
);
async function fetchTranslationsFor(newLocale) {
const response = await fetch('/lang/en.json');
return await response.json();
}
To render it i do like so: pseudo.
element.innerText = json.myprop;
Change innerText to innerHTML. When you use the text method, it escapes the html characters. Innerhtml renders the exact html.
element.innerHTML = json.myprop;

Translate all innerText in document html with google translate api

i am making a request to this url to translate text from english to spanish
URL: https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q=Hello
and efectivelly i´m getting translated text to spanish, so, now i want to get dinamically all innerText in body document and then put again translated text, how can i do this?
In simple words, I want to dynamically translate the website with a button click.
This is my example code to start:
let textToBeTranslate =["hello","thanks","for","help me"]
var url = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q="+textToBeTranslate;
fetch(url)
.then(data => data.json()).then(data => {
//Text translated to spanish
var textTranslated = data[0][0][0].split(", ");
console.log(textTranslated)
//output: ["hola gracias por ayudarme"]
//Now i want to dinamically put translated text in body tag again
}).catch(error => {
console.error(error)
});
Try this:
const translateElement = async element => {
const
elementNode = element.childNodes[0],
sourceText = elementNode && elementNode.nodeValue;
if (sourceText)
try {
const
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=es&dt=t&q=' + sourceText,
resultJson = await fetch(url),
result = await resultJson.json(),
translatedText = result[0][0][0].split(', ');
elementNode.nodeValue = translatedText;
} catch (error) {
console.error(error);
}
}
}
For a single element - Just call it, like this:
(async () => await translateElement(document.body))();
For all elements in the DOM - You will need to recursively go over all elements starting from the desired parent tag (body, in your case), and call the above function for each element, like this:
(async () => {
const
parent = 'body',
selector = `${parent}, ${parent} *`,
elements = [...document.querySelectorAll(selector)],
promises = elements.map(translateElement);
await Promise.all(promises);
})();
Remarks:
I used childNodes[0].nodeValue instead of innerHtml or
innerText to keep the child elements.
Note that go over the entire DOM is not recommended and can lead to problems like changing script and style tags.

Read environment variable value with cheerio

I am parsing a webpage and trying to get the text values from it.
I am using cheerio to be able to do this with node.js.
Currently whenever I parse a tag it returns {{status}} this is because the value is an environment variable, but I want to be able to read the actual value (in this case it is "2").
This is what I have currently got:
const rp = require('request-promise');
const url = 'my url';
const $ = require('cheerio');
rp(url)
.then(function(html){
//success!
console.log($('.class-name div', html).text());
})
.catch(function(err){
//handle error
});
I have also tried using .html(), .contents() but still not success.
Do I have to change the second parameter in $('.class-name DIV', <PARAMETER>) to achieve what I am after?
You don't provide any URL or HTML you're trying to parse.
So, with cheerio, you can use selector like this format.
$( selector, [context], [root] )
Means search the selector inside context, within root element (usually HTML doc string), selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object.
Meanwhile:
$(selector).text() => Return innerText of the selector
$(selector).html() => Return innerHTML of the selector
$(selector).attr('class') => Return value of class attribute
But cheerio parser is difficult to debug.
I've used cheerio for a while and sometimes this can be a headache.
So i've found jsonframe-cheerio, a package that parse that HTML tags for you.
In this working example below, as you can see it parse cheerio perfectly.
It will translate a format called frame to extract innerText, attributes value, even filter or match some regular expression.
HTML source (simplified for readibility)
https://www.example.com
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
CHeerIO
const request = require ('request-promise')
const cheerio = require ('cheerio')
const jsonfrm = require ('jsonframe-cheerio')
const url = 'https://www.example.com'
let $ = null
;(async () => {
try {
const response = await request(url)
$ = cheerio.load (response)
jsonfrm($)
let frame = {
articles : { // This format will loop every matching selectors occurs
_s : "body > div", // The root of every repeating item
_d : [{
"titling" : "h1", // The innerText of article h1
"excerpt" : "p", // The innerText of article content p
"linkhref": "a[href] # href" // The value of href attribute within a link
}]
}
}
const displayResult = $('body').scrape(frame, { string: true } )
console.log ( displayResult )
} catch ( error ) {
console.log ('ERROR: ', error)
}
})()

Selecting an html node's text content with htmlparser2 in Node.js

I want to parse some html with htmlparser2 module for Node.js. My task is to find a precise element by its ID and extract its text content.
I have read the documentation (quite limited) and I know how to setup my parser with the onopentag function but it only gives access to the tag name and its attributes (I cannot see the text). The ontext function extracts all text nodes from the given html string, but ignores all markup.
So here's my code.
const htmlparser = require("htmlparser2");
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if (attribs.id === "heading1"){
console.log(/*how to extract text so I can get "Some heading" here*/);
}
},
ontext: function(text){
console.log(text); // Some heading \n Foobar
}
});
parser.parseComplete(file);
I expect the output of the function call to be 'Some heading'. I believe that there is some obvious solution but somehow it misses my mind.
Thank you.
You can do it like this using the library you asked about:
const htmlparser = require('htmlparser2');
const domUtils = require('domutils');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
var handler = new htmlparser.DomHandler(function(error, dom) {
if (error) {
console.log('Parsing had an error');
return;
} else {
const item = domUtils.findOne(element => {
const matches = element.attribs.id === 'heading1';
return matches;
}, dom);
if (item) {
console.log(item.children[0].data);
}
}
});
var parser = new htmlparser.Parser(handler);
parser.write(file);
parser.end();
The output you will get is "Some Heading". However, you will, in my opinion, find it easier to just use a querying library that is meant for it. You of course, don't need to do this, but you can note how much simpler the following code is: How do I get an element name in cheerio with node.js
Cheerio OR a querySelector API such as https://www.npmjs.com/package/node-html-parser if you prefer the native query selectors is much more lean.
You can compare that code to something more lean, such as the node-html-parser which supports simply querying:
const { parse } = require('node-html-parser');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const root = parse(file);
const text = root.querySelector('#heading1').text;
console.log(text);

Categories

Resources