jsdom get text without image - javascript

I am trying to use jsdom to get a description from an article.
The html code of the article is
<p><img src="http://localhost/bibi_cms/cms/app/images/upload_photo/1506653694941.png"
style="width: 599.783px; height: 1066px;"></p>
<p>testestestestestestestest<br></p>
Here are my nodejs code for getting the description from the content, It seems it will get the text from first p tag and print out empty string. So I just want to get the content in p tag that contains no image. Anyone help me on this issue?
const dom = new JSDOM(results[i].content.toString());
if (dom.window.document.querySelector("p") !== null)
results[i].description = dom.window.document.querySelector("p").textContent;

Ideally you could test against Node.TEXT_NODE but that is erroring for me on nodejs for some reason so (using gulp just for testing purposes):
const gulp = require("gulp");
const fs = require('fs');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const html = yourHTML.html';
gulp.task('default', ['getText']);
gulp.task('getText', function () {
var dirty;
dirty = fs.readFileSync(html, 'utf8');
const dom = new JSDOM(dirty);
const pList = dom.window.document.querySelectorAll("p");
pList.forEach(function (el, index, list) {
console.log("p.firstElementChild.nodeName : " + el.firstElementChild.nodeName);
if (el.firstElementChild.nodeName !== "IMG") {
console.log(el.textContent);
}
});
return;
})
So the key is the test
el.firstElementChild.nodeName !== "IMG"
if you know that either an img tag or text follows the p tag. In your case the firstElementChild.nodeName you want is actually a br tag but I assume that isn't always necessarily there at the end of the text.
You could also test against an empty string ala :
if (el.textContent.trim() !== "") {} // you may want to trim() that for spaces

Related

How to use Cheerio in NodeJS to scrape img srcs

This is the code
const getHappyMovies = async () => {
try {
const movieData = [];
let title;
let description;
let imageUrl;
const response = await axios.get(happyUrl); //https://www.imdb.com/list/ls008985796/
const $ = load(response.data);
const movies = $(".lister-item");
movies.each(function () {
title = $(this).find("h3 a").text();
description = $(this).find("p").eq(1).text();
imageUrl = $(this).find("a img").attr("src");
movieData.push({ title, description, imageUrl });
});
console.log(movieData);
} catch (e) {
console.error(e);
}
};
Here's the output I'm receiving:
And this is the website I'm scraping from
Now I need to get the src of that image, but it's returning something else, as shown in the output image.
The golden rule of Cheerio is "it doesn't run JS". As a result, devtools is often inaccurate since it shows the state of the page after JS runs.
Instead, either look at view-source:, disable JS or look at the HTML response printed from your terminal to get a more accurate sense of what's actually on the page (or not).
Looking at the source:
<img alt="Tokyo Story"
class="loadlate"
loadlate="https://m.media-amazon.com/images/M/MV5BYWQ4ZTRiODktNjAzZC00Nzg1LTk1YWQtNDFmNDI0NmZiNGIwXkEyXkFqcGdeQXVyNzkwMjQ5NzM#._V1_UY209_CR2,0,140,209_AL_.jpg"
data-tconst="tt0046438"
height="209"
src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png"
width="140" />
You can see src= is a placeholder image but loadlate is the actual URL. When the image is scrolled into view, JS kicks in and lazily loads the loadlate URL into the src attribute, leading to your observed devtools state.
The solution is to use .attr("loadlate"):
const axios = require("axios");
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "<Your URL>";
const getHappyMovies = () =>
axios.get(url).then(({data: html}) => {
const $ = cheerio.load(html);
return [...$(".lister-item")].map(e => ({
title: $(e).find(".lister-item-header a").text(),
description: $(e).find(".lister-item-content p").eq(1).text().trim(),
imageUrl: $(e).find(".lister-item-image img").attr("loadlate"),
}));
});
getHappyMovies().then(movies => console.log(movies));
Note that I'm using class selectors which are more specific than plain tags.

Pdfjs cant extract language other than English properly

I'm trying to extract texts and images from some pdf files of bengali language with same structure using pdf.js. But the problem is pdf.js can't output bengali properly.
Most of the characters are shown as black square with question mark inside.
Also the signs used as vowels in bengali aren’t positioned properly.
For example [মোঃ রাজু মিয়া] outputs as [ �মাঃ রাজু িময়া]. As I'm extracting images and that part is working properly. It would be nice if i can also do the text part in pdf.js with custom font or something.
My sample code:
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.9.359/pdf.worker.min.js"></script>
<input type="file" id="file-pdf" accept=".pdf">
<script>
const onLoadFile = async event => {
try {
// turn array buffer into typed array
const typedArray = new Uint8Array(event.target.result);
const loadingPdfDocument = pdfjsLib.getDocument(typedArray);
const pdfDocumentInstance = await loadingPdfDocument.promise;
const totalNumPages = pdfDocumentInstance.numPages;
let pn = 1
let page = await pdfDocumentInstance.getPage(pn)
let textContent = await page.getTextContent()
var textItems = textContent.items;
console.log(textItems)
} catch (error) {
console.log(error);
}
};
document.getElementById('file-pdf').addEventListener('change', event => {
const file = event.target.files[0];
if (file.type !== 'application/pdf') {
alert(`File ${file.name} is not a PDF file type`);
return;
}
const fileReader = new FileReader();
fileReader.onload = onLoadFile;
fileReader.readAsArrayBuffer(file);
});
</script>

Xpath doesn't recognize anchor tag?

I'm running some Node.js code to scrape a website and return some text from this part of the html:
And here's the code I'm using to get it
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
const axios = require('axios');
(async () => {
const response = await axios.get('https://www.aritzia.com/en/product/sculpt-knit-tank-%28arjun-knit-top%29/66139.html?dwvar_66139_color=17388');
const html = response.data;
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:div[contains(#class, 'pdp-product-brand')]/*/text()", doc);
console.log(nodes.length ? nodes[0].nodeValue : nodes.length)
})();
The code above works as expected -- it prints Babaton.
But when I swap out the xpath above for one that includes a instead of * (i.e. //x:div[contains(#class, 'pdp-product-brand')]/a/text()) it instead tells me that nodes.length === 0.
I would expect it to give the same result because the div that it's pointing to does in fact have a child anchor tag (see screenshot above). I'm just confused why it doesn't work with a and was wondering if anybody else knew the answer. Thanks!

Can't download or copy SVG without body element

How do I successfully copy to clipboard the SVG content on this page?
https://cdn.dribbble.com/assets/dribbble-ball-icon-e94956d5f010d19607348176b0ae90def55d61871a43cb4bcb6d771d8d235471.svg
I get an error at the select() method that looks like this:
Uncaught TypeError: el.select is not a function
at <anonymous>:1:4
This is my code at the moment that can be run in the console.
function copyClip() {
const docEl = document.documentElement
const string = new XMLSerializer().serializeToString(docEl)
const el = document.createElement('textarea')
docEl.insertAdjacentElement('beforeend', el)
el.value = string
el.select()
document.execCommand('copy')
}
copyClip()
One answer to this is to use a different execution of copying to clipboard but I don't understand why the select method in the original question isn't working. This function works:
const docEl = document.documentElement
const string = new XMLSerializer().serializeToString(docEl)
navigator.clipboard.writeText(string).then(
function () {
alert('The SVG was copied to your clipboard.')
},
function (err) {
alert('Could not copy:', err)
}
)

Selecting an html node's text content with htmlparser2 in Node.js

I want to parse some html with htmlparser2 module for Node.js. My task is to find a precise element by its ID and extract its text content.
I have read the documentation (quite limited) and I know how to setup my parser with the onopentag function but it only gives access to the tag name and its attributes (I cannot see the text). The ontext function extracts all text nodes from the given html string, but ignores all markup.
So here's my code.
const htmlparser = require("htmlparser2");
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if (attribs.id === "heading1"){
console.log(/*how to extract text so I can get "Some heading" here*/);
}
},
ontext: function(text){
console.log(text); // Some heading \n Foobar
}
});
parser.parseComplete(file);
I expect the output of the function call to be 'Some heading'. I believe that there is some obvious solution but somehow it misses my mind.
Thank you.
You can do it like this using the library you asked about:
const htmlparser = require('htmlparser2');
const domUtils = require('domutils');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
var handler = new htmlparser.DomHandler(function(error, dom) {
if (error) {
console.log('Parsing had an error');
return;
} else {
const item = domUtils.findOne(element => {
const matches = element.attribs.id === 'heading1';
return matches;
}, dom);
if (item) {
console.log(item.children[0].data);
}
}
});
var parser = new htmlparser.Parser(handler);
parser.write(file);
parser.end();
The output you will get is "Some Heading". However, you will, in my opinion, find it easier to just use a querying library that is meant for it. You of course, don't need to do this, but you can note how much simpler the following code is: How do I get an element name in cheerio with node.js
Cheerio OR a querySelector API such as https://www.npmjs.com/package/node-html-parser if you prefer the native query selectors is much more lean.
You can compare that code to something more lean, such as the node-html-parser which supports simply querying:
const { parse } = require('node-html-parser');
const file = '<h1 id="heading1">Some heading</h1><p>Foobar</p>';
const root = parse(file);
const text = root.querySelector('#heading1').text;
console.log(text);

Categories

Resources