How to use Cheerio in NodeJS to scrape img srcs

How to use Cheerio in NodeJS to scrape img srcs - javascript

This is the code
const getHappyMovies = async () => {
try {
const movieData = [];
let title;
let description;
let imageUrl;
const response = await axios.get(happyUrl); //https://www.imdb.com/list/ls008985796/
const $ = load(response.data);
const movies = $(".lister-item");
movies.each(function () {
title = $(this).find("h3 a").text();
description = $(this).find("p").eq(1).text();
imageUrl = $(this).find("a img").attr("src");
movieData.push({ title, description, imageUrl });
});
console.log(movieData);
} catch (e) {
console.error(e);
}
};
Here's the output I'm receiving:
And this is the website I'm scraping from
Now I need to get the src of that image, but it's returning something else, as shown in the output image.

The golden rule of Cheerio is "it doesn't run JS". As a result, devtools is often inaccurate since it shows the state of the page after JS runs.
Instead, either look at view-source:, disable JS or look at the HTML response printed from your terminal to get a more accurate sense of what's actually on the page (or not).
Looking at the source:
<img alt="Tokyo Story"
class="loadlate"
loadlate="https://m.media-amazon.com/images/M/MV5BYWQ4ZTRiODktNjAzZC00Nzg1LTk1YWQtNDFmNDI0NmZiNGIwXkEyXkFqcGdeQXVyNzkwMjQ5NzM#._V1_UY209_CR2,0,140,209_AL_.jpg"
data-tconst="tt0046438"
height="209"
src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png"
width="140" />
You can see src= is a placeholder image but loadlate is the actual URL. When the image is scrolled into view, JS kicks in and lazily loads the loadlate URL into the src attribute, leading to your observed devtools state.
The solution is to use .attr("loadlate"):
const axios = require("axios");
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "<Your URL>";
const getHappyMovies = () =>
axios.get(url).then(({data: html}) => {
const $ = cheerio.load(html);
return [...$(".lister-item")].map(e => ({
title: $(e).find(".lister-item-header a").text(),
description: $(e).find(".lister-item-content p").eq(1).text().trim(),
imageUrl: $(e).find(".lister-item-image img").attr("loadlate"),
}));
});
getHappyMovies().then(movies => console.log(movies));
Note that I'm using class selectors which are more specific than plain tags.

Related

Is there a way I can get links from pages other than the DOM in a firefox extension?

I'm trying to get links from other pages but since they aren't loaded I can't do something like this.
var urlsFound = [];
var url = "https://www.youtube.com";
for(var i = url.links.length; i --> 0;)
{
urls.push(url.links[i].href);
}
urlsFound.forEach((item, index) =>
{
console.log(`${index} : ${item}`);
});
Now of course the code above won't work but it's there just to better illustrate what I'm trying to accomplish. How can I get links from other pages without loading them into the DOM while having it work in a firefox extension? I'm willing to use any technology provided it can work within a firefox extension.

You need to make a http request using something like fetch or axios and then parse the html and get the links, using cheerio. Here's an example.
You can see how to use axios and cheerio here:
https://www.npmjs.com/package/axios
https://cheerio.js.org/
const axios = require("axios").default; // can use fetch etc.
const cheerio = require("cheerio");
function getLinks(url) {
return new Promise((resolve, reject) => {
axios.get(url).then(res => res.data).then(HTML => {
const $ = cheerio.load(HTML);
const links = $("a").toArray();
resolve(links);
}).catch(err => reject(err));
});
}

Insert new key value pair inside and array of objects, but value is created by axios.get

So I've been working on a scraper. Everything was well until I've tried scraping data for individual link.
Now to explain: I've got a scraper, which scrapes me data about apartments. Now first url is page where the articles are located(approx. 29-30 should be fetched). Now on that page I don't have information about square meters, so I need to run another scraper for each link that is scraped, and scrape square meters from there.
Here is the code that I have:
const axios = require('axios');
const cheerio = require('cheerio');
const url = `https://www.olx.ba/pretraga?vrsta=samoprodaja&kategorija=23&sort_order=desc&kanton=9&sacijenom=sacijenom&stranica=2`;
axios.get(url).then((response) => {
const articles = [];
const $ = cheerio.load(response.data);
$('div[id="rezultatipretrage"] > div')
.not('div[class="listitem artikal obicniArtikal i index"]')
.not('div[class="obicniArtikal"]')
.each((index, element) => {
$('span[class="prekrizenacijena"]').remove();
const getLink = $(element).find('div[class="naslov"] > a').attr('href');
const getDescription = $(element)
.find('div[class="naslov"] > a > p')
.text();
const getPrice = $(element)
.find('div[class="datum"] > span')
.text()
.replace(/\.| ?KM$/g, '')
.replace(' ', '');
const getPicture = $(element)
.find('div[class="slika"] > img')
.attr('src');
articles[index] = {
id: getLink.substring(27, 35),
link: getLink,
description: getDescription,
price: getPrice,
picture: getPicture,
};
});
articles.map((item, index) => {
axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const sqa = $('div[class="df2 "]').first().text();
});
});
console.log(articles);
});
Now the first part of the code likes as it should, I've been struggling with this second part.
Now I'm mapping over articles because there, for each link, I need to load it into axios function and get the data about square meters.
So my desired output would be updated articles: with it's old objects and key values inside it but with key sqm and value of scraped sqaure meters.
Any ideas on how to achieve this?
Thanks!

You could simply add the information about the square meters to the current article/item, something like:
const articlePromises = Promise.all(articles.map((item) => {
return axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const sqa = $('div[class="df2 "]').first().text();
item.sqm = sqa;
});
}));
articlePromises.then(() => {
console.log(articles);
});
Note that you need to wait for all mapped promises to resolve, before you log the resulting articles.
Also note that using async/await you could rewrite your code to be a bit cleaner, see https://javascript.info/async-await.

Xpath doesn't recognize anchor tag?

I'm running some Node.js code to scrape a website and return some text from this part of the html:
And here's the code I'm using to get it
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
const axios = require('axios');
(async () => {
const response = await axios.get('https://www.aritzia.com/en/product/sculpt-knit-tank-%28arjun-knit-top%29/66139.html?dwvar_66139_color=17388');
const html = response.data;
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:div[contains(#class, 'pdp-product-brand')]/*/text()", doc);
console.log(nodes.length ? nodes[0].nodeValue : nodes.length)
})();
The code above works as expected -- it prints Babaton.
But when I swap out the xpath above for one that includes a instead of * (i.e. //x:div[contains(#class, 'pdp-product-brand')]/a/text()) it instead tells me that nodes.length === 0.
I would expect it to give the same result because the div that it's pointing to does in fact have a child anchor tag (see screenshot above). I'm just confused why it doesn't work with a and was wondering if anybody else knew the answer. Thanks!

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks

Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

StaleElementReferenceError on iterations

My application gets a list of IDs from the db. I iterate over these with a cursor & for every ID, I plug it into a URL with Selenium to get specific items on a page. This is doing a search on a keyword & getting the most relevant item to that search. There are around 1000 results from the db. At random iterations, 1 of the driver actions will throw up an StaleElementReferenceError with the full message of:
stale element reference: element is not attached to the page document\n (Session info: chrome=77.0.3865.75)
Looking at the official docs I can see that the 2 common causes for this are:
The element has been deleted entirely.
The element is no longer attached to the DOM.
With the former being the most frequent cause.
index.js
const { MongoClient, ObjectID } = require('mongodb')
const fs = require('fs')
const path = require('path')
const { Builder, Capabilities, until, By } = require('selenium-webdriver')
const chrome = require('selenium-webdriver/chrome')
require('dotenv').config()
async function init() {
try {
const chromeOpts = new chrome.Options()
const ids = fs.readFileSync(path.resolve(__dirname, '..', 'data', 'primary_ids.json'), 'utf8')
const client = await MongoClient.connect(process.env.DB_URL || 'mongodb://localhost:27017/test', {
useNewUrlParser: true
})
const db = client.db(process.env.DB_NAME || 'test')
const productCursor = db.collection('product').find(
{
accountId: ObjectID(process.env.ACCOUNT_ID),
primaryId: {
$in: JSON.parse(ids)
}
},
{
_id: 1,
primaryId: 1
}
)
const resultsSelector = 'body #wrapper div.src-routes-search-style__container--2g429 div.src-routes-search-style__products--3rsz9'
const mostRelevantSelector = `${resultsSelector}
> div:nth-child(2)
> div.src-routes-search-product-item-raw-style__product--3vH_O:nth-child(1)`
const titleContainerSelector = `${mostRelevantSelector}
> div.src-routes-search-product-item-raw-style__mainPart--1HEWx
> div.src-routes-search-product-item-raw-style__containerText--3NefD
> div.src-routes-search-product-item-raw-style__description--3swql
> div.src-routes-search-product-item-raw-style__titleContainer--tazkH`
const productImageSelector = `${mostRelevantSelector}
> div.src-routes-search-product-item-raw-style__mainPart--1HEWx
> div.src-routes-search-product-item-raw-style__containerImages--1PfdF
> a.src-routes-search-product-item-raw-style__productImage--1Y42Y
> img`
const linkSelector = `${titleContainerSelector} > a`
const primaryIdSelector = `${titleContainerSelector} > p`
chromeOpts.setChromeBinaryPath('/usr/local/bin')
const driver = await new Builder()
.withCapabilities(Capabilities.chrome())
.forBrowser('chrome')
.build()
let newProds = {}
let product
let i = 0
while (await productCursor.hasNext()) {
i += 1
product = await productCursor.next()
let searchablePrimaryId = product.primaryId
let link
let primaryId
let pId
let href
let img
let imgSrc
if (product.primaryId.includes('#')) {
searchablePrimaryId = product.primaryId.substr(0, product.primaryId.indexOf('#'))
}
if (searchablePrimaryId.includes('-')) {
searchablePrimaryId = searchablePrimaryId.substr(0, searchablePrimaryId.indexOf('-'))
}
await driver.get(`https://icecat.biz/en/search?keyword=${encodeURIComponent(searchablePrimaryId.toLowerCase())}`)
link = await driver.wait(until.elementLocated(By.css(linkSelector)), 10000) // wait 10 seconds
img = await driver.wait(until.elementLocated(By.css(productImageSelector)), 10000)
imgSrc = await img.getAttribute('src')
primaryId = await driver.wait(until.elementLocated(By.css(primaryIdSelector)), 10000)
pId = await primaryId.getText()
href = await link.getAttribute('href')
const iceCatId = href.substr(href.lastIndexOf('-') + 1, href.length)
const _iceCatId = iceCatId.substr(0, iceCatId.indexOf('.html'))
const idFound = (searchablePrimaryId.toUpperCase() === pId.toUpperCase()) && !imgSrc.includes('logo-fullicecat')
newProds[product._id.toString()] = {
primaryId: product.primaryId,
iceCatId: idFound ? _iceCatId : 'N/A'
}
}
const foundProducts = Object.values(newProds).filter(prod => prod.iceCatId !== 'N/A')
console.log(`\nFound ${foundProducts.length}/${JSON.parse(ids).length}`)
fs.writeFileSync(path.resolve(__dirname, '..', 'data', 'new_products.json'), JSON.stringify(newProds, null, 4), 'utf8')
driver.quit()
} catch(err) {
throw err
}
}
init()
.then(res => {
console.log(res)
})
.catch(err => {
console.error(err)
})
To debug, I put a try...catch around each of the driver actions to see which specific action it is that is failing but that didn't work as it was never a consistent action that was failing. For example, sometimes if would have been one of the elementLocated lines or others it would have just been the getAttribute action.
If it is the latter in that scenario, that is why I am confused as to why this error is being thrown as surely selenium has found the element on the page (i.e. link) but is unable to do getAttribute('src') on the element? That's why I'm confused as to the error I'm getting. I imagine I must be doing something wrong with how I am setting up selenium to handle iterations. The iterations never get higher than 110

In your case the second cause is The element is no longer attached to the DOM. If a WebElement is located and the DOM is refreshed afterwards this element become stale even if the DOM hasn't change, the same locator will return new WebElement.
Normally, driver.get() will block until the page is fully loaded, however this site is running JavaScript to load the search results. You can test it by running document.readyState in the developer tools console, you will see "complete" results while the search results are still loading.
The page has a spinner before the results are located, hopefully it will be enough to wait for it to appear and became stale before scraping the page
await driver.get(`https://icecat.biz/en/search?keyword=${encodeURIComponent(searchablePrimaryId.toLowerCase())}`)
let spinner = driver.wait(until.elementIsVisible(By.className('src-routes-search-style__loader---acti')))
driver.wait(until.stalenessOf(spinner))
link = await driver.wait(until.elementLocated(By.css(linkSelector)), 10000)

You don't have wait for Ajax request to finish. The website retrieves and refreshes dom once you go to end and also keeps calling index every few seconds so DOM probably keeps updating. You can probably hold AJAX requests, get your results, process and enable AJAX again.

Could you try removing "await" from img Src = await img.getAttribute('src'). Since wait for img is already handled in its previous line.

Develop Reference

JavaScript is the programming language of the Web.

How to use Cheerio in NodeJS to scrape img srcs - javascript

Related

Is there a way I can get links from pages other than the DOM in a firefox extension?

Insert new key value pair inside and array of objects, but value is created by axios.get

Xpath doesn't recognize anchor tag?

Web Scrape with Puppeteer within a table

StaleElementReferenceError on iterations

Categories

Resources