Unable to scrape certain elements with Cheerio

Unable to scrape certain elements with Cheerio - javascript

I'm trying to scrape a product page using pupeteer and Cheerio. (this page)
I'm using a data id to scrape the title and image. The problem is that the title never gets scraped while the image does every time.
I've tried scraping the title by class name but that doesn't work either. Does this have something to do with the specfic website I'm trying to scrape? Thank you.
my code:
// Load cheerio
const $ = cheerio.load(data);
/* Scrape Product Page */
const product = [];
// Title
$('[data-testid="product-name"]').each(() => {
product.push({
title: $(this).text(),
});
});
// Image
$('[data-testid="product-detail-image"]').each((index, value) => {
const imgSrc = $(value).attr('src');
product.push({
image: imgSrc,
});
});

As I mentioned in the comments, there's almost no use case I can think of that makes sense with both Puppeteer and Cheerio at once. If the data is static, use Cheerio alongside a simple request library like Axios, otherwise use Puppeteer and skip Cheerio entirely in favor of native Puppeteer selectors.
One other potential reason to use Puppeteer is if your requests library is being blocked by the server's robot detector, as appears to be the case here.
This script worked for me:
const puppeteer = require("puppeteer");
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const url = "https://stockx.com/nike-air-force-1-low-white-07";
await page.goto(url);
const nameSel = '[data-testid="product-name"]';
await page.waitForSelector(nameSel, {timeout: 60000});
const name = await page.$eval(nameSel, el => el.textContent);
const imgSel = '[data-testid="product-detail-image"]';
await page.waitForSelector(imgSel);
const src = await page.$eval(imgSel, el => el.src);
console.log(name, src);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;

Related

How to select specific button in puppeteer

So I'm building a program that scrapes Poshmark webpages and extracts the usernames of each seller on the page!
I want it to go through every page using the 'next' button, but theres 6 buttons all with the same class name...
Heres the link: https://poshmark.com/category/Men-Jackets_&_Coats?sort_by=like_count&all_size=true&my_size=false
(In my google chrome this page has an infinite scroll (hence the scrollToBottom async function i started writing) but i realized inside puppeteer's chrome it has 'next page' buttons.)
The window displays page 1-5 and then the 'next page' button.
The problem is that all of the buttons share the same html class name, so I'm confused on how to differentiate.
const e = require('express');
const puppeteer = require('puppeteer');
const url = "https://poshmark.com/category/Men-Jackets_&_Coats?sort_by=like_count&all_size=true&my_size=false";
let usernames = [];
const initItemArea = async (page) => {
const itemArea = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.tc--g.m--l--1.ellipses')).map(x => x.textContent);
});
}
const pushToArray = async (itemArea, page) => {
itemArea.forEach(function (element) {
//console.log('username: ', $(element).text());
usernames.push(element);
});
};
const scrollToBottom = async (itemArea, page) => {
while (true) {
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await new Promise((resolve) => setTimeout(resolve, 1000));
await page.screenshot({path : "ss.png"})
}
};
const gotoNextPage = async (page) => {
await page.waitForSelector(".button.btn.btn--pagination");
const nextButton = await page.evaluate((page) => {
document.querySelector(".button.btn.btn--pagination")
});
await page.click(nextButton);
console.log('Next Page Loading')
};
async function main() {
const client = await puppeteer.launch({
headless: false,
executablePath: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
});
const page = await client.newPage();
await page.goto(url);
await page.waitForSelector(".tc--g.m--l--1.ellipses");
const itemArea = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.tc--g.m--l--1.ellipses')).map(x => x.textContent);
});
gotoNextPage(page)
};
main();
Currently, my gotoNextPage function doesnt even find the button, so i thought i'd entered the selector wrong...
Then when I went to find the selector, I realized all buttons have the same one anyway...
My html knowledge is basically nonexistent, but I want to finish this project out. All help is very appreciated.
Bonus: my initPageArea function doesn't work when I call as a function like that, so I hardcoded it into main()...
I'll be diving deep into this problem later on, as I've seen it before, but any quick answers / direction would be awesome.
Thanks a lot.

you can try selecting the buttons using their position in the page.
For example, you can select the first button using the following CSS selector:
.button.btn.btn--pagination:nth-child(1)
to select the second button:
.button.btn.btn--pagination:nth-child(2)
Got the idea? :)
you can refactor your gotoNextPage function to use this approach, consider this example:
const gotoNextPage = async (page, buttonIndex) => {
await page.waitForSelector(".button.btn.btn--pagination");
// Select the button using its position in the page
const nextButton = await page.evaluate((buttonIndex) => {
return document.querySelector(`.button.btn.btn--pagination:nth-child(${buttonIndex})`);
}, buttonIndex);
// Click on the button
await page.click(nextButton);
console.log("Next Page Loading");
};

Whenever you're messing with buttons and scroll, it's a good idea to think about where the data is coming from. It's usually being delivered to the front-end via a JSON API, so you might as well try to hit that API directly rather than mess with the DOM.
const url = maxId => `https://poshmark.com/vm-rest/channel_groups/category/channels/category/collections/post?request={%22filters%22:{%22department%22:%22Men%22,%22category_v2%22:%22Jackets_%26_Coats%22,%22inventory_status%22:[%22available%22]},%22sort_by%22:%22like_count%22,%22facets%22:[%22color%22,%22brand%22,%22size%22],%22experience%22:%22all%22,%22sizeSystem%22:%22us%22,%22max_id%22:%22${maxId}%22,%22count%22:%2248%22}&summarize=true&pm_version=226.1.0`;
(async () => {
const usernames = [];
for (let maxId = 1; maxId < 5 /* for testing */; maxId++) {
const response = await fetch(url(maxId)); // Node 18 or install node-fetch
if (!response.ok) {
throw Error(response.statusText);
}
const payload = await response.json();
if (payload.error) {
break;
}
usernames.push(...payload.data.map(e => e.creator_username));
}
console.log(usernames.slice(0, 10));
console.log("usernames.length", usernames.length);
})()
.catch(err => console.error(err));
The response blob has a ton of additional data.
I would add a significant delay between requests if I were to use code like this to avoid rate limiting/blocking.
If you're set on Puppeteer, something like this should work as well, although it's slower and I didn't have time to run to the end of the 5k (or more?) users:
const puppeteer = require("puppeteer"); // ^19.1.0
const url = "Your URL";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const usernames = [];
const sel = ".tc--g.m--l--1.ellipses";
for (;;) {
try {
await page.waitForSelector(sel);
const users = await page.$$eval(sel, els => {
const text = els.map(e => e.textContent);
els.forEach(el => el.remove());
return text;
});
console.log(users); // optional for debugging
usernames.push(...users);
await page.$$eval(
".btn--pagination",
els => els.find(el => el.textContent.includes("Next")).click()
);
}
catch (err) {
break;
}
}
console.log(usernames);
console.log(usernames.length);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
I don't think navigations are triggered by the "Next" button, so my strategy for detecting when a page transition has occurred involves destroying the current set of elements after scraping the usernames, then waiting until the next batch shows up. This may seem inelegant, but it's easy to implement and seems reliable, not making assumptions about the usernames themselves.
It's also possible to use Puppeteer and make or intercept API requests, armed with a fresh cookie. This is sort of halfway between the two extremes shown above. For example:
const puppeteer = require("puppeteer");
const url = "Your URL";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const usernames = await page.evaluate(async () => {
const url = maxId => `https://poshmark.com/vm-rest/channel_groups/category/channels/category/collections/post?request={%22filters%22:{%22department%22:%22Men%22,%22category_v2%22:%22Jackets_%26_Coats%22,%22inventory_status%22:[%22available%22]},%22sort_by%22:%22like_count%22,%22facets%22:[%22color%22,%22brand%22,%22size%22],%22experience%22:%22all%22,%22sizeSystem%22:%22us%22,%22max_id%22:%22${maxId}%22,%22count%22:%2248%22}&summarize=true&pm_version=226.1.0`;
const usernames = [];
try {
for (let maxId = 1; maxId < 5 /* for testing */; maxId++) {
const response = await fetch(url(maxId)); // node 18 or install node-fetch
if (!response.ok) {
throw Error(response.statusText);
break;
}
const json = await response.json();
if (json.error) {
break;
}
usernames.push(...json.data.map(e => e.creator_username));
}
}
catch (err) {
console.error(err);
}
return usernames;
});
console.log(usernames);
console.log("usernames.length", usernames.length);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
The above code limits to 4 requests to keep it simple and easy to validate.
Blocking images and other unnecessary resources can help speed the Puppeteer versions up, left as an exercise (or just use the direct fetch version shown at top).

I want to get the urls of each home from the attribute content

const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const url = "https://www.airbnb.co.in/s/Haridwar--Uttarakhand/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&price_filter_input_type=0&price_filter_num_nights=5&l2_property_type_ids%5B%5D=1&search_type=autocomplete_click&query=Haridwar%2C%20Uttarakhand&place_id=ChIJyVfuuA5HCTkR8_VApnaRRE4&date_picker_type=calendar&source=structured_search_input_header";
async function scrapHomesPage(url)
{
try
{
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto(url);
const html = await page.evaluate(()=> document.body.innerHTML);
const $ = cheerio.load(html);
const homes = $('[itemprop="url"]').map((i, element) => $(element).attr("content")).get();
console.log(homes);
}
catch(err)
{
console.error(err);
}
}
scrapHomesPage("https://www.airbnb.co.in/s/Haridwar--Uttarakhand/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&price_filter_input_type=0&price_filter_num_nights=5&l2_property_type_ids%5B%5D=1&search_type=autocomplete_click&query=Haridwar%2C%20Uttarakhand&place_id=ChIJyVfuuA5HCTkR8_VApnaRRE4&date_picker_type=calendar&source=structured_search_input_header");
I tried to add everything I could to wait for the page to load all the contents. I tried wait for selectors etc. I am always getting an empty array instead I should get an array with all the links of each home listed on the Airbnb site for that particular location.

I don't see any reason to use Cheerio here. It's just another layer of indirection to get the data you want, involving an extra dependency, a whole second parse of the page and the potential for bugs when the page goes out of sync with the HTML snapshot you've created. If you do need to use it, you can use page.content() instead of page.evaluate(() => document.body.innerHTML).
As for the main problem, you appear to be missing a call to page.waitForSelector:
const puppeteer = require("puppeteer"); // ^19.0.0
const url = "your url";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.waitForSelector('[itemprop="url"]');
const content = await page.$$eval(
'[itemprop="url"]',
els => els.map(el => el.getAttribute("content"))
);
console.log(content);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());

How to get the first link under a ul tag using Puppeteer?

I am trying to get the link of the latest house posting in a real estate website.
This is the code I have written til now
const puppeteer = require("puppeteer");
const link =
"https://www.daft.ie/property-for-rent/dublin-4-dublin?radius=5000&numBeds_from=2&numBeds_to=3&sort=publishDateDesc";
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto(link);
const elements = await page.$x("//button[normalize-space()='Accept All']");
await elements[0].click();
// const handle = await page.waitForXPath("//ul[#data-testid='results']");
// const yourHref = await page.evaluate(
// (anchor) => anchor.getAttribute("href"),
// handle
// );
const hrefs1 = await page.evaluate(() =>
Array.from(document.querySelectorAll("a[href]"), (a) =>
a.getAttribute("href")
)
);
console.log(hrefs1);
await browser.close();
})();
However, this code is to get all the href links on the target page.
HTML code of the page:
It is easier to read the code from the picture than if I paste the code, thats why I attached an image.
As you can see under ul tag with data-testid=results there are many li tags inside which there is a a href, I wish to extract the link from this and that too only the top most li link as it will newest house posting.
How can I do this?
Expected output - I just want the first link under li tag. In the picture above, the output would be
/for-rent/house-glencloy-road-whitehall-dublin-9/4072150

Following up on the comment chain, the selector '[data-testid="results"] a[href]' should give the first result href.
const puppeteer = require("puppeteer"); // ^16.2.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
const url =
"https://www.daft.ie/property-for-rent/dublin-4-dublin?radius=5000&numBeds_from=2&numBeds_to=3&sort=publishDateDesc";
await page.goto(url, {waitUntil: "domcontentloaded"});
const xp = "//button[normalize-space()='Accept All']";
const cookiesBtn = await page.waitForXPath(xp);
await cookiesBtn.click();
const el = await page.waitForSelector('[data-testid="results"] a[href]');
console.log(await el.evaluate(el => el.getAttribute("href")));
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
If you want all of the result hrefs, try:
const allHrefs = await page.$$eval(
'[data-testid="results"] a[href]',
els => els.map(e => e.getAttribute("href"))
);
Note that the data is available statically, so you could just use fetch (native on Node 18+) and Cheerio which is faster and probably more reliable, assuming there's no detection issues (and you could add a user-agent and take other counter-measures if there are):
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "https://www.daft.ie/property-for-rent/dublin-4-dublin?radius=5000&numBeds_from=2&numBeds_to=3&sort=publishDateDesc";
fetch(url).then(res => res.text()).then(html => {
const $ = cheerio.load(html);
const sel = '[data-testid="results"] a[href]';
console.log($(sel).attr("href"));
// or all:
console.log([...$(sel)].map(e => e.attribs.href));
});
On my slow machine this took 3.5 seconds versus 30 seconds for headful Puppeteer and 15-20 seconds for headless Puppeteer depending on cache warmth.
Or, if you are using Puppeteer for whatever reason, you could block all the requests, JS and images to speed things up dramatically. Your default await page.goto(link); waits for the load event, which is content you may not need.

Why will puppeteer not click on the video

I am currently writing a simple program that grabs the name of a song from my discord bot, finds the video and passes it to a function to convert to mp3. My problem is that puppeteer dosen't click on the video and instead just returns the search page link.
Here is my code to grab the link and pass it through download:
async function findSongName(stringWithName){
let stringName = stringWithName.replace(commands.play, '')
const link = 'https://www.youtube.com';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(link)
await page.type('ytd-searchbox#search.style-scope.ytd-masthead', stringName);
page.keyboard.press('Enter');
await page.click('yt-interaction#extended');
console.log(page.url())
await browser.close()
}

It sounds like you want to get the title and URL of the top result for a YT search. For starters, you don't need to start at the YT homepage. Just navigate to https://www.youtube.com/results?search_query=${yourQuery} to speed things up and reduce complexity.
Next, if you view the page source of /results, there's a large (~1 MB) global data structure called ytInitialData that has all of the relevant results in it (along with a lot of other irrelevant stuff, admittedly). Theoretically, you could grab the page with Axios, parse out ytInitialData with Cheerio, grab your data using plain array/object JS and skip Puppeteer entirely.
Of course, using the YT search API is the most reliable and proper way.
Since you're using Puppeteer, though, the data can be pulled out of the "#items a#video-title" elements as follows:
const puppeteer = require("puppeteer");
const searchYT = async (page, searchQuery) => {
const encodedQuery = encodeURIComponent(searchQuery);
const url = `https://www.youtube.com/results?search_query=${encodedQuery}`;
await page.goto(url);
const sel = "a#video-title";
await page.waitForSelector(sel);
return page.$$eval(sel, els =>
els.map(e => ({
title: e.textContent.trim(),
href: e.href,
}))
);
};
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on("request", req => {
req.resourceType() === "image" ? req.abort() : req.continue();
});
const results = await searchYT(page, "stack overflow");
console.log(results);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output (for the search term "stack overflow"):
[
{
title: 'Stack Overflow is full of idiots.',
href: 'https://www.youtube.com/watch?v=I_ZK0t9-llo'
},
{
title: "How To Use Stack Overflow (no, ForrestKnight, it's not full of idiots)",
href: 'https://www.youtube.com/watch?v=sMIslcynm0Q'
},
{
title: 'How to use Stack Overflow as a Beginner ?',
href: 'https://www.youtube.com/watch?v=Vt-Wf7d0CFo'
},
{
title: 'How Microsoft Uses Stack Overflow for Teams',
href: 'https://www.youtube.com/watch?v=mhh0aK6yJgA'
},
// ...
]
Since you only want the first result, then it's here, but if you want more than the initial batch, either work through ytInitialData as described above or scroll the page down with Puppeteer.
Now that you have a video URL that you want to make into an mp3, I'd recommend youtube-dl. There are Node wrappers you can install to access its API easily, such as node-youtube-dl which was the first result when I searched and I've never used before.

How do I combine puppeteer plugins with puppeteer clusters?

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer.
I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth
I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker
I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses.
Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) .
There is a problem though:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';
const browser = await puppeteer.launch({
dumpio: false,
headless: false,
args: [
`--proxy-server=socks5://127.0.0.1:${TOR_port}`,
`--no-sandbox`,
],
ignoreHTTPSErrors: true,
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto(webUrl, {
waitUntil: 'load',
timeout: 30000,
});
page
.waitForSelector('.price')
.then(() => {
console.log('The price is available');
await browser.close();
})
.catch(() => {
// close this since it is clearly not a zillow website
throw new Error('This is not the zillow website');
});
} catch (e) {
await browser.close();
}
The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.
So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.
How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?

You can just hand over your puppeteer Instance like following:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');
puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());
const browser = await puppeteer.launch({
puppeteer,
});
Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions

You can just add the plugins with puppeteer.use()
You have to use puppeteer-extra.
const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
// Do stuff
})();

Develop Reference

JavaScript is the programming language of the Web.

Unable to scrape certain elements with Cheerio - javascript

Related

How to select specific button in puppeteer

I want to get the urls of each home from the attribute content

How to get the first link under a ul tag using Puppeteer?

Why will puppeteer not click on the video

How do I combine puppeteer plugins with puppeteer clusters?

Categories

Resources