Puppeteer Page.$$eval() method returning empty arrays - javascript

I'm building a web scraping application with puppeteer. I'm trying to get an array of links to scrape from but it returns an empty array.
const scraperObject = {
url: 'https://www.walgreens.com/',
async scraper(browser){
let page = await browser.newPage();
console.log(`Navigating to ${this.url}...`);
await page.goto(this.url);
// Wait for the required DOM to be rendered
await page.waitForSelector('.CA__Featured-categories__full-width');
// Get the link to all the required links in the featured categories
let urls = await page.$$eval('.list__contain > ul#at-hp-rp-featured-ul > li', links => {
// Extract the links from the data
links = links.map(el => el.querySelector('li > a').href)
return links;
});
Whenever I ran this, the console would give me the needed array of links (example below).
Navigating to https://www.walgreens.com/...
[
'https://www.walgreens.com/seasonal/holiday-gift-shop?ban=dl_dl_FeatCategory_HolidayShop_TEST'
'https://www.walgreens.com/store/c/cough-cold-and-flu/ID=20002873-tier1?ban=dl_dl_FeatCategory_CoughColdFlu_TEST'
'https://www.walgreens.com/store/c/contact-lenses/ID=359432-tier2clense?ban=dl_dl_FeatCategory_ContactLenses_TEST'
]
So, from here I had the navigate to one of those urls through the code block below and rerun the same code to go through an array of categories to eventually navigate to the product listings page.
//Navigate to Household Essentials
let hEssentials = await browser.newPage();
await hEssentials.goto(urls[11]);
// Wait for the required DOM to be rendered
await page.waitForSelector('.content');
// Get the link to all the required links in the featured categories
let shopByNeedUrls = await page.$$eval('div.wag-row > div.wag-col-3 wag-col-md-6 wag-col-sm-6 CA__MT30', links1 => {
// Extract the links from the data
links1 = links1.map(el => el.querySelector('div > a').href)
return links1;
});
console.log(shopByNeedUrls);
}
However, whenever I run this code through the console, I receive the same navigating message but then I return an empty array(as shown in the example below)
Navigating to https://www.walgreens.com/...
[]
If anyone is able to explain why I'm outputting an empty array, that'd be great. Thank you.
I've attempted to change the parameter of the page.waitForSelector method and the page.$$eval method. However none of them appeared to work and output the same result. In fact, I recieve a timeout error sometimes for the the page.waitForSelector method.

Related

show long API call progress with progress bar in NextJS

When the user clicks a button on my NextJS website, a NextJS API is called. There, a Puppeteer client is started, an external API is called, and the code loops through this response and crawls through some data.
This takes a long time, and I wanted to give the user some kind of information on how the progress is going.
For instance: I get several pages and items on each page from the external API — let's say, 3 pages with 100 items each. Then I'd show the user "processing item 1 of 300". As the items go by, this number would be updated.
The problem is that right now, I'm using res.send, and it closes the connection with a 200 status. I wanted to send back this data without closing.
Some people told me to research HTTP Streaming, but I couldn't find any practical explanation on how to do it — especially using NextJS.
Pseudocode:
// api/index.ts
export default async function handler(
req: NextApiRequest,
res: NextApiResponse<Data>,
) {
// Start crawler instance
const { page } = await crawler.up()
const items = await getItems()
// Close crawler before ending
await crawler.down(page)
res.status(200).json(items)
}
// getItems.ts
export const getItems = async () => {
const items = fetch('external-url')
const result = []
for (const index in items) {
// instead of this console.log, I wanted to send this as a message to the website, so it could update a progress bar
console.log(`Processing ${index + 1} of ${items.length}`)
const processed = await processResult(items[index]) // this will take a while
result.push(processed)
}

How to use Promise.all with multiple Firestore queries

I know there are similar questions to this on stack overflow but thus far none have been able to help me get my code working.
I have a function that takes an id, and makes a call to firebase firestore to get all the documents in a "feedItems" collection. Each document contains two fields, a timestamp and a post ID. The function returns an array with each post object. This part of the code (getFeedItems below) works as expected.
The problem occurs in the next step. Once I have the array of post ID's, I then loop over the array and make a firestore query for each one, to get the actual post information. I know these queries are asynchronous, so I use Promise.all to wait for each promise to resolve before using the final array of post information.
However, I continue to receive "undefined" as a result of these looped queries. Why?
const useUpdateFeed = (uid) => {
const [feed, setFeed] = useState([]);
useEffect(() => {
// getFeedItems returns an array of postIDs, and works as expected
async function getFeedItems(uid) {
const docRef = firestore
.collection("feeds")
.doc(uid)
.collection("feedItems");
const doc = await docRef.get();
const feedItems = [];
doc.forEach((item) => {
feedItems.push({
...item.data(),
id: item.id,
});
});
return feedItems;
}
// getPosts is meant to take the array of post IDs, and return an array of the post objects
async function getPosts(items) {
console.log(items)
const promises = [];
items.forEach((item) => {
const promise = firestore.collection("posts").doc(item.id).get();
promises.push(promise);
});
const posts = [];
await Promise.all(promises).then((results) => {
results.forEach((result) => {
const post = result.data();
console.log(post); // this continues to log as "undefined". Why?
posts.push(post);
});
});
return posts;
}
(async () => {
if (uid) {
const feedItems = await getFeedItems(uid);
const posts = await getPosts(feedItems);
setFeed(posts);
}
})();
}, []);
return feed; // The final result is an array with a single "undefined" element
};
There are few things I have already verified on my own:
My firestore queries work as expected when done one at a time (so there are not any bugs with the query structures themselves).
This is a custom hook for React. I don't think my use of useState/useEffect is having any issue here, and I have tested the implementation of this hook with mock data.
EDIT: A console.log() of items was requested and has been added to the code snippet. I can confirm that the firestore documents that I am trying to access do exist, and have been successfully retrieved when called in individual queries (not in a loop).
Also, for simplicity the collection on Firestore currently only includes one post (with an ID of "ANkRFz2L7WQzA3ehcpDz", which can be seen in the console log output below.
EDIT TWO: To make the output clearer I have pasted it as an image below.
Turns out, this was human error. Looking at the console log output I realised there is a space in front of the document ID. Removing that on the backend made my code work.

How to get all clickable elements on the webpage using puppeteer?

For webscraping purpose, I want to find all URLs present on the website which I can access using the tag 'a'. Refer to the below script
// Get all urls in the page
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a');
items.forEach((item) => {
results.push({
url: item.href,
});
});
Now there are some URLs hidden, and which can be accessed using clicking on the elements on the page. How can get the list of all clickable elements on a page using puppeteer or nodejs?

Trying to scrape websites using puppeteer and getting back empty objects

I began learning puppeteer today and I ran into a problem. I was trying to create a covid tracker and I wanted to scrape from worldometers. But when I try to get back information it returns an array with empty objects. The number of objects matches to the number of tags with the same class but it doesn't show any information. here is my code
const puppeteer = require("puppeteer")
async function getCovidCases(){
const browser = await puppeteer.launch({
defaultViewport: null,
headless: false,
slowMo: 250
})
const page = await browser.newPage()
const url = "https://www.worldometers.info/coronavirus/#countries"
await page.goto(url, {waitUntil: 'networkidle0'})
await page.waitForSelector(".navbar-nav", {visible: true})
const results = await page.$$eval(".navbar-nav", rows => {
return rows
})
await console.log(results)
}
getCovidCases()
Does Anyone Know What To Do?
Based on the selector I assume you are in this step interested in the navbar items.
const results = await page.$$eval(".navbar-nav", navBars => {
return navBars.map(navBar => {
const anchors = Array.from(navBar.getElementsByTagName('a'));
return anchors.map(anchor => anchor.innerText);
});
})
This yields [ [ 'Coronavirus', 'Population' ] ] and might be useful for you.
Use $eval if you expect only one element and $$eval if you expect multiple elements. In the callback you have a reference to that dom element, but you cannot return it directly. If you console.log anything it won't show up in the nodejs terminal, but in the browsers terminal. What you return there will be send back to nodejs and it needs to be serializable (I think). What you get back from navBar will be converted to an empty object and is not what you want. That's why I map over it and convert it to a string (innerText).
If you want scrape other data, you should use another selector (.nav-bar).

slow looping over pages and extracting data using puppeteer

I have a table looking like that. All the names in the name columns are links that navigate to the next page.
|---------------------|------------------|
| NAME | ID |
|---------------------|------------------|
| Name 1 | 1 |
|---------------------|------------------|
| Name 2 | 2 |
|---------------------|------------------|
| Name 3 | 3 |
|---------------------|------------------|
I am trying to grab the link, extract data from it and then return back to the table. However, there are over 4000 records in the table and everything is processed very slowly (around 1000ms per record)
Here is my code that:
//Grabs all table rows.
const items = await page.$$(domObjects.itemPageLink);
for (let i = 0; i < items.length; i++) {
await page.goto(url);
await page.waitForSelector(domObjects.itemPageLink);
let items = await page.$$(domObjects.itemPageLink);
const item = items[i];
let id = await item.$eval("td:last-of-type", node => node.innerText.split(",").map(item => item.trim()));
let link = await item.$eval("td:first-of-type a", node => node.click());
await page.waitForSelector(domObjects.itemPageWrapper);
let itemDetailsPage = await page.$(domObjects.itemPageWrapper);
let title = await itemDetailsPage.$eval(".page-header__title", title => title.innerText);
console.log(title);
console.log(id);
}
Is there a way to speed up this so I can get all the results at once much quicker? I would like to use this for my API.
There are some minor code improvements and one major improvement which can be applied here.
Minor improvements: Use fewer puppeteer functions
The minor improvements boil down to using as few of the puppeteer functions as possible. Most of the puppeteer functions you use, are sending data from the Node.js environment to the browser environment via a WebSocket. While this only takes a few milliseconds, these milliseconds obviously add up in the long run. For more information on this, you can check out this question asking about the difference of using page.evaluate vs. using more puppeteer functions.
This means, to optimize your code, you can for example use querySelector inside of the page instead of running item.$eval multiple times. Another optimization is to directly use the result of page.waitForSelector. The function will return the node, when it appears. Therefore, you do not need to query it via page.$ again afterwards.
These are only minor improvements, which might slightly improve the crawling speed.
Major improvement: Use a puppeteer pool
Right now, you are using one browser with a single page to crawl multiple URLs. You can improve the speed of your script by using a pool of puppeteer resources to crawl multiple URLs in parallel. puppeteer-cluster allows you to do exactly that (disclaimer: I'm the author). The library takes a task and applies it to a number of URLs in parallel.
The number of parallel instances, you can use depends on your CPU, memory and throughput. The more you can use the better your crawling speed will be.
Code Sample
Below is a minimal example, adapting your code to extract the same data. The code first sets up a cluster with one browser and four pages. After that, a task function is defined which will be executed for each of the queued objects.
After this, one page instance of the cluster is used to extract the IDs and URLs from the initial page. The function given to the cluster.queue extracts the IDs and URLs from the page and calls cluster.queue with the objects being { id: ..., url: ... }. For each of the queued objects, the cluster.task function is executed, which then extracts the title and prints it out next to the passed ID.
// Setup your cluster with 4 pages
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 4,
});
// Define the task for the pages (go the the URL, and extract the title)
await cluster.task(async ({ page, data: { id, url } }) => {
await page.goto(url);
const itemDetailsPage = await page.waitForSelector(domObjects.itemPageWrapper);
const title = await itemDetailsPage.$eval('.page-header__title', title => title.innerText);
console.log(id, url);
});
// use one page of the cluster, to extract the links (ignoring the task function above)
cluster.queue(({ page }) => {
await page.goto(url); // URLs is given from outside
// Extract the links and ids from the initial page by using a page of the cluster
const itemData = await page.$$eval(domObjects.itemPageLink, items => items.map(item => ({
id: item.querySelector('td:last-of-type').split(',').map(item => item.trim()),
url: item.querySelector('td:first-of-type a').href,
})));
// Queue the data: { id: ..., url: ... } to start the process
itemData.forEach(data => cluster.queue(data));
});

Categories

Resources