Crawling multiple URLs in a loop using Puppeteer

Crawling multiple URLs in a loop using Puppeteer - javascript

I have an array of URLs to scrape data from:
urls = ['url','url','url'...]
This is what I'm doing:
urls.map(async (url)=>{
await page.goto(url);
await page.waitForNavigation({ waitUntil: 'networkidle' });
})
This seems to not wait for page load and visits all the URLs quite rapidly (I even tried using page.waitFor).
I wanted to know if am I doing something fundamentally wrong or this type of functionality is not advised/supported.

map, forEach, reduce, etc, does not wait for the asynchronous operation within them, before they proceed to the next element of the iterator they are iterating over.
There are multiple ways of going through each item of an iterator synchronously while performing an asynchronous operation, but the easiest in this case I think would be to simply use a normal for operator, which does wait for the operation to finish.
const urls = [...]
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
await page.goto(`${url}`);
await page.waitForNavigation({ waitUntil: 'networkidle2' });
}
This would visit one url after another, as you are expecting. If you are curious about iterating serially using await/async, you can have a peek at this answer: https://stackoverflow.com/a/24586168/791691

The accepted answer shows how to serially visit each page one at a time. However, you may want to visit multiple pages simultaneously when the task is embarrassingly parallel, that is, scraping a particular page isn't dependent on data extracted from other pages.
A tool that can help achieve this is Promise.allSettled which lets us fire off a bunch of promises at once, determine which were successful and harvest results.
For a basic example, let's say we want to scrape usernames for Stack Overflow users given a series of ids.
Serial code:
const puppeteer = require("puppeteer"); // ^19.6.3
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const baseURL = "https://stackoverflow.com/users";
const startId = 6243352;
const qty = 5;
const usernames = [];
for (let i = startId; i < startId + qty; i++) {
await page.goto(`${baseURL}/${i}`, {
waitUntil: "domcontentloaded"
});
const sel = ".flex--item.mb12.fs-headline2.lh-xs";
const el = await page.waitForSelector(sel);
usernames.push(await el.evaluate(el => el.textContent.trim()));
}
console.log(usernames);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Parallel code:
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const baseURL = "https://stackoverflow.com/users";
const startId = 6243352;
const qty = 5;
const usernames = (await Promise.allSettled(
[...Array(qty)].map(async (_, i) => {
const page = await browser.newPage();
await page.goto(`${baseURL}/${i + startId}`, {
waitUntil: "domcontentloaded"
});
const sel = ".flex--item.mb12.fs-headline2.lh-xs";
const el = await page.waitForSelector(sel);
const text = await el.evaluate(el => el.textContent.trim());
await page.close();
return text;
})))
.filter(e => e.status === "fulfilled")
.map(e => e.value);
console.log(usernames);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Remember that this is a technique, not a silver bullet that guarantees a speed increase on all workloads. It will take some experimentation to find the optimal balance between the cost of creating more pages versus the parallelization of network requests on a given particular task and system.
The example here is contrived since it's not interacting with the page dynamically, so there's not as much room for gain as in a typical Puppeteer use case that involves network requests and blocking waits per page.
Of course, beware of rate limiting and any other restrictions imposed by sites (running the code above may anger Stack Overflow's rate limiter).
For tasks where creating a page per task is prohibitively expensive or you'd like to set a cap on parallel request dispatches, consider using a task queue or combining serial and parallel code shown above to send requests in chunks. This answer shows a generic pattern for this agnostic of Puppeteer.
These patterns can be extended to handle the case when certain pages depend on data from other pages, forming a dependency graph.
See also Using async/await with a forEach loop which explains why the original attempt in this thread using map fails to wait for each promise.

If you find that you are waiting on your promise indefinitely, the proposed solution is to use the following:
const urls = [...]
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
const promise = page.waitForNavigation({ waitUntil: 'networkidle' });
await page.goto(`${url}`);
await promise;
}
As referenced from this github issue

Best way I found to achieve this.
const puppeteer = require('puppeteer');
(async () => {
const urls = ['https://www.google.com/', 'https://www.google.com/']
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(`${url}`, { waitUntil: 'networkidle2' });
await browser.close();
}
})();

Something no one else mentions is that if you are fetching multiple pages using the same page object it is crucial that you set its timeout to 0. Otherwise, once it has fetched the default 30 seconds worth of pages, it will timeout.
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setDefaultNavigationTimeout(0);

Related

How to open list of pages synchronously

let downloadPageLinks = [];
fetchStreamingLinks.forEach(async (item) => {
page = await browser.newPage();
await page.goto(item, { waitUntil: "networkidle0" });
const fetchDownloadPageLinks = await page.evaluate(() => {
return loc4;
});
console.log(fetchDownloadPageLinks);
});
I have an array of links(fetchStreamingLinks). Above function opens all the links simultaneously present in fetchDownloadPageLinks. Suppose the array contains 100 links then it opens all the 100 links simultaneously.
Now what I want to do is, open all the links one by one present in fetchStreamingLinks, perform some logic in page context's and close it then open next link.

.forEach() is not promise-aware so when you pass it an async callback, it doesn't pay any attention to the promise that it returns. Thus, it runs all your operations in parallel. .forEach() should be essentially considered obsolete these days, especially for asynchronous operations because a plain for loop gives you so much more control and is promise-aware (e.g. the loop will pause for an await).
let downloadPageLinks = [];
for (let item of fetchStreamingLinks) {
let page = await browser.newPage();
await page.goto(item, { waitUntil: "networkidle0" });
const fetchDownloadPageLinks = await page.evaluate(() => {
return loc4;
});
await page.close();
console.log(fetchDownloadPageLinks);
}
FYI, I don't know the puppeteer API really well, but you probably should close the page (as I show) when you're done with it to avoid pages stacking up as you process.

Discord Bot JS interaction reply does not wait until the code finish executing

I just started to code and trying to build a discord js bot. I am scraping data from a website. My problem is that the await interaction.reply(randomFact); will execute immediately while my code have not finish scraping and return the result. I have tried async/await but still does not work.
module.exports = {
data: new SlashCommandBuilder()
.setName("tips")
.setDescription("Scrap from website"),
async execute(interaction) {
let randomFact;
let healthArray = [];
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.example.com/health-facts/");
const pageData = await page.evaluate(() => {
return {
html: document.documentElement.innerHTML,
};
});
const $ = cheerio.load(pageData.html);
$(".round-number")
.find("h3")
.each(function (i, el) {
let row = $(el).text().replace(/(\s+)/g, " ");
row = $(el)
.text()
.replace(/[0-9]+. /g, "")
.trim();
healthArray.push(row);
});
await browser.close();
randomFact =
healthArray[Math.floor(Math.random() * healthArray.length)];
await interaction.reply(randomFact);
},
};
the output error
Sorry if there is anything lacking in my post. I just joined stack overflow.

The discord api has a timelimit of 3 seconds to reply to a interaction before it expires, to increase this timelimit you will need to use the deferReply method found here, the attached link is for the ButtonInteraction but the SelectMenueInteraction and other types of interactions also have the same method so shouldn't be a issue
https://discord.js.org/#/docs/main/stable/search?query=Interaction.defer

function in page.evaluate() is not working/being executed - Puppeteer

My Code below tries to collect a bunch of hyper links that come under the class name ".jss2". However, I do not think the function within my page.evaluate() is working. When I run the code, the link_list const doesn't get displayed.
I ran the document.querySelectorAll on the Chrome console and that was perfectly fine - really having a hard time with this.
async function testing() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 800});
await page.goto(url);
const link_list = await this.page.evaluate(() => {
let elements = Array.from(document.querySelectorAll(".jss2"));
let links = elements.map(element => {
return element.href;
});
return (links);
});
console.log(link_list);
}

const link_list = await page.$$eval('.classname', links => links.map(link => link.href));
Found the answer here: PUPPETEER - unable to extract elements on certain websites using page.evaluate(() => document.querySelectorAll())

How can I click on all links matching a selector with Playwright?

I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).

const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);

Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488

Puppeteer Async Await Loop in NodeJS

I am trying to make a script that :
Grabs all urls from a sitemap
Takes a screenshot of it with puppeteer
I am currently trying to understand how to code asynchronously but I still have troubles with finding the right coding pattern for this problem.
Here is the code I currently have :
// const spider = require('./spider');
const Promise = require('bluebird');
const puppeteer = require('puppeteer');
const SpiderConstructor = require('sitemapper');
async function crawl(url, timeout) {
const results = await spider(url, timeout);
await Promise.each(results, async (result, index) => {
await screen(result, index);
});
}
async function screen(result, index) {
const browser = await puppeteer.launch();
console.log('doing', index);
const page = await browser.newPage();
await page.goto(result);
const path = await 'screenshots/' + index + page.title() + '.png';
await page.screenshot({path});
browser.close();
}
async function spider(url, timeout) {
const spider = await new SpiderConstructor({
url: url,
timeout: timeout
});
const data = await spider.fetch();
console.log(data.sites.length);
return data.sites;
};
crawl('https://www.google.com/sitemap.xml', 15000)
.catch(err => {
console.error(err);
});
I am having the following problems :
The length of the results array is not a constant, it varies every time I launch the script, which I guess resides in the fact it is not fully resolved when I display it, but I thought the whole point of await was so that we are guarantied that on next line the promise is resolved.
The actual screenshotting action part of the script doesn't work half the time and I am pretty sure I have unresolved promises but I have no of the actual pattern for looping over an async function, right now it seems like it does a screenshot after the other (linear and incremental) but I get alot of duplicates.
Any help is appreciated. Thank you for your time

Develop Reference

JavaScript is the programming language of the Web.

Crawling multiple URLs in a loop using Puppeteer - javascript

Related

How to open list of pages synchronously

Discord Bot JS interaction reply does not wait until the code finish executing

function in page.evaluate() is not working/being executed - Puppeteer

How can I click on all links matching a selector with Playwright?

Puppeteer Async Await Loop in NodeJS

Categories

Resources