How to open list of pages synchronously - javascript

let downloadPageLinks = [];
fetchStreamingLinks.forEach(async (item) => {
page = await browser.newPage();
await page.goto(item, { waitUntil: "networkidle0" });
const fetchDownloadPageLinks = await page.evaluate(() => {
return loc4;
});
console.log(fetchDownloadPageLinks);
});
I have an array of links(fetchStreamingLinks). Above function opens all the links simultaneously present in fetchDownloadPageLinks. Suppose the array contains 100 links then it opens all the 100 links simultaneously.
Now what I want to do is, open all the links one by one present in fetchStreamingLinks, perform some logic in page context's and close it then open next link.

.forEach() is not promise-aware so when you pass it an async callback, it doesn't pay any attention to the promise that it returns. Thus, it runs all your operations in parallel. .forEach() should be essentially considered obsolete these days, especially for asynchronous operations because a plain for loop gives you so much more control and is promise-aware (e.g. the loop will pause for an await).
let downloadPageLinks = [];
for (let item of fetchStreamingLinks) {
let page = await browser.newPage();
await page.goto(item, { waitUntil: "networkidle0" });
const fetchDownloadPageLinks = await page.evaluate(() => {
return loc4;
});
await page.close();
console.log(fetchDownloadPageLinks);
}
FYI, I don't know the puppeteer API really well, but you probably should close the page (as I show) when you're done with it to avoid pages stacking up as you process.

Related

Function does not pass object to the constant

I'm new to javascript so maybe it's a dumb mistake. I'm trying to pass the values ​​of the object that I get in this webscrapping function to the constant but I'm not succeeding. Every time I try to print the menu it prints as "undefined".
`
const puppeteer = require("puppeteer");
async function getMenu() {
console.log("Opening the browser...");
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('https://pra.ufpr.br/ru/ru-centro-politecnico/', {waitUntil: 'domcontentloaded'});
console.log("Content loaded...");
// Get the viewport of the page
const fullMenu = await page.evaluate(() => {
return {
day: document.querySelector('#conteudo div:nth-child(3) p strong').innerText,
breakfastFood: document.querySelector('tbody tr:nth-child(2)').innerText,
lunchFood: document.querySelector('tbody tr:nth-child(4)').innerText,
dinnerFood: document.querySelector('tbody tr:nth-child(6)').innerText
};
});
await browser.close();
return {
breakfast: fullMenu.day + "\nCafé da Manhã:\n" + fullMenu.breakfastFood,
lunch: fullMenu.day + "\nAlmoço:\n" + fullMenu.lunchFood,
dinner: fullMenu.day + "\nJantar:\n" + fullMenu.dinnerFood
};
};
const menu = getMenu();
console.log(menu.breakfast);
`
I've tried to pass these values ​​in several ways to a variable but I'm not succeeding. I also accept other methods of passing these strings, I'm doing it this way because it's the simplest I could think of.
Your getMenu() is an async function.
In your last bit of code, can you change it to,
(async () => {
let menu = await getMenu();
console.log(menu.breakfast);
})();
credit to this post.
I have no access to the package that you imported. You may try changing the last part of your code to:
const menu = await getMenu();
if (menu) {
console.log(menu.breakfast);
}
Explanation
getMenu() and await getMenu() are different things in JS. getMenu() is a Promise Object which does not represent any string / number / return value. await getMenu() tells JS to run other code first to wait for the result of getMenu().
Despite await tells JS to wait for getMenu() to be resolved, it doesn't stop console.log(menu.breakfast) from running. Your code will try to access menu - which at that moment it is a Promise object. Therefore, breakfast property doesn't exist in the Promise object, so you get undefined.
By adding a if (menu) {...} statement, javascript will wait until menu is resolved before going inside the if-statement. This is useful when you want to do console.log() on a async/await return value.

How can I click on all links matching a selector with Playwright?

I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488

Run puppeteer function again until completion

I've got a puppeteer function that runs on a Node JS script, upon launching, my initial function runs, however, after navigating to the next page of a website (in my example using btnToClick) I need it to re-evaluate the page and collect more data. Right now I'm using a setInterval that assumes the total time per page scrape is 12 seconds, I'd like to be able to run my extract function again after it's completed one, and keep running it until nextBtn returns 0.
Below is my current set up:
function extractFromArea() {
puppeteer.launch({
headless: true
}).then(async browser => {
// go to our page of choice, and wait for the body to load
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto('mypage');
const extract = async function() {
// wait before evaluating the page
await page.evaluate(() => {
// next button
const nextBtn = document.querySelectorAll('a.nav.next.rndBtn.ui_button.primary.taLnk').length
if (nextBtn < 1) {
// if no more pages are found
}
}
// wait, then proceed to next page
setTimeout(() => {
const btnToClick = document.querySelector('a.nav.next.rndBtn.ui_button.primary.taLnk')
btnToClick.click()
}, 2000)
});
};
// TODO: somehow need to make this run again based on when the current extract function is finished.
setInterval(() => {
extract()
}, 12000)
// kick off the extraction
extract()
});
}
Here's what a while loop might look like:
while(await page.$('a.next')){
await page.click('a.next')
// do something
}

Scraping dynamic pages with node.js and headless browser

I'm trying to scrap data from the page that is loading dynamically. For this I'm using headless browser puppeteer
Puppeteer can be seen as the headlessBrowserClient in the code.
The main challenge is to gracefully close the browser as soon as needed data received. But if you close it earlier than evaluateCustomCode execution is finished - evaluateCustomCode progress would be lost.
evaluateCustomCode is a function that can be called as if we run it in the Chrome Dev tools.
To have control over the network requests and async flow of puppeteer API - I use async generator that encapsulates all the logic described above.
The problem is that I feel that the code smells, but I can't see any better solution.
Ideas ?
module.exports = function buildClient (headlessBrowserClient) {
const getPageContent = async (pageUrl, evaluateCustomCode) => {
const request = sendRequest(pageUrl)
const { value: page } = await request.next()
if (page) {
const pageContent = await page.evaluate(evaluateCustomCode)
request.next()
return pageContent
}
}
async function * sendRequest (url) {
const browser = await headlessBrowserClient.launch()
const page = await browser.newPage()
const state = {
req: { url },
}
try {
await page.goto(url)
yield page
} catch (error) {
throw new APIError(error, state)
} finally {
yield browser.close()
}
}
return {
getPageContent,
}
}
You can use waitForFunction or waitFor and evaluate with Promise.all. No matter how dynamic the website is, you are waiting for something to be true at end and close the browser when that happens.
Since I do not have access to your dynamic url, I am going to use some random variables and delays as example. It will resolve once the variable returns truthy.
await page.waitForFunction((()=>!!someVariableThatShouldBeTrue);
If your dynamic page actually creates a selector somewhere after you evaluate the code? In that case,
await page.waitFor('someSelector')
Now back to your customCode, let me rename that for you a bit,
await page.evaluate(customCode)
Where customCode is something that will set a variable someVariableThatShouldBeTrue to true somewhere. Honestly it can be anything, a request, a string or anything. The possibilities are endless.
You can put a promise inside page.evaluate, recent chromium supports them very well. So the following will work too, resolve once you are loaded the function/data. Make sure the customCode is an async function or returns promise.
const pageContent = await page.evaluate(CustomCode);
Alright, now we have all required pieces. I modified the code a bit so it doesn't smell to me :D ,
module.exports = function buildClient(headlessBrowserClient) {
return {
getPageContent: async (url, CustomCode) => {
const state = {
req: { url },
};
// so that we can call them on "finally" block
let browser, page;
try {
// launch browser
browser = await headlessBrowserClient.launch()
page = await browser.newPage()
await page.goto(url)
// evaluate and wait for something to happen
// first element returns the pageContent, but whole will resolve if both ends truthy
const [pageContent] = await Promise.all([
await page.evaluate(CustomCode),
await page.waitForFunction((() => !!someVariableThatShouldBeTrue))
])
// Or, You realize you can put a promise inside page.evaluate, recent chromium supports them very well
// const pageContent = await page.evaluate(CustomCode)
return pageContent;
} catch (error) {
throw new APIError(error, state)
} finally {
// NOTE: Maybe we can move them on a different function
await page.close()
await browser.close()
}
}
}
}
You can change and tweak it more as you wish. I did not test the final code (since I don't have APIError, evaluateCustomCode etc) but it should work.
It doesn't have all those generators and stuff like that. Promises, That's how you can deal with dynamic pages :D .
PS: IMO, Such questions are more befitting for the code review.

Crawling multiple URLs in a loop using Puppeteer

I have an array of URLs to scrape data from:
urls = ['url','url','url'...]
This is what I'm doing:
urls.map(async (url)=>{
await page.goto(url);
await page.waitForNavigation({ waitUntil: 'networkidle' });
})
This seems to not wait for page load and visits all the URLs quite rapidly (I even tried using page.waitFor).
I wanted to know if am I doing something fundamentally wrong or this type of functionality is not advised/supported.
map, forEach, reduce, etc, does not wait for the asynchronous operation within them, before they proceed to the next element of the iterator they are iterating over.
There are multiple ways of going through each item of an iterator synchronously while performing an asynchronous operation, but the easiest in this case I think would be to simply use a normal for operator, which does wait for the operation to finish.
const urls = [...]
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
await page.goto(`${url}`);
await page.waitForNavigation({ waitUntil: 'networkidle2' });
}
This would visit one url after another, as you are expecting. If you are curious about iterating serially using await/async, you can have a peek at this answer: https://stackoverflow.com/a/24586168/791691
The accepted answer shows how to serially visit each page one at a time. However, you may want to visit multiple pages simultaneously when the task is embarrassingly parallel, that is, scraping a particular page isn't dependent on data extracted from other pages.
A tool that can help achieve this is Promise.allSettled which lets us fire off a bunch of promises at once, determine which were successful and harvest results.
For a basic example, let's say we want to scrape usernames for Stack Overflow users given a series of ids.
Serial code:
const puppeteer = require("puppeteer"); // ^19.6.3
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const baseURL = "https://stackoverflow.com/users";
const startId = 6243352;
const qty = 5;
const usernames = [];
for (let i = startId; i < startId + qty; i++) {
await page.goto(`${baseURL}/${i}`, {
waitUntil: "domcontentloaded"
});
const sel = ".flex--item.mb12.fs-headline2.lh-xs";
const el = await page.waitForSelector(sel);
usernames.push(await el.evaluate(el => el.textContent.trim()));
}
console.log(usernames);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Parallel code:
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const baseURL = "https://stackoverflow.com/users";
const startId = 6243352;
const qty = 5;
const usernames = (await Promise.allSettled(
[...Array(qty)].map(async (_, i) => {
const page = await browser.newPage();
await page.goto(`${baseURL}/${i + startId}`, {
waitUntil: "domcontentloaded"
});
const sel = ".flex--item.mb12.fs-headline2.lh-xs";
const el = await page.waitForSelector(sel);
const text = await el.evaluate(el => el.textContent.trim());
await page.close();
return text;
})))
.filter(e => e.status === "fulfilled")
.map(e => e.value);
console.log(usernames);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Remember that this is a technique, not a silver bullet that guarantees a speed increase on all workloads. It will take some experimentation to find the optimal balance between the cost of creating more pages versus the parallelization of network requests on a given particular task and system.
The example here is contrived since it's not interacting with the page dynamically, so there's not as much room for gain as in a typical Puppeteer use case that involves network requests and blocking waits per page.
Of course, beware of rate limiting and any other restrictions imposed by sites (running the code above may anger Stack Overflow's rate limiter).
For tasks where creating a page per task is prohibitively expensive or you'd like to set a cap on parallel request dispatches, consider using a task queue or combining serial and parallel code shown above to send requests in chunks. This answer shows a generic pattern for this agnostic of Puppeteer.
These patterns can be extended to handle the case when certain pages depend on data from other pages, forming a dependency graph.
See also Using async/await with a forEach loop which explains why the original attempt in this thread using map fails to wait for each promise.
If you find that you are waiting on your promise indefinitely, the proposed solution is to use the following:
const urls = [...]
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
const promise = page.waitForNavigation({ waitUntil: 'networkidle' });
await page.goto(`${url}`);
await promise;
}
As referenced from this github issue
Best way I found to achieve this.
const puppeteer = require('puppeteer');
(async () => {
const urls = ['https://www.google.com/', 'https://www.google.com/']
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(`${url}`, { waitUntil: 'networkidle2' });
await browser.close();
}
})();
Something no one else mentions is that if you are fetching multiple pages using the same page object it is crucial that you set its timeout to 0. Otherwise, once it has fetched the default 30 seconds worth of pages, it will timeout.
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.setDefaultNavigationTimeout(0);

Categories

Resources