Puppeteer | Wait for all JavaScript is executed - javascript

I try to take screenshots from multiple pages, which should be fully loaded (including lazy loaded images) for later comparison.
I found the lazyimages_without_scroll_events.js example which helps a lot.
With the following code the screenshots are looking fine, but there is some major issue.
async function takeScreenshot(browser, viewport, route) {
return browser.newPage().then(async (page) => {
const fileName = `${viewport.directory}/${getFilename(route)}`;
await page.setViewport({
width: viewport.width,
height: 500,
});
await page.goto(
`${config.server.master}${route}.html`,
{
waitUntil: 'networkidle0',
}
);
await page.evaluate(() => {
/* global document,requestAnimationFrame */
let lastScrollTop = document.scrollingElement.scrollTop;
// Scroll to bottom of page until we can't scroll anymore.
const scroll = () => {
document.scrollingElement.scrollTop += 100;
if (document.scrollingElement.scrollTop !== lastScrollTop) {
lastScrollTop = document.scrollingElement.scrollTop;
requestAnimationFrame(scroll);
}
};
scroll();
});
await page.waitFor(5000);
await page.screenshot({
path: `screenshots/master/${fileName}.png`,
fullPage: true,
});
await page.close();
console.log(`Viewport "${viewport.name}", Route "${route}"`);
});
}
Issue: Even with higher values for page.waitFor() (timeout), sometimes not the all of the frontend related JavaScripts on the pages were fully executed.
For some older pages where some JavaScript could change the frontend. F.e. in one legacy case a jQuery.matchHeight.
Best case: In an ideal world Puppeteer would wait till all JavaScript is evaluated and executed. Is something like this possible?
EDIT
I could improve the script slightly with the help from cody-g.
function jQueryMatchHeightIsProcessed() {
return Array.from($('.match-height')).every((element) => {
return element.style.height !== '';
});
}
// Within takeScreenshot() after page.waitFor()
await page.waitForFunction(jQueryMatchHeightIsProcessed, {timeout: 0});
... but it is far from perfect. It seems I have to find similar solutions for different frontend scripts to really consider everything which happening on the target page.
The main problem with jQuery.matchHeight in my case is that it does process different heights in different runs. Maybe caused by image lazyloading. It seems I have to wait until I can replace it with Flexbox. (^_^)°
Other issues to fix:
Disable animations:
await page.addStyleTag({
content: `
* {
transition: none !important;
animation: none !important;
}
`,
});
Handle slideshows:
function handleSwiperSlideshows() {
Array.from($('.swiper-container')).forEach((element) => {
if (typeof element.swiper !== 'undefined') {
if (element.swiper.autoplaying) {
element.swiper.stopAutoplay();
element.swiper.slideTo(0);
}
}
});
}
// Within takeScreenshot() after page.waitFor()
await page.evaluate(handleSwiperSlideshows);
But still not enough. I think it's impossible to visual test these legacy pages.

The following waitForFunction might be useful for you, you can use it to wait for any arbitrary function to evaluate to true. If you have access to the page's code you can set the window status and use that to notify puppeteer it is safe to continue, or just rely on some sort of other ready state.
Note: this function is a polling function, and re-evaluates at some interval which can be specified.
const watchDog = page.waitForFunction('<your function to evaluate to true>');
E.g.,
const watchDog = page.waitForFunction('window.status === "ready"');
await watchDog;
In your page's code you simply need to set the window.status to ready
To utilize multiple watchdogs in multiple asynchronous files you could do something like
index.js
...import/require file1.js;
...import/require file2.js;
...code...
file1.js:
var file1Flag=false; // global
...code...
file1Flag=true;
file2.js:
var file2Flag=false; // global
...code...
file2Flag=true;
main.js:
const watchDog = page.waitForFunction('file1Flag && file2Flag');
await watchDog;

async function takeScreenshot(browser, viewport, route) {
return browser.newPage().then(async (page) => {
const fileName = `${viewport.directory}/${getFilename(route)}`;
await page.setViewport({
width: viewport.width,
height: 500,
});
await page.goto(
`${config.server.master}${route}.html`,
{
waitUntil: 'networkidle0',
}
);
await page.evaluate(() => {
scroll(0, 99999)
});
await page.waitFor(5000);
await page.screenshot({
path: `screenshots/master/${fileName}.png`,
fullPage: true,
});
await page.close();
console.log(`Viewport "${viewport.name}", Route "${route}"`);
});
}

Related

Function does not pass object to the constant

I'm new to javascript so maybe it's a dumb mistake. I'm trying to pass the values ​​of the object that I get in this webscrapping function to the constant but I'm not succeeding. Every time I try to print the menu it prints as "undefined".
`
const puppeteer = require("puppeteer");
async function getMenu() {
console.log("Opening the browser...");
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('https://pra.ufpr.br/ru/ru-centro-politecnico/', {waitUntil: 'domcontentloaded'});
console.log("Content loaded...");
// Get the viewport of the page
const fullMenu = await page.evaluate(() => {
return {
day: document.querySelector('#conteudo div:nth-child(3) p strong').innerText,
breakfastFood: document.querySelector('tbody tr:nth-child(2)').innerText,
lunchFood: document.querySelector('tbody tr:nth-child(4)').innerText,
dinnerFood: document.querySelector('tbody tr:nth-child(6)').innerText
};
});
await browser.close();
return {
breakfast: fullMenu.day + "\nCafé da Manhã:\n" + fullMenu.breakfastFood,
lunch: fullMenu.day + "\nAlmoço:\n" + fullMenu.lunchFood,
dinner: fullMenu.day + "\nJantar:\n" + fullMenu.dinnerFood
};
};
const menu = getMenu();
console.log(menu.breakfast);
`
I've tried to pass these values ​​in several ways to a variable but I'm not succeeding. I also accept other methods of passing these strings, I'm doing it this way because it's the simplest I could think of.
Your getMenu() is an async function.
In your last bit of code, can you change it to,
(async () => {
let menu = await getMenu();
console.log(menu.breakfast);
})();
credit to this post.
I have no access to the package that you imported. You may try changing the last part of your code to:
const menu = await getMenu();
if (menu) {
console.log(menu.breakfast);
}
Explanation
getMenu() and await getMenu() are different things in JS. getMenu() is a Promise Object which does not represent any string / number / return value. await getMenu() tells JS to run other code first to wait for the result of getMenu().
Despite await tells JS to wait for getMenu() to be resolved, it doesn't stop console.log(menu.breakfast) from running. Your code will try to access menu - which at that moment it is a Promise object. Therefore, breakfast property doesn't exist in the Promise object, so you get undefined.
By adding a if (menu) {...} statement, javascript will wait until menu is resolved before going inside the if-statement. This is useful when you want to do console.log() on a async/await return value.

How can I click on all links matching a selector with Playwright?

I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488

Run puppeteer function again until completion

I've got a puppeteer function that runs on a Node JS script, upon launching, my initial function runs, however, after navigating to the next page of a website (in my example using btnToClick) I need it to re-evaluate the page and collect more data. Right now I'm using a setInterval that assumes the total time per page scrape is 12 seconds, I'd like to be able to run my extract function again after it's completed one, and keep running it until nextBtn returns 0.
Below is my current set up:
function extractFromArea() {
puppeteer.launch({
headless: true
}).then(async browser => {
// go to our page of choice, and wait for the body to load
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto('mypage');
const extract = async function() {
// wait before evaluating the page
await page.evaluate(() => {
// next button
const nextBtn = document.querySelectorAll('a.nav.next.rndBtn.ui_button.primary.taLnk').length
if (nextBtn < 1) {
// if no more pages are found
}
}
// wait, then proceed to next page
setTimeout(() => {
const btnToClick = document.querySelector('a.nav.next.rndBtn.ui_button.primary.taLnk')
btnToClick.click()
}, 2000)
});
};
// TODO: somehow need to make this run again based on when the current extract function is finished.
setInterval(() => {
extract()
}, 12000)
// kick off the extraction
extract()
});
}
Here's what a while loop might look like:
while(await page.$('a.next')){
await page.click('a.next')
// do something
}

Puppeteer throws error with "Node not visible..."

When I open this page with puppeteer, and try to click the element, it throws an Error when it is expected to be possible to click the element.
const puppeteer = require('puppeteer');
const url = "https://www.zapimoveis.com.br/venda/apartamentos/rj+rio-de-janeiro/";
async function run() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'load'});
const nextPageSelector = "#list > div.pagination > #proximaPagina";
await page.waitForSelector(nextPageSelector, {visible: true})
console.log("Found the button");
await page.click(nextPageSelector)
console.log('clicked');
}
run();
Here's a working version of your code.
const puppeteer = require('puppeteer');
const url = "https://www.zapimoveis.com.br/venda/apartamentos/rj+rio-de-janeiro/";
async function run() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(url);
const nextPageSelector = "#list > div.pagination > #proximaPagina";
console.log("Found the button");
await page.evaluate(selector=>{
return document.querySelector(selector).click();
},nextPageSelector)
console.log('clicked');
}
run();
I personally prefer to use page.evaluate, as page.click doesn't work for me neither in some cases and you can execute whatever js on the page directly.
The only thing to know is the syntax :
- first param : function to execute
- second param : the variable(s) to be passed to this function
Found the problem.
When you page.click(selector) or ElementHandle.click(), Puppeteer scrolls to the element, find its center coordinates, and finaly clicks. It uses the function _clickablePoint at node_modules/puppeteer/lib/JSHandle.js to find the coordinates.
The problem with this website (zapimoveis) is that the scroll into the element's viewport is too slow, so Puppeteer can't find its (x,y) center.
One nice way you can click on this element is to use page.evaluate to click it using page javascript. But, there is a gambi's way that I prefer. :) Change the line await page.click(nextPageSelector) by these lines:
try { await page.click(nextPageSelector) } catch (error) {} // sacrifice click :)
await page.waitFor(3000); // time for scrolling
await page.click(nextPageSelector); // this will work

Scraping dynamic pages with node.js and headless browser

I'm trying to scrap data from the page that is loading dynamically. For this I'm using headless browser puppeteer
Puppeteer can be seen as the headlessBrowserClient in the code.
The main challenge is to gracefully close the browser as soon as needed data received. But if you close it earlier than evaluateCustomCode execution is finished - evaluateCustomCode progress would be lost.
evaluateCustomCode is a function that can be called as if we run it in the Chrome Dev tools.
To have control over the network requests and async flow of puppeteer API - I use async generator that encapsulates all the logic described above.
The problem is that I feel that the code smells, but I can't see any better solution.
Ideas ?
module.exports = function buildClient (headlessBrowserClient) {
const getPageContent = async (pageUrl, evaluateCustomCode) => {
const request = sendRequest(pageUrl)
const { value: page } = await request.next()
if (page) {
const pageContent = await page.evaluate(evaluateCustomCode)
request.next()
return pageContent
}
}
async function * sendRequest (url) {
const browser = await headlessBrowserClient.launch()
const page = await browser.newPage()
const state = {
req: { url },
}
try {
await page.goto(url)
yield page
} catch (error) {
throw new APIError(error, state)
} finally {
yield browser.close()
}
}
return {
getPageContent,
}
}
You can use waitForFunction or waitFor and evaluate with Promise.all. No matter how dynamic the website is, you are waiting for something to be true at end and close the browser when that happens.
Since I do not have access to your dynamic url, I am going to use some random variables and delays as example. It will resolve once the variable returns truthy.
await page.waitForFunction((()=>!!someVariableThatShouldBeTrue);
If your dynamic page actually creates a selector somewhere after you evaluate the code? In that case,
await page.waitFor('someSelector')
Now back to your customCode, let me rename that for you a bit,
await page.evaluate(customCode)
Where customCode is something that will set a variable someVariableThatShouldBeTrue to true somewhere. Honestly it can be anything, a request, a string or anything. The possibilities are endless.
You can put a promise inside page.evaluate, recent chromium supports them very well. So the following will work too, resolve once you are loaded the function/data. Make sure the customCode is an async function or returns promise.
const pageContent = await page.evaluate(CustomCode);
Alright, now we have all required pieces. I modified the code a bit so it doesn't smell to me :D ,
module.exports = function buildClient(headlessBrowserClient) {
return {
getPageContent: async (url, CustomCode) => {
const state = {
req: { url },
};
// so that we can call them on "finally" block
let browser, page;
try {
// launch browser
browser = await headlessBrowserClient.launch()
page = await browser.newPage()
await page.goto(url)
// evaluate and wait for something to happen
// first element returns the pageContent, but whole will resolve if both ends truthy
const [pageContent] = await Promise.all([
await page.evaluate(CustomCode),
await page.waitForFunction((() => !!someVariableThatShouldBeTrue))
])
// Or, You realize you can put a promise inside page.evaluate, recent chromium supports them very well
// const pageContent = await page.evaluate(CustomCode)
return pageContent;
} catch (error) {
throw new APIError(error, state)
} finally {
// NOTE: Maybe we can move them on a different function
await page.close()
await browser.close()
}
}
}
}
You can change and tweak it more as you wish. I did not test the final code (since I don't have APIError, evaluateCustomCode etc) but it should work.
It doesn't have all those generators and stuff like that. Promises, That's how you can deal with dynamic pages :D .
PS: IMO, Such questions are more befitting for the code review.

Categories

Resources