Run puppeteer function again until completion - javascript

I've got a puppeteer function that runs on a Node JS script, upon launching, my initial function runs, however, after navigating to the next page of a website (in my example using btnToClick) I need it to re-evaluate the page and collect more data. Right now I'm using a setInterval that assumes the total time per page scrape is 12 seconds, I'd like to be able to run my extract function again after it's completed one, and keep running it until nextBtn returns 0.
Below is my current set up:
function extractFromArea() {
puppeteer.launch({
headless: true
}).then(async browser => {
// go to our page of choice, and wait for the body to load
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto('mypage');
const extract = async function() {
// wait before evaluating the page
await page.evaluate(() => {
// next button
const nextBtn = document.querySelectorAll('a.nav.next.rndBtn.ui_button.primary.taLnk').length
if (nextBtn < 1) {
// if no more pages are found
}
}
// wait, then proceed to next page
setTimeout(() => {
const btnToClick = document.querySelector('a.nav.next.rndBtn.ui_button.primary.taLnk')
btnToClick.click()
}, 2000)
});
};
// TODO: somehow need to make this run again based on when the current extract function is finished.
setInterval(() => {
extract()
}, 12000)
// kick off the extraction
extract()
});
}

Here's what a while loop might look like:
while(await page.$('a.next')){
await page.click('a.next')
// do something
}

Related

How to open list of pages synchronously

let downloadPageLinks = [];
fetchStreamingLinks.forEach(async (item) => {
page = await browser.newPage();
await page.goto(item, { waitUntil: "networkidle0" });
const fetchDownloadPageLinks = await page.evaluate(() => {
return loc4;
});
console.log(fetchDownloadPageLinks);
});
I have an array of links(fetchStreamingLinks). Above function opens all the links simultaneously present in fetchDownloadPageLinks. Suppose the array contains 100 links then it opens all the 100 links simultaneously.
Now what I want to do is, open all the links one by one present in fetchStreamingLinks, perform some logic in page context's and close it then open next link.
.forEach() is not promise-aware so when you pass it an async callback, it doesn't pay any attention to the promise that it returns. Thus, it runs all your operations in parallel. .forEach() should be essentially considered obsolete these days, especially for asynchronous operations because a plain for loop gives you so much more control and is promise-aware (e.g. the loop will pause for an await).
let downloadPageLinks = [];
for (let item of fetchStreamingLinks) {
let page = await browser.newPage();
await page.goto(item, { waitUntil: "networkidle0" });
const fetchDownloadPageLinks = await page.evaluate(() => {
return loc4;
});
await page.close();
console.log(fetchDownloadPageLinks);
}
FYI, I don't know the puppeteer API really well, but you probably should close the page (as I show) when you're done with it to avoid pages stacking up as you process.

How can I click on all links matching a selector with Playwright?

I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488

Puppeteer waitForNavigation reliability in determining page URL

I've got a Puppeteer Node JS app that, given a starting URL, follows the URL and scrapes the window's URL of each page it identifies. Originally I was using a setInterval and getting the current URL every 250ms but have stumbled upon the waitForNavigation option and need to know whether what I've got is going to be reliable?
Given the starting URL, I need to identify all of the pages, and just the pages that Puppeteer goes through, and then with a setTimeout make the assumption that if Puppeteer hasn't redirected to a new page within a given period of time, assume that there's no more redirections.
Will page.waitForNavigation work for this intended behaviour?
My current JS is:
let evalTimeout;
// initiate a Puppeteer instance with options and launch
const browser = await puppeteer.launch({
args: argOptions,
headless: (config.puppeteer.run_in_headless === 'true') ? true : false
});
// launch a new page
const page = await browser.newPage();
// go to a URL
await page.goto(body.url);
// create a function to inject into the page to scrape data
const currentUrl = () => {
return window.location.href;
}
// log the current page every 250ms
async function scrapePageUrl (runOnce = false) {
try {
console.log('running timeout...')
if (!runOnce) {
evalTimeout = setTimeout(() => {
console.log('6s reached, running omce')
scrapePageUrl(true) // assumes no more redirections after 6s, get final URL
}, 6000)
}
const url = await page.evaluate(currentUrl);
if (!runOnce) await page.waitForNavigation();
console.log(`url: ${url}`)
if (!runOnce) {
clearTimeout(evalTimeout)
scrapePageUrl()
}
} catch (err) { }
}
scrapePageUrl()

PhantomJS to capture next page content after button click event

I am trying to capture second page content after click method. But it is returning front page content.
const status = await page.open('https://www.dubailand.gov.ae/English/services/Eservices/Pages/Brokers.aspx');
console.log(status);
await page.evaluate(function() {
document.querySelector('#ctl00_ctl42_g_26779dcd_6f3a_42ae_903c_59dea61690e9_dpPager > a.NextPageLink').click();
})
const content = await page.property('content');
console.log(content);
I have done similar task by using puppeteer, but shifting to phantomjs due to deployment issues with puppeteer.
any help is appreciated.
You get the front page because you request page's content immediately after clicking on the "next" button, but you need to wait for Ajax request to finish. It can be done by observing a "tree palm" ajax loader: when it's not visible, the results are in.
// Utility function to pass time: await timeout(ms)
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms));
// emulate a realistic client's screen size
await page.property('viewportSize', { width: 1280, height: 720 });
const status = await page.open('https://www.dubailand.gov.ae/English/services/Eservices/Pages/Brokers.aspx');
await page.evaluate(function() {
document.querySelector('#ctl00_ctl42_g_26779dcd_6f3a_42ae_903c_59dea61690e9_dpPager > a.NextPageLink').click();
});
// Give it time to start request
await timeout(1000);
// Wait until the loader is gone
while(1 == await page.evaluate(function(){
return jQuery(".Loader_large:visible").length
}))
{
await timeout(1000);
console.log(".");
}
// Now for scraping
let contacts = await page.evaluate(function(){
var contacts = [];
jQuery("#tbBrokers tr").each(function(i, row){
contacts.push({"title" : jQuery(row).find("td:nth-child(2)").text().trim(), "phone" : jQuery(row).find("td:nth-child(4)").text().trim() })
})
return contacts;
});
console.log(contacts);

Puppeteer | Wait for all JavaScript is executed

I try to take screenshots from multiple pages, which should be fully loaded (including lazy loaded images) for later comparison.
I found the lazyimages_without_scroll_events.js example which helps a lot.
With the following code the screenshots are looking fine, but there is some major issue.
async function takeScreenshot(browser, viewport, route) {
return browser.newPage().then(async (page) => {
const fileName = `${viewport.directory}/${getFilename(route)}`;
await page.setViewport({
width: viewport.width,
height: 500,
});
await page.goto(
`${config.server.master}${route}.html`,
{
waitUntil: 'networkidle0',
}
);
await page.evaluate(() => {
/* global document,requestAnimationFrame */
let lastScrollTop = document.scrollingElement.scrollTop;
// Scroll to bottom of page until we can't scroll anymore.
const scroll = () => {
document.scrollingElement.scrollTop += 100;
if (document.scrollingElement.scrollTop !== lastScrollTop) {
lastScrollTop = document.scrollingElement.scrollTop;
requestAnimationFrame(scroll);
}
};
scroll();
});
await page.waitFor(5000);
await page.screenshot({
path: `screenshots/master/${fileName}.png`,
fullPage: true,
});
await page.close();
console.log(`Viewport "${viewport.name}", Route "${route}"`);
});
}
Issue: Even with higher values for page.waitFor() (timeout), sometimes not the all of the frontend related JavaScripts on the pages were fully executed.
For some older pages where some JavaScript could change the frontend. F.e. in one legacy case a jQuery.matchHeight.
Best case: In an ideal world Puppeteer would wait till all JavaScript is evaluated and executed. Is something like this possible?
EDIT
I could improve the script slightly with the help from cody-g.
function jQueryMatchHeightIsProcessed() {
return Array.from($('.match-height')).every((element) => {
return element.style.height !== '';
});
}
// Within takeScreenshot() after page.waitFor()
await page.waitForFunction(jQueryMatchHeightIsProcessed, {timeout: 0});
... but it is far from perfect. It seems I have to find similar solutions for different frontend scripts to really consider everything which happening on the target page.
The main problem with jQuery.matchHeight in my case is that it does process different heights in different runs. Maybe caused by image lazyloading. It seems I have to wait until I can replace it with Flexbox. (^_^)°
Other issues to fix:
Disable animations:
await page.addStyleTag({
content: `
* {
transition: none !important;
animation: none !important;
}
`,
});
Handle slideshows:
function handleSwiperSlideshows() {
Array.from($('.swiper-container')).forEach((element) => {
if (typeof element.swiper !== 'undefined') {
if (element.swiper.autoplaying) {
element.swiper.stopAutoplay();
element.swiper.slideTo(0);
}
}
});
}
// Within takeScreenshot() after page.waitFor()
await page.evaluate(handleSwiperSlideshows);
But still not enough. I think it's impossible to visual test these legacy pages.
The following waitForFunction might be useful for you, you can use it to wait for any arbitrary function to evaluate to true. If you have access to the page's code you can set the window status and use that to notify puppeteer it is safe to continue, or just rely on some sort of other ready state.
Note: this function is a polling function, and re-evaluates at some interval which can be specified.
const watchDog = page.waitForFunction('<your function to evaluate to true>');
E.g.,
const watchDog = page.waitForFunction('window.status === "ready"');
await watchDog;
In your page's code you simply need to set the window.status to ready
To utilize multiple watchdogs in multiple asynchronous files you could do something like
index.js
...import/require file1.js;
...import/require file2.js;
...code...
file1.js:
var file1Flag=false; // global
...code...
file1Flag=true;
file2.js:
var file2Flag=false; // global
...code...
file2Flag=true;
main.js:
const watchDog = page.waitForFunction('file1Flag && file2Flag');
await watchDog;
async function takeScreenshot(browser, viewport, route) {
return browser.newPage().then(async (page) => {
const fileName = `${viewport.directory}/${getFilename(route)}`;
await page.setViewport({
width: viewport.width,
height: 500,
});
await page.goto(
`${config.server.master}${route}.html`,
{
waitUntil: 'networkidle0',
}
);
await page.evaluate(() => {
scroll(0, 99999)
});
await page.waitFor(5000);
await page.screenshot({
path: `screenshots/master/${fileName}.png`,
fullPage: true,
});
await page.close();
console.log(`Viewport "${viewport.name}", Route "${route}"`);
});
}

Categories

Resources