I've got a Puppeteer Node JS app that, given a starting URL, follows the URL and scrapes the window's URL of each page it identifies. Originally I was using a setInterval and getting the current URL every 250ms but have stumbled upon the waitForNavigation option and need to know whether what I've got is going to be reliable?
Given the starting URL, I need to identify all of the pages, and just the pages that Puppeteer goes through, and then with a setTimeout make the assumption that if Puppeteer hasn't redirected to a new page within a given period of time, assume that there's no more redirections.
Will page.waitForNavigation work for this intended behaviour?
My current JS is:
let evalTimeout;
// initiate a Puppeteer instance with options and launch
const browser = await puppeteer.launch({
args: argOptions,
headless: (config.puppeteer.run_in_headless === 'true') ? true : false
});
// launch a new page
const page = await browser.newPage();
// go to a URL
await page.goto(body.url);
// create a function to inject into the page to scrape data
const currentUrl = () => {
return window.location.href;
}
// log the current page every 250ms
async function scrapePageUrl (runOnce = false) {
try {
console.log('running timeout...')
if (!runOnce) {
evalTimeout = setTimeout(() => {
console.log('6s reached, running omce')
scrapePageUrl(true) // assumes no more redirections after 6s, get final URL
}, 6000)
}
const url = await page.evaluate(currentUrl);
if (!runOnce) await page.waitForNavigation();
console.log(`url: ${url}`)
if (!runOnce) {
clearTimeout(evalTimeout)
scrapePageUrl()
}
} catch (err) { }
}
scrapePageUrl()
Related
I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488
I've got a puppeteer function that runs on a Node JS script, upon launching, my initial function runs, however, after navigating to the next page of a website (in my example using btnToClick) I need it to re-evaluate the page and collect more data. Right now I'm using a setInterval that assumes the total time per page scrape is 12 seconds, I'd like to be able to run my extract function again after it's completed one, and keep running it until nextBtn returns 0.
Below is my current set up:
function extractFromArea() {
puppeteer.launch({
headless: true
}).then(async browser => {
// go to our page of choice, and wait for the body to load
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
await page.goto('mypage');
const extract = async function() {
// wait before evaluating the page
await page.evaluate(() => {
// next button
const nextBtn = document.querySelectorAll('a.nav.next.rndBtn.ui_button.primary.taLnk').length
if (nextBtn < 1) {
// if no more pages are found
}
}
// wait, then proceed to next page
setTimeout(() => {
const btnToClick = document.querySelector('a.nav.next.rndBtn.ui_button.primary.taLnk')
btnToClick.click()
}, 2000)
});
};
// TODO: somehow need to make this run again based on when the current extract function is finished.
setInterval(() => {
extract()
}, 12000)
// kick off the extraction
extract()
});
}
Here's what a while loop might look like:
while(await page.$('a.next')){
await page.click('a.next')
// do something
}
I am trying to capture second page content after click method. But it is returning front page content.
const status = await page.open('https://www.dubailand.gov.ae/English/services/Eservices/Pages/Brokers.aspx');
console.log(status);
await page.evaluate(function() {
document.querySelector('#ctl00_ctl42_g_26779dcd_6f3a_42ae_903c_59dea61690e9_dpPager > a.NextPageLink').click();
})
const content = await page.property('content');
console.log(content);
I have done similar task by using puppeteer, but shifting to phantomjs due to deployment issues with puppeteer.
any help is appreciated.
You get the front page because you request page's content immediately after clicking on the "next" button, but you need to wait for Ajax request to finish. It can be done by observing a "tree palm" ajax loader: when it's not visible, the results are in.
// Utility function to pass time: await timeout(ms)
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms));
// emulate a realistic client's screen size
await page.property('viewportSize', { width: 1280, height: 720 });
const status = await page.open('https://www.dubailand.gov.ae/English/services/Eservices/Pages/Brokers.aspx');
await page.evaluate(function() {
document.querySelector('#ctl00_ctl42_g_26779dcd_6f3a_42ae_903c_59dea61690e9_dpPager > a.NextPageLink').click();
});
// Give it time to start request
await timeout(1000);
// Wait until the loader is gone
while(1 == await page.evaluate(function(){
return jQuery(".Loader_large:visible").length
}))
{
await timeout(1000);
console.log(".");
}
// Now for scraping
let contacts = await page.evaluate(function(){
var contacts = [];
jQuery("#tbBrokers tr").each(function(i, row){
contacts.push({"title" : jQuery(row).find("td:nth-child(2)").text().trim(), "phone" : jQuery(row).find("td:nth-child(4)").text().trim() })
})
return contacts;
});
console.log(contacts);
I'm trying to scrap data from the page that is loading dynamically. For this I'm using headless browser puppeteer
Puppeteer can be seen as the headlessBrowserClient in the code.
The main challenge is to gracefully close the browser as soon as needed data received. But if you close it earlier than evaluateCustomCode execution is finished - evaluateCustomCode progress would be lost.
evaluateCustomCode is a function that can be called as if we run it in the Chrome Dev tools.
To have control over the network requests and async flow of puppeteer API - I use async generator that encapsulates all the logic described above.
The problem is that I feel that the code smells, but I can't see any better solution.
Ideas ?
module.exports = function buildClient (headlessBrowserClient) {
const getPageContent = async (pageUrl, evaluateCustomCode) => {
const request = sendRequest(pageUrl)
const { value: page } = await request.next()
if (page) {
const pageContent = await page.evaluate(evaluateCustomCode)
request.next()
return pageContent
}
}
async function * sendRequest (url) {
const browser = await headlessBrowserClient.launch()
const page = await browser.newPage()
const state = {
req: { url },
}
try {
await page.goto(url)
yield page
} catch (error) {
throw new APIError(error, state)
} finally {
yield browser.close()
}
}
return {
getPageContent,
}
}
You can use waitForFunction or waitFor and evaluate with Promise.all. No matter how dynamic the website is, you are waiting for something to be true at end and close the browser when that happens.
Since I do not have access to your dynamic url, I am going to use some random variables and delays as example. It will resolve once the variable returns truthy.
await page.waitForFunction((()=>!!someVariableThatShouldBeTrue);
If your dynamic page actually creates a selector somewhere after you evaluate the code? In that case,
await page.waitFor('someSelector')
Now back to your customCode, let me rename that for you a bit,
await page.evaluate(customCode)
Where customCode is something that will set a variable someVariableThatShouldBeTrue to true somewhere. Honestly it can be anything, a request, a string or anything. The possibilities are endless.
You can put a promise inside page.evaluate, recent chromium supports them very well. So the following will work too, resolve once you are loaded the function/data. Make sure the customCode is an async function or returns promise.
const pageContent = await page.evaluate(CustomCode);
Alright, now we have all required pieces. I modified the code a bit so it doesn't smell to me :D ,
module.exports = function buildClient(headlessBrowserClient) {
return {
getPageContent: async (url, CustomCode) => {
const state = {
req: { url },
};
// so that we can call them on "finally" block
let browser, page;
try {
// launch browser
browser = await headlessBrowserClient.launch()
page = await browser.newPage()
await page.goto(url)
// evaluate and wait for something to happen
// first element returns the pageContent, but whole will resolve if both ends truthy
const [pageContent] = await Promise.all([
await page.evaluate(CustomCode),
await page.waitForFunction((() => !!someVariableThatShouldBeTrue))
])
// Or, You realize you can put a promise inside page.evaluate, recent chromium supports them very well
// const pageContent = await page.evaluate(CustomCode)
return pageContent;
} catch (error) {
throw new APIError(error, state)
} finally {
// NOTE: Maybe we can move them on a different function
await page.close()
await browser.close()
}
}
}
}
You can change and tweak it more as you wish. I did not test the final code (since I don't have APIError, evaluateCustomCode etc) but it should work.
It doesn't have all those generators and stuff like that. Promises, That's how you can deal with dynamic pages :D .
PS: IMO, Such questions are more befitting for the code review.
The framenavigated event in the Devtools protocol doesn't seem to be working the same way as the framenavigated event in Puppeteer. If I use the parentframe method in the Puppeteer example below I only get the client side navigation redirects which is what I want. If I use the Devtools protocol example replacing the parentframe method with parentId I don't get anything because the code doesn't get past parentId.
How can I use the framenavigated event in the Devtools protocol to behave the same way as Puppeteer framenavigated and only get the navigation redirect chain?
// var url = "http://wechat.com";
var url = "http://yahoo.com";
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
// Puppeteer framenavigated example:
page.on('framenavigated', frame => {
if(frame.parentFrame() === null) {
console.log("Puppeteer nav frame: ");
console.log(frame._url);
}
});
// Result returns:
// Puppeteer nav frame:
// https://uk.yahoo.com/?p=us
// Devtools protocol framenavigated example:
const client = await page.target().createCDPSession();
await client.send('Page.enable');
await client.on('Page.frameNavigated', (e) => {
// console.log("Framenavigated event fired");
if(e.frame.parentId === null) {
console.log("Devtools protocol frame: ");
console.log(e.frame.url);
}
});
// Result returns:
// Either hangs up and times out or finishes but returns nothing
await page.goto(url);
await browser.close();
});
Just took another look at this and figured out the problem, so I'll answer my own question.
The parentFrame() method in Puppeteer returns null for a frame which is part of the main frame of the page, but parentId in the Devtools protcol returns undefined for a main frame. So, I just had to check for an undefined or false parentId value instead. Obvious I know ##.
await client.on('Page.frameNavigated', (e) => {
if(!e.frame.parentId) {
// if(typeof e.frame.parentId === "undefined") { // works also
console.log(e.frame.url);
}
});