I'm new to puppeteer, trying to understand how it works by writing a simple scraping job.
What I plan to do
Plan is simple:
goto a page,
then extract all <li> links under a <ul> tag
click each <li> link and take a screenshot of the target page.
How I implement it
Code goes as follows,
await page.goto('http://some.url.com'); // step-1
const a_elems = await page.$$('li.some_css_class a'); // step-2
for (var i=0; i<a_elems.length; i++) { // step-3
const elem = a_elems[i];
await Promise.all([
elem.click(),
page.waitForNavigation({waitUntil: 'networkidle0'}) // click each link and wait page loading
]);
await page.screenshot({path: `${IMG_FOLDER}/${txt}.png`});
await page.goBack({waitUntil: 'networkidle0'}); // go back to previous page so that we could click next link
console.log(`clicked link = ${txt}`);
}
What is wrong & Need help
However, the above code only could do with the first link in a_elems, and when the for-loop comes to the 2nd link, the code breaks with error saying
(node:40606) UnhandledPromiseRejectionWarning: Error: Node is detached from document
at ElementHandle._scrollIntoViewIfNeeded (.../.npm-packages/lib/node_modules/puppeteer/lib/JSHandle.js:203:13)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async ElementHandle.click (.../.npm-packages/lib/node_modules/puppeteer/lib/JSHandle.js:282:5)
at async Promise.all (index 0)
at async main (.../test.js:34:5)
-- ASYNC --
at ElementHandle.<anonymous> (.../.npm-packages/lib/node_modules/puppeteer/lib/helper.js:111:15)
at main (.../test.js:35:12)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
I suspect that the execution context of page has already changed after the first link is clicked, even though I called page.goBack to previous page, but it doesn't give me the previous execution context.
Not sure if my speculation is right or wrong, and couldn't find any similar issue out there, hope I could get some help here, thanks!
If there could be even better implementation to achieve my plan, please let me know.
You are right about the elements losing its context when you goBack. That's not going to work.
But, as you commented, you can grab the href from the element and start from there:
for (var i=0; i<a_elems.length; i++) { // step-3
const elem = a_elems[i];
const href = await page.evaluate(e => e.href, elem); //Chrome will return the absolute URL
const newPage = await browser.newPage();
await newPage.goto(href);
await newPage.screenshot({path: `${IMG_FOLDER}/${txt}.png`});
await newPage.close();
console.log(`clicked link = ${txt}`);
}
You could even do this in parallel, although there is an internal queue for screenshots.
Related
I have used puppeteer for one of my projects to open webpages in headless chrome, do some actions and then close the page. These actions, however, are user dependent. I want to attach a lifetime to the page, where it closes automatically after, say 30 minutes, of opening irrespective of whether any action is performed or not.
I have tried setTimeout() functionality of Node JS but it didn't work (or I just couldn't figure how to make it work).
I have tried the following:
const puppeteer = require('puppeteer-core');
const browser = await puppeteer.connect({browserURL: browser_url});
const page = await browser.newPage();
// timer starts ticking here upon creation of new page (maybe in a subroutine and not block the main thread)
/**
..
Do something
..
*/
// timer ends and closePage() is triggered.
const closePage = (page) => {
if (!page.isClosed()) {
page.close();
}
}
But this gives me the following error:
Error: Protocol error: Connection closed. Most likely the page has been closed.
Your provided code should work as excepted. Are you sure the page is still opened after the timeout and it is indeed the same page?
You can try this wrapper for opening pages and closing them correctly.
// since it is async it won't block the eventloop.
// using `await` will allow other functions to execute.
async function openNewPage(browser, timeoutMs) {
const page = await browser.newPage()
setTimeout(async () => {
// you want to use try/catch for omitting unhandled promise rejections.
try {
if(!page.isClosed()) {
await page.close()
}
} catch(err) {
console.error('unexpected error occured when closing page.', err)
}
}, timeoutMs)
}
// use it like so.
const browser = await puppeteer.connect({browserURL: browser_url});
const min30Ms = 30 * 60 * 1000
const page = await openNewPage(browser, min30Ms);
// ...
The above only closes the Tabs in your browser. For closing the puppeteer instance you would have to call browser.close() which could may be what you want?
page.close returns a promise so you need to define closePage as an async function and use await page.close(). I believe #silvan's answer should address the issue, just make sure to replace if condition
if(page.isClosed())
with
if(!page.isClosed())
I want to make a scraper with puppeteer, that opens a site, uses its search bar and opens the first link.
That is the code:
const puppeteer = require('puppeteer');
(async () => {
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto('https://example.com', {waitUntil: 'networkidle2'});
await page.click('[name=query]');
await page.keyboard.type("(Weapon)");
await page.keyboard.press('Enter');
await page.waitForSelector('div[class="search-results"]', {timeout: 100000});
});
The problem is I can't make it open the first link from the search results, I tried to use page.click() But all of the search results are the same except the URL.
What I want to know is how can I make it open the first link from search results.
There're more ways to solve this. I recommend experimenting with it a bit, so you learn different ways of doing this.
await page.click('.search-results a');
it turns out Puppeteer always click on the first element it finds, so if you want the first one, this will be enough.
Or you can select all the links and then click on the first one:
const resultLinks = await page.$$('.search-results a');
resultLinks[0].click();
It'd be better to include a condition here as well, so you don't end up with an error because no element was found:
const resultLinks = await page.$$('.search-results a');
if (resultLinks.length) resultLinks[0].click();
There're more ways, so if you want to learn more, please refer to the API documenttion.
When I click the Next button to continue my test the page has a transition, so the password may be inputted, this transition is not allowing me to click on the password input section, so to combat the problem I used the wait method to wait for 1s until the element is located. The error is described after code
const {
Builder,
By,
until,
Capabilities
} = require('selenium-webdriver');
// requiring needed modules
(async function login() {
const pageLoad = new Capabilities().setPageLoadStrategy('normal')
//configuring the way the page loads
let driver = await new Builder().withCapabilities(pageLoad).forBrowser('firefox').build();
try {
await driver.get('https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=13&rver=7.3.6963.0&wp=MBI_SSL&wreply=https://www.microsoft.com/en-us/&lc=1033&id=74335&aadredir=1');
//going to link
var userName = (await driver.findElement(By.css('#i0116'))).sendKeys((USERNAME));
//finding element and typing
(await driver.findElement(By.css('#idSIButton9'))).click();
// clicking element
await driver.wait(until.elementLocated(By.css('#i0116')),1000);
//This is where I think the error is happening
var passWd = (await driver.findElement(By.css('#i0116'))).click().sendKeys(PASSWORD);
(await driver.findElement(By.css('#idSIButton9'))).click();
} catch (error) {
console.log(error)
} finally {
console.log('finished')
}
}())
{ NoSuchElementError: Unable to locate element: #i0116
at Object.throwDecodedError (/home/name/Desktop/projects/Test/Chrome/pecPrea/node_modules/selenium-webdriver/lib/error.js:550:15)
at parseHttpResponse (/home/name/Desktop/projects/Test/Chrome/pecPrea/node_modules/selenium-webdriver/lib/http.js:565:13)
at Executor.execute (/home/name/Desktop/projects/Test/Chrome/pecPrea/node_modules/selenium-webdriver/lib/http.js:491:26)
at process._tickCallback (internal/process/next_tick.js:68:7)
name: 'NoSuchElementError',
remoteStacktrace:
'WebDriverError#chrome://marionette/content/error.js:175:5\nNoSuchElementError#chrome://marionette/content/error.js:387:5\nelement.find/</<#chrome://marionette/content/element.js:330:16\n' }
finished
If you just want to wait for a action to be completed you can try sleep from
java.util.concurrent.TimeUnit:
if u want it in seconds u can use
TimeUnit.SECONDS.sleep(int);
and for minutes you can use
TimeUnit.MINUTES.sleep(int);
The raw way to use sleep would be to use Thread.sleep() but the input here should be in milliseconds of if your program contains multiple sleep statements i would prefer TimeUnit
I chose the incorrect element. Instead of selecting #i0118, the correct element. I STUPIDLY selected #i0116
On a page I'm scraping with Puppeteer, I have a list with the same id for every li. I am trying to find and click on an element with specific text within this list. I have the following code:
await page.waitFor(5000)
const linkEx = await page.$x("//a[contains(text(), 'Shop')]")
if (linkEx.length > 0) {
await linkEx[0].click()
}
Do you have any idea how I could replace the first line with waiting for the actual text 'Shop'?
I tried await page.waitFor(linkEx), waitForSelector(linkEx) but it's not working.
Also, I would like to replace that a in the second line of code with the actual id (#activities) or something like that but I couldn't find a proper example.
Could you please help me with this issue?
page.waitForXPath what you need here.
Example:
const puppeteer = require('puppeteer')
async function fn() {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://example.com')
// await page.waitForSelector('//a[contains(text(), "More information...")]') // ❌
await page.waitForXPath('//a[contains(text(), "More information...")]') // ✅
const linkEx = await page.$x('//a[contains(text(), "More information...")]')
if (linkEx.length > 0) {
await linkEx[0].click()
}
await browser.close()
}
fn()
Try this for id-based XPath:
"//*[#id='activities' and contains(text(), 'Shop')]"
Did you know? If you right-click on an element in Chrome DevTools "Elements" tab and you select "Copy": there you are able to copy the exact selector or XPath of an element. After that, you can switch to the "Console" tab and with the Chrome API you are able to test the selector's content, so you can prepare it for your puppeteer script. E.g.: $x("//*[#id='activities' and contains(text(), 'Shop')]")[0].href should show the link what you expected to click on, otherwise you need to change on the access, or you need to check if there are more elements with the same selector etc. This may help to find more appropriate selectors.
For puppeteer 19 and newer, waitForXPath() is obsolete. Use the xpath prefix instead
await page.waitForSelector('xpath/' + xpathExpression)
In your case:
const linkEx = await page.waitForSelector('xpath///a[contains(text(), "Shop")]');
await linkEx.click();
I m trying to test web page of my project with JEST & Puppeteer testing tool. In web page when i right click on element one menu pops up in page with setting some style attributes on element. So with this flow i m trying to test the same with JEST, I have written following code for the same.
describe('Test for Rest Data', () => {
jest.setTimeout(100000);
beforeEach(async () => {
await page.goto("url", { waitUntil: 'networkidle2' })
await page.waitForSelector('table');
});
});
test("Assert for delete row !",async () => {
await page.click('tr','right');
const tbl = await page.evaluate(()=>{
return document.querySelector('tr').getAttribute('style');
});
expect(tbl).not.toBeNull();
});
here when i click on of table style attribute gets added but with above code tbl is not getting any value.
Am I doing something wrong ? How should I do this right ?
You should also probably wait for some time after the click, maybe the style changes but not instantly, maybe the element is not there yet.
Try,
await page.waitFor(1000); // wait for some time
// or this below
await page.waitFor('tr'); // wait for the element
Which will wait for some time or the element. Check if that is the case.