I want to web scrap a site but the problem with that site is this, content in the site load in 1s but the loader in the navbar kept loading for 30 to 1min so my puppeteer scrapper kept waiting for the loader in the navbar to stop?
Is there any way to run window.stop() after a certain timeout
GITHUB
const checkBook = async () => {
await page.goto(`https://wattpad.com/story/${bookid}`, {
waitUntil: 'domcontentloaded',
});
const is404 = await page.$('#story-404-wrapper');
if (is404) {
socket.emit('error', {
message: 'Story not found',
});
await browser.close();
return {
error: true,
};
}
storyLastUpdated = await page
.$eval(
'.table-of-contents__last-updated strong',
(ele: any) => ele.textContent,
)
.then((date: string) => getDate(date));
};
Similar approach to Marcel's answer. The following will do the job:
page.goto(url)
await page.waitForTimeout(1000)
await page.evaluate(() => window.stop())
// your scraper script goes here
await browser.close()
Notes:
page.goto() is NOT awaited, so you save time compared to waiting until DOMContentLoaded or Load events...
...if the goto was not awaited you need to make sure your script can start the work with the DOM. You can either use page.waitForTimeout() or page.waitForSelector().
you have to execute window.stop() within page.evaluate() so you can avoid this kind of error: Error: Navigation failed because browser has disconnected!
You could strip the
waitUntil: 'domcontentloaded',
in favor of a timeout as documented here https://github.com/puppeteer/puppeteer/blob/v14.1.0/docs/api.md#pagegotourl-options
or set the timeout to zero and instead use one of the page.waitFor... like this
await page.waitForTimeout(30000);
Related
I am playing around with Puppeteer for Twitter automation. I have discovered that often the navigation will timeout. For example:
puppeteer.launch({ headless: false }).then(async browser => {
try {
const page = await browser.newPage();
await page.goto('https://twitter.com/home');
// This sometimes fails with timeout error
await page.waitForNavigation({ waitUntil: 'networkidle2' });
if (page.url() === 'https://twitter.com/login') {
await login(page);
}
} catch(error) {
console.log(error);
}
});
The page will just hang and I'll get the error TimeoutError
I have tried changing the waitUntil parameter and it doesn't seem to make a difference. I have also set await page.setDefaultNavigationTimeout(0); this of course stops the error appearing but the page just never responds.
Has anyone else faced this problem? Is Twitter detecting me? Or have I missed something? I am using puppeteer-extra with puppeteer-extra-plugin-stealth using the default settings.
I am using puppeteer to load a web page.
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
I set the request interception to true and log all requests urls. The requests I logged is a lot less than the requests when I load the url in chrome browser.
At least there is one request https://www.onthehouse.com.au/odin/api/compositeSearch which can be found in chrome dev tool console but not show in above code.
I wonder how I can log all requests?
I did some benchmarking between 4 variants of this script. And for me the results were the same. Note: I did multiple tests, sometimes due to local network speed it was less calls. But after 2-3 tries Puppeteer was able to catch all requests.
On the https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195 page there are some async and defer scripts, my hypothesis was that may load differently when we use different Puppeteer settings, or async vs. sync functions inside page.on.
Note 2: I tested another page, not the one in the original question as I already needed a VPN to visit this Australian website, it was easy from Chrome, with Puppeteer it would take more: trust me the page I tested has similarly tons of analytics and tracking requests.
Baseline from Chrome network: 28 calls
First I've visited xy webpage, the results were 28 calls on the Network tab.
Case 1: Original (sync, networkidle2)
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
Result: 28 calls
Case 2: Async, networkidle2
The page.on has an async function inside so we can await the request.url()
await page.setRequestInterception(true);
page.on('request', async request => {
console.log(await request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
Result: 28 calls
Case 3: Sync, networkidle0
Similar as the original, but with networkidle0.
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle0' },
);
Result: 28 calls
Case 3: Async, networkidle0
The page.on has an async function inside so we can await the request.url(). Plus networkidle0.
await page.setRequestInterception(true);
page.on('request', async request => {
console.log(await request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle0' },
);
Result: 28 calls
As there was no difference between the number of requests on the Network tab and from Puppeteer, neither the way we launch puppeteer or how we collect the requests my idea is:
Either you have accepted the Cookie Consent in your Chrome so the Network will have more requests (these requests only happen after the cookies are accepted), you can accept their cookie policy with a simple navigation, so after you've navigated inside their page there will be more requests on Network immediately.
[...] By continuing to use our website, you consent to cookies being used.
Solution: Do not directly visit the desired page, but navigate there through clicks, so your Puppeteer's Chromium will accept the cookie consent, hence you will have all analytics requests as well.
Some Chrome addon affects the number of requests on the page.
Advise: Check your Puppeteer requests against an incognito Chrome's Network tab, make sure all Extensions/Addons are disabled.
+ If you are only interested in XHR then you may need to add request.resourceType to your code to differentiate them from others docs.
I am trying to create a custom error messages when Puppeteer fails to do a task, in my case it cannot find the field that it has to click.
let page;
before(async () => { /* before hook for mocha testing */
page = await browser.newPage();
await page.goto("https://www.linkedin.com/login");
await page.setViewport({ width: 1920, height: 1040 });
});
after(async function () { /* after hook for mocah testing */
await page.close();
});
it('should login to home page', async () => { /* simple test case */
const emailInput = "#username";
const passwordInput = "#assword";
const submitSelector = ".login__form_action_container ";
linkEmail = await page.$(emailInput);
linkPassword = await page.$(passwordInput)
linkSubmit = await page.$(submitSelector);
await linkEmail.click({ clickCount: 3 });
await linkEmail.type('testemail#example.com'); // add the email address for linkedin //
await linkPassword.click({ clickCount: 3 }).catch(error => {
console.log('The following error occurred: ' + error);
});;
await linkPassword.type('testpassword'); // add password for linkedin account
await linkSubmit.click();
await page.waitFor(3000);
});
});
I have deliberately put a wrong passwordInput name in order to force puppeteer to fail. However, the console.log message is never printed.
This is my error output which is the default mocha error:
simple test for Linkedin Login functionality
1) should login to home page
0 passing (4s)
1 failing
1) simple test for Linkedin Login functionality
should login to home page:
TypeError: Cannot read property 'click' of null
at Context.<anonymous> (test/sample.spec.js:29:28)
Line 29 is the await linkPassword.click({ clickCount: 3 })
Anyone has an idea how I can make it print a custom error message when an error like this occurs?
The problem is that the exception is being thrown not in the result of the function await linkPassword.click() execution, but in the result of attempt of executing the function. By .catch() you try to handle an eventual exception thrown during execution. page.$() works this way it returns a null if a selector isn't found. And in your case, you execute null.click({ clickCount: 3 }).catch() what actually doesn't have sense.
To quickly solve your problem you should do a check to verify whether linkPassword isn't null. However, I think you make a big mistake by using page.$() to get an element to interact with. This way you lose a lot of the puppeteer features because instead to use puppeteer's method page.click() you use a simple browser's click() in the browser.
Instead, you should make sure that the element exists and is visible and then use the puppeteer's API to play with the element. Like this:
const emailInput = "#username";
await page.waitForSelector(emailInput);
await page.click(emailInput, { clickCount: 3 });
await page.type(emailInput, 'testemail#example.com')
Thanks to that your script makes sure the element is clickable and if it is it scrolls to the element and performs clicks and types the text.
Then you can handle a case when the element isn't found this way:
page.waitForSelector(emailInput).catch(() => {})
or just by using try/catch.
I am trying to run puppeteer in bamboo build run. But seems there is problem to execute it properly. The detail error below
I wonder if there is stuff I have to install to get it able to run in bamboo? or I have to do other alternative. There is no articles available online regarding this issue.
And a bit more background, I am trying to implement jest-image-snapshot into my test process. and making a call to generate snapshot like this
const puppeteer = require('puppeteer');
let browser;
beforeAll(async () => {
browser = await puppeteer.launch();
});
it('show correct page: variant', async () => {
const page = await browser.newPage();
await page.goto(
'http://localhost:8080/app/register?experimentName=2018_12_STREAMLINED_ACCOUNT&experimentVariation=STREAMLINED#/'
);
const image = await page.screenshot();
expect(image).toMatchImageSnapshot();
});
afterAll(async () => {
await browser.close();
});
the reason log of TypeError: Cannot read property 'newPage' of undefined is because const page = await browser.newPage();
The important part is in your screenshot:
Failed to launch chrome! ... No usable sandbox!
Try to launch puppeteer without a sandbox like this:
await puppeteer.launch({
args: ['--no-sandbox']
});
Depending on the platform, you might also want to try the following arguments (also in addition):
--disable-setuid-sandbox
--disable-dev-shm-usage
If all three do not work, the Troubleshooting guide might have additional information.
So I am trying to use puppeteer to automate some data entry functions in Oracle Cloud applications.
As of now I am able to launch the cloud app login page, enter username and password credentials and click login button. Once login is successful, Oracle opens a homepage for the user. Once this happens if I take screenshot or execute a page.content the screenshot and the content html is from the login page not of the homepage.
How do I always have a reference to the current page that the user is on?
Here is the basic code so far.
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({headless: false});
let page = await browser.newPage();
await page.goto('oraclecloudloginurl', {waitUntil: 'networkidle2'});
await page.type('#userid', 'USERNAME', {delay : 10});
await page.type('#password', 'PASSWORD', {delay : 10});
await page.waitForSelector('#btnActive', {enabled : true});
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
await page.screenshot({path: 'home.png'});
const html = await page.content();
await fs.writeFileSync('home.html', html);
await page.waitFor(10000);
await browser.close();
})();
With this the user logs in fine and the home page is displayed. But I get an error after that when I try to screenshot the homepage and render the html content. It seems to be the page has changed and I am referring to the old page. How can I refer to the context of the current page?
Below is the error:
(node:14393) UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id undefined
This code looks problematic for two reasons:
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
The first problem is that the page.click().then() spins off a totally separate promise chain:
page.click() --> .then(...)
|
v
page.waitForNavigation()
|
v
page.screenshot(...)
|
v
...
This means the click that triggers the navigation and the navigation are running in parallel and can never be rejoined into the same promise chain. The usual solution here is to tie them into the same promise chain:
// Note: this code is still broken; keep reading!
await page.click('#btnActive', {delay : 1000});
console.log('Login Button Clicked');
await page.waitForNavigation();
This adheres to the principle of not mixing then and await unless you have good reason to.
But the above code is still broken because Puppeteer requires the waitForNavigation() promise to be set before the event that triggers navigation is fired. The fix is:
await Promise.all([
page.waitForNavigation(),
page.click('#btnActive', {delay : 1000}),
]);
or
const navPromise = page.waitForNavigation(); // no await
await page.click('#btnActive', {delay : 1000});
await navPromise;
Following this pattern, Puppeteer should no longer be confused about its context.
Minor notes:
'networkidle2' is slow and probably unnecessary, especially for a page you're soon going to be navigating away from. I'd default to 'domcontentloaded'.
await page.waitFor(10000); is deprecated along with page.waitForTimeout(), although I realize this is an older post.