Puppeteer often fails waitForNavigation with Timeout error on Twitter - javascript

I am playing around with Puppeteer for Twitter automation. I have discovered that often the navigation will timeout. For example:
puppeteer.launch({ headless: false }).then(async browser => {
try {
const page = await browser.newPage();
await page.goto('https://twitter.com/home');
// This sometimes fails with timeout error
await page.waitForNavigation({ waitUntil: 'networkidle2' });
if (page.url() === 'https://twitter.com/login') {
await login(page);
}
} catch(error) {
console.log(error);
}
});
The page will just hang and I'll get the error TimeoutError
I have tried changing the waitUntil parameter and it doesn't seem to make a difference. I have also set await page.setDefaultNavigationTimeout(0); this of course stops the error appearing but the page just never responds.
Has anyone else faced this problem? Is Twitter detecting me? Or have I missed something? I am using puppeteer-extra with puppeteer-extra-plugin-stealth using the default settings.

Related

How to run window.stop after certain time in puppeteer

I want to web scrap a site but the problem with that site is this, content in the site load in 1s but the loader in the navbar kept loading for 30 to 1min so my puppeteer scrapper kept waiting for the loader in the navbar to stop?
Is there any way to run window.stop() after a certain timeout
GITHUB
const checkBook = async () => {
await page.goto(`https://wattpad.com/story/${bookid}`, {
waitUntil: 'domcontentloaded',
});
const is404 = await page.$('#story-404-wrapper');
if (is404) {
socket.emit('error', {
message: 'Story not found',
});
await browser.close();
return {
error: true,
};
}
storyLastUpdated = await page
.$eval(
'.table-of-contents__last-updated strong',
(ele: any) => ele.textContent,
)
.then((date: string) => getDate(date));
};
Similar approach to Marcel's answer. The following will do the job:
page.goto(url)
await page.waitForTimeout(1000)
await page.evaluate(() => window.stop())
// your scraper script goes here
await browser.close()
Notes:
page.goto() is NOT awaited, so you save time compared to waiting until DOMContentLoaded or Load events...
...if the goto was not awaited you need to make sure your script can start the work with the DOM. You can either use page.waitForTimeout() or page.waitForSelector().
you have to execute window.stop() within page.evaluate() so you can avoid this kind of error: Error: Navigation failed because browser has disconnected!
You could strip the
waitUntil: 'domcontentloaded',
in favor of a timeout as documented here https://github.com/puppeteer/puppeteer/blob/v14.1.0/docs/api.md#pagegotourl-options
or set the timeout to zero and instead use one of the page.waitFor... like this
await page.waitForTimeout(30000);

How can I get all xhr calls in puppeteer?

I am using puppeteer to load a web page.
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
I set the request interception to true and log all requests urls. The requests I logged is a lot less than the requests when I load the url in chrome browser.
At least there is one request https://www.onthehouse.com.au/odin/api/compositeSearch which can be found in chrome dev tool console but not show in above code.
I wonder how I can log all requests?
I did some benchmarking between 4 variants of this script. And for me the results were the same. Note: I did multiple tests, sometimes due to local network speed it was less calls. But after 2-3 tries Puppeteer was able to catch all requests.
On the https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195 page there are some async and defer scripts, my hypothesis was that may load differently when we use different Puppeteer settings, or async vs. sync functions inside page.on.
Note 2: I tested another page, not the one in the original question as I already needed a VPN to visit this Australian website, it was easy from Chrome, with Puppeteer it would take more: trust me the page I tested has similarly tons of analytics and tracking requests.
Baseline from Chrome network: 28 calls
First I've visited xy webpage, the results were 28 calls on the Network tab.
Case 1: Original (sync, networkidle2)
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
Result: 28 calls
Case 2: Async, networkidle2
The page.on has an async function inside so we can await the request.url()
await page.setRequestInterception(true);
page.on('request', async request => {
console.log(await request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
Result: 28 calls
Case 3: Sync, networkidle0
Similar as the original, but with networkidle0.
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle0' },
);
Result: 28 calls
Case 3: Async, networkidle0
The page.on has an async function inside so we can await the request.url(). Plus networkidle0.
await page.setRequestInterception(true);
page.on('request', async request => {
console.log(await request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle0' },
);
Result: 28 calls
As there was no difference between the number of requests on the Network tab and from Puppeteer, neither the way we launch puppeteer or how we collect the requests my idea is:
Either you have accepted the Cookie Consent in your Chrome so the Network will have more requests (these requests only happen after the cookies are accepted), you can accept their cookie policy with a simple navigation, so after you've navigated inside their page there will be more requests on Network immediately.
[...] By continuing to use our website, you consent to cookies being used.
Solution: Do not directly visit the desired page, but navigate there through clicks, so your Puppeteer's Chromium will accept the cookie consent, hence you will have all analytics requests as well.
Some Chrome addon affects the number of requests on the page.
Advise: Check your Puppeteer requests against an incognito Chrome's Network tab, make sure all Extensions/Addons are disabled.
+ If you are only interested in XHR then you may need to add request.resourceType to your code to differentiate them from others docs.

login into gmail fails for unknown reason

I am trying to login into my gmail with puppeteer to lower the risk of recaptcha
here is my code
await page.goto('https://accounts.google.com/AccountChooser?service=mail&continue=https://mail.google.com/mail/', {timeout: 60000})
.catch(function (error) {
throw new Error('TimeoutBrows');
});
await page.waitForSelector('#identifierId' , { visible: true });
await page.type('#identifierId' , 'myemail');
await Promise.all([
page.click('#identifierNext') ,
page.waitForSelector('.whsOnd' , { visible: true })
])
await page.type('#password .whsOnd' , "mypassword");
await page.click('#passwordNext');
await page.waitFor(5000);
but i always end up with this message
I even tried to just open the login window with puppeteer and fill the login form manually myself, but even that failed.
Am I missing something ?
When I look into console there is a failed ajax call just after login.
Request URL: https://accounts.google.com/_/signin/challenge?hl=en&TL=APDPHBCG5lPol53JDSKUY2mO1RzSwOE3ZgC39xH0VCaq_WHrJXHS6LHyTJklSkxd&_reqid=464883&rt=j
Request Method: POST
Status Code: 401
Remote Address: 216.58.213.13:443
Referrer Policy: no-referrer-when-downgrade
)]}'
[[["er",null,null,null,null,401,null,null,null,16]
,["e",2,null,null,81]
]]
I've inspected your code and it seems to be correct despite of some selectors. Also, I had to add a couple of timeouts in order to make it work. However, I failed to reproduce your issue so I'll just post the code that worked for me.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://accounts.google.com/AccountChooser?service=mail&continue=https://mail.google.com/mail/', {timeout: 60000})
.catch(function (error) {
throw new Error('TimeoutBrows');
});
await page.screenshot({path: './1.png'});
...
})();
Please, note that I run browser in normal, not headless mode. If you take a look at screenshot at this position, you will see that it is correct Google login form
The rest of the code is responsible for entering password
const puppeteer = require('puppeteer');
(async () => {
...
await page.waitForSelector('#identifierId', {visible: true});
await page.type('#identifierId', 'my#email');
await Promise.all([
page.click('#identifierNext'),
page.waitForSelector('.whsOnd', {visible: true})
]);
await page.waitForSelector('input[name=password]', {visible: true});
await page.type('input[name=password]', "my.password");
await page.waitForSelector('#passwordNext', {visible: true});
await page.waitFor(1000);
await page.click('#passwordNext');
await page.waitFor(5000);
})();
Please also note few differences from your code - the selector for password field is different. I had to add await page.waitForSelector('#passwordNext', {visible: true}); and a small timeout after that so the button could be clicked successfully.
I've tested all the code above and it worked successfully. Please, let me know if you still need some help or facing troubles with my example.
The purpose of question is to login to Gmail. I will share another method that does not involve filling email and password fields on puppeteer script
and works in headless: true mode.
Method
Login to your gmail using normal browser (google chrome preferebbly)
Export all cookies for the gmail tab
Use page.setCookie to import the cookies to your puppeteer instance
Login to gmail
This should be no brainer.
Export all cookies
I will use an extension called Edit This Cookie, however you can use other extensions or manual methods to extract the cookies.
Click the browser icon and then click the Export button.
Import cookies to puppeteer instance
We will save the cookies in a cookies.json file and then import using page.setCookie function before navigation. That way when gmail page loads, it will have login information right away.
The code might look like this.
const puppeteer = require("puppeteer");
const cookies = require('./cookies.json');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set cookies here, right after creating the instance
await page.setCookie(...cookies);
// do the navigation,
await page.goto("https://mail.google.com/mail/u/0/#search/stackoverflow+survey", {
waitUntil: "networkidle2",
timeout: 60000
});
await page.screenshot({ path: "example.png" });
await browser.close();
})();
Result:
Notes:
It was not asked, but I should mention following for future readers.
Cookie Expiration: Cookies might be short lived, and expire shortly after, behave differently on a different device. Logging out on your original device will log you out from the puppeteer as well since it's sharing the cookies.
Two Factor: I am not yet sure about 2FA authentication. It did not ask me about 2FA probably because I logged in from same device.

Failed to launch chrome!, failed to launch chrome puppeteer in bamboo for jest image snapshot test

I am trying to run puppeteer in bamboo build run. But seems there is problem to execute it properly. The detail error below
I wonder if there is stuff I have to install to get it able to run in bamboo? or I have to do other alternative. There is no articles available online regarding this issue.
And a bit more background, I am trying to implement jest-image-snapshot into my test process. and making a call to generate snapshot like this
const puppeteer = require('puppeteer');
let browser;
beforeAll(async () => {
browser = await puppeteer.launch();
});
it('show correct page: variant', async () => {
const page = await browser.newPage();
await page.goto(
'http://localhost:8080/app/register?experimentName=2018_12_STREAMLINED_ACCOUNT&experimentVariation=STREAMLINED#/'
);
const image = await page.screenshot();
expect(image).toMatchImageSnapshot();
});
afterAll(async () => {
await browser.close();
});
the reason log of TypeError: Cannot read property 'newPage' of undefined is because const page = await browser.newPage();
The important part is in your screenshot:
Failed to launch chrome! ... No usable sandbox!
Try to launch puppeteer without a sandbox like this:
await puppeteer.launch({
args: ['--no-sandbox']
});
Depending on the platform, you might also want to try the following arguments (also in addition):
--disable-setuid-sandbox
--disable-dev-shm-usage
If all three do not work, the Troubleshooting guide might have additional information.

How do I reference the current page object in puppeteer once user moves from login to homepage?

So I am trying to use puppeteer to automate some data entry functions in Oracle Cloud applications.
As of now I am able to launch the cloud app login page, enter username and password credentials and click login button. Once login is successful, Oracle opens a homepage for the user. Once this happens if I take screenshot or execute a page.content the screenshot and the content html is from the login page not of the homepage.
How do I always have a reference to the current page that the user is on?
Here is the basic code so far.
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({headless: false});
let page = await browser.newPage();
await page.goto('oraclecloudloginurl', {waitUntil: 'networkidle2'});
await page.type('#userid', 'USERNAME', {delay : 10});
await page.type('#password', 'PASSWORD', {delay : 10});
await page.waitForSelector('#btnActive', {enabled : true});
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
await page.screenshot({path: 'home.png'});
const html = await page.content();
await fs.writeFileSync('home.html', html);
await page.waitFor(10000);
await browser.close();
})();
With this the user logs in fine and the home page is displayed. But I get an error after that when I try to screenshot the homepage and render the html content. It seems to be the page has changed and I am referring to the old page. How can I refer to the context of the current page?
Below is the error:
(node:14393) UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id undefined
This code looks problematic for two reasons:
page.click('#btnActive', {delay : 1000}).then(() => console.log('Login Button Clicked'));
await page.waitForNavigation();
The first problem is that the page.click().then() spins off a totally separate promise chain:
page.click() --> .then(...)
|
v
page.waitForNavigation()
|
v
page.screenshot(...)
|
v
...
This means the click that triggers the navigation and the navigation are running in parallel and can never be rejoined into the same promise chain. The usual solution here is to tie them into the same promise chain:
// Note: this code is still broken; keep reading!
await page.click('#btnActive', {delay : 1000});
console.log('Login Button Clicked');
await page.waitForNavigation();
This adheres to the principle of not mixing then and await unless you have good reason to.
But the above code is still broken because Puppeteer requires the waitForNavigation() promise to be set before the event that triggers navigation is fired. The fix is:
await Promise.all([
page.waitForNavigation(),
page.click('#btnActive', {delay : 1000}),
]);
or
const navPromise = page.waitForNavigation(); // no await
await page.click('#btnActive', {delay : 1000});
await navPromise;
Following this pattern, Puppeteer should no longer be confused about its context.
Minor notes:
'networkidle2' is slow and probably unnecessary, especially for a page you're soon going to be navigating away from. I'd default to 'domcontentloaded'.
await page.waitFor(10000); is deprecated along with page.waitForTimeout(), although I realize this is an older post.

Categories

Resources