I am using puppeteer to load a web page.
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
I set the request interception to true and log all requests urls. The requests I logged is a lot less than the requests when I load the url in chrome browser.
At least there is one request https://www.onthehouse.com.au/odin/api/compositeSearch which can be found in chrome dev tool console but not show in above code.
I wonder how I can log all requests?
I did some benchmarking between 4 variants of this script. And for me the results were the same. Note: I did multiple tests, sometimes due to local network speed it was less calls. But after 2-3 tries Puppeteer was able to catch all requests.
On the https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195 page there are some async and defer scripts, my hypothesis was that may load differently when we use different Puppeteer settings, or async vs. sync functions inside page.on.
Note 2: I tested another page, not the one in the original question as I already needed a VPN to visit this Australian website, it was easy from Chrome, with Puppeteer it would take more: trust me the page I tested has similarly tons of analytics and tracking requests.
Baseline from Chrome network: 28 calls
First I've visited xy webpage, the results were 28 calls on the Network tab.
Case 1: Original (sync, networkidle2)
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
Result: 28 calls
Case 2: Async, networkidle2
The page.on has an async function inside so we can await the request.url()
await page.setRequestInterception(true);
page.on('request', async request => {
console.log(await request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle2' },
);
Result: 28 calls
Case 3: Sync, networkidle0
Similar as the original, but with networkidle0.
await page.setRequestInterception(true);
page.on('request', (request) => {
console.log(request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle0' },
);
Result: 28 calls
Case 3: Async, networkidle0
The page.on has an async function inside so we can await the request.url(). Plus networkidle0.
await page.setRequestInterception(true);
page.on('request', async request => {
console.log(await request.url())
request.continue();
...
}
}
await page.goto(
'https://www.onthehouse.com.au/property-for-rent/vic/aspendale-gardens-3195',
{ waitUntil: 'networkidle0' },
);
Result: 28 calls
As there was no difference between the number of requests on the Network tab and from Puppeteer, neither the way we launch puppeteer or how we collect the requests my idea is:
Either you have accepted the Cookie Consent in your Chrome so the Network will have more requests (these requests only happen after the cookies are accepted), you can accept their cookie policy with a simple navigation, so after you've navigated inside their page there will be more requests on Network immediately.
[...] By continuing to use our website, you consent to cookies being used.
Solution: Do not directly visit the desired page, but navigate there through clicks, so your Puppeteer's Chromium will accept the cookie consent, hence you will have all analytics requests as well.
Some Chrome addon affects the number of requests on the page.
Advise: Check your Puppeteer requests against an incognito Chrome's Network tab, make sure all Extensions/Addons are disabled.
+ If you are only interested in XHR then you may need to add request.resourceType to your code to differentiate them from others docs.
Related
I want to web scrap a site but the problem with that site is this, content in the site load in 1s but the loader in the navbar kept loading for 30 to 1min so my puppeteer scrapper kept waiting for the loader in the navbar to stop?
Is there any way to run window.stop() after a certain timeout
GITHUB
const checkBook = async () => {
await page.goto(`https://wattpad.com/story/${bookid}`, {
waitUntil: 'domcontentloaded',
});
const is404 = await page.$('#story-404-wrapper');
if (is404) {
socket.emit('error', {
message: 'Story not found',
});
await browser.close();
return {
error: true,
};
}
storyLastUpdated = await page
.$eval(
'.table-of-contents__last-updated strong',
(ele: any) => ele.textContent,
)
.then((date: string) => getDate(date));
};
Similar approach to Marcel's answer. The following will do the job:
page.goto(url)
await page.waitForTimeout(1000)
await page.evaluate(() => window.stop())
// your scraper script goes here
await browser.close()
Notes:
page.goto() is NOT awaited, so you save time compared to waiting until DOMContentLoaded or Load events...
...if the goto was not awaited you need to make sure your script can start the work with the DOM. You can either use page.waitForTimeout() or page.waitForSelector().
you have to execute window.stop() within page.evaluate() so you can avoid this kind of error: Error: Navigation failed because browser has disconnected!
You could strip the
waitUntil: 'domcontentloaded',
in favor of a timeout as documented here https://github.com/puppeteer/puppeteer/blob/v14.1.0/docs/api.md#pagegotourl-options
or set the timeout to zero and instead use one of the page.waitFor... like this
await page.waitForTimeout(30000);
I am playing around with Puppeteer for Twitter automation. I have discovered that often the navigation will timeout. For example:
puppeteer.launch({ headless: false }).then(async browser => {
try {
const page = await browser.newPage();
await page.goto('https://twitter.com/home');
// This sometimes fails with timeout error
await page.waitForNavigation({ waitUntil: 'networkidle2' });
if (page.url() === 'https://twitter.com/login') {
await login(page);
}
} catch(error) {
console.log(error);
}
});
The page will just hang and I'll get the error TimeoutError
I have tried changing the waitUntil parameter and it doesn't seem to make a difference. I have also set await page.setDefaultNavigationTimeout(0); this of course stops the error appearing but the page just never responds.
Has anyone else faced this problem? Is Twitter detecting me? Or have I missed something? I am using puppeteer-extra with puppeteer-extra-plugin-stealth using the default settings.
Is there any way to make page.evaluate() wait until there are no more network requests for at least 500ms (like page.goto() waits for networkidle0)?
For example:
await page.evaluate('window.location = "https://example.com"');
// listen to the network requests until there are no requests fired for at least 500 ms
waitUntil only makes sense in a navigation context. If you mean having waitUntil option for evaluate() itself the answer is no. However, if you trigger a navigation using evaluate() you can use waitForNavigation():
await page.evaluate(() => window.location = "https://example.com")
await page.waitForNavigation({waitUntil: 'networkidle0'});
I am trying to login into my gmail with puppeteer to lower the risk of recaptcha
here is my code
await page.goto('https://accounts.google.com/AccountChooser?service=mail&continue=https://mail.google.com/mail/', {timeout: 60000})
.catch(function (error) {
throw new Error('TimeoutBrows');
});
await page.waitForSelector('#identifierId' , { visible: true });
await page.type('#identifierId' , 'myemail');
await Promise.all([
page.click('#identifierNext') ,
page.waitForSelector('.whsOnd' , { visible: true })
])
await page.type('#password .whsOnd' , "mypassword");
await page.click('#passwordNext');
await page.waitFor(5000);
but i always end up with this message
I even tried to just open the login window with puppeteer and fill the login form manually myself, but even that failed.
Am I missing something ?
When I look into console there is a failed ajax call just after login.
Request URL: https://accounts.google.com/_/signin/challenge?hl=en&TL=APDPHBCG5lPol53JDSKUY2mO1RzSwOE3ZgC39xH0VCaq_WHrJXHS6LHyTJklSkxd&_reqid=464883&rt=j
Request Method: POST
Status Code: 401
Remote Address: 216.58.213.13:443
Referrer Policy: no-referrer-when-downgrade
)]}'
[[["er",null,null,null,null,401,null,null,null,16]
,["e",2,null,null,81]
]]
I've inspected your code and it seems to be correct despite of some selectors. Also, I had to add a couple of timeouts in order to make it work. However, I failed to reproduce your issue so I'll just post the code that worked for me.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://accounts.google.com/AccountChooser?service=mail&continue=https://mail.google.com/mail/', {timeout: 60000})
.catch(function (error) {
throw new Error('TimeoutBrows');
});
await page.screenshot({path: './1.png'});
...
})();
Please, note that I run browser in normal, not headless mode. If you take a look at screenshot at this position, you will see that it is correct Google login form
The rest of the code is responsible for entering password
const puppeteer = require('puppeteer');
(async () => {
...
await page.waitForSelector('#identifierId', {visible: true});
await page.type('#identifierId', 'my#email');
await Promise.all([
page.click('#identifierNext'),
page.waitForSelector('.whsOnd', {visible: true})
]);
await page.waitForSelector('input[name=password]', {visible: true});
await page.type('input[name=password]', "my.password");
await page.waitForSelector('#passwordNext', {visible: true});
await page.waitFor(1000);
await page.click('#passwordNext');
await page.waitFor(5000);
})();
Please also note few differences from your code - the selector for password field is different. I had to add await page.waitForSelector('#passwordNext', {visible: true}); and a small timeout after that so the button could be clicked successfully.
I've tested all the code above and it worked successfully. Please, let me know if you still need some help or facing troubles with my example.
The purpose of question is to login to Gmail. I will share another method that does not involve filling email and password fields on puppeteer script
and works in headless: true mode.
Method
Login to your gmail using normal browser (google chrome preferebbly)
Export all cookies for the gmail tab
Use page.setCookie to import the cookies to your puppeteer instance
Login to gmail
This should be no brainer.
Export all cookies
I will use an extension called Edit This Cookie, however you can use other extensions or manual methods to extract the cookies.
Click the browser icon and then click the Export button.
Import cookies to puppeteer instance
We will save the cookies in a cookies.json file and then import using page.setCookie function before navigation. That way when gmail page loads, it will have login information right away.
The code might look like this.
const puppeteer = require("puppeteer");
const cookies = require('./cookies.json');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set cookies here, right after creating the instance
await page.setCookie(...cookies);
// do the navigation,
await page.goto("https://mail.google.com/mail/u/0/#search/stackoverflow+survey", {
waitUntil: "networkidle2",
timeout: 60000
});
await page.screenshot({ path: "example.png" });
await browser.close();
})();
Result:
Notes:
It was not asked, but I should mention following for future readers.
Cookie Expiration: Cookies might be short lived, and expire shortly after, behave differently on a different device. Logging out on your original device will log you out from the puppeteer as well since it's sharing the cookies.
Two Factor: I am not yet sure about 2FA authentication. It did not ask me about 2FA probably because I logged in from same device.
Please tell me how to properly use a proxy with a puppeteer and headless Chrome. My option does not work.
const puppeteer = require('puppeteer');
(async () => {
const argv = require('minimist')(process.argv.slice(2));
const browser = await puppeteer.launch({args: ["--proxy-server =${argv.proxy}","--no-sandbox", "--disable-setuid-sandbox"]});
const page = await browser.newPage();
await page.setJavaScriptEnabled(false);
await page.setUserAgent(argv.agent);
await page.setDefaultNavigationTimeout(20000);
try{
await page.goto(argv.page);
const bodyHTML = await page.evaluate(() => new XMLSerializer().serializeToString(document))
body = bodyHTML.replace(/\r|\n/g, '');
console.log(body);
}catch(e){
console.log(e);
}
await browser.close();
})();
You can find an example about proxy at here
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '--proxy-server=127.0.0.1:9876' ]
});
const page = await browser.newPage();
await page.goto('https://google.com');
await browser.close();
})();
It's possible with puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request. And yes, it works both in headless and headful Chrome.
First install it:
npm i puppeteer-page-proxy
Then require it:
const useProxy = require('puppeteer-page-proxy');
Using it is easy;
Set proxy for an entire page:
await useProxy(page, 'http://127.0.0.1:8000');
If you want a different proxy for each request,then you can simply do this:
await page.setRequestInterception(true);
page.on('request', req => {
useProxy(req, 'socks5://127.0.0.1:9000');
});
Then if you want to be sure that your page's IP has changed, you can look it up;
const data = await useProxy.lookup(page);
console.log(data.ip);
It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:
const proxy = 'http://login:pass#127.0.0.1:8000'
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy
do not use
"--proxy-server =${argv.proxy}"
this is a normal string instead of template literal
use ` instead of "
`--proxy-server =${argv.proxy}`
otherwise argv.proxy will not be replaced
check this string before you pass it to launch function to make sure it's correct
and you may want to visit http://api.ipify.org/ in that browser to make sure the proxy works normally
if you want to use different proxy for per page, try this, use https-proxy-agent or http-proxy-agent to proxy request for per page
You can use https://github.com/gajus/puppeteer-proxy to set proxy either for entire page or for specific requests only, e.g.
import puppeteer from 'puppeteer';
import {
createPageProxy,
} from 'puppeteer-proxy';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageProxy = createPageProxy({
page,
proxyUrl: 'http://127.0.0.1:3000',
});
await page.setRequestInterception(true);
page.once('request', async (request) => {
await pageProxy.proxyRequest(request);
});
await page.goto('https://example.com');
})();
To skip proxy simply call request.continue() conditionally.
Using puppeteer-proxy Page can have multiple proxies.
You can find proxies list on Private Proxy and use it with the code below
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
// Proxies List from Private proxies
const proxiesList = [
'http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
];
const oldProxyUrl = proxiesList[Math.floor(Math.random() * (proxiesList.length))];
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
const browser = await puppeteer.launch({
headless: true,
ignoreHTTPSErrors: true,
args: [
`--proxy-server=${newProxyUrl}`,
`--ignore-certificate-errors`,
`--no-sandbox`,
`--disable-setuid-sandbox`
]
});
const page = await browser.newPage();
await page.authenticate();
//
// you code here
//
// close proxy chain
await proxyChain.closeAnonymizedProxy(newProxyUrl, true);
})();
You can find the full post here
I see https://github.com/Cuadrix/puppeteer-page-proxy and https://github.com/gajus/puppeteer-proxy recommended above, and I want to emphasize that these two packages are technically not using Chrome instance to perform actual network request, here is what they are doing instead:
when the user code initiates network request of Puppeteer, e.g. calls page.goto(), the proxy package intercepts this outgoing HTTP request and pauses it
the proxy package passes the request to another network library (Got)
Got performs actual network request, through the proxy specified
Got now needs to pass all the network response data back to Puppeteer! This means a bunch of interesting things the proxy package now needs to manage, like copying cookie headers from raw HTTP set-cookie format to puppeteer format
While this might be a viable approach for a lot of cases, you need to understand that this changes your HTTP request TLS fingerprint so your HTTP request might get blocked by some websites, particularly the ones which are using Cloudflare bot detection (because the website now sees that your request originates from Node.js, not from Chrome).
Alternative method of setting proxy in Puppeteer.
Launch args of Chrome are good if you want to use one proxy for all websites. What if you still want to have one Chrome instance use multiple proxies, but you don't want to use 2 packages mentioned above?
createIncognitoBrowserContext Puppeteer function to the rescue:
// Create a new incognito browser context
const context = await browser.createIncognitoBrowserContext({ proxy: 'http://localhost:2022' });
// Create a new page inside context.
const page = await context.newPage();
// authenticate in proxy using basic browser auth
await page.authenticate({username:user, password:password});
// ... do stuff with page ...
await page.goto('https://example.com');
// Dispose context once it's no longer needed.
await context.close();
proxy-chain package
If your proxy requires auth, and you don't like the page.authenticate call, the proxy might be set using proxy-chain npm package.
proxy-chain launches intermediate proxy on your localhost which allows to do some nice things. Read more on technical details of proxy-chain package implementation: https://pixeljets.com/blog/how-to-set-proxy-in-puppeteer
According to my experience, all above fail due to different reasons.
I find that applying proxy on the entire OS works each time. I get no proxy fails. This strategy works on both Windows and Linux.
This way, I get zero puppeteer bot failures. Bear in mind, I am spinning up 7000 bots per server. I am running this on 7 servers.