Use different ip addresses in puppeteer requests - javascript

I have multiple ip interfaces in my server and I can't find how to force puppeteer to use them in its requests
I am using node v10.15.0 and puppeteer 1.11.0

You can use the flag --netifs-to-ignore when launching the browser to specify which interfaces should be ignored by Chrome. Quote from the List of Chromium Command Line Switches:
--netifs-to-ignore: List of network interfaces to ignore. Ignored interfaces will not be used for network connectivity
You can use the argument like this when launching the browser:
const browser = await puppeteer.launch({
args: ['--netifs-to-ignore=INTERFACE_TO_IGNORE']
});

Maybe this will help. You can see the full code here
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '--proxy-server=127.0.0.1:9876' ]
});
const page = await browser.newPage();
await page.goto('https://google.com');
await browser.close();
})();

Related

External resources in Puppeteer with Chrome executable fail to load (net::ERR_EMPTY_RESPONSE)

I'm having issues using external resources in a Puppeteer job that I'm running with a full Chrome executable (not the default Chromium). Any help would be massively appreciated!
So for example, if I load a video with a public URL it fails even though it works fine if I hit it manually in the browser.
const videoElement = document.createElement('video');
videoElement.src = src;
videoElement.onloadedmetadata = function() {
console.log(videoElement.duration);
};
Here's my Puppeteer call:
(async () => {
const browser = await puppeteer.launch({
args: [
'--remote-debugging-port=9222',
'--autoplay-policy=no-user-gesture-required',
'--allow-insecure-localhost',
'--proxy-server=http://localhost:9000',
'--proxy-bypass-list=""',
'--no-sandbox',
'--disable-setuid-sandbox',
],
executablePath:
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
});
const page = await browser.newPage();
logConsole(page);
await page.goto(`http://${hostname}/${path}`, {
waitUntil: 'networkidle2',
});
await page.waitForSelector('#job-complete');
console.log('Job complete!');
await browser.close();
})();
Unlike many Puppeteer examples, the issue here isn't that my test doesn't wait long enough. The resources fail to load / return empty responses almost instantly.
It also doesn't appear to be an authentication issue - I reach my own server just fine.
Although I'm not running on https here, the URL I try directly in the browser works without SSL.
I should also mention that this is a React (CRA) website and I'm calling Puppeteer with Node.
I can see that at least 3 other external resources (non-video) also fail. Is there a flag or something I should be using that I'm missing? Thanks so much for any help!
In my case I had to use puppeteer-extra and puppeteer-extra-plugin-stealth:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
I also found the following flags useful:
const browser = await puppeteer.launch({
args: [
'--disable-web-security',
'--autoplay-policy=no-user-gesture-required',
'--no-sandbox',
'--disable-setuid-sandbox',
'--remote-debugging-port=9222',
'--allow-insecure-localhost',
],
executablePath:
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
});
Finally, I found it necessary in a few cases to bypass CSP:
await page.setBypassCSP(true);
Please be careful using these rather insecure settings 😬

Can't use Brave Browser with Puppeteer

About a month ago I wrote a question asking if it were possible using Brave Browser with puppeteer; the answer was yes, I tested it, and everything worked perfectly;
today I tried to run the same code but i got the error ERROR: process "xxxxx" not found
Any ideas about this issue?
const puppeteer = require('puppeteer');
(async()=>{
const browser = await puppeteer.launch({
executablePath:"C:/Program Files (x86)/BraveSoftware/Brave-Browser/Application/brave.exe",
headless:false,
devtools:false,
})
const page = await browser.newPage()
})()
You need to do at least two things to get puppeteer working with Brave.
First, you need to enable remote debugging on brave. You need to go to chrome://settings/privacy and then enable Remote debugging.
Second, Brave doesn't like many default command-line arguments that puppeteer sends. So you might want to ignore default arguments.
(async()=>{
const browser = await puppeteer.launch({
executablePath:"/Applications/Brave Browser.app/Contents/MacOS/Brave Browser",
headless:false,
ignoreDefaultArgs: true
})
const page = await browser.newPage()
page.goto("https://www.google.com")
})()

Is there any way in Puppeteer to get the exact data from the Chrome Network tab?

I'm attempting to use Puppeteer to navigate to a URL and extract the metrics from the Network tab in the Chrome developer tools. For example, navigating to this page shows the following Network info, and captures a total of 47 requests.
However, I'm trying to get these metrics using the following code:
import { Browser, Page } from "puppeteer";
const puppeteer = require('puppeteer');
async function run() {
const browser: Browser = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page: Page = await browser.newPage();
await page.goto("https://stackoverflow.com/questions/30455964/what-does-window-performance-getentries-mean");
let performanceTiming = JSON.parse(
await page.evaluate(() => JSON.stringify(window.performance.getEntries()))
);
console.log(performanceTiming);
}
run();
However, when I peer into the performanceTiming object, it only has 34 items in it:
Therefore, my questions are please:
Why is there a difference in the number of requests the Network tab vs performance.getEntries() are showing?
Is it possible to get performance.getEntries() to show all of the requests instead of only a portion of them?
Is it possible for Puppeteer to get all of the data from the Network tab?
You can use the page.tracing feature.
const browser = await puppeteer.launch({ headless: true});
const page = await browser.newPage();
await page.tracing.start({ categories: ['devtools.timeline'], path: "./tracing.json" });
await page.goto("https://stackoverflow.com/questions/30455964/what-does-window-performance-getentries-mean");
var tracing = JSON.parse(await page.tracing.stop());
The devtools.timeline category has many things to explore.
You can get all the request filtering by ResourceSendRequest
tracing.traceEvents.filter(te => te.name ==="ResourceSendRequest")
And all the response filtering by ResourceReceiveResponse
tracing.traceEvents.filter(te => te.name ==="ResourceReceiveResponse")

How to use puppeteer to automante Amazon Connect CCP login?

I'm trying use puppeteer to automate the login process for our agents in Amazon Connect however I can't get puppeteer to finish loading the CCP login page. See code below:
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = 'https://ccalderon-reinvent.awsapps.com/connect/ccp#/';
await page.goto(url, {waitUntil: 'domcontentloaded'});
console.log(await page.content());
// console.log('waiting for username input');
// await page.waitForSelector('#wdc_username');
await browser.close();
I can never see the content of the page, it times out. Am I doing something wrong? If I launch the browser with { headless: false } I can see the page never finishes loading.
Please note the same code works fine with https://www.github.com/login so it must be something specific to the source code of Connect's CCP.
In case you are from future and having problem with puppeteer for no reason, try to downgrade the puppeteer version first and see if the issue persists.
This seems like a bug with Chromium Development Version 73.0.3679.0, The error log said it could not load specific script somehow, but we could still load the script manually.
The Solution:
Using Puppeteer version 1.11.0 solved this issue. But if you want to use puppeteer version 1.12.2 but with a different chromium revision, you can use the executablePath argument.
Here are the respective versions used on puppeteer (at this point of answer),
Chromium 73.0.3679.0 - Puppeteer v1.12.2
Chromium 72.0.3582.0 - Puppeteer v1.11.0
Chromium 71.0.3563.0 - Puppeteer v1.9.0
Chromium 70.0.3508.0 - Puppeteer v1.7.0
Chromium 69.0.3494.0 - Puppeteer v1.6.2
I checked my locally installed chrome,which was loading the page correctly,
$(which google-chrome) --version
Google Chrome 72.0.3626.119
Note: The puppeteer team suggested on their doc to specifically use the chrome provided with the code (most likely the latest developer version) instead of using different revisions.
Also I edited the code a little bit to finish loading when all network requests is done and the username input is visible.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
executablePath: "/usr/bin/google-chrome"
});
const page = await browser.newPage();
const url = "https://ccalderon-reinvent.awsapps.com/connect/ccp#/";
await page.goto(url, { waitUntil: "networkidle0" });
console.log("waiting for username input");
await page.waitForSelector("#wdc_username", { visible: true });
await page.screenshot({ path: "example.png" });
await browser.close();
})();
The specific revision number can be obtained in many ways, one is to check the package.json of puppeteer package. The url for 1.11.0 is,
https://github.com/GoogleChrome/puppeteer/blob/v1.11.0/package.json
If you like to automate the chrome revision downloading, you can use browserFetcher to fetch specific revision.
const browserFetcher = puppeteer.createBrowserFetcher();
const revisionInfo = await browserFetcher.download('609904'); // chrome 72 is 609904
const browser = await puppeteer.launch({executablePath: revisionInfo.executablePath})
Result:

How to use proxy in puppeteer and headless Chrome?

Please tell me how to properly use a proxy with a puppeteer and headless Chrome. My option does not work.
const puppeteer = require('puppeteer');
(async () => {
const argv = require('minimist')(process.argv.slice(2));
const browser = await puppeteer.launch({args: ["--proxy-server =${argv.proxy}","--no-sandbox", "--disable-setuid-sandbox"]});
const page = await browser.newPage();
await page.setJavaScriptEnabled(false);
await page.setUserAgent(argv.agent);
await page.setDefaultNavigationTimeout(20000);
try{
await page.goto(argv.page);
const bodyHTML = await page.evaluate(() => new XMLSerializer().serializeToString(document))
body = bodyHTML.replace(/\r|\n/g, '');
console.log(body);
}catch(e){
console.log(e);
}
await browser.close();
})();
You can find an example about proxy at here
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '--proxy-server=127.0.0.1:9876' ]
});
const page = await browser.newPage();
await page.goto('https://google.com');
await browser.close();
})();
It's possible with puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request. And yes, it works both in headless and headful Chrome.
First install it:
npm i puppeteer-page-proxy
Then require it:
const useProxy = require('puppeteer-page-proxy');
Using it is easy;
Set proxy for an entire page:
await useProxy(page, 'http://127.0.0.1:8000');
If you want a different proxy for each request,then you can simply do this:
await page.setRequestInterception(true);
page.on('request', req => {
useProxy(req, 'socks5://127.0.0.1:9000');
});
Then if you want to be sure that your page's IP has changed, you can look it up;
const data = await useProxy.lookup(page);
console.log(data.ip);
It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:
const proxy = 'http://login:pass#127.0.0.1:8000'
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy
do not use
"--proxy-server =${argv.proxy}"
this is a normal string instead of template literal
use ` instead of "
`--proxy-server =${argv.proxy}`
otherwise argv.proxy will not be replaced
check this string before you pass it to launch function to make sure it's correct
and you may want to visit http://api.ipify.org/ in that browser to make sure the proxy works normally
if you want to use different proxy for per page, try this, use https-proxy-agent or http-proxy-agent to proxy request for per page
You can use https://github.com/gajus/puppeteer-proxy to set proxy either for entire page or for specific requests only, e.g.
import puppeteer from 'puppeteer';
import {
createPageProxy,
} from 'puppeteer-proxy';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageProxy = createPageProxy({
page,
proxyUrl: 'http://127.0.0.1:3000',
});
await page.setRequestInterception(true);
page.once('request', async (request) => {
await pageProxy.proxyRequest(request);
});
await page.goto('https://example.com');
})();
To skip proxy simply call request.continue() conditionally.
Using puppeteer-proxy Page can have multiple proxies.
You can find proxies list on Private Proxy and use it with the code below
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
// Proxies List from Private proxies
const proxiesList = [
'http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
];
const oldProxyUrl = proxiesList[Math.floor(Math.random() * (proxiesList.length))];
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
const browser = await puppeteer.launch({
headless: true,
ignoreHTTPSErrors: true,
args: [
`--proxy-server=${newProxyUrl}`,
`--ignore-certificate-errors`,
`--no-sandbox`,
`--disable-setuid-sandbox`
]
});
const page = await browser.newPage();
await page.authenticate();
//
// you code here
//
// close proxy chain
await proxyChain.closeAnonymizedProxy(newProxyUrl, true);
})();
You can find the full post here
I see https://github.com/Cuadrix/puppeteer-page-proxy and https://github.com/gajus/puppeteer-proxy recommended above, and I want to emphasize that these two packages are technically not using Chrome instance to perform actual network request, here is what they are doing instead:
when the user code initiates network request of Puppeteer, e.g. calls page.goto(), the proxy package intercepts this outgoing HTTP request and pauses it
the proxy package passes the request to another network library (Got)
Got performs actual network request, through the proxy specified
Got now needs to pass all the network response data back to Puppeteer! This means a bunch of interesting things the proxy package now needs to manage, like copying cookie headers from raw HTTP set-cookie format to puppeteer format
While this might be a viable approach for a lot of cases, you need to understand that this changes your HTTP request TLS fingerprint so your HTTP request might get blocked by some websites, particularly the ones which are using Cloudflare bot detection (because the website now sees that your request originates from Node.js, not from Chrome).
Alternative method of setting proxy in Puppeteer.
Launch args of Chrome are good if you want to use one proxy for all websites. What if you still want to have one Chrome instance use multiple proxies, but you don't want to use 2 packages mentioned above?
createIncognitoBrowserContext Puppeteer function to the rescue:
// Create a new incognito browser context
const context = await browser.createIncognitoBrowserContext({ proxy: 'http://localhost:2022' });
// Create a new page inside context.
const page = await context.newPage();
// authenticate in proxy using basic browser auth
await page.authenticate({username:user, password:password});
// ... do stuff with page ...
await page.goto('https://example.com');
// Dispose context once it's no longer needed.
await context.close();
proxy-chain package
If your proxy requires auth, and you don't like the page.authenticate call, the proxy might be set using proxy-chain npm package.
proxy-chain launches intermediate proxy on your localhost which allows to do some nice things. Read more on technical details of proxy-chain package implementation: https://pixeljets.com/blog/how-to-set-proxy-in-puppeteer
According to my experience, all above fail due to different reasons.
I find that applying proxy on the entire OS works each time. I get no proxy fails. This strategy works on both Windows and Linux.
This way, I get zero puppeteer bot failures. Bear in mind, I am spinning up 7000 bots per server. I am running this on 7 servers.

Categories

Resources