How to use proxy in puppeteer and headless Chrome? - javascript

Please tell me how to properly use a proxy with a puppeteer and headless Chrome. My option does not work.
const puppeteer = require('puppeteer');
(async () => {
const argv = require('minimist')(process.argv.slice(2));
const browser = await puppeteer.launch({args: ["--proxy-server =${argv.proxy}","--no-sandbox", "--disable-setuid-sandbox"]});
const page = await browser.newPage();
await page.setJavaScriptEnabled(false);
await page.setUserAgent(argv.agent);
await page.setDefaultNavigationTimeout(20000);
try{
await page.goto(argv.page);
const bodyHTML = await page.evaluate(() => new XMLSerializer().serializeToString(document))
body = bodyHTML.replace(/\r|\n/g, '');
console.log(body);
}catch(e){
console.log(e);
}
await browser.close();
})();

You can find an example about proxy at here
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '--proxy-server=127.0.0.1:9876' ]
});
const page = await browser.newPage();
await page.goto('https://google.com');
await browser.close();
})();

It's possible with puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request. And yes, it works both in headless and headful Chrome.
First install it:
npm i puppeteer-page-proxy
Then require it:
const useProxy = require('puppeteer-page-proxy');
Using it is easy;
Set proxy for an entire page:
await useProxy(page, 'http://127.0.0.1:8000');
If you want a different proxy for each request,then you can simply do this:
await page.setRequestInterception(true);
page.on('request', req => {
useProxy(req, 'socks5://127.0.0.1:9000');
});
Then if you want to be sure that your page's IP has changed, you can look it up;
const data = await useProxy.lookup(page);
console.log(data.ip);
It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:
const proxy = 'http://login:pass#127.0.0.1:8000'
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy

do not use
"--proxy-server =${argv.proxy}"
this is a normal string instead of template literal
use ` instead of "
`--proxy-server =${argv.proxy}`
otherwise argv.proxy will not be replaced
check this string before you pass it to launch function to make sure it's correct
and you may want to visit http://api.ipify.org/ in that browser to make sure the proxy works normally

if you want to use different proxy for per page, try this, use https-proxy-agent or http-proxy-agent to proxy request for per page

You can use https://github.com/gajus/puppeteer-proxy to set proxy either for entire page or for specific requests only, e.g.
import puppeteer from 'puppeteer';
import {
createPageProxy,
} from 'puppeteer-proxy';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageProxy = createPageProxy({
page,
proxyUrl: 'http://127.0.0.1:3000',
});
await page.setRequestInterception(true);
page.once('request', async (request) => {
await pageProxy.proxyRequest(request);
});
await page.goto('https://example.com');
})();
To skip proxy simply call request.continue() conditionally.
Using puppeteer-proxy Page can have multiple proxies.

You can find proxies list on Private Proxy and use it with the code below
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
// Proxies List from Private proxies
const proxiesList = [
'http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
];
const oldProxyUrl = proxiesList[Math.floor(Math.random() * (proxiesList.length))];
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
const browser = await puppeteer.launch({
headless: true,
ignoreHTTPSErrors: true,
args: [
`--proxy-server=${newProxyUrl}`,
`--ignore-certificate-errors`,
`--no-sandbox`,
`--disable-setuid-sandbox`
]
});
const page = await browser.newPage();
await page.authenticate();
//
// you code here
//
// close proxy chain
await proxyChain.closeAnonymizedProxy(newProxyUrl, true);
})();
You can find the full post here

I see https://github.com/Cuadrix/puppeteer-page-proxy and https://github.com/gajus/puppeteer-proxy recommended above, and I want to emphasize that these two packages are technically not using Chrome instance to perform actual network request, here is what they are doing instead:
when the user code initiates network request of Puppeteer, e.g. calls page.goto(), the proxy package intercepts this outgoing HTTP request and pauses it
the proxy package passes the request to another network library (Got)
Got performs actual network request, through the proxy specified
Got now needs to pass all the network response data back to Puppeteer! This means a bunch of interesting things the proxy package now needs to manage, like copying cookie headers from raw HTTP set-cookie format to puppeteer format
While this might be a viable approach for a lot of cases, you need to understand that this changes your HTTP request TLS fingerprint so your HTTP request might get blocked by some websites, particularly the ones which are using Cloudflare bot detection (because the website now sees that your request originates from Node.js, not from Chrome).
Alternative method of setting proxy in Puppeteer.
Launch args of Chrome are good if you want to use one proxy for all websites. What if you still want to have one Chrome instance use multiple proxies, but you don't want to use 2 packages mentioned above?
createIncognitoBrowserContext Puppeteer function to the rescue:
// Create a new incognito browser context
const context = await browser.createIncognitoBrowserContext({ proxy: 'http://localhost:2022' });
// Create a new page inside context.
const page = await context.newPage();
// authenticate in proxy using basic browser auth
await page.authenticate({username:user, password:password});
// ... do stuff with page ...
await page.goto('https://example.com');
// Dispose context once it's no longer needed.
await context.close();
proxy-chain package
If your proxy requires auth, and you don't like the page.authenticate call, the proxy might be set using proxy-chain npm package.
proxy-chain launches intermediate proxy on your localhost which allows to do some nice things. Read more on technical details of proxy-chain package implementation: https://pixeljets.com/blog/how-to-set-proxy-in-puppeteer

According to my experience, all above fail due to different reasons.
I find that applying proxy on the entire OS works each time. I get no proxy fails. This strategy works on both Windows and Linux.
This way, I get zero puppeteer bot failures. Bear in mind, I am spinning up 7000 bots per server. I am running this on 7 servers.

Related

In Puppeteer, is it possible to launch a headed browser instance, get the user authenticated, and then continue the session in a headless state? [duplicate]

I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.

Launch Tor browser using Puppeteer instead of Chrome on Windows 10

I'm on a Windows 10 machine, I've downloaded the Tor browser and using the Tor browser normally works fine, but I'd like to make Puppeteer use Tor to launch in a headless mode, I'm seeing a lot regarding the Socks5 proxy but can't figure out how to set this up and why it's not working? Presumably when running the launch method it launches Tor in the background?
Here's my JS code in node so far...
// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')
// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
// artificial sleep function
const sleep = async (ms) => {
return new Promise((res, rej) => {
setTimeout(() => {
res()
}, ms)
})
}
// login function
const emulate = async () => {
// initiate a Puppeteer instance with options and launch
const browser = await puppeteer.launch({
headless: false,
args: [
'--proxy-server=socks5://127.0.0.1:1337'
]
});
// launch Facebook and wait until idle
const page = await browser.newPage()
// go to Tor
await page.goto('https://check.torproject.org/');
const isUsingTor = await page.$eval('body', el =>
el.innerHTML.includes('Congratulations. This browser is configured to use Tor')
);
if (!isUsingTor) {
console.log('Not using Tor. Closing...')
return await browser.close()
}
// do something...
}
// kick it off
emulate()
This gives me a ERR_PROXY_CONNECTION_FAILED error in chromium, why isn't it launching using Tor?
There are lot more steps you need to take.
You need to install tor on your system. You might want to use-
brew install tor
Start tor with-
brew services start tor
tor use port 9050 by default, so your proxy should look like this;
--proxy-server=socks5://127.0.0.1:9050
If you must use another port, then it must be added in the torrc file.
Also, you might need to do your //go to tor// before your //launch facebook//

Keeping same Puppeteer page open nodeJS

I'm working on an API developed around running some JS on a page that's opened in Puppeteer but I don't want to keep open/closing & waiting for the page to load since it's a heavy content page.
Is it possible to run a forever start on a node script that initiates the page & keeps it open forever & then call a separate node script whenever it's needed to run some javascript on this page?
I've attempted the following but appears the page doesn't remain open:
keepopen.js
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const start = +new Date();
const browser = await puppeteer.launch({args: ['--no-sandbox']});
const page = await browser.newPage();
await page.goto('https://www.bigwebsite.com/', {"waitUntil" : "networkidle0"});
const end = +new Date();
console.log(end - start);
//await browser.close();
})();
runjs.js
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const start = +new Date();
const browser = await puppeteer.launch({args: ['--no-sandbox']});
const page = await browser.targets()[browser.targets().length-1].page();
const hash = await page.evaluate(() => {
return runFunction();
});
const end = +new Date();
console.log(hash);
console.log(end - start);
//await browser.close();
})();
I run the following: forever start keepopen.js and then runjs.js but I'm getting the error:
(node:1642) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'evaluate' of null
It is not possible to share a resource between two Node.js scripts like that. You need a server that keeps the browser open.
Code Sample
Below is an example, using the library express to start a server. Calling /start-browser launches the browser and stores the browser and page object outside of the current function. That way a second function (called when /run is accessed) can use the page object to run code inside of it.
const express = require('express');
const app = express();
let browser, page;
app.get('/start-browser', async function (req, res) {
browser = await puppeteer.launch({args: ['--no-sandbox']});
page = await browser.newPage();
res.end('Browser started');
});
app.get('/run', async function (req, res) {
await page.evaluate(() => {
// ....
});
res.end('Done.'); // You could also return results here
});
app.listen(3000);
Keep in mind, that this a minimal example to get you started. In a real world scenario, you would need to catch errors and maybe also restart the browser from time to time.
You could run a http server, using node, where the puppeteer page object is created once on startup, and then initiate your current script by placing that code inside (a so called) "routing" function (which is just a function that serves a web request) of the http server you've created.
As long as the page object is created right outside the scope of the routing function, that contains your code, your routing function will maintain access to that same page object between numerous web requests.
You'll be able to reuse that same page object over and over again instead having to reload it for each call like you're currently doing. However, you need a service to persist the page object between requests/calls.
You can either create your own http server (using node's built-in http package), or use express (and there are many other http based packages besides express you could use).

Use different ip addresses in puppeteer requests

I have multiple ip interfaces in my server and I can't find how to force puppeteer to use them in its requests
I am using node v10.15.0 and puppeteer 1.11.0
You can use the flag --netifs-to-ignore when launching the browser to specify which interfaces should be ignored by Chrome. Quote from the List of Chromium Command Line Switches:
--netifs-to-ignore: List of network interfaces to ignore. Ignored interfaces will not be used for network connectivity
You can use the argument like this when launching the browser:
const browser = await puppeteer.launch({
args: ['--netifs-to-ignore=INTERFACE_TO_IGNORE']
});
Maybe this will help. You can see the full code here
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '--proxy-server=127.0.0.1:9876' ]
});
const page = await browser.newPage();
await page.goto('https://google.com');
await browser.close();
})();

Puppeteer - using '--allow-file-access-from-files' to load a local file through XMLHttpRequest is not working

I am trying to use a local file in Headless Chromium started through Puppeteer.
I always run into the following error:
'Cross origin requests are only supported for protocol schemes: http, data, chrome, https'
I did attempt to set --allow-file-access-from-file.
It can be reproduced as follows:
const puppeteer = require('puppeteer');
puppeteer.launch({headless:true,args:['--allow-file-access-from-files']}).then(
async browser => {
const page = await browser.newPage();
await page.setContent('<html><head><meta charset="UTF-8"></head><body><div>A page</div></body></html>');
await page.addScriptTag({url:"https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"});
await page.on('console', msg => console.log('PAGE LOG:', msg.text()));
await page.evaluate (() => {
$.get('file:///..../cors.js')
.done(
_ => console.log('done')
).fail(
e => console.log('fail:'+JSON.stringify(e))
);
});
await browser.close();
}
);
Looking at the running processes, it does look like Chromium was started with the option.
All tips warmly welcomed!
You are attempting to load your local file from the Page DOM Environment using $.get(), which will not work because it is a violation of the same-origin policy.
The Chromium flag --allow-file-access-from-files is described below:
By default, file:// URIs cannot read other file:// URIs. This is an override for developers who need the old behavior for testing.
This flag does not apply to your scenario.
You can use the Node.js function fs.readFileSync() instead to obtain the content of your file and pass it to page.evaluate():
const fs = require('fs');
const file_path = '/..../cors.js';
const file_content = fs.existsSync(file_path) ? fs.readFileSync(file_path, 'utf8') : '';
await page.evaluate(file_content => {
console.log(file_content);
}, file_content);

Categories

Resources