Keeping same Puppeteer page open nodeJS - javascript

I'm working on an API developed around running some JS on a page that's opened in Puppeteer but I don't want to keep open/closing & waiting for the page to load since it's a heavy content page.
Is it possible to run a forever start on a node script that initiates the page & keeps it open forever & then call a separate node script whenever it's needed to run some javascript on this page?
I've attempted the following but appears the page doesn't remain open:
keepopen.js
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const start = +new Date();
const browser = await puppeteer.launch({args: ['--no-sandbox']});
const page = await browser.newPage();
await page.goto('https://www.bigwebsite.com/', {"waitUntil" : "networkidle0"});
const end = +new Date();
console.log(end - start);
//await browser.close();
})();
runjs.js
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const start = +new Date();
const browser = await puppeteer.launch({args: ['--no-sandbox']});
const page = await browser.targets()[browser.targets().length-1].page();
const hash = await page.evaluate(() => {
return runFunction();
});
const end = +new Date();
console.log(hash);
console.log(end - start);
//await browser.close();
})();
I run the following: forever start keepopen.js and then runjs.js but I'm getting the error:
(node:1642) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'evaluate' of null

It is not possible to share a resource between two Node.js scripts like that. You need a server that keeps the browser open.
Code Sample
Below is an example, using the library express to start a server. Calling /start-browser launches the browser and stores the browser and page object outside of the current function. That way a second function (called when /run is accessed) can use the page object to run code inside of it.
const express = require('express');
const app = express();
let browser, page;
app.get('/start-browser', async function (req, res) {
browser = await puppeteer.launch({args: ['--no-sandbox']});
page = await browser.newPage();
res.end('Browser started');
});
app.get('/run', async function (req, res) {
await page.evaluate(() => {
// ....
});
res.end('Done.'); // You could also return results here
});
app.listen(3000);
Keep in mind, that this a minimal example to get you started. In a real world scenario, you would need to catch errors and maybe also restart the browser from time to time.

You could run a http server, using node, where the puppeteer page object is created once on startup, and then initiate your current script by placing that code inside (a so called) "routing" function (which is just a function that serves a web request) of the http server you've created.
As long as the page object is created right outside the scope of the routing function, that contains your code, your routing function will maintain access to that same page object between numerous web requests.
You'll be able to reuse that same page object over and over again instead having to reload it for each call like you're currently doing. However, you need a service to persist the page object between requests/calls.
You can either create your own http server (using node's built-in http package), or use express (and there are many other http based packages besides express you could use).

Related

In Puppeteer, is it possible to launch a headed browser instance, get the user authenticated, and then continue the session in a headless state? [duplicate]

I want to start a chromium browser instant headless, do some automated operations, and then turn it visible before doing the rest of the stuff.
Is this possible to do using Puppeteer, and if it is, can you tell me how? And if it is not, is there any other framework or library for browser automation that can do this?
So far I've tried the following but it didn't work.
const browser = await puppeteer.launch({'headless': false});
browser.headless = true;
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
Short answer: It's not possible
Chrome only allows to either start the browser in headless or non-headless mode. You have to specify it when you launch the browser and it is not possible to switch during runtime.
What is possible, is to launch a second browser and reuse cookies (and any other data) from the first browser.
Long answer
You would assume that you could just reuse the data directory when calling puppeteer.launch, but this is currently not possible due to multiple bugs (#1268, #1270 in the puppeteer repo).
So the best approach is to save any cookies or local storage data that you need to share between the browser instances and restore the data when you launch the browser. You then visit the website a second time. Be aware that any state the website has in terms of JavaScript variable, will be lost when you recrawl the page.
Process
Summing up, the whole process should look like this (or vice versa for headless to headfull):
Crawl in non-headless mode until you want to switch mode
Serialize cookies
Launch or reuse second browser (in headless mode)
Restore cookies
Revisit page
Continue crawling
As mentioned, this isn't currently possible since the headless switch occurs via Chromium launch flags.
I usually do this with userDataDir, which the Chromium docs describe as follows:
The user data directory contains profile data such as history, bookmarks, and cookies, as well as other per-installation local state.
Here's a simple example. This launches a browser headlessly, sets a local storage value on an arbitrary page, closes the browser, re-opens it headfully, retrieves the local storage value and prints it.
const puppeteer = require("puppeteer"); // ^18.0.4
const url = "https://www.example.com";
const opts = {userDataDir: "./data"};
let browser;
(async () => {
{
browser = await puppeteer.launch({...opts, headless: true});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.evaluate(() => localStorage.setItem("hello", "world"));
await browser.close();
}
{
browser = await puppeteer.launch({...opts, headless: false});
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const result = await page.evaluate(() => localStorage.getItem("hello"));
console.log(result); // => world
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Change const opts = {userDataDir: "./data"}; to const opts = {}; and you'll see null print instead of world; the user data doesn't persist.
The answer from a few years ago mentions issues with userDataDir and suggests a cookies solution. That's fine, but I haven't had any issues with userDataDir so either they've been resolved on the Puppeteer end or my use cases haven't triggered the issues.
There's a useful-looking answer from a reputable source in How to turn headless on after launch? but I haven't had a chance to try it yet.

Waiting for download to complete on Puppeteer

I have a script made using node.js and puppeteer which downloads a file from a button (which doesn't redirect to a url), so right now i'm using await await page.waitForTimeout(1000); to wait for the download to complete but it has a few flaws, such as:
Depending on the connection, the download might take more than 1000ms to finish, as well as it might take less, which wouldn't make sense to wait more than what took to finish the download.
My question is, is there a way to wait for a download to complete using Node+Puppeteer? I have tried using waitUntil: 'networkidle0 and networkidle2 but both seem to wait forever.
Code below:
const path = require('path');
const puppeteer = require('puppeteer');
(async () => {
/* Initialize some variables */
const browser = await puppeteer.launch();
// Instantiates a new page
const page = await browser.newPage();
// Gets current path
const downloadPath = path.resolve('./');
// Specifies wether it allows downloading multiple files or not
await page._client.send('Page.setDownloadBehavior',
{behavior: 'allow', downloadPath: downloadPath});
// Goes to My Website
await page.goto('http://localhost:8080/mywebsite');
// Exports to CSV
await page.waitForSelector("#W0009EXPORTAXLS > a > i", {visible: true});
await page.tap("#W0009EXPORTAXLS > a > i");
await page.waitForTimeout(1000);
// Log
console.log('File exported.');
// Closes the browser
await browser.close();
})();

Have array from Puppeteer/Cheerios program. Have ionic phone app design. How to move the array to the angular code?

This is my first phone app. I am using Ionic for the cross-platform work which uses Angular as you know I'm sure. I have a separate program which scrapes a webpage using puppeteer and cheerio and creates an array of values from the web page. This works.
I'm not sure how I get the array in my web scraping program read by my ionic/angular program.
I have a basic ionic setup and am just trying a most basic activity of being able to see the array from the ionic/angular side but after trying to put it in several places I realized I really didnt know where to import the code to ionic/angular which returns the array or where to put the webscraper code directly in one of the .ts files or ???
This is my web scraping program:
const puppeteer = require('puppeteer'); // live webscraping
let scrape = async () => {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('--page url here --'); // link to page
const result = await page.evaluate(() => {
let data = []; // Create an empty array that will store our data
let elements = document.querySelectorAll('.list-myinfo-block'); // Select all Products
let photo_elements = document.getElementsByTagName('img'); //
var photo_count = 0;
for (var element of elements) { // Loop through each product getting photos
let picture_link = photo_elements[photo_count].src;
let name = element.childNodes[1].innerText;
let itype = element.childNodes[9].innerText
data.push({
picture_link,
name,
itype
}); // Push an object with the data onto our array
photo_count = photo_count + 1;
}
return data;
});
browser.close();
return result; // Return the data
};
scrape().then((value) => {
console.log(value); // Success!
});
When I run the webscraping program I see the array with the correct values in it. Its getting it into the ionic part of it. Sometimes the ionic phone page will show up with nothing in it, sometimes it says it cannot find "/" ... I've tried so many different places and looked all over the web that I have quite a combination of errors. I know I'm putting it in the wrong places - or maybe not everywhere I should. Thank you!
You need a server which will run the scraper on demand.
Any scraper that uses a real browser (ie: chromium) will have to run in a OS that supports it. There is no other way.
Think about this,
Does your mobile support chromium and nodeJS? It does not. There are no chromium build for mobile which supports automation with nodeJS (yet).
Can you run a browser inside another browser? You cannot.
Way 1: Remote wsEndpoint
There are some services which offers wsEndpoint but I will not mention them here. I will describe how you can create your own wsEndPoint and use it.
Run browser and Get wsEndpoint
The following code will launch a puppeteer instance whenever you connect to it. You have to run it inside a server.
const http = require('http');
const httpProxy = require('http-proxy');
const proxy = new httpProxy.createProxyServer();
http
.createServer()
.on('upgrade', async(req, socket, head) => {
const browser = await puppeteer.launch();
const target = browser.wsEndpoint();
proxyy.ws(req, socket, head, { target })
})
.listen(8080);
When you run this on the server/terminal, you can use the ip of the server to connect. In my case it's ws://127.0.0.1:8080.
Use puppeteer-web
Now you will need to install puppeteer-web on your mobile/web app. To bundle Puppeteer using Browserify follow the instruction below.
Clone Puppeteer repository:
git clone https://github.com/GoogleChrome/puppeteer && cd puppeteer
npm install
npm run bundle
This will create ./utils/browser/puppeteer-web.js file that contains Puppeteer bundle.
You can use it later on in your web page to drive another browser instance through its WS Endpoint:
<script src='./puppeteer-web.js'></script>
<script>
const puppeteer = require('puppeteer');
const browser = await puppeteer.connect({
browserWSEndpoint: '<another-browser-ws-endpont>'
});
// ... drive automation ...
</script>
Way 2: Use an API
I will use express for a minimal setup. Consider your scrape function is exported to a file called scrape.js and you have the following index.js file.
const express = require('express')
const scrape= require('./scrape')
const app = express()
app.get('/', function (req, res) {
scrape().then(data=>res.send({data}))
})
app.listen(8080)
This will launch a express API on the port 8080.
Now if you run it with node index.js on a server, you can call it from any mobile/web app.
Helpful Resources
I had some fun with puppeteer and webpack,
playground-react-puppeteer
playground-electron-react-puppeteer-example
To keep the api running, you will need to learn a bit about backend and how to keep the server alive etc. See these links for full understanding of creating the server and more,
Official link to puppeteer-web
Puppeteer with docker
Docker with XVFB and Puppeteer
Puppeteer with chrome extension
Puppeteer with local wsEndpoint
Avoid memory leak on server

How can I get Chrome's remote debug URL when using the "remote-debugging-port" in Electron?

I've set the remote-debugging-port option for Chrome in my Electron main process:
app.commandLine.appendSwitch('remote-debugging-port', '8315')
Now, how can I get the ws:// URL that I can use to connect to Chrome?
I see that the output while I'm running Electron shows
DevTools listening on ws://127.0.0.1:8315/devtools/browser/52ba17be-0c0d-4db6-b6f9-a30dc10df13c
but I would like to get this URL from inside the main process. The URL is different every time. How can I get it from inside the Electron main process?
Can I somehow read my Electron's main process output, from within my main process JavaScript code?
Here's how to connect Puppeteer to your Electron window from your Electron main process code:
app.commandLine.appendSwitch('remote-debugging-port', '8315')
async function test() {
const response = await fetch(`http://localhost:8315/json/list?t=${Math.random()}`)
const debugEndpoints = await response.json()
let webSocketDebuggerUrl = ''
for (const debugEndpoint of debugEndpoints) {
if (debugEndpoint.title === 'Saffron') {
webSocketDebuggerUrl = debugEndpoint.webSocketDebuggerUrl
break
}
}
const browser = await puppeteer.connect({
browserWSEndpoint: webSocketDebuggerUrl
})
// use puppeteer APIs now!
}
// ... make your window, etc, the usual, and then: ...
// wait for the window to open/load, then connect Puppeteer to it:
mainWindow.webContents.on("did-finish-load", () => {
test()
})

How to use proxy in puppeteer and headless Chrome?

Please tell me how to properly use a proxy with a puppeteer and headless Chrome. My option does not work.
const puppeteer = require('puppeteer');
(async () => {
const argv = require('minimist')(process.argv.slice(2));
const browser = await puppeteer.launch({args: ["--proxy-server =${argv.proxy}","--no-sandbox", "--disable-setuid-sandbox"]});
const page = await browser.newPage();
await page.setJavaScriptEnabled(false);
await page.setUserAgent(argv.agent);
await page.setDefaultNavigationTimeout(20000);
try{
await page.goto(argv.page);
const bodyHTML = await page.evaluate(() => new XMLSerializer().serializeToString(document))
body = bodyHTML.replace(/\r|\n/g, '');
console.log(body);
}catch(e){
console.log(e);
}
await browser.close();
})();
You can find an example about proxy at here
'use strict';
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '--proxy-server=127.0.0.1:9876' ]
});
const page = await browser.newPage();
await page.goto('https://google.com');
await browser.close();
})();
It's possible with puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request. And yes, it works both in headless and headful Chrome.
First install it:
npm i puppeteer-page-proxy
Then require it:
const useProxy = require('puppeteer-page-proxy');
Using it is easy;
Set proxy for an entire page:
await useProxy(page, 'http://127.0.0.1:8000');
If you want a different proxy for each request,then you can simply do this:
await page.setRequestInterception(true);
page.on('request', req => {
useProxy(req, 'socks5://127.0.0.1:9000');
});
Then if you want to be sure that your page's IP has changed, you can look it up;
const data = await useProxy.lookup(page);
console.log(data.ip);
It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:
const proxy = 'http://login:pass#127.0.0.1:8000'
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy
do not use
"--proxy-server =${argv.proxy}"
this is a normal string instead of template literal
use ` instead of "
`--proxy-server =${argv.proxy}`
otherwise argv.proxy will not be replaced
check this string before you pass it to launch function to make sure it's correct
and you may want to visit http://api.ipify.org/ in that browser to make sure the proxy works normally
if you want to use different proxy for per page, try this, use https-proxy-agent or http-proxy-agent to proxy request for per page
You can use https://github.com/gajus/puppeteer-proxy to set proxy either for entire page or for specific requests only, e.g.
import puppeteer from 'puppeteer';
import {
createPageProxy,
} from 'puppeteer-proxy';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const pageProxy = createPageProxy({
page,
proxyUrl: 'http://127.0.0.1:3000',
});
await page.setRequestInterception(true);
page.once('request', async (request) => {
await pageProxy.proxyRequest(request);
});
await page.goto('https://example.com');
})();
To skip proxy simply call request.continue() conditionally.
Using puppeteer-proxy Page can have multiple proxies.
You can find proxies list on Private Proxy and use it with the code below
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
// Proxies List from Private proxies
const proxiesList = [
'http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
' http://skrll:au4....',
];
const oldProxyUrl = proxiesList[Math.floor(Math.random() * (proxiesList.length))];
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
const browser = await puppeteer.launch({
headless: true,
ignoreHTTPSErrors: true,
args: [
`--proxy-server=${newProxyUrl}`,
`--ignore-certificate-errors`,
`--no-sandbox`,
`--disable-setuid-sandbox`
]
});
const page = await browser.newPage();
await page.authenticate();
//
// you code here
//
// close proxy chain
await proxyChain.closeAnonymizedProxy(newProxyUrl, true);
})();
You can find the full post here
I see https://github.com/Cuadrix/puppeteer-page-proxy and https://github.com/gajus/puppeteer-proxy recommended above, and I want to emphasize that these two packages are technically not using Chrome instance to perform actual network request, here is what they are doing instead:
when the user code initiates network request of Puppeteer, e.g. calls page.goto(), the proxy package intercepts this outgoing HTTP request and pauses it
the proxy package passes the request to another network library (Got)
Got performs actual network request, through the proxy specified
Got now needs to pass all the network response data back to Puppeteer! This means a bunch of interesting things the proxy package now needs to manage, like copying cookie headers from raw HTTP set-cookie format to puppeteer format
While this might be a viable approach for a lot of cases, you need to understand that this changes your HTTP request TLS fingerprint so your HTTP request might get blocked by some websites, particularly the ones which are using Cloudflare bot detection (because the website now sees that your request originates from Node.js, not from Chrome).
Alternative method of setting proxy in Puppeteer.
Launch args of Chrome are good if you want to use one proxy for all websites. What if you still want to have one Chrome instance use multiple proxies, but you don't want to use 2 packages mentioned above?
createIncognitoBrowserContext Puppeteer function to the rescue:
// Create a new incognito browser context
const context = await browser.createIncognitoBrowserContext({ proxy: 'http://localhost:2022' });
// Create a new page inside context.
const page = await context.newPage();
// authenticate in proxy using basic browser auth
await page.authenticate({username:user, password:password});
// ... do stuff with page ...
await page.goto('https://example.com');
// Dispose context once it's no longer needed.
await context.close();
proxy-chain package
If your proxy requires auth, and you don't like the page.authenticate call, the proxy might be set using proxy-chain npm package.
proxy-chain launches intermediate proxy on your localhost which allows to do some nice things. Read more on technical details of proxy-chain package implementation: https://pixeljets.com/blog/how-to-set-proxy-in-puppeteer
According to my experience, all above fail due to different reasons.
I find that applying proxy on the entire OS works each time. I get no proxy fails. This strategy works on both Windows and Linux.
This way, I get zero puppeteer bot failures. Bear in mind, I am spinning up 7000 bots per server. I am running this on 7 servers.

Categories

Resources