I am trying to:
Visit a page that initialises a session
Store the session in a JSON object
Visit the same page, which now should recognise the existing session
The implementation I have attempted is as follows:
import puppeteer from 'puppeteer';
const createSession = async (browser, startUrl) => {
const page = await browser.newPage();
await page.goto(startUrl);
await page.waitForSelector('#submit');
const cookies = await page.cookies();
const url = await page.url();
return {
cookies,
url
};
};
const useSession = async (browser, session) => {
const page = await browser.newPage();
for (const cookie of session.cookies) {
await page.setCookie(cookie);
}
await page.goto(session.url);
};
const run = async () => {
const browser = await puppeteer.launch({
headless: false
});
const session = await createSession(browser, 'http://foo.com/');
// The session has been established
await useSession(browser, session);
await useSession(browser, session);
};
run();
createSession is used to capture the cookies of the loaded page.
useSession are expected to load the page using the existing cookies.
However, this does not work – the session.url page does not recognise the session. It appears that not all cookies are being captured this way.
It appears that page#cookies returns some cookies with the session=true,expires=0 configuration. setCookie ignores these values.
I worked around this by constructing a new cookies array overriding the expires and session properties.
const cookies = await page.cookies();
const sessionFreeCookies = cookies.map((cookie) => {
return {
...cookie,
expires: Date.now() / 1000 + 10 * 60,
session: false
};
});
At the time writing this answer, session property is not documented. Refer to the following issue https://github.com/GoogleChrome/puppeteer/issues/980.
Puppeteer page.cookies() method only fetches cookies for the current page domain. However, there might be cases where it can have cookies from different domains as well.
You can call the internal method Network.getAllCookies to fetch cookies from all the domains.
(async() => {
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com', {waitUntil : 'networkidle2' });
// Here we can get all of the cookies
console.log(await page._client.send('Network.getAllCookies'));
})();
More on this thread here - Puppeteer get 3rd party cookies
Related
So what I am trying to do is to open puppeteer window with my google profile, but what I want is to do it multiple times, what I mean is 2-4 windows but with the same profile - is that possible? I am getting this error when I do it:
(node:17460) UnhandledPromiseRejectionWarning: Error: Failed to launch the browser process!
[45844:13176:0410/181437.893:ERROR:cache_util_win.cc(20)] Unable to move the cache: Access is denied. (0x5)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
'--user-data-dir=C:\\Users\\USER\\AppData\\Local\\Google\\Chrome\\User Data',
);
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
Note: It is already pointed in the comments but there is a syntax error in the example. The launch should look like this:
const browser = await puppeteer.launch({
headless: false,
args: ['--user-data-dir=C:\\Users\\USER\\AppData\\Local\\Google\\Chrome\\User Data']
});
The error is coming from the fact that you are launching multiple browser instances at the very same time hence the profile directory will be locked and cannot be moved to reuse by puppeteer.
You should avoid starting chromium instances with the very same user data dir at the same time.
Possible solutions
Make the opened windows sequential, can be useful if you have only a few. E.g.:
const firstFn = async () => await puppeteer.launch() ...
const secondFn = async () => await puppeteer.launch() ...
(async () => {
await firstFn()
await secondFn()
})();
Creating copies of the user-data-dir as User Data1, User Data2 User Data3 etc. to avoid conflict while puppeteer copies them. This could be done on the fly with Node's fs module or even manually (if you don't need a lot of instances).
Consider reusing Chromium instances (if your use case allows it), with browser.wsEndpoint and puppeteer.connect, this can be a solution if you would need to open thousands of pages with the same user data dir.
Note: this one is the best for performance as only one browser will be launched, then you can open as many pages in a for..of or regular for loop as you want (using forEach by itself can cause side effects), E.g.:
const puppeteer = require('puppeteer')
const urlArray = ['https://example.com', 'https://google.com']
async function fn() {
const browser = await puppeteer.launch({
headless: false,
args: ['--user-data-dir=C:\\Users\\USER\\AppData\\Local\\Google\\Chrome\\User Data']
})
const browserWSEndpoint = await browser.wsEndpoint()
for (const url of urlArray) {
try {
const browser2 = await puppeteer.connect({ browserWSEndpoint })
const page = await browser2.newPage()
await page.goto(url) // it can be wrapped in a retry function to handle flakyness
// doing cool things with the DOM
await page.screenshot({ path: `${url.replace('https://', '')}.png` })
await page.goto('about:blank') // because of you: https://github.com/puppeteer/puppeteer/issues/1490
await page.close()
await browser2.disconnect()
} catch (e) {
console.error(e)
}
}
await browser.close()
}
fn()
I am currently having the problem (puppeteer) in a project that I think cookies are not activated. But I also don't know how to activate them if they are not already activated from the beginning.
Since every website nowaday has Captcha, so we can skip the auto-login part.
I'm new too, I got an idea from here for this.
Firstly check if there is saved cookies.json file, if not, do the manually login, you click the submit button yourself and solve the captcha puzzle (in non-headless mode), the page should be redirected to destination page.
Once the destination page is loaded, save the cookies in to a Json file for next time.
Example:
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless: false, //launch in non-headless mode so you can see graphics
defaultViewport: null
});
let [page] = await browser.pages();
await page.setRequestInterception(true);
const getCookies = async (page) => {
// Get page cookies
const cookies = await page.cookies()
// Save cookies to file
fs.writeFile('./cookies.json', JSON.stringify(cookies, null, 4), (err) => {
if (err) console.log(err);
return
});
}
const setCookies = async (page) => {
// Get cookies from file as a string
let cookiesString = fs.readFileSync('./cookies.json', 'utf8');
// Parse string
let cookies = JSON.parse(cookiesString)
// Set page cookies
await page.setCookie.apply(page, cookies);
return
}
page.on('request', async (req) => {
// If URL is already loaded in to system
if (req.url() === 'https://example.com/LoggedUserCP') {
console.log('logged in, get the cookies');
await getCookies(page);
// if it's being in login page, try to set existed cookie
} else if (req.url() === 'https://example.com/Login?next=LoggedUserCP') {
await setCookies(page);
console.log('set the saved cookies');
}
// otherwise go to login page and login yourself
req.continue();
});
await page.goto('https://example.com/Login?next=LoggedUserCP');
})();
EDIT for Mission Clarity: In the end I am pulling inventory data and customer data from Postgres to render and send a bunch of PDFs to customers, once per month.
These PDFs are dynamic in that the cover page will have varying customer name/address. The next page(s) are also dynamic as they are lists of a particular customer's expiring inventory with item/expirying date/serial number.
I had made a client-side React page with print CSS to render some print-layout letters that could be printed off/saved as a pretty PDF.
Then, the waterfall spec came in that this was to be an automated process on the server. Basically, the PDF needs attached to an email alerting customers of expiring product (in med industry where everything needs audited).
I thought using Puppeteer would be a nice and easy switch. Just add a route that processes all customers, looking up whatever may be expiring, and then passing that into the dynamic react page to be rendered headless to a PDF file (and eventually finish the whole rest of the plan, sending email, etc.). Right now I just grab 10 customers and their expiring stock for PoC, then I have basically: { customer: {}, expiring: [] }.
I've attempted using POST to page with interrupt, but I guess that makes sense that I cannot get post data in the browser.
So, I switched my approach to using cookies. This I would expect to work, but I can never read the cookie(s) into the page.
Here is a: Simple route, simple puppeteer which writes out cookies to a json and takes a screenshot just for proof, and simple HTML with script I'm using just to try to prove I can pass data along.
server/index.js:
app.get('/testing', async (req, res) => {
console.log('GET /testing');
res.sendFile(path.join(__dirname, 'scratch.html'));
});
scratch.js (run at commandline node ./scratch.js:
const puppeteer = require('puppeteer')
const fs = require('fs');
const myCookies = [{name: 'customer', value: 'Frank'}, {name: 'expiring', value: JSON.stringify([{a: 1, b: 'three'}])}];
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:1234/testing', { waitUntil: 'networkidle2' });
await page.setCookie(...myCookies);
const cookies = await page.cookies();
const cookieJson = JSON.stringify(cookies);
// Writes expected cookies to file for sanity check.
fs.writeFileSync('scratch_cookies.json', cookieJson);
// FIXME: Cookies never get appended to page.
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
server/scratch.html:
<html>
<body>
</body>
<script type='text/javascript'>
document.write('Cookie: ' + document.cookie);
</script>
</html>
The result is just a PNG with the word "Cookie:" on it. Any insight appreciated!
This is the actual route I'm using where makeExpiryLetter is utilizing puppeteer, but I can't seem to get it to actually read the customer and rows data.
app.get('/create-expiry-letter', async (req, res) => {
// Create PDF file using puppeteer to render React page w/ data.
// Store in Db.
// Email file.
// Send final count of letters sent back for notification in GUI.
const cc = await dbo.getConsignmentCustomers();
const result = await Promise.all(cc.rows.map(async x => {
// Get 0-60 day consignments by customer_id;
const { rows } = await dbo.getExpiry0to60(x.customer_id);
if (rows && rows.length > 0) {
const epiryLetter = await makeExpiryLetter(x, rows); // Uses puppeteer.
// TODO: Store in Db / Email file.
return true;
} else {
return false;
}
}));
res.json({ emails_sent: result.filter(x => x === true).length });
});
Thanks to the samples from #ggorlen I've made huge headway in using cookies. In my inline script of expiry.html I'm grabbing the values by wrapping my render function in function main () and adding onload to body tag <body onload='main()'.
Inside the main function we can grab the values I needed:
const customer = JSON.parse(document.cookie.split('; ').find(row => row.startsWith('customer')).split('=')[1]);
const expiring = JSON.parse(document.cookie.split('; ').find(row => row.startsWith('expiring')).split('=')[1]);
FINALLY (and yes, of course this will all be used in an automated worker in the end) I can get my beautifully rendered PDF like so:
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setCookie(...myCookies);
await page.goto('http://localhost:1234/testing');
await page.pdf({ path: `scratch-expiry-letter.pdf`, format: 'letter' });
await browser.close();
})();
The problem is here:
await page.goto('http://localhost:1234/testing', { waitUntil: 'networkidle2' });
await page.setCookie(...myCookies);
The first line says, go to the page. Going to a page involves parsing the HTML and executing scripts, including your document.write('Cookie: ' + document.cookie); line in scratch.html, at which time there are no cookies on the page (assuming a clear browser cache).
After the page is loaded, await page.goto... returns and the line await page.setCookie(...myCookies); runs. This correctly sets your cookies and the remaining lines execute. const cookies = await page.cookies(); runs and pulls the newly-set cookies out and you write them to disk. await page.screenshot({path: 'scratch_shot.png'}); runs, taking a shot of the page without the DOM updated with the new cookies that were set after the initial document.write call.
You can fix this problem by turning your JS on the scratch.html page into a function that can be called after page load and cookies are set, or injecting such a function dynamically with Puppeteer using evaluate:
const puppeteer = require('puppeteer');
const myCookies = [
{name: 'customer', value: 'Frank'},
{name: 'expiring', value: JSON.stringify([{a: 1, b: 'three'}])}
];
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('http://localhost:1234/testing');
await page.setCookie(...myCookies);
// now that the cookies are ready, we can write to the document
await page.evaluate(() => document.write('Cookie' + document.cookie));
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
A more general approach is to set the cookies before navigation. This way, the cookies will already exist when any scripts that might use them run.
const puppeteer = require('puppeteer');
const myCookies = [
{
name: 'expiring',
value: '[{"a":1,"b":"three"}]',
domain: 'localhost',
path: '/',
expires: -1,
size: 29,
httpOnly: false,
secure: false,
session: true,
sameParty: false,
sourceScheme: 'NonSecure',
sourcePort: 80
},
{
name: 'customer',
value: 'Frank',
domain: 'localhost',
path: '/',
expires: -1,
size: 13,
httpOnly: false,
secure: false,
session: true,
sameParty: false,
sourceScheme: 'NonSecure',
sourcePort: 80
}
];
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setCookie(...myCookies);
await page.goto('http://localhost:1234/testing');
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
That said, I'm not sure if cookies are the easiest or best way to do what you're trying to do. Since you're serving HTML, you could pass the data along with it statically, expose a separate API route to collect a customer's data which the front end can use, or pass GET parameters, depending on the nature of the data and what you're ultimately trying to accomplish.
You could even have a file upload form on the React app, then have Puppeteer upload the JSON data into the app programmatically through that form.
In fact, if your final goal is to dynamically generate a PDF, using React and Puppeteer might be overkill, but I'm not sure I have a better solution to offer without some research and additional context about your use case.
Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available.
Any idea how I can fix this error and resolve the issue?
I currently have
const browser = await puppeteer.launch({
args: [
"--proxy-server=https=myproxy:myproxyport",
"--no-sandbox",
'--disable-gpu',
"--disable-setuid-sandbox",
],
timeout: 0,
headless: true,
});
page.authentication
The correct format for proxy-server argument is,
--proxy-server=HOSTNAME:PORT
If it's HTTPS proxy, you can pass the username and password using page.authenticate before even doing a navigation,
page.authenticate({username:'user', password:'password'});
Complete code would look like this,
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless:false,
ignoreHTTPSErrors:true,
args: ['--no-sandbox','--proxy-server=HOSTNAME:PORT']
});
const page = await browser.newPage();
// Authenticate Here
await page.authenticate({username:user, password:password});
await page.goto('https://www.example.com/');
})();
Proxy Chain
If somehow the authentication does not work using above method, you might want to handle the authentication somewhere else.
There are multiple packages to do that, one is proxy-chain, with this, you can take one proxy, and use it as new proxy server.
The proxyChain.anonymizeProxy(proxyUrl) will take one proxy with username and password, create one new proxy which you can use on your script.
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');
(async() => {
const oldProxyUrl = 'http://username:password#hostname:8000';
const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
// Prints something like "http://127.0.0.1:12345"
console.log(newProxyUrl);
const browser = await puppeteer.launch({
args: [`--proxy-server=${newProxyUrl}`],
});
// Do your magic here...
const page = await browser.newPage();
await page.goto('https://www.example.com');
})();
We have some routing logic that kicks you to the homepage if you dont have a JWT_TOKEN set... I want to set this before the page loads/before the js is invoked.
how do i do this ?
You have to register localStorage item like this:
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
You should do it after page page.goto - browser must have an url to register local storage item on it. After this, enter the same page once again, this time token should be here before the page is loaded.
Here is a fully working example:
const puppeteer = require('puppeteer');
const http = require('http');
const html = `
<html>
<body>
<div id="element"></div>
<script>
document.getElementById('element').innerHTML =
localStorage.getItem('token') ? 'signed' : 'not signed';
</script>
</body>
</html>`;
http
.createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.write(html);
res.end();
})
.listen(8080);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080/');
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
await page.goto('http://localhost:8080/');
const text = await page.evaluate(
() => document.querySelector('#element').textContent
);
console.log(text);
await browser.close();
process.exit(0);
})();
There's some discussion about this in Puppeteer's GitHub issues.
You can load a page on the domain, set your localStorage, then go to the actual page you want to load with localStorage ready. You can also intercept the first url load to return instantly instead of actually load the page, potentially saving a lot of time.
const doSomePuppeteerThings = async () => {
const url = 'http://example.com/';
const browser = await puppeteer.launch();
const localStorage = { storageKey: 'storageValue' };
await setDomainLocalStorage(browser, url, localStorage);
const page = await browser.newPage();
// do your actual puppeteer things now
};
const setDomainLocalStorage = async (browser, url, values) => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
r.respond({
status: 200,
contentType: 'text/plain',
body: 'tweak me.',
});
});
await page.goto(url);
await page.evaluate(values => {
for (const key in values) {
localStorage.setItem(key, values[key]);
}
}, values);
await page.close();
};
in 2021 it work with following code:
// store in localstorage the token
await page.evaluateOnNewDocument (
token => {
localStorage.clear();
localStorage.setItem('token', token);
}, 'eyJh...9_8cw');
// open the url
await page.goto('http://localhost:3000/Admin', { waitUntil: 'load' });
The next line from the first comment does not work unfortunately
await page.evaluate(() => {
localStorage.setItem('token', 'example-token'); // not work, produce errors :(
});
Without requiring to double goTo this would work:
const browser = await puppeteer.launch();
browser.on('targetchanged', async (target) => {
const targetPage = await target.page();
const client = await targetPage.target().createCDPSession();
await client.send('Runtime.evaluate', {
expression: `localStorage.setItem('hello', 'world')`,
});
});
// newPage, goTo, etc...
Adapted from the lighthouse doc for puppeteer that do something similar: https://github.com/GoogleChrome/lighthouse/blob/master/docs/puppeteer.md
Try and additional script tag. Example:
Say you have a main.js script that houses your routing logic.
Then a setJWT.js script that houses your token logic.
Then within your html that is loading these scripts order them in this way:
<script src='setJWT.js'></script>
<script src='main.js'></script>
This would only be good for initial start of the page.
Most routing libraries, however, usually have an event hook system that you can hook into before a route renders. I would store the setJWT logic somewhere in that callback.