Page doesn't see cookies in Puppeteer - javascript

EDIT for Mission Clarity: In the end I am pulling inventory data and customer data from Postgres to render and send a bunch of PDFs to customers, once per month.
These PDFs are dynamic in that the cover page will have varying customer name/address. The next page(s) are also dynamic as they are lists of a particular customer's expiring inventory with item/expirying date/serial number.
I had made a client-side React page with print CSS to render some print-layout letters that could be printed off/saved as a pretty PDF.
Then, the waterfall spec came in that this was to be an automated process on the server. Basically, the PDF needs attached to an email alerting customers of expiring product (in med industry where everything needs audited).
I thought using Puppeteer would be a nice and easy switch. Just add a route that processes all customers, looking up whatever may be expiring, and then passing that into the dynamic react page to be rendered headless to a PDF file (and eventually finish the whole rest of the plan, sending email, etc.). Right now I just grab 10 customers and their expiring stock for PoC, then I have basically: { customer: {}, expiring: [] }.
I've attempted using POST to page with interrupt, but I guess that makes sense that I cannot get post data in the browser.
So, I switched my approach to using cookies. This I would expect to work, but I can never read the cookie(s) into the page.
Here is a: Simple route, simple puppeteer which writes out cookies to a json and takes a screenshot just for proof, and simple HTML with script I'm using just to try to prove I can pass data along.
server/index.js:
app.get('/testing', async (req, res) => {
console.log('GET /testing');
res.sendFile(path.join(__dirname, 'scratch.html'));
});
scratch.js (run at commandline node ./scratch.js:
const puppeteer = require('puppeteer')
const fs = require('fs');
const myCookies = [{name: 'customer', value: 'Frank'}, {name: 'expiring', value: JSON.stringify([{a: 1, b: 'three'}])}];
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:1234/testing', { waitUntil: 'networkidle2' });
await page.setCookie(...myCookies);
const cookies = await page.cookies();
const cookieJson = JSON.stringify(cookies);
// Writes expected cookies to file for sanity check.
fs.writeFileSync('scratch_cookies.json', cookieJson);
// FIXME: Cookies never get appended to page.
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
server/scratch.html:
<html>
<body>
</body>
<script type='text/javascript'>
document.write('Cookie: ' + document.cookie);
</script>
</html>
The result is just a PNG with the word "Cookie:" on it. Any insight appreciated!
This is the actual route I'm using where makeExpiryLetter is utilizing puppeteer, but I can't seem to get it to actually read the customer and rows data.
app.get('/create-expiry-letter', async (req, res) => {
// Create PDF file using puppeteer to render React page w/ data.
// Store in Db.
// Email file.
// Send final count of letters sent back for notification in GUI.
const cc = await dbo.getConsignmentCustomers();
const result = await Promise.all(cc.rows.map(async x => {
// Get 0-60 day consignments by customer_id;
const { rows } = await dbo.getExpiry0to60(x.customer_id);
if (rows && rows.length > 0) {
const epiryLetter = await makeExpiryLetter(x, rows); // Uses puppeteer.
// TODO: Store in Db / Email file.
return true;
} else {
return false;
}
}));
res.json({ emails_sent: result.filter(x => x === true).length });
});
Thanks to the samples from #ggorlen I've made huge headway in using cookies. In my inline script of expiry.html I'm grabbing the values by wrapping my render function in function main () and adding onload to body tag <body onload='main()'.
Inside the main function we can grab the values I needed:
const customer = JSON.parse(document.cookie.split('; ').find(row => row.startsWith('customer')).split('=')[1]);
const expiring = JSON.parse(document.cookie.split('; ').find(row => row.startsWith('expiring')).split('=')[1]);
FINALLY (and yes, of course this will all be used in an automated worker in the end) I can get my beautifully rendered PDF like so:
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setCookie(...myCookies);
await page.goto('http://localhost:1234/testing');
await page.pdf({ path: `scratch-expiry-letter.pdf`, format: 'letter' });
await browser.close();
})();

The problem is here:
await page.goto('http://localhost:1234/testing', { waitUntil: 'networkidle2' });
await page.setCookie(...myCookies);
The first line says, go to the page. Going to a page involves parsing the HTML and executing scripts, including your document.write('Cookie: ' + document.cookie); line in scratch.html, at which time there are no cookies on the page (assuming a clear browser cache).
After the page is loaded, await page.goto... returns and the line await page.setCookie(...myCookies); runs. This correctly sets your cookies and the remaining lines execute. const cookies = await page.cookies(); runs and pulls the newly-set cookies out and you write them to disk. await page.screenshot({path: 'scratch_shot.png'}); runs, taking a shot of the page without the DOM updated with the new cookies that were set after the initial document.write call.
You can fix this problem by turning your JS on the scratch.html page into a function that can be called after page load and cookies are set, or injecting such a function dynamically with Puppeteer using evaluate:
const puppeteer = require('puppeteer');
const myCookies = [
{name: 'customer', value: 'Frank'},
{name: 'expiring', value: JSON.stringify([{a: 1, b: 'three'}])}
];
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('http://localhost:1234/testing');
await page.setCookie(...myCookies);
// now that the cookies are ready, we can write to the document
await page.evaluate(() => document.write('Cookie' + document.cookie));
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
A more general approach is to set the cookies before navigation. This way, the cookies will already exist when any scripts that might use them run.
const puppeteer = require('puppeteer');
const myCookies = [
{
name: 'expiring',
value: '[{"a":1,"b":"three"}]',
domain: 'localhost',
path: '/',
expires: -1,
size: 29,
httpOnly: false,
secure: false,
session: true,
sameParty: false,
sourceScheme: 'NonSecure',
sourcePort: 80
},
{
name: 'customer',
value: 'Frank',
domain: 'localhost',
path: '/',
expires: -1,
size: 13,
httpOnly: false,
secure: false,
session: true,
sameParty: false,
sourceScheme: 'NonSecure',
sourcePort: 80
}
];
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setCookie(...myCookies);
await page.goto('http://localhost:1234/testing');
await page.screenshot({path: 'scratch_shot.png'});
await browser.close();
})();
That said, I'm not sure if cookies are the easiest or best way to do what you're trying to do. Since you're serving HTML, you could pass the data along with it statically, expose a separate API route to collect a customer's data which the front end can use, or pass GET parameters, depending on the nature of the data and what you're ultimately trying to accomplish.
You could even have a file upload form on the React app, then have Puppeteer upload the JSON data into the app programmatically through that form.
In fact, if your final goal is to dynamically generate a PDF, using React and Puppeteer might be overkill, but I'm not sure I have a better solution to offer without some research and additional context about your use case.

Related

Is there a way to open multiple tabs simultaneously on Playwright or Puppeteer to complete the same tasks?

I just started coding, and I was wondering if there was a way to open multiple tabs concurrently with one another. Currently, my code goes something like this:
const puppeteer = require("puppeteer");
const rand_url = "https://www.google.com";
async function initBrowser() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(rand_url);
await page.setViewport({
width: 1200,
height: 800,
});
return page;
}
async function login(page) {
await page.goto("https://www.google.com");
await page.waitFor(100);
await page.type("input[id ='user_login'", "xxx");
await page.waitFor(100);
await page.type("input[id ='user_password'", "xxx");
}
this is not my exact code, replaced with different aliases, but you get the idea. I was wondering if there was anyone out there that knows the code that allows this same exact browser to be opened on multiple instances, replacing the respective login info only. Of course, it would be great to prevent my IP from getting banned too, so if there was a way to apply proxies to each respective "browser"/ instance, that would be perfect.
Lastly, I would like to know whether or not playwright or puppeteer is superior in the way they can handle these multiple instances. I don't even know if this is a possibility, but please enlighten me. I want to learn more.
You can use multiple browser window as different login/cookies.
For simplicity, you can use the puppeteer-cluster module by Thomas Dondorf.
This module can make your puppeteer launched and queued one by one so that you can use this to automating your login, and even save login cookies for the next launches.
Feel free to go to the Github: https://github.com/thomasdondorf/puppeteer-cluster
const { Cluster } = require('puppeteer-cluster')
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // <= this is the number of
// parallel task running simultaneously
}) // You can change to the number of CPU
const cpuNumber = require('os').cpus().length // for example
await cluster.task(async ({ page, data: [username, password] }) => {
await page.goto('https://www.example.com')
await page.waitForTimeout(100)
await page.type('input[id ="user_login"', username)
await page.waitForTimeout(100)
await page.type('input[id ="user_password"', password)
const screen = await page.screenshot()
// Store screenshot, Save Cookies, do something else
});
cluster.queue(['myFirstUsername', 'PassW0Rd1'])
cluster.queue(['anotherUsername', 'Secr3tAgent!'])
// cluster.queue([username, password])
// username and password array passed into cluster task function
// many more pages/account
await cluster.idle()
await cluster.close()
})()
For Playwright, sadly still unsupported by the module above,you can use browser pool (cluster) module to automating the Playwright launcher.
And for proxy usage, I recommend Puppeteer library as the legendary one.
Don't forget to choose my answer as the right one, if this helps you.
There are profiling and proxy options; you could combine them to achieve your goal:
Profile, https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context
import { chromium } from 'playwright'
const userDataDir = /tmp/ + process.argv[2]
const browserContext = await chromium.launchPersistentContext(userDataDir)
// ...
Proxy, https://playwright.dev/docs/api/class-browsertype#browser-type-launch
import { chromium } from 'playwright'
const proxy = { /* secret */ }
const browser = await chromium.launch({
proxy: { server: 'pre-context' }
})
const browserContext = await browser.newContext({
proxy: {
server: `http://${proxy.ip}:${proxy.port}`,
username: proxy.username,
password: proxy.password,
}
})
// ...

Why will puppeteer not click on the video

I am currently writing a simple program that grabs the name of a song from my discord bot, finds the video and passes it to a function to convert to mp3. My problem is that puppeteer dosen't click on the video and instead just returns the search page link.
Here is my code to grab the link and pass it through download:
async function findSongName(stringWithName){
let stringName = stringWithName.replace(commands.play, '')
const link = 'https://www.youtube.com';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(link)
await page.type('ytd-searchbox#search.style-scope.ytd-masthead', stringName);
page.keyboard.press('Enter');
await page.click('yt-interaction#extended');
console.log(page.url())
await browser.close()
}
It sounds like you want to get the title and URL of the top result for a YT search. For starters, you don't need to start at the YT homepage. Just navigate to https://www.youtube.com/results?search_query=${yourQuery} to speed things up and reduce complexity.
Next, if you view the page source of /results, there's a large (~1 MB) global data structure called ytInitialData that has all of the relevant results in it (along with a lot of other irrelevant stuff, admittedly). Theoretically, you could grab the page with Axios, parse out ytInitialData with Cheerio, grab your data using plain array/object JS and skip Puppeteer entirely.
Of course, using the YT search API is the most reliable and proper way.
Since you're using Puppeteer, though, the data can be pulled out of the "#items a#video-title" elements as follows:
const puppeteer = require("puppeteer");
const searchYT = async (page, searchQuery) => {
const encodedQuery = encodeURIComponent(searchQuery);
const url = `https://www.youtube.com/results?search_query=${encodedQuery}`;
await page.goto(url);
const sel = "a#video-title";
await page.waitForSelector(sel);
return page.$$eval(sel, els =>
els.map(e => ({
title: e.textContent.trim(),
href: e.href,
}))
);
};
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
await page.setRequestInterception(true);
page.on("request", req => {
req.resourceType() === "image" ? req.abort() : req.continue();
});
const results = await searchYT(page, "stack overflow");
console.log(results);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output (for the search term "stack overflow"):
[
{
title: 'Stack Overflow is full of idiots.',
href: 'https://www.youtube.com/watch?v=I_ZK0t9-llo'
},
{
title: "How To Use Stack Overflow (no, ForrestKnight, it's not full of idiots)",
href: 'https://www.youtube.com/watch?v=sMIslcynm0Q'
},
{
title: 'How to use Stack Overflow as a Beginner ?',
href: 'https://www.youtube.com/watch?v=Vt-Wf7d0CFo'
},
{
title: 'How Microsoft Uses Stack Overflow for Teams',
href: 'https://www.youtube.com/watch?v=mhh0aK6yJgA'
},
// ...
]
Since you only want the first result, then it's here, but if you want more than the initial batch, either work through ytInitialData as described above or scroll the page down with Puppeteer.
Now that you have a video URL that you want to make into an mp3, I'd recommend youtube-dl. There are Node wrappers you can install to access its API easily, such as node-youtube-dl which was the first result when I searched and I've never used before.

How to set Cookies enabled in puppeteer

I am currently having the problem (puppeteer) in a project that I think cookies are not activated. But I also don't know how to activate them if they are not already activated from the beginning.
Since every website nowaday has Captcha, so we can skip the auto-login part.
I'm new too, I got an idea from here for this.
Firstly check if there is saved cookies.json file, if not, do the manually login, you click the submit button yourself and solve the captcha puzzle (in non-headless mode), the page should be redirected to destination page.
Once the destination page is loaded, save the cookies in to a Json file for next time.
Example:
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless: false, //launch in non-headless mode so you can see graphics
defaultViewport: null
});
let [page] = await browser.pages();
await page.setRequestInterception(true);
const getCookies = async (page) => {
// Get page cookies
const cookies = await page.cookies()
// Save cookies to file
fs.writeFile('./cookies.json', JSON.stringify(cookies, null, 4), (err) => {
if (err) console.log(err);
return
});
}
const setCookies = async (page) => {
// Get cookies from file as a string
let cookiesString = fs.readFileSync('./cookies.json', 'utf8');
// Parse string
let cookies = JSON.parse(cookiesString)
// Set page cookies
await page.setCookie.apply(page, cookies);
return
}
page.on('request', async (req) => {
// If URL is already loaded in to system
if (req.url() === 'https://example.com/LoggedUserCP') {
console.log('logged in, get the cookies');
await getCookies(page);
// if it's being in login page, try to set existed cookie
} else if (req.url() === 'https://example.com/Login?next=LoggedUserCP') {
await setCookies(page);
console.log('set the saved cookies');
}
// otherwise go to login page and login yourself
req.continue();
});
await page.goto('https://example.com/Login?next=LoggedUserCP');
})();

Puppeteer Login to Instagram

I'm trying to login into Instagram with Puppeteer, but somehow I'm unable to do it.
Can you help me?
Here is the link I'm using:
https://www.instagram.com/accounts/login/
I tried different stuff. The last code I tried was this:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.instagram.com/accounts/login/');
await page.evaluate();
await afterJS.type('#f29d14ae75303cc', 'username');
await afterJS.type('#f13459e80cdd114', 'password');
await page.pdf({path: 'page.pdf', format: 'A4'});
await browser.close();
})();
Thanks in advance!
OK you're on the right track but just need to change a few things.
Firstly, I have no idea where your afterJS variable comes from? Either way you won't need it.
You're asking for data to be typed into the username and password input fields but aren't asking puppeteer to actually click on the log in button to complete the log in process.
page.evaluate() is used to execute JavaScript code inside of the page context (ie. on the web page loaded in the remote browser). So you don't need to use it here.
I would refactor your code to look like the following:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.instagram.com/accounts/login/');
await page.waitForSelector('input[name="username"]');
await page.type('input[name="username"]', 'username');
await page.type('input[name="password"]', 'password');
await page.click('button[type="submit"]');
// Add a wait for some selector on the home page to load to ensure the next step works correctly
await page.pdf({path: 'page.pdf', format: 'A4'});
await browser.close();
})();
Hopefully this sets you down the right path to getting past the login page!
Update 1:
You've enquired about parsing the text of an element on Instagram... unfortunately I don't have an account on there myself so can't really give you an exact solution but hopefully this still proves of some value.
So you're trying to evaluate an elements text, right? You can do this as follows:
const text = await page.$eval(cssSelector, (element) => {
return element.textContent;
});
All you have to do is replace cssSelector with the selector of the element you wish to retrieve the text from.
Update 2:
OK lastly, you've enquired about scrolling down to an element within a parent element. I'm not going to steal the credit from someone else so here's the answer to that:
How to scroll to an element inside a div?
What you'll have to do is basically follow the instructions in there and get that to work with puppeteer similar to as follows:
await page.evaluate(() => {
const lastLink = document.querySelectorAll('h3 > a')[2];
const topPos = lastLink.offsetTop;
const parentDiv = document.querySelector('div[class*="eo2As"]');
parentDiv.scrollTop = topPos;
});
Bear in mind that I haven't tested that code - I've just directly followed the answer in the URL I've provided. It should work!
You can log in to Instagram using the following example code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Wait until page has loaded
await page.goto('https://www.instagram.com/accounts/login/', {
waitUntil: 'networkidle0',
});
// Wait for log in form
await Promise.all([
page.waitForSelector('[name="username"]'),
page.waitForSelector('[name="password"]'),
page.waitForSelector('[name="submit"]'),
]);
// Enter username and password
await page.type('[name="username"]', 'username');
await page.type('[name="password"]', 'password');
// Submit log in credentials and wait for navigation
await Promise.all([
page.click('[type="submit"]'),
page.waitForNavigation({
waitUntil: 'networkidle0',
}),
]);
// Download PDF
await page.pdf({
path: 'page.pdf',
format: 'A4',
});
await browser.close();
})();

How to recreate a page with all of the cookies?

I am trying to:
Visit a page that initialises a session
Store the session in a JSON object
Visit the same page, which now should recognise the existing session
The implementation I have attempted is as follows:
import puppeteer from 'puppeteer';
const createSession = async (browser, startUrl) => {
const page = await browser.newPage();
await page.goto(startUrl);
await page.waitForSelector('#submit');
const cookies = await page.cookies();
const url = await page.url();
return {
cookies,
url
};
};
const useSession = async (browser, session) => {
const page = await browser.newPage();
for (const cookie of session.cookies) {
await page.setCookie(cookie);
}
await page.goto(session.url);
};
const run = async () => {
const browser = await puppeteer.launch({
headless: false
});
const session = await createSession(browser, 'http://foo.com/');
// The session has been established
await useSession(browser, session);
await useSession(browser, session);
};
run();
createSession is used to capture the cookies of the loaded page.
useSession are expected to load the page using the existing cookies.
However, this does not work – the session.url page does not recognise the session. It appears that not all cookies are being captured this way.
It appears that page#cookies returns some cookies with the session=true,expires=0 configuration. setCookie ignores these values.
I worked around this by constructing a new cookies array overriding the expires and session properties.
const cookies = await page.cookies();
const sessionFreeCookies = cookies.map((cookie) => {
return {
...cookie,
expires: Date.now() / 1000 + 10 * 60,
session: false
};
});
At the time writing this answer, session property is not documented. Refer to the following issue https://github.com/GoogleChrome/puppeteer/issues/980.
Puppeteer page.cookies() method only fetches cookies for the current page domain. However, there might be cases where it can have cookies from different domains as well.
You can call the internal method Network.getAllCookies to fetch cookies from all the domains.
(async() => {
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com', {waitUntil : 'networkidle2' });
// Here we can get all of the cookies
console.log(await page._client.send('Network.getAllCookies'));
})();
More on this thread here - Puppeteer get 3rd party cookies

Categories

Resources