Waiting for Google Captcha to be displayed using puppeteer - javascript

I can't verify if a div exists or not on the page using Puppeteer, and I don't know why...
I would like to scope the captcha on the page using this code:
puppeteer
.launch(options)
.then(async (browser) => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
console.log("[+] Connecting...");
await page.goto("https://www.google.com/recaptcha/api2/demo");
console.log("[+] Connected");
page.waitForNavigation()
if (await page.$('div.recaptcha-checkbox-border') !== null)
console.log('[+] Resolving captcha');
})
.catch((err) => {
console.log(err);
});
But my if is alaways false and I don't know why.
Here is a screenshot of the element scoped manually:

This script should work
'use strict';
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless : false});
const page = await browser.newPage();
console.log("Opening page");
await page.goto('https://www.google.com/recaptcha/api2/demo');
console.log("Opened page");
const frame = await page.frames().find(f => f.name().startsWith("a-"));
await frame.waitForSelector('div.recaptcha-checkbox-border');
console.log("Captcha exists!");
await browser.close();
})();
It appears that the captcha is inside an iframe that always starts with name=a- (however I can't confirm that with my limited testing)
You first need to get the iframe then await for the selector after the iframe has loaded. This is the output, try changing the iframe name or the selector name to see it fail.

Related

Popup form visible, but html code missing in Puppeteer

I'm currently trying to get some informations from a website (https://www.bauhaus.info/) and fail at the cookie popup form.
This is my code till now:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.bauhaus.info');
await sleep(5000);
const html = await page.content();
fs.writeFileSync("./page.html", html, "UTF-8");
page.pdf({
path: './bauhaus.pdf',
format: 'a4'
});
});
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
Till this everything works fine. But I can't accept the cookie banner, because I don't see the html from this banner in puppeteer. But in the pdf I can see the form.
My browser
Puppeteer
Why can I not see this popup in the html code?
Bonus quest: Is there any way to replace the sleep method with any page.await without knowing which js method triggers the cookie form to appear?
This element is in a shadow root. Please visit my answer in Puppeteer not giving accurate HTML code for page with shadow roots for additional information about the shadow DOM.
This code dips into the shadow root, waits for the button to appear, then clicks it:
const puppeteer = require("puppeteer"); // ^13.5.1
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
const url = "https://www.bauhaus.info/";
await page.goto(url, {waitUntil: "domcontentloaded"});
const el = await page.waitForSelector("#usercentrics-root");
await page.waitForFunction(el =>
el.shadowRoot.querySelector(".sc-gsDKAQ.dejeIh"), {}, el
);
await el.evaluate(el =>
el.shadowRoot.querySelector(".sc-gsDKAQ.dejeIh").click()
);
await page.waitForTimeout(100000); // pause to show that it worked
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;

puppeteer can't get page source after using evaluate()

I'm using puppeteer to interact with a website using the evaluate() function to maniupulate page front (i.e to click on certain items etc...), click through works fine but I can't return the page source after clicking using evaluate.
I have recreated the error in this simplified script below it loads google.com, clicks on 'I feel lucky' and should then return the page source of the loaded page:
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto('https://www.google.com/', {waitUntil: 'networkidle2'});
response = await page.evaluate(() => {
document.getElementsByClassName('RNmpXc')[1].click()
});
await page.waitForNavigation({waitUntil: 'load'});
console.log(response.text());
}
main();
I get the following error:
TypeError: Cannot read property 'text' of undefined
UPDATE New code following suggestion to use page.content()
const puppeteer = require('puppeteer');
async function main() {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto('https://www.google.com/', {waitUntil: 'networkidle2'});
await page.evaluate(() => {
document.getElementsByClassName('RNmpXc')[1].click()
});
const source = await page.content()
console.log(source);
}
main();
I am now getting the following error:
Error: Execution context was destroyed, most likely because of a navigation.
My question is: How can I return page source using the .text() method after manipulating the webpage using the evaluate() method?
All suggestions / insight / proposals would be very much appreciated thanks.
Since you're asking for page source after javascript modification, I'd assume you want DOM and not the original HTML content. your evaluate function doesn't return anything which results in undefined response. You can use
const source = await page.evaluate(() => new XMLSerializer().serializeToString(document.doctype) + document.documentElement.outerHTML);
or
const source = await page.content();

Puppeteer: Cannot find element loaded by Javascript

I've been using Puppeteer to scrape some websites, and it works well when the element I need is in the DOM; however I can't get it to work when the element is loaded via Javascript. E.g. please see my code below. More specifically, the page.waitForSelector always triggers a timeout error. I've tried a page.screenshot and the resulting image does show a fully loaded page, which contains this .evTextFont element.
How can I modify this code to successfully retrieve the .evTextFont element?
I've tried both Puppeteer versions 1.11 and 1.17, but am getting the same problem for both
Thanks a lot
Adapted from here
const puppeteer = require('puppeteer');
const URL = 'https://www.paintbar.com.au/events-1/moments-in-moonlight';
puppeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] }).then(async browser => {
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 600})
await page.goto(URL, {waitUntil: 'networkidle0'});
await page.waitForSelector('.evTextFont');
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});
// await page.screenshot({ path: './image.jpg', type: 'jpeg' });
const result = await page.evaluate(() => {
try {
var data = [];
$('.evTextFont').each(function() {
const title = $(this).text();
data.push({
'title' : title,
});
});
return data;
} catch(err) {
console.log(err.toString());
}
});
await browser.close();
for(var i = 0; i < result.length; i++) {
console.log('Data: ' + result[i].title);
}
process.exit();
}).catch(function(error) {
console.error(error);
process.exit();
});
It happens because the event you're looking for is shown inside of an iframe element, from another site, so you need to find that iframe first and then do operation on it.
await page.goto(URL, {waitUntil: 'networkidle0'});
// Looking for the iframe with the event
const frame = (await page.frames()).find(f => f.url().includes("events.wix.com"));
// Then do work as before, but on that frame
await frame.waitForSelector('.evTextFont');
await frame.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});
const result = await frame.evaluate(() => {...})

Set localstorage items before page loads in puppeteer?

We have some routing logic that kicks you to the homepage if you dont have a JWT_TOKEN set... I want to set this before the page loads/before the js is invoked.
how do i do this ?
You have to register localStorage item like this:
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
You should do it after page page.goto - browser must have an url to register local storage item on it. After this, enter the same page once again, this time token should be here before the page is loaded.
Here is a fully working example:
const puppeteer = require('puppeteer');
const http = require('http');
const html = `
<html>
<body>
<div id="element"></div>
<script>
document.getElementById('element').innerHTML =
localStorage.getItem('token') ? 'signed' : 'not signed';
</script>
</body>
</html>`;
http
.createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.write(html);
res.end();
})
.listen(8080);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080/');
await page.evaluate(() => {
localStorage.setItem('token', 'example-token');
});
await page.goto('http://localhost:8080/');
const text = await page.evaluate(
() => document.querySelector('#element').textContent
);
console.log(text);
await browser.close();
process.exit(0);
})();
There's some discussion about this in Puppeteer's GitHub issues.
You can load a page on the domain, set your localStorage, then go to the actual page you want to load with localStorage ready. You can also intercept the first url load to return instantly instead of actually load the page, potentially saving a lot of time.
const doSomePuppeteerThings = async () => {
const url = 'http://example.com/';
const browser = await puppeteer.launch();
const localStorage = { storageKey: 'storageValue' };
await setDomainLocalStorage(browser, url, localStorage);
const page = await browser.newPage();
// do your actual puppeteer things now
};
const setDomainLocalStorage = async (browser, url, values) => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', r => {
r.respond({
status: 200,
contentType: 'text/plain',
body: 'tweak me.',
});
});
await page.goto(url);
await page.evaluate(values => {
for (const key in values) {
localStorage.setItem(key, values[key]);
}
}, values);
await page.close();
};
in 2021 it work with following code:
// store in localstorage the token
await page.evaluateOnNewDocument (
token => {
localStorage.clear();
localStorage.setItem('token', token);
}, 'eyJh...9_8cw');
// open the url
await page.goto('http://localhost:3000/Admin', { waitUntil: 'load' });
The next line from the first comment does not work unfortunately
await page.evaluate(() => {
localStorage.setItem('token', 'example-token'); // not work, produce errors :(
});
Without requiring to double goTo this would work:
const browser = await puppeteer.launch();
browser.on('targetchanged', async (target) => {
const targetPage = await target.page();
const client = await targetPage.target().createCDPSession();
await client.send('Runtime.evaluate', {
expression: `localStorage.setItem('hello', 'world')`,
});
});
// newPage, goTo, etc...
Adapted from the lighthouse doc for puppeteer that do something similar: https://github.com/GoogleChrome/lighthouse/blob/master/docs/puppeteer.md
Try and additional script tag. Example:
Say you have a main.js script that houses your routing logic.
Then a setJWT.js script that houses your token logic.
Then within your html that is loading these scripts order them in this way:
<script src='setJWT.js'></script>
<script src='main.js'></script>
This would only be good for initial start of the page.
Most routing libraries, however, usually have an event hook system that you can hook into before a route renders. I would store the setJWT logic somewhere in that callback.

Open multiple links in new tab and switch focus with a loop with puppeteer?

I have multiple links in a single page whom I would like to access either sequentially or all at once. What I want to do is open all the links in their respective new tabs and get the page as pdf for all the pages. How do I achieve the same with puppeteer?
I can get all the links with a DOM and href property but I don't know how to open them in new tab access them and then close them.
You can open a new page in a loop:
const puppeteer = require('puppeteer');
(async () => {
try {
const browser = await puppeteer.launch();
const urls = [
'https://www.google.com',
'https://www.duckduckgo.com',
'https://www.bing.com',
];
const pdfs = urls.map(async (url, i) => {
const page = await browser.newPage();
console.log(`loading page: ${url}`);
await page.goto(url, {
waitUntil: 'networkidle0',
timeout: 120000,
});
console.log(`saving as pdf: ${url}`);
await page.pdf({
path: `${i}.pdf`,
format: 'Letter',
printBackground: true,
});
console.log(`closing page: ${url}`);
await page.close();
});
Promise.all(pdfs).then(() => {
browser.close();
});
} catch (error) {
console.log(error);
}
})();
To open a new tab (activate) it you just need to make a call to page.bringToFront()
const page1 = await browser.newPage();
await page1.goto('https://www.google.com');
const page2 = await browser.newPage();
await page2.goto('https://www.bing.com');
const pageList = await browser.pages();
console.log("NUMBER TABS:", pageList.length);
//switch tabs here
await page1.bringToFront();
//Do something... save as pdf
await page2.bringToFront();
//Do something... save as pdf
I suspect you have an array of pages so you might need to tweak the above code to cater for that.
As for generating a single pdf from multiple tabs I am pretty certain this is not possible. I suspect there will be a node library that can take multiple pdf files and merge into one.
pdf-merge might be what you are looking for.
You can also use a for loop.
(async ()=>{
const movieURL= ["https://www.imdb.com/title/tt0234215", "https://www.imdb.com/title/tt0411008"];
for (var i = 0; i < movieURL.length; i++) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(movieURL[i], {waitUntil: "networkidle2"});
const movieData = await page.evaluate(() => {
let movieTitle = document.querySelector('div[class="TitleBlock"] > h1').innerText;
return{movieTitle}
});
await browser.close();
await console.log(movieData);
}
})()

Categories

Resources