Popup form visible, but html code missing in Puppeteer - javascript

I'm currently trying to get some informations from a website (https://www.bauhaus.info/) and fail at the cookie popup form.
This is my code till now:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.bauhaus.info');
await sleep(5000);
const html = await page.content();
fs.writeFileSync("./page.html", html, "UTF-8");
page.pdf({
path: './bauhaus.pdf',
format: 'a4'
});
});
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
Till this everything works fine. But I can't accept the cookie banner, because I don't see the html from this banner in puppeteer. But in the pdf I can see the form.
My browser
Puppeteer
Why can I not see this popup in the html code?
Bonus quest: Is there any way to replace the sleep method with any page.await without knowing which js method triggers the cookie form to appear?

This element is in a shadow root. Please visit my answer in Puppeteer not giving accurate HTML code for page with shadow roots for additional information about the shadow DOM.
This code dips into the shadow root, waits for the button to appear, then clicks it:
const puppeteer = require("puppeteer"); // ^13.5.1
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
const url = "https://www.bauhaus.info/";
await page.goto(url, {waitUntil: "domcontentloaded"});
const el = await page.waitForSelector("#usercentrics-root");
await page.waitForFunction(el =>
el.shadowRoot.querySelector(".sc-gsDKAQ.dejeIh"), {}, el
);
await el.evaluate(el =>
el.shadowRoot.querySelector(".sc-gsDKAQ.dejeIh").click()
);
await page.waitForTimeout(100000); // pause to show that it worked
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;

Related

How to get the first link under a ul tag using Puppeteer?

I am trying to get the link of the latest house posting in a real estate website.
This is the code I have written til now
const puppeteer = require("puppeteer");
const link =
"https://www.daft.ie/property-for-rent/dublin-4-dublin?radius=5000&numBeds_from=2&numBeds_to=3&sort=publishDateDesc";
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto(link);
const elements = await page.$x("//button[normalize-space()='Accept All']");
await elements[0].click();
// const handle = await page.waitForXPath("//ul[#data-testid='results']");
// const yourHref = await page.evaluate(
// (anchor) => anchor.getAttribute("href"),
// handle
// );
const hrefs1 = await page.evaluate(() =>
Array.from(document.querySelectorAll("a[href]"), (a) =>
a.getAttribute("href")
)
);
console.log(hrefs1);
await browser.close();
})();
However, this code is to get all the href links on the target page.
HTML code of the page:
It is easier to read the code from the picture than if I paste the code, thats why I attached an image.
As you can see under ul tag with data-testid=results there are many li tags inside which there is a a href, I wish to extract the link from this and that too only the top most li link as it will newest house posting.
How can I do this?
Expected output - I just want the first link under li tag. In the picture above, the output would be
/for-rent/house-glencloy-road-whitehall-dublin-9/4072150
Following up on the comment chain, the selector '[data-testid="results"] a[href]' should give the first result href.
const puppeteer = require("puppeteer"); // ^16.2.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
const url =
"https://www.daft.ie/property-for-rent/dublin-4-dublin?radius=5000&numBeds_from=2&numBeds_to=3&sort=publishDateDesc";
await page.goto(url, {waitUntil: "domcontentloaded"});
const xp = "//button[normalize-space()='Accept All']";
const cookiesBtn = await page.waitForXPath(xp);
await cookiesBtn.click();
const el = await page.waitForSelector('[data-testid="results"] a[href]');
console.log(await el.evaluate(el => el.getAttribute("href")));
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
If you want all of the result hrefs, try:
const allHrefs = await page.$$eval(
'[data-testid="results"] a[href]',
els => els.map(e => e.getAttribute("href"))
);
Note that the data is available statically, so you could just use fetch (native on Node 18+) and Cheerio which is faster and probably more reliable, assuming there's no detection issues (and you could add a user-agent and take other counter-measures if there are):
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "https://www.daft.ie/property-for-rent/dublin-4-dublin?radius=5000&numBeds_from=2&numBeds_to=3&sort=publishDateDesc";
fetch(url).then(res => res.text()).then(html => {
const $ = cheerio.load(html);
const sel = '[data-testid="results"] a[href]';
console.log($(sel).attr("href"));
// or all:
console.log([...$(sel)].map(e => e.attribs.href));
});
On my slow machine this took 3.5 seconds versus 30 seconds for headful Puppeteer and 15-20 seconds for headless Puppeteer depending on cache warmth.
Or, if you are using Puppeteer for whatever reason, you could block all the requests, JS and images to speed things up dramatically. Your default await page.goto(link); waits for the load event, which is content you may not need.

Waiting for Google Captcha to be displayed using puppeteer

I can't verify if a div exists or not on the page using Puppeteer, and I don't know why...
I would like to scope the captcha on the page using this code:
puppeteer
.launch(options)
.then(async (browser) => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
console.log("[+] Connecting...");
await page.goto("https://www.google.com/recaptcha/api2/demo");
console.log("[+] Connected");
page.waitForNavigation()
if (await page.$('div.recaptcha-checkbox-border') !== null)
console.log('[+] Resolving captcha');
})
.catch((err) => {
console.log(err);
});
But my if is alaways false and I don't know why.
Here is a screenshot of the element scoped manually:
This script should work
'use strict';
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless : false});
const page = await browser.newPage();
console.log("Opening page");
await page.goto('https://www.google.com/recaptcha/api2/demo');
console.log("Opened page");
const frame = await page.frames().find(f => f.name().startsWith("a-"));
await frame.waitForSelector('div.recaptcha-checkbox-border');
console.log("Captcha exists!");
await browser.close();
})();
It appears that the captcha is inside an iframe that always starts with name=a- (however I can't confirm that with my limited testing)
You first need to get the iframe then await for the selector after the iframe has loaded. This is the output, try changing the iframe name or the selector name to see it fail.

How to click a specific div with a specific class?

I'm new at coding in puppeteer, and I wanted to know how to make it click this: (image)
The code I have rn is this one:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('page link is here');
await page.screenshot({ path: 'game.png' });
const [button] = await page.$x("//button[contains(., 'Accept')]");
if (button) {
await button.click();
}
I want to click it here.
await page.screenshot({ path: 'test.png' });
await browser.close();
})();
Sorry for my bad English 😔👌
If the element highlighted in the screenshot is the one to be clicked, you can simply:
await page.click('.shipyard-item');
I'd like to suggest the excellent puppeteer documentation to consult with for most the use cases.

Puppeteer Login to Instagram

I'm trying to login into Instagram with Puppeteer, but somehow I'm unable to do it.
Can you help me?
Here is the link I'm using:
https://www.instagram.com/accounts/login/
I tried different stuff. The last code I tried was this:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.instagram.com/accounts/login/');
await page.evaluate();
await afterJS.type('#f29d14ae75303cc', 'username');
await afterJS.type('#f13459e80cdd114', 'password');
await page.pdf({path: 'page.pdf', format: 'A4'});
await browser.close();
})();
Thanks in advance!
OK you're on the right track but just need to change a few things.
Firstly, I have no idea where your afterJS variable comes from? Either way you won't need it.
You're asking for data to be typed into the username and password input fields but aren't asking puppeteer to actually click on the log in button to complete the log in process.
page.evaluate() is used to execute JavaScript code inside of the page context (ie. on the web page loaded in the remote browser). So you don't need to use it here.
I would refactor your code to look like the following:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.instagram.com/accounts/login/');
await page.waitForSelector('input[name="username"]');
await page.type('input[name="username"]', 'username');
await page.type('input[name="password"]', 'password');
await page.click('button[type="submit"]');
// Add a wait for some selector on the home page to load to ensure the next step works correctly
await page.pdf({path: 'page.pdf', format: 'A4'});
await browser.close();
})();
Hopefully this sets you down the right path to getting past the login page!
Update 1:
You've enquired about parsing the text of an element on Instagram... unfortunately I don't have an account on there myself so can't really give you an exact solution but hopefully this still proves of some value.
So you're trying to evaluate an elements text, right? You can do this as follows:
const text = await page.$eval(cssSelector, (element) => {
return element.textContent;
});
All you have to do is replace cssSelector with the selector of the element you wish to retrieve the text from.
Update 2:
OK lastly, you've enquired about scrolling down to an element within a parent element. I'm not going to steal the credit from someone else so here's the answer to that:
How to scroll to an element inside a div?
What you'll have to do is basically follow the instructions in there and get that to work with puppeteer similar to as follows:
await page.evaluate(() => {
const lastLink = document.querySelectorAll('h3 > a')[2];
const topPos = lastLink.offsetTop;
const parentDiv = document.querySelector('div[class*="eo2As"]');
parentDiv.scrollTop = topPos;
});
Bear in mind that I haven't tested that code - I've just directly followed the answer in the URL I've provided. It should work!
You can log in to Instagram using the following example code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Wait until page has loaded
await page.goto('https://www.instagram.com/accounts/login/', {
waitUntil: 'networkidle0',
});
// Wait for log in form
await Promise.all([
page.waitForSelector('[name="username"]'),
page.waitForSelector('[name="password"]'),
page.waitForSelector('[name="submit"]'),
]);
// Enter username and password
await page.type('[name="username"]', 'username');
await page.type('[name="password"]', 'password');
// Submit log in credentials and wait for navigation
await Promise.all([
page.click('[type="submit"]'),
page.waitForNavigation({
waitUntil: 'networkidle0',
}),
]);
// Download PDF
await page.pdf({
path: 'page.pdf',
format: 'A4',
});
await browser.close();
})();

Open multiple links in new tab and switch focus with a loop with puppeteer?

I have multiple links in a single page whom I would like to access either sequentially or all at once. What I want to do is open all the links in their respective new tabs and get the page as pdf for all the pages. How do I achieve the same with puppeteer?
I can get all the links with a DOM and href property but I don't know how to open them in new tab access them and then close them.
You can open a new page in a loop:
const puppeteer = require('puppeteer');
(async () => {
try {
const browser = await puppeteer.launch();
const urls = [
'https://www.google.com',
'https://www.duckduckgo.com',
'https://www.bing.com',
];
const pdfs = urls.map(async (url, i) => {
const page = await browser.newPage();
console.log(`loading page: ${url}`);
await page.goto(url, {
waitUntil: 'networkidle0',
timeout: 120000,
});
console.log(`saving as pdf: ${url}`);
await page.pdf({
path: `${i}.pdf`,
format: 'Letter',
printBackground: true,
});
console.log(`closing page: ${url}`);
await page.close();
});
Promise.all(pdfs).then(() => {
browser.close();
});
} catch (error) {
console.log(error);
}
})();
To open a new tab (activate) it you just need to make a call to page.bringToFront()
const page1 = await browser.newPage();
await page1.goto('https://www.google.com');
const page2 = await browser.newPage();
await page2.goto('https://www.bing.com');
const pageList = await browser.pages();
console.log("NUMBER TABS:", pageList.length);
//switch tabs here
await page1.bringToFront();
//Do something... save as pdf
await page2.bringToFront();
//Do something... save as pdf
I suspect you have an array of pages so you might need to tweak the above code to cater for that.
As for generating a single pdf from multiple tabs I am pretty certain this is not possible. I suspect there will be a node library that can take multiple pdf files and merge into one.
pdf-merge might be what you are looking for.
You can also use a for loop.
(async ()=>{
const movieURL= ["https://www.imdb.com/title/tt0234215", "https://www.imdb.com/title/tt0411008"];
for (var i = 0; i < movieURL.length; i++) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(movieURL[i], {waitUntil: "networkidle2"});
const movieData = await page.evaluate(() => {
let movieTitle = document.querySelector('div[class="TitleBlock"] > h1').innerText;
return{movieTitle}
});
await browser.close();
await console.log(movieData);
}
})()

Categories

Resources