I have some trouble using the newest version of puppeteer.
I'm using puppeteer version 0.13.0.
I have a site with this element:
<div class="header">hey there</div>
I'm trying to run this code:
const headerHandle = await page.evaluateHandle(() => {
const element = document.getElementsByClassName('header');
return element;
});
Now the headerHandle is a JSHandle with a description: 'HTMLCollection(0)'.
If I try to run
headerHandle.getProperties() and try to console.log I get Promise { <pending> }.
If I just try to get the element like this:
const result = await page.evaluate(() => {
const element = document.getElementsByClassName('header');
return Promise.resolve(element);
});
I get an empty object.
How do I get the actual element or the value of the element?
Puppeteer has changed the way evaluate works, the safest way to retrieve DOM elements is by creating a JSHandle, and passing that handle to the evaluate function:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com', { waitUntil: 'networkidle2' });
const jsHandle = await page.evaluateHandle(() => {
const elements = document.getElementsByTagName('h1');
return elements;
});
console.log(jsHandle); // JSHandle
const result = await page.evaluate(els => els[0].innerHTML, jsHandle);
console.log(result); // it will log the string 'Example Domain'
await browser.close();
})();
For reference: evalute docs, issue #1590, issue #1003 and PR #1098
Fabio's approach is good to have for working with arrays, but in many cases you don't need the nodes themselves, just their serializable contents or properties. In OP's case, there's only one element being selected, so the following works more directly (with less straightforward approaches shown for comparison):
const puppeteer = require("puppeteer"); // ^19.1.0
const html = `<!DOCTYPE html><html><body>
<div class="header">hey there</div>
</body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const text = await page.$eval(".header", el => el.textContent);
console.log(text); // => hey there
// or, less directly:
const text2 = await page.evaluate(() => {
// const el = document.getElementsByClassName(".header")[0] // take the 0th element
const el = document.querySelector(".header"); // ... better still
return el.textContent;
});
console.log(text2); // => hey there
// even less directly, similar to OP:
const handle = await page.evaluateHandle(() =>
document.querySelector(".header")
);
const text3 = await handle.evaluate(el => el.textContent);
console.log(text3); // => hey there
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Getting the text from multiple elements is also straightforward, not requiring handles:
const html = `<!DOCTYPE html><html><body>
<div class="header">foo</div>
<div class="header">bar</div>
<div class="header">baz</div>
</body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const text = await page.$$eval(
".header",
els => els.map(el => el.textContent)
);
console.log(text);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
As Fabio's approach attests, things get trickier when working with multiple elements when you want to use the handles in Puppeteer. Unlike the ElementHandle[] return of page.$$, page.evaluateHandle's JSHandle return isn't iterable, even if the handle point to an array. It's only expandable into an array back into the browser.
One workaround is to return the length, optionally attach the selector array to the window (or re-query it multiple times), then run a loop and call evaluateHandle to return each ElementHandle:
// ...
await page.setContent(html);
const length = await page.$$eval(".header", els => {
window.els = els;
return els.length;
});
const nodes = [];
for (let i = 0; i < length; i++) {
nodes.push(await page.evaluateHandle(i => window.els[i], i));
}
// now you can loop:
for (const el of nodes) {
console.log(await el.evaluate(el => el.textContent));
}
// ...
See also Puppeteer find list of shadowed elements and get list of ElementHandles which, in spite of the shadow DOM in the title, is mostly about working with arrays of handles.
Related
Consider this really simple example:
class MyClass {
public add(num: number): number {
return num + 2;
}
}
const result = await page.evaluate((NewInstance) => {
console.log("typeof instance", typeof NewInstance); // undefined
const d = new NewInstance();
console.log("result", d.add(10));
return d.add(10);
}, MyClass);
I've tried everything I could think of. The main reason I want to use a class here, is because there's a LOT of code I don't want to just include inside the evaluate method directly. It's messy and hard to keep track of it, so I wanted to move all logic to a class so it's easier to understand what's going on.
Is this possible?
It's possible, but not necessarily great design, depending on what you're trying to do. It's hard to suggest the best solution without knowing the actual use case, so I'll just provide options and let you make the decision.
One approach is to stringify the class (either by hand or with .toString()) or put it in a separate file, then addScriptTag:
const puppeteer = require("puppeteer"); // ^19.6.3
class MyClass {
add(num) {
return num + 2;
}
}
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(
"https://www.example.com",
{waitUntil: "domcontentloaded"}
);
await page.addScriptTag({content: MyClass.toString()});
const result = await page.evaluate(() => new MyClass().add(10));
console.log(result); // => 12
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
See this answer for more examples.
Something like eval is also feasible. If it looks scary, consider that anything you put into a page.evaluate() or page.addScriptTag() is effectively the same thing as far as security goes.
const result = await page.evaluate(MyClassStringified => {
const MyClass = eval(`(${MyClassStringified})`);
return new MyClass().add(10);
}, MyClass.toString());
Many other patterns are also possible, like exposing your library via exposeFunction if the logic is Node-based rather than browser-based.
That said, defining the class inside an evaluate may not be as bad as you think:
const addTonsOfCode = () => {
MyClass = class {
add(num) {
return num + 2;
}
}
// ... tons of code ...
};
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(
"https://www.example.com",
{waitUntil: "domcontentloaded"}
);
await page.evaluate(addTonsOfCode);
const result = await page.evaluate(() => new MyClass().add(10));
console.log(result); // => 12
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
I'd prefer to namespace this all into a library:
const addTonsOfCode = () => {
class MyClass {
add(num) {
return num + 2;
}
}
// ... tons of code ...
window.MyLib = {
MyClass,
// ...
};
};
Then use with:
await page.evaluate(addTonsOfCode);
await page.evaluate(() => new MyLib.MyClass().add(10));
I'm triyng to get my puppeteer to login with my gmail on zalando. Im using the id for the button so it can typ my gmail into it but it just doesn't want to. Can you help me?
This is where the id, class etc is:
<input type="email" class="cDRR43 WOeOAB _0Qm8W1 _7Cm1F9 FxZV-M bsVOrE
mo6ZnF dUMFv9 K82if3 LyRfpJ pVrzNP NN8L-8 QGmTh2 Vn-7c-"
id="login.email" data-testid="email_input" name="login.email" value=""
placeholder="E-postadress" autocomplete="email">
This is my code:
const puppeteer = require('puppeteer');
const product_url = "https://www.zalando.se/nike-sportswear-air-flight-lite-mid-hoega-sneakers- whiteblack-ni112n02z-a11.html"
const cart = "https://www.zalando.se/cart"
async function givePage(){
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage();
return page;
}
async function addToCart(page){
// going to website
await page.goto(product_url)
// clicking "handla"
await page.waitForSelector("button[class='DJxzzA u9KIT8 uEg2FS U_OhzR ZkIJC- Vn-7c- FCIprz heWLCX JIgPn9 LyRfpJ pxpHHp Md_Vex NN8L-8 GTG2H9 MfX1a0 WCjo-q EKabf7 aX2-iv r9BRio mo6ZnF PLvOOB']");
await page.click("button[class='DJxzzA u9KIT8 uEg2FS U_OhzR ZkIJC- Vn-7c- FCIprz heWLCX JIgPn9 LyRfpJ pxpHHp Md_Vex NN8L-8 GTG2H9 MfX1a0 WCjo-q EKabf7 aX2-iv r9BRio mo6ZnF PLvOOB']", elem => elem.click());
// clicking "OK" to cookies
await page.waitForSelector("button[class='uc-btn uc-btn-primary']");
await page.click("button[class='uc-btn uc-btn-primary']", elem => elem.click());
// clicking "size EU 41"
await page.evaluate(() => document.getElementsByClassName('_6G4BGa _0Qm8W1 _7Cm1F9 FxZV-M IvnZ13 Pb4Ja8 ibou8b JT3_zV ZkIJC- Md_Vex JCuRr_ na6fBM _0xLoFW FCIprz pVrzNP KRmOLG NuVH8Q')[4].click());
console.log("körs")
await page.evaluate(async() => { setTimeout(function(){ console.log('waiting'); }, 1000);});
// going to "cart"
await page.goto(cart)
// clicking "gå till checkout"
await page.waitForSelector("button[class='z-1-button z-coast-base-primary-accessible z-coast-base__sticky-sumary__cart__button-checkout z-1-button--primary z-1-button--button']");
await page.click("button[class='z-1-button z-coast-base-primary-accessible z-coast-base__sticky-sumary__cart__button-checkout z-1-button--primary z-1-button--button']", elem => elem.click());
}
async function Login(page){
await page.evaluate(async() => { setTimeout(function(){ console.log('waiting'); }, 1000);});
await page.type("input[id='login.email']", 'david.exartor#gmail.com');
}
async function checkout(){
var page = await givePage();
await addToCart(page);
await Login(page);
}
checkout();
I've tried using the other things such as the name, class and testid but still no success. I was expecting that something would work but nothing did.
You're missing waiting for that input selector:
const uname = await page.waitForSelector("[id='login.email']");
await uname.type('david.exartor#gmail.com');
Suggestions/notes:
This code:
await page.click("button[class='uc-btn uc-btn-primary']", elem => elem.click());
can just be:
await page.click("button[class='uc-btn uc-btn-primary']");
The second argument is supposed to be an options object, not a callback. If you want to trigger a native click, use:
await page.$eval("button[class='uc-btn uc-btn-primary']", el => el.click());
When I run into trouble automating a login, I often add a userDataDir and pop open a browser session so I can log in to the site manually.
Try to avoid sleeping. It slows down your script and can lead to random failures. Pick tighter predicates like waitForSelector or waitForFunction and encode the exact condition you're waiting on.
Luckily, your attempts at sleeping don't actually do much of anything:
await page.evaluate(async() => { setTimeout(function(){ console.log('waiting'); }, 1000);});
This just logs to the browser console after a second but doesn't block in Puppeteer. The async keyword isn't necessary. To actually sleep in the browser, you could do:
await page.evaluate(() => new Promise(r => setTimeout(r, 1000)));
or just sleep in Node:
await new Promise(r => setTimeout(r, 1000));
If you run console.log(await page.content()) headlessly, you'll see the site is detecting you as a bot and not returning the login page. The canonical is Why does headless need to be false for Puppeteer to work? if you plan to run headlessly in the future.
The givePage function leaks a browser handle, hanging the process. Better to write your script without abstractions until you have everything working, then factor out abstractions. My usual boilerplate is something like:
const puppeteer = require("puppeteer");
const scrape = async page => {
// write your code here
const url = "https://www.example.com";
await page.goto(url, {waitUntil: "domcontentloaded"});
console.log(await page.title());
};
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await scrape(page);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Be extremely careful with your [class="foo bar baz"] selectors. These are rigid and overly-precise relative to the preferred .foo.bar.baz version. The former is an exact match, so if another class shows up or the order of the classes change, your script will break. Here's an example of the problem:
const puppeteer = require("puppeteer"); // ^19.0.0
const html = `<p class="foo bar">OK</p>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const p = (...args) => console.log(...args);
const text = sel => page
.$eval(sel, el => el.textContent)
.catch(err => "FAIL");
// Good:
p(await text(".foo.bar")); // => OK
p(await text(".bar.foo")); // => OK
p(await text(".foo")); // => OK
p(await text(".bar")); // => OK
// Works but verbose:
p(await text('[class~="foo"][class~="bar"]')); // => OK
// Works but brittle:
p(await text('[class="foo bar"]')); // => OK
// Special cases that are sometimes necessary:
p(await text('[class^="foo "]')); // => OK
p(await text('[class$=" bar"]')); // => OK
p(await text('[class*="fo"]')); // => OK
// Fails:
p(await text('[class="foo"]')); // => FAIL
p(await text('[class="bar"]')); // => FAIL
p(await text('[class="bar foo"]')); // => FAIL
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
The [attr=""] selector is suitable in uncommon situations when you need to test semantics like "begins with", "ends with", "substring" or in a very rare case where you actually need to distinguish between class="foo bar" and class="bar foo", which I've never had to do before.
Be careful with overly-specific selectors like .foo.bar.baz.quux.garply.corge. If you can distinguish that element with a simple .foo or a #baz .foo, just use that in most circumstances. Related: overusing browser-generated selectors and Convenient way to get input for puppeteer page.click().
Block images and extra resources to speed up your script once you get the basic functionality working.
In the following code snippet, I try to click a button (after some Timeout) within the page.evaluate function. It does not work. Yet, when I open the console in the launched browser and manually type const btn = document.querySelectorAll("form button")[1]; btn.click() it does.
Can anyone explain to me the cause of this difference in behavior and how to fix it?
Here's a minimal reproducible example:
import { resolve } from 'path';
import puppeteer from 'puppeteer'
//go to page and handle cookie requests
const browser = await puppeteer.launch({defaultViewport: {width: 1920, height: 1080},
headless:false, args: ['--start-maximized']});
const page = await browser.newPage();
const url = "https://de.finance.yahoo.com/";
await page.goto(url);
await page.waitForSelector("div.actions");
await page.evaluate( () => {
let z= document.querySelector("div.actions"); z.children[4].click()
})
await page.waitForSelector("input[id=yfin-usr-qry]");
await page.evaluate( () => {let z= document.querySelector("input[id=yfin-usr-qry]");
z.value = "AAPL"; const btn = document.querySelectorAll("form button")[1];
return new Promise((resolve) => setTimeout(() => {btn.click();resolve()},1000))})
})
The form button selector appears to be incorrect, selecting a non-visible element with class .modules_clearBtn__uUU5h.modules_noDisplay__Qnbur. I'd suggest selecting by .finsrch-btn or #UH-0-UH-0-Header .finsrch-btn if you have to select this, but it's not really necessary, so I won't use it in my suggested solution below.
Beyond that, I'd tighten up some of the selectors, skip the timeout and prefer using trusted Puppeteer events when possible.
I'm not sure what data you want on the final page but this should give you a screenshot of it, showing all of the content:
const puppeteer = require("puppeteer"); // ^18.0.4
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const $ = (...args) => page.waitForSelector(...args);
const url = "https://de.finance.yahoo.com/";
await page.goto(url, {waitUntil: "domcontentloaded"});
await (await $('button[name="agree"]')).click();
const input = await $("#yfin-usr-qry");
await input.type("AAPL");
await page.keyboard.press("Enter");
await $("#AAPL-interactive-2col-qsp-m");
await page.evaluate("scrollTo(0, document.body.scrollHeight)");
await $("#recommendations-by-symbol");
await page.screenshot({path: "aapl.png", fullPage: true});
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
That said, rather than navigating to the homepage, typing in a search, then pressing a button, you could consider building the URL directly, e.g. https://de.finance.yahoo.com/quote/${symbol} and navigating right to it. This is generally faster, more reliable, and easier to code.
I'm using Puppeteer and jsDOM to scrape this site: https://www.lcfc.com/matches/results.
I want the names of the teams of every match, so on the console I use this:
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
On the console, the names prints ok but when I use this on my code it returns nothing.
This is my code:
const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
(async () => {
try {
const browser = await puppeteer.launch() ;
const page = await browser.newPage();
const response = await page.goto('https://www.lcfc.com/matches/results');
const body = await response.text();
const { window: { document } } = new jsdom.JSDOM(body);
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
await browser.close();
} catch (error) {
console.error(error);
}
})();
And I don't have any error. Some suggestion? Thank you.
I tried with this code now, but still not working. I show the code and a picture of the console:
const puppeteer = require('puppeteer');
(async () => {
try {
const browser = await puppeteer.launch() ;
const page = await browser.newPage();
await page.waitForSelector('.match-item__team-container span');
const data = await page.evaluate(() => {
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
});
//listen to console events in the chrome tab and log it in nodejs process
page.on('console', consoleObj => console.log(consoleObj.text()));
await browser.close();
} catch (error) {
console.log(error);
}
})();
Do it puppeter way and use evaluate to run your code after waiting for the selector to appear via waitForSelector
await page.waitForSelector('.match-item__team-container span');
const data = await page.evaluate(() => {
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
//or return the values of the selected item
return somevalue;
});
//listen to console events in the chrome tab and log it in nodejs process
page.on('console', consoleObj => console.log(consoleObj.text()));
evaluate runs your code inside the active tab of the chrome so you will not need jsDOM to parse the response.
UPDATE
The new timeout issue is because the page is taking too long to load: use {timeout : 0}
const data = await page.evaluate(() => {
document.querySelectorAll('.match-item__team-container span')
.forEach(element => console.log(element.textContent));
//or return the values of the selected item
return somevalue;
},{timeout:60000});
I tested iterations with puppeteer in a small case. I already have read the common reason for puppeteer disconnections are that the Node script doesnt wait for the puppeteer actions to be ended. So I converted all functions in my snippet into async functions but it didnt help.
If the small case with six iterations work I will implement it in my current project with like 50 iterations.
'use strict';
const puppeteer = require('puppeteer');
const arrIDs = [8322072, 1016816, 9312604, 1727088, 9312599, 8477729];
const call = async () => {
await puppeteer.launch().then(async (browser) => {
arrIDs.forEach(async (id, index, arr) => {
await browser.newPage().then(async (page) => {
await page.goto(`http://somelink.com/${id}`).then(async () => {
await page.$eval('div.info > table > tbody', async (heading) => {
return heading.innerText;
}).then(async (result) => {
await browser.close();
console.log(result);
});
});
});
});
});
};
call();
forEach executes synchronously. replace forEach with a simple for loop.
const arrIDs = [8322072, 1016816, 9312604, 1727088, 9312599, 8477729];
const page = await browser.newPage();
for (let id of arrIDs){
await page.goto(`http://somelink.com/${id}`);
let result = await page.$eval('div.info > table > tbody', heading => heading.innerText).catch(e => void e);
console.log(result);
}
await browser.close()
The way you've formatted and nested everything seems like some incarnation of callback hell.
Here's my suggestion, its not working, but the structure is going to work better for Async / Await
const puppeteer = require("puppeteer");
const chromium_path_706915 =
"706915/chrome.exe";
async function Run() {
arrIDs.forEach(
await Navigate();
)
async function Navigate(url) {
const browser = await puppeteer.launch({
executablePath: chromium_path_706915,
args: ["--auto-open-devtools-for-tabs"],
headless: false
});
const page = await browser.newPage();
const response = await page.goto(url);
const result = await page.$$eval("div.info > table > tbody", result =>
result.map(ele2 => ({
etd: ele2.innerText.trim()
}))
);
await browser.close();
console.log(result);
}
}
run();
On top of the other answers, I want to point out that async and forEach loops don't exactly play as expected. One possible solution is having a custom implementation that supports this:
Utility function:
async function asyncForEach(array: Array<any>, callback: any) {
for (let index = 0; index < array.length; index++) {
await callback(array[index], index, array);
}
}
Example usage:
const start = async () => {
await asyncForEach([1, 2, 3], async (num) => {
await waitFor(50);
console.log(num);
});
console.log('Done');
}
start();
Going through this article by Sebastien Chopin can help make it a bit more clear as to why async/await and forEach act unexpectedly. Here it is as a gist.