How to multiple html elements with puppeteer? - javascript

I want to get multiple html elements with puppeteer from dynamic website.
But when I only get first element.
How to get all elements?
const puppeteer = require("puppeteer-core");
browser = await puppeteer.launch({
executablePath:
"./node_modules/chromium/lib/chromium/chrome-mac/Chromium.app/Contents/MacOS/Chromium",
});
const element = await page.waitForSelector(
".MuiTableRow-root.MuiTableRow-hover.css-1tq71ky"
);
const value = await element.evaluate((el) => el.textContent);
console.log(value);
await browser.close();

I had a similar issue and i solved it with a loop, something like this:
for(const single of element){
const value = await single.evaluate((el) => el.textContent);
console.log(value);
}

Related

function in page.evaluate() is not working/being executed - Puppeteer

My Code below tries to collect a bunch of hyper links that come under the class name ".jss2". However, I do not think the function within my page.evaluate() is working. When I run the code, the link_list const doesn't get displayed.
I ran the document.querySelectorAll on the Chrome console and that was perfectly fine - really having a hard time with this.
async function testing() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 800});
await page.goto(url);
const link_list = await this.page.evaluate(() => {
let elements = Array.from(document.querySelectorAll(".jss2"));
let links = elements.map(element => {
return element.href;
});
return (links);
});
console.log(link_list);
}
const link_list = await page.$$eval('.classname', links => links.map(link => link.href));
Found the answer here: PUPPETEER - unable to extract elements on certain websites using page.evaluate(() => document.querySelectorAll())

Returning a node from puppeteer page.evaluate() [duplicate]

This question already has answers here:
Puppeteer page.evaluate querySelectorAll return empty objects
(3 answers)
Closed 3 days ago.
I'm working with Node.js and Puppeteer for the first time and can't find a way to output values from page.evaluate to the outer scope.
My algorithm:
Login
Open URL
Get ul
Loop over each li and click on it
Wait for innetHTML to be set and add it's src content to an array.
How can I return data from page.evaluate()?
const puppeteer = require('puppeteer');
const CREDENTIALS = require(`./env.js`).credentials;
const SELECTORS = require(`./env.js`).selectors;
const URLS = require(`./env.js`).urls;
async function run() {
try {
const urls = [];
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(URLS.login, {waitUntil: 'networkidle0'});
await page.type(SELECTORS.username, CREDENTIALS.username);
await page.type(SELECTORS.password, CREDENTIALS.password);
await page.click(SELECTORS.submit);
await page.waitForNavigation({waitUntil: 'networkidle0'});
await page.goto(URLS.course, {waitUntil: 'networkidle0'});
const nodes = await page.evaluate(selector => {
let elements = document.querySelector(selector).childNodes;
console.log('elements', elements);
return Promise.resolve(elements ? elements : null);
}, SELECTORS.list);
const links = await page.evaluate((urls, nodes, VIDEO) => {
return Array.from(nodes).forEach((node) => {
node.click();
return Promise.resolve(urls.push(document.querySelector(VIDEO).getAttribute('src')));
})
}, urls, nodes, SELECTORS.video);
const output = await links;
} catch (err) {
console.error('err:', err);
}
}
run();
The function page.evaluate() can only return a serializable value, so it is not possible to return an element or NodeList back from the page environment using this method.
You can use page.$$() instead to obtain an ElementHandle array:
const nodes = await page.$$(`${selector} > *`); // selector children
If the length of the constant nodes is 0, then make sure you are waiting for the element specified by the selector to be added to the DOM with page.waitForSelector():
await page.waitForSelector(selector);
let elementsHendles = await page.evaluateHandle(() => document.querySelectorAll('a'));
let elements = await elementsHendles.getProperties();
let elements_arr = Array.from(elements.values());
Use page.evaluateHandle() to return a DOM node as a Puppeteer ElementHandle that you can manipulate in Node.

Failed to scrape the link to the next page using xpath in puppeteer

I'm trying to scrape the link to the next page from this webpage. I know how to scrape that using css selector. However, things go wrong when I attempt to parse the same using xpath. This is what I get instead of the next page link.
const puppeteer = require("puppeteer");
let url = "https://stackoverflow.com/questions/tagged/web-scraping";
(async () => {
const browser = await puppeteer.launch({headless:false});
const [page] = await browser.pages();
await page.goto(url,{waitUntil: 'networkidle2'});
let nextPageLink = await page.$x("//a[#rel='next']", item => item.getAttribute("href"));
// let nextPageLink = await page.$eval("a[rel='next']", elm => elm.href);
console.log("next page:",nextPageLink);
await browser.close();
})();
How can I scrape the link to the next page using xpath?
page.$x(expression) returns an array of element handles. You need either destructuring or index acces to get the first element from the array.
To get a DOM element property from this element handle, you need either evaluating with element handle parameter or element handle API.
const [nextPageLink] = await page.$x("//a[#rel='next']");
const nextPageURL = await nextPageLink.evaluate(link => link.href);
Or:
const [nextPageLink] = await page.$x("//a[#rel='next']");
const nextPageURL = await (await nextPageURL.getProperty('href')).jsonValue();

Puppeteer - How to remove script tags

I've been looking into Puppeteer, and am able to get the innerHTML, however, this can also contain <script> content which I would like removed.
How do I achieve this?
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.example.com');
console.log(await page.evaluate(() => document.body.innerHTML));
Something like this?
const innerHTML = await page.evaluate(() => {
for (const script of document.body.querySelectorAll('script')) script.remove();
return document.body.innerHTML;
});

Puppeteer: Get DOM element which isn't in the initial DOM

I'm trying to figure out how to get the elements in e.g. a JS gallery that loads its images after they have been clicked on.
I'm using a demo of Viewer.js as an example. The element with the classes .viewer-move.viewer-transition isn't in the initial DOM. After clicking on an image the element is available but if I use $eval the string is empty. If I open the console in the Puppeteer browser and execute document.querySelector('.viewer-move.viewer-transition') I get the element but in the code the element isn't available.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://fengyuanchen.github.io/viewerjs/');
await page.click('[data-original="images/tibet-1.jpg"]');
let viewer = await page.$eval('.viewer-move.viewer-transition', el => el.innerHTML);
console.log(viewer);
})();
You get the empty string because the element has no content so inner HTML is empty. outerHTML seems working:
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch({ headless: false });
const [page] = await browser.pages();
await page.goto('https://fengyuanchen.github.io/viewerjs/');
await page.click('[data-original="images/tibet-1.jpg"]');
await page.waitForSelector('.viewer-move.viewer-transition');
const viewer = await page.$eval('.viewer-move.viewer-transition', el => el.outerHTML);
console.log(viewer);
await browser.close();
} catch (err) {
console.error(err);
}
})();
Since you have to wait until it is available, the most convenient method would be to use await page.waitForSelector(".viewer-move.viewer-transition") which would wait util the element is added to DOM, although this has the caveat that this continues execution the moment that the element is added to DOM, even if it is empty/hidden.

Categories

Resources