function in page.evaluate() is not working/being executed - Puppeteer - javascript

My Code below tries to collect a bunch of hyper links that come under the class name ".jss2". However, I do not think the function within my page.evaluate() is working. When I run the code, the link_list const doesn't get displayed.
I ran the document.querySelectorAll on the Chrome console and that was perfectly fine - really having a hard time with this.
async function testing() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.setViewport({width: 1200, height: 800});
await page.goto(url);
const link_list = await this.page.evaluate(() => {
let elements = Array.from(document.querySelectorAll(".jss2"));
let links = elements.map(element => {
return element.href;
});
return (links);
});
console.log(link_list);
}

const link_list = await page.$$eval('.classname', links => links.map(link => link.href));
Found the answer here: PUPPETEER - unable to extract elements on certain websites using page.evaluate(() => document.querySelectorAll())

Related

Returning a node from puppeteer page.evaluate() [duplicate]

This question already has answers here:
Puppeteer page.evaluate querySelectorAll return empty objects
(3 answers)
Closed 3 days ago.
I'm working with Node.js and Puppeteer for the first time and can't find a way to output values from page.evaluate to the outer scope.
My algorithm:
Login
Open URL
Get ul
Loop over each li and click on it
Wait for innetHTML to be set and add it's src content to an array.
How can I return data from page.evaluate()?
const puppeteer = require('puppeteer');
const CREDENTIALS = require(`./env.js`).credentials;
const SELECTORS = require(`./env.js`).selectors;
const URLS = require(`./env.js`).urls;
async function run() {
try {
const urls = [];
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(URLS.login, {waitUntil: 'networkidle0'});
await page.type(SELECTORS.username, CREDENTIALS.username);
await page.type(SELECTORS.password, CREDENTIALS.password);
await page.click(SELECTORS.submit);
await page.waitForNavigation({waitUntil: 'networkidle0'});
await page.goto(URLS.course, {waitUntil: 'networkidle0'});
const nodes = await page.evaluate(selector => {
let elements = document.querySelector(selector).childNodes;
console.log('elements', elements);
return Promise.resolve(elements ? elements : null);
}, SELECTORS.list);
const links = await page.evaluate((urls, nodes, VIDEO) => {
return Array.from(nodes).forEach((node) => {
node.click();
return Promise.resolve(urls.push(document.querySelector(VIDEO).getAttribute('src')));
})
}, urls, nodes, SELECTORS.video);
const output = await links;
} catch (err) {
console.error('err:', err);
}
}
run();
The function page.evaluate() can only return a serializable value, so it is not possible to return an element or NodeList back from the page environment using this method.
You can use page.$$() instead to obtain an ElementHandle array:
const nodes = await page.$$(`${selector} > *`); // selector children
If the length of the constant nodes is 0, then make sure you are waiting for the element specified by the selector to be added to the DOM with page.waitForSelector():
await page.waitForSelector(selector);
let elementsHendles = await page.evaluateHandle(() => document.querySelectorAll('a'));
let elements = await elementsHendles.getProperties();
let elements_arr = Array.from(elements.values());
Use page.evaluateHandle() to return a DOM node as a Puppeteer ElementHandle that you can manipulate in Node.

How can I click on all links matching a selector with Playwright?

I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488

Failed to scrape the link to the next page using xpath in puppeteer

I'm trying to scrape the link to the next page from this webpage. I know how to scrape that using css selector. However, things go wrong when I attempt to parse the same using xpath. This is what I get instead of the next page link.
const puppeteer = require("puppeteer");
let url = "https://stackoverflow.com/questions/tagged/web-scraping";
(async () => {
const browser = await puppeteer.launch({headless:false});
const [page] = await browser.pages();
await page.goto(url,{waitUntil: 'networkidle2'});
let nextPageLink = await page.$x("//a[#rel='next']", item => item.getAttribute("href"));
// let nextPageLink = await page.$eval("a[rel='next']", elm => elm.href);
console.log("next page:",nextPageLink);
await browser.close();
})();
How can I scrape the link to the next page using xpath?
page.$x(expression) returns an array of element handles. You need either destructuring or index acces to get the first element from the array.
To get a DOM element property from this element handle, you need either evaluating with element handle parameter or element handle API.
const [nextPageLink] = await page.$x("//a[#rel='next']");
const nextPageURL = await nextPageLink.evaluate(link => link.href);
Or:
const [nextPageLink] = await page.$x("//a[#rel='next']");
const nextPageURL = await (await nextPageURL.getProperty('href')).jsonValue();

web scraping with puppeteer does not find the CSS tag

im starting to learn web scraping in javascript with puppeteer. I found a video that I liked that showcases puppeteer and I'm trying to scrape the same information as the video(link). the page has changed a little from the video so I used what I think are the correct tags.
the problem comes when I try to find the "h3" tag. the tag exists in the DOM but my code refuses to acknowledge its existence but works "fine" when looking for the "h2" tag.
what I want to know is why my code does not retrieve it.
web page: https://marketingplatform.google.com/about/partners/find-a-partner?utm_source=marketingplatform.google.com&utm_medium=et&utm_campaign=marketingplatform.google.com%2Fabout%2Fpartners%2F
// normal things to launch it
const puppeteer = require("puppeteer");
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = "https://marketingplatform.google.com/about/partners/find-a-partner?utm_source=marketingplatform.google.com&utm_medium=et&utm_campaign=marketingplatform.google.com%2Fabout%2Fpartners%2F";
await page.goto(url);
// here comes the problem
// this doesn't work v
const h3 = await page.evaluate(() => document.querySelector("h3").textContent);
console.log(h3); //the error is because it tries to get the text content of null meaning it didn't found "h3"
// this DOES work v
const h2 = await page.evaluate(() => document.querySelector("h2").textContent);
console.log(h2);
//await browser.close();
})();
i know that "h3" exists. I will appreciate it if you explain a little of what happens so I can learn more
thx.
h3 header not exist on page, we need wait it by waitForSelector:
// normal things to launch it
const puppeteer = require("puppeteer");
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = "https://marketingplatform.google.com/about/partners/find-a-partner?utm_source=marketingplatform.google.com&utm_medium=et&utm_campaign=marketingplatform.google.com%2Fabout%2Fpartners%2F";
await page.goto(url);
await page.waitForSelector('h3')
const h3 = await page.evaluate(() => document.querySelector("h3").textContent);
console.log(h3);
const h2 = await page.evaluate(() => document.querySelector("h2").textContent);
console.log(h2);
await browser.close(); // don't forget close it.
})();
output is:
Viden
Find your perfect match.

How to get input element with puppeteer, when the page load all elements inside frameset tag

I am trying to get all input element in this website:
http://rwis.mdt.mt.gov/scanweb/swframe.asp?Pageid=SfHistoryTable&Units=English&Groupid=269000&Siteid=269003&Senid=0&DisplayClass=NonJava&SenType=All&CD=7%2F1%2F2020+10%3A41%3A50+AM
Here is element source page looks like.
here is my code:
const puppeteer = require("puppeteer");
function run() {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(
"http://rwis.mdt.mt.gov/scanweb/swframe.asp?Pageid=SfHistoryTable&Units=English&Groupid=269000&Siteid=269003&Senid=0&DisplayClass=NonJava&SenType=All&CD=7%2F1%2F2020+10%3A41%3A50+AM"
);
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll("input").length;
return items;
});
browser.close();
return resolve(urls);
} catch (e) {
return reject(e);
}
});
}
run().then(console.log).catch(console.error);
Right now my output have 0, when i run document.querySelectorAll("input").length in the console, it give me 8 .
It seems like everything is loaded in the frameset tag, this might be the issue, could anyone have any idea how to solve this issue?
You have to get the frame element, from there you can get the frame itself so you can call evaluate inside that frame:
const elementHandle = await page.$('frame[name=SWContent]');
const frame = await elementHandle.contentFrame();
let urls = await frame.evaluate(() => {
let results = [];
let items = document.querySelectorAll("input").length;
return items;
});

Categories

Resources