I am creating a program to scrape forum responses for the online Uni I work for. I managed to successfully navigate to the appropriate pages, but when I tried to include scraping for the list of names of the learners who have responded I receive an 'Execution context was destroyed error'.
So far I tried moving around page.waitFor() methods with varying amounts of timeouts.
const nameLinkList = await page.$$eval(
'.coursename',
(courseLinks => courseLinks.map(link => {
const a = link.querySelector('.coursename > a');
return {
name: a.innerText,
link: a.href
};
}))
);
for (const {
name,
link
} of nameLinkList) {
await Promise.all([
page.waitForNavigation(),
page.goto(link),
page.waitFor(2000),
]);
let [button] = await page.$x("//a[contains(., 'Self')]");
if (button) {
await button.click();
} else {
console.log(name);
console.log('Didnt find link');
}
fs.appendFile('out.csv', name + '\n');
await page.waitFor(1000);
var elementExists = await page.$$('.author .media-body');
if (elementExists) {
await console.log(name);
await page.waitFor(500);
for (let z of elementExists) {
const studentName = await z.$eval('a', a => a.innerText);
await page.waitFor(2000)
await console.log(studentName);
}
}
await page.goto('www.urlwiththelistofcourses.com');
}
I expected it to iterate through each page, logging first the name of the course, followed by the names of any students who posted on the courses particular forum. The thing that confuses me is that unlike previous errors which got stuck on a particular iteration this one is variable, usually in the same area, around the 12-17th iteration, sometimes even earlier.
It seems that a combination of adjusting the waitFor here:
fs.appendFile('out.csv', name + '\n');
await page.waitFor(1000);
var elementExists = await page.$$('.author .media-body');
to 2000, combined with disabling the rendering of css and images solved the problem. The program must have linked away before entering the loop if the page loaded too slowly.
Related
I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488
I've got a pretty simple class that I'm trying to use Puppeteer within, but no matter what I do the async code just doesn't seem to execute when I put a breakpoint on it.
The let data = await page.$$eval will execute and then literally nothing happens after that. The code doesn't even step into the inner function block.
Surely the await on that line should force the inner async block to execute before it moves onto the console log at the bottom?
let url = "https://www.ikea.com/gb/en/p/godmorgon-high-cabinet-brown-stained-ash-effect-40457851/";
let scraper = new Scraper();
scraper.launch(url);
export class Scraper{
constructor(){}
async launch(url: string){
let browser = await puppeteer.launch({});
let page = await browser.newPage();
await page.goto(url);
let data = await page.$$eval(' body *', elements => {
console.log("Elements: ", elements);
elements.forEach(element => {
console.log("Element: ", element.className);
})
return "done";
})
console.log("Data: ", data);
}
}
I'm trying to follow this tutorial.
I even copied this block of code directly from the tutorial but still it doesn't work.
await page.goto(this.url);
// Wait for the required DOM to be rendered
await page.waitForSelector('.page_inner');
// Get the link to all the required books
let urls = await page.$$eval('section ol > li', links => {
// Make sure the book to be scraped is in stock
links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
// Extract the links from the data
links = links.map(el => el.querySelector('h3 > a').href)
return links;
});
console.log(urls);
I'm trying to practice some web scraping with prices from a supermarket. It's with node.js and puppeteer. I can navigate throught the website in beginning with accepting cookies and clicking a "load more button". But then when I try to read div's containing the products with querySelectorAll I get stuck. It returns undefined even though I wait for a specific div to be present. What am I missing?
Problem is at the end of the code block.
const { product } = require("puppeteer");
const scraperObjectAll = {
url: 'https://www.bilkatogo.dk/s/?query=',
async scraper(browser) {
let page = await browser.newPage();
console.log(`Navigating to ${this.url}`);
await page.goto(this.url);
// accept cookies
await page.evaluate(_ => {
CookieInformation.submitAllCategories();
});
var productsRead = 0;
var productsTotal = Number.MAX_VALUE;
while (productsRead < 100) {
// Wait for the required DOM to be rendered
await page.waitForSelector('button.btn.btn-dark.border-radius.my-3');
// Click button to read more products
await page.evaluate(_ => {
document.querySelector("button.btn.btn-dark.border-radius.my-3").click()
});
// Wait for it to load the new products
await page.waitForSelector('div.col-10.col-sm-4.col-lg-2.text-center.mt-4.text-secondary');
// Get number of products read and total
const loadProducts = await page.evaluate(_ => {
let p = document.querySelector("div.col-10.col-sm-4.col-lg-2").innerText.replace("INDLÆS FLERE", "").replace("Du har set ","").replace(" ", "").replace(/(\r\n|\n|\r)/gm,"").split("af ");
return p;
});
console.log("Products (read/total): " + loadProducts);
productsRead = loadProducts[0];
productsTotal = loadProducts[1];
// Now waiting for a div element
await page.waitForSelector('div[data-productid]');
const getProducts = await page.evaluate(_ => {
return document.querySelectorAll('div');
});
// PROBLEM HERE!
// Cannot convert undefined or null to object
console.log("LENGTH: " + Array.from(getProducts).length);
}
The callback passed to page.evaluate runs in the emulated page context, not in the standard scope of the Node script. Expressions can't be passed between the page and the Node script without careful considerations: most importantly, if something isn't serializable (converted into plain JSON), it can't be transferred.
querySelectorAll returns a NodeList, and NodeLists only exist on the front-end, not the backend. Similarly, NodeLists contain HTMLElements, which also only exist on the front-end.
Put all the logic that requires using the data that exists only on the front-end inside the .evaluate callback, for example:
const numberOfDivs = await page.evaluate(_ => {
return document.querySelectorAll('div').length;
});
or
const firstDivText = await page.evaluate(_ => {
return document.querySelector('div').textContent;
});
I'm learning to use puppeteer but I'm running into trouble. I'm trying to create a program which takes in a date and finds a famous persons whose birthday is on that date. I have this code:
const puppeteer = require('puppeteer');
try {
(async () => {
console.log('here');
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.famousbirthdays.com/');
console.log('me');
await page.type(document.querySelector('input'), '11-16-1952');
console.log('you clicked');
await page.click(document.querySelector('button'));
console.log('Here');
await page.waitForSelector(
document.querySelector('div[class="list_page"]')
);
let data = await page.evaluate(() => {
let name = document.querySelector('div[class="name"').textContent;
return { name };
});
console.log(data);
browser.close();
})();
} catch (err) {
console.error(err);
}
Im not understand why I'm getting errors at the page.type line? I get an error and cant reach that log of "you clicked". If I read the documentation correctly, .click can take in a selector and text to type into it so I'm pretty sure im using it correctly. I checked on the browser console and document.querySelector('input') does pull up the element I want(the search bar). Any advice is appreciated. Thanks for looking.
How to push elements in array inside async function (puppetteer)?
The page is structured into severel levels. Levels only appear by clicking on a link inside a cell-element with a specific ID.
I already achieved to select current shown ID's, push them into an array to loop through and click links inside elements with that ID to open next hirachy.
After this I repeat this process, by loopin through the difference of the new selected array of ID's (old ID's + new ID's) minus array of ID's from previous loop (old ID's).
Problem appears at executing second loop.
It seems that i made a mistake at pushing elements into array while inside async function... Some links dont get clicked, but with no scheme, so i assume the problem gets caused by async.
Thank you! Sry if thats not a proper description, thats my first question and i am new to this async world.
(async function main() {
try {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('---url---');
await page.waitForSelector('.cotable');
const arrIDMaster = [];
await levelLoop(page, arrIDMaster);
await levelLoop(page, arrIDMaster);
} catch (e) {
}
})();
async function levelLoop(page, arrIDMaster) {
const rows = await page.$$('.coRow.hi.coTableR');
const arrIDLocalGet = [];
//Get all ID's
for (const row of rows) {
const rowID = await page.evaluate(el => el.id, row);
await arrIDLocalGet.push(rowID);
}
// First one needs to be removed
arrIDLocalGet.shift();
//Create Local ID-Array - difference
const arrIDLocal = await differenceOf2Arrays(arrIDMaster,arrIDLocalGet);
// Loop through local ID-Array and Click
for (const id of arrIDLocal) {
const rows = await page.$('#' + id);
const link = await rows.$('a.KnotenLink');
await link.click();
}
//push only new ID's to global array
for (const id of arrIDLocal) {
await arrIDMaster.push(id);
}
};
function differenceOf2Arrays (array1, array2) {
return new Promise(resolve => {
let arrdiff = array2.filter(x => !array1.includes(x));
resolve(arrdiff);
});
};
Hopefully someone can help me, perhaps it's just a mistake in generell because i am pretty new to this stuff. Beside that i am sure thats not the most beatifull solution, I am also happy for suggestions to this!