I'm new to javascript so maybe it's a dumb mistake. I'm trying to pass the values of the object that I get in this webscrapping function to the constant but I'm not succeeding. Every time I try to print the menu it prints as "undefined".
`
const puppeteer = require("puppeteer");
async function getMenu() {
console.log("Opening the browser...");
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('https://pra.ufpr.br/ru/ru-centro-politecnico/', {waitUntil: 'domcontentloaded'});
console.log("Content loaded...");
// Get the viewport of the page
const fullMenu = await page.evaluate(() => {
return {
day: document.querySelector('#conteudo div:nth-child(3) p strong').innerText,
breakfastFood: document.querySelector('tbody tr:nth-child(2)').innerText,
lunchFood: document.querySelector('tbody tr:nth-child(4)').innerText,
dinnerFood: document.querySelector('tbody tr:nth-child(6)').innerText
};
});
await browser.close();
return {
breakfast: fullMenu.day + "\nCafé da Manhã:\n" + fullMenu.breakfastFood,
lunch: fullMenu.day + "\nAlmoço:\n" + fullMenu.lunchFood,
dinner: fullMenu.day + "\nJantar:\n" + fullMenu.dinnerFood
};
};
const menu = getMenu();
console.log(menu.breakfast);
`
I've tried to pass these values in several ways to a variable but I'm not succeeding. I also accept other methods of passing these strings, I'm doing it this way because it's the simplest I could think of.
Your getMenu() is an async function.
In your last bit of code, can you change it to,
(async () => {
let menu = await getMenu();
console.log(menu.breakfast);
})();
credit to this post.
I have no access to the package that you imported. You may try changing the last part of your code to:
const menu = await getMenu();
if (menu) {
console.log(menu.breakfast);
}
Explanation
getMenu() and await getMenu() are different things in JS. getMenu() is a Promise Object which does not represent any string / number / return value. await getMenu() tells JS to run other code first to wait for the result of getMenu().
Despite await tells JS to wait for getMenu() to be resolved, it doesn't stop console.log(menu.breakfast) from running. Your code will try to access menu - which at that moment it is a Promise object. Therefore, breakfast property doesn't exist in the Promise object, so you get undefined.
By adding a if (menu) {...} statement, javascript will wait until menu is resolved before going inside the if-statement. This is useful when you want to do console.log() on a async/await return value.
Related
I'm using Playwright to scrape some data. How do I click on all links on the page matching a selector?
const { firefox } = require('playwright');
(async () => {
const browser = await firefox.launch({headless: false, slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.google.com');
page.pause(); // allow user to manually search for something
const wut = await page.$$eval('a', links => {
links.forEach(async (link) => {
link.click(); // maybe works?
console.log('whoopee'); // doesn't print anything
page.goBack(); // crashes
});
return links;
});
console.log(`wut? ${wut}`); // prints 'wut? undefined'
await browser.close();
})();
Some issues:
console.log inside the $$eval doesn't do anything.
page.goBack() and page.pause() inside the eval cause a crash.
The return value of $$eval is undefined (if I comment out page.goBack() so I get a return value at all). If I return links.length instead of links, it's correct (i.e. it's a positive integer). Huh?
I get similar results with:
const links = await page.locator('a');
await links.evaluateAll(...)
Clearly I don't know what I'm doing. What's the correct code to achieve something like this?
(X-Y problem alert: I don't actually care if I do this with $$eval, Playwright, or frankly even Javascript; all I really want to do is make this work in any language or tool).
const { context } = await launch({ slowMo: 250 });
const page = await context.newPage();
await page.goto('https://stackoverflow.com/questions/70702820/how-can-i-click-on-all-links-matching-a-selector-with-playwright');
const links = page.locator('a:visible');
const linksCount = await links.count();
for (let i = 0; i < linksCount; i++) {
await page.bringToFront();
try {
const [newPage] = await Promise.all([
context.waitForEvent('page', { timeout: 5000 }),
links.nth(i).click({ modifiers: ['Control', 'Shift'] })
]);
await newPage.waitForLoadState();
console.log('Title:', await newPage.title());
console.log('URL: ', page.url());
await newPage.close();
}
catch {
continue;
}
}
There's a number of ways you could do this, but I like this approach the most. Clicking a link, waiting for the page to load, and then going back to the previous page has a lot of problems with it - most importantly is that for many pages the links might change every time the page loads. Ctrl+shift+clicking opens in a new tab, which you can access using the Promise.all pattern and catching the 'page' event.
I only tried this on this page, so I'm sure there's tons of other problems that my arise. But for this page in particular, using 'a:visible' was necessary to prevent getting stuck on hidden links. The whole clicking operation is wrapped in a try/catch because some of the links aren't real links and don't open a new page.
Depending on your use case, it may be easiest just to grab all the hrefs from each link:
const links = page.locator('a:visible');
const linksCount = await links.count();
const hrefs = [];
for (let i = 0; i < linksCount; i++) {
hrefs.push(await links.nth(i).getAttribute('href'));
}
console.log(hrefs);
Try this approach.I will use typescript.
await page.waitForSelector(selector,{timeout:10000});
const links = await page.$$(selector);
for(const link of links)
{
await link.click({timeout:8000});
//your additional code
}
See more on https://youtu.be/54OwsiRa_eE?t=488
I'm trying to practice some web scraping with prices from a supermarket. It's with node.js and puppeteer. I can navigate throught the website in beginning with accepting cookies and clicking a "load more button". But then when I try to read div's containing the products with querySelectorAll I get stuck. It returns undefined even though I wait for a specific div to be present. What am I missing?
Problem is at the end of the code block.
const { product } = require("puppeteer");
const scraperObjectAll = {
url: 'https://www.bilkatogo.dk/s/?query=',
async scraper(browser) {
let page = await browser.newPage();
console.log(`Navigating to ${this.url}`);
await page.goto(this.url);
// accept cookies
await page.evaluate(_ => {
CookieInformation.submitAllCategories();
});
var productsRead = 0;
var productsTotal = Number.MAX_VALUE;
while (productsRead < 100) {
// Wait for the required DOM to be rendered
await page.waitForSelector('button.btn.btn-dark.border-radius.my-3');
// Click button to read more products
await page.evaluate(_ => {
document.querySelector("button.btn.btn-dark.border-radius.my-3").click()
});
// Wait for it to load the new products
await page.waitForSelector('div.col-10.col-sm-4.col-lg-2.text-center.mt-4.text-secondary');
// Get number of products read and total
const loadProducts = await page.evaluate(_ => {
let p = document.querySelector("div.col-10.col-sm-4.col-lg-2").innerText.replace("INDLÆS FLERE", "").replace("Du har set ","").replace(" ", "").replace(/(\r\n|\n|\r)/gm,"").split("af ");
return p;
});
console.log("Products (read/total): " + loadProducts);
productsRead = loadProducts[0];
productsTotal = loadProducts[1];
// Now waiting for a div element
await page.waitForSelector('div[data-productid]');
const getProducts = await page.evaluate(_ => {
return document.querySelectorAll('div');
});
// PROBLEM HERE!
// Cannot convert undefined or null to object
console.log("LENGTH: " + Array.from(getProducts).length);
}
The callback passed to page.evaluate runs in the emulated page context, not in the standard scope of the Node script. Expressions can't be passed between the page and the Node script without careful considerations: most importantly, if something isn't serializable (converted into plain JSON), it can't be transferred.
querySelectorAll returns a NodeList, and NodeLists only exist on the front-end, not the backend. Similarly, NodeLists contain HTMLElements, which also only exist on the front-end.
Put all the logic that requires using the data that exists only on the front-end inside the .evaluate callback, for example:
const numberOfDivs = await page.evaluate(_ => {
return document.querySelectorAll('div').length;
});
or
const firstDivText = await page.evaluate(_ => {
return document.querySelector('div').textContent;
});
I have been having some problems with understanding when/how await is used and required in TestCafé.
Why is does this function require an await on the first line?
async getGroupCount(groupName) {
const groupFilterLinkText = await this.groupFilterLink.withText(groupName).textContent;
const openInd = groupFilterLinkText.indexOf('(');
const closeInd = groupFilterLinkText.indexOf(')');
const count = parseInt(groupFilterLinkText.slice(openInd + 1, closeInd), 10);
return count;
}
I later found in the doc that it says "Note that these methods and property getters are asynchronous, so use await to obtain an element's property." However, I'm using the function I wrote in a pretty synchronous way so this took me completely by surprise. I have ensured that the page is in the state that I want it to be in, and I just want to parse a string, but the test errors on line 2 of this function without the async/await.
Related to this, this is an example test case that calls that function:
test('Verify an account owner can create a single user', async (t) => {
const userName = 'Single User';
const userEmail = 'singleuser#wistia.com';
const origUserCount = await usersPage.getGroupCount('All Users');
await t
.click(usersPage.addUserButton);
await waitForReact();
await t
.typeText(usersPage.addUserModal.nameInput, userName)
.typeText(usersPage.addUserModal.emailInput, userEmail)
.expect(usersPage.addUserModal.saveButton.hasAttribute('disabled')).notOk()
.click(usersPage.addUserModal.saveButton)
.expect(usersPage.userCard.withText(userName).exists).ok({ timeout: 10000 });
const newUserCount = await usersPage.getGroupCount('All Users');
await t
.expect(newUserCount).eql(origUserCount + 1);
});
Originally, I had the last few lines of the test looking like this:
await t
...
.click(usersPage.addUserModal.saveButton)
.expect(usersPage.userCard.withText(userName).exists).ok({ timeout: 10000 })
.expect(await usersPage.getGroupCount('All Users')).eql(origUserCount + 1);
That is, it was included in one function chain for the entire test. This failed because the value of await usersPage.getGroupCount('All Users') was still returning the original value instead of getting updated. I don't understand why I need to separate that out into its own call to the function. Why is that getGroupCount being evaluated seemingly at the beginning of the test rather than when the test reaches that line of code? It doesn't seem to be behaving asynchronously in the way that I expected.
Looking at the documentations (https://devexpress.github.io/testcafe/documentation/test-api/built-in-waiting-mechanisms.html), it seems like you don't have to separate it out into it's own call but you'll have to shift the await to the beginning of the chain?
So for example, instead of:
<something>
.click(usersPage.addUserModal.saveButton)
.expect(usersPage.userCard.withText(userName).exists).ok({ timeout: 10000 })
.expect(await usersPage.getGroupCount('All Users')).eql(origUserCount + 1);
You can do:
await <something>
.click(usersPage.addUserModal.saveButton)
.expect(usersPage.userCard.withText(userName).exists).ok({ timeout: 10000 })
.expect(usersPage.getGroupCount('All Users')).eql(origUserCount + 1);
I'm trying to scrap data from the page that is loading dynamically. For this I'm using headless browser puppeteer
Puppeteer can be seen as the headlessBrowserClient in the code.
The main challenge is to gracefully close the browser as soon as needed data received. But if you close it earlier than evaluateCustomCode execution is finished - evaluateCustomCode progress would be lost.
evaluateCustomCode is a function that can be called as if we run it in the Chrome Dev tools.
To have control over the network requests and async flow of puppeteer API - I use async generator that encapsulates all the logic described above.
The problem is that I feel that the code smells, but I can't see any better solution.
Ideas ?
module.exports = function buildClient (headlessBrowserClient) {
const getPageContent = async (pageUrl, evaluateCustomCode) => {
const request = sendRequest(pageUrl)
const { value: page } = await request.next()
if (page) {
const pageContent = await page.evaluate(evaluateCustomCode)
request.next()
return pageContent
}
}
async function * sendRequest (url) {
const browser = await headlessBrowserClient.launch()
const page = await browser.newPage()
const state = {
req: { url },
}
try {
await page.goto(url)
yield page
} catch (error) {
throw new APIError(error, state)
} finally {
yield browser.close()
}
}
return {
getPageContent,
}
}
You can use waitForFunction or waitFor and evaluate with Promise.all. No matter how dynamic the website is, you are waiting for something to be true at end and close the browser when that happens.
Since I do not have access to your dynamic url, I am going to use some random variables and delays as example. It will resolve once the variable returns truthy.
await page.waitForFunction((()=>!!someVariableThatShouldBeTrue);
If your dynamic page actually creates a selector somewhere after you evaluate the code? In that case,
await page.waitFor('someSelector')
Now back to your customCode, let me rename that for you a bit,
await page.evaluate(customCode)
Where customCode is something that will set a variable someVariableThatShouldBeTrue to true somewhere. Honestly it can be anything, a request, a string or anything. The possibilities are endless.
You can put a promise inside page.evaluate, recent chromium supports them very well. So the following will work too, resolve once you are loaded the function/data. Make sure the customCode is an async function or returns promise.
const pageContent = await page.evaluate(CustomCode);
Alright, now we have all required pieces. I modified the code a bit so it doesn't smell to me :D ,
module.exports = function buildClient(headlessBrowserClient) {
return {
getPageContent: async (url, CustomCode) => {
const state = {
req: { url },
};
// so that we can call them on "finally" block
let browser, page;
try {
// launch browser
browser = await headlessBrowserClient.launch()
page = await browser.newPage()
await page.goto(url)
// evaluate and wait for something to happen
// first element returns the pageContent, but whole will resolve if both ends truthy
const [pageContent] = await Promise.all([
await page.evaluate(CustomCode),
await page.waitForFunction((() => !!someVariableThatShouldBeTrue))
])
// Or, You realize you can put a promise inside page.evaluate, recent chromium supports them very well
// const pageContent = await page.evaluate(CustomCode)
return pageContent;
} catch (error) {
throw new APIError(error, state)
} finally {
// NOTE: Maybe we can move them on a different function
await page.close()
await browser.close()
}
}
}
}
You can change and tweak it more as you wish. I did not test the final code (since I don't have APIError, evaluateCustomCode etc) but it should work.
It doesn't have all those generators and stuff like that. Promises, That's how you can deal with dynamic pages :D .
PS: IMO, Such questions are more befitting for the code review.
I am trying to make a script that :
Grabs all urls from a sitemap
Takes a screenshot of it with puppeteer
I am currently trying to understand how to code asynchronously but I still have troubles with finding the right coding pattern for this problem.
Here is the code I currently have :
// const spider = require('./spider');
const Promise = require('bluebird');
const puppeteer = require('puppeteer');
const SpiderConstructor = require('sitemapper');
async function crawl(url, timeout) {
const results = await spider(url, timeout);
await Promise.each(results, async (result, index) => {
await screen(result, index);
});
}
async function screen(result, index) {
const browser = await puppeteer.launch();
console.log('doing', index);
const page = await browser.newPage();
await page.goto(result);
const path = await 'screenshots/' + index + page.title() + '.png';
await page.screenshot({path});
browser.close();
}
async function spider(url, timeout) {
const spider = await new SpiderConstructor({
url: url,
timeout: timeout
});
const data = await spider.fetch();
console.log(data.sites.length);
return data.sites;
};
crawl('https://www.google.com/sitemap.xml', 15000)
.catch(err => {
console.error(err);
});
I am having the following problems :
The length of the results array is not a constant, it varies every time I launch the script, which I guess resides in the fact it is not fully resolved when I display it, but I thought the whole point of await was so that we are guarantied that on next line the promise is resolved.
The actual screenshotting action part of the script doesn't work half the time and I am pretty sure I have unresolved promises but I have no of the actual pattern for looping over an async function, right now it seems like it does a screenshot after the other (linear and incremental) but I get alot of duplicates.
Any help is appreciated. Thank you for your time