How to get all clickable elements on the webpage using puppeteer? - javascript

For webscraping purpose, I want to find all URLs present on the website which I can access using the tag 'a'. Refer to the below script
// Get all urls in the page
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a');
items.forEach((item) => {
results.push({
url: item.href,
});
});
Now there are some URLs hidden, and which can be accessed using clicking on the elements on the page. How can get the list of all clickable elements on a page using puppeteer or nodejs?

Related

How to display all players names in the Ball Don't Lie API array

I am doing one of my first projects using the Ball Don't lie API, trying to build my version of an ESPN landing page. I am using https://www.balldontlie.io/api/v1/players. I am using Javascript, I have been stuck for days trying to understand how to display the first and last name of all of the players on the landing page in HTML. I only know how to display one name if I use data.data[0]. I've tried .map, loops, it's just not clicking. I want to be able to display other stats in the array as well. Can anyone help?
This my Javascript code:
async function getPlayers() {
const response = await fetch ('https://www.balldontlie.io/api/v1/players');
const data = await response.json();
const players = data.data;
console.log(players);
displayPlayer(players);
}
function displayPlayer(players) {
const scores = document.getElementById('scores');
scores.innerHTML = `
${players.first_name} ${players.last_name}`;
}
getPlayers()```
I had tried .map, I've tried loops, I am just not understanding what function is going to show the players. Maybe my orignal code doesn't make sense. I've tried watching Youtube and can't find anyone doing it in simple Javascript.
You can try this in your script and edit points 2. and 4. for better display of what you need to show
// 1. GET request using fetch()
fetch("https://www.balldontlie.io/api/v1/players")
// Converting received data to JSON
.then((response) => response.json())
.then((json) => {
// 2. Create a variable to store HTML table headers
let li = `<tr><th>ID</th><th>first_name</th><th>height_feet</th><th>height_inches</th> <th>last_name</th><th>position</th><th>im lazy...</th></tr>`;
// 3. Loop through each data and add a table row
console-console.log(json.data);
json.data.forEach((user) => {
li += `<tr>
<td>${user.id}</td>
<td>${user.first_name} </td>
<td>${user.height_feet}</td>
<td>${user.height_inches}</td>
<td>${user.last_name}</td>
<td>${user.position}</td>
<td>${user.team.id}</td>
<td>${user.team.abbreviation}</td>
<td>${user.team.city}</td>
<td>${user.team.conference}</td>
<td>${user.team.division}</td>
<td>${user.team.full_name}</td>
<td>${user.team.name}</td>
</tr>`;
});
// 4. DOM Display result
document.getElementById("users").innerHTML = li;
});
And your html body part look like this
<div>
<!-- Table to display fetched user data -->
<table id="users"></table>
</div>
Your constant players is an array. In order to access a player's information within that array, you would need to index each player to then access their object of key:value pairs.
That is why you can get the first player's name to show when you save players as data.data[0]. This is indicating that you want to access the object in position 0 in the array. If you wanted the second player's information you would reference data.data[1], and so forth.
With trying to keep as much of your original code as possible (and adding some comments), I believe this is what you were trying to achieve.
async function getPlayers() {
// Fetch the API and convert it to json.
const response = await fetch ('https://www.balldontlie.io/api/v1/players');
const data = await response.json();
// Save the returned data as an array.
const players = data.data;
console.log(players);
// Create an element to display each individual player's information.
players.array.forEach(player => {
displayPlayer(player);
});
}
function displayPlayer(player) {
// Grab the element encasing all players.
const scores = document.getElementById('scores');
// Create a new element for the individual player.
const playerContent = document.createElement('div');
// Add the player's name.
playerContent.innerHTML = `
${player.first_name} ${player.last_name}`;
// Add the player content into the encasing division.
scores.appendChild(playerContent);
}
getPlayers()
We will use the forEach() function to index each player's object in the array of data for us, grab your "scores" element you created on your HTML page, then we will "append" (add to the end) each player's information into your "scores" element.
The website link below has some useful information to read that can help you build on your existing code when you want to start adding styling.
https://www.thesitewizard.com/javascripts/insert-div-block-javascript.shtml
This site has some useful information on using "promises" when dealing with async functions that will come in handy as you progress in coding.
https://www.geeksforgeeks.org/why-we-use-then-method-in-javascript/
These website links were added as of 02/04/2023 (just to add as a disclaimer to the links because who knows what they will do in 2030 O.o).
Hope this helps!

Puppeteer Page.$$eval() method returning empty arrays

I'm building a web scraping application with puppeteer. I'm trying to get an array of links to scrape from but it returns an empty array.
const scraperObject = {
url: 'https://www.walgreens.com/',
async scraper(browser){
let page = await browser.newPage();
console.log(`Navigating to ${this.url}...`);
await page.goto(this.url);
// Wait for the required DOM to be rendered
await page.waitForSelector('.CA__Featured-categories__full-width');
// Get the link to all the required links in the featured categories
let urls = await page.$$eval('.list__contain > ul#at-hp-rp-featured-ul > li', links => {
// Extract the links from the data
links = links.map(el => el.querySelector('li > a').href)
return links;
});
Whenever I ran this, the console would give me the needed array of links (example below).
Navigating to https://www.walgreens.com/...
[
'https://www.walgreens.com/seasonal/holiday-gift-shop?ban=dl_dl_FeatCategory_HolidayShop_TEST'
'https://www.walgreens.com/store/c/cough-cold-and-flu/ID=20002873-tier1?ban=dl_dl_FeatCategory_CoughColdFlu_TEST'
'https://www.walgreens.com/store/c/contact-lenses/ID=359432-tier2clense?ban=dl_dl_FeatCategory_ContactLenses_TEST'
]
So, from here I had the navigate to one of those urls through the code block below and rerun the same code to go through an array of categories to eventually navigate to the product listings page.
//Navigate to Household Essentials
let hEssentials = await browser.newPage();
await hEssentials.goto(urls[11]);
// Wait for the required DOM to be rendered
await page.waitForSelector('.content');
// Get the link to all the required links in the featured categories
let shopByNeedUrls = await page.$$eval('div.wag-row > div.wag-col-3 wag-col-md-6 wag-col-sm-6 CA__MT30', links1 => {
// Extract the links from the data
links1 = links1.map(el => el.querySelector('div > a').href)
return links1;
});
console.log(shopByNeedUrls);
}
However, whenever I run this code through the console, I receive the same navigating message but then I return an empty array(as shown in the example below)
Navigating to https://www.walgreens.com/...
[]
If anyone is able to explain why I'm outputting an empty array, that'd be great. Thank you.
I've attempted to change the parameter of the page.waitForSelector method and the page.$$eval method. However none of them appeared to work and output the same result. In fact, I recieve a timeout error sometimes for the the page.waitForSelector method.

How do I get a web page element using a .querySelector()? Webdriver-io

Within Webdriver-io tests, I need to get a web page element using .querySelector()
This code:
a = browser.executeScript('window.document.querySelector("div.my_diagram_div canvas")', [])
console.log('a = ', await a)
outputs the following output:
a = { 'element-6066-11e4-a52e-4f735466cecf': 'ELEMENT-40' }
It's not an element object, I can't work with it any further. How do I get the web page element object?
P.S. In the browser console, the code returns the correct result
What ever you get in console is just a representation of element, its not the actual output.
if you want that html tag use
Within Webdriver-io tests, I need to get a web page element using .querySelector()
use:
a = browser.executeScript('window.document.querySelector("div.my_diagram_div canvas").outerHTML', [])
console.log('a = ', await a)
or
a = await $('div.my_diagram_div canvas')
console.log('a = ', await a.getProperty("outerHTML"))
UPDATE
If you need the element object just use
a = await browser.executeScript('window.document.querySelector("div.my_diagram_div canvas")', [])
elm= await $(a)
await elm.getWindowRect()
also in this case yo udon't need executescript
elm= await $('div.my_diagram_div canvas')
await elm.getWindowRect()

How to make doc.id clickable item?

How can I turn doc.id into clickable item?
Right now I can get data from my database, I can list all the doc.id's, but I want to make those into hyperlinks, when pressed it gets the data from the database and shows it's data as a drawing on the canvas. If that makes sense?
doc.id is that unique id for stuff that is saved in my firestore db.
const allDrawings = document.querySelector('#allDrawings');
function renderDrawings(doc){
let li = document.createElement('li');
let key = document.createElement('doc.id');
li.setAttribute('data-id', doc.id);
key.textContent = doc.id;
li.appendChild(key);
allDrawings.appendChild(li);
}
db.collection('joonistused').get().then((snapshot) => {
snapshot.docs.forEach(doc => {
renderDrawings(doc);
console.log(doc.id);
})
})
If you're just trying to create an anchor in the DOM then try something like this:
const anchor = document.createElement('a');
anchor.href = `/some/path/to/${doc.id}`;
anchor.innerText = `Document ID ${doc.id}`;
// Document ID 123
I believe what you are looking for is a design/logic to list documents in a firestore collection in your browser and make them list items. You then want to click on an item and have the content for that document displayed to the user. This will require programming logic on your end. You will want to write browser/client side JavaScript that is called when a link is clicked (onclick). When the JavaScript code is reached you will then want to make a web client call to the Firestore database to retrieve the corresponding document.
See: https://firebase.google.com/docs/reference/js/firebase.firestore.DocumentReference.html#get
This looks very useful too:
Firebase Firestore Tutorial #3 - Getting Documents

slow looping over pages and extracting data using puppeteer

I have a table looking like that. All the names in the name columns are links that navigate to the next page.
|---------------------|------------------|
| NAME | ID |
|---------------------|------------------|
| Name 1 | 1 |
|---------------------|------------------|
| Name 2 | 2 |
|---------------------|------------------|
| Name 3 | 3 |
|---------------------|------------------|
I am trying to grab the link, extract data from it and then return back to the table. However, there are over 4000 records in the table and everything is processed very slowly (around 1000ms per record)
Here is my code that:
//Grabs all table rows.
const items = await page.$$(domObjects.itemPageLink);
for (let i = 0; i < items.length; i++) {
await page.goto(url);
await page.waitForSelector(domObjects.itemPageLink);
let items = await page.$$(domObjects.itemPageLink);
const item = items[i];
let id = await item.$eval("td:last-of-type", node => node.innerText.split(",").map(item => item.trim()));
let link = await item.$eval("td:first-of-type a", node => node.click());
await page.waitForSelector(domObjects.itemPageWrapper);
let itemDetailsPage = await page.$(domObjects.itemPageWrapper);
let title = await itemDetailsPage.$eval(".page-header__title", title => title.innerText);
console.log(title);
console.log(id);
}
Is there a way to speed up this so I can get all the results at once much quicker? I would like to use this for my API.
There are some minor code improvements and one major improvement which can be applied here.
Minor improvements: Use fewer puppeteer functions
The minor improvements boil down to using as few of the puppeteer functions as possible. Most of the puppeteer functions you use, are sending data from the Node.js environment to the browser environment via a WebSocket. While this only takes a few milliseconds, these milliseconds obviously add up in the long run. For more information on this, you can check out this question asking about the difference of using page.evaluate vs. using more puppeteer functions.
This means, to optimize your code, you can for example use querySelector inside of the page instead of running item.$eval multiple times. Another optimization is to directly use the result of page.waitForSelector. The function will return the node, when it appears. Therefore, you do not need to query it via page.$ again afterwards.
These are only minor improvements, which might slightly improve the crawling speed.
Major improvement: Use a puppeteer pool
Right now, you are using one browser with a single page to crawl multiple URLs. You can improve the speed of your script by using a pool of puppeteer resources to crawl multiple URLs in parallel. puppeteer-cluster allows you to do exactly that (disclaimer: I'm the author). The library takes a task and applies it to a number of URLs in parallel.
The number of parallel instances, you can use depends on your CPU, memory and throughput. The more you can use the better your crawling speed will be.
Code Sample
Below is a minimal example, adapting your code to extract the same data. The code first sets up a cluster with one browser and four pages. After that, a task function is defined which will be executed for each of the queued objects.
After this, one page instance of the cluster is used to extract the IDs and URLs from the initial page. The function given to the cluster.queue extracts the IDs and URLs from the page and calls cluster.queue with the objects being { id: ..., url: ... }. For each of the queued objects, the cluster.task function is executed, which then extracts the title and prints it out next to the passed ID.
// Setup your cluster with 4 pages
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 4,
});
// Define the task for the pages (go the the URL, and extract the title)
await cluster.task(async ({ page, data: { id, url } }) => {
await page.goto(url);
const itemDetailsPage = await page.waitForSelector(domObjects.itemPageWrapper);
const title = await itemDetailsPage.$eval('.page-header__title', title => title.innerText);
console.log(id, url);
});
// use one page of the cluster, to extract the links (ignoring the task function above)
cluster.queue(({ page }) => {
await page.goto(url); // URLs is given from outside
// Extract the links and ids from the initial page by using a page of the cluster
const itemData = await page.$$eval(domObjects.itemPageLink, items => items.map(item => ({
id: item.querySelector('td:last-of-type').split(',').map(item => item.trim()),
url: item.querySelector('td:first-of-type a').href,
})));
// Queue the data: { id: ..., url: ... } to start the process
itemData.forEach(data => cluster.queue(data));
});

Categories

Resources