Trying to scrape websites using puppeteer and getting back empty objects - javascript

I began learning puppeteer today and I ran into a problem. I was trying to create a covid tracker and I wanted to scrape from worldometers. But when I try to get back information it returns an array with empty objects. The number of objects matches to the number of tags with the same class but it doesn't show any information. here is my code
const puppeteer = require("puppeteer")
async function getCovidCases(){
const browser = await puppeteer.launch({
defaultViewport: null,
headless: false,
slowMo: 250
})
const page = await browser.newPage()
const url = "https://www.worldometers.info/coronavirus/#countries"
await page.goto(url, {waitUntil: 'networkidle0'})
await page.waitForSelector(".navbar-nav", {visible: true})
const results = await page.$$eval(".navbar-nav", rows => {
return rows
})
await console.log(results)
}
getCovidCases()
Does Anyone Know What To Do?

Based on the selector I assume you are in this step interested in the navbar items.
const results = await page.$$eval(".navbar-nav", navBars => {
return navBars.map(navBar => {
const anchors = Array.from(navBar.getElementsByTagName('a'));
return anchors.map(anchor => anchor.innerText);
});
})
This yields [ [ 'Coronavirus', 'Population' ] ] and might be useful for you.
Use $eval if you expect only one element and $$eval if you expect multiple elements. In the callback you have a reference to that dom element, but you cannot return it directly. If you console.log anything it won't show up in the nodejs terminal, but in the browsers terminal. What you return there will be send back to nodejs and it needs to be serializable (I think). What you get back from navBar will be converted to an empty object and is not what you want. That's why I map over it and convert it to a string (innerText).
If you want scrape other data, you should use another selector (.nav-bar).

Related

Puppeteer Page.$$eval() method returning empty arrays

I'm building a web scraping application with puppeteer. I'm trying to get an array of links to scrape from but it returns an empty array.
const scraperObject = {
url: 'https://www.walgreens.com/',
async scraper(browser){
let page = await browser.newPage();
console.log(`Navigating to ${this.url}...`);
await page.goto(this.url);
// Wait for the required DOM to be rendered
await page.waitForSelector('.CA__Featured-categories__full-width');
// Get the link to all the required links in the featured categories
let urls = await page.$$eval('.list__contain > ul#at-hp-rp-featured-ul > li', links => {
// Extract the links from the data
links = links.map(el => el.querySelector('li > a').href)
return links;
});
Whenever I ran this, the console would give me the needed array of links (example below).
Navigating to https://www.walgreens.com/...
[
'https://www.walgreens.com/seasonal/holiday-gift-shop?ban=dl_dl_FeatCategory_HolidayShop_TEST'
'https://www.walgreens.com/store/c/cough-cold-and-flu/ID=20002873-tier1?ban=dl_dl_FeatCategory_CoughColdFlu_TEST'
'https://www.walgreens.com/store/c/contact-lenses/ID=359432-tier2clense?ban=dl_dl_FeatCategory_ContactLenses_TEST'
]
So, from here I had the navigate to one of those urls through the code block below and rerun the same code to go through an array of categories to eventually navigate to the product listings page.
//Navigate to Household Essentials
let hEssentials = await browser.newPage();
await hEssentials.goto(urls[11]);
// Wait for the required DOM to be rendered
await page.waitForSelector('.content');
// Get the link to all the required links in the featured categories
let shopByNeedUrls = await page.$$eval('div.wag-row > div.wag-col-3 wag-col-md-6 wag-col-sm-6 CA__MT30', links1 => {
// Extract the links from the data
links1 = links1.map(el => el.querySelector('div > a').href)
return links1;
});
console.log(shopByNeedUrls);
}
However, whenever I run this code through the console, I receive the same navigating message but then I return an empty array(as shown in the example below)
Navigating to https://www.walgreens.com/...
[]
If anyone is able to explain why I'm outputting an empty array, that'd be great. Thank you.
I've attempted to change the parameter of the page.waitForSelector method and the page.$$eval method. However none of them appeared to work and output the same result. In fact, I recieve a timeout error sometimes for the the page.waitForSelector method.

Unable to perform a Firebase query on a timestamp and receiving an empty body in response

Can someone explain to me why I am not able to perform a simple Firebase query checking for a specific timestamp in a subcollection?
The code below works if I try to retrieve the whole document, but if I add the where query it just returns a 200 response with an empty body.
I have also tried to replace db.collection with db.collectionGroup and in this case I get a 500 response with the following message Collection IDs must not contain '/'.
Here you can see how I have structured my data and my code:
try {
const reference = db.collection(`/data/10546781/history`).where("timestamp", "==", 1659559179735)
const document = await reference.get()
res.status(200).json(document.forEach(doc => {
doc.data()
console.log(doc.data())
}))
} catch(error) {
res.status(500).json(error)
console.log(error)
};
It seems you are looking for map() that creates a new array and not forEach() loop that returns nothing. Try:
const reference = db.collection(`/data/10546781/history`).where("realtimeData.timestamp", "==", 1659559179735)
const snapshot = await reference.get()
const data = snapshot.docs.map((d) => ({
id: d.id,
...d.data()
}))
res.status(200).json(data)
Additionally, you need to use the dot notation if you want to query based on a nested field.
#Dharmaraj Thanks for the help. Your answer was part of the solution. The other part concerned how my data was structured. The timestamp needed to be at the parent level of the subcollection document. So either outside the realTimeData object or the whole object needs to be flattened at the parent level.

How do I get a web page element using a .querySelector()? Webdriver-io

Within Webdriver-io tests, I need to get a web page element using .querySelector()
This code:
a = browser.executeScript('window.document.querySelector("div.my_diagram_div canvas")', [])
console.log('a = ', await a)
outputs the following output:
a = { 'element-6066-11e4-a52e-4f735466cecf': 'ELEMENT-40' }
It's not an element object, I can't work with it any further. How do I get the web page element object?
P.S. In the browser console, the code returns the correct result
What ever you get in console is just a representation of element, its not the actual output.
if you want that html tag use
Within Webdriver-io tests, I need to get a web page element using .querySelector()
use:
a = browser.executeScript('window.document.querySelector("div.my_diagram_div canvas").outerHTML', [])
console.log('a = ', await a)
or
a = await $('div.my_diagram_div canvas')
console.log('a = ', await a.getProperty("outerHTML"))
UPDATE
If you need the element object just use
a = await browser.executeScript('window.document.querySelector("div.my_diagram_div canvas")', [])
elm= await $(a)
await elm.getWindowRect()
also in this case yo udon't need executescript
elm= await $('div.my_diagram_div canvas')
await elm.getWindowRect()

How to use Promise.all with multiple Firestore queries

I know there are similar questions to this on stack overflow but thus far none have been able to help me get my code working.
I have a function that takes an id, and makes a call to firebase firestore to get all the documents in a "feedItems" collection. Each document contains two fields, a timestamp and a post ID. The function returns an array with each post object. This part of the code (getFeedItems below) works as expected.
The problem occurs in the next step. Once I have the array of post ID's, I then loop over the array and make a firestore query for each one, to get the actual post information. I know these queries are asynchronous, so I use Promise.all to wait for each promise to resolve before using the final array of post information.
However, I continue to receive "undefined" as a result of these looped queries. Why?
const useUpdateFeed = (uid) => {
const [feed, setFeed] = useState([]);
useEffect(() => {
// getFeedItems returns an array of postIDs, and works as expected
async function getFeedItems(uid) {
const docRef = firestore
.collection("feeds")
.doc(uid)
.collection("feedItems");
const doc = await docRef.get();
const feedItems = [];
doc.forEach((item) => {
feedItems.push({
...item.data(),
id: item.id,
});
});
return feedItems;
}
// getPosts is meant to take the array of post IDs, and return an array of the post objects
async function getPosts(items) {
console.log(items)
const promises = [];
items.forEach((item) => {
const promise = firestore.collection("posts").doc(item.id).get();
promises.push(promise);
});
const posts = [];
await Promise.all(promises).then((results) => {
results.forEach((result) => {
const post = result.data();
console.log(post); // this continues to log as "undefined". Why?
posts.push(post);
});
});
return posts;
}
(async () => {
if (uid) {
const feedItems = await getFeedItems(uid);
const posts = await getPosts(feedItems);
setFeed(posts);
}
})();
}, []);
return feed; // The final result is an array with a single "undefined" element
};
There are few things I have already verified on my own:
My firestore queries work as expected when done one at a time (so there are not any bugs with the query structures themselves).
This is a custom hook for React. I don't think my use of useState/useEffect is having any issue here, and I have tested the implementation of this hook with mock data.
EDIT: A console.log() of items was requested and has been added to the code snippet. I can confirm that the firestore documents that I am trying to access do exist, and have been successfully retrieved when called in individual queries (not in a loop).
Also, for simplicity the collection on Firestore currently only includes one post (with an ID of "ANkRFz2L7WQzA3ehcpDz", which can be seen in the console log output below.
EDIT TWO: To make the output clearer I have pasted it as an image below.
Turns out, this was human error. Looking at the console log output I realised there is a space in front of the document ID. Removing that on the backend made my code work.

How can I make a request to multiple URLs and parse the results from each page?

I'm using the popular npm package cheerio with request to retrieve some table data.
Whilst I can retrieve and parse the table from a single page easily, I'd like to loop over / process multiple pages.
I have tried wrapping inside loops / various utilities offers by the async package but can't figure this one out. In most cases, node runs out of memory.
current code:
const cheerio = require('cheerio');
const axios = require("axios");
var url = someUrl;
const getData = async url => {
try {
const response = await axios.get(url);
const data = response.data;
const $ = cheerio.load(data);
const announcement = $(`#someId`).each(function(i, elm) {
console.log($(this).text())
})
} catch (error) {
console.log(error);
}
};
getData(url); //<--- Would like to give an array here to fetch from multiple urls / pages
My current approach, after trying loops, is to wrap this inside another function with a callback param. However no success yet and is getting quite messy.
What is the best way to feed an array to this function?
Assuming you want to do them one at a time:
; (async() => {
for(let url of urls){
await getData(url)
}
})()
Have you tried using Promise.all (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/all)?
For loops are usually a bad idea when dealing with asynchronous calls. It depends how many calls you want to make but I believe this could be enough. I would use an array of promises that fetch the data and map over the results to do the parsing.

Categories

Resources