Web Scrape with Puppeteer within a table - javascript

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks

Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

Related

Unable to get isDisabled() to work in Playwright

I need to check that a button is disabled (checking for a last page of a table). There are two with the same id (top and bottom of the table).
const nextPageButtons = await this.page.$$('button#_btnNext'); // nextPageButtons.length is 2, chekced via console.log
const nextPageButtonState = await nextPageButtons[0].isDisabled();
But when I do the above I get: elementHandle.isDisabled: Unable to adopt element handle from a different document.
Why doesn't this work?
So, this works:
const nextPageButtons = await this.page.$$('button#_btnNext');
const nextPageButton1 = await nextPageButtons[0];
const nextPageButton1State = await nextPageButtonsState.isDisabled();

Insert new key value pair inside and array of objects, but value is created by axios.get

So I've been working on a scraper. Everything was well until I've tried scraping data for individual link.
Now to explain: I've got a scraper, which scrapes me data about apartments. Now first url is page where the articles are located(approx. 29-30 should be fetched). Now on that page I don't have information about square meters, so I need to run another scraper for each link that is scraped, and scrape square meters from there.
Here is the code that I have:
const axios = require('axios');
const cheerio = require('cheerio');
const url = `https://www.olx.ba/pretraga?vrsta=samoprodaja&kategorija=23&sort_order=desc&kanton=9&sacijenom=sacijenom&stranica=2`;
axios.get(url).then((response) => {
const articles = [];
const $ = cheerio.load(response.data);
$('div[id="rezultatipretrage"] > div')
.not('div[class="listitem artikal obicniArtikal i index"]')
.not('div[class="obicniArtikal"]')
.each((index, element) => {
$('span[class="prekrizenacijena"]').remove();
const getLink = $(element).find('div[class="naslov"] > a').attr('href');
const getDescription = $(element)
.find('div[class="naslov"] > a > p')
.text();
const getPrice = $(element)
.find('div[class="datum"] > span')
.text()
.replace(/\.| ?KM$/g, '')
.replace(' ', '');
const getPicture = $(element)
.find('div[class="slika"] > img')
.attr('src');
articles[index] = {
id: getLink.substring(27, 35),
link: getLink,
description: getDescription,
price: getPrice,
picture: getPicture,
};
});
articles.map((item, index) => {
axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const sqa = $('div[class="df2 "]').first().text();
});
});
console.log(articles);
});
Now the first part of the code likes as it should, I've been struggling with this second part.
Now I'm mapping over articles because there, for each link, I need to load it into axios function and get the data about square meters.
So my desired output would be updated articles: with it's old objects and key values inside it but with key sqm and value of scraped sqaure meters.
Any ideas on how to achieve this?
Thanks!
You could simply add the information about the square meters to the current article/item, something like:
const articlePromises = Promise.all(articles.map((item) => {
return axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const sqa = $('div[class="df2 "]').first().text();
item.sqm = sqa;
});
}));
articlePromises.then(() => {
console.log(articles);
});
Note that you need to wait for all mapped promises to resolve, before you log the resulting articles.
Also note that using async/await you could rewrite your code to be a bit cleaner, see https://javascript.info/async-await.

How do I continuously listen for a new item while scraping a website

I am using puppeteer to scrape a website that is being live updated, to report the latest item elsewhere.
Currently the way I was thinking accomplishing this is to run a setInterval call on my async scrape and to compare if the last item has changed, checking every 30 seconds. I assume there has to be a better way of doing this then that.
Here is my current code:
const puppeteer = require('puppeteer');
playtracker = async () => {
console.log('loading');
const browser = await puppeteer.launch({});
const page = await browser.newPage();
await page.goto('URL-Being-Scraped');
await page.waitForSelector('.playlist-tracklist-view');
let html = await page.$$eval('.playlist-tracklist-view > .playlist-track', tracks => {
tracks = tracks.filter(track => track.querySelector('.playlist-trackname').textContent);
tracks = tracks.map(el => el.querySelector('.playlist-trackname').textContent);
return tracks;
});
console.log('logging', html[html.length-1]);
};
setInterval(playtracker, 30000)
There is an api called "MutationObserver". You can check that out on MDN. Here's the link https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver
What it is doing is basically do whatever you want to do when the spesific element has changed. Lets say you have a list you want to listen. What you would do is
const listElement = document.querySelector( [list element] );
const callbackFunc = funcion foo () {
//do something
}
const yourMutationObserver = new MutationObserver(callbackFunc)
yourMutationObserver.observe(listElement)
You can disconnect your mutation observer with yourMutationObserver.disconnect() method whenever you want.
This could help too if you confused about how to implement it https://stackoverflow.com/a/48145840/14138428

Puppeteer returning undefined (JS) using xPath

I'm trying to scrape this element: on this website.
My JS code:
const puppeteer = require("puppeteer");
const url = 'https://magicseaweed.com/Bore-Surf-Report/1886/'
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const title = await page.$x('/html/body/div[1]/div[2]/div[2]/div/div[2]/div[2]/div[2]/div/div/div[1]/div/header/h3/div[1]/span[1]')
let text = await page.evaluate(res => res.textContext, title[0])
console.log(text) // UNDEFINED
text is undefined. What is the problem here? Thanks.
I think you need to fix 1 or 2 issues on your code.
textContent vs textContext
xpath
For the content you want the xpath should be:
const title = await page.$x('/html/body/div[1]/div[2]/div[2]/div/div[2]/div[2]/div[2]/div/div/div[1]/div/div[1]/div[1]/div/div[2]/ul[1]/li[1]/text()')
And to get the content of this:
const text = await page.evaluate(el => {
return el.textContent.trim()
}, title[0])
Notice you need send title[0] as an argument to the page function.
OR
if you don't need to use xpath, it seems you could get directly using class name to find the element:
const rating = await page.evaluate(() => {
return $('.rating.rating-large.clearfix > li.rating-text')[0].textContent.trim()
})

Xpath doesn't recognize anchor tag?

I'm running some Node.js code to scrape a website and return some text from this part of the html:
And here's the code I'm using to get it
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
const axios = require('axios');
(async () => {
const response = await axios.get('https://www.aritzia.com/en/product/sculpt-knit-tank-%28arjun-knit-top%29/66139.html?dwvar_66139_color=17388');
const html = response.data;
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:div[contains(#class, 'pdp-product-brand')]/*/text()", doc);
console.log(nodes.length ? nodes[0].nodeValue : nodes.length)
})();
The code above works as expected -- it prints Babaton.
But when I swap out the xpath above for one that includes a instead of * (i.e. //x:div[contains(#class, 'pdp-product-brand')]/a/text()) it instead tells me that nodes.length === 0.
I would expect it to give the same result because the div that it's pointing to does in fact have a child anchor tag (see screenshot above). I'm just confused why it doesn't work with a and was wondering if anybody else knew the answer. Thanks!

Categories

Resources