I don't understand how to use cheerio class selectors - javascript

I want to get the children inside this element from within the website, and the elements inside the children
Code:
const axios = require("axios");
const cheerio = require("cheerio");
(async () => {
const response = await axios.get(`...`);
const $ = cheerio.load(response.data);
let ChatBody = $('div[class="chatbody overflow-y-auto flex-column"]').children()
console.log(ChatBody)
/*ChatBody.each( (index, element) => {
console.log(index,element)
})*/
})();
Code and Output screenshot
Elements screenshot
I use nodejs v12.22.10, axios and cheerio packages, javascript*

Maybe you want something like:
$('.chatbody > *').get().map(el => $(el).text())
This will give text of all children

Related

How do i check for multiple keywords using cheerio?

Currently, I am trying to build a scrapper that searches all <a> tags for specific keywords like "climate" or "environment." Using cheerio, is it possible to look for multiple keywords so that I get results of multiple keywords?
Here is my code-
const PORT = 8000;
const express = require('express');
const axios = require('axios');
const cheerio = require('cheerio');
const { response } = require('express');
const app = express();
const articles = [];
app.get('/',(req,res)=>{
res.json('Hello World')
})
app.get('/news',(req,res)=>{
axios.get('https://www.tbsnews.net/bangladesh/environment/climate-change')
.then((response)=>{
const html = response.data;
const $ = cheerio.load(html);
$('a:contains("climate")',html).each(function(){
const title = $(this).text()
const url = $(this).attr('href')
articles.push({
title,
url
})
})
res.json(articles)
}).catch((err)=>console.log(err));
})
app.listen(PORT,()=>{console.log(`server running on Port ${PORT}`)});
From the documentation of cheerio, you can use multiple contains selectors similarly you would use them in jQuery.
Logical OR
If you need to match any of the words, just separate the contains selectors with a comma.
$('a:contains("climate"), a:contains("environment")', html)
Logical AND
If you need to match exactly the two words, add the second contains selector right after the first.
$('a:contains("climate"):contains("environment")', html)

Xpath doesn't recognize anchor tag?

I'm running some Node.js code to scrape a website and return some text from this part of the html:
And here's the code I'm using to get it
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
const axios = require('axios');
(async () => {
const response = await axios.get('https://www.aritzia.com/en/product/sculpt-knit-tank-%28arjun-knit-top%29/66139.html?dwvar_66139_color=17388');
const html = response.data;
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:div[contains(#class, 'pdp-product-brand')]/*/text()", doc);
console.log(nodes.length ? nodes[0].nodeValue : nodes.length)
})();
The code above works as expected -- it prints Babaton.
But when I swap out the xpath above for one that includes a instead of * (i.e. //x:div[contains(#class, 'pdp-product-brand')]/a/text()) it instead tells me that nodes.length === 0.
I would expect it to give the same result because the div that it's pointing to does in fact have a child anchor tag (see screenshot above). I'm just confused why it doesn't work with a and was wondering if anybody else knew the answer. Thanks!

Web Scrape with Puppeteer within a table

I am trying to scrape this page.
https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214
I want to be able to find the grade count for PSA 9 and 10. If we look at the HTML of the page, you will notice that PSA does a very bad job (IMO) at displaying the data. Every TR is a player. And the first TD is a card number. Let's just say I want to get Card Number 1 which in this case is Kevin Garnett.
There are a total of four cards, so those are the only four cards I want to display.
Here is the code I have.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.psacard.com/Pop/GetItemTable?headingID=172510&categoryID=20019&isPSADNA=false&pf=0&_=1583525404214");
const tr = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr'))
return tds.map(td => td.innerHTML)
});
const getName = tr.map(name => {
//const thename = Array.from(name.querySelectorAll('td.card-num'))
console.log("\n\n"+name+"\n\n");
})
await browser.close();
})();
I will get each TR printed, but I can't seem to dive into those TRs. You can see I have a line commented out, I tried to do this but get an error. As of right now, I am not getting it by the player dynamically... The easiest way I would think is to create a function that would think about getting the specific card would be doing something where the select the TR -> TD.card-num == 1 for Kevin.
Any help with this would be amazing.
Thanks
Short answer: You can just copy and paste that into Excel and it pastes perfectly.
Long answer: If I'm understanding this correctly, you'll need to map over all of the td elements and then, within each td, map each tr. I use cheerio as a helper. To complete it with puppeteer just do: html = await page.content() and then pass html into the cleaner I've written below:
const cheerio = require("cheerio")
const fs = require("fs");
const test = (html) => {
// const data = fs.readFileSync("./test.html");
// const html = data.toString();
const $ = cheerio.load(html);
const array = $("tr").map((index, element)=> {
const card_num = $(element).find(".card-num").text().trim()
const player = $(element).find("strong").text()
const mini_array = $(element).find("td").map((ind, elem)=> {
const hello = $(elem).find("span").text().trim()
return hello
})
return {
card_num,
player,
column_nine: mini_array[13],
column_ten: mini_array[14],
total:mini_array[15]
}
})
console.log(array[2])
}
test()
The code above will output the following:
{
card_num: '1',
player: 'Kevin Garnett',
column_nine: '1-0',
column_ten: '0--',
total: '100'
}

Problem to Scrape and bypass sucuri protection

I'm trying to scrape the https://twist.moe/ page, but cheerio doesn't show me the content of the page. Apparently it uses some sucuri protection.
When using cheerio it shows me:
<html><head><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='dj0iNXN1Y3VyIi5jaGFyQXQoMCkrJzkwJy5zbGljZSgxLDIpKyI2c3VjdXIiLmNoYXJBdCgwKSsnYlMzJy5jaGFyQXQoMikrJ3BKMCcuY2hhckF0KDIpKydUZCcuc2xpY2UoMSwyKSsiNiIuc2xpY2UoMCwxKSArICAnJyArIAoiZiIgKyAgJycgKycnKyI5aiIuY2hhckF0KDApICsgImQiICsgIiIgK1N0cmluZy5mcm9tQ2hhckNvZGUoMHg2MykgKyAnYicgKyAgJzgyJy5zbGljZSgxLDIpKyJmIiArICJmc2VjIi5zdWJzdHIoMCwxKSArICdpPDEnLmNoYXJBdCgyKSsnSGxJMycuc3Vic3RyKDMsIDEpICsnQ2kyZicuc3Vic3RyKDMsIDEpICsnOGEnLnNsaWNlKDEsMikrJzknICsgICJhIi5zbGljZSgwLDEpICsgICcnICsgCiI2Ii5zbGljZSgwLDEpICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgImZzdSIuc2xpY2UoMCwxKSArICIiICsiOGgiLmNoYXJBdCgwKSArICdsQjQnLmNoYXJBdCgyKSsiOXN1Ii5zbGljZSgwLDEpICsgImUiLnNsaWNlKDAsMSkgKyAiNCIuc2xpY2UoMCwxKSArIFN0cmluZy5mcm9tQ2hhckNvZGUoOTgpICsgIiIgKyc3JyArICAnMCcgKyAgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjdXInLmNoYXJBdCgwKSsgJ3UnLmNoYXJBdCgwKSsnYycuY2hhckF0KDApKyd1JysncnN1YycuY2hhckF0KDApKyAnaScrJ18nLmNoYXJBdCgwKSsnY3N1Y3UnLmNoYXJBdCgwKSAgKydsc3VjdScuY2hhckF0KDApICArJ29zdScuY2hhckF0KDApICsndXMnLmNoYXJBdCgwKSsnZHN1Y3UnLmNoYXJBdCgwKSAgKydwJysnc3VjdXInLmNoYXJBdCg0KSsgJ3N1Y3VvJy5jaGFyQXQoNCkrICd4c3VjJy5jaGFyQXQoMCkrICd5c3VjdXJpJy5jaGFyQXQoMCkgKyAnc3VfJy5jaGFyQXQoMikrJ3UnKycnKyd1JysnaScrJ2RzdWN1Jy5jaGFyQXQoMCkgICsnX3N1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VyaWInLmNoYXJBdCg2KSsnc3VjdWQnLmNoYXJBdCg0KSsgJzQnKydzMycuY2hhckF0KDEpKyc5cycuY2hhckF0KDApKycwc3VjdScuY2hhckF0KDApICArJzQnKydzdWN1ZScuY2hhckF0KDQpKyAnNycuY2hhckF0KDApKyI9IiArIHYgKyAnO3BhdGg9LzttYXgtYWdlPTg2NDAwJzsgbG9jYXRpb24ucmVsb2FkKCk7';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></head><body>
</body></html>
From chrome I accessed dev-tools and found two cookie values
"sucuri_cloudproxy_uuid_645833be2=0fa8e64535001a7393d98096a1bf40a5"
"sucuri_cloudproxy_uuid_f735b3372=77a9f80992e5d2cc9ffda8e165f8dcfb"
Both values I pass them to the header, but still shows me the same output.
const axios = require('axios');
const cheerio = require('cheerio');
const url = require('./urls');
const util = require('../utils');
const animeList = async() =>{
const headers = {
cookie: "sucuri_cloudproxy_uuid_645833be2=0fa8e64535001a7393d98096a1bf40a5; sucuri_cloudproxy_uuid_f735b3372=77a9f80992e5d2cc9ffda8e165f8dcfb;"
}
const {data} = await axios.get('https://twist.moe/' , {headers});
const body = await data;
const $ = cheerio.load(body);
const promises = [];
console.log($.html());
};
animeList();
Can someone show me how to get all the HTML content of the page https://twist.moe/?
Problem solved using the npm package cloudscraper which is Node.js library to bypass Cloudflare's anti-DDOS page. Cloudscraper can also identify and automatically bypass Sucuri.
const animeList = async() =>{
const res = await cloudscraper.get(`${url.BASE_URL}`);
const body = await res;
const $ = cheerio.load(body);
const promises = [];
console.log($.html());
};

TypeError: $ .find is not a function

I'm extracting the data from a page, but I get this error
TypeError: $ .find is not a function`
I already installed cheerio. When I put trm = $.find(".item-row[data-item='TRM']").find(".item-value > span"); is when the error comes out, I get the data but this error comes out.
Code:
const express = require("express");
const app = express();
const https = require('https');
const cheerio = require('cheerio');
app.get('/', function(req, res) {
res.send('express test');
});
https.get('https://widgetsdataifx.blob.core.windows.net/semana/semanaindicators', (resp) => {
let data = '';
resp.on('data', (chunk) => {
data += chunk;
});
resp.on('end', () => {
var $ = cheerio.load(data);
trm = $.find(".item-row[data-item='TRM']").find(".item-value > span");
});
}).on("error", (err) => {
console.log("Error: " + err.message);
});
There is no $.find() function, just as there isn't one in jQuery. There is a .find() method on jQuery objects, but that's not what $ represents.
trm = $(".item-row[data-item='TRM']").find(".item-value > span");
searches the markup loaded for "item-row" elements, and then from each of those it searches for <span> elements inside "item-value" elements.
As in "real" jQuery, the $ object is a function. You make functions calls to it and pass in selectors that you want Cheerio to find in the HTML markup you've loaded.
edit — here is a working test. If you npm install cheerio you can try it yourself with Node:
var cheerio = require("cheerio");
var $ = cheerio.load(`<body>
<div class=item-row data-item=TRM>
<div class=item-value>
<span>THIS IS THE CONTENT</span>
</div>
</div>
</body>`);
var span = $(".item-row[data-item='TRM']").find(".item-value > span");
console.log(span.text());
Playing with the code it looks like the var $ = cheerio.load( data ) expression assigns the $ variable to an instance of cheerio with the HTML document loaded. You then can traverse the dom like you would with jQuery.
Changing line 20 to
$("body").find(".item-row[data-item='TRM']").find(".item-value > span");
Will work because we are selecting the body and then calling the find method on the return value of the original query instead of the instance of cheerio itself.

Categories

Resources