Have just been playing around with axios & cheerio. I was attempting to scrape table data from the world rugby rankings website. I would like to return the top 10 rows of the table.
https://www.world.rugby/tournaments/rankings/mru
Presently I can only retrieve the first row of the table and I can't figure out why.
const axios = require('axios')
const cheerio = require('cheerio')
async function getWorldRankings() {
try {
const siteUrl = 'https://www.world.rugby/tournaments/rankings/mru'
const { data } = await axios({
method: "GET",
url: siteUrl,
})
const $ = cheerio.load(data)
const elemSelector = 'body > section > div.pageContent.flex-content > div:nth-child(2) > div.column.large-8 > div > section > section > div.fullRankingsContainer.large-7.columns > div > div > table > tbody > tr'
$(elemSelector).each((parentIndex, parentElem) => {
$(parentElem).children().each((childIndex, childElem) => {
console.log($(childElem).text());
})
})
} catch (err) {
console.error(err);
}
}
getWorldRankings()
Result:
>node index.jsx
Position
Teams
Points
For full context and credit I was playing around with this guide:
https://www.youtube.com/watch?v=5YCuUCRS_Ks (I'm using the same code just different url's and css selectors - and I can retrieve table rows as intended with his example coinmarketcap.com and many other websites).
For the world rugby rankings site - even though the html is available in dev tools is the data being injected in some way that makes it unselectable? (I have no idea what I am talking about just throwing out a guess).
Thanks for any help.
node v16.4.2
"axios": "^0.22.0",
"cheerio": "^1.0.0-rc.10",
The data for that table is being loaded later with AJAX and is not initially loaded with the page, you therefore cannot select it with Cheerio. The good news is you don't even need Cheerio for this. If you take a look at the network requests tab in your browser's development tools, you'll see the AJAX request being made uses the following URL to load JSON formatted data --the data you want-- into the page:
https://cmsapi.pulselive.com/rugby/rankings/mru?language=en&client=pulse
Related
the app that I test has SSO login. So, 1st I am cy.vist(xxx) to one URL then system directs me another cy.visit(yyy). But currently I cant handle the issue that currently I am facing. Can you please help me to figure out?
Thank you
it.only("logs in using env variables", () => {
const username = Cypress.env("username");
const password = Cypress.env("password");
let href;
cy.visit("/");
cy.get(".chakra-stack > .css-1n94901").click()
cy.contains("Login with Carmarket").click();
cy.get(":nth-child(1) > .form-control").type(username);
cy.get(":nth-child(2) > .form-control").type(password);
cy.get(".login > .submit > .button").click({multiple: true,force: true});
cy.url().then((url) => {
href = url;
cy.log("href ", href);
cy.visit(href);
cy.url().should("include", "blabla[][1]][1]"); //assertion of that we are in this url
});
});
});
After clicking the login button, you can directly assert whether url contains some text. Also you can add some timeout as well for the redirection to complete.
cy.url({timeout: 6000}).should("include", "blabla[][1]][1]")
Considering your url is https://example.com/ Within the same test you can work with urls if they have format like https://example.com/something or https://example.com/something/123 or https://superdomain.example.com/. So basically the url should have the same origin. But in case the url is not from the same origin which is in your case, then you have to move to a new test to resolve this. This is a trade-off of cypress and you can read more about it from here. So you can do something like this:
it('logs in using env variables', () => {
const username = Cypress.env('username')
const password = Cypress.env('password')
let href
cy.visit('/')
cy.get('.chakra-stack > .css-1n94901').click()
cy.contains('Login with Carmarket').click()
cy.get(':nth-child(1) > .form-control').type(username)
cy.get(':nth-child(2) > .form-control').type(password)
cy.get('.login > .submit > .button').click({multiple: true, force: true})
})
it('validate content of url', () => {
cy.url({timeout: 6000}).should('include', 'blabla[][1]][1]')
})
I have successfully developed a cascading dropdown list using javascript thanks to some code I found online. The html code it generates looks as expected when I view the code inside my Firefox web developer tools. The problem I have is that my php backend cannot read this from the $_POST buffer. The error I get is "Undefined index". It's almost as if the php does not see the second DDL that is dynamically added to my html page. Is there a trick I'm missing?
<script type="text/javascript">
var created = 0;
function displayAccordingly() {
if (created == 1) {
removeDrop();
}
//Call mainMenu the main dropdown menu
var mainMenu = document.getElementById('mainMenu');
//Create the new dropdown menu
var whereToPut = document.getElementById('myDiv');
var newDropdown = document.createElement('select');
newDropdown.setAttribute('id',"newDropdownMenu");
newDropdown.setAttribute('name',"AccountNumber");
whereToPut.appendChild(newDropdown);
if if (mainMenu.value == "Office Expense") { //The person chose Office Expense
var option000000000=document.createElement("option");
option000000000.text="---";
option000000000.value="000000000";
newDropdown.add(option000000000,newDropdown.options[null]);
var option160006235=document.createElement("option");
option160006235.text="COPY PAPER AND SUPPLIES";
option160006235.value="160006235";
newDropdown.add(option1160006235,newDropdown.options[null]);
var option160006237=document.createElement("option");
option160006237.text="COPIER RENTAL AGREEMENT";
option160006237.value="160006237";
newDropdown.add(option1160006237,newDropdown.options[null]);
} else if (mainMenu.value == "Custodial") { //The person chose Custodial
var option000000000=document.createElement("option");
option000000000.text="---";
newDropdown.add(option000000000,newDropdown.options[null]);
var option164006410=document.createElement("option");
option164006410.value="164006410";
option164006410.text="CONTRACTED SERVICES-FACILITIES";
newDropdown.add(option164006410,newDropdown.options[null]);
var option164006415=document.createElement("option");
option164006415.value="164006415";
option164006415.text="MAINTENANCE-GROUNDS";
newDropdown.add(option164006415,newDropdown.options[null]);
var option164006420=document.createElement("option");
option164006420.value="164006420";
option164006420.text="MATERIALS AND SUPPLIES";
newDropdown.add(option164006420,newDropdown.options[null]);
}
created = 1
}
function removeDrop() {
var d = document.getElementById('myDiv');
var oldmenu = document.getElementById('newDropdownMenu');
d.removeChild(oldmenu);
}
</script>
What the development tools shows as my HTML code:
My PHP Code (simplified)
$AccountNumber = $_POST['AcountNumber'];
I can read Category from the $_POST buffer, but not AccountNumber.
I am thus thinking the Javascript works fine, I don't understand why the value for AccountNumber is not placed in the $_POST buffer.
The results from a print_r($_POST) is as follows (Right after [Category] I would expect [AccountNumber]=>):
Array ( [action] => POStepTwo [logged_in_user] => 1625605397 [who] => requester [UserID] => 1625605397 [Vendor] => 2080MED [Department] => Plant [Category] => Office Expense [ShippingInstructions] => 1 [RequesterNote] => test )
Thanks for all the help.
I am trying to scrape a website to get the scores of each team. I am running into an issue where my script is returning null content. I cannot see where I am going wrong and looking for some help.
JS:
const Nightmare = require('nightmare')
const cheerio = require('cheerio');
const fs = require('fs');
const nightmare = Nightmare({ show: true })
const url = 'https://www.mscl.org/live/scorecard/ed7941919f69b0e11e800fef/mHcehsPR9S86T3zQv';
nightmare
.goto(url)
.wait('body')
.wait('div#summaryTab.tab-pane.fade.in.table-responsive.borderless.active')
.evaluate(() => document.querySelector('div.col-md-6').innerHTML)
.end()
.then(response => {
console.log(getData(response));
}).catch(err => {
console.log(err);
});
let getData = html => {
data = [];
const $ = cheerio.load(html);
$('div').each((i, elem) => {
if(i === 0 ){
console.log($(elem).find('nth-child(1)').html());
}
});
return data;
}
The html I am scraping is here.
https://pastebin.com/R6syWDwD
The line where the scores are: 30 and 32
<div class="col-md-6">
<b>40 Overs Match</b><br>
<b>MVCC Combined</b> won the toss and chose Batting<br>
<b>Umpires: </b>No umpires were selected<br>
<b>Date: </b> 3/24/2021, 5:00:00 PM<br>
<b>Ground: </b>Acton Field 1<br>
<b>Result: TBD</b><br>
<b>MoM: </b> <br>
<hr>
<p><b>MVCC COMBINED XI - 147/10</b> (<b>O:</b> 12.5 | <b>RR:</b> 11.45)</p>
<p><b>MVCC United XI - 23/1</b> (<b>O:</b> 2.0 | <b>RR:</b> 11.50)</p>
<hr>
</div>
When I run this it returns nothing. No errors are being displayed either. What am i missing?
The jQuery Docs for nth-child says
jQuery's implementation of :nth- selectors is strictly derived from the CSS specification
So you propably have to provide an element to your nth-child(1)-pseudo-selector, to tell jQuery from which element it should select the nth-child of. try this:
console.log($(elem).find('b:nth-child(1)').html());
alternatively, just try to prefix nth-child(1) with a colon -> :nth-child(1)
Edit:
I just realized you are using innerHTML on your selected div, which actually returns the contents of the div without the wrapping div itself. But in getData you try to select the div with $('div') which then is actually not found.
Objective:
I have a button that runs a function to load more items from my Mongoose DataBase and add them to a table row. I use get to get and return data from my server side. And am following pointers from this post, but I am still unable to render what I need.
Client side code:
<main role="main">
<div class="container-fluid ">
<div class="card-columns" id="test" value="<%=deals%>">
<tbody>
<tr id="content"><% include partials/row.html %></tr>
</tbody>
</div>
</div>
<div class="container" style="max-width: 25rem; text-align: center;">
<a id="8-reload" class="btn more" onclick="loadMore(8)"></a>
</div>
<script >
const loadMore = async function(x){
const response = await fetch(`/${x}`);
if (!response.ok)
throw oops;
const data =await response.text();
console.log(data);
console.log($("#content"));
await document.getElementById('content').insertAdjacentHTML('beforeend',data);
}
</script>
Server JS request:
app.get(`/:count`, (req, res) => {
const count = parseInt(req.params.count);
database.find({}, (err, found) => {
if (!err){
res.render("more",{items:found},(err,html)=>{
if(!err){
res.send(html);
}else{
console.log(err);
}
});
} else {
console.log(err);
}
}).skip(count).limit(25);
});
when running the function nothing happens and browser console log reads the long string of html. and this for the place I want to append to:
jQuery.fn.init {}
proto: Object(0)
No errors on server console log. What am I missing?
UPDATE I tried Appending instead of to content to test and lo and behold I AM appending html, just not all of my content. int only inserts the opening and closing of the entire html template none of the content between it. Weird.
Okay looks like the big problem was two things. The first issue was I was appending to a place that the content couldn't append to. The second issue was My template started with Table rows with more content in them which would not allow other stuff to render. Once I moved my jquery to a parent id and removed the table rows everything came in fine!
I'm figuring out what's the best way to update a current list of results from an API call, with a new list of results from an API call.
I'm making API request to news API and loading them into the index page when it first loads:
app.get("/", function (req, res) {
request("https://newsapi.org/v2/top-headlines?q=" + initialQ + "&category=sports&pageSize=10&page=" + page + "&sortBy=relevance&apiKey=" + apiKey, function (error, response, body) {
if (!error && response.statusCode == 200) {
let data = JSON.parse(body);
totalResults = data.totalResults;
console.log(totalResults)
let articles = scripts.articlesArr(data);
let filteredArticles = scripts.filteredArr(articles);
res.render("index", { filtered: filteredArticles });
} else {
res.redirect("/");
console.log(response.body);
}
});
});
Then the user will toggle two buttons to get more results, or go back a page:
app.post("/", function (req, res) {
let inputValue = req.body.page;
let pages = Math.ceil(totalResults / 10)
page = scripts.iteratePages(inputValue, page, pages);
request("https://newsapi.org/v2/top-headlines?q=" + initialQ + "&category=sports&pageSize=10&page=" + page + "&sortBy=relevance&apiKey=" + apiKey, function (error, response, body) {
if (!error && response.statusCode == 200) {
let data = JSON.parse(body);
let articles = scripts.articlesArr(data);
let filteredArticles = scripts.filteredArr(articles);
res.render("index", { filtered: filteredArticles });
} else {
res.redirect("/");
console.log(response.body);
}
});
});
I'm aware of Socket io, but I was wondering if there are other means or methods of achieving this? From what I understand, I can update frontend content via the front end - but with my current set up I'd much prefer to update from the back end
EJS code:
<div id="container">
<% for(var i=0; i < filtered.length; i++) { %>
<ul>
<li><%= filtered[i].title %></li>
<li><%= filtered[i].date %></li>
<li><img src="<%= filtered[i].image%>" /></li>
<li><%=filtered[i].description%></li>
<li><%= filtered[i].link %></li>
</ul>
<% } %>
</div>
<form action="/" method="POST">
<ul>
<li>
<button type="submit" name="page" value="next">Get more results</button>
<button type="submit" name="page" value="prev">Go back a page</button>
</li>
</ul>
</form>
For bi-directional communication we can use WebSockets (with a library like Socket.IO), for uni-directional server-to-client we can use EventSource, and for uni-directional client-to-server we use good ol' HTTP, through fetch or XMLHttpRequest in the browser API (this is referred to as AJAX, though I think most devs just says "client calls the server" these days). For 99% of use cases what we want is client-to-server over HTTP. If I understand correctly then you want stuff to happen when the users pushes a button. That's a case of client-to-server.
User pushes button
Client calls our new API endpoint /articles with fetch to get more articles: const data = await fetch('localhost:8080/articles'); const articles = await data.json(). A simplified version of the code for /articles looks something like app.get('/articles', (req, res) => request("https://newsapi.org").then(articles => /* do stuff with articles here */res.send(result)). This end point returns json instead of html (which our / endpoint returns)
Our server calls newsapi. Newsapi anserrs our server. Our server answers the client.
Then we need some data binding/templating that ensures that the DOM is updated with the new articles. This is functionality that libs like React and Angular supply. But for learning purposes and to keep things simple you can do something like articles.forEach(a => {const el = document.createElement('li'); el.innerHtml = a; document.getElementById('articles').appendChild(el)}), assuming a tag <ul id="articles">... where articles are supposed to show up exists (you probably want to do something more complex with your articles, but you get the idea)
Page hasn't reloaded 🙌
Update: some code review :)
use template literals. "https://newsapi.org/v2/top-headlines?q=" + initialQ + "&category=sports&pageSize=10&page=" + page + "&sortBy=relevance&apiKey=" + apiKey -> https://newsapi.org/v2/top-headlines?q=${initialQ}&category=sports&pageSize=10&page=${page}&sortBy=relevance&apiKey=${apiKey}
Prefer const over let
Use new lines when you're lines get very long (many go by 80 columns as preferred max width)
It looks like you do one ul for each article and one li for each property on the article. ul is a list (unordered list) and li is a list item. So one ul should contain many li, and each li should contain one item (in this case an article). You can read more about semantics in web development here