Webscraping paginated website with different page layout using Puppeteer - javascript

I'm trying to paginate through 200+ pages on this website and not all of them have the same layout. For example: GPA breakdown and SAT/ACT(in testing policy row) Super Scores are different across these schools. And for the harvard college page, SAT/ACT Super Scores just flat out don't show up. I'm having problems trying to format this for the csv because these data show up for one page but not for some other ones.
Links:
https://www.princetonreview.com/college/georgia-institute-technology-1022905
https://www.princetonreview.com/college/princeton-university-1024041
https://www.princetonreview.com/college/harvard-college-1022984
CSV file I currently have: https://ibb.co/Tc3DyFR This sample only shows the difference in Super Scores because I have not scraped GPA breakdown yet. However, both layouts are different across different pages.
Code:
const puppeteer = require('puppeteer');
const fs = require('fs-extra');
(async function main() {
try{
var names = await (fs.readFileSync('names.csv', 'utf8')).split('\n');
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36');
await page.goto('https://www.princetonreview.com/college/harvard-college-1022984#!admissions');
//await fs.writeFile('out.csv', 'School Name,Applicants,Acceptance Rate,Average HS GPA,GPA: Over 3.75,GPA: 3.50-3.74,GPA: 3.25 - 3.49,GPA: 3.00 - 3.24,GPA: 2.50 - 2.99,GPA: 2.00 - 2.49,SAT Reading and Writing,SAT Math,ACT\n');
await fs.appendFile('out.csv', `"${names[1]}",`);
const numbers = await page.evaluate(() => {
let nums = document.querySelectorAll('.number-callout');
let arr = Array.prototype.slice.call(nums);
let text_arr = [];
for(let i = 0; i < arr.length; i++){
if(arr[i].innerText == "")
continue;
text_arr.push(arr[i].innerText.trim());
}
return text_arr;
});
for(var e of numbers){
await fs.appendFile('out.csv', `"${e}",`);
}
await fs.appendFile('out.csv', `\n`);
//console.log(numbers);
await browser.close();
}catch(e){
console.log('our error', e);
}
})();

Short Answer:
Whenever you are paginating through separate style, you must stop thinking about a general solution at first.
Think of each block separately, and try to get the data one by one. This way you can format and break them as you want.
Long Answer:
This looks a like a pretty big assignment/task to resolve in one question. However here is some lead to resolve this problem.
Our problem is,
- The format is different across different pages.
- Some page has the data, some does not.
- We need to extract 8-10 specific data.
Let's say we want to extract Superscore SAT score, which is available on Priceton and Georgia but not Harvard page.
We need to search all them them specifically, or optimize the code to extract all data. There is not generalized way to magically knowing what's what.
// Let's grab all elements
[...document.querySelectorAll('div.number-callout')]
// And search for specific term
.find(e=>{
// We can go upper level and find the link element
// since it's the only one identifying this data
const parent = e.parentNode.querySelector('a')
// if an element is found, we search for the text there
if(parent) return parent.innerText.includes('Superscore SAT')
})
This will only return result on first two.
This also works with "Superscore ACT"
You can map through the elements and merge the data,
const data = {};
[...document.querySelectorAll('div.number-callout')].map(e=>{
const parent = e.parentNode.querySelector('a');
if(parent){
if(parent.innerText.includes('Superscore ACT')) data["Superscore ACT"] = true;
if(parent.innerText.includes('Superscore SAT')) data["Superscore SAT"] = true;
}
});
Result:

Related

How to click on a year in calendar with Selenium using JavaScript

I am trying to write a Selenium test with JavaScript but was unable to fill calendar data. It is using a dropdown menu:
const {Builder, By, Key} = require('selenium-webdriver')
const test2 = async () => {
let driver = await new Builder().forBrowser('chrome').build()
await driver.manage().window().maximize()
await driver.get('https://demoqa.com/automation-practice-form')
let calendar =
driver.findElement(By.xpath("//input[#id='dateOfBirthInput']"))
await calendar.click()
let month = driver.findElement(By.xpath("//select[#class='react-
datepicker__month-select']"))
await month.click()
await month.sendKeys(Key.DOWN, Key.DOWN, Key.RETURN)
let year = driver.findElement(By.xpath("//select[contains(#class,'react-datepicker__year-select')]")).value = "1988"
How can I clcik on value that I need? Console logging year gives me 1988, but I don't know how to select it in the browser. Is there any other way besides pressing Key.DOWN 35 times? I cannot use Select class in JavaScript.
You can use the option tag to set values directly in the dropdown without opening it. Also, if your option is not found in the dropdown, an error will be thrown. You can handle it in a try block
month = "May"
year = "1988"
let driver = await new Builder().forBrowser('chrome').build()
await driver.manage().window().maximize()
await driver.get('https://demoqa.com/automation-practice-form')
let calendar = driver.findElement(By.xpath("//input[#id='dateOfBirthInput']"))
await calendar.click()
let month = driver.findElement(By.xpath("//select[#class='react-datepicker__month-select']/option[text()='"+month+"']"))
await month.click()
let year = driver.findElement(By.xpath("//select[contains(#class,'react-datepicker__year-select')]/option[text()='"+year+"']"))
await year.click()
Don't have a JavaScript environment. However executing a javascript does the job.
Using Python clients:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#dateOfBirthInput")))
driver.execute_script("arguments[0].value = arguments[1]", element, "24 Jan 2023")
Browser Snapshot:
Execution in the Console
Command
document.getElementById('dateOfBirthInput').value='24 Jan 2023'
Snapshot:

Puppeteer node.js : if class doesn't exist pass blank value in array

I'm trying to scrape a page with business listings and get the title, location and website. The issue is that some of these businesses don't have a website. I'm currently using an array of arrays to store the data:
[
[websites],
[titles],
[locations]
]
When exporting the output in excel, I want to pass a blank value to the array of websites when there is no website listed and the URL for those that do have a website. In other words, I want to have something like this:
Websites
Titles
Locations
Website A
Title A
Location A
(blank because it doesn't have a website)
Title B
Location B
Website C
Title C
Location C
...
...
...
The code I've written so far is the following:
async function main(){
try{
const browser = await puppeteer.launch({"headless":false});
const page = await browser.newPage();
await page.goto(url), { waitUntil: 'networkidle0' };
const businessesPosts = await page.$$eval("[class^='AdvItemBox']", allPosts => allPosts.map(post => [
post.querySelector(".siteLink.urlClickLoggingClass").href != null
? post.querySelector(".siteLink.urlClickLoggingClass").href
: " ", //throws error "Cannot read property 'href' of null"
post.querySelector("[class^='CompanyName']").innerText, // get the title
post.querySelector("[class^='AdvAddress']").innerText] // get the location
));
const wb = xlsx.utils.book_new();
const ws = xlsx.utils.aoa_to_sheet(businessesPosts);
xlsx.utils.book_append_sheet(wb,ws);
xlsx.writeFile(wb, "posts.xlsx");
await browser.close()
} catch(e){
console.log('error',e);
}
};
main();
Here's the HTML code of the website's class
<a class="siteLink urlClickLoggingClass" target="_blank" product="AdvListing" productid="2419662++1926511++1" href="http://www.test.com">
Apparently there's something wrong when trying to insert a condition inside the array.
Any help would be much appreciated!
Instead of:
post.querySelector(".siteLink.urlClickLoggingClass").href != null
? post.querySelector(".siteLink.urlClickLoggingClass").href
: " ",
try:
post.querySelector(".siteLink.urlClickLoggingClass")?.href ?? " ",
See:
Optional chaining (?.)
Nullish coalescing operator (??)

How to scrape a table with changing data using Cheerio in Node.js?

I am trying to scrape data from a table in a website which has constantly changing values. So each row can vary day to day but I want to be able to scrape the correct data. I am using the Cheerio library at the moment and I am not familiar with it but here's what I have:
const rp = require("request-promise");
const cheerio = require("cheerio");
let Italy = "";
async function main() {
const result = await rp.get("https://www.worldometers.info/coronavirus/");
const $ = cheerio.load(result);
$("#main_table_countries > tbody:nth-child(2) > tr:nth-child(2)").each((i,el) => {
const item = $(el).text();
Italy = item;
});
}
So, as you can see this scrapes data from the worldometer website for the coronavirus cases in Italy. Italy's position however has been changing between 2 and 3 over the past few days. This has resulted in my program fetching the wrong information. This is what I would like to fix.
Here's the link to the worldometer website:
https://www.worldometers.info/coronavirus/
Thanks,
Karthik
What I Implemented is that you can get all the tr's and loop over them to get all the names and add it to an array and then use the Array Index to find any country you want
async function main() {
let NamesArr=[]
let CountryToFind= 'Italy'
const result = await rp.get("https://www.worldometers.info/coronavirus/");
const $ = cheerio.load(result);
$('#main_table_countries').find('tbody').eq(0).find('tr').each((i,el)=>{
NamesArr.push($(el).find('td').eq(0).text().trim())
})
let Index= NamesArr.indexOf(CountryToFind) + 1
$(`#main_table_countries > tbody:nth-child(2) > tr:nth-child(${Index})`).each((i,el) => {
const item = $(el).text();
console.log(item);
});
}
main()
This Returns me
You can definitely refactor it but this way makes your parser dynamic as you can now search for any country.
Use the :contains pseudo for this:
$('tr:contains(Italy)').text()
//" Italy 9,172 +1,797 463 +97 724 7,985 733 151.7 "

How do I loop through and add key values to values

I want to be able to loop through this object and add "key" values to the values so it can be easily used for a project. I'm Webscraping from a Website to get the data I have been able to get the data but I need to format it, here is my code and I'm stuck on what the next step is.
I'm trying to get it to look like this
EXAMPLE
server: "USSouth2"
cost: 100
howmuch: 56
time: 28
Code
let options = {
url: 'https://www.realmeye.com/items/misc/',
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
}
var result = [];
request(options, function(err, resp, html) {
if (!err) {
const $ = cheerio.load(html);
let servers = $('.cheapest-server').attr('data-alternatives');
servers.forEach(function(nexus) {
result.push({
servername: nexus[0],
cost: nexus[1],
howmuch: nexus[2],
time: nexus[3]
})
})
}
})
JSON object (this can change depending on the Website data)
["USSouth2 Nexus",100,56,28],["EUSouth Nexus",100,62,25],["AsiaSouthEast Nexus",100,58,25],["EUNorth2 Nexus",100,64,24],["EUEast Nexus",100,55,24],["USWest2 Nexus",100,73,23],["USMidWest2 Nexus",100,53,21],["USEast2 Nexus",100,98,17],["EUWest Nexus",100,66,11],["EUSouthWest Nexus",100,86,10],["USNorthWest Nexus",100,87,9],["USSouthWest Nexus",100,67,9],["EUWest2 Nexus",100,89,8],["USWest Nexus",100,66,8],["USSouth Nexus",100,54,7],["USMidWest Nexus",100,90,6],["USSouth3 Nexus",100,82,6],["USEast Nexus",100,65,1]]
I'm getting this error
TypeError: servers.forEach is not a function
I've fixed it I forgot to parse the JSON here is the answer by making a new variable
let jsonData = JSON.parse(servers);
Okay, so if you know for sure the four keys will not change, then do something like this.
Create an array with the four keys in order of appearance in a single JSON array value. Use the map method of the array to loop over the array and manipulate each value and the and the reduce method to turn an array into object.
Check out the example below to see what it does.
// Your JSON.
const json = [["USSouth2 Nexus",100,56,28], ["EUSouth Nexus",100,62,25]];
// The four keys of the object. In order of the JSON array.
const keys = ['servername', 'cost', 'howmuch', 'time'];
// Loop over servers and return an object reduces from an array.
const servers = json.map(server =>
server.reduce((acc, cur, i) => {
let key = keys[i];
acc[key]= cur;
return acc;
}, {}));
console.log(servers);
I haev updated the code snippet. please check now.
let servers = [["USSouth2 Nexus",100,56,28],["EUSouth Nexus",100,62,25],["AsiaSouthEast Nexus",100,58,25],["EUNorth2 Nexus",100,64,24],["EUEast Nexus",100,55,24],["USWest2 Nexus",100,73,23],["USMidWest2 Nexus",100,53,21],["USEast2 Nexus",100,98,17],["EUWest Nexus",100,66,11],["EUSouthWest Nexus",100,86,10],["USNorthWest Nexus",100,87,9],["USSouthWest Nexus",100,67,9],["EUWest2 Nexus",100,89,8],["USWest Nexus",100,66,8],["USSouth Nexus",100,54,7],["USMidWest Nexus",100,90,6],["USSouth3 Nexus",100,82,6],["USEast Nexus",100,65,1]];
var result = [];
servers.forEach(function(nexus) {
result.push({
servername: nexus[0],
cost: nexus[1],
howmuch: nexus[2],
time: nexus[3]
});
});
console.log(result);
First of all, you need to check is servers are prsent and it's in form of array anf if so, then loop through the servers and get your data
const servers = $('.cheapest-server').attr('data-alternatives');
if (servers && Array.isArray(servers) && servers.length) {
const result = [];
for (let i = 0; i < servers.lenght; i += 1) {
const entity = servers[i];
const objBuilder = {
servername: entity[0] || null,
cost: entity[1] || null,
howmuch: entity[2] || null,
time: entity[3] || null,
};
result.push(objBuilder);
}
} else {
// Error handling or pass an empty array
}

How can I parse a .frd file with Javascript?

A .frd file is a type of multi-column numeric data table used for storing information about the frequency response of speakers. A .frd file looks something like this when opened in a text editor:
2210.4492 89.1 -157.7
2216.3086 88.99 -157.7
2222.168 88.88 -157.6
2228.0273 88.77 -157.4
Using javascript, is there a way that I can parse this data in order to return each column separately?
For example, from the .frd file above, I would need to return the values like so:
var column1 = [2210.4492, 2216.3086, 2222.168, 2228.0273];
var column2 = [89.1, 88.99, 88.88, 88.77];
var column3 = [-157.7, -157.7, -157.6, -157.4];
I'm not exactly sure where to begin in trying to achieve this, so any step in the right direction would be helpful!
I found the following description of the FRD file format and I will follow it.
Let's assume that the content of your .frd file is in the variable called content (the following example is for Node.js):
const fs = require('fs');
const content = fs.readFileSync('./input.frd').toString();
Now if content has your FRD data, it means it's a set of lines, each line contains exactly three numbers: a frequency (Hz), a level (dB), and a phase (degrees). To split your content into lines, we can just literally split it:
const lines = content.split(/\r?\n/);
(normally, splitting just by '\n' would've worked, but let's explicitly support Windows-style line breaks \r\n just in case. The /\r?\n/ is a regular expression that says "maybe \r, then \n")
To parse each line into three numbers, we can do this:
const values = line.split(/\s+/);
If the file can contain empty lines, it may make sense to double check that the line has exactly three values:
if (values.length !== 3) {
// skip this line
}
Given that we have three values in values, as strings, we can assign the corresponding variables:
const [frequency, level, phase] = values.map(value => Number(value));
(.map converts all the values in values from strings to Number - let's do this to make sure we store the correct type).
Now putting all those pieces together:
const fs = require('fs');
const content = fs.readFileSync('./input.frd').toString();
const frequencies = [];
const levels = [];
const phases = [];
const lines = content.split(/\r?\n/);
for (const line of lines) {
const values = line.split(/\s+/);
if (values.length !== 3) {
continue;
}
const [frequency, level, phase] = values.map(value => Number(value));
frequencies.push(frequency);
levels.push(level);
phases.push(phase);
}
console.log(frequencies);
console.log(levels);
console.log(phases);
The main code (the one that works with content) will also work in browser, not just in Node.js, if you need that.
This code can be written in a tons of different ways, but I tried to make it easier to explain so did something very straightforward.
To use it in Node.js (if your JavaScript file is called index.js):
$ cat input.frd
2210.4492 89.1 -157.7
2216.3086 88.99 -157.7
2222.168 88.88 -157.6
2228.0273 88.77 -157.4
$ node index.js
[ 2210.4492, 2216.3086, 2222.168, 2228.0273 ]
[ 89.1, 88.99, 88.88, 88.77 ]
[ -157.7, -157.7, -157.6, -157.4 ]

Categories

Resources