How to click on a year in calendar with Selenium using JavaScript - javascript

I am trying to write a Selenium test with JavaScript but was unable to fill calendar data. It is using a dropdown menu:
const {Builder, By, Key} = require('selenium-webdriver')
const test2 = async () => {
let driver = await new Builder().forBrowser('chrome').build()
await driver.manage().window().maximize()
await driver.get('https://demoqa.com/automation-practice-form')
let calendar =
driver.findElement(By.xpath("//input[#id='dateOfBirthInput']"))
await calendar.click()
let month = driver.findElement(By.xpath("//select[#class='react-
datepicker__month-select']"))
await month.click()
await month.sendKeys(Key.DOWN, Key.DOWN, Key.RETURN)
let year = driver.findElement(By.xpath("//select[contains(#class,'react-datepicker__year-select')]")).value = "1988"
How can I clcik on value that I need? Console logging year gives me 1988, but I don't know how to select it in the browser. Is there any other way besides pressing Key.DOWN 35 times? I cannot use Select class in JavaScript.

You can use the option tag to set values directly in the dropdown without opening it. Also, if your option is not found in the dropdown, an error will be thrown. You can handle it in a try block
month = "May"
year = "1988"
let driver = await new Builder().forBrowser('chrome').build()
await driver.manage().window().maximize()
await driver.get('https://demoqa.com/automation-practice-form')
let calendar = driver.findElement(By.xpath("//input[#id='dateOfBirthInput']"))
await calendar.click()
let month = driver.findElement(By.xpath("//select[#class='react-datepicker__month-select']/option[text()='"+month+"']"))
await month.click()
let year = driver.findElement(By.xpath("//select[contains(#class,'react-datepicker__year-select')]/option[text()='"+year+"']"))
await year.click()

Don't have a JavaScript environment. However executing a javascript does the job.
Using Python clients:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input#dateOfBirthInput")))
driver.execute_script("arguments[0].value = arguments[1]", element, "24 Jan 2023")
Browser Snapshot:
Execution in the Console
Command
document.getElementById('dateOfBirthInput').value='24 Jan 2023'
Snapshot:

Related

Puppeteer node.js : if class doesn't exist pass blank value in array

I'm trying to scrape a page with business listings and get the title, location and website. The issue is that some of these businesses don't have a website. I'm currently using an array of arrays to store the data:
[
[websites],
[titles],
[locations]
]
When exporting the output in excel, I want to pass a blank value to the array of websites when there is no website listed and the URL for those that do have a website. In other words, I want to have something like this:
Websites
Titles
Locations
Website A
Title A
Location A
(blank because it doesn't have a website)
Title B
Location B
Website C
Title C
Location C
...
...
...
The code I've written so far is the following:
async function main(){
try{
const browser = await puppeteer.launch({"headless":false});
const page = await browser.newPage();
await page.goto(url), { waitUntil: 'networkidle0' };
const businessesPosts = await page.$$eval("[class^='AdvItemBox']", allPosts => allPosts.map(post => [
post.querySelector(".siteLink.urlClickLoggingClass").href != null
? post.querySelector(".siteLink.urlClickLoggingClass").href
: " ", //throws error "Cannot read property 'href' of null"
post.querySelector("[class^='CompanyName']").innerText, // get the title
post.querySelector("[class^='AdvAddress']").innerText] // get the location
));
const wb = xlsx.utils.book_new();
const ws = xlsx.utils.aoa_to_sheet(businessesPosts);
xlsx.utils.book_append_sheet(wb,ws);
xlsx.writeFile(wb, "posts.xlsx");
await browser.close()
} catch(e){
console.log('error',e);
}
};
main();
Here's the HTML code of the website's class
<a class="siteLink urlClickLoggingClass" target="_blank" product="AdvListing" productid="2419662++1926511++1" href="http://www.test.com">
Apparently there's something wrong when trying to insert a condition inside the array.
Any help would be much appreciated!
Instead of:
post.querySelector(".siteLink.urlClickLoggingClass").href != null
? post.querySelector(".siteLink.urlClickLoggingClass").href
: " ",
try:
post.querySelector(".siteLink.urlClickLoggingClass")?.href ?? " ",
See:
Optional chaining (?.)
Nullish coalescing operator (??)

How to scrape a table with changing data using Cheerio in Node.js?

I am trying to scrape data from a table in a website which has constantly changing values. So each row can vary day to day but I want to be able to scrape the correct data. I am using the Cheerio library at the moment and I am not familiar with it but here's what I have:
const rp = require("request-promise");
const cheerio = require("cheerio");
let Italy = "";
async function main() {
const result = await rp.get("https://www.worldometers.info/coronavirus/");
const $ = cheerio.load(result);
$("#main_table_countries > tbody:nth-child(2) > tr:nth-child(2)").each((i,el) => {
const item = $(el).text();
Italy = item;
});
}
So, as you can see this scrapes data from the worldometer website for the coronavirus cases in Italy. Italy's position however has been changing between 2 and 3 over the past few days. This has resulted in my program fetching the wrong information. This is what I would like to fix.
Here's the link to the worldometer website:
https://www.worldometers.info/coronavirus/
Thanks,
Karthik
What I Implemented is that you can get all the tr's and loop over them to get all the names and add it to an array and then use the Array Index to find any country you want
async function main() {
let NamesArr=[]
let CountryToFind= 'Italy'
const result = await rp.get("https://www.worldometers.info/coronavirus/");
const $ = cheerio.load(result);
$('#main_table_countries').find('tbody').eq(0).find('tr').each((i,el)=>{
NamesArr.push($(el).find('td').eq(0).text().trim())
})
let Index= NamesArr.indexOf(CountryToFind) + 1
$(`#main_table_countries > tbody:nth-child(2) > tr:nth-child(${Index})`).each((i,el) => {
const item = $(el).text();
console.log(item);
});
}
main()
This Returns me
You can definitely refactor it but this way makes your parser dynamic as you can now search for any country.
Use the :contains pseudo for this:
$('tr:contains(Italy)').text()
//" Italy 9,172 +1,797 463 +97 724 7,985 733 151.7 "

Cheerio Not Parsing HTML Correctly

I've got an array of rows that I've parsed out of a table from html, stored in a list. Each of the rows in the list is a string that looks (something) like this:
["<td headers="DOCUMENT" class="t14data"><a target="6690-Exhibit-C-20190611-1" href="http://www.fara.gov/docs/6690-Exhibit-C-20190611-1.pdf" class="doj-analytics-processed"><span style="color:blue">Click Here </span></a></td><td headers="REGISTRATIONNUMBER" class="t14data">6690</td><td headers="REGISTRANTNAME" class="t14data">SKDKnickerbocker LLC</td><td headers="DOCUMENTTYPE" class="t14data">Exhibit C</td><td headers="STAMPED/RECEIVEDDATE" class="t14data">06/11/2019</td>","<td headers="DOCUMENT" class="t14data"><a target="5334-Supplemental-Statement-20190611-30" href="http://www.fara.gov/docs/5334-Supplemental-Statement-20190611-30.pdf" class="doj-analytics-processed"><span style="color:blue">Click Here </span></a></td><td headers="REGISTRATIONNUMBER" class="t14data">5334</td><td headers="REGISTRANTNAME" class="t14data">Commonwealth of Dominica Maritime Registry, Inc.</td><td headers="DOCUMENTTYPE" class="t14data">Supplemental Statement</td><td headers="STAMPED/RECEIVEDDATE" class="t14data">06/11/2019</td>"]
The code is pulled from the page with the following page.evaluate function using puppeteer.
I'd like to then parse this code with cheerio, which I find to be simpler and more understandable. However, when I pass each of the strings of html into cheerio, it fails to parse them correctly. Here's the current function I'm using:
let data = res.map((tr) => {
let $ = cheerio.load(tr);
const link = $("a").attr("href");
const number = $("td[headers='REGISTRATIONNUMBER']").text();
const name = $("td[headers='REGISTRANTNAME']").text();
const type = $("td[headers='DOCUMENTTYPE']").text();
const date = $("td[headers='STAMPED/RECEIVEDDATE']").text();
return { link, number, name, type, date };
});
For some reason, only the "a" tag is working correctly for each row. Meaning, the "link" variable is correctly defined, but none of the other ones are. When I use $("*") to return a list of what should be all of the td's, it returns an unusual node list:
What am I doing wrong, and how can I gain access to the td's with the various headers, and their text content? Thanks!
It usually looks more like this:
let data = res.map((i, tr) => {
const link = $(tr).find("a").attr("href");
const number = $(tr).find("td[headers='REGISTRATIONNUMBER']").text();
const name = $(tr).find("td[headers='REGISTRANTNAME']").text();
const type = $(tr).find("td[headers='DOCUMENTTYPE']").text();
const date = $(tr).find("td[headers='STAMPED/RECEIVEDDATE']").text();
return { link, number, name, type, date };
}).get();
Keep in mind that cheerio map has the arguments reversed from js map.
I found the solution. I'm simply returning the full html through puppeteer instead of trying to get individual rows, and then using the above suggestion (from #pguardiario) to parse the text:
const res = await page.evaluate(() => {
return document.body.innerHTML;
});
let $ = cheerio.load(res);
let trs = $(".t14Standard tbody tr.highlight-row");
let data = trs.map((i, tr) => {
const link = $(tr).find("a").attr("href");
const number = $(tr).find("td[headers='REGISTRATIONNUMBER']").text();
const registrant = $(tr).find("td[headers='REGISTRANTNAME']").text();
const type = $(tr).find("td[headers='DOCUMENTTYPE']").text();
const date = moment($(tr).find("td[headers='STAMPED/RECEIVEDDATE']").text()).valueOf().toString();
return { link, number, registrant, type, date };
});

Webscraping paginated website with different page layout using Puppeteer

I'm trying to paginate through 200+ pages on this website and not all of them have the same layout. For example: GPA breakdown and SAT/ACT(in testing policy row) Super Scores are different across these schools. And for the harvard college page, SAT/ACT Super Scores just flat out don't show up. I'm having problems trying to format this for the csv because these data show up for one page but not for some other ones.
Links:
https://www.princetonreview.com/college/georgia-institute-technology-1022905
https://www.princetonreview.com/college/princeton-university-1024041
https://www.princetonreview.com/college/harvard-college-1022984
CSV file I currently have: https://ibb.co/Tc3DyFR This sample only shows the difference in Super Scores because I have not scraped GPA breakdown yet. However, both layouts are different across different pages.
Code:
const puppeteer = require('puppeteer');
const fs = require('fs-extra');
(async function main() {
try{
var names = await (fs.readFileSync('names.csv', 'utf8')).split('\n');
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36');
await page.goto('https://www.princetonreview.com/college/harvard-college-1022984#!admissions');
//await fs.writeFile('out.csv', 'School Name,Applicants,Acceptance Rate,Average HS GPA,GPA: Over 3.75,GPA: 3.50-3.74,GPA: 3.25 - 3.49,GPA: 3.00 - 3.24,GPA: 2.50 - 2.99,GPA: 2.00 - 2.49,SAT Reading and Writing,SAT Math,ACT\n');
await fs.appendFile('out.csv', `"${names[1]}",`);
const numbers = await page.evaluate(() => {
let nums = document.querySelectorAll('.number-callout');
let arr = Array.prototype.slice.call(nums);
let text_arr = [];
for(let i = 0; i < arr.length; i++){
if(arr[i].innerText == "")
continue;
text_arr.push(arr[i].innerText.trim());
}
return text_arr;
});
for(var e of numbers){
await fs.appendFile('out.csv', `"${e}",`);
}
await fs.appendFile('out.csv', `\n`);
//console.log(numbers);
await browser.close();
}catch(e){
console.log('our error', e);
}
})();
Short Answer:
Whenever you are paginating through separate style, you must stop thinking about a general solution at first.
Think of each block separately, and try to get the data one by one. This way you can format and break them as you want.
Long Answer:
This looks a like a pretty big assignment/task to resolve in one question. However here is some lead to resolve this problem.
Our problem is,
- The format is different across different pages.
- Some page has the data, some does not.
- We need to extract 8-10 specific data.
Let's say we want to extract Superscore SAT score, which is available on Priceton and Georgia but not Harvard page.
We need to search all them them specifically, or optimize the code to extract all data. There is not generalized way to magically knowing what's what.
// Let's grab all elements
[...document.querySelectorAll('div.number-callout')]
// And search for specific term
.find(e=>{
// We can go upper level and find the link element
// since it's the only one identifying this data
const parent = e.parentNode.querySelector('a')
// if an element is found, we search for the text there
if(parent) return parent.innerText.includes('Superscore SAT')
})
This will only return result on first two.
This also works with "Superscore ACT"
You can map through the elements and merge the data,
const data = {};
[...document.querySelectorAll('div.number-callout')].map(e=>{
const parent = e.parentNode.querySelector('a');
if(parent){
if(parent.innerText.includes('Superscore ACT')) data["Superscore ACT"] = true;
if(parent.innerText.includes('Superscore SAT')) data["Superscore SAT"] = true;
}
});
Result:

How get information from tag use node.js and library puppeteer? [duplicate]

This question already has an answer here:
How with use node.js get information from this tag that in page source - {{= flyingStatus(it.m_status) }}?
(1 answer)
Closed 5 years ago.
Page source is different from page in browser. Therefore I need use puppeteer library or jsdom library.
The page have tag "div" and many classes "bma-fly flying flying-won-team2 flying-past":
How get information from this tag???
I use code:
const puppeteer = require('puppeteer');
var fs = require('fs');
var link = "www. la la la . com";
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(link);
const text2 = await page.evaluate(() => document.querySelector('.bma-fly.flying.flying-won-team2.flying-past').innerHTML);
console.log(text2);
fs.writeFileSync("a08.txt" , text2);
browser.close();
})();
If I use:
const text1 = await page.evaluate(() => document.querySelector("div.bma-fly.flying.flying-won-team2.flying-past").innerHTML);
I get information only from first time this element is found.
How get other information where this tag and this class?
If not use innerHTML , I get: {} in console. (I use Linux).
If I save use fs.writeFileSync("a07.txt" , text1); , I get [object Object].
If I use .childNodes I get
{ '0': {}, '1': {}, '2': {}, '3': {}, '4': {}, '5': {}, '6': {} }
in console.
If I save this, I get: [object Object].
Please, help me.
To select all of the nodes using that selector, you need document.querySelectorAll().
// get html element refs to
var els = document.querySelectorAll('div.bma-fly.flying.flying-won-team2.flying-past')
// convert nodelist to array, then map the innerHTML property
var htmls = Array.prototype.slice.apply(els).map(el => el.innerHTML)
This will give you an array of all values for your selector.
["asdf", "asdfdsf", ...]
If you are writing a json object to a text file, you need to convert it into a string. This can be done with JSON.stringify().
In your case, it would look like this:
fs.writeFileSync("a07.txt" , JSON.stringify(text1))

Categories

Resources