Cheerio Not Parsing HTML Correctly - javascript

I've got an array of rows that I've parsed out of a table from html, stored in a list. Each of the rows in the list is a string that looks (something) like this:
["<td headers="DOCUMENT" class="t14data"><a target="6690-Exhibit-C-20190611-1" href="http://www.fara.gov/docs/6690-Exhibit-C-20190611-1.pdf" class="doj-analytics-processed"><span style="color:blue">Click Here </span></a></td><td headers="REGISTRATIONNUMBER" class="t14data">6690</td><td headers="REGISTRANTNAME" class="t14data">SKDKnickerbocker LLC</td><td headers="DOCUMENTTYPE" class="t14data">Exhibit C</td><td headers="STAMPED/RECEIVEDDATE" class="t14data">06/11/2019</td>","<td headers="DOCUMENT" class="t14data"><a target="5334-Supplemental-Statement-20190611-30" href="http://www.fara.gov/docs/5334-Supplemental-Statement-20190611-30.pdf" class="doj-analytics-processed"><span style="color:blue">Click Here </span></a></td><td headers="REGISTRATIONNUMBER" class="t14data">5334</td><td headers="REGISTRANTNAME" class="t14data">Commonwealth of Dominica Maritime Registry, Inc.</td><td headers="DOCUMENTTYPE" class="t14data">Supplemental Statement</td><td headers="STAMPED/RECEIVEDDATE" class="t14data">06/11/2019</td>"]
The code is pulled from the page with the following page.evaluate function using puppeteer.
I'd like to then parse this code with cheerio, which I find to be simpler and more understandable. However, when I pass each of the strings of html into cheerio, it fails to parse them correctly. Here's the current function I'm using:
let data = res.map((tr) => {
let $ = cheerio.load(tr);
const link = $("a").attr("href");
const number = $("td[headers='REGISTRATIONNUMBER']").text();
const name = $("td[headers='REGISTRANTNAME']").text();
const type = $("td[headers='DOCUMENTTYPE']").text();
const date = $("td[headers='STAMPED/RECEIVEDDATE']").text();
return { link, number, name, type, date };
});
For some reason, only the "a" tag is working correctly for each row. Meaning, the "link" variable is correctly defined, but none of the other ones are. When I use $("*") to return a list of what should be all of the td's, it returns an unusual node list:
What am I doing wrong, and how can I gain access to the td's with the various headers, and their text content? Thanks!

It usually looks more like this:
let data = res.map((i, tr) => {
const link = $(tr).find("a").attr("href");
const number = $(tr).find("td[headers='REGISTRATIONNUMBER']").text();
const name = $(tr).find("td[headers='REGISTRANTNAME']").text();
const type = $(tr).find("td[headers='DOCUMENTTYPE']").text();
const date = $(tr).find("td[headers='STAMPED/RECEIVEDDATE']").text();
return { link, number, name, type, date };
}).get();
Keep in mind that cheerio map has the arguments reversed from js map.

I found the solution. I'm simply returning the full html through puppeteer instead of trying to get individual rows, and then using the above suggestion (from #pguardiario) to parse the text:
const res = await page.evaluate(() => {
return document.body.innerHTML;
});
let $ = cheerio.load(res);
let trs = $(".t14Standard tbody tr.highlight-row");
let data = trs.map((i, tr) => {
const link = $(tr).find("a").attr("href");
const number = $(tr).find("td[headers='REGISTRATIONNUMBER']").text();
const registrant = $(tr).find("td[headers='REGISTRANTNAME']").text();
const type = $(tr).find("td[headers='DOCUMENTTYPE']").text();
const date = moment($(tr).find("td[headers='STAMPED/RECEIVEDDATE']").text()).valueOf().toString();
return { link, number, registrant, type, date };
});

Related

filter and copy data from one spreadsheet to another while criteria is in different spreadsheet

I have two google sheets forst one is https://docs.google.com/spreadsheets/d/1PJtjlkxCDFOIhJxpMJl4PcpJKL4nvGVtk2LDgVhbuNQ/edit#gid=0
and the second is https://docs.google.com/spreadsheets/d/1ZGw6dHpYE4ABvsE8S6dIPe7kLQ1eZW95Xp2F1oPHX5c/edit#gid=0
I want to filer data in first sheet base on the criteria which presents in the second sheet "Automate!D3" and then copy the filtered data to second sheet in "Filtered_Data". and want to this process automate so in future when i add more data to sheet one so that can be copy to below data in sheet two.
i can not able to filter throw app script, so i want to help in this.
Something like this should work:
const dsid = "1PJtjlkxCDFOIhJxpMJl4PcpJKL4nvGVtk2LDgVhbuNQ";
const tssid = "1ZGw6dHpYE4ABvsE8S6dIPe7kLQ1eZW95Xp2F1oPHX5c"
function filterByDate() {
const values = SpreadsheetApp.openById(dsid).getSheetByName('DT').getDataRange().getValues();
const tss = SpreadsheetApp.openById(tssid);
const criteria = tss.getSheetByName('Automate').getRange('D3').getValue();
const ts = tss.getSheetByName('Filtered_Data');
const results = values
.filter(row => row[1].getDate == criteria.getDate)
.map(row => [row[1],row[2],row[3],row[4]]);
ts.getRange(2,1,results.length,results[0].length).setValues(results);
}
I would suggest to place the criteria value in the daily sheet, and install the function into daily sheet as an edit trigger.
Make sure that the 'criteria' and the 'date' column are both formatted as 'Date' format.
The follow code has changed the way to get values.
Insead of .getValues(), it now uses .getDisplayValues(), and convert the string back into date obj afterwards.
const dsid = "1PJtjlkxCDFOIhJxpMJl4PcpJKL4nvGVtk2LDgVhbuNQ";
const tssid = "1ZGw6dHpYE4ABvsE8S6dIPe7kLQ1eZW95Xp2F1oPHX5c";
function filterByDate() {
const values = SpreadsheetApp.openById(dsid).getSheetByName('DT').getDataRange().getDisplayValues();
const tss = SpreadsheetApp.openById(tssid);
const criteria = tss.getSheetByName('Automate').getRange('D3').getDisplayValues();
const ts = tss.getSheetByName('Filtered_Data');
const results = values
.filter(row => new Date(row[1]).setHours(0,0,0,0) == new Date(criteria).setHours(0,0,0,0))
.map(row => [row[1],row[2],row[3],row[4]]);
ts.getRange(2,1,results.length,results[0].length).setValues(results);
}

Google Apps Script - fill document template with sheet data

complete noob, and my first ever post,so sorry in advance for the eventual poor choice of words.
I am working on a mail merge script, that will fill a GDoc template with data from a GSheet, creating a separate GDoc for each row in GSheet.
Script is working well, I'm using the .replacetext method on the template's body, like below:
function createNewGoogleDocs() {
const documentLink_Col = ("Document Link");
const template = DriveApp.getFileById('1gZG-NR8CcOpnBTZfTy8gEsGDOLXa9Ba9Ks5zXJbujY4');
const destinationFolder = DriveApp.getFolderById('1DcpZGeyoCJxAQu1vMbSj31amzpwfr_JB');
const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('data');
const data = sheet.getDataRange().getDisplayValues();
const heads = data[0]; // Assumes row 1 contains the column headings
const documentLink_ColIndex = heads.indexOf(documentLink_Col);
data.forEach(function(row, index){
if(index === 0) return;
const templateCopy = template.makeCopy(`${row[0]} ${row[1]} Report`, destinationFolder); //create a copy of the template document
const templateCopyId = DocumentApp.openById(templateCopy.getId());
const templateCopyBody = templateCopyId.getBody();
templateCopyBody.replaceText('{{Name}}', row[0]);
templateCopyBody.replaceText('{{Address}}', row[1]);
templateCopyBody.replaceText('{{City}}', row[2]);
templateCopyId.saveAndClose();
const url = templateCopyId.getUrl();
sheet.getRange(index +1 , documentLink_ColIndex + 1).setValue(url);
})
}
What I want to change:
Have freedom to add/remove columns in the sheet without having to hard code every header column with a .replacetext method
I have found a kind of similar script that achieves that for sending emails based on GmailApp, and I extracted 2 functions that do a token replacement, but I don't know how to call the function fillInTemplateFromObject_ in my function createNewGoogleDocs
here is the code for the functions I found in the other script:
function fillInTemplateFromObject_(template, data) {
// We have two templates one for plain text and the html body
// Stringifing the object means we can do a global replace
let template_string = JSON.stringify(template);
// Token replacement
template_string = template_string.replace(/{{[^{}]+}}/g, key => {
return escapeData_(data[key.replace(/[{}]+/g, "")] || "");
});
return JSON.parse(template_string);
}
/**
* Escape cell data to make JSON safe
* #see https://stackoverflow.com/a/9204218/1027723
* #param {string} str to escape JSON special characters from
* #return {string} escaped string
*/
function escapeData_(str) {
return str
.replace(/[\\]/g, '\\\\')
.replace(/[\"]/g, '\\\"')
.replace(/[\/]/g, '\\/')
.replace(/[\b]/g, '\\b')
.replace(/[\f]/g, '\\f')
.replace(/[\n]/g, '\\n')
.replace(/[\r]/g, '\\r')
.replace(/[\t]/g, '\\t');
};
Thanks everyone in advance for your support.
Using column headers to make programmatic assignments
function myfunction() {
const ss = SpreadsheetApp.getActive();
const sh = ss.getSheetByName("Sheet0");
const [hA, ...vs] = sh.getDataRange();// hA is Name Address City
const idx = {};
const body = DocumentApp.getActiveDocument().getBody();
hA.forEach((h, i) => { idx[h] = i; })
vs.forEach(row => {
body.replaceText("{{Name}}", row[idx["Name"]]);
body.replaceText("{{Address}}", row[idx["Address"]]);
body.replaceText("{{City}}", row[idx["City"]]);
});
}
As long as you keep the column titles the same you can move them around anywhere you wish

Get cypress database query output objects in to variables

I have a cypress test which has been set up with mysql node module. When I run bellow mentioned test Its giving output as follows.
const executeQuery = (query) => {
cy.task('DBQuery', query).then(function (recordset) {
var rec = recordset
cy.log(rec)
})
}
Query:
select *
from Users
where email = 'sheeranlymited#lymitedtest.com'
OUTPUT: log [Object{23}]
Query:
select firstname
from Users
where email = 'sheeranlymited#lymitedtest.com'
OUTPUT: log [{firstname: Edward}]
instead of cy.log(rec) I want to get the output of 23 columns to assign in to different variables based on the column name.
Appreciate if someone can help me to resolve this...
You can use Object.values in js to retrieve values from your object
Let's say you need to extract the value of the 3rd column, so your code will look like,
cy.task('DBQuery', query).then(function (recordset) {
var rec = recordset
const results = Object.values(rec[0])
// results[index of the column] will output the results
cy.log(results[3])
})
We can do a small modification to make your task easier,
cy.task('DBQuery', query).then(function (recordset) {
var rec = recordset
const Values = Object.values(rec[0]);
const keys = Object.keys(rec[0]);
let result = {};
let index = 0;
keys.forEach(key => {
result[keys[index]] = Values[index];
i++
})
//result.firstName will give you your results
cy.log(result.firstName);
})
In this way, we are generating key-value pairs having the key as the column name. So you can use the column name to find the value.
Hope this helps.
cheers.

How to scrape a table with changing data using Cheerio in Node.js?

I am trying to scrape data from a table in a website which has constantly changing values. So each row can vary day to day but I want to be able to scrape the correct data. I am using the Cheerio library at the moment and I am not familiar with it but here's what I have:
const rp = require("request-promise");
const cheerio = require("cheerio");
let Italy = "";
async function main() {
const result = await rp.get("https://www.worldometers.info/coronavirus/");
const $ = cheerio.load(result);
$("#main_table_countries > tbody:nth-child(2) > tr:nth-child(2)").each((i,el) => {
const item = $(el).text();
Italy = item;
});
}
So, as you can see this scrapes data from the worldometer website for the coronavirus cases in Italy. Italy's position however has been changing between 2 and 3 over the past few days. This has resulted in my program fetching the wrong information. This is what I would like to fix.
Here's the link to the worldometer website:
https://www.worldometers.info/coronavirus/
Thanks,
Karthik
What I Implemented is that you can get all the tr's and loop over them to get all the names and add it to an array and then use the Array Index to find any country you want
async function main() {
let NamesArr=[]
let CountryToFind= 'Italy'
const result = await rp.get("https://www.worldometers.info/coronavirus/");
const $ = cheerio.load(result);
$('#main_table_countries').find('tbody').eq(0).find('tr').each((i,el)=>{
NamesArr.push($(el).find('td').eq(0).text().trim())
})
let Index= NamesArr.indexOf(CountryToFind) + 1
$(`#main_table_countries > tbody:nth-child(2) > tr:nth-child(${Index})`).each((i,el) => {
const item = $(el).text();
console.log(item);
});
}
main()
This Returns me
You can definitely refactor it but this way makes your parser dynamic as you can now search for any country.
Use the :contains pseudo for this:
$('tr:contains(Italy)').text()
//" Italy 9,172 +1,797 463 +97 724 7,985 733 151.7 "

Writing a query parser in javascript

I'm trying to write a parser that supports the following type of query clauses
from: A person
at: a specific company
location: The person's location
So a sample query would be like -
from:Alpha at:Procter And Gamble location:US
How do i write this generic parser in javascript. Also, I was considering including AND operators inside queries like
from:Alpha AND at:Procter And Gamble AND location:US
However, this would conflict with the criteria value in any of the fields (Procter And Gamble)
Use a character like ";" instead of AND and then call theses functions:
var query = 'from:Alpha;at:Procter And Gamble;location:US';
var result = query.split(';').map(v => v.split(':'));
console.log(result);
And then you'll have an array of pairs, which array[0] = prop name and array[1] = prop value
var query = 'from:Alpha;at:Procter And Gamble;location:US';
var result = query.split(';').map(v => v.split(':'));
console.log(result);
Asuming your query will always look like this from: at: location:
You can do this:
const regex = /from:\s*(.*?)\s*at:\s*(.*?)\s*location:\s*(.*)\s*/
const queryToObj = query => {
const [,from,at,location] = regex.exec(query)
return {from,at,location}
}
console.log(queryToObj("from:Alpha at Betaat: Procter And Gamble location: US"))
However, adding a terminator allow you to mix order and lowering some keywords:
const regex = /(\w+):\s*(.*?)\s*;/g
const queryToObj = query => {
const obj = {}
let temp
while(temp = regex.exec(query)){
let [,key,value] = temp
obj[key] = value
}
return obj
}
console.log(queryToObj("from:Alpha at Beta;at:Procter And Gamble;location:US;"))
console.log(queryToObj("at:Procter And Gamble;location:US;from:Alpha at Beta;"))
console.log(queryToObj("from:Alpha at Beta;"))

Categories

Resources