How to handle empty <dt> values in Web-Scraping using JavaScript

How to handle empty <dt> values in Web-Scraping using JavaScript - javascript

I have a JavaScript code that scrapes data in a dl Description List.
There are possible 7 dt values with respective dd values.
Only those of the 7 dt values and their dd values are shown on the website that have at least 1 dd value - thus, it can be 1 dt incl. dd is scraped or in another webpage 7
OK, I have a working Javascript code, that does the job!
const columns = [
{ text: 'Website', name: 'Website' },
{ text: 'Phone', name: 'Phone' },
{ text: 'Industry', name: 'Industry' },
{ text: 'Company size', name: 'Companysize' },
{ text: 'Headquarters', name: 'Headquarters' },
{ text: 'Founded', name: 'Founded' },
{ text: 'Specialties', name: 'Specialties' },
{ text: 'Employees', name: 'Employees' }
]
const result = [];
const elements = document.querySelectorAll('dl dt');
elements.forEach((element) => {
const regex = new RegExp(element.innerText, 'i');
const findColumn = columns.find((column) => regex.test(column.text));
if (!findColumn) return;
const columnValue = element.nextElementSibling.innerText;
result.push({ [findColumn.name]: columnValue });
});
Problem
I want to save the scraping results in an MS Excel table that has 7 columns
BUT --> The result of the scraping can have1 up to 7 columns
Because of that, I can't simply append the results, row by row - I have to do it manually. Copying the right value in the right column for it.
I would need a code that can do the following:
The values of the 7 dt elements are the header of the 7 columns
The code always results in 7 values
If the ddis a real value and accordingly scraped by the code above, then it is put in the correct column
If there is no dt element, then the string "n/a" should be put under the respective column.
This way, the results are stored in a consistent Excel table with always the correct values in the correct column.
I cannot find any specific info or material or sample code to write JavaScript code to solve this task! I think, a JavaScript expert is needed to write that code.
Thank you for helping me out to understand JavaScript better and to learn how to write JS code.
P.S.: The website scraping from: LinkedIn companies

Related

How to Insert Text into table cells using Google Docs API?

I'm encountering an implementation problem I need some help with:
I am building a Google Docs integration that involves programmatically creating a table and then mapping through an array of records to add the data of those records to the table cells. I'm able to create a blank table and locate the start index of each cell to insert the data into, but when I try to use the insertText request from the docs and place the text in a specific cell I get the following error:
"Invalid request[1].insertText: The insertion index must be inside the
bounds of an existing paragraph. You can still create new paragraph by
inserting new lines."
I've tried the following
Simply inserting my text and hoping that each insertText request places text in a new cell
Adding '\n' at the start index of the cells
Creating the table with tableRows already defined (this error
out as an invalid request format)
Creating cells one-by-one and inserting text after each appended 1x1 column or row (eventually you create a row that duplicated the row above it with multiple columns and all text would just be in the first cell)
Inserting Column breaks after the first cells text in hopes it would shift the
paragraph over to the new cell
Here's the last implementation I tried:
(For context I aim to filter through an items array, create a table for each item, and then for every subitem in that item object to have a table row where relevant data will go under Header 1, 2 and 3, but for now I'd be happy just to get the tables and headers in the right places)
var requests = []
await Promise.all(
items.map(async (item) => {
let subitems = []
subitems = item.subitems.filter((subitem) => subitem.selected)
//Create table api call
await docs.documents
.batchUpdate({
documentId: newDoc.data.documentId,
resource: {
requests: [
{
insertTable: {
columns: 3,
rows: item.subitems.length + 1,
endOfSegmentLocation: { segmentId: '' },
},
},
],
},
})
.then((res) => res.data)
//Api call for cell data
var docData = await docs.documents
.get({
documentId: newDoc.data.documentId,
})
.then((res) => res.data)
requests.push(
{
insertText: {
text: 'Header 1',
location: {
index:
docData.body.content[2].table.tableRows[0].tableCells[0]
.startIndex,
},
},
},
{
insertText: {
text: 'Header 2',
location: {
index:
docData.body.content[2].table.tableRows[0].tableCells[1]
.startIndex,
},
},
},
{
insertText: {
text: 'Header 3',
location: {
index:
docData.body.content[2].table.tableRows[0].tableCells[2]
.startIndex,
},
},
}
)
})
)
//Insert Text API call
var myDoc = await docs.documents
.batchUpdate({
documentId: newDoc.data.documentId,
resource: {
requests,
},
})
.then((res) => res.data)
return myDoc
}
If you need any more info let me know. Thanks in advance.

Access Table Header Row Key React

I was working with tables and I came across this issue: I want to access the data-row-key attribute (shown in the image below) in the table header row at a child row and I'm stuck. Code:
class App extends React.Component {
render() {
const columns = [
// sample of how the JSON API is read
{
title: "Title", dataIndex: "title", key: "title",
},
// the one that actually matters. becomes the actions column eventually
{
title: "Action", dataIndex: "", key: "x", width: "12%",
render: () => (
<Popconfirm
placement="topRight"
title="Are you sure to delete this task?"
// retrieve the data here as a parameter into the confirm(n) call
onConfirm={() => confirm(43)} okText="Yes" cancelText="No"
>
<a>delete</a>
</Popconfirm>
)
}
];
return (
<Table columns={columns} dataSource={this.state.data}/>
);
}
}
Right now I have the actual number (43) in there, but I want it to be dynamic as to be able to retrieve the data from the <tr data-row-key=...> tag, shown in the image below.
As a note, there is not a leading id column at the start of the table. The keys are provided through Django's rest framework -- which is in JSON format, in the very last image. Rendered results:
JSON format:
Can anyone please help me? Thanks in advance.

You can use the querySelector for it.
let value = document.querySelector('data-row-key')

Remove unwanted columns from CSV file using Papaparse

I have a situation where a user can upload a csv file. This CSV file contains a lot of data, but I am only interested in 2 columns (ID and Date). At the moment, I am parsing the CSV using Papaparse
Papa.parse(ev.data, {
delimiter: "",
newline: "",
quoteChar: '"',
header: true,
error: function(err, file, inputElem, reason) { },
complete: function (results) {
this.parsed_csv = results.data;
}
});
When this is run this.parsed_csv represents objects of data keyed by the field name. So if I JSON.stringify the output is something like this
[
{
"ID": 123456,
"Date": "2012-01-01",
"Irrelevant_Column_1": 123,
"Irrelevant_Column_2": 234,
"Irrelevant_Column_3": 345,
"Irrelevant_Column_4": 456
},
...
]
So my main question is how can I get rid of the columns I dont need, and just produce a new csv containing the columns ID and Date?
Thanks
One thing I realised, is there a way to add dynamic variables. For instance I am letting users select the columns I want to map. Now I need to do something like this
let ID = this.selectedIdCol;
this.parsed_csv = results.data.map(element => ({ID: element.ID, Date: element.Date}));
It is saying that ID is unused however. Thanks

let data = [
{
"ID": 123456,
"Date": "2012-01-01",
"Irrelevant_Column_1": 123,
"Irrelevant_Column_2": 234,
"Irrelevant_Column_3": 345,
"Irrelevant_Column_4": 456
},
...
]
just produce results by using the following code:
data = data.map(element => ({ID: element.ID, Date: element.Date}))
Now you have desired column, please generate a new CSV on these columns

As Serrurier pointed out above, You should use the step/chunk function to alter the data rather than after parse map as in memory data is already available.
PapaParse.parse(file, { skipEmptyLines: true, header: true, step: (results, parser) => {
results.data = _.pick(results.data , [ 'column1' 'column2']);
return results;
}});

Note that if you are loading a huge file, you will have the whole file in memory right after the parsing. Moreover it may freeze the browser due to the heavy workload. You can avoid that by reading and discarding columns :
row by row
chunk by chunk.
You should read Papaparse's FAQ before implementing that. To sum up, you will store required columns by extracting them from the step or chunk callbacks.

Creating dynamic number of columns or header in excel in nodejs/javascript

I have used exceljs module in nodejs for exporting json data to excel. It's working fine, but the names of headers/columns have to be predefined before adding rows i.e., columns are fixed. After addition of rows, I can't add columns dynamically.
I have tried a number of modules available through npm but all of them have the same features.
So, is there any way or module that, at the time of manipulation of json data, can create a new column and add the required row.

If someone is still looking into this problem then I have a decent solution.
Instead of creating columns, you can create a table as follows in the worksheet.
worksheet.addTable({
name: "MyTable",
ref: "A1",
headerRow: true,
totalsRow: false,
style: {
theme: null,
showRowStripes: true,
showColumnStripes: true,
},
columns: [
{ name: "EmployeeID" },
{ name: "First Name" },
],
rows: [/* Enter initial rows if you want to add*/],
});
After adding a table to the column A1 of your worksheet you can add new columns dynamically
const table = worksheet.getTable("MyTable");
table.addColumn({
name: "Column name",
});
table.commit();

I tried directly pushing the new columns to the worksheet.columns but it is not working. I did a workaround and working well for me.
Note: You need to make the track of already added columns in the worksheet to get the next empty columns by column index.
Here is an example:
let workbook = new excel.Workbook(); //creating workbook
let worksheet = workbook.addWorksheet('Records'); //creating worksheet
const columns = [];
columns.push({header: 'Id', key: '_id', width: 30});
columns.push({header: 'Name', key: 'name', width: 30});
//Set Headers to WorkSheet Header
worksheet.columns = columns;
//Now insert some records if you want
worksheet.addRow({_id: "1", name: "Mitchell Starc"});
worksheet.addRow({_id: "2", name: "Ab de Villiers"});
//Update or add dynamic columns
//Get the empty columns from worksheet. You can get the empty columns number by using `columns` array length
//For this you have to track all inserted columns in worksheet
//This will return the next empty columns
let newColumn = worksheet.getColumn(columns.length + 1);
//Set new key header and all other required properties
newColumn.key = "profession";
newColumn.header = "Profession";
newColumn.width = 30;
//Add the new column to `columns` to track the added headers
columns.push(newColumn);
//Now you can insert rows with new columns
worksheet.addRow({_id: "3", name: "MS Dhoni", profession: "Cricket"});
workbook.xlsx.writeFile("records.xlsx")
.then(function () {
console.log("file saved!");
});

Now not sure if this worked 2 years ago but this worked for me
var columns=[]
x="data1"
y="data2"
Columns.push({ header: x, key: x })
Columns.push({ header: y, key: y})
worksheet.columns = Columns
You must use a separate variable to dynamically create the array of structs for it to work. if you use worksheet.columns=[] and worksheet.columns.push(..) it will fail.

Insert multiple records in single call

So I'm trying to insert an array of records of length to Azure Table. I'm on New Azure portal, and all the help I found was for old one.
New Azure Portal's Script page looks like
I've tried to override the insert method as following:
var table = require('azure-mobile-apps').table();
table.insert(function (context) {
all_answers = context.item.answers;
console.log(all_answers[0]);
return context.execute();
});
Log shows the following object:
{ id: '',
userid: '0029C048-B8A0-42AE-B8F8-2B9402D69EEF',
createdate: '2016-01-30 00:40:18',
questionid: 1,
choiceid: 0
}
How can I insert all records of array in Table?
Anticipated Thanks

Develop Reference

JavaScript is the programming language of the Web.

How to handle empty <dt> values in Web-Scraping using JavaScript - javascript

Related

How to Insert Text into table cells using Google Docs API?

Access Table Header Row Key React

Remove unwanted columns from CSV file using Papaparse

Creating dynamic number of columns or header in excel in nodejs/javascript

Insert multiple records in single call

Categories

Resources