JS Web Scraper Hangs on Evaluation - javascript

I'm building a scraper that needs to get data from multiple adjacent elements. Essentially there are headers (h3) that display categories, next to which there are tables of ranks. I'm trying to crawl the page for a String to find out if that String was ranked in a category and, if so, what rank it achieved (A, B, or C), then fill an array with objects that describe what the categories and ranks that string achieved (phew).
Initially, I generated an error in the while loop ("cannot define property 'tagName' of null"), as sib kept evaluating to null for some reason. I added a test, in case it was happening at the end of arr, but now the code just hangs indefinitely. I have a feeling that sib isn't being defined immediately, but I can't put my finger on if so or why.
I am testing everything in Chrome DevTools, if that helps anything. Also I am not using a Fat Arrow function in DevTools to test this out, so it is not contributing to this problem per se, but if it will screw me over in the long run please let me know!
(str) => {
let h = document.getElementsByTagName('h3');
let arr = [];
let res = [];
for ( let i = 0; i < h.length; i++){ //Filter out any h3's that are not category headers.
h[i].id.includes('category-') && arr.push(h[i]);
};
for ( let i = 0; i < arr.length; i++){
let head = arr[i].innerText; //Save the current category header.
let sib = arr[i].nextElementSibling; //Should be a table containing rank A.
while ((sib != null) && (sib.tagName == 'TABLE')){
if(sib.innerText.includes(str)){ //Create an object if the rankings table contains our target.
let hit = {};
hit.category = head;
hit.rank = sib.children[0].innerText;
res.push(hit);
}
else{ //Go to the next table which contains the next rank.
sib = sib.nextElementSibling;
};
};
};
return res;
}

Related

How to perform fast search on JSON file?

I have a json file that contains many objects and options.
Each of these kinds:
{"item": "name", "itemId": 78, "data": "Some data", ..., "option": number or string}
There are about 10,000 objects in the file.
And when part of item value("ame", "nam", "na", etc) entered , it should display all the objects and their options that match this part.
RegExp is the only thing that comes to my mind, but at 200mb+ file it starts searching for a long time(2 seconds+)
That's how I'm getting the object right now:
let reg = new RegExp(enteredName, 'gi'), //enteredName for example "nam"
data = await fetch("myFile.json"),
jsonData = await data.json();
let results = jsonData.filter(jsonObj => {
let item = jsonObj.item,
itemId = String(jsonObj.itemId);
return reg.test(item) || reg.test(itemId);
});
But that option is too slow for me.
What method is faster to perform such search using js?
Looking up items by item number should be easy enough by creating a hash table, which others have already suggested. The big problem here is searching for items by name. You could burn a ton of RAM by creating a tree, but I'm going to go out on a limb and guess that you're not necessarily looking for raw lookup speed. Instead, I'm assuming that you just want something that'll update a list on-the-fly as you type, without actually interrupting your typing, is that correct?
To that end, what you need is a search function that won't lock-up the main thread, allowing the DOM to be updated between returned results. Interval timers are one way to tackle this, as they can be set up to iterate through large, time-consuming volumes of data while allowing for other functions (such as DOM updates) to be executed between each iteration.
I've created a Fiddle that does just that:
// Create a big array containing items with names generated randomly for testing purposes
let jsonData = [];
for (i = 0; i < 10000; i++) {
var itemName = '';
jsonData.push({ item: Math.random().toString(36).substring(2, 15) + Math.random().toString(36).substring(2, 15) });
}
// Now on to the actual search part
let returnLimit = 1000; // Maximum number of results to return
let intervalItr = null; // A handle used for iterating through the array with an interval timer
function nameInput (e) {
document.getElementById('output').innerHTML = '';
if (intervalItr) clearInterval(intervalItr); // If we were iterating through a previous search, stop it.
if (e.value.length > 0) search(e.value);
}
let reg, idx
function search (enteredName) {
reg = new RegExp(enteredName, 'i');
idx = 0;
// Kick off the search by creating an interval that'll call searchNext() with a 0ms delay.
// This will prevent the search function from locking the main thread while it's working,
// allowing the DOM to be updated as you type
intervalItr = setInterval(searchNext, 0);
}
function searchNext() {
if (idx >= jsonData.length || idx > returnLimit) {
clearInterval(intervalItr);
return;
}
let item = jsonData[idx].item;
if (reg.test(item)) document.getElementById('output').innerHTML += '<br>' + item;
idx++;
}
https://jsfiddle.net/FlimFlamboyant/we4r36tp/26/
Note that this could also be handled with a WebWorker, but I'm not sure it's strictly necessary.
Additionally, this could be further optimized by utilizing a secondary array that is filled as the search takes place. When you enter an additional character and a new search is started, the new search could begin with this secondary array, switching to the original if it runs out of data.

How to access the currently iterated array value in a loop?

Current attempt using an array of objects with properties:
The objective:
I want to automatically fill out emails on behalf of ~30 different people. The form fields are always consistent, but the values I'm filling in will change on an email-to-email basis. I'm using TagUI to do this.
My old code (last code box below) successfully filled out each form by assigning each line in the .csv to a separate array BUT failed to iterate through the values of a specific column within the .csv. Please see the text above the last code box below for further explanation.
Now I'm starting again, this time aiming to create an array of objects (representing each email being sent) with properties (representing each field to be filled within each email).
Here's what I've got so far:
// Using TagUI for browser automation
// https://github.com/kelaberetiv/TagUI
website-to-automate-URL-here.com
// Set up the arrays to be used later
emails = []
// Load in the 'db.csv' file
// Link to .csv: https://docs.google.com/spreadsheets/d/16iF7F-8eh2eE6kDiye0GVlmOCjADQjlVE9W1KH0Y8MM/edit?usp=sharing
csv_file = 'db.csv'
load '+csv_file+' to csv_lines
// Split the string variable "lines" into an array of individual lines
lines = csv_lines.split('\n')
// Split the individual lines up into individual properties
for (i=0; i < lines.length; i++)
{
emails[i].name = properties[1].trim()
emails[i].recipients = properties[2].trim()
properties = lines[i].split(',')
}
EDIT: The below code has been put on the back burner as I attempt to solve this another way. Solutions are still welcome.
I'm having trouble triggering my for loop (the last one in the code below).
My goal for the for loop in question, in plain English, is as follows: Repeat the below code X times, where X is determined by the current iteration of the total_images array.
So if the total_images array looks like this:
[Total Images, 2, 3, 4, 5]
And the parent for loop is on its third iteration, then this for loop should dictate that the following code is executed 4 times.
I'm using TagUI (https://github.com/kelaberetiv/TagUI), so there many be some non-Javascript code here.
https://www.website.com
wait 3s
// Setting up all the arrays that the .csv will load
array_campaign = []
array_subject = []
array_teaser = []
array_recipients = []
array_exclude = []
array_img1src = []
array_img1alt = []
array_img1url = []
array_img2src = []
array_img2alt = []
array_img2url = []
array_img3src = []
array_img3alt = []
array_img3url = []
array_img4src = []
array_img4alt = []
array_img4url = []
total_images = []
// Load in the 'db.csv' file
csv_file = 'db.csv'
load '+csv_file+' to lines
// Chop up the .csv data into individual pieces
// NOTE: Make sure the [#] corresponds to .csv column
// Reminder: Numbers start at 0
array_lines = lines.split('\n')
for (n=0; n<array_lines.length; n++)
{
items = array_lines[n].split(',')
array_campaign[n] = items[1].trim()
array_recipients[n] = items[2].trim()
array_exclude[n] = items[3].trim()
array_subject[n] = items[4].trim()
array_teaser[n] = items[5].trim()
array_img1src[n] = items[6].trim()
array_img1alt[n] = items[7].trim()
array_img1url[n] = items[8].trim()
array_img2src[n] = items[9].trim()
array_img2alt[n] = items[10].trim()
array_img2url[n] = items[11].trim()
array_img3src[n] = items[12].trim()
array_img3alt[n] = items[13].trim()
array_img3url[n] = items[14].trim()
array_img4src[n] = items[15].trim()
array_img4alt[n] = items[16].trim()
array_img4url[n] = items[17].trim()
total_images[n] = items[18].trim()
}
for (i=1; i < array_campaign.length; i++)
{
echo "This is a campaign entry."
wait 2s
}
// This is the problem loop that's being skipped
blocks = total_images[i]
for (image_blocks=0; image_blocks < blocks; image_blocks++)
{
hover vis1_3.png
click visClone.png
}
This is the most coding I've ever done, so if you could point me in the right direction and explain like I'm a beginner it would be much appreciated.
Look like the only reason make your last loop being skipped is that total_images[i] is undefined, which is used for the loop condition. I believe that the value of i at that moment is equal to array_campaign.length from the previous loop, which is actually out of array range.
Here're some example codes:
const arr = [0, 1, 2];
const length = arr.length; // the length is 3, but the last index of this array is 2 (count from 0)
for (i = 0; i < length; i++) {
console.log(i);
}
// output:
// 0
// 1
// 2
console.log(i); // i at this moment is 3, which is = arr.length and made the above loop exit
console.log(arr[i]); // => undefined, because the last index of the above array is 2, so if you reference to an un-existed element of an array, it will return undefined.
"run the following code X times, where X is determined by the value of total_images[i]" - so, if I understand your question correctly, you can use nested loops to do this:
for (i=1; i < array_campaign.length; i++)
{
echo "This is a campaign entry."
wait 2s
// nested loop, the number of iteration is based on the value i of outside loop
for (j=0; j < total_images[i]; j++) {
// do something here
}
}
My old code should have worked. I opened up the .csv file in notepad and noticed there were SEVERAL extra commas interfering with the last column of data, throwing everything for a loop.
Did some searching and apparently this is a common thing. Beware!
I created TagUI but I don't check Stack Overflow for user queries and issues. Try raising issue directly at GitHub next time - https://github.com/kelaberetiv/TagUI/issues
Looks like you found the solution! Yes, if the CSV file contains incorrect number of columns (some rows with more columns than others), it will lead to error when trying to work on it from your automation script. It looks like the extra commas cause extra columns and broke your code.

Where/how to set increment on loop and update array only when condition found?

I'm writing a function to iterate through folders on Google Drive and match files (Google Sheets) with a variable string (a date specified on a table cell). When a matching file is found, the containing folder name string is assigned to folderItems[0] and the file URL to folderItems[1]. Once all matching files within a folder have been found, the next folder is iterated through in the same way. These "folderItems" arrays are stored in a parent array "folderItemsContainer" to create a 2 dimensional array which can then be output to a spreadsheet using .setValues().
I'm having trouble figuring out how or where to put the increment variable so that it will increment only when a filename match is made but not stop a loop when a match isn't found.
I've tried various structures including interchanging for and while loops and inserting if statements where seemingly useful. I've looked at a few different answers on Stackoverflow that come close to making sense but none seem to be applicable here. I'm fairly new to programming. I've got different variations of code I've tried, but this is where I'm up to so far:
function GetFolderData() {
var currentSheet = SpreadsheetApp.getActiveSpreadsheet();
var currentYearPeriod = currentSheet.getRange("C1!A4").getValue();
// Logger.log(currentYearPeriod);
//Get folder objects from parent folder
var parentFolderId = "17F0fcBH0jmxsk2sUq723AuIY0E2G_u0m";
var parentFolder = DriveApp.getFolderById(parentFolderId);
//Get folders from specified parent folder
var StaffFolders = parentFolder.getFolders();
//Create container array
var folderItemsContainer = [];
//Create Item Array
var folderItems = [];
var i = 0;
//For every staff folder, regardless of content, do:
while (StaffFolders.hasNext()) {
//Get current folder object
currentFolder = StaffFolders.next();
//Get files in current folder object as FileIterator
FolderFiles = currentFolder.getFiles();
//If folder empty, outer while loop will iterate
if (FolderFiles !== null) {
//Iterate through existing files
while (FolderFiles.hasNext()) {
//Get file object sequentially
file = FolderFiles.next();
//When filename matches currentYearPeriod, store URL next to name in folderItems
for (i = 0; file.getName().indexOf(currentYearPeriod) !== -1; i++) {
folderItems[i] = [];
folderItems[i][0] = currentFolder.getName();
// Logger.log(currentFolder.getName());
folderItems[i][1] = file.getUrl();
folderItemsContainer[i] = folderItems[i];
}
}
}
}
return folderItemsContainer;
}
function InsertFolderData() {
var sheet = SpreadsheetApp.getActiveSheet();
sheet.getRange("B4:Z1000").clearContent();
FolderData = GetFolderData();
Logger.log(FolderData);
sheet
.getRange(4, 2, FolderData.length, FolderData[0].length)
.setValues(FolderData);
Logger.log(FolderData);
/* var str = "";
for (var i = 0; i < FolderData.length; i++) {
str += FolderData[i] + "\r\n";
}
str = str.substr(0);
var ui = SpreadsheetApp.getUi();
ui.alert("DATA IMPORTED: " + "\r\n" + str);
*/
}
With the above code, I'm not entirely sure why but I seem to be getting stuck in an endless loop and the script doesn't finish. What I'm hoping to achieve is the folderItemsContainer array being populated with arrays containing file information (parent folder name[0] and file URL[1]) for files that match the currentYearPeriod variable. I've been refactoring the code and I've learned a lot but unfortunately not how to solve the problem.
You should check what's the deference between each loop, you are not fully undestending them. If you want to execute the instructions inside the for loop until a certain condition is met, in this case file.getName().indexOf(currentYearPeriod) !== -1, you should use a while loop. The bug is that the previous condition is never met because file never change while running the for loop. Thats why you are having an infinite loop. My solution:
// new variable
var cnt = 0;
while (StaffFolders.hasNext()) {
currentFolder = StaffFolders.next();
FolderFiles = currentFolder.getFiles();
if (FolderFiles !== null) {
while (FolderFiles.hasNext()) {
file = FolderFiles.next();
// You for loop started here
folderItems[cnt] = [];
folderItems[cnt][0] = currentFolder.getName();
folderItems[cnt][1] = file.getUrl();
folderItemsContainer[cnt] = folderItems[cnt];
// each time you read a new file you increment by 1
cnt++;
}
}
// this reset the counter for each new folder
cnt = 0;
}
Deferences between loops:
for loops
They are used when you know how many iteration will be needed. For example, if you want to print all the character of a string in the console:
const str = "hello";
for(let i = 0; i < str.length; i++) {
console.log(str.charAt(i));
}
let i = 0 is the starting point
i < str.length is when you want to stop. If you have to use a simbol which is not one of the follow <, <=, >, >=, you shouldn't be using a for loop.
i++ how you want to reach the stop property.
while loops
If you dont know when your loop is going to end, if it's going to have, 5 iteration, 100 iteration or 0 iteration. You should use while loops.
function getFirstL(str)
let i = 0;
while(i < str.length && str.charAt(i) !== "l"){
i++;
}
}
Your for loop. Here is syntax of for loop.
for (statement 1; statement 2; statement 3) {
// code block to be executed
}
Statement 1 is executed (one time) before the execution of the code block.
Statement 2 defines the condition for executing the code block.
Statement 3 is executed (every time) after the code block has been executed.
Your for loop doesn't define a condition for it to exit. A minimum or maximum value. something like
i<file.getName().indexOf(currentYearPeriod);
So it will check from 0-to that value.

Array logic match for list

I have a quick links widget with different types of links/menus that the user can choose from. Only four different menu options can be shown at the same time - not more or less.
In code I first extract all the menu options which come in the form in [1,2,3...] which corresponds to the rows in a list where the menu options is stored.
The user chooses menu options is also returned in the same way with an array like [2,3,8,9] with the number corresponding which row to get from the list.
Example:
All menu/widgets
Travel
Hotel
Car
Buss
Airplane
Holiday
This will return an array [1,2,3,4,5,6]
And if I choose to save hotel, buss, airplane and holiday then my user settings will return [2,4,5,6].
Problem:
It works, until a widget is deleted from the list that the user has saved then the widget only will show three menus/links. I want the widget to always show four links, so if one is missing I need to populate the array. So if its missing, I want to show another link. It would be good, but not needed, to take a link that is set to default when its missing (always the first four in the list). I have set up a logic for that but its not working.
Code:
public async getUserWidgets(): Promise<Widget[]> {
return new Promise<Widget[]>(async(resolve, error) => {
let allWidgets = await this.getAllWidgets(); // Returns an array of all links [1,2,4...]
let userRepository = new UserProfileRepository(this.absoluteWebUrl);
let userSettings = await userRepository.getUserExtensionValues(this.context); //contains the user saved widgets ex [2,3,6,7]
var result:Widget[] = [];
// if the user has no settings, or less than 4 saved links
if (userSettings == null || userSettings.QuickLinksWidgets == null || userSettings.QuickLinksWidgets.length < 4) {
result = allWidgets.filter((w) => {return w.defaultWidget;}).slice(0,4); //default widget but not really needed.
}
else {
var ids = userSettings.QuickLinksWidgets;
for (let i = 0; i < 4; i++) {
let id = '' + ids[i];
let w = allWidgets.filter((e) => { return e.id == id;});
if (w.length == 0) {
continue;
}
result.push(w[0]);
}
};
resolve(result);
}); }
From what you described, it sounds like maybe you're not updating properly (calling getUserWidgets when userSettings.QuickLinksWidgets changes? First check to make sure it's called as you expect.
If getUserWidgets is being called properly, try to add defaults to their settings until you have 4 links total. Right now you are using default links if they have any less than 4 in their settings.
For example:
// take up to 4 user links and make sure we don't exceed the length of the array
for (let i = 0; i < 4 && i < userSettings.QuickLinksWidgets.length - 1; i++) {
// get the id of the widget in the user's settings
let widgetId = userSettings.QuickLinksWidgets[i].id
// find the widget with a matching id and add it to our results
result.push(allWidgets.find(w => w.id === widgetId)
}
// if there's not already 4 links, add more to our list
let j = 0
while (result.length < 4) {
// check the first 4 user links to make sure we didn't include this link already
if (!userSettings.QuickLinksWidgets.slice(0, 4).includes(allWidgets[j].id)) {
// add the new widget to the results
result.push(allWidgets[j])
}
j++
}
resolve(result)

Return the list of tags that every selected (e.g. where selected is true) candidate shares

Given an array of candidates, who follow this type:
type Candidate = {
name: string,
tags: string[],
selected: boolean,
};
How do I return the list of tags that every selected candidate shares? Tags are only included in the result if every single selected candidate has that tag. The order of tags in the output doesn't matter.
function sharedTags(candidates) {
// code goes here
return [];
}
module.exports = sharedTags;
Let's break down the steps of what you're looking to accomplish.
First, you'll need to identify which candidates are selected. You can do this with the standard Array.filter(predicateFn). Your predicate, will look something along the lines of function(c) { return c.selected === true; }.
It's worth saying that if you can structure your code such that the candidates parameter is always supplied with an array of selected candidates, that first step will be unnecessary. As with most things in software, it depends on the assumptions you're willing to make.
Next, you'll compute a collection representing the (ed: updated) intersection of tags between the collection of candidates. This involves writing a helper function that can take two candidates and determine whether they have any tags in common:
var sharedTags = function(c1, c2) {
return c1.tags.filter(function(t) {
return c2.tags.indexOf(t) >= 0;
});
};
EX:
var c1 = { name: 'Jo', tags: ["red", "blue", "green"]};
var c2 = { name: 'Bill', tags: ["yellow", "blue", "purple"]};
var shared = sharedTags(c1, c2); // ["blue"]
A fast approach is to have a Map of tags which has key as tag name and value as tag frequency.
Loop through each candidate's tags and prepare a unique map of tags. The reason for distinct tags is because it may so happen that same tags gets repeated multiple times for a single candidate. So, this could break our core deciding check of whether a tag appears for all candidates.
Have a global map variable which keeps track of tags and their frequencies.
Now, in the end, iterate over all tags in map and check if it's frequency happens to be
equal to candidates.length. If yes, it occurred in every candidate, else it didn't.
This way, you visit each candidate's tags only once.
Below is an implementation to demonstrate the same.
CODE:
function sharedTags(candidates) {
var results = [];
var map = {};
// collect each tag's frequency in a map
for(let i=0;i<candidates.length;++i){
let each_candidate = candidates[i];
if(each_candidate['selected'] === true){
// collect all unique tags for this candidate in iteration
let unique_tags = {};
for(let j=0;j<each_candidate['tags'].length;++j){
let tag = each_candidate['tags'][j];
unique_tags[tag] = unique_tags[tag] === undefined ? 1 : unique_tags[tag];
}
// merge it's presence with the global "map" variable
let this_candidate_tags = Object.keys(unique_tags);
for(let k=0;k<this_candidate_tags.length;++k){
if(map[this_candidate_tags[k]] === undefined){
map[this_candidate_tags[k]] = 1;
}else{
map[this_candidate_tags[k]] += 1;
}
}
}
}
// now check for frequency of each tag. If it equals candidates length, that means it appeared in every candidate.
var tags = Object.keys(map);
for(let each_tag in tags){
if(map[tags[each_tag]] === candidates.length){
results.push(tags[each_tag]);
}
}
return results;
}

Categories

Resources