Scraping/Cleaning a HTM page into a collection using JavaScript - javascript

I have a file saved as a BLOB type in Oracle database. It's a .htm file that I want to scrape/clean and get several collections out of it.
The problem is that there are no id or class attributes. What is more there is a lot of rubbish code and in one file there are several different tables. Of each table I want to make a collection.
I thought about manually manipulating with this as a string file but the file is quite big. Could you give me some ideas on how to do so?

Related

Can i give a file a unique id?

Im working with nodejs.
I have a webpage that accepts an excel upload, reads the excel and creates products based on the excels info read. My reading algorithm only works with certain format, the excel cant be written in just any random way. So i offer an excel file written in that format for the user to download, fill it (it only allows certain cells to be modified) and then upload it.
The problem is distinguishing the right file (the one i offer), from any other random file without wasting time reading it. Is there any way to give a file an id or something that certifies that it is actually the file that is suppossed to be uploaded? At first i thought about the name, but anyone can upload a random file with the same name so i dont see any way. Can you help me? Thank you

How to add tags to files on disk?

I'm creating app for cataloging images on disk. And looking for solution how to store data about tags I added to this images. For now I found only two solutions:
using information in DB. Since I use it in https://electronjs.org, I think better to make array with this information
usinf tags in exif.
My questions are:
what solution will be faster (it can be thousands of images on disk)?
Is there other solutions to add tags to images and store it?

How to scrape javascript table in R?

I want to scrape a table from the citibike : https://s3.amazonaws.com/tripdata/index.html
My goal is to get the urls of the zip files all at once, instead of manually type all the dates and downloading one at each time. Since the webpage is updated monthly, every time I run the function, I want be able to get all the up-to-date data files.
I first tried to use Rvest and XML packages and then realized that the webpage contains both the html and a table that's generated by a javascript function. That's where the problem was.
Really appreciate any help and please let me know if I could provide further information.
If I go to https://s3.amazonaws.com/tripdata/ (just the root, no index.html) I get a simple XML file. The relevant element is Key (uppercase K, lowercase e,y) if you want to parse the XML but I would just search the plain text, that is: ignore the XML, treat it like a simple text file, get every string between <Key> and </Key> treat that as the filename that it is and prefix https://s3.amazonaws.com/tripdata/ to get it.
The first entry is all together (170 MB) as it seems, so you might be ok with that alone.

Is it faster to load data as json or from a html file?

The title phrases it badly so here's a longer description :
I have an application that exports data in html format. ( 500 rows, 20 columns)
It looks terrible with lots of useless columns.
I want to use something like datatables to make a more usable table, i.e. paging/sorting/filtering/hiding columns
The option I'm trying first is to insert the table from the exported html file using the .load() function from jquery. Then I loop through the table deleting/modifying columns.
This seems very slow (I suspect my looping and searching) so I'm looking for improvements.
One idea is to pre-convert my exported html file to json (using notepad++ macros or something like that) and then build the table that I want from that json file.
Any opinions on whether I can expect a large performace boost, or potential problems to look out for ?
Many thanks / Colm
JSON should be faster, when its loaded its ready to go without all of the text parsing you would need to do with a text file. Lots of other jquery addons available to make it easy for you once it is in JSON.
I think this is not about which loads data faster but which solution is better for your problem. Datatables is very flexible and you can load from different sources. Take a look at the "Data Sources" and "Server side processing" in the examples: http://datatables.net/examples/
Datatables uses mostly JSON format. To process your data need to find the best approach; convert your exported html file, process the file with javascript to convert data (jquery can help you here), etc..
This page gives some real world examples of loading data in json vs data in a html table. Fairly conclusive, see the post from sd_zuo on July 2010, a fourfold increase in speed loading from json and then just building the table that you want to display.
Granted the page deals specifically with the slowness of the innerHtml function in IE8 but I think I'll give it a go in json and see how it compares across a couple of browsers.
P.S. This page gives good advice on fast creation of html using raw javascript and then only using jquery to insert one full row at a time

How to do Javascript access a local database in txt format

I am newbie working on a 100% js prototype. It consist of 3 docs: an html page full of xml tags, a small dictionary in a text file format, and a js file with jquery.
The js needs to parse the xml tags (no problem here) and look into the mini-dictionary list for available translations.
Which is the best way to implement the mini-dictionary list. (No more than 50.000 records). Is there a way to load the list into a memory database and access it from js? Which is the usual path to take in this case? What is the simplest and machine-independent way to do this?
Any directions as to where should I research are greatly appreciated.
I would suggest encoding mini-dictionary with JSON data format, and then using AJAX to get that file and parse it. But then you are risking someone will just copy whole dictionary and steal your work.
That is, if you are not using server side language, like PHP. If you are using it, then just store everything into database and request just specific words with AJAX.

Categories

Resources