How to scrape javascript table in R? - javascript

I want to scrape a table from the citibike : https://s3.amazonaws.com/tripdata/index.html
My goal is to get the urls of the zip files all at once, instead of manually type all the dates and downloading one at each time. Since the webpage is updated monthly, every time I run the function, I want be able to get all the up-to-date data files.
I first tried to use Rvest and XML packages and then realized that the webpage contains both the html and a table that's generated by a javascript function. That's where the problem was.
Really appreciate any help and please let me know if I could provide further information.

If I go to https://s3.amazonaws.com/tripdata/ (just the root, no index.html) I get a simple XML file. The relevant element is Key (uppercase K, lowercase e,y) if you want to parse the XML but I would just search the plain text, that is: ignore the XML, treat it like a simple text file, get every string between <Key> and </Key> treat that as the filename that it is and prefix https://s3.amazonaws.com/tripdata/ to get it.
The first entry is all together (170 MB) as it seems, so you might be ok with that alone.

Related

Convert a javascript object to CSV file

I have a script which reads a file line by line, generate an object with some fields from certain lines and now I want to put that generated object into a CSV file.
How can I do the following:
From the script itself generate a CSV file
Give initial fields (headers) to the file
Update that file line by line (add to the file one line at a time)
Some clarifications, I don't know the size of the CSV in advance, so the file must by dynamically changed.
Thanks in advance.
Looking at what you have said:
From the script itself generate a csv file
Have a look at node-csv-generate which lets you generate csv strings easily
Give initial fields (headers) to the file & 3. Update that file line by line (add to the file one line at a time)
Check out the node-csv-generate stream functionality to write individually line by line (i.e. inital headers first)
Now since you said you need to run it locally, I would recommend Rhino if just using JS but if node.js is required then check out Rhinodo. These will let you run the program locally on the JVM basically (you could call the JS from within Java if you wanted to).
To export the CSV file there are plenty examples online this SO thread being one... i.e.
var encodedUri = encodeURI(csvContent);
window.open(encodedUri);
Where csvContent is the complete string of your csv. I am not sure how supported this is on Rhinodo, but I'm pretty sure it'll all work on Rhino.
If this is intended to be a purely desktop based application, I would look at using Java (or your preferred language Python or C# might be nicer depending on what you are used to :-) ) rather than JS if everything needs to be local and it intends on being widely used. That way you have a much cleaner interaction with the OS and a lot more control.
I hope this helps!

Is it possible to write protect old data of JSON Files and only enable appending?

I need to store some date stamped data in a JSON file. It is a sensor output. Each day the same JSON file is updated with the additional data. Now, is it possible to put some write protection on already available data to ensure that only new lines could be added to the document and no manual tampering should occur with it?
I suspect that creating checksums after every update may help, but I am not sure how do I implement it? I mean if some part of JSON file is editable then probably checksum is also editable.
Any other way for history protection?
Write protection normally only exists for complete files. So you could revoke write permissions for the file, but then also appending isn't possible anymore.
For ensuring that no tampering has taken place, the standard way would be to cryptographically sign the data. You can do this like this, in principle:
Take the contents of the file.
Add a secret key (any arbitrary string or random characters will do, the longer the better) to this string.
Create a cryptographical checksum (SHA256 hash or similar).
Append this hash to the file. (Newlines before and after.)
You can do this again every time you append something to the file. Because nobody except you knows your secret key, nobody except you will be able to produce the correct hash codes of the part of the file above the hash code.
This will not prevent tampering but it will be detectable.
This is relatively easily done using shell utilities like sha256sum for mere text files. But you have a JSON structure in a file. This is a complex case because the position in the file does not correlate with the age of the data anymore (unlike in a text file which is only being appended to).
To still achieve what you want you need to have an age information on the data. Do you have this? If you provide the JSON structure as #Rohit asked for we might be able to give more detailed advice.

Convert HTML to markdown using pagedown?

I have successfully setup pagedown on a site I am using, but I have run into an issue when trying to edit HTML that has already been created. I would like to take a HTML chunk that was created using pagedown, convert it back to markdown and place it in the editor.
I looked around but didn't see this covered in the documentation. I took a look in the Markdown.Converter.js file to see if there was a makeMarkdown function to match the makeHTML function but I didn't see anything.
How do I go about converting HTML back to markdown for editing?
As far as I know, no, there is no existing solution that will convert html to markdown. There are a few problems that would need to be solved before that can be done, for example, representing floats, text alignment, font sizes, etc in markdown. That leaves you with two options:
Store the markdown in the database, then convert the markdown to html on the fly. This has the advantage of being able to easily edit the text and reduces the amount of data you're storing in the database.
the second option is to store both the markdown and the html in the database. This uses more disk space, however will result in less resources being used to retrieve the html because you no longer have to convert markdown to html on the fly.
Both options are viable, each with their own advantages. I usually use the first option so that i don't have duplicate data in the database, but the second option is likely easier to use because the display-system that displays the content won't be required to have a markdown processor, instead it just pulls the generated html directly from the database.
I'll likely move to the second option instead in future projects because it makes the data more portable. If you were to access the database in a different server-language, you wouldn't need a markdown processor written in that language to get the html.

How to do Javascript access a local database in txt format

I am newbie working on a 100% js prototype. It consist of 3 docs: an html page full of xml tags, a small dictionary in a text file format, and a js file with jquery.
The js needs to parse the xml tags (no problem here) and look into the mini-dictionary list for available translations.
Which is the best way to implement the mini-dictionary list. (No more than 50.000 records). Is there a way to load the list into a memory database and access it from js? Which is the usual path to take in this case? What is the simplest and machine-independent way to do this?
Any directions as to where should I research are greatly appreciated.
I would suggest encoding mini-dictionary with JSON data format, and then using AJAX to get that file and parse it. But then you are risking someone will just copy whole dictionary and steal your work.
That is, if you are not using server side language, like PHP. If you are using it, then just store everything into database and request just specific words with AJAX.

Including hidden data in an HTML page for javascript to process

I produce a complex HTML summary report from data in a database which could be a summary of maybe 200,000 rows in the database. The user can click a link to request an Excel version.
When they do a JS script extracts the key components of the report and stuffs them into a form in a hidden iframe. This form submits to a server-side script which generates the Excel version of the report (without the graphics etc).
As the calculations for the report are complex and "costly" it makes sense not to run them again to create the Excel version as all the data is on the page already. Also the user may have customised the report once it is loaded and I can use JS to pass those preferences to the form as well so the Excel doc reflects them too.
The way I am doing this is to include the following for each component of the report that transfers to a row in the Excel version. I've hijacked an HTML tag that isn't otherwise used.
<code id="xl_row_211865_2_x" class="rowlabel">Musicals}{40%}{28.6%}{6</code>
The code element above is a summary of the row below in the HTML report which becomes one row in the Excel doc and includes the label and various data elements. There may be a thousand or more such elements in one report.
As the data contains text I've had to use something like }{ as a field separator as this is unlikely to occur in any real text in the report. I have code set to display:none in the CSS.
When the user wants an Excel version of their report the JS code searches the HTML for any <code> elements and puts their className and innerHTML in the form. The className indicates how to format the row in Excel and the data is then put into adjacent cells on the Excel row.
The HTML report shows one percentage base (they can toggle between them) but the user preference when requesting an Excel version was to include both.
Is there a better way of doing this?
(As this is a part of a complex web app no user is going to turn CSS off or lack javascript or they wouldn't get this far)
ADDED: I can't use HTML5 as the users are corporates often on older browsers like IE6
Use the new data- attributes
http://www.javascriptkit.com/dhtmltutors/customattributes.shtml
<div data-row="[["Musicals",40,28.6,6], ...]">
The div could be the TD tag or TR tag or any other relevant tag already related to the row and the " is the escaped ".
That makes the data hidden from view and also ensures that there will come standard solutions to process the data.
Also for encoding data I would suggest using JSON as that is also a standard that is easy to use.
Standard solutions:
1) Use a Javascript data block:
<script>
var mydata = {
'Musicals': ['6','40%','28.6%'],
"That's Life": ['2','13.2%','0.5%'],
...etc....
}
</script>
2) Use element attributes:
(see http://ejohn.org/blog/html-5-data-attributes/ for more info)
<div class='my_data_row' data-name='Musicals' data-col1='6' data-col2='40%' data-col3='26.6%'>
...and then use Javascript to load the attributes as required.
This second option would be used when the data is related to the element in question. You wouldn't normally want to use this for data that's going to be used elsewhere; I would say that in your case, the simple Javascript data block would be a far better solution.
If you do go with the data attributes as per the second option, note the use of the 'data-' prefix on the attributes. This is an HTML5 specification that keeps user-defined attributes separate from normal HTML ones. See the linked page for more info on that.
You could try the new html5 feature localStorage instead of using hidden html fields, that is if you're sure that your users use only latest modern browsers.
Anyway, an improvement on your code would be to actually store the data in JSON format:
Instead of
Musicals}{40%}{28.6%}{6
you would use something like
{
"label": "Musicals",
"percentage1": 40,
"percentage2": 28.6,
"otherLabel": 6
}
This way you can build javascript objects just by evaluating (eval) or parsing (JSON.parse) the innerHTML of the hidden element, in a faster way than you interpret your own curly brackets protocol.
My point to solve that in a better way would be take these results and save in some temporal XML file in the server, show the contents in the browser and when the user request for the Excel version, you only need to take the temporal XML.
Take a look to Linq-to-XML, because its fluent-style programming would help you in reading the XML file in few lines and then creating such Excel file.
Another solution would be serialize your object collection to JSON and deserialize them with the DataContractJsonSerializer. That would make the size of temp file smaller than XML approach.

Categories

Resources