Scraping javascript-generated tables in R with -relenium-

Scraping javascript-generated tables in R with -relenium- - javascript

Recently, starting from this very useful question (Scraping html tables into R data frames using the XML package) I successfully used the XML package for scraping HTML tables.
Now I am trying to extract Javascript-generated tables from here:
Tables 2013 (then click on "Sortare alfabetică").
I am interested in exporting the first 9 columns of the, say, pag.1-pag.10 data.
I went through different related questions on the forum, including some where it was suggested not to use R to perform such task and a similar question that however did not prove directly useful for my problem. As suggested, I have been reading the information about the Relenium package (see the developers' toy example here).
According to the structure of the website where the tables of interest are located, I have to click a first button to access the tables sorted by name and then to click a second button to navigate through all the next tables I want to export. In practice I have to:
click Sortare alfabetică button
copy the first 9 columns of the 10-row table
click right button (called Pagina urmatoare)
And repeat 2-3 for 10 times.
By using the Chrome inspector (Tools > Developer tools) I found the following paths for the two buttons:
/html/body/table/tbody/tr[1]/td/table[2]/tbody/tr[2]/td/table/tbody/tr/td[2]/a
/html/body/table/tbody/tr[1]/td/table[3]/tbody/tr/td[4]/table
I started with this code in order to accomplish step 1:
library(relenium)
firefox <- firefoxClass$new()
firefox$get("http://bacalaureat.edu.ro/2013/rapoarte/rezultate/index.html")
buttonElement <- firefox$findElementByXPath("/html/body/table/tbody/tr[1]/td/table[2]/tbody/tr[2]/td/table/tbody/tr/td[2]/a")
buttonElement$click()
But I get the following error:
[1] "Error: NoSuchElementException"
[1] "Thrown by Firefox$findElement(By by) and webElement$findElement(By by)."
I don't know whether it is an easier way to proceed, but an alternative to point 3 to navigate through pag.1-pag.10 can be to work with the dropdown menu of the webpage.
The paths for pag.1 and pag.2 are:
//*[#id="PageNavigator"]/option[1]
//*[#id="PageNavigator"]/option[2]
Focusing on scraping data from a single table
Clearly, even before being able to navigate in the 10 tables through the buttons or the scrolldown menu, the crucial problem is to extract the data contained in each table.
With this code I tried to focus on extracting the first 9 columns of the first table only (then the code could be iterated through "http://bacalaureat.edu.ro/.../page_2.html", "http://bacalaureat.edu.ro/.../page_3.html", etc.):
library(XML)
library(relenium)
firefox <- firefoxClass$new()
firefox$get("http://bacalaureat.edu.ro/2013/rapoarte/rezultate/alfabetic/page_1.html")
doc <- htmlParse(firefox$getPageSource())
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
But the output is extremely messy. I don't know if this maks sense, and I am only guessing, but it could be necessary to go deeper in the javascript code and extract the information in the table cell by cell.
For instance, for the first individual, the 9 variable values of interest are characterized by the following XPaths:
//*[#id="mainTable"]/tbody/tr[3]/td[1]
//*[#id="mainTable"]/tbody/tr[3]/td[2]
//*[#id="mainTable"]/tbody/tr[3]/td[3]/a
//*[#id="mainTable"]/tbody/tr[3]/td[4]/a
//*[#id="mainTable"]/tbody/tr[3]/td[5]/a
//*[#id="mainTable"]/tbody/tr[3]/td[6]/a
//*[#id="mainTable"]/tbody/tr[3]/td[7]
//*[#id="mainTable"]/tbody/tr[3]/td[8]
//*[#id="mainTable"]/tbody/tr[3]/td[9]
Using these paths, the entries of each cell could be saved into an R vector and the procedure could be repeated for all the other individual-specific rows of data. Is it sensible to proceed like this? If so, how would you proceed with -relenium-?

Related

Tableau Server: How to select/include specific sheets from a workbook (for exporting to PDF) via script / console / API?

In Tableau Server I'm trying to find a way to select specific sheets from a workbook (for exporting to PDF) via script / browser dev console / tabcmd / URL / API. Here's a real / live example, which can be played around with: https://help.tableau.com/current/api/js_api/en-us/JavaScriptAPI/js_api_sample_export_to_pdf.htm
So far I tried, inter alia, various JavaScript commands to get the right element and trigger a "click", but so far to no avail... the underlying *.js files reveal a lot but so far I had no luck in getting the proper handle.
(I managed to give the impression that the sheets were selected, i.e., the checkboxes were shown and ticked via my JavaScript, but that didn't budge the actual selection, e.g. the "2 of 6" in the screenshot above.)
Related links:
https://help.tableau.com/current/server/en-us/tabcmd_cmd.htm#id7cb8d032-a4ff-43da-9990-15bdfe64bcd0
Export a specific sheet from Tableau using tabcmd
TabPy (Tableau) how to automate producing pdfs from a workbook
https://community.tableau.com/s/question/0D54T00001HuPQSSA3/how-to-set-default-options-for-showexportpdfdialog-?_ga=2.20070778.390974142.1655121966-1741765833.1651754073&_gac=1.204654884.1654237999.EAIaIQobChMI9oaVytSQ-AIVtRSLCh0gNw-xEAEYASAAEgLApvD_BwE (Q: Ideally, I would also like to pre-select certain sheets for the user as well. -- A: I don't think it is possible using the javascript api.)
https://help.tableau.com/current/api/js_api/en-us/JavaScriptAPI/js_api_ref.htm
https://www.tableau.com/developer/tools

One approach (there might be much better ones...), which works for here:
const bla = document.getElementsByClassName('thumbnail-wrapper_f1gupj42')
bla[1].click()
That toggles the checkbox on the "College" sheet above :-)

A solution which worked (not only worked in the API but also in my use case) is recording a puppeteer script and using the keyboard only (e.g. Tab, Enter, etc) to select the individual pages; recording mouse clicks did not work that well, but who needs a mouse anyway?

Using Sheets to search through the entire sheet and pull up results in a column

I have a bunch of sheets I use for personal work. They have a bunch of different car parts under different tabs in each sheet.
I created a master sheet that importrange's from all of them and shows links to them in a master tab to jump to each tab separately. (doors, hoods, lightbulbs, door trims, roof racks, box of crown vic parts; its all over the place)
Is there a way for a user to search some text in a cell and have the column next to it populate it with results with matching words and ultimately, link to the tab and row that the item exists on?
ex: I have a sheet called "Search" and I type in A2 "crown vic". Then it will populate B2:B100 with any items found in the entire sheet with the words "crown vic" in it, and C2:C100 will have a link to the tabbed sheet that it is in.
Link to a test page to get my idea across:
https://docs.google.com/spreadsheets/d/1WrImPYHhhMOOZbf-AE2sNs82xL-u8wWOW4IFel6RGcY/edit#gid=999756632
I believe it would be better for me to use Javascript and HTML to create a web database for all this info instead of using sheets since its limited in some ways I want to use it. Ultimately I want it to be easier to find all the data by bringing up things with search.

I think I have a basic answer for you. However your sample sheet is not very much like your final sheet, with all of the tabs you've mentioned, so there is only the working concept that I can demonstrate here. With a really reresentative sample sheet, I could flesh out more details on how the links to multiple possible tabs would need to be built. See my sample tab, GK-Help Search, added to your sample sheet.
First, we do a query, in column B, to return the list of matching car parts.
=QUERY('Car Parts'!A2:A,"select A where upper(A) contains '"&UPPER(A2)&"' ",0)
For your production sheet, this would require all of the data tabs to be concatenated, in a vertically stacked array. Eg.
=QUERY({ 'doors'!A2:A;
'hoods'!A2:A;
'lights'!A2:A },"select...")
Then the main formula is this, in C2:
=HYPERLINK("https://docs.google.com/spreadsheets/d/1WrImPYHhhMOOZbf-AE2sNs82xL-u8wWOW4IFel6RGcY/edit#gid=" &
"0" & "&range=" &
SUBSTITUTE(REGEXEXTRACT(CELL("address",INDIRECT("'Car Parts'!A" & MATCH(B2,'Car Parts'!A$2:A,0)+1)),"(\$.*)"),"$",""),
CELL("address",INDIRECT("'Car Parts'!A" & MATCH(B2,'Car Parts'!A$2:A,0)+1)))
This does a lookup of each car part, to get the address of the cell. Then a dynamic HYPERLINK is built, using the URL of the spreadsheet, and the address of the cell. The element that is not fleshed in my demo out is how to build the "gid" part of the URL address, since you did not provide multiple sample tabs. But this is very possible.
Here is a previous answer on doing that last part.
how-to-insert-hyperlink-to-a-cell-in-google-sheet-using-formula
My sample sheet looks like the following:

Is there a way to change every occurrence of a character that exists on a page of code?

I have been working on a code and it consists of repeating the code over, about ten times to make it complete.
Essentially, there are 10 occurrences of one main code.
for example,
myTable1, myTable2, myTable3 .......
Instead of copying and pasting my code 10 times over and changing every single instance of "1" to "2", and then every instance of "2" to "3", and so forth
I would like to build a small program that I can load my code into and set the code to switch the numbers for me
So the first instance of loading the code into the new program would change all "1's" to "2's", and then all "2's" to "3's" and so forth, ultimately making my job tremendously easier.
Hope this makes sense.
EDIT
So here is a brief description of my code. Ill try my best to explain it. It is to be used for my everyday job. I'm a project manager for a roofing company, so I am expected to log daily job notes after every day of work is complete. I've created an HTML form which collects all of the information necessary for my daily job notes. The function of that form then transfers all of the information to a textarea input box where I can then copy the text and ultimately paste it into my employers daily job notes thread. The first occurrence of transferring a job as I described above, is quite simple for me. The hard part is repeating that for several more jobs in one day. I average managing about 5 to 6 jobs per day. Therefore, (as far as i know) every input and checkbox in my HTML form need a different id so that I can associate the values of the inputs and checkboxes with the correct/corresponding job note within the overall text that I will be adding to my employers daily notes thread. I know about arrays and loops but i will humbly say, I definitely don't understand either of them completely. I know what i'm doing now is bad practice and very time consuming but I haven't been able to find a work around for this particular issue. Therefore, I'm at the pOint to where I'm thinking, if I need to actually copy and edit this form and all of it's code 7 or so times, I would rather make a program that I can feed the code to and it spit the code back out after changing every instance of 1 to 2, and 2 to 3 and so forth.

There are so many ways to get it fix one of them are as follows...
Step 1 : Create common.js file
Step 2 : Create one function which will do this logic for you. eg. CheckTheSequence(arrayObject){}
Step 3 : Update the below code as per your requirement...
// OBJECTS
var obj = { one: 1,two: 2,three: 3,four: 4,five: 5};
$.each(obj, function (index, value) {
console.log(value);
});
// Outputs: 1 2 3 4 5
Step 4: Call this common.js file into your main root page (index.html)
This way you can access this method through out the application.

Actually, after thinking about this more, I realized WORD has a 'find and replace' feature (Ctrl+H)
I can paste my code in WORD, click (Ctrl+H) add the credentials for what to replace and what with, and then click 'Replace All'.
Thanks for everybody's help

Eliminating a certain part of a logfile before loading to table in Pentaho using Javascript

I have several logfiles that need to be loaded into a table. However, the first three lines have to be omitted automatically without me having to remove them.
I have used a text file input where all the data from the log file has been put into a single column under the name Field 1.
These are the first four line which is the first four rows.
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-10-25 22:30:02
#Fields: date time s-computername s-ip....
As you can see above, the first four lines have to be omitted and I have to load data after '#Fields:'. Is there a way to do it using Javascript in Pentaho?

yes you can use step called modified java-script step , and according to your need you can write a code and achieve your desired result..

It seems you need to filter out all lines that start with #, as being comments.
You can achieve much faster results by using a Filter Rows step, with a condition "not starts with". The Javascript step is too slow, I recommend avoiding it if there's any other option available.

Why don't you simply start reading text file from the 5th lines? In the Content tab of the text input step you can set the Number of header lines. Put 4 in the input.

Getting ImpactStory JavaScript widgets rendered to a table cell in an R Shiny app

[The question was yesterday also posted to https://groups.google.com/d/msg/shiny-discuss/1UmzvZJwM54/gdMmX7QQ-eIJ with no answers so far]
I've been working on a Shiny app that shows both as a table and as a rCharts (NVD3) chart some raw altmetric data of a few journal articles http://spark.rstudio.com/ttso/alt/
My code with sample data https://gist.github.com/tts/6990101
So far so good, but now I've run into difficulties when trying to include in the table the JavaScript widgets explained in http://impactstory.org/api-docs The widgets as such work just fine, like in this HTML example http://users.tkk.fi/sonkkila/alt/arts.html - but in my Shiny table, the last (IS) column where they should emerge, is empty.
When I leave the table unsanitized, I can see that the HTML code is there - but when sanitized, it vanishes. The renderTable code in server.R: https://gist.github.com/tts/6990101#file-server-r-L69-L85
From the HTML source I can also see that the ImpactStory script is at the top of the page as it should.
All pointers are welcome!
Disclaimer: although I've been playing with R and Shiny for some time now, I consider myself a JavaScript/CSS newbie really so I may be missing something obvious here.
EDIT: Just to clarify: of course the sanitized HTML code vanishes at that point, because there is no textual element value. I wonder if a) there are clashes between the different JS scripts, maybe they'd need to come in different order (have to ask ImpactStory about this) or b) there are some problems in how xtable() generates the output or c) Shiny does not know how to communicate with the impactstory.js script or d) Shiny does not see the script at all. Should I build a custom output component?
EDIT2: AFAIK the problem lies in the fact that the reactively outputted table does not see the JavaScript. Tested: when I manually add non-dynamic HTML code to ui.R with all the attributes that the ImpactStory JS needs to know about an article, the widgets are rendered ok. Also, if I add, in server.R, the script element in the data frame that is outputted, the widgets are rendered, but also - will never stop doing it, resulting to a loop :) I suppose what I'd need is similar to what is asked here, in question nr 2

Well, I've found one solution that isn't pretty but will do for the moment.
renderTable uses xtable to render the HTML table, and you can define your own sanitization function while at it, see http://cran.r-project.org/web/packages/xtable/vignettes/xtableGallery.pdf (p. 7)
Here, I simply replace the column header 'Widget' with a string that defines the script:
output$table <- renderTable({
...
}, include.rownames = FALSE, sanitize.text.function = function(s) sub("Widget", "<script type=\"text/javascript\" src=\"http://impactstory.org/embed/v1/impactstory.js\"></script>", s))
For a reason I don't yet understand, there is a set of double widgets at first.
Thanks for your patience.

Develop Reference

JavaScript is the programming language of the Web.