I am currently working on a project of finding empty classrooms in our school in real time. For that purpose, I need to extract substitution published on our school page (https://ssnovohradska.edupage.org/substitution/?), since there might be any additional changes.
But when I try to extract the html source code and parse it with bs4, it cannot find the divs(class: "section print-nobreak") that contain the substitution text. When I took a look at the page source code(Ctrl+U) I found that there is only a javascript that prints it all directly.
Is there any way to extract the html after the javascript output has been already rendered?
Thanks for help!
Parsing HTML is unfortunately necessary to solve your problem. But I will explain how to find ways to avoid that in your future projects (not based on this website).
You've correctly noticed that the text is created by JavaScript code running on the page. This could also indicate that the data is either loaded from another resource (XHR/fetch call getting a response from an API) or is stored as a JSON/JS inside of the website's code. (Or is generated from an algorithm, but this is unlikely to be the case in such websites.)
The website actually uses both methods (initial render gets data stored inside of the website's code, but when you switch dates on the calendar it makes AJAX requests). You can see this by searching for ReactDOM.render(React.createElement( in the code. They're providing a HTML string to the createElement call, so I would suggest looking into the AJAX way of doing things.
Now, to check where the resource is located, all you need to do is opening Developer Tools in your favorite browser (usually Control+Shift+I) and navigating to the Network tab. Now that your network tab is open, you need to cause the website to load external data, for example, by pressing a date on the "calendar bar".
Here you will notice many external requests, but we're actually looking only for XHR calls. Click on the XHR button next to the "Filter" text field. That should result in only one request being shown:
Unfortunately for us, the response only contains HTML. Also, API calls are protected - they require a PHP session ID and some sort of a token (__gsh) to not fail. So, going back to step 1 - seems like our only solution is to use regular expressions to find the text between "report_html":"<div class and </div></div></div> from the source code, if you're interested in today's date only. If you want to get contents for tomorrow or any other date - you will need to either fetch the page, save the cookies and find the token to supply to the request and then make that request, or use something like puppeteer or pyppeteer (since you've mentioned BS4) and load the webpage in that. If you aren't doing the data fetching that often, you should be fine overall.
Related
For example there is a table like in this link https://leetcode.com/contest/weekly-contest-309/ranking
How to access the database from where it is coming. Like let's say to get whole ranking table at a place
I tried reading HTML file but didn't get it
One extension scrapes the table only
How can we achieve this?
There is no simple answer here - it really depends how is it implemented on the server side. As rv.kvetch pointed you can get part of result from url:
https://leetcode.com/contest/api/ranking/weekly-contest-309/?pagination=1®ion=global
You can notice pagination query parameter here, indeed you can access second page, third page and so on. Sometimes there is some parameter like page_size implemented on server but it doesn't look like that case.
So to access full table you probably need to iterate over that pages and glue the results.
EDIT: How to get such url for some page?
Open your favorive browser, run web inspector (usually right click - inspect) and go to network tab, where you can find all requests sent during page rendering.
What I need
I need to retrieve data from this source . Let's assume I must use only PowerBi for this.
What I did so far
If I use the basic web source option, then the query is just basically an htlm parsing with which I can easily get the data found in the html scope of the page, example:
Source:
The steps I'm following through Web source option:
Query:
(to simplify the example, assume we don't need the dates)
You can download that example .pbix file here.
The problem
The problem is that I need more data, which can't be accessed through the html preview. For example, let's imagine I need to retrieve the data from January 2010 to April 2020. Those king of queries can only be done via this button located in the webpage (which exports the requested data to an Excel workbook):
The idea is to get this process automated, so going to the source and export the excel file all the time is not an option.
Inspecting the element I realized that what it does is execute a javascript function:
The question
As a PowerBi/PowerQuery noob I wonder: Is there any way I can get that data directly with PowerBi (maybe calling the js function somehow)? If there is so, then how?
Thank you in advance.
The solution to my case was to use URL parameters to retrieve de data without parsing the html table.
❌Original URL I was using:
https://gee.bccr.fi.cr/indicadoreseconomicos/Cuadros/frmVerCatCuadro.aspx?idioma=1&CodCuadro=%20400
✔️New URL for the query, adding some parameters:
https://gee.bccr.fi.cr/indicadoreseconomicos/Cuadros/frmVerCatCuadro.aspx?idioma=1&CodCuadro=%20400&Idioma=1&FecInicial=2010/01/01&FecFinal=2040/01/01&Filtro=0&Exportar=True
This procedure only works in this case, because obviously the parameters will not be the same on other web pages.
However, I post this answer to keep the main idea for those who are in a similar situation: first try with the appropriate url parameters to get the data in a different format. Of course you first must know which are the available parameters, which is a limitation.
I'm a relative newbie when it comes to coding, especially javascript. I currently am trying to populate a table from a google spreadsheet, which will update when the spreadsheet is.
I followed this tutorial word for word (basically all you need to do is replace the key with your own to specify your spreadsheet, and make sure its both published and public, which I've done)
http://dataforradicals.com/the-absurdly-illustrated-guide-to-sortable-searchable-online-data-tables/
I just get a bad request 400 error referring to my spreadsheet. If I visit the spreadsheet generated directly I just get the words...
"Invalid query parameter value for sq."
https://spreadsheets.google.com/feeds/list/1UcfO9GHePQrcixZB_R9uVXr1vHVqVTDg7DdsOjpm-K0/od6/public/values?alt=json-in-script&sq=&callback=Tabletop.callbacks.tt140241226993949106
I can visit my spreadsheet with the link I was given when I published it here..
[maximum links reached but the structure is different]
As you can see the domain structure is different. I fear that "Tabletop to Datatables" is adding an outdated url to the start of that link but can't find where it actually applies it.
The only reason I would think thats not happening is because the example in the tutorial still works! And the link it refers to is the old style URL too
I'm baffled, please help if you can. All suggestions appreciated
The query string includes a parameter without a value, &sq=.
https://spreadsheets.google.com/feeds/list/1UcfO9GHePQrcixZB_R9uVXr1vHVqVTDg7DdsOjpm-K0/od6/public/values?alt=json-in-script&sq=&callback=Tabletop.callbacks.tt140241226993949106
^^^^
Try this, with that parameter completely removed...
https://spreadsheets.google.com/feeds/list/1UcfO9GHePQrcixZB_R9uVXr1vHVqVTDg7DdsOjpm-K0/od6/public/values?alt=json-in-script&callback=Tabletop.callbacks.tt140241226993949106
There is an updated version of this project. Any necessary updates are included here:
https://github.com/scottpham/tabletop-to-datatables
Try with the updated versions of all js libraries.
Check the link and remove the extra string after pub. That part of link is not necessary and may cause issues.
According to google:
The 400 Bad Request error is an HTTP status code that means that the request you sent to the website server, often something simple like a request to load a web page, was somehow incorrect or corrupted and the server couldn't understand it.
Good luck
I am trying to extract data from a website using Perl. Below is the description of the site:
site displays data dependent on a date
a calendar is displayed that is used to change the date
upon clicking the dates in the calendar, it calls a javascript function that passes in the date and refreshes the part of the page that displays the data
My question is, how do I execute that JS function so that I could loop through the dates that I need data from?
Thanks in Advance
It's much easy make same HTTP request from your script and get all data you need directly.
You can record all HTTP requests/responses of your browser by using HTTP Fox extension (for Firefox).
Edit
There is a CPAN module:
JavaScript - Perl extension for executing embedded JavaScript
But I've not already tested them.
Original post:
Take a look at smjs it's Spidermonkey's JS shell.
You could pass javascript by:
open my $jsout,"echo '$javascript' | smjs |" or die;
print <$jsout>;
...
But be care! This take security consideration away!
I'm writing a program in Python that needs to use a site's advanced search options. Specifically, the search page is the NVC advanced search page . I know the names of the projects and versions I need to search for, so ideally the program would select the project names and versions numbers from the dropdown lists, then return the results page(s).
I'm totally unfamiliar with HTML and Javascript, and I'm fairly new to Python, so I don't know if there's a way to 'click' these dropdown menus via Python, then return the results. The fact that the Javascript makes an Ajax call further complicates the situation, since I can't just load the page's source and parse out the list of project names and version.
Can anyone with some Python/Javascript/Ajax experience send me in the right direction?
An example use of this program would be that I start out with the project "glibc' and its version number '2.3.6' The program would make sure that this combination is listed at all (which isn't guaranteed), then return the results page (which has about 13 results).
The Mechanize Python library is perfect for form automation. There is an example of how to edit and submit forms on the examples page.
If a human user is using that search page, they click on one of the product links, which then load the list of products from another page, e.g.:
http://web.nvd.nist.gov/view/vuln/cpe/cpe-chooser?index=0&component=Vendor
This page is unfortunately not using JSON, so they have some custom javascript parsing for the response. The data from this response is then displayed as a drop-down for the user. When the user selects a product, the browser selects the correct value, so that when the form is submitted, it will be part of the query. e.g.:
http://web.nvd.nist.gov/view/vuln/search-results?adv_search=true&cves=on&cpe_vendor=cpe%3A%2F%3Aa-a-s_application_access_server
In this, cpe_vendor=cpe%3A%2F%3Aa-a-s_application_access_server is the important part. The part before the = sign is the field name, the part after is the selected value (which originally came from the ajax request). The funny %3A bits are URL-encoding.
So you don't actually need to interact with the page, since you know the names of the vendors and products for which you want to search; you just need to look up the field name (cpe_vendor for vendors) and the value for the specific products/vendors (cpe:/:a-a-s_application_access_server for my example above), then do a request to the normal search URL.
The advanced search options page sends the options via GET to the results page, giving you the URL (linebreaks mine to make it clearer):
http://web.nvd.nist.gov/view/vuln/search-results?
adv_search=true&
cves=on&
cve_id=&
query=&
cwe_id=&
cpe_vendor=cpe%3A%2F%3Aian_bezanson&
cpe_product=cpe%3A%2Fa%3Aian_bezanson%3Adropbox&
cpe_version=cpe%3A%2Fa%3Aian_bezanson%3Adropbox%3A0.0.3_beta&
pub_date_start_month=0&
pub_date_start_year=2005&
pub_date_end_month=2&
pub_date_end_year=2009&
mod_date_start_month=2&
mod_date_start_year=2007&
mod_date_end_month=9&
mod_date_end_year=2009&
cvss_sev_base=&cvss_av=&
cvss_ac=&
cvss_au=&
cvss_c=&
cvss_i=&
cvss_a=
It would then take a bit of sleuthing to figure out what bit of the url is what information from the form but that should let you then just scrape the results page.