Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.
Related
How can I provide the user of a web-app with a download-link to programmatically created data in AngularDart?
I thought this would be an easy task, since the download of data could be handled via data-links. But it turns out that AngluarDart doesn't let me use data-links since they are considered unsecure. In a pure Javascript environment I would use Filesaver.js, but also this is not possible with AngularDart (at least I didn't find a way to use it there).
What I really want to do: I create data in the app with code. At the end i have a json-structure that needs to be downloaded to the client computer of the user. He should be presented with a file select dialog where he can enter a filename and then the data should be saved there. And this should be initiated by a click on a button.
Up to now i didn't find a working way to make this happen in AngularDart. I tried BrowserClient, a-tags with download attribute, forms with data-url, but nothing works.
If anybody could give me a hint how to make this work, I would be very happy. A hint on how to use Javascript-Libraries (like FileSaver.js) in AngularDart would also be welcome.
I don't use Flutter and also I need this to work in the browser. So File from dart:io is no solution for me (this will be one of the first things you find, when searching for a solution). Also it is no solution to save the file to the server and download it to the client.
On basketball-reference.com, there is an injury page that shows all of the current injuries in the NBA. I'd like to begin archiving this data to keep a record of whose injured in the NBA daily. Apart from simply being a basketball stat nut, this is will be an input to a Bayesian Model that predicts a players playing time from his teammates injuries.
Now, I could simply go to his page once a day, click the Get Table as a CSV" button, and copy and paste that into a file, but this seems like a cron job.
I could grab the raw html and parse it but the web page already has a get_csv_output(e) function in its sr-min.js file readily available. In fact, if I open up the developer console and type in
get_csv_output("injuries")
I get all of the csv dumped out as a string. It feels an awful lot like reinventing the wheel when I could simply use this function.
Somehow there is a disconnect in my mind though. I don't grok how I can visit a page, run a js function, and save the output without spinning up a full chrome driver instance through selenium or something. This feels like a simple problem with a simple solution that I just don't know.
I don't particularly care what language the solution is in, although I'd prefer a python, bash, or some other light weight solution.
Please let me know if I'm being naive.
Edit: The page is https://www.basketball-reference.com/friv/injuries.cgi
Edit 2: The accepted answer is an excellent solution for future reference.
I ended up doing
curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv
Where the python script is...
import sys
from bs4 import BeautifulSoup
def parse_injury_html(html_doc):
soup = BeautifulSoup(html_doc, "html.parser")
injuries_table = soup.find(id="injuries")
for row in injuries_table.tbody.find_all("tr"):
if row.get('class', None) == "thead":
continue
name = row.th
team, update, description = row.find_all("td")
yield((name.string, team.string, update.string, description.string))
def main():
for (name, team, update, description) in parse_injury_html(sys.stdin.read()):
print(f"{name}\t{team}\t{update}\t{description}")
if __name__ == '__main__':
main()
Just executing this function won't do no good because it must be executed in context of that injuries page. If you look at its code, it effectively parses html data. Weird way of doing things but I saw worse. Nevermind.
The easiest solution will be using something that opens the page and calls the function just like you do it in devtools. Barmar suggested Selenium, but I personally prefer puppeteer. It is run via NodeJS, it opens Chrome in windowless mode and executes any open API on any site. In our case - the get_csv_output function.
After that you may do whatever you want with the result string. Dump it to DB or save to file.
An example of puppeteer code.
You could more directly just run the code in that JS function. Node.js is a standalone JS engine, so you may be able to use it to run the exact same function.
That function is most likely just making HTTP requests to download the data from a server, perhaps with some mild data manipulations. The networking layer between node and browser JS are not the same, but there are polyfills available. If the JS function is using the fetch API, you can use node-fetch, or if it's using XHR-style requests, xmlhttprequest.
Since the code is probably a simple data fetch, it might be simple enough to reverse-engineer what's going on and write your own script yourself in whatever language you prefer to make the same type of HTTP request. Watching what's going on in the network tab of your developer tools should tell you where it's getting its data.
I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium
I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.
When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page
Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!
I inject javascript code into a page user is currently viewing, on users command this script make DOM changes. At the end of this interaction user might want to save the page so that s/he can view/edit it later. I could remember the DOM changes that user made, But if the original page(at its source) is changed, I will not be able to restore this page for user. That is why I want to send the changed page to my server. I should be able to restore it completely and the page should behave exactly the way it did(including scripts and media).
Additionally I can not store media of users page at my end(resource limitation), so I guess I have to parse and modify all addresses/references/links of media to global URL/URI in various scripts(HTML/CSS/JavaScript).
Now the question is, Is there a library/framework/jquery extension that can help me achieve this objective ?
else, What is the right/professional way to do it ?
Since you are using jQuery you could try $("html").html(); just make sure to add the appropriate <html> tags when you output it again.
$('body').html()
$('head').html()
$('html').html()
Download firebug, and try it in the console window on this page. I am getting what looks like the correct data back.
Have I got It right that you are building some kind of CMS that let's the user edit entire pages (Not just seperate content blocks) in Contenteditable mode?
I would definatly advise looking at a solution like ckeditor/tinymce etc... Because doing it all yourself will be a terrible pain.
The answer from #Sydenam should work fine to save the whole HTML page.
Meanwhile, and this is IMPORTANT, I would recommend you to consider a potential SECURITY ISSUE here. Indeed the user can inject whatever he wants in the DOM and have you saving it, like nasty Javascript functions sending confidential information on a remote server for example.
So, in my perspective, a professional way of doing this would be to dedicate a PART of the DOM only to that usage, let say a <div id='editable_div'> that you can load using a $('#editable_div').load('your_url',parameters, etc...), and save afterward using another AJAX call.
When saving it you can parse this chunk of HTML and make sure nothing nasty is inside with some regexp (like tags).
Hope it helps,
Regards,