Get element from website with python without opening a browser - javascript

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.

As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.

How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

Related

Is there a way to programmatically run a javascript function from the source of a website

On basketball-reference.com, there is an injury page that shows all of the current injuries in the NBA. I'd like to begin archiving this data to keep a record of whose injured in the NBA daily. Apart from simply being a basketball stat nut, this is will be an input to a Bayesian Model that predicts a players playing time from his teammates injuries.
Now, I could simply go to his page once a day, click the Get Table as a CSV" button, and copy and paste that into a file, but this seems like a cron job.
I could grab the raw html and parse it but the web page already has a get_csv_output(e) function in its sr-min.js file readily available. In fact, if I open up the developer console and type in
get_csv_output("injuries")
I get all of the csv dumped out as a string. It feels an awful lot like reinventing the wheel when I could simply use this function.
Somehow there is a disconnect in my mind though. I don't grok how I can visit a page, run a js function, and save the output without spinning up a full chrome driver instance through selenium or something. This feels like a simple problem with a simple solution that I just don't know.
I don't particularly care what language the solution is in, although I'd prefer a python, bash, or some other light weight solution.
Please let me know if I'm being naive.
Edit: The page is https://www.basketball-reference.com/friv/injuries.cgi
Edit 2: The accepted answer is an excellent solution for future reference.
I ended up doing
curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv
Where the python script is...
import sys
from bs4 import BeautifulSoup
def parse_injury_html(html_doc):
soup = BeautifulSoup(html_doc, "html.parser")
injuries_table = soup.find(id="injuries")
for row in injuries_table.tbody.find_all("tr"):
if row.get('class', None) == "thead":
continue
name = row.th
team, update, description = row.find_all("td")
yield((name.string, team.string, update.string, description.string))
def main():
for (name, team, update, description) in parse_injury_html(sys.stdin.read()):
print(f"{name}\t{team}\t{update}\t{description}")
if __name__ == '__main__':
main()
Just executing this function won't do no good because it must be executed in context of that injuries page. If you look at its code, it effectively parses html data. Weird way of doing things but I saw worse. Nevermind.
The easiest solution will be using something that opens the page and calls the function just like you do it in devtools. Barmar suggested Selenium, but I personally prefer puppeteer. It is run via NodeJS, it opens Chrome in windowless mode and executes any open API on any site. In our case - the get_csv_output function.
After that you may do whatever you want with the result string. Dump it to DB or save to file.
An example of puppeteer code.
You could more directly just run the code in that JS function. Node.js is a standalone JS engine, so you may be able to use it to run the exact same function.
That function is most likely just making HTTP requests to download the data from a server, perhaps with some mild data manipulations. The networking layer between node and browser JS are not the same, but there are polyfills available. If the JS function is using the fetch API, you can use node-fetch, or if it's using XHR-style requests, xmlhttprequest.
Since the code is probably a simple data fetch, it might be simple enough to reverse-engineer what's going on and write your own script yourself in whatever language you prefer to make the same type of HTTP request. Watching what's going on in the network tab of your developer tools should tell you where it's getting its data.

python javascript scrape automatically

Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.

Get Html page after jQuery and Javascript execution with Python

I've imported the content of a webpage into a variable in python, but I'm not getting the final structure (the one that's modified by Ajax and jQuery in general).
How could I solve this?
I would like to get the html as the one I see if I save the page from the browser.
That's my code:
import urllib.request
urlAddress = "http:// ... /"
getPage = urllib.request.urlopen(urlAddress)
outputPage = str(getPage.read())
print(outPage)
You can't by just getting the page source from the server. You need to do one of the following:
Use the headless browser or similar solution (Selenium, Splash, PhantomJS, ...) to run the JS code in the page itself and see the results.
Figure out what the JS code actually does, and recreate the same in Python. If it's doing another call to the server, you can see that in the XHR tab in Developer Tools on Chrome.

Why is requests.get() retrieving different HTML using Python than browser?

I am attempting to extract data from an HTML table, but it appears that the HTML isn't loading correctly when using requests.get(). Instead, a line in the source reads:
"JavaScript is not enabled and therefore this page may not function correctly."
When I navigate to the page in Google Chrome, the HTML appears as it should.
How do I get a Python script to load the proper HTML?
Welcome to the wonderful world of web-crawling. The problem you are experiencing is that requests.get() would just get you the initial page that the browser receives at the beginning of a page load. But, this is not the page you see in the browser since there could be so much involved in forming the web page: javascript function calls, AJAX calls etc.
If you want to programmatically get the HTML you see when you click "Show source" in a web browser after the page was loaded - you would need a real browser. This is there selenium could be a good option:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
Note that selenium itself is very powerful in terms of locating elements - you don't need a separate HTML parser for extracting the data out of the page.
Hope that helps.
If you are sure that you have to deal with JavaScript, webdriver will handle better and saves your life.
from selenium.common.exceptions import NoSuchElementException
from selenium import webdriver
from time import sleep
browser = webdriver.Firefox()
browser.get("http://yourwebsite.com/html-table")
browser.find_element_by_id("some-js-triggering-elem").click()
while 1:
try:
browser.find_element_by_id("elem-that-makes-you-know-that-table-is-loaded")
except NoSuchElementException:
sleep(1)
html = browser.find_element_by_xpath("//*").get_attribute("outerHTML")
# Use PyQuery or something else to parse the html and get data from table

Scrape JavaScript download links from ASP website

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.
When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page
Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!

Categories

Resources