Why is requests.get() retrieving different HTML using Python than browser?

Why is requests.get() retrieving different HTML using Python than browser? - javascript

I am attempting to extract data from an HTML table, but it appears that the HTML isn't loading correctly when using requests.get(). Instead, a line in the source reads:
"JavaScript is not enabled and therefore this page may not function correctly."
When I navigate to the page in Google Chrome, the HTML appears as it should.
How do I get a Python script to load the proper HTML?

Welcome to the wonderful world of web-crawling. The problem you are experiencing is that requests.get() would just get you the initial page that the browser receives at the beginning of a page load. But, this is not the page you see in the browser since there could be so much involved in forming the web page: javascript function calls, AJAX calls etc.
If you want to programmatically get the HTML you see when you click "Show source" in a web browser after the page was loaded - you would need a real browser. This is there selenium could be a good option:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
Note that selenium itself is very powerful in terms of locating elements - you don't need a separate HTML parser for extracting the data out of the page.
Hope that helps.

If you are sure that you have to deal with JavaScript, webdriver will handle better and saves your life.
from selenium.common.exceptions import NoSuchElementException
from selenium import webdriver
from time import sleep
browser = webdriver.Firefox()
browser.get("http://yourwebsite.com/html-table")
browser.find_element_by_id("some-js-triggering-elem").click()
while 1:
try:
browser.find_element_by_id("elem-that-makes-you-know-that-table-is-loaded")
except NoSuchElementException:
sleep(1)
html = browser.find_element_by_xpath("//*").get_attribute("outerHTML")
# Use PyQuery or something else to parse the html and get data from table

Related

Reading HTML created by JavaScript with Python

I have a small problem with linking Python with HTML, CSS, and JavaScript: Currently, I have a website that takes in the user's input, and uses JavaScript to modify some of the tags, so that Python can read it.
However, when I tried the following code:
from requests_html import HTMLSession <br>
session = HTMLSession() <br>
r = session.get('https://infiniteembarrasseddesign.lucatu1.repl.co/') <br>
ex = r.html.find("#ex", first=True) <br>
print(ex.text).
It doesn't output anything. Well, it does, just an empty element. However, my JavaScript should have replaced the textContent of the div with the user's entry. Is there a way to make Python read HTML created by JavaScript?
My operating system is windows 10.
Here's the HTML code if you want it:
https://repl.it/#LUCATU1/InfiniteEmbarrassedDesign#index.html. The URL of the webpage is in the code.
My apologies that I cannot provide the main code for Python as it involves opening private .mdb files.

Problem is that requests.get does not render a JavaScript, in order to test it you can open Dev Tools(F12 - in Chrome) and click "Console" open and type "Disable JavaScript" and refresh the page, after you will see what your web-sraper(requests) gets. You can solve this problem by alternative version by using Selenium.

requests.get don't see Javascript, you can try selenium:
from selenium import webdriver
driver = webdriver.ChromeDriver()
driver.get("URL")
And get source using driver.page_source or crawl through the page using selenium methods.

How to load javascript values into python from web page?

When I inspect the code on a webpage, I can see the html and the javascript. I've used Beautiful Soup to import and parse the html, but there is a large section written in javascript, which pulls variables from a programmable logic controller (PLC). I can't find the data in python after I load and parse with Beautiful Soup - it's only the html code.
The PLC is being read directly by the webpage and I see the live values updating in front of me, but I can't import them directly. The screen shot is what the code looks like from the inspect window. Let's say I want to import that variable id="aout7" with attribute class="on", how can I do that?

Webpages are best run in a browser. There are API-s for remote controlling a browser/browser engine, a popular one is Selenium, and it has Python bindings: see https://pypi.org/project/selenium/ - the page contains instructions for installing:
pip install -U selenium
and some introductory examples, like this snippet issuing a Yahoo search:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element_by_name('p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
You will need something similar, just with find_element_by_id (https://selenium-python.readthedocs.io/locating-elements.html), and use the text attribute of elements to read their content.

Get Html page after jQuery and Javascript execution with Python

I've imported the content of a webpage into a variable in python, but I'm not getting the final structure (the one that's modified by Ajax and jQuery in general).
How could I solve this?
I would like to get the html as the one I see if I save the page from the browser.
That's my code:
import urllib.request
urlAddress = "http:// ... /"
getPage = urllib.request.urlopen(urlAddress)
outputPage = str(getPage.read())
print(outPage)

You can't by just getting the page source from the server. You need to do one of the following:
Use the headless browser or similar solution (Selenium, Splash, PhantomJS, ...) to run the JS code in the page itself and see the results.
Figure out what the JS code actually does, and recreate the same in Python. If it's doing another call to the server, you can see that in the XHR tab in Developer Tools on Chrome.

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.

As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.

How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

Scrape JavaScript download links from ASP website

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.
I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.

When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3) # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]
You can use the links and pass them to urllib2 to download them accordingly.
If you need more than a script, I can recommend you a combination of Scrapy and Selenium:
selenium with scrapy for dynamic page

Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.
First of all, here's the JS - it's very simple:
function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}
That writes to these fields:
<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>
So, you just need a POST form, targetting this URL:
http://tsd.dlink.com.tw/downloads2008detail.asp
And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:
Enter=OK
ModelCategory=0
ModelSno=0
ModelCategory_=DAP
ModelSno_=1150
Model_Sno=
ModelVer=
sel_PageNo=1
OS=GPL
You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.
Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!

Develop Reference

JavaScript is the programming language of the Web.

Why is requests.get() retrieving different HTML using Python than browser? - javascript

Related

Reading HTML created by JavaScript with Python

How to load javascript values into python from web page?

Get Html page after jQuery and Javascript execution with Python

Get element from website with python without opening a browser

Scrape JavaScript download links from ASP website

Categories

Resources