When I inspect the code on a webpage, I can see the html and the javascript. I've used Beautiful Soup to import and parse the html, but there is a large section written in javascript, which pulls variables from a programmable logic controller (PLC). I can't find the data in python after I load and parse with Beautiful Soup - it's only the html code.
The PLC is being read directly by the webpage and I see the live values updating in front of me, but I can't import them directly. The screen shot is what the code looks like from the inspect window. Let's say I want to import that variable id="aout7" with attribute class="on", how can I do that?
Webpages are best run in a browser. There are API-s for remote controlling a browser/browser engine, a popular one is Selenium, and it has Python bindings: see https://pypi.org/project/selenium/ - the page contains instructions for installing:
pip install -U selenium
and some introductory examples, like this snippet issuing a Yahoo search:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element_by_name('p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
You will need something similar, just with find_element_by_id (https://selenium-python.readthedocs.io/locating-elements.html), and use the text attribute of elements to read their content.
Related
I have a small problem with linking Python with HTML, CSS, and JavaScript: Currently, I have a website that takes in the user's input, and uses JavaScript to modify some of the tags, so that Python can read it.
However, when I tried the following code:
from requests_html import HTMLSession <br>
session = HTMLSession() <br>
r = session.get('https://infiniteembarrasseddesign.lucatu1.repl.co/') <br>
ex = r.html.find("#ex", first=True) <br>
print(ex.text).
It doesn't output anything. Well, it does, just an empty element. However, my JavaScript should have replaced the textContent of the div with the user's entry. Is there a way to make Python read HTML created by JavaScript?
My operating system is windows 10.
Here's the HTML code if you want it:
https://repl.it/#LUCATU1/InfiniteEmbarrassedDesign#index.html. The URL of the webpage is in the code.
My apologies that I cannot provide the main code for Python as it involves opening private .mdb files.
Problem is that requests.get does not render a JavaScript, in order to test it you can open Dev Tools(F12 - in Chrome) and click "Console" open and type "Disable JavaScript" and refresh the page, after you will see what your web-sraper(requests) gets. You can solve this problem by alternative version by using Selenium.
requests.get don't see Javascript, you can try selenium:
from selenium import webdriver
driver = webdriver.ChromeDriver()
driver.get("URL")
And get source using driver.page_source or crawl through the page using selenium methods.
I wanted to create a robot to scrape a website with this address :
https://1xxpers100.mobi/en/line/
But the problem is that when I wanted to get data from this website
I realized that this website is using django because they are using
phrases like {{if group_name}} and others
there is a loop created with this kind of method and it creates table rows and
the information that I want is there.
when I am working with python and I download the html code I can't find
any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want
How can I get html codes that holds the content of that table like chrome tools
to get the information that I want from this website?
My way to get the codes is using python :
import urllib.request
fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
This should work for what you want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.encode("utf-8"))
here 'lmxl' is what I use because it worked for the site I tested it on. If you have trouble with that just try another parser.
another problem is that there is a character that isn't recognized by default. so read the contents of soup using utf-8
Extra Info
This has nothing to do with django. HTML has what is described as a "tree" like structure. Where each set of tags is the parent of all children tags immediately inside it. You just weren't reading deep enough into the tree.
I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium
Good afternoon all.
I am currently parsing this website : http://uk.easyroommate.com/results-room/loc/981238/pag/1 .
I want to get the listing of every url of each adverts. However this listing is coded with JavaScript. I can perfectly see them via the Firefox firebug, but I have not find any way to get them via Python. I think it is doable but I don' t know how.
EDIT : Obviously I have tried with module like BeautifulSoup but as it is a JavaScript generated page, it is totally useless.
Thank you in advance for your help.
Ads listing is generated by JavaScript. BeautifulSoup gives you this for example:
<ul class="search-results" data-bind="template: { name: 'room-template', foreach: $root.resultsViewModel.Results, as: 'resultItem' }"></ul>
I would suggest looking at: Getting html source when some html is generated by javascript and Python Scraping JavaScript using Selenium and Beautiful Soup.
Thanks to your lead here is the solution and I hope it will help someone one day :
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://uk.easyroommate.com/results-room/loc/981238/pag/1')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
print soup.prettify()
## You are now able to see the HTML generated by javascript code and you
## can extract it as usual using BeautifulSoup
for el in soup.findAll('div', class_="listing-meta listing-meta--small"):
print el.find('a').get('href')
Again in my case I just wanted to extract those links, but once you have got the web page source code via Selenium, it is a piece of cake to use beautifulSoup and get every item you want.
I am attempting to extract data from an HTML table, but it appears that the HTML isn't loading correctly when using requests.get(). Instead, a line in the source reads:
"JavaScript is not enabled and therefore this page may not function correctly."
When I navigate to the page in Google Chrome, the HTML appears as it should.
How do I get a Python script to load the proper HTML?
Welcome to the wonderful world of web-crawling. The problem you are experiencing is that requests.get() would just get you the initial page that the browser receives at the beginning of a page load. But, this is not the page you see in the browser since there could be so much involved in forming the web page: javascript function calls, AJAX calls etc.
If you want to programmatically get the HTML you see when you click "Show source" in a web browser after the page was loaded - you would need a real browser. This is there selenium could be a good option:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
Note that selenium itself is very powerful in terms of locating elements - you don't need a separate HTML parser for extracting the data out of the page.
Hope that helps.
If you are sure that you have to deal with JavaScript, webdriver will handle better and saves your life.
from selenium.common.exceptions import NoSuchElementException
from selenium import webdriver
from time import sleep
browser = webdriver.Firefox()
browser.get("http://yourwebsite.com/html-table")
browser.find_element_by_id("some-js-triggering-elem").click()
while 1:
try:
browser.find_element_by_id("elem-that-makes-you-know-that-table-is-loaded")
except NoSuchElementException:
sleep(1)
html = browser.find_element_by_xpath("//*").get_attribute("outerHTML")
# Use PyQuery or something else to parse the html and get data from table