Extract/decode CSS from HTML into Python - javascript

Good afternoon all.
I am currently parsing this website : http://uk.easyroommate.com/results-room/loc/981238/pag/1 .
I want to get the listing of every url of each adverts. However this listing is coded with JavaScript. I can perfectly see them via the Firefox firebug, but I have not find any way to get them via Python. I think it is doable but I don' t know how.
EDIT : Obviously I have tried with module like BeautifulSoup but as it is a JavaScript generated page, it is totally useless.
Thank you in advance for your help.

Ads listing is generated by JavaScript. BeautifulSoup gives you this for example:
<ul class="search-results" data-bind="template: { name: 'room-template', foreach: $root.resultsViewModel.Results, as: 'resultItem' }"></ul>
I would suggest looking at: Getting html source when some html is generated by javascript and Python Scraping JavaScript using Selenium and Beautiful Soup.

Thanks to your lead here is the solution and I hope it will help someone one day :
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://uk.easyroommate.com/results-room/loc/981238/pag/1')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
print soup.prettify()
## You are now able to see the HTML generated by javascript code and you
## can extract it as usual using BeautifulSoup
for el in soup.findAll('div', class_="listing-meta listing-meta--small"):
print el.find('a').get('href')
Again in my case I just wanted to extract those links, but once you have got the web page source code via Selenium, it is a piece of cake to use beautifulSoup and get every item you want.

Related

Reading HTML created by JavaScript with Python

I have a small problem with linking Python with HTML, CSS, and JavaScript: Currently, I have a website that takes in the user's input, and uses JavaScript to modify some of the tags, so that Python can read it.
However, when I tried the following code:
from requests_html import HTMLSession <br>
session = HTMLSession() <br>
r = session.get('https://infiniteembarrasseddesign.lucatu1.repl.co/') <br>
ex = r.html.find("#ex", first=True) <br>
print(ex.text).
It doesn't output anything. Well, it does, just an empty element. However, my JavaScript should have replaced the textContent of the div with the user's entry. Is there a way to make Python read HTML created by JavaScript?
My operating system is windows 10.
Here's the HTML code if you want it:
https://repl.it/#LUCATU1/InfiniteEmbarrassedDesign#index.html. The URL of the webpage is in the code.
My apologies that I cannot provide the main code for Python as it involves opening private .mdb files.
Problem is that requests.get does not render a JavaScript, in order to test it you can open Dev Tools(F12 - in Chrome) and click "Console" open and type "Disable JavaScript" and refresh the page, after you will see what your web-sraper(requests) gets. You can solve this problem by alternative version by using Selenium.
requests.get don't see Javascript, you can try selenium:
from selenium import webdriver
driver = webdriver.ChromeDriver()
driver.get("URL")
And get source using driver.page_source or crawl through the page using selenium methods.

Can't run JavaScript using requests-html library on python

I need to pull out some information from some links that contain some javascript code. I know how to do it with Selenium, but it takes a lot of time and I need more efficient way to pull this off.
I cam across the requests-html library and it looks quite robust way for my purposes, but unfortunately it doesn't look like I'm able to run the javascript with it.
I read the documentation from the following link https://requests-html.readthedocs.io/en/latest/
And tried the following code:
from requests_html import HTMLSession,HTML
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://drive.google.com/file/d/1rZ-DhTFPCen6DvJXlNl3Bxuwj4-ULwoa/view")
resp.html.render()
soup = BeautifulSoup(resp.html.html, 'lxml')
email = soup.find_all('img', {'class':'ndfHFb-c4YZDc-MZArnb-BA389-YLEF4c'})
print(email)
I get no results after running this code, even though the class exists if I open the link from my browser.
I've also tried using headers with my requests with no help. I tried the same code (with different html tag, of course) for another link (https://web.archive.org/web/*/stackoverflow.com) but I get some html text including a response that says that my browser must support javascript.
My code for this part:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://web.archive.org/web/*/stackoverflow.com")
resp.html.render()
soup = BeautifulSoup(resp.html.html, 'lxml')
print(soup)
The response I get:
<div class="no-script-message">
The Wayback Machine requires your browser to support JavaScript, please email info#archive.org<br/>if you have any questions about this.
</div>
Any help would be appreciated.
Thanks!
In render, add sleep parameter
resp.html.render(sleep=2)
This should work on the site. But as you mentioned the code worked for the StackOverflow but did not work for the other URL? is it because the server might not respond or the tag that you are looking for may not be available at that time. but anyway the requests-HTML should have given you an error.
I was about to check your problem and add it to my blog post How to use Requests-HTMLbut unfortunately, the link you provided is not working.

How to load javascript values into python from web page?

When I inspect the code on a webpage, I can see the html and the javascript. I've used Beautiful Soup to import and parse the html, but there is a large section written in javascript, which pulls variables from a programmable logic controller (PLC). I can't find the data in python after I load and parse with Beautiful Soup - it's only the html code.
The PLC is being read directly by the webpage and I see the live values updating in front of me, but I can't import them directly. The screen shot is what the code looks like from the inspect window. Let's say I want to import that variable id="aout7" with attribute class="on", how can I do that?
Webpages are best run in a browser. There are API-s for remote controlling a browser/browser engine, a popular one is Selenium, and it has Python bindings: see https://pypi.org/project/selenium/ - the page contains instructions for installing:
pip install -U selenium
and some introductory examples, like this snippet issuing a Yahoo search:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element_by_name('p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
You will need something similar, just with find_element_by_id (https://selenium-python.readthedocs.io/locating-elements.html), and use the text attribute of elements to read their content.

how to scrape websites that using django

I wanted to create a robot to scrape a website with this address :
https://1xxpers100.mobi/en/line/
But the problem is that when I wanted to get data from this website
I realized that this website is using django because they are using
phrases like {{if group_name}} and others
there is a loop created with this kind of method and it creates table rows and
the information that I want is there.
when I am working with python and I download the html code I can't find
any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want
How can I get html codes that holds the content of that table like chrome tools
to get the information that I want from this website?
My way to get the codes is using python :
import urllib.request
fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
This should work for what you want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.encode("utf-8"))
here 'lmxl' is what I use because it worked for the site I tested it on. If you have trouble with that just try another parser.
another problem is that there is a character that isn't recognized by default. so read the contents of soup using utf-8
Extra Info
This has nothing to do with django. HTML has what is described as a "tree" like structure. Where each set of tags is the parent of all children tags immediately inside it. You just weren't reading deep enough into the tree.

Scraping elements generated by javascript queries using python

I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.
I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.
Is it possible to get access to this data without opening the page in a browser?
This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead
Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.
You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):
soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

Categories

Resources