I need to pull out some information from some links that contain some javascript code. I know how to do it with Selenium, but it takes a lot of time and I need more efficient way to pull this off.
I cam across the requests-html library and it looks quite robust way for my purposes, but unfortunately it doesn't look like I'm able to run the javascript with it.
I read the documentation from the following link https://requests-html.readthedocs.io/en/latest/
And tried the following code:
from requests_html import HTMLSession,HTML
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://drive.google.com/file/d/1rZ-DhTFPCen6DvJXlNl3Bxuwj4-ULwoa/view")
resp.html.render()
soup = BeautifulSoup(resp.html.html, 'lxml')
email = soup.find_all('img', {'class':'ndfHFb-c4YZDc-MZArnb-BA389-YLEF4c'})
print(email)
I get no results after running this code, even though the class exists if I open the link from my browser.
I've also tried using headers with my requests with no help. I tried the same code (with different html tag, of course) for another link (https://web.archive.org/web/*/stackoverflow.com) but I get some html text including a response that says that my browser must support javascript.
My code for this part:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://web.archive.org/web/*/stackoverflow.com")
resp.html.render()
soup = BeautifulSoup(resp.html.html, 'lxml')
print(soup)
The response I get:
<div class="no-script-message">
The Wayback Machine requires your browser to support JavaScript, please email info#archive.org<br/>if you have any questions about this.
</div>
Any help would be appreciated.
Thanks!
In render, add sleep parameter
resp.html.render(sleep=2)
This should work on the site. But as you mentioned the code worked for the StackOverflow but did not work for the other URL? is it because the server might not respond or the tag that you are looking for may not be available at that time. but anyway the requests-HTML should have given you an error.
I was about to check your problem and add it to my blog post How to use Requests-HTMLbut unfortunately, the link you provided is not working.
Related
When I inspect the code on a webpage, I can see the html and the javascript. I've used Beautiful Soup to import and parse the html, but there is a large section written in javascript, which pulls variables from a programmable logic controller (PLC). I can't find the data in python after I load and parse with Beautiful Soup - it's only the html code.
The PLC is being read directly by the webpage and I see the live values updating in front of me, but I can't import them directly. The screen shot is what the code looks like from the inspect window. Let's say I want to import that variable id="aout7" with attribute class="on", how can I do that?
Webpages are best run in a browser. There are API-s for remote controlling a browser/browser engine, a popular one is Selenium, and it has Python bindings: see https://pypi.org/project/selenium/ - the page contains instructions for installing:
pip install -U selenium
and some introductory examples, like this snippet issuing a Yahoo search:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element_by_name('p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
You will need something similar, just with find_element_by_id (https://selenium-python.readthedocs.io/locating-elements.html), and use the text attribute of elements to read their content.
I wanted to create a robot to scrape a website with this address :
https://1xxpers100.mobi/en/line/
But the problem is that when I wanted to get data from this website
I realized that this website is using django because they are using
phrases like {{if group_name}} and others
there is a loop created with this kind of method and it creates table rows and
the information that I want is there.
when I am working with python and I download the html code I can't find
any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want
How can I get html codes that holds the content of that table like chrome tools
to get the information that I want from this website?
My way to get the codes is using python :
import urllib.request
fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
This should work for what you want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.encode("utf-8"))
here 'lmxl' is what I use because it worked for the site I tested it on. If you have trouble with that just try another parser.
another problem is that there is a character that isn't recognized by default. so read the contents of soup using utf-8
Extra Info
This has nothing to do with django. HTML has what is described as a "tree" like structure. Where each set of tags is the parent of all children tags immediately inside it. You just weren't reading deep enough into the tree.
Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.
I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.
I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.
Is it possible to get access to this data without opening the page in a browser?
This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead
Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.
You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):
soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})
Good afternoon all.
I am currently parsing this website : http://uk.easyroommate.com/results-room/loc/981238/pag/1 .
I want to get the listing of every url of each adverts. However this listing is coded with JavaScript. I can perfectly see them via the Firefox firebug, but I have not find any way to get them via Python. I think it is doable but I don' t know how.
EDIT : Obviously I have tried with module like BeautifulSoup but as it is a JavaScript generated page, it is totally useless.
Thank you in advance for your help.
Ads listing is generated by JavaScript. BeautifulSoup gives you this for example:
<ul class="search-results" data-bind="template: { name: 'room-template', foreach: $root.resultsViewModel.Results, as: 'resultItem' }"></ul>
I would suggest looking at: Getting html source when some html is generated by javascript and Python Scraping JavaScript using Selenium and Beautiful Soup.
Thanks to your lead here is the solution and I hope it will help someone one day :
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://uk.easyroommate.com/results-room/loc/981238/pag/1')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
print soup.prettify()
## You are now able to see the HTML generated by javascript code and you
## can extract it as usual using BeautifulSoup
for el in soup.findAll('div', class_="listing-meta listing-meta--small"):
print el.find('a').get('href')
Again in my case I just wanted to extract those links, but once you have got the web page source code via Selenium, it is a piece of cake to use beautifulSoup and get every item you want.