how to scrape websites that using django - javascript

I wanted to create a robot to scrape a website with this address :
https://1xxpers100.mobi/en/line/
But the problem is that when I wanted to get data from this website
I realized that this website is using django because they are using
phrases like {{if group_name}} and others
there is a loop created with this kind of method and it creates table rows and
the information that I want is there.
when I am working with python and I download the html code I can't find
any content but "{{code}}" in there, but when I'm working with chrome developer tools (inspect) and when I work with console I can see the content that is inside of the table that I want
How can I get html codes that holds the content of that table like chrome tools
to get the information that I want from this website?
My way to get the codes is using python :
import urllib.request
fp = urllib.request.urlopen("https://1xxpers100.mobi/en/line/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()

This should work for what you want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://1xxpers100.mobi/en/line/')
soup = BeautifulSoup(r.content, 'lxml')
print(soup.encode("utf-8"))
here 'lmxl' is what I use because it worked for the site I tested it on. If you have trouble with that just try another parser.
another problem is that there is a character that isn't recognized by default. so read the contents of soup using utf-8
Extra Info
This has nothing to do with django. HTML has what is described as a "tree" like structure. Where each set of tags is the parent of all children tags immediately inside it. You just weren't reading deep enough into the tree.

Related

How to load javascript values into python from web page?

When I inspect the code on a webpage, I can see the html and the javascript. I've used Beautiful Soup to import and parse the html, but there is a large section written in javascript, which pulls variables from a programmable logic controller (PLC). I can't find the data in python after I load and parse with Beautiful Soup - it's only the html code.
The PLC is being read directly by the webpage and I see the live values updating in front of me, but I can't import them directly. The screen shot is what the code looks like from the inspect window. Let's say I want to import that variable id="aout7" with attribute class="on", how can I do that?
Webpages are best run in a browser. There are API-s for remote controlling a browser/browser engine, a popular one is Selenium, and it has Python bindings: see https://pypi.org/project/selenium/ - the page contains instructions for installing:
pip install -U selenium
and some introductory examples, like this snippet issuing a Yahoo search:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element_by_name('p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
You will need something similar, just with find_element_by_id (https://selenium-python.readthedocs.io/locating-elements.html), and use the text attribute of elements to read their content.

python javascript scrape automatically

Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.

Scraping elements generated by javascript queries using python

I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.
I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.
Is it possible to get access to this data without opening the page in a browser?
This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead
Working example :
import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008
Explanation :
Inspecting the webpage, you can see that it does a request to this :
https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true
If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.
So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.
You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):
soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

Simple login function for XBMC (Python) issue

I'm trying to scrape sections of a Javascript calendar page through python(xbmc/kodi).
So far I've been able to scrape static html variables but not the JavaScript generated sections.
The variables im trying to retrieve are <strong class="item-title">**this**</strong> , <span class="item-daterange">**this**</span> and <div class="item-location">**this**</div> , note that they are in separate sections of the html source , and rendered through JavaScript. All of them scraped variables should be appended into one String and displayed.
response = net.http_GET('my URL')
link = response.content
match=re.compile('<strong class="gcf-item-title">(.+?)</strong>').findall(link)
for name in match:
name = name
print name
From the above with regex i can scrape just one of those variables and since i need a String list to be displayed of all the variables together , How can that be done?
I get that the page has to be pre rendered for the javascript variables to be scraped But since I'm using xbmc , I am not sure on how i can import additional python libraries such as dryscrape to get this done. Downloading Dryscrape gives me a setup.py , init.py file along with some others but how can i use all of them together?
Thanks.
Is your question about the steps to scrape the JavaScript, how to use Python on XBMC/Kodi, or how to install packages that come with a setup.py file?
Just based on your RegEx above, if your entries are always like <strong class="item-title">**this**</strong> you won't get a match since your re pattern is for elements with class="gcf-item-title.
Are you using or able to use BeautifulSoup? If you're not using it, but can, you should--it's life changing in terms of scraping websites.

Extract/decode CSS from HTML into Python

Good afternoon all.
I am currently parsing this website : http://uk.easyroommate.com/results-room/loc/981238/pag/1 .
I want to get the listing of every url of each adverts. However this listing is coded with JavaScript. I can perfectly see them via the Firefox firebug, but I have not find any way to get them via Python. I think it is doable but I don' t know how.
EDIT : Obviously I have tried with module like BeautifulSoup but as it is a JavaScript generated page, it is totally useless.
Thank you in advance for your help.
Ads listing is generated by JavaScript. BeautifulSoup gives you this for example:
<ul class="search-results" data-bind="template: { name: 'room-template', foreach: $root.resultsViewModel.Results, as: 'resultItem' }"></ul>
I would suggest looking at: Getting html source when some html is generated by javascript and Python Scraping JavaScript using Selenium and Beautiful Soup.
Thanks to your lead here is the solution and I hope it will help someone one day :
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://uk.easyroommate.com/results-room/loc/981238/pag/1')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
print soup.prettify()
## You are now able to see the HTML generated by javascript code and you
## can extract it as usual using BeautifulSoup
for el in soup.findAll('div', class_="listing-meta listing-meta--small"):
print el.find('a').get('href')
Again in my case I just wanted to extract those links, but once you have got the web page source code via Selenium, it is a piece of cake to use beautifulSoup and get every item you want.

Categories

Resources