BeautifulSoup Scraping: loading div instead of the content - javascript

Noob here.
I'm trying to scrape search results from this website: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance
I'm using python's BeautifulSoup
import csv
import requests
from BeautifulSoup import BeautifulSoup
for numb in ('0', '69'):
url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'id': 'StudySearchResults'})
lista = []
for i in table.findAll('h3'):
lista.append(h3.string)
print(table.prettify())
I want to get clean data with the basic information about the Master (for now just the name).
The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.
However, the results are:
<div id="StudySearchResults">
<div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
<div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
<div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
<!-- Wait pane, just here to make sure there is no white page -->
<div id="WaitPane" class="WaitPane">
<img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
<span>Loading search results...</span>
</div>
</div>
</div>
Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)

If you want only the text, you should do this
lista.append(h3.get_text())
Regarding your second question, jsfan's answer is right. You should try Selenium and use its wait feature to wait for your search results, that appear in divs with the class names Result master premium
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "div[#class*='Result master premium']))
)

You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.
If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.
While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.
If you want to get into using Webdriver with Python, the official documentation is a good place to start.

Related

Scrapy: extracting data from script tag

I am new to Scrapy. I am trying to scrape contents from 'https://www.tysonprop.co.za/agents/' for work purposes.
In particular, the information I am looking for seems to be generated by a script tag.
The line: <%= branch.branch_name %>
resolves to: Tyson Properties Head Office
at run time.
I am trying to access the text generated inside the h2 element at run time.
However, the Scrapy response object seems to grab the raw source code. I.e. the data I want appears as <%= branch.branch_name %> and not "Tyson Properties Head Office".
Any help would be appreciated.
HTML response object extract:
<script type="text/html" id="id_branch_template">
<div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
<h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
<div class="branch-agents container_12 first last clearfix">
<div id="agents-list-left" class="agents-list left grid_6">
</div>
<div id="agents-list-right" class="agents-list right grid_6">
</div>
</div>
</div>
</script>
<script type="text/javascript">
Current Scrapy spider code:
import scrapy
from scrapy.crawler import CrawlerProcess
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/agents/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
script = response.xpath('//script[#id="id_branch_template"]/text()').get()
div = scrapy.Selector(text=script).xpath('//div[contains(#class,"branch-container")]')
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
Related to this question:
Scrapy xpath not extracting div containing special characters <%=
As the accepted answer on the related question suggests, consider using the AJAX endpoint.
If that doesn't work for you, consider using Splash.
The data seems to be downloaded with AJAX and added to the page with JavaScript. Scrapy can use Splash to execute JS on the page.
For example, this should work just fine after that.
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
The docs have instructions for installing Splash but after getting it up and running, code changes to the actual crawler are pretty minimal.
Install scrapy-splash with
pip install scrapy-splash
Add some configurations to settings.py (listed on the Github page) and finally use SplashRequest instead of scrapy.Request.
If that doesn't work for you, maybe check out Selenium or Pyppeteer.
Also, the HTML response doesn't have "Tyson Properties Head Office" in it before executing JS (i.e. inside a script) except as a dropdown menu item, which probably isn't that useful, so it can't be extracted from the response.

Dynamic Data Web Scraping with Python, BeautifulSoup

I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select('span[class="pull-right"]') it should give me the number, but only the tag comes. I believe it is because Javascript is used in the webpage. 180,476 is the position of data at this specific HTML that I want for many pages:
<div class="legend-block--body">
<div class="linear-legend--counts">
Pageviews:
<span class="pull-right">
180,476
</span>
</div>
<div class="linear-legend--counts">
Daily average:
<span class="pull-right">
8,594
</span>
</div></div>
My code(this is in a loop to work for many pages):
res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
print(ab)
output:
[<span class="pull-right">\n<label class="logarithmic-scale">\n<input
class="logarithmic-scale-option" type="checkbox"/>\n Logarithmic scale
</label>\n</span>, <span class="pull-right">\n<label class="begin-at-
zero">\n<input class="begin-at-zero-option" type="checkbox"/>\n Begin at
zero </label>\n</span>, <span class="pull-right">\n<label class="show-
labels">\n<input class="show-labels-option" type="checkbox"/>\n Show
values </label>\n</span>]
Example URL:https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi
I want the Pageviews
The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.
To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:
from selenium import webdriver
browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
url = wiki_page[0]
selector = wiki_page[1]
browser.get(wiki_page)
page_views_count = browser.find_element_by_css_selector(selector)
print page_views_count.text
browser.quit()
NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.
You should try using the python plugin selenium.
It requires you to download a driver for whatever browser you are using.
You will then be able to use selenium to pull out values from the html
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi")
element = driver.find_element_by_class_name("pull-right")
// or the following below
//element = driver.find_element_by_name("q")
//element = driver.find_element_by_id("html ID name")
//element = driver.find_element_by_name("html element name")
//element = driver.find_element_by_xpath("//input[#id='passwd-id']")
print(element)
driver.close()

Copy phonenumber with python

Im struggling with python3. I teached myself the basics. Now im learning webdriver and bs4. FUN!
I want to scrape the phonenumber of a page. In other cases i made a working script. But now im on a page that gives me a headache!
I think the problem is that the content is dynamicly loaded. (Its not on the pagesource)
This is the page: https://www.dnls.nl/locatie/diergaarde-blijdorp-rotterdam
There is a element with this text: Toon telefoonnummer. I can click it with:
driver.find_element_by_link_text("Toon telefoonnummer").click().
The phonenumber is visible but for my eyes only! Im trying to grab the number for hours now with xpath and css but i can't grab it!
In the pagesource i see this:
<div class="show_onclick">
<div class="text-center telephone-field"><a class="set-align" :href="'tel:'+project.contact_phone">{{project.contact_phone}}</a></div>
<div class="text-center">{{project.contact_textline}}</div>
</div>
This is my last code:
from selenium import webdriver
try:
#telefoonnummer
driver.find_element_by_link_text("Toon telefoonnummer").click() #This works
driver.implicitly_wait(5)
telefoonnummer = driver.find_element_by_xpath(".//*[#id='main-inner']/div[1]/div[1]/div/div[2]/ul[1]/li[3]/div/div[1]/a").text
print(telefoonnummer)
except:
print("")
Is there a way to scrape this kind of content?
UPDATE: i found the data in a javascript thats in the head of the pagesource. Its a massive javascript and it contains:
"contact_name":"Blijdorp Happenings","contact_phone":"010 4431415","contact_email":"sales#diergaardeblijdorp.nl"
What is my goal: I want to find the phonenumber, copy to phonenumber and safe it in a var 'telefoonnummer'
Kind Regards!

Python Requests getting all html data from site

I am trying to get product data from Metal Mulisha, I have a list of product IDs that I need to find data on. So I use python with python package requests, with the search URL "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
I then use BeautifulSoup to find the class and data I need, but I get an error that says there was nothing there.
So I first went to the URL in Chrome then inspected the elements and all the information I needed was in the html on Chrome.
Here is a snippet of what Chrome showed.
<div class="col-md-10 col-md-push-2">
<div data-rfkid="rfkid_7" data-keyphrase="20M35518334Z M45518403Z M45518415Z" class="rfk_sp rfk-sp">
<div class="rfk_sp_container" data-nrp="2" data-ntp="2" data-pg="1" data-status="2" rfk_track_appear_once="f=sp,rfkid=rfkid_7,a=1,c=1">
<div class="rfk_header">
</div>
<div class="rfk_message">
<div class="rfk_msg_noresult">
</div>
<div class="rfk_msg_results">Top Results for "20m35518334z m45518403z m45518415z"</div>
It keeps continuing under the first div, all I am showing you is there in a lot of information after <div data-rfkid=.
Once I ran my python script to find the first div, this is what I get.
<div class="col-md-10 col-md-push-2">
<div data-keyphrase="20M35518334Z M45518403Z M45518415Z" data-rfkid="rfkid_7"></div>
</div>
As if all the product information that I need is not there.
Here is my python code, so you can see what I did. I am using python 3.5.
import requests
from bs4 import BeautifulSoup
url = "http://www.metalmulisha.com/shop/search/?q=20M35518334Z%20M45518403Z%20M45518415Z"
html = requests.get(url).text
bs = BeautifulSoup(html, 'lxml')
possible_links = bs.find('div', attrs={'class': 'col-md-10 col-md-push-2'})
print(possible_links)
My question is why can't python find the html I need? If I inspect the site in Chrome I see it just fine, but when I use Python and request the site, it's not there. Is this to do with JavaScript? And if so how do I fix this?

How to automate selecting certain codes in an html?

Hi I have a question about automating selecting certain content in an HTML. So if we save an webpage as html only, then we'll get HTML codes along with other stylesheets and javascript codes. However, I only want to extract the HTML codes between <div class='post-content' itemprop='articleBody'>and</div> and then create a new HTML file that has the extracted HTML codes. Is there a possible way to do it? Example codes are down below:
<html>
<script src='.....'>
</script>
<style>
...
</style>
<div class='header-outer'>
<div class='header-title'>
<div class='post-content' itemprop='articleBody'>
<p>content we want</p>
</div>
</div></div>
<div class='footer'>
</div>
</html>
While I'm typing, I'm thinking about javascript, which seems to be able to manipulate HTML DOM elements..Is Ruby able to do that? Can I generate a new clean html that only contains content between <div class='post-content' itemprop='articleBody'>and</div> by using javascript or Ruby? However, as for how to write the actual code, I don't have a clue.
So anybody has any idea about it? Thank you so much!
I'm not quite sure what you're asking, but I'll take a crack at it.
Can Ruby modify the DOM on a webpage?
Short answer, no. Browsers don't know how to run Ruby. They do know how to run javascript, so that's what usually used for real-time DOM manipulation.
Can I generate a new clean html
Yes? At the end of the day, HTML is just a specifically formatted string. If you want to download the source from that page and find everything in the <div class='post-content' itemprop='articleBody'> tag, there are a couple of ways to go about that. The best is probably the nokogiri gem, which is a ruby HTML parser. You'll be able to feed it a string (from a file or otherwise) that represents the old page and strip out what you want. Doing that would look something like this:
require 'nokogiri'
page = Nokogiri::HTML(open("https://googleblog.blogspot.com"))
# finds the first child of the <div class="post-content"> element
text = page.css('.post-content')[0].text
I believe that gives you the text you're looking for. More detailed nokogiri instructions can be found here.
You want to use a regular expression. For example:
//The "m" means multi-line
var regEx = /<div class='post-content' itemprop='articleBody'>([\s\S]*?)<\/div>/m;
//The content (you'll put the javascript at the bottom
var bodyCode = document.body.innerHTML;
var match = bodyCode.match( regEx );
//Prints to the console
console.dir( match );
You can see this in action here: https://regex101.com/r/kJ5kW6/1

Categories

Resources