Im struggling with python3. I teached myself the basics. Now im learning webdriver and bs4. FUN!
I want to scrape the phonenumber of a page. In other cases i made a working script. But now im on a page that gives me a headache!
I think the problem is that the content is dynamicly loaded. (Its not on the pagesource)
This is the page: https://www.dnls.nl/locatie/diergaarde-blijdorp-rotterdam
There is a element with this text: Toon telefoonnummer. I can click it with:
driver.find_element_by_link_text("Toon telefoonnummer").click().
The phonenumber is visible but for my eyes only! Im trying to grab the number for hours now with xpath and css but i can't grab it!
In the pagesource i see this:
<div class="show_onclick">
<div class="text-center telephone-field"><a class="set-align" :href="'tel:'+project.contact_phone">{{project.contact_phone}}</a></div>
<div class="text-center">{{project.contact_textline}}</div>
</div>
This is my last code:
from selenium import webdriver
try:
#telefoonnummer
driver.find_element_by_link_text("Toon telefoonnummer").click() #This works
driver.implicitly_wait(5)
telefoonnummer = driver.find_element_by_xpath(".//*[#id='main-inner']/div[1]/div[1]/div/div[2]/ul[1]/li[3]/div/div[1]/a").text
print(telefoonnummer)
except:
print("")
Is there a way to scrape this kind of content?
UPDATE: i found the data in a javascript thats in the head of the pagesource. Its a massive javascript and it contains:
"contact_name":"Blijdorp Happenings","contact_phone":"010 4431415","contact_email":"sales#diergaardeblijdorp.nl"
What is my goal: I want to find the phonenumber, copy to phonenumber and safe it in a var 'telefoonnummer'
Kind Regards!
Related
I am kind of a starter in html, css and javascript and I am trying to remove a plain script in html with no source file via a WebExtension, probably with javascript but I can't find the solution to my problem.I have looked everywhere in Stack Exchange and other simmilar blogs and forums but nothing worked
HTML Code:
<script>
const SUPPORT_BASE = "https://support.aternos.org/hc/";
const SUPPORT_ARTICLES = {"countdown":360026950972,"uploadworld":360027235751,"connect":360026805072,"size":360035144691,"adb lock":360034748092,"email":360039498492,"pending":360041686352,"domains":360044623491,"deprecated":360033339752,"backups":360044837012};
</script>
I'm lost.Please help me.And the total HTML element is:
<header class="header" style="">
<script>
const SUPPORT_BASE = "https://support.aternos.org/hc/";
const SUPPORT_ARTICLES = {"countdown":360026950972,"uploadworld":360027235751,"connect":360026805072,"size":360035144691,"adblock":360034748092,"email":360039498492,"pending":360041686352,"domains":360044623491,"deprecated":360033339752,"backups":360044837012};
</script>
</header>
Please help me.
The website is https://aternos.org/server/
Please define what you mean with "remove". If you just want to remove the script from the html file just remove the script text?
If you are trying to write code that will remove it when the code is running please provide the js aswell. Wont be hard with some DOM manipulation.
I am new to Scrapy. I am trying to scrape contents from 'https://www.tysonprop.co.za/agents/' for work purposes.
In particular, the information I am looking for seems to be generated by a script tag.
The line: <%= branch.branch_name %>
resolves to: Tyson Properties Head Office
at run time.
I am trying to access the text generated inside the h2 element at run time.
However, the Scrapy response object seems to grab the raw source code. I.e. the data I want appears as <%= branch.branch_name %> and not "Tyson Properties Head Office".
Any help would be appreciated.
HTML response object extract:
<script type="text/html" id="id_branch_template">
<div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
<h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
<div class="branch-agents container_12 first last clearfix">
<div id="agents-list-left" class="agents-list left grid_6">
</div>
<div id="agents-list-right" class="agents-list right grid_6">
</div>
</div>
</div>
</script>
<script type="text/javascript">
Current Scrapy spider code:
import scrapy
from scrapy.crawler import CrawlerProcess
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/agents/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
script = response.xpath('//script[#id="id_branch_template"]/text()').get()
div = scrapy.Selector(text=script).xpath('//div[contains(#class,"branch-container")]')
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
Related to this question:
Scrapy xpath not extracting div containing special characters <%=
As the accepted answer on the related question suggests, consider using the AJAX endpoint.
If that doesn't work for you, consider using Splash.
The data seems to be downloaded with AJAX and added to the page with JavaScript. Scrapy can use Splash to execute JS on the page.
For example, this should work just fine after that.
h2 = div.xpath('./h2[contains(#class,"branch-name")]')
The docs have instructions for installing Splash but after getting it up and running, code changes to the actual crawler are pretty minimal.
Install scrapy-splash with
pip install scrapy-splash
Add some configurations to settings.py (listed on the Github page) and finally use SplashRequest instead of scrapy.Request.
If that doesn't work for you, maybe check out Selenium or Pyppeteer.
Also, the HTML response doesn't have "Tyson Properties Head Office" in it before executing JS (i.e. inside a script) except as a dropdown menu item, which probably isn't that useful, so it can't be extracted from the response.
Noob here.
I'm trying to scrape search results from this website: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance
I'm using python's BeautifulSoup
import csv
import requests
from BeautifulSoup import BeautifulSoup
for numb in ('0', '69'):
url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'id': 'StudySearchResults'})
lista = []
for i in table.findAll('h3'):
lista.append(h3.string)
print(table.prettify())
I want to get clean data with the basic information about the Master (for now just the name).
The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.
However, the results are:
<div id="StudySearchResults">
<div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
<div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
<div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
<!-- Wait pane, just here to make sure there is no white page -->
<div id="WaitPane" class="WaitPane">
<img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
<span>Loading search results...</span>
</div>
</div>
</div>
Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)
If you want only the text, you should do this
lista.append(h3.get_text())
Regarding your second question, jsfan's answer is right. You should try Selenium and use its wait feature to wait for your search results, that appear in divs with the class names Result master premium
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "div[#class*='Result master premium']))
)
You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.
If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.
While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.
If you want to get into using Webdriver with Python, the official documentation is a good place to start.
Hi I have a question about automating selecting certain content in an HTML. So if we save an webpage as html only, then we'll get HTML codes along with other stylesheets and javascript codes. However, I only want to extract the HTML codes between <div class='post-content' itemprop='articleBody'>and</div> and then create a new HTML file that has the extracted HTML codes. Is there a possible way to do it? Example codes are down below:
<html>
<script src='.....'>
</script>
<style>
...
</style>
<div class='header-outer'>
<div class='header-title'>
<div class='post-content' itemprop='articleBody'>
<p>content we want</p>
</div>
</div></div>
<div class='footer'>
</div>
</html>
While I'm typing, I'm thinking about javascript, which seems to be able to manipulate HTML DOM elements..Is Ruby able to do that? Can I generate a new clean html that only contains content between <div class='post-content' itemprop='articleBody'>and</div> by using javascript or Ruby? However, as for how to write the actual code, I don't have a clue.
So anybody has any idea about it? Thank you so much!
I'm not quite sure what you're asking, but I'll take a crack at it.
Can Ruby modify the DOM on a webpage?
Short answer, no. Browsers don't know how to run Ruby. They do know how to run javascript, so that's what usually used for real-time DOM manipulation.
Can I generate a new clean html
Yes? At the end of the day, HTML is just a specifically formatted string. If you want to download the source from that page and find everything in the <div class='post-content' itemprop='articleBody'> tag, there are a couple of ways to go about that. The best is probably the nokogiri gem, which is a ruby HTML parser. You'll be able to feed it a string (from a file or otherwise) that represents the old page and strip out what you want. Doing that would look something like this:
require 'nokogiri'
page = Nokogiri::HTML(open("https://googleblog.blogspot.com"))
# finds the first child of the <div class="post-content"> element
text = page.css('.post-content')[0].text
I believe that gives you the text you're looking for. More detailed nokogiri instructions can be found here.
You want to use a regular expression. For example:
//The "m" means multi-line
var regEx = /<div class='post-content' itemprop='articleBody'>([\s\S]*?)<\/div>/m;
//The content (you'll put the javascript at the bottom
var bodyCode = document.body.innerHTML;
var match = bodyCode.match( regEx );
//Prints to the console
console.dir( match );
You can see this in action here: https://regex101.com/r/kJ5kW6/1
I am scraping a job board (eluta.ca) using Python and Selenium.
I'm able to return everything I want except for the link to the extended job description. In the bit of HTML below the job "Claims Adjuster"; links to "http://www.eluta.ca./direct/p?i=f1a7daa360e9468d5837d821c9d328ec"
<h2 class="title">
<span class="lk-job-title" title="Claims Adjuster" onclick="enav2('./direct/p?i=f1a7daa360e9468d5837d821c9d328ec')"></span>
</h2>
I can find the element using the code
webdriver.find_element_by_css_selector("#organic-jobs .organic-job:nth-child("+str(Line)+") .title .lk-job-title")
retrieve the job via .text, navigate to the extended description via .click()
Am I missing sometingh obvious that would return all or part of "enav2('./direct/p?i=f1a7daa360e9468d5837d821c9d328ec')"
Thanks in advance
Use .get_attribute():
element = webdriver.find_element_by_css_selector("#organic-jobs .organic-job:nth-child("+str(Line)+") .title .lk-job-title")
print element.get_attribute('onclick')