I am using python with linux. I believe I have the correct packages installed. I tried getting the content from this page, but the output is too convoluted for me to understand.
When I inspect the html in the browser, I can see the actual page information if I drill down far enough, as shown in the image below, I can see 'Afghanistan' and 'Chishti Sufis' nested in appropriate tags. When I try to get the contents of the webpage using the code below, I've tried 2 methods, I get what looks like a script calling functions referring to the information stored elsewhere. Can someone please give me pointers on how to understand this structure to get the information I need. I want to be able to extract the regions, titles and the year range from this page(https://religiondatabase.org/browse/regions)
How can I tell if this page allows me to extract their data or if it is wrong to do so.
Thanks in advance for your help.
I tried the two approaches below. In the second approach I tried to extract the information in the script tag, but I can't understand it. I was expecting to see the actual page content nested under various html tags, but I don't
import requests
from bs4 import BeautifulSoup
import html5lib
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'}
URL = 'https://religiondatabase.org/browse/regions'
r = requests.get(url =URL, headers=headers)
print(r.content)
**Output: **
b'<!doctype html>You need to enable JavaScript to run this app.!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])'
and
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://religiondatabase.org/browse/regions')
print(r.content)
script = r.html.find('script', first = True)
print(script.text)
**Output: **
!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])
from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !
I'm trying to scrape real estate data off of this website: example
As you can see the relevant content is placed into article tags.
I'm running selenium with phantomjs:
driver = webdriver.PhantomJS(executable_path=PJSpath)
Then I generate the URL in python, because all search results are part of the link, so I can search what I'm looking for in the program without needing to fill out the form.
Before calling
driver.get(engine_link)
I copy engine_link to the clipboard and it opens fine in chrome.
Next I wait for all possible redirects to happen:
def wait_for_redirect(wdriver):
elem = wdriver.find_element_by_tag_name("html")
count = 0
while True:
count += 1
if count > 5:
print("Waited for redirect for 5 seconds!")
return
time.sleep(1)
try:
elem = wdriver.find_element_by_tag_name("html")
except StaleElementReferenceException:
return
Now at last I want to iterate over all <article> tags on the current page:
for article in driver.find_elements_by_tag_name("article"):
But this loop never returns anything. The program doesn't find any article tags, I've tried it with xpath and css selectors. Moreover, the articles are enclosed in a section tag, that can't be found either.
Is there a problem with this specific type of tags in Selenium or am I missing something JS related here? At the bottom of the page there are JavaScript templates whose naming suggests that they generate the search results.
Any help appreciated!
Pretend not to be PhantomJS and add an Explicit Wait (worked for me):
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# set a custom user-agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11")
# wait for arcitles to be present
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article")))
# get articles
for article in driver.find_elements_by_tag_name("article"):
print(article.text)
I'm crawling through some directories with ASP.NET programming via Scrapy.
The pages to crawl through are encoded as such:
javascript:__doPostBack('ctl00$MainContent$List','Page$X')
where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack', but my guess is that I have to be trickier in order to pull the info from the javascript "link."
If it's easier to "unmask" each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that's okay, too.
This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:
the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
using a regular BaseSpider since there is some logic involved in the pagination
it is important to provide headers pretending to be a real browser
it is important to yield FormRequests withdont_filter=True since we are basically making a POST request to the same URL but with different parameters
The code:
import re
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
HEADERS = {
'X-MicrosoftAjax': 'Delta=true',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
class ExitRealtySpider(BaseSpider):
name = "exit_realty"
allowed_domains = ["exitrealty.com"]
start_urls = [URL]
def parse(self, response):
# submit a form (first page)
self.data = {}
for form_input in response.css('form#aspnetForm input'):
name = form_input.xpath('#name').extract()[0]
try:
value = form_input.xpath('#value').extract()[0]
except IndexError:
value = ""
self.data[name] = value
self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
self.data['__EVENTARGUMENT'] = 'Page$1'
return FormRequest(url=URL,
method='POST',
callback=self.parse_page,
formdata=self.data,
meta={'page': 1},
dont_filter=True,
headers=HEADERS)
def parse_page(self, response):
current_page = response.meta['page'] + 1
# parse agents (TODO: yield items instead of printing)
for agent in response.xpath('//a[#class="regtext"]/text()'):
print agent.extract()
print "------"
# request the next page
data = {
'__EVENTARGUMENT': 'Page$%d' % current_page,
'__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
'__ASYNCPOST': 'true',
'__EVENTTARGET': 'ctl00$MainContent$agentList',
'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
'': ''
}
return FormRequest(url=URL,
method='POST',
formdata=data,
callback=self.parse_page,
meta={'page': current_page},
dont_filter=True,
headers=HEADERS)
This question already has answers here:
Scraping Javascript generated data
(2 answers)
Closed 9 years ago.
For a while now, I have been using R and the package RCurl to automatically download information from a webpage; I normally use simple functions like getURL(), getForm() and postForm(). I usually just find the HTML parameters optional values and fill them. However, I came across a webpage which I think cannot be downloaded using the traditional functions because I cannot find any parameters in the url address. I believe this is happening because the webpage is written in javascript and I don't know how to deal with it. I am a mathematician with vast experience using R but with a very basic knowledge of HTML and no knowledge at all of javascript.
I don't necessarily need to use R directly, I could use other software and then import it from R. I have found a Mozilla application called mozrepl but I was unable to make it work. I would appreciate if someone with more experience could help me with a solution, whether using different software or putting the appropriate commands in R or mozrepl. If it is not possible to download the info directly to an R variable it would be ok to save it to a text file.
The information I want to download is produced after selecting a date value in the following url and then hitting the button called "Consultar TIIE". A table is produced with the variables "Posturas", "Montos" and "Participantes".
http://www.banxico.org.mx/tiieban/leeArgumentos.faces?BMXC_plazo=28&BMXC_semanas=4
I am doing this because my final objective is to put the information together into a dataframe.
There is no issue with javascript here. The javascript simple creates the calender so you can pick your date to submit to the form. There is however an issue with alot else.
On a server side it seems like they are trying to detect none browser attempts to pull the data.
Also they have a redirect once the form is correctly submitted which is causing an issue.
require(RCurl)
require(XML)
appDate <- "20130502"
rURL <- "http://www.banxico.org.mx/tiieban/leeArgumentos.faces?BMXC_plazo=28&BMXC_semanas=4"
usera <- "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:21.0) Gecko/20100101 Firefox/21.0"
curl <- getCurlHandle(cookiefile = "", verbose = TRUE, useragent = usera
, followlocation = TRUE, autoreferer = TRUE, postredir = 2
, httpheader = c(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding" = "gzip, deflate"
, "Accept-Language" = "en-US,en;q=0.5"
, Connection = "keep-alive"), referer = "http://www.banxico.org.mx/tiieban/leeArgumentos.faces")
txt <- getURLContent(rURL, curl = curl, verbose = TRUE)
fParams <- structure(c(appDate, "Consultar+TIIE", "leeArgumentos")
,.Names = c( "leeArgumentos%3Afecha", "leeArgumentos%3Asubmit", "leeArgumentos"))
res <- postForm(rURL, .params = fParams, style = "post", curl = curl, binary = TRUE)
xRes <- htmlParse(rawToChar(res))
readHTMLTable(getNodeSet(xRes, "//*/table")[[3]])
Posturas Montos Participantes
1 4.3100 350 Banco Credit Suisse (México), S.A.
2 4.3245 350 Banco Inbursa S.A.
3 4.3200 350 Banco Invex S.A.
4 4.3375 350 Banco Mercantil del Norte S.A.
5 4.3350 350 Banco Nacional de México S.A.
6 4.3250 350 HSBC México S.A.
7 4.3300 350 ScotiaBank Inverlat, S.A.
There many things going on. The parameters for the form need encoding. leeArgumentos:fecha needs to be leeArgumentos%3Afecha for example. A user agent is probably being detected as are referrer strings and various other headers.
This does look like a javascript problem, rather than something directly related to web-scraping in R.
There are a variety of approaches to this issue, you might take a look at Scraping Javascript generated data and also the suggestions at Language for web scraping JAVASCRIPT content
The example you point to appears to run a custom script, show_calendar2, defined here http://www.banxico.org.mx/tiieban/scripts/ts_picker2.js