I am using python with linux. I believe I have the correct packages installed. I tried getting the content from this page, but the output is too convoluted for me to understand.
When I inspect the html in the browser, I can see the actual page information if I drill down far enough, as shown in the image below, I can see 'Afghanistan' and 'Chishti Sufis' nested in appropriate tags. When I try to get the contents of the webpage using the code below, I've tried 2 methods, I get what looks like a script calling functions referring to the information stored elsewhere. Can someone please give me pointers on how to understand this structure to get the information I need. I want to be able to extract the regions, titles and the year range from this page(https://religiondatabase.org/browse/regions)
How can I tell if this page allows me to extract their data or if it is wrong to do so.
Thanks in advance for your help.
I tried the two approaches below. In the second approach I tried to extract the information in the script tag, but I can't understand it. I was expecting to see the actual page content nested under various html tags, but I don't
import requests
from bs4 import BeautifulSoup
import html5lib
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'}
URL = 'https://religiondatabase.org/browse/regions'
r = requests.get(url =URL, headers=headers)
print(r.content)
**Output: **
b'<!doctype html>You need to enable JavaScript to run this app.!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])'
and
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://religiondatabase.org/browse/regions')
print(r.content)
script = r.html.find('script', first = True)
print(script.text)
**Output: **
!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])
The code performs a google search using the below init.py
def search(term, num_results=10, lang="en", lr="lang_en"):
usr_agent = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/61.0.3163.100 Safari/537.36'}
def fetch_results(search_term, number_results, language_code):
escaped_search_term = search_term.replace(' ', '+')
google_url = 'https://www.google.com/search?q={}&num={}&hl={}&lr={}'.format(escaped_search_term, number_results+1,
language_code, lr)
...
Some of the returned links use javascript do translate the website:
<script type="text/javascript">
var home = '/de/', root = '/', country = 'ch', language = 'de', w = {
"download_image": "Bild download (Niedrige Qualität)",
...
another example:
<script>
dataLayer.push({
'brand' : 'Renault',
'countryCode' : 'BE',
'googleAccount' : 'UA-23041452-1',
'adobeAccount' : 'renaultbeprod',
'languageCode' : 'nl',
...
Is there a way to filter out the results translated through javascript and get search results only in one language ?
My apologies. Please disregard this question. I made a trivial mistake of putting tripple quoted comments inside a function in another part of the code. It disabled the parameters to be passed effectively.
After removing tripple quote comment from below, I get the results only in english:
for google_url in search(query, # The query you want to run
lang='en', # User interface language (host language)
num_results = 10, # Number of results per page
lr="lang_en" # Langauge of the documents received
'''
lr - parameter is implemented in __init__.py of googlesearch
It should be handled only here.
Other useful search parameters not used yet are:
cr - restricts search results to documents originating in a particular country.
(ex. cr=countryCA)
gl - boosts search results whose country of origin matches the parameter value.
(ex. gl=uk)
'''
):
I have read many posts on using Scrapy to scrape JSON data, but haven't found one with dates in the URL.
I am using Scrapy version 2.1.0 and I am trying to scrape this site which populates based on date ranges in the URL. Here is the rest of my code which includes headers I copied from the site I am trying to scrape, but I am trying to use the following while loop to generate the URLs:
while start_date <= end_date:
start_date += delta
dates_url = (str(start_date) + "&end=" + str(start_date))
ideas_url = base_url+dates_url
request = scrapy.Request(
ideas_url,
callback=self.parse_ideas,
headers=self.headers
)
print(ideas_url)
yield request
Then I am trying to scrape using the following:
def parse_ideas(self, response):
raw_data = response.body
data = json.loads(raw_data)
yield {
'Idea' : data['dates']['data']['idea']
}
Here is a more complete error output from when I try to runspider and export to a CSV, but I keep getting the error:
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Is this the best approach for scraping a site that uses dates in its URL to populate? And, if so, what I am doing wrong with my JSON request that I am not getting any results?
Note in case it matters, in settings.py I enabled and edited the following:
USER_AGENT = 'Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/4537.36'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
And I added the following at the end of settings.py
HTTPERROR_ALLOWED_CODES = [400]
DOWNLOAD_DELAY = 2
The problem with this is that you are trying to scrape a JavaScript app. In the html it says:
<noscript>You need to enable JavaScript to run this app.</noscript>
Another problem is that the app pulls it's data from this api which requires authorization. So I guess your best bet is to either use Splash or Selenium to wait for the page to load and use the html generated by them.
Personally I mostly use something very similar to scrapy-selenium. There is also a Package available for it here
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Goal: Get inner text of JavaScript element from Yahoo Finance Page. Please refer to
I can get the innerHTML using the the code below
document.getElementsByClassName('D(ib) Va(t)')[15].childNodes[2].innerHTML
But, I can't find a method to communicate this to the Yahoo Finance page in Java
I've briefly tried the following APIs:
JSoup
HTMLUnit
Nashorn
I think Nashorn can get the text I'm looking for, but I haven't been able to do it yet.
If anyone has done something similar or can point me in the right direction, that would be much appreciated.
Let me know if more details are needed.
HtmlUnit seems to have problems with this site, since the response is incomplete as well. You could use PhantomJS. Just download the binary for your OS and create a script file (see API).
Script (yahoo.js):
var page = require('webpage').create();
var fs = require('fs');
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.settings.resourceTimeout = '5000';
page.open('http://finance.yahoo.com/quote/AAPL/profile?p=AAPL', function(status) {
console.log("Status: " + status);
if(status === "success") {
var path = 'yahoo.html';
fs.write(path, page.content, 'w');
}
phantom.exit();
});
Java code:
try {
//change path to phantomjs binary and your script file
String phantomJSPath = "bin" + File.separator + "phantomjs";
String scriptFile = "yahoo.js";
Process process = Runtime.getRuntime().exec(phantomJSPath + " " + scriptFile);
process.waitFor();
//Jsoup
Elements elements = Jsoup.parse(new File("yahoo.html"),"UTF-8").select("div.asset-profile-container p strong"); //yahoo.html created by script file in same path
for (Element element : elements) {
if(element.attr("data-reactid").contains("asset-profile.1.1.1.2")){
System.out.println(element.text());
}
}
} catch (Exception e) {
e.printStackTrace();
}
Output:
Consumer Goods
Note:
The following link returns a JSONObject containing the company information, not sure though if the crumb parameter changes or is constant for a company:
https://query2.finance.yahoo.com/v10/finance/quoteSummary/AAPL?formatted=true&crumb=hm4%2FV0JtzlL&lang=en-US®ion=US&modules=assetProfile%2CsecFilings%2CcalendarEvents&corsDomain=finance.yahoo.com