I am using python with linux. I believe I have the correct packages installed. I tried getting the content from this page, but the output is too convoluted for me to understand.
When I inspect the html in the browser, I can see the actual page information if I drill down far enough, as shown in the image below, I can see 'Afghanistan' and 'Chishti Sufis' nested in appropriate tags. When I try to get the contents of the webpage using the code below, I've tried 2 methods, I get what looks like a script calling functions referring to the information stored elsewhere. Can someone please give me pointers on how to understand this structure to get the information I need. I want to be able to extract the regions, titles and the year range from this page(https://religiondatabase.org/browse/regions)
How can I tell if this page allows me to extract their data or if it is wrong to do so.
Thanks in advance for your help.
I tried the two approaches below. In the second approach I tried to extract the information in the script tag, but I can't understand it. I was expecting to see the actual page content nested under various html tags, but I don't
import requests
from bs4 import BeautifulSoup
import html5lib
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'}
URL = 'https://religiondatabase.org/browse/regions'
r = requests.get(url =URL, headers=headers)
print(r.content)
**Output: **
b'<!doctype html>You need to enable JavaScript to run this app.!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])'
and
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://religiondatabase.org/browse/regions')
print(r.content)
script = r.html.find('script', first = True)
print(script.text)
**Output: **
!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])
I have read many posts on using Scrapy to scrape JSON data, but haven't found one with dates in the URL.
I am using Scrapy version 2.1.0 and I am trying to scrape this site which populates based on date ranges in the URL. Here is the rest of my code which includes headers I copied from the site I am trying to scrape, but I am trying to use the following while loop to generate the URLs:
while start_date <= end_date:
start_date += delta
dates_url = (str(start_date) + "&end=" + str(start_date))
ideas_url = base_url+dates_url
request = scrapy.Request(
ideas_url,
callback=self.parse_ideas,
headers=self.headers
)
print(ideas_url)
yield request
Then I am trying to scrape using the following:
def parse_ideas(self, response):
raw_data = response.body
data = json.loads(raw_data)
yield {
'Idea' : data['dates']['data']['idea']
}
Here is a more complete error output from when I try to runspider and export to a CSV, but I keep getting the error:
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Is this the best approach for scraping a site that uses dates in its URL to populate? And, if so, what I am doing wrong with my JSON request that I am not getting any results?
Note in case it matters, in settings.py I enabled and edited the following:
USER_AGENT = 'Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/4537.36'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
And I added the following at the end of settings.py
HTTPERROR_ALLOWED_CODES = [400]
DOWNLOAD_DELAY = 2
The problem with this is that you are trying to scrape a JavaScript app. In the html it says:
<noscript>You need to enable JavaScript to run this app.</noscript>
Another problem is that the app pulls it's data from this api which requires authorization. So I guess your best bet is to either use Splash or Selenium to wait for the page to load and use the html generated by them.
Personally I mostly use something very similar to scrapy-selenium. There is also a Package available for it here
I have this code (InDesign CS6), and it's not working as expected. I'm using Mac OS and I need to make the code compatible with Windows and Mac.
Trying to get a text/JSON over my localhost, and the socket return an empty string:-
function getData(host, path) {
var reply = '';
var conn = new Socket;
var request = "GET " + path + " HTTP/1.1\r\n" + "Host: " + host + "\r\n" + "\n";
if (conn.open (host)) {
conn.write (request);
reply = conn.read(999999);
var close = conn.close();
}
return reply;
}
var host = 'localhost:80';
var path = '/test/test/json.php';
var test = getData(host, path);
alert(typeof(test) + ' Length:' + test.length);
Edit: Finally I manage to find out what causing the problem. I create a VMware, and try to run the script, and it's working. Not sure why it doesn't work on my machine. Download Wireshark, and saw InDesign send the request, but something blocks the request from accessing the server. I will update if I able to detect what causing the block.
When it comes to Socket, I guess the simpliest is to take advantage of that script written by Rorohiko:
https://rorohiko.blogspot.fr/2013/01/geturlsjsx.html
Or have a try with IDExtenso library:
https://github.com/indiscripts/IdExtenso
I find those convenient as they deal with the inner socket mechanisms for you.
You do not need to use a socket just to get JSON from your server.
Instead refer to the XMLHttpRequest documentation or just a library such as jQuery which greatly simplifies making Ajax calls for JSON.
I am trying to create a web application that can read certain files (logs) provided by the users and then Use the LogParser 2.2 exe by microsoft to parse the logs and provide the requested output.
The idea i have is to run the Local LogParser.exe present in the Users system and then use the same generated output to ready my output.
I don not know if this approach is correct , However i am trying to do the same and somewhere my code is not correctly being followed and i am not able to find any output/Error .
My code segment is as follows :
protected void Button2_Click(object sender, EventArgs e)
{
try
{
string fileName = #"C:\Program Files (x86)\Log Parser 2.2\LOGPARSER.exe";
string filename = "LogParser";
string input = " -i:IISW3C ";
string query = " Select top 10 cs-ur-stem, count(cs-ur-stem) from " + TextBox1.Text + " group by cs-uri-stem order by count(cs-ur-stem)";
string output = " -o:DATAGRID ";
string argument = filename + input + query + output;
ProcessStartInfo PSI = new ProcessStartInfo(fileName)
{
UseShellExecute = false,
Arguments = argument,
RedirectStandardInput = true,
RedirectStandardOutput = true,
CreateNoWindow = false
};
Process LogParser = Process.Start(PSI);
LogParser.Start();
}
catch (Exception Prc)
{
MessageBox.Show(Prc.Message);
}
I might be doing something wrong , but can someone point me in correct direction ? Can Javascript ActiveX control may be the way forward ?
All the help is appreciated
(( I am making it as an internal application for my organisation and it is assumed that the log parser will be present in the computer this web application is being used )0
Thanks
Ravi
Add a reference to Interop.MSUtil in your project and then use the COM API as described in the help file. The following using statements should allow you to interact with LogParser through your code:
using LogQuery = Interop.MSUtil.LogQueryClass;
using FileLogInputFormat = Interop.MSUtil.COMTextLineInputContextClass;
Then you can do something like:
var inputFormat = new FileLogInputFormat();
// Instantiate the LogQuery object
LogQuery oLogQuery = new LogQuery();
var results = oLogQuery.Execute(yourQuery, inputFormat);
You have access to a bunch of predefined input formats and output formats (like IIS and W3C)), so you can pick the one that best suits your needs. Also, you will need to run regsvr on LogParser.dll on the machine you are executing on if you have not installed LogParser. The doc is actually pretty good to get you started.