Get InnerHTMLof JS from Website using Java [closed] - javascript

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Goal: Get inner text of JavaScript element from Yahoo Finance Page. Please refer to
I can get the innerHTML using the the code below
document.getElementsByClassName('D(ib) Va(t)')[15].childNodes[2].innerHTML
But, I can't find a method to communicate this to the Yahoo Finance page in Java
I've briefly tried the following APIs:
JSoup
HTMLUnit
Nashorn
I think Nashorn can get the text I'm looking for, but I haven't been able to do it yet.
If anyone has done something similar or can point me in the right direction, that would be much appreciated.
Let me know if more details are needed.

HtmlUnit seems to have problems with this site, since the response is incomplete as well. You could use PhantomJS. Just download the binary for your OS and create a script file (see API).
Script (yahoo.js):
var page = require('webpage').create();
var fs = require('fs');
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.settings.resourceTimeout = '5000';
page.open('http://finance.yahoo.com/quote/AAPL/profile?p=AAPL', function(status) {
console.log("Status: " + status);
if(status === "success") {
var path = 'yahoo.html';
fs.write(path, page.content, 'w');
}
phantom.exit();
});
Java code:
try {
//change path to phantomjs binary and your script file
String phantomJSPath = "bin" + File.separator + "phantomjs";
String scriptFile = "yahoo.js";
Process process = Runtime.getRuntime().exec(phantomJSPath + " " + scriptFile);
process.waitFor();
//Jsoup
Elements elements = Jsoup.parse(new File("yahoo.html"),"UTF-8").select("div.asset-profile-container p strong"); //yahoo.html created by script file in same path
for (Element element : elements) {
if(element.attr("data-reactid").contains("asset-profile.1.1.1.2")){
System.out.println(element.text());
}
}
} catch (Exception e) {
e.printStackTrace();
}
Output:
Consumer Goods
Note:
The following link returns a JSONObject containing the company information, not sure though if the crumb parameter changes or is constant for a company:
https://query2.finance.yahoo.com/v10/finance/quoteSummary/AAPL?formatted=true&crumb=hm4%2FV0JtzlL&lang=en-US&region=US&modules=assetProfile%2CsecFilings%2CcalendarEvents&corsDomain=finance.yahoo.com

Related

Scraping a webpage - beginner

I am using python with linux. I believe I have the correct packages installed. I tried getting the content from this page, but the output is too convoluted for me to understand.
When I inspect the html in the browser, I can see the actual page information if I drill down far enough, as shown in the image below, I can see 'Afghanistan' and 'Chishti Sufis' nested in appropriate tags. When I try to get the contents of the webpage using the code below, I've tried 2 methods, I get what looks like a script calling functions referring to the information stored elsewhere. Can someone please give me pointers on how to understand this structure to get the information I need. I want to be able to extract the regions, titles and the year range from this page(https://religiondatabase.org/browse/regions)
How can I tell if this page allows me to extract their data or if it is wrong to do so.
Thanks in advance for your help.
I tried the two approaches below. In the second approach I tried to extract the information in the script tag, but I can't understand it. I was expecting to see the actual page content nested under various html tags, but I don't
import requests
from bs4 import BeautifulSoup
import html5lib
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'}
URL = 'https://religiondatabase.org/browse/regions'
r = requests.get(url =URL, headers=headers)
print(r.content)
**Output: **
b'<!doctype html>You need to enable JavaScript to run this app.!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])'
and
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://religiondatabase.org/browse/regions')
print(r.content)
script = r.html.find('script', first = True)
print(script.text)
**Output: **
!function(e){function r(r){for(var n,u,c=r[0],i=r1,f=r[2],p=0,s=[];p<c.length;p++)u=c[p],Object.prototype.hasOwnProperty.call(o,u)&&o[u]&&s.push(o[u][0]),o[u]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(l&&l(r);s.length;)s.shift()();return a.push.apply(a,f||[]),t()}function t(){for(var e,r=0;r<a.length;r++){for(var t=a[r],n=!0,c=1;c<t.length;c++){var i=t[c];0!==o[i]&&(n=!1)}n&&(a.splice(r--,1),e=u(u.s=t[0]))}return e}var n={},o={3:0},a=[];function u(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,u),t.l=!0,t.exports}u.e=function(e){var r=[],t=o[e];if(0!==t)if(t)r.push(t[2]);else{var n=new Promise((function(r,n){t=o[e]=[r,n]}));r.push(t[2]=n);var a,c=document.createElement("script");c.charset="utf-8",c.timeout=120,u.nc&&c.setAttribute("nonce",u.nc),c.src=function(e){return u.p+"static/js/"+({}[e]||e)+"."+{0:"e5e2acc6",1:"e2cf61a4",5:"145ce2fe",6:"c5a670f3",7:"33c0f0b5",8:"e18577cc",9:"2af95b97",10:"66591cf5",11:"ebaf6d39",12:"2c9c3ea5",13:"1f5b00d2"}[e]+".chunk.js"}(e);var i=new Error;a=function(r){c.onerror=c.onload=null,clearTimeout(f);var t=o[e];if(0!==t){if(t){var n=r&&("load"===r.type?"missing":r.type),a=r&&r.target&&r.target.src;i.message="Loading chunk "+e+" failed.\n("+n+": "+a+")",i.name="ChunkLoadError",i.type=n,i.request=a,t1}o[e]=void 0}};var f=setTimeout((function(){a({type:"timeout",target:c})}),12e4);c.onerror=c.onload=a,document.head.appendChild(c)}return Promise.all(r)},u.m=e,u.c=n,u.d=function(e,r,t){u.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},u.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},u.t=function(e,r){if(1&r&&(e=u(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(u.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)u.d(t,n,function(r){return e[r]}.bind(null,n));return t},u.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return u.d(r,"a",r),r},u.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},u.p="/browse/",u.oe=function(e){throw console.error(e),e};var c=this["webpackJsonpbrowse-app"]=this["webpackJsonpbrowse-app"]||[],i=c.push.bind(c);c.push=r,c=c.slice();for(var f=0;f<c.length;f++)r(c[f]);var l=i;t()}([])

Scrapy: scraping JSON data from URL that is constructed with dates

I have read many posts on using Scrapy to scrape JSON data, but haven't found one with dates in the URL.
I am using Scrapy version 2.1.0 and I am trying to scrape this site which populates based on date ranges in the URL. Here is the rest of my code which includes headers I copied from the site I am trying to scrape, but I am trying to use the following while loop to generate the URLs:
while start_date <= end_date:
start_date += delta
dates_url = (str(start_date) + "&end=" + str(start_date))
ideas_url = base_url+dates_url
request = scrapy.Request(
ideas_url,
callback=self.parse_ideas,
headers=self.headers
)
print(ideas_url)
yield request
Then I am trying to scrape using the following:
def parse_ideas(self, response):
raw_data = response.body
data = json.loads(raw_data)
yield {
'Idea' : data['dates']['data']['idea']
}
Here is a more complete error output from when I try to runspider and export to a CSV, but I keep getting the error:
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Is this the best approach for scraping a site that uses dates in its URL to populate? And, if so, what I am doing wrong with my JSON request that I am not getting any results?
Note in case it matters, in settings.py I enabled and edited the following:
USER_AGENT = 'Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/4537.36'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
And I added the following at the end of settings.py
HTTPERROR_ALLOWED_CODES = [400]
DOWNLOAD_DELAY = 2
The problem with this is that you are trying to scrape a JavaScript app. In the html it says:
<noscript>You need to enable JavaScript to run this app.</noscript>
Another problem is that the app pulls it's data from this api which requires authorization. So I guess your best bet is to either use Splash or Selenium to wait for the page to load and use the html generated by them.
Personally I mostly use something very similar to scrapy-selenium. There is also a Package available for it here

InDesign CS6 Socket return Empty

I have this code (InDesign CS6), and it's not working as expected. I'm using Mac OS and I need to make the code compatible with Windows and Mac.
Trying to get a text/JSON over my localhost, and the socket return an empty string:-
function getData(host, path) {
var reply = '';
var conn = new Socket;
var request = "GET " + path + " HTTP/1.1\r\n" + "Host: " + host + "\r\n" + "\n";
if (conn.open (host)) {
conn.write (request);
reply = conn.read(999999);
var close = conn.close();
}
return reply;
}
var host = 'localhost:80';
var path = '/test/test/json.php';
var test = getData(host, path);
alert(typeof(test) + ' Length:' + test.length);
Edit: Finally I manage to find out what causing the problem. I create a VMware, and try to run the script, and it's working. Not sure why it doesn't work on my machine. Download Wireshark, and saw InDesign send the request, but something blocks the request from accessing the server. I will update if I able to detect what causing the block.
When it comes to Socket, I guess the simpliest is to take advantage of that script written by Rorohiko:
https://rorohiko.blogspot.fr/2013/01/geturlsjsx.html
Or have a try with IDExtenso library:
https://github.com/indiscripts/IdExtenso
I find those convenient as they deal with the inner socket mechanisms for you.
You do not need to use a socket just to get JSON from your server.
Instead refer to the XMLHttpRequest documentation or just a library such as jQuery which greatly simplifies making Ajax calls for JSON.

extract information from a javascript file to a remote site

I need to extract the information contained in Html and Javascript of a site . As for html I have succeeded in this by using the java library called jsoup , but now I would like to extrapolate content of a variable within the js files from the same site .
How can I do it ? Thanks in advance
I would like to extrapolate content start of a variable within the js files from the same site
Try this:
// ** Exception handling removed ** //
Document doc = Jsoup.connect(websiteUrl).get();
String jsFilesCssQuery = "script[src]";
for(Element script : doc.select(jsFilesCssQuery) {
// You may add further checks on the script element found here...
// ...
// Download JS code
Connection.Response response = Jsoup //
.connect(script.absUrl("src")) //
.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36") //
.ignoreContentType(true) // To force Jsoup download the JS code
.referrer(doc.location()) //
.execute(); //
String jsCode = new String( //
response.bodyAsBytes(), //
Charset.forName(response.charset()) //
);
// Do extraction on jsCode here...
// ...
}

Running an Executable file from an ASP.NET web application

I am trying to create a web application that can read certain files (logs) provided by the users and then Use the LogParser 2.2 exe by microsoft to parse the logs and provide the requested output.
The idea i have is to run the Local LogParser.exe present in the Users system and then use the same generated output to ready my output.
I don not know if this approach is correct , However i am trying to do the same and somewhere my code is not correctly being followed and i am not able to find any output/Error .
My code segment is as follows :
protected void Button2_Click(object sender, EventArgs e)
{
try
{
string fileName = #"C:\Program Files (x86)\Log Parser 2.2\LOGPARSER.exe";
string filename = "LogParser";
string input = " -i:IISW3C ";
string query = " Select top 10 cs-ur-stem, count(cs-ur-stem) from " + TextBox1.Text + " group by cs-uri-stem order by count(cs-ur-stem)";
string output = " -o:DATAGRID ";
string argument = filename + input + query + output;
ProcessStartInfo PSI = new ProcessStartInfo(fileName)
{
UseShellExecute = false,
Arguments = argument,
RedirectStandardInput = true,
RedirectStandardOutput = true,
CreateNoWindow = false
};
Process LogParser = Process.Start(PSI);
LogParser.Start();
}
catch (Exception Prc)
{
MessageBox.Show(Prc.Message);
}
I might be doing something wrong , but can someone point me in correct direction ? Can Javascript ActiveX control may be the way forward ?
All the help is appreciated
(( I am making it as an internal application for my organisation and it is assumed that the log parser will be present in the computer this web application is being used )0
Thanks
Ravi
Add a reference to Interop.MSUtil in your project and then use the COM API as described in the help file. The following using statements should allow you to interact with LogParser through your code:
using LogQuery = Interop.MSUtil.LogQueryClass;
using FileLogInputFormat = Interop.MSUtil.COMTextLineInputContextClass;
Then you can do something like:
var inputFormat = new FileLogInputFormat();
// Instantiate the LogQuery object
LogQuery oLogQuery = new LogQuery();
var results = oLogQuery.Execute(yourQuery, inputFormat);
You have access to a bunch of predefined input formats and output formats (like IIS and W3C)), so you can pick the one that best suits your needs. Also, you will need to run regsvr on LogParser.dll on the machine you are executing on if you have not installed LogParser. The doc is actually pretty good to get you started.

Categories

Resources