How to scrape embedded JSON using PhantomJS

How to scrape embedded JSON using PhantomJS - javascript

I need to get a particular piece of data from a JSON string encoded within a script tag within a returned HTML document using phantomjs. The HTML looks basically like this:
... [preamble html tags etc.]
....
<script id="ine-data" type="application/json">
{"userData": {"account_owner": "Grib"},
"skey":"b207ff1f8d5a394c2f7af1681ad3470c",
"location": "EU"
</script>
<script id="notification-data" type="application/json">
... [other stuff including html body]
What I need to get to is the value for skey within the JSON. I am unable to use the selectors to even get to the script. For instance,
page.open('https://www.site1.com/dash', function(status) {
var ine_data = document.querySelectorAll('script').item(0);
console.log(ine_data); phantom.exit();
});
This returns null. Can anyone point me in the right direction please?

The PhantomJS function you're looking for is called page.evaluate (documentation). It allows you to run javascript sandboxed within the javascript environment of the browser itself.
So following your example:
page.open('https://www.site1.com/dash', function(status) {
var ske = page.evaluate(function() {
var json_text = document.querySelector("#ine-data").innerHTML,
json_values = JSON.parse(json_text);
return json_values.skey;
});
console.log(ske)
phantom.exit();
});
Though I'd note that the JSON in your example is invalid (missing a trailing }), so my example won't work without fixing that first!

Related

How to solve error while parsing HTML

I´m trying to get the elements from a web page in Google spreadsheet using:
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
var elements = XmlService.parse(html);
}
However I keep geting the error:
Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")
How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.
I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.

The html isn't xml. And you don't need to try to parse it. You need to use string methods:
function pegarAsCoisas() {
var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
var html = urlFetchReturn.getContentText();
Logger.log('html.length: ' + html.length);
var index_OfH1 = html.indexOf('<h1');
var endingH1 = html.indexOf('</h1>');
Logger.log('index_OfH1: ' + index_OfH1);
Logger.log('endingH1: ' + endingH1);
var h1Content = html.slice(index_OfH1, endingH1);
var h1Content = h1Content.slice(h1Content.indexOf(">")+1);
Logger.log('h1Content: ' + h1Content);
};

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?
Regarding the OP code, the following works just fine
function pegarAsCoisas() {
var html = UrlFetchApp
.fetch('http://www.saosilvestre.com.br')
.getContentText();
Logger.log(html);
}
As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.
Related questions:
How to parse an HTML string in Google Apps Script without using XmlService?
What is the best way to parse html in google apps script

Try replace itemscope by itemscope = '':
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
html = replace("itemscope", "itemscope = ''");
var elements = XmlService.parse(html);
}
For more information, look here.

External JavaScript file is not defined

For a web project, I've included a JavaScript file as a script src, as shown here.
<script src="xml2json.js"> //same directory as the web project
Next, I tried to invoke a method within xml2json, called xml_str2json.
downloadUrl("ship_track_ajax.php", function(data) {
var xml_string = data.responseText; //an XML string
//A parser to transform XML string into a JSON object is required.
//Use convert XML to JSON with xml2json.js
var markers = xml2json.xml_str2json(xml_string);
}
However, console log indicates "Uncaught ReferenceError: xml2json is not defined", even though xml2json is included as a script src. Can anyone tell me as to what is wrong?

You have to call the function directly in javascript without reffering the filename as like
xml_str2json(xml_string);
If the function is defined in any of the included file it will be invoked.
I hope this will solve your problem

Maybe you should try this:
var json = xml2json(parseXml(xml), " ");
See Demo from https://github.com/henrikingo/xml2json

executing javascript selectors on page source in string format

I am trying to develop and application with jsoup and java to scrap some web pages. So what I am hoping to make is to let jsoup get the page source first and the on the page source let the below javascript get executed and return back a result.
$("body, body *").each(function(i, val) {
// do something and something more
});
I am planning to use ScriptEngineManager to execute javascript code from Java.
ScriptEngineManager manager = new ScriptEngineManager();
ScriptEngine engine = manager.getEngineByName("JavaScript");
My question is how can I make / is there a way that the JS script accept the Document(string) that jsoup returns and do select operation on it just like a regular page.
So what I am hoping to have is something like this:
// document is the Document object returned from the Jsoup or Document converted to string
var myfunction = function(document){
document.$("body, body *").each(function(i, val) {
// do something and something more
});
}

jQuery selector accepts a string of html, and you can utilize the context/scope to select only from that variable. So you can do something like this as an example (using the variable as context):
//a string of elements/html
var doc = '<html><body><ul><li>1</li><li>2</li><li>3</li></ul></body></html>';
$('li', doc).each(function() {
console.log(this); //iterates each 'li' within 'doc'
});
Or, in your case:
var myfunction = function(doc){
$("*", doc).each(function(i, val) {
// do something and something more
});
}

untermitated string literal when using resources in razor

I have a razor code which is using resorces from resources.resx. When i use it in a function (java script), it shows error as "unterminated string literal". How do I use resources in my java script code? However in html part of my code it is able to get the actual value if #mynamespace.name
function check(arg)
{
...
var name = "#mynamespace.name";
...
}

You can't use Razor in javascript file, because javascript files are static. All you can do is use script section in your .cshtml file. You can make a walk-around following jcreamer898 post https://stackoverflow.com/a/9406739/4563955
// someFile.js
var myFunction = function(options){
// do stuff with options
};
// razorFile.cshtml
<script>
window.myFunction = new myFunction(#model.Stuff);
// If you need a whole model serialized then use...
window.myFunction = new myFunction(#Html.Raw(Json.Encode(model)));
</script>

Get script content [duplicate]

If I have a script tag like this:
<script
id = "myscript"
src = "http://www.example.com/script.js"
type = "text/javascript">
</script>
I would like to get the content of the "script.js" file. I'm thinking about something like document.getElementById("myscript").text but it doesn't work in this case.

tl;dr script tags are not subject to CORS and same-origin-policy and therefore javascript/DOM cannot offer access to the text content of the resource loaded via a <script> tag, or it would break same-origin-policy.
long version:
Most of the other answers (and the accepted answer) indicate correctly that the "correct" way to get the text content of a javascript file inserted via a <script> loaded into the page, is using an XMLHttpRequest to perform another seperate additional request for the resource indicated in the scripts src property, something which the short javascript code below will demonstrate. I however found that the other answers did not address the point why to get the javascript files text content, which is that allowing to access content of the file included via the <script src=[url]></script> would break the CORS policies, e.g. modern browsers prevent the XHR of resources that do not provide the Access-Control-Allow-Origin header, hence browsers do not allow any other way than those subject to CORS, to get the content.
With the following code (as mentioned in the other questions "use XHR/AJAX") it is possible to do another request for all not inline script tags in the document.
function printScriptTextContent(script)
{
var xhr = new XMLHttpRequest();
xhr.open("GET",script.src)
xhr.onreadystatechange = function () {
if(xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {
console.log("the script text content is",xhr.responseText);
}
};
xhr.send();
}
Array.prototype.slice.call(document.querySelectorAll("script[src]")).forEach(printScriptTextContent);
and so I will not repeat that, but instead would like to add via this answer upon the aspect why itthat

Do you want to get the contents of the file http://www.example.com/script.js? If so, you could turn to AJAX methods to fetch its content, assuming it resides on the same server as the page itself.

Update: HTML Imports are now deprecated (alternatives).
---
I know it's a little late but some browsers support the tag LINK rel="import" property.
http://www.html5rocks.com/en/tutorials/webcomponents/imports/
<link rel="import" href="/path/to/imports/stuff.html">
For the rest, ajax is still the preferred way.

I don't think the contents will be available via the DOM. You could get the value of the src attribute and use AJAX to request the file from the server.

yes, Ajax is the way to do it, as in accepted answer. If you get down to the details, there are many pitfalls. If you use jQuery.load(...), the wrong content type is assumed (html instead of application/javascript), which can mess things up by putting unwanted <br> into your (scriptNode).innerText, and things like that. Then, if you use jQuery.getScript(...), the downloaded script is immediately executed, which might not be what you want (might screw up the order in which you want to load the files, in case you have several of those.)
I found it best to use jQuery.ajax with dataType: "text"
I used this Ajax technique in a project with a frameset, where the frameset and/or several frames need the same JavaScript, in order to avoid having the server send that JavaScript multiple times.
Here is code, tested and working:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<script id="scriptData">
var scriptData = [
{ name: "foo" , url: "path/to/foo" },
{ name: "bar" , url: "path/to/bar" }
];
</script>
<script id="scriptLoader">
var LOADER = {
loadedCount: 0,
toBeLoadedCount: 0,
load_jQuery: function (){
var jqNode = document.createElement("script");
jqNode.setAttribute("src", "/path/to/jquery");
jqNode.setAttribute("onload", "LOADER.loadScripts();");
jqNode.setAttribute("id", "jquery");
document.head.appendChild(jqNode);
},
loadScripts: function (){
var scriptDataLookup = this.scriptDataLookup = {};
var scriptNodes = this.scriptNodes = {};
var scriptNodesArr = this.scriptNodesArr = [];
for (var j=0; j<scriptData.length; j++){
var theEntry = scriptData[j];
scriptDataLookup[theEntry.name] = theEntry;
}
//console.log(JSON.stringify(scriptDataLookup, null, 4));
for (var i=0; i<scriptData.length; i++){
var entry = scriptData[i];
var name = entry.name;
var theURL = entry.url;
this.toBeLoadedCount++;
var node = document.createElement("script");
node.setAttribute("id", name);
scriptNodes[name] = node;
scriptNodesArr.push(node);
jQuery.ajax({
method : "GET",
url : theURL,
dataType : "text"
}).done(this.makeHandler(name, node)).fail(this.makeFailHandler(name, node));
}
},
makeFailHandler: function(name, node){
var THIS = this;
return function(xhr, errorName, errorMessage){
console.log(name, "FAIL");
console.log(xhr);
console.log(errorName);
console.log(errorMessage);
debugger;
}
},
makeHandler: function(name, node){
var THIS = this;
return function (fileContents, status, xhr){
THIS.loadedCount++;
//console.log("loaded", name, "content length", fileContents.length, "status", status);
//console.log("loaded:", THIS.loadedCount, "/", THIS.toBeLoadedCount);
THIS.scriptDataLookup[name].fileContents = fileContents;
if (THIS.loadedCount >= THIS.toBeLoadedCount){
THIS.allScriptsLoaded();
}
}
},
allScriptsLoaded: function(){
for (var i=0; i<this.scriptNodesArr.length; i++){
var scriptNode = this.scriptNodesArr[i];
var name = scriptNode.id;
var data = this.scriptDataLookup[name];
var fileContents = data.fileContents;
var textNode = document.createTextNode(fileContents);
scriptNode.appendChild(textNode);
document.head.appendChild(scriptNode); // execution is here
//console.log(scriptNode);
}
// call code to make the frames here
}
};
</script>
</head>
<frameset rows="200pixels,*" onload="LOADER.load_jQuery();">
<frame src="about:blank"></frame>
<frame src="about:blank"></frame>
</frameset>
</html>
related question

.text did get you contents of the tag, it's just that you have nothing between your open tag and your end tag. You can get the src attribute of the element using .src, and then if you want to get the javascript file you would follow the link and make an ajax request for it.

In a comment to my previous answer:
I want to store the content of the script so that I can cache it and use it directly some time later without having to fetch it from the external web server (not on the same server as the page)
In that case you're better off using a server side script to fetch and cache the script file. Depending on your server setup you could just wget the file (periodically via cron if you expect it to change) or do something similar with a small script inthe language of your choice.

if you want the contents of the src attribute, you would have to do an ajax request and look at the responsetext. If you where to have the js between and you could access it through innerHTML.
This might be of interest: http://ejohn.org/blog/degrading-script-tags/

I had a same issue, so i solve it this way:
The js file contains something like
window.someVarForReturn = `content for return`
On html
<script src="file.js"></script>
<script>console.log(someVarForReturn)</script>
In my case the content was html template. So i did something like this:
On js file
window.someVarForReturn = `<did>My template</div>`
On html
<script src="file.js"></script>
<script>
new DOMParser().parseFromString(someVarForReturn, 'text/html').body.children[0]
</script>

You cannot directly get what browser loaded as the content of your specific script tag (security hazard);
But
you can request the same resource (src) again ( which will succeed immediately due to cache ) and read it's text:
const scriptSrc = document.querySelector('script#yours').src;
// re-request the same location
const scriptContent = await fetch(scriptSrc).then((res) => res.text());

If you're looking to access the attributes of the <script> tag rather than the contents of script.js, then XPath may well be what you're after.
It will allow you to get each of the script attributes.
If it's the example.js file contents you're after, then you can fire off an AJAX request to fetch it.

It's funny but we can't, we have to fetch them again over the internet.
Likely the browser will read his cache, but a ping is still sent to verify the content-length.
[...document.scripts].forEach((script) => {
fetch(script.src)
.then((response) => response.text() )
.then((source) => console.log(source) )
})

Using 2008-style DOM-binding it would rather be:
document.getElementById('myscript').getAttribute("src");
document.getElementById('myscript').getAttribute("type");

You want to use the innerHTML property to get the contents of the script tag:
document.getElementById("myscript").innerHTML
But as #olle said in another answer you probably want to have a read of:
http://ejohn.org/blog/degrading-script-tags/

If a src attribute is provided, user agents are required to ignore the content of the element, if you need to access it from the external script, then you are probably doing something wrong.
Update: I see you've added a comment to the effect that you want to cache the script and use it later. To what end? Assuming your HTTP is cache friendly, then your caching needs are likely taken care of by the browser already.

I'd suggest the answer to this question is using the "innerHTML" property of the DOM element. Certainly, if the script has loaded, you do not need to make an Ajax call to get it.
So Sugendran should be correct (not sure why he was voted down without explanation).
var scriptContent = document.getElementById("myscript").innerHTML;
The innerHTML property of the script element should give you the scripts content as a string provided the script element is:
an inline script, or
that the script has loaded (if using the src attribute)
olle also gives the answer, but I think it got 'muddled' by his suggesting it needs to be loaded through ajax first, and i think he meant "inline" instead of between.
if you where to have the js between and you could access it through innerHTML.
Regarding the usefulness of this technique:
I've looked to use this technique for client side error logging (of javascript exceptions) after getting "undefined variables" which aren't contained within my own scripts (such as badly injected scripts from toolbars or extensions) - so I don't think it's such a way out idea.

Not sure why you would need to do this?
Another way round would be to hold the script in a hidden element somewhere and use Eval to run it. You could then query the objects innerHtml property.

Develop Reference

JavaScript is the programming language of the Web.

How to scrape embedded JSON using PhantomJS - javascript

Related

How to solve error while parsing HTML

External JavaScript file is not defined

executing javascript selectors on page source in string format

untermitated string literal when using resources in razor

Get script content [duplicate]

Categories

Resources