How can I load a DOM from a string in PhantomJS?

How can I load a DOM from a string in PhantomJS? - javascript

Most of the examples I have found on the web involve loading a URL.
However, if I simply have a string that contains an svg or html and I want to load it into a dom for manipulation, I cannot figure out how to manipulate it.
var fs=require('fs')
var content = fs.read("EarlierSavedPage.svg")
// How do I load content into a DOM?
I realize that, in this example where is a local file is being read, there is a workaround for reading the local file directly, but I am interested more generally in whether a page can be loaded from a string.
I have already looked at the documentation but did not see anything obvious.

The default page in PhantomJS is a comparable to about:blank and is essentially
<html>
<body>
</body>
</html>
It means that you can directly add your svg to the DOM to and render it. It seems that you have to render it asynchronously to give the browser time to actually compute the svg. Here is a complete script:
var page = require('webpage').create(),
fs = require('fs')
var content = fs.read("EarlierSavedPage.svg")
page.evaluate(function(content){
document.body.innerHTML = content;
}, content);
setTimeout(function(){
page.render("EarlierSavedPage.png"); // render or do whatever
phantom.exit();
}, 0); // phantomjs is single threaded so you need to do this asynchronously, but immediately
When you load an HTML file into content, then you can directly assign it to the current DOM (to page.content):
page.content = content;
This would likely also need some asynchronous decoupling like above.
The other way would be to actually load the HTML file with page.open:
page.open(filePathToHtmlFile, function(success){
// do something like render
phantom.exit();
});

Related

How to read the generated source (html with DOM changes) of a webpage within javascript?

I want to read a webpage programmatically (with javascript-angular) and search some elements inside. What i have until now is:
$http.get('http://.....').success(function(data) {
var doc = new DOMParser().parseFromString(data, 'text/html');
var result = doc.evaluate('//div[#class = \'xx\']/a', doc, null, XPathResult.STRING_TYPE, null);
$scope.all = result.stringValue;
});
so in the example i can read the value of any html element.
Very unluckily, the page i want to read uses some Javascript and the source code (html) is just a part of its entire html source (including DOM changes), which the browser at the end shows. So the html which is returned from the http get, does not necessarily contain the elements i need.
Is there a way of getting the entire html after the javascript run?
Edit: Yes the page is from another domain + The provided API does not give me the info i need.

Get script content [duplicate]

If I have a script tag like this:
<script
id = "myscript"
src = "http://www.example.com/script.js"
type = "text/javascript">
</script>
I would like to get the content of the "script.js" file. I'm thinking about something like document.getElementById("myscript").text but it doesn't work in this case.

tl;dr script tags are not subject to CORS and same-origin-policy and therefore javascript/DOM cannot offer access to the text content of the resource loaded via a <script> tag, or it would break same-origin-policy.
long version:
Most of the other answers (and the accepted answer) indicate correctly that the "correct" way to get the text content of a javascript file inserted via a <script> loaded into the page, is using an XMLHttpRequest to perform another seperate additional request for the resource indicated in the scripts src property, something which the short javascript code below will demonstrate. I however found that the other answers did not address the point why to get the javascript files text content, which is that allowing to access content of the file included via the <script src=[url]></script> would break the CORS policies, e.g. modern browsers prevent the XHR of resources that do not provide the Access-Control-Allow-Origin header, hence browsers do not allow any other way than those subject to CORS, to get the content.
With the following code (as mentioned in the other questions "use XHR/AJAX") it is possible to do another request for all not inline script tags in the document.
function printScriptTextContent(script)
{
var xhr = new XMLHttpRequest();
xhr.open("GET",script.src)
xhr.onreadystatechange = function () {
if(xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {
console.log("the script text content is",xhr.responseText);
}
};
xhr.send();
}
Array.prototype.slice.call(document.querySelectorAll("script[src]")).forEach(printScriptTextContent);
and so I will not repeat that, but instead would like to add via this answer upon the aspect why itthat

Do you want to get the contents of the file http://www.example.com/script.js? If so, you could turn to AJAX methods to fetch its content, assuming it resides on the same server as the page itself.

Update: HTML Imports are now deprecated (alternatives).
---
I know it's a little late but some browsers support the tag LINK rel="import" property.
http://www.html5rocks.com/en/tutorials/webcomponents/imports/
<link rel="import" href="/path/to/imports/stuff.html">
For the rest, ajax is still the preferred way.

I don't think the contents will be available via the DOM. You could get the value of the src attribute and use AJAX to request the file from the server.

yes, Ajax is the way to do it, as in accepted answer. If you get down to the details, there are many pitfalls. If you use jQuery.load(...), the wrong content type is assumed (html instead of application/javascript), which can mess things up by putting unwanted <br> into your (scriptNode).innerText, and things like that. Then, if you use jQuery.getScript(...), the downloaded script is immediately executed, which might not be what you want (might screw up the order in which you want to load the files, in case you have several of those.)
I found it best to use jQuery.ajax with dataType: "text"
I used this Ajax technique in a project with a frameset, where the frameset and/or several frames need the same JavaScript, in order to avoid having the server send that JavaScript multiple times.
Here is code, tested and working:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<script id="scriptData">
var scriptData = [
{ name: "foo" , url: "path/to/foo" },
{ name: "bar" , url: "path/to/bar" }
];
</script>
<script id="scriptLoader">
var LOADER = {
loadedCount: 0,
toBeLoadedCount: 0,
load_jQuery: function (){
var jqNode = document.createElement("script");
jqNode.setAttribute("src", "/path/to/jquery");
jqNode.setAttribute("onload", "LOADER.loadScripts();");
jqNode.setAttribute("id", "jquery");
document.head.appendChild(jqNode);
},
loadScripts: function (){
var scriptDataLookup = this.scriptDataLookup = {};
var scriptNodes = this.scriptNodes = {};
var scriptNodesArr = this.scriptNodesArr = [];
for (var j=0; j<scriptData.length; j++){
var theEntry = scriptData[j];
scriptDataLookup[theEntry.name] = theEntry;
}
//console.log(JSON.stringify(scriptDataLookup, null, 4));
for (var i=0; i<scriptData.length; i++){
var entry = scriptData[i];
var name = entry.name;
var theURL = entry.url;
this.toBeLoadedCount++;
var node = document.createElement("script");
node.setAttribute("id", name);
scriptNodes[name] = node;
scriptNodesArr.push(node);
jQuery.ajax({
method : "GET",
url : theURL,
dataType : "text"
}).done(this.makeHandler(name, node)).fail(this.makeFailHandler(name, node));
}
},
makeFailHandler: function(name, node){
var THIS = this;
return function(xhr, errorName, errorMessage){
console.log(name, "FAIL");
console.log(xhr);
console.log(errorName);
console.log(errorMessage);
debugger;
}
},
makeHandler: function(name, node){
var THIS = this;
return function (fileContents, status, xhr){
THIS.loadedCount++;
//console.log("loaded", name, "content length", fileContents.length, "status", status);
//console.log("loaded:", THIS.loadedCount, "/", THIS.toBeLoadedCount);
THIS.scriptDataLookup[name].fileContents = fileContents;
if (THIS.loadedCount >= THIS.toBeLoadedCount){
THIS.allScriptsLoaded();
}
}
},
allScriptsLoaded: function(){
for (var i=0; i<this.scriptNodesArr.length; i++){
var scriptNode = this.scriptNodesArr[i];
var name = scriptNode.id;
var data = this.scriptDataLookup[name];
var fileContents = data.fileContents;
var textNode = document.createTextNode(fileContents);
scriptNode.appendChild(textNode);
document.head.appendChild(scriptNode); // execution is here
//console.log(scriptNode);
}
// call code to make the frames here
}
};
</script>
</head>
<frameset rows="200pixels,*" onload="LOADER.load_jQuery();">
<frame src="about:blank"></frame>
<frame src="about:blank"></frame>
</frameset>
</html>
related question

.text did get you contents of the tag, it's just that you have nothing between your open tag and your end tag. You can get the src attribute of the element using .src, and then if you want to get the javascript file you would follow the link and make an ajax request for it.

In a comment to my previous answer:
I want to store the content of the script so that I can cache it and use it directly some time later without having to fetch it from the external web server (not on the same server as the page)
In that case you're better off using a server side script to fetch and cache the script file. Depending on your server setup you could just wget the file (periodically via cron if you expect it to change) or do something similar with a small script inthe language of your choice.

if you want the contents of the src attribute, you would have to do an ajax request and look at the responsetext. If you where to have the js between and you could access it through innerHTML.
This might be of interest: http://ejohn.org/blog/degrading-script-tags/

I had a same issue, so i solve it this way:
The js file contains something like
window.someVarForReturn = `content for return`
On html
<script src="file.js"></script>
<script>console.log(someVarForReturn)</script>
In my case the content was html template. So i did something like this:
On js file
window.someVarForReturn = `<did>My template</div>`
On html
<script src="file.js"></script>
<script>
new DOMParser().parseFromString(someVarForReturn, 'text/html').body.children[0]
</script>

You cannot directly get what browser loaded as the content of your specific script tag (security hazard);
But
you can request the same resource (src) again ( which will succeed immediately due to cache ) and read it's text:
const scriptSrc = document.querySelector('script#yours').src;
// re-request the same location
const scriptContent = await fetch(scriptSrc).then((res) => res.text());

If you're looking to access the attributes of the <script> tag rather than the contents of script.js, then XPath may well be what you're after.
It will allow you to get each of the script attributes.
If it's the example.js file contents you're after, then you can fire off an AJAX request to fetch it.

It's funny but we can't, we have to fetch them again over the internet.
Likely the browser will read his cache, but a ping is still sent to verify the content-length.
[...document.scripts].forEach((script) => {
fetch(script.src)
.then((response) => response.text() )
.then((source) => console.log(source) )
})

Using 2008-style DOM-binding it would rather be:
document.getElementById('myscript').getAttribute("src");
document.getElementById('myscript').getAttribute("type");

You want to use the innerHTML property to get the contents of the script tag:
document.getElementById("myscript").innerHTML
But as #olle said in another answer you probably want to have a read of:
http://ejohn.org/blog/degrading-script-tags/

If a src attribute is provided, user agents are required to ignore the content of the element, if you need to access it from the external script, then you are probably doing something wrong.
Update: I see you've added a comment to the effect that you want to cache the script and use it later. To what end? Assuming your HTTP is cache friendly, then your caching needs are likely taken care of by the browser already.

I'd suggest the answer to this question is using the "innerHTML" property of the DOM element. Certainly, if the script has loaded, you do not need to make an Ajax call to get it.
So Sugendran should be correct (not sure why he was voted down without explanation).
var scriptContent = document.getElementById("myscript").innerHTML;
The innerHTML property of the script element should give you the scripts content as a string provided the script element is:
an inline script, or
that the script has loaded (if using the src attribute)
olle also gives the answer, but I think it got 'muddled' by his suggesting it needs to be loaded through ajax first, and i think he meant "inline" instead of between.
if you where to have the js between and you could access it through innerHTML.
Regarding the usefulness of this technique:
I've looked to use this technique for client side error logging (of javascript exceptions) after getting "undefined variables" which aren't contained within my own scripts (such as badly injected scripts from toolbars or extensions) - so I don't think it's such a way out idea.

Not sure why you would need to do this?
Another way round would be to hold the script in a hidden element somewhere and use Eval to run it. You could then query the objects innerHtml property.

Can JavaScript access it's own source url?

Suppose I'm embedding a javascript in HTML page:
<script type="text/javascript" src="www.mydomain.com/script.js?var1=abc&var2=def"></script>
Is there a way I can get the src url inside the script and extract the params?

Given that you are using a regular script element in the HTML source, you can just get the last script element in the document. Since script elements are (in the absence of attributes that you aren't using in your example) blocking, no more will be added to the document until this one has been executed.
var scripts = document.getElementsByTagName('script');
var last_script = scripts[scripts.length - 1];
var url = script.src;
This won't work if you dynamically add a script element before the last script using DOM.

this little hack uses error handling to find the location of external scripts from within:
(function(){ // script filename setter, leaves window.__filename set with active script URL.
if(self.attachEvent){
function fn(e,u){self.__filename=u;}
attachEvent("onerror",fn);
setTimeout(function(){detachEvent("onerror", fn)},20);
eval("gehjkrgh3489c()");
}else{
Object.defineProperty( window, "__filename", { configurable: true, get:function __filename(){
try{document.s0m3741ng()}catch(y){
return "http://" +
String(y.fileName || y.file || y.stack || y + '')
.split(/:\d+:\d+/)[0].split("http://")[1];
}
}})//end __filename
}//end if old IE?
}());
it sets a global "__filename" property when run, so atop an external script, the __filename is in effect for the execution of the whole script.
i strongly prefer to sniff url parts from scr attributes, but this works in most browsers and without knowing the URL ahead of time.

I don't think there is a property already inside the script that points to this url.
From the script, you can read the DOM. So you can lookup the script tag and inspect its src attribute, but if you got multiple scripts (or the DOM was modified), you cannot really know for sure which one it is.
I assume it is for checking input. So to solve this, you can eiter:
Render the script through a server side script (PHP), and let it output variables. Disadvantage: eats more server resources and makes caching a bitch.
Just get parameter from all the scripts loading from your domain. Maybe it doesn't matter much, or you have only one script anyway. Disadvantage: In this case this is possible, but not very reliable and resistant to changes.
My preferred: Add the variables to the script tag (actually, to another script tag) to make them available directly in Javascript, rather than parsing the script url.
Like this:
<script type="text/javascript">
var1 = 'abc';
var2 = 'def';
</script>
<script type="text/javascript" src="www.mydomain.com/script.js"></script>

Here are two other solutions that will work no matter how the script is loaded (even if they are loaded dynamically or with async or defer attributes):
Put an id on the script tag.
<script id="myscript" type="text/javascript" src="www.mydomain.com/script.js?var1=abc&var2=def"></script>
Then, you can find it with the id:
$("#myscript").attr("src")
Or second, if you know the filename, you can search for any script tag that contains that filename:
function findScriptTagByFilename(fname) {
$("script").each(function() {
if (this.src.indexOf(fname) !== -1) {
return this.src;
}
});
}
var url = findScriptTagByFilename("/script.js");

How to prevent resource loading of unattached elements in Chrome

I'm working on Chrome extension and I have following problem:
var myDiv = document.createElement('div');
myDiv.innerHTML = '<img src="a.png">';
What happens now is that Chrome tries to load the "a.png" resource, even If I don't attach the "div" element to document. Is there a way to prevent it?
_In the extension I need to get data from a site that doesn't provide any API, so I have to parse the whole HTML to get the necessary data. Writing my own simple HTML parser could be tricky so I would rather use the native HTML parser. However, in Chrome when I put the whole source code to some temporary non-attached element (so it would get parsed and I could filter the necessary data), ale the images (and possibly other resources) start to load as well, causing higher traffic or (in case of relative paths) lots of errors in console. _

To prevent the resources from being loaded, you'll need to create your Node in an entirely new #document. You can use document.implementation.createHTMLDocument for this.
var dom = document.implementation.createHTMLDocument(); // make new #document
// now use this to..
var myDiv = dom.createElement('div'); // ..create a <div>
myDiv.innerHTML = '<img src="a.png">'; // ..parse HTML

You can delay parsing/loading html by storing it in non-standard attribute, then assigning it to innerHtml, "when the time comes":
myDiv.setAttribute('deferredHtml', '<img src="http://upload.wikimedia.org/wikipedia/commons/4/4e/Single_apple.png">');
global.loadDeferredImage = function() {
if(myDiv.hasAttribute('deferredHtml')) {
myDiv.innerHTML = myDiv.getAttribute('deferredHtml');
myDiv.removeAttribute('deferredHtml');
}
};
... onclick="loadDeferredImage()"
I created jsfiddle illustrating this idea:
http://jsfiddle.net/akhikhl/CbCst/3/

Deferring JavaScript loading

I have heard and read a few articles about deferring JavaScript loading and am very interested. It seems to be very promising for web apps that may be useful on Mobile platforms where the amount of JavaScript that can be loaded and executed is limited.
Unfortunately, most of the articles talk about this at an extremely high level. How would one approach this?
EDIT
Normally, all JavaScript is loaded on page load, however, there may be functions that are not necessary until a certain action occurs, at which time, the JavaScript should be loaded. This helps ease the burden of the browser on page load.
Specifically, I have a page that very heavily uses JavaScript. When I load the page on my phone, it won't load properly. As I debugged the page, I eliminated some of the JS functions. Once enough was eliminated, the page suddenly worked.
I want to be able to load the JS as needed. And possibly even eliminate the functions simply used for start up.

The basics are simple - breaking up your JavaScript code into logically separate components and loading only what you need. Depending on what you are building you can use:
Loaders:
Modernizr.load (or yepnope.js by itself)
LABjs
Many, many, many other deferred loading libraries.
Dependency managers (which are also loaders):
Require.js
dojo.require
JavaScript MVC's steal.js
Several other dependency management libraries.
These tools make use of a wide variety of techniques to defer the loading of scripts, the execution of scripts, manage dependencies, etc. What you need depends on what you are building.
You may also want to read through this discussion to learn something more about the pros and cons of using such techniques.
Response to edit:
There isn't really a good way to unload JavaScript that you have already loaded - the closest approximation you can get is to keep all of your loading code namespaced inside your application's namespace and then "clean up" by setting that namespace, and all references to it to null.

I have used a simple script published on line with some modification done by me.
Assume that your COMPRESSED Javascript file is in the cache directory in your webserver and you want to defer the loading of this compressed js file.
Your compressed js file:
80aaad2a95e397a9f6f64ac79c4b452f.js
This is the code html code:
<script type="text/javascript" src="/resources/js/defer.js?cache=80aaad2a95e397a9f6f64ac79c4b452f.js"></script>
This is the defer.js file content:
(function() {
/*
* http://gtmetrix.com/
* In order to load a page, the browser must parse the contents of all <script> tags,
* which adds additional time to the page load. By minimizing the amount of JavaScript needed to render the page,
* and deferring parsing of unneeded JavaScript until it needs to be executed,
* you can reduce the initial load time of your page.
*/
// http://feather.elektrum.org/book/src.html
// Get the script tag from the html
var scripts = document.getElementsByTagName('script');
var myScript = scripts[ scripts.length - 1 ];
// Get the querystring
var queryString = myScript.src.replace(/^[^\?]+\??/,'');
// Parse the parameters
var params = parseQuery( queryString );
var s = document.createElement('script');
s.type = 'text/javascript';
s.async = true;
s.src = '/cache/' + params.cache; // Add the name of the js file
var x = document.getElementsByTagName('script')[0];
x.parentNode.insertBefore(s, x);
function parseQuery ( query ) {
var Params = new Object ();
if ( ! query ) return Params; // return empty object
var Pairs = query.split(/[;&]/);
for ( var i = 0; i < Pairs.length; i++ ) {
var KeyVal = Pairs[i].split('=');
if ( ! KeyVal || KeyVal.length != 2 ) continue;
var key = unescape( KeyVal[0] );
var val = unescape( KeyVal[1] );
val = val.replace(/\+/g, ' ');
Params[key] = val;
}
return Params;
}
})();
I would like to say thanks to http://feather.elektrum.org/book/src.html that helped me to understand how to get the parameters from the script tag.
bye

Deferring loading til when?
The reason typically why JS is loaded last, is so that the entire DOM has been loaded first.
An easy way is to just use
<body onload="doSomething();">
So you could easily have doSomething() function to load all your JS.
You can also add a function to window.onload, like
window.onload = function(){ };
Also, if you are using JS librarys, such as jQuery and Dojo, they each have their own onReady and addOnLoad methods in order to run some JS only after the document has already loaded.

Here's a useful article on the script element's defer and async attributes. Specifying these attributes will get the browser to defer loading in different ways. You can also load in an external script using JavaScript after page load.
It should also be noted that the position of your script elements within your HTML document will determine load and execution order if neither defer nor async have been specified.

Develop Reference

JavaScript is the programming language of the Web.

How can I load a DOM from a string in PhantomJS? - javascript

Related

How to read the generated source (html with DOM changes) of a webpage within javascript?

Get script content [duplicate]

Can JavaScript access it's own source url?

How to prevent resource loading of unattached elements in Chrome

Deferring JavaScript loading

Categories

Resources