PhantomJS create page from string

PhantomJS create page from string - javascript

Is it possible to create a page from a string?
example:
html = '<html><body>blah blah blah</body></html>'
page.open(html, function(status) {
// do something
});
I have already tried the above with no luck....
Also, I think it's worth mentioning that I'm using nodejs with phantomjs-node(https://github.com/sgentle/phantomjs-node)
Thanks!

It's very simple, take a look at the colorwheel.js example.
var page = require('webpage').create();
page.content = '<html><body><p>Hello world</p></body></html>';
That's all! Then you can manipulate the page, e.g. render it as an image.

To do this you need to set the page content to your string.
phantom.create(function (ph) {
ph.createPage(function (page) {
page.set('viewportSize', {width:1440,height:900})
//like this
page.set('content', html);
page.render(path_to_pdf, function() {
//now pdf is written to disk.
ph.exit();
});
});
});
you need to use page.set() to set the html content.
as per https://github.com/sgentle/phantomjs-node#functionality-details
Properties can't be get/set directly.
Instead use page.get('version', callback) or page.set('viewportSize', {width:640,height:480}), etc.
Nested objects can be accessed by including dots in keys, such as
page.set('settings.loadImages', false)

Looking at the phantomjs API, page.open requires a URL as the first argument, not an HTML string. This is why the what you tried does not work.
However, one way that you might be able to achieve the effect of creating a page from a string is to host an empty "skeleton page," somewhere with a URL (could be localhost), and then include Javascript (using includeJs) into the empty page. The Javascript that you include into the blank page can use document.write("<p>blah blah blah</p>") to dynamically add content to the webpage.
I've ever done this, but AFAIK this should work.
Sample skeleton page:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head></head>
<body></body>
</html>

Just wanted to mention I recently had a similar need and discovered that I could pass file:// style references as an URL param, so I dumped my HTML string into a local file then passed the full path to my capture script (django_screamshot) which basically uses casperjs and phantomjs + a capture.js script.
Anyway it just works and its reasonably fast..

I got the following to work in PhantomJS version 2.0.0. Whereas before, I was using page.open() to open a page from the filesystem and set a callback:
page.open("bench.html", pageLoadCallback);
Now, I accomplish the same thing from a string variable with the HTML page. The page.setContent() method requires a URL as the second argument, and this uses fs.absolute() to construct a file:// URL.
page.onLoadFinished = pageLoadCallback;
page.setContent(bench_str, "file://" + fs.absolute(".") + "/bench.html");

Related

How to solve error while parsing HTML

I´m trying to get the elements from a web page in Google spreadsheet using:
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
var elements = XmlService.parse(html);
}
However I keep geting the error:
Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")
How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.
I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.

The html isn't xml. And you don't need to try to parse it. You need to use string methods:
function pegarAsCoisas() {
var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
var html = urlFetchReturn.getContentText();
Logger.log('html.length: ' + html.length);
var index_OfH1 = html.indexOf('<h1');
var endingH1 = html.indexOf('</h1>');
Logger.log('index_OfH1: ' + index_OfH1);
Logger.log('endingH1: ' + endingH1);
var h1Content = html.slice(index_OfH1, endingH1);
var h1Content = h1Content.slice(h1Content.indexOf(">")+1);
Logger.log('h1Content: ' + h1Content);
};

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?
Regarding the OP code, the following works just fine
function pegarAsCoisas() {
var html = UrlFetchApp
.fetch('http://www.saosilvestre.com.br')
.getContentText();
Logger.log(html);
}
As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.
Related questions:
How to parse an HTML string in Google Apps Script without using XmlService?
What is the best way to parse html in google apps script

Try replace itemscope by itemscope = '':
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
html = replace("itemscope", "itemscope = ''");
var elements = XmlService.parse(html);
}
For more information, look here.

Get script content [duplicate]

If I have a script tag like this:
<script
id = "myscript"
src = "http://www.example.com/script.js"
type = "text/javascript">
</script>
I would like to get the content of the "script.js" file. I'm thinking about something like document.getElementById("myscript").text but it doesn't work in this case.

tl;dr script tags are not subject to CORS and same-origin-policy and therefore javascript/DOM cannot offer access to the text content of the resource loaded via a <script> tag, or it would break same-origin-policy.
long version:
Most of the other answers (and the accepted answer) indicate correctly that the "correct" way to get the text content of a javascript file inserted via a <script> loaded into the page, is using an XMLHttpRequest to perform another seperate additional request for the resource indicated in the scripts src property, something which the short javascript code below will demonstrate. I however found that the other answers did not address the point why to get the javascript files text content, which is that allowing to access content of the file included via the <script src=[url]></script> would break the CORS policies, e.g. modern browsers prevent the XHR of resources that do not provide the Access-Control-Allow-Origin header, hence browsers do not allow any other way than those subject to CORS, to get the content.
With the following code (as mentioned in the other questions "use XHR/AJAX") it is possible to do another request for all not inline script tags in the document.
function printScriptTextContent(script)
{
var xhr = new XMLHttpRequest();
xhr.open("GET",script.src)
xhr.onreadystatechange = function () {
if(xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {
console.log("the script text content is",xhr.responseText);
}
};
xhr.send();
}
Array.prototype.slice.call(document.querySelectorAll("script[src]")).forEach(printScriptTextContent);
and so I will not repeat that, but instead would like to add via this answer upon the aspect why itthat

Do you want to get the contents of the file http://www.example.com/script.js? If so, you could turn to AJAX methods to fetch its content, assuming it resides on the same server as the page itself.

Update: HTML Imports are now deprecated (alternatives).
---
I know it's a little late but some browsers support the tag LINK rel="import" property.
http://www.html5rocks.com/en/tutorials/webcomponents/imports/
<link rel="import" href="/path/to/imports/stuff.html">
For the rest, ajax is still the preferred way.

I don't think the contents will be available via the DOM. You could get the value of the src attribute and use AJAX to request the file from the server.

yes, Ajax is the way to do it, as in accepted answer. If you get down to the details, there are many pitfalls. If you use jQuery.load(...), the wrong content type is assumed (html instead of application/javascript), which can mess things up by putting unwanted <br> into your (scriptNode).innerText, and things like that. Then, if you use jQuery.getScript(...), the downloaded script is immediately executed, which might not be what you want (might screw up the order in which you want to load the files, in case you have several of those.)
I found it best to use jQuery.ajax with dataType: "text"
I used this Ajax technique in a project with a frameset, where the frameset and/or several frames need the same JavaScript, in order to avoid having the server send that JavaScript multiple times.
Here is code, tested and working:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<script id="scriptData">
var scriptData = [
{ name: "foo" , url: "path/to/foo" },
{ name: "bar" , url: "path/to/bar" }
];
</script>
<script id="scriptLoader">
var LOADER = {
loadedCount: 0,
toBeLoadedCount: 0,
load_jQuery: function (){
var jqNode = document.createElement("script");
jqNode.setAttribute("src", "/path/to/jquery");
jqNode.setAttribute("onload", "LOADER.loadScripts();");
jqNode.setAttribute("id", "jquery");
document.head.appendChild(jqNode);
},
loadScripts: function (){
var scriptDataLookup = this.scriptDataLookup = {};
var scriptNodes = this.scriptNodes = {};
var scriptNodesArr = this.scriptNodesArr = [];
for (var j=0; j<scriptData.length; j++){
var theEntry = scriptData[j];
scriptDataLookup[theEntry.name] = theEntry;
}
//console.log(JSON.stringify(scriptDataLookup, null, 4));
for (var i=0; i<scriptData.length; i++){
var entry = scriptData[i];
var name = entry.name;
var theURL = entry.url;
this.toBeLoadedCount++;
var node = document.createElement("script");
node.setAttribute("id", name);
scriptNodes[name] = node;
scriptNodesArr.push(node);
jQuery.ajax({
method : "GET",
url : theURL,
dataType : "text"
}).done(this.makeHandler(name, node)).fail(this.makeFailHandler(name, node));
}
},
makeFailHandler: function(name, node){
var THIS = this;
return function(xhr, errorName, errorMessage){
console.log(name, "FAIL");
console.log(xhr);
console.log(errorName);
console.log(errorMessage);
debugger;
}
},
makeHandler: function(name, node){
var THIS = this;
return function (fileContents, status, xhr){
THIS.loadedCount++;
//console.log("loaded", name, "content length", fileContents.length, "status", status);
//console.log("loaded:", THIS.loadedCount, "/", THIS.toBeLoadedCount);
THIS.scriptDataLookup[name].fileContents = fileContents;
if (THIS.loadedCount >= THIS.toBeLoadedCount){
THIS.allScriptsLoaded();
}
}
},
allScriptsLoaded: function(){
for (var i=0; i<this.scriptNodesArr.length; i++){
var scriptNode = this.scriptNodesArr[i];
var name = scriptNode.id;
var data = this.scriptDataLookup[name];
var fileContents = data.fileContents;
var textNode = document.createTextNode(fileContents);
scriptNode.appendChild(textNode);
document.head.appendChild(scriptNode); // execution is here
//console.log(scriptNode);
}
// call code to make the frames here
}
};
</script>
</head>
<frameset rows="200pixels,*" onload="LOADER.load_jQuery();">
<frame src="about:blank"></frame>
<frame src="about:blank"></frame>
</frameset>
</html>
related question

.text did get you contents of the tag, it's just that you have nothing between your open tag and your end tag. You can get the src attribute of the element using .src, and then if you want to get the javascript file you would follow the link and make an ajax request for it.

In a comment to my previous answer:
I want to store the content of the script so that I can cache it and use it directly some time later without having to fetch it from the external web server (not on the same server as the page)
In that case you're better off using a server side script to fetch and cache the script file. Depending on your server setup you could just wget the file (periodically via cron if you expect it to change) or do something similar with a small script inthe language of your choice.

if you want the contents of the src attribute, you would have to do an ajax request and look at the responsetext. If you where to have the js between and you could access it through innerHTML.
This might be of interest: http://ejohn.org/blog/degrading-script-tags/

I had a same issue, so i solve it this way:
The js file contains something like
window.someVarForReturn = `content for return`
On html
<script src="file.js"></script>
<script>console.log(someVarForReturn)</script>
In my case the content was html template. So i did something like this:
On js file
window.someVarForReturn = `<did>My template</div>`
On html
<script src="file.js"></script>
<script>
new DOMParser().parseFromString(someVarForReturn, 'text/html').body.children[0]
</script>

You cannot directly get what browser loaded as the content of your specific script tag (security hazard);
But
you can request the same resource (src) again ( which will succeed immediately due to cache ) and read it's text:
const scriptSrc = document.querySelector('script#yours').src;
// re-request the same location
const scriptContent = await fetch(scriptSrc).then((res) => res.text());

If you're looking to access the attributes of the <script> tag rather than the contents of script.js, then XPath may well be what you're after.
It will allow you to get each of the script attributes.
If it's the example.js file contents you're after, then you can fire off an AJAX request to fetch it.

It's funny but we can't, we have to fetch them again over the internet.
Likely the browser will read his cache, but a ping is still sent to verify the content-length.
[...document.scripts].forEach((script) => {
fetch(script.src)
.then((response) => response.text() )
.then((source) => console.log(source) )
})

Using 2008-style DOM-binding it would rather be:
document.getElementById('myscript').getAttribute("src");
document.getElementById('myscript').getAttribute("type");

You want to use the innerHTML property to get the contents of the script tag:
document.getElementById("myscript").innerHTML
But as #olle said in another answer you probably want to have a read of:
http://ejohn.org/blog/degrading-script-tags/

If a src attribute is provided, user agents are required to ignore the content of the element, if you need to access it from the external script, then you are probably doing something wrong.
Update: I see you've added a comment to the effect that you want to cache the script and use it later. To what end? Assuming your HTTP is cache friendly, then your caching needs are likely taken care of by the browser already.

I'd suggest the answer to this question is using the "innerHTML" property of the DOM element. Certainly, if the script has loaded, you do not need to make an Ajax call to get it.
So Sugendran should be correct (not sure why he was voted down without explanation).
var scriptContent = document.getElementById("myscript").innerHTML;
The innerHTML property of the script element should give you the scripts content as a string provided the script element is:
an inline script, or
that the script has loaded (if using the src attribute)
olle also gives the answer, but I think it got 'muddled' by his suggesting it needs to be loaded through ajax first, and i think he meant "inline" instead of between.
if you where to have the js between and you could access it through innerHTML.
Regarding the usefulness of this technique:
I've looked to use this technique for client side error logging (of javascript exceptions) after getting "undefined variables" which aren't contained within my own scripts (such as badly injected scripts from toolbars or extensions) - so I don't think it's such a way out idea.

Not sure why you would need to do this?
Another way round would be to hold the script in a hidden element somewhere and use Eval to run it. You could then query the objects innerHtml property.

How can I load a DOM from a string in PhantomJS?

Most of the examples I have found on the web involve loading a URL.
However, if I simply have a string that contains an svg or html and I want to load it into a dom for manipulation, I cannot figure out how to manipulate it.
var fs=require('fs')
var content = fs.read("EarlierSavedPage.svg")
// How do I load content into a DOM?
I realize that, in this example where is a local file is being read, there is a workaround for reading the local file directly, but I am interested more generally in whether a page can be loaded from a string.
I have already looked at the documentation but did not see anything obvious.

The default page in PhantomJS is a comparable to about:blank and is essentially
<html>
<body>
</body>
</html>
It means that you can directly add your svg to the DOM to and render it. It seems that you have to render it asynchronously to give the browser time to actually compute the svg. Here is a complete script:
var page = require('webpage').create(),
fs = require('fs')
var content = fs.read("EarlierSavedPage.svg")
page.evaluate(function(content){
document.body.innerHTML = content;
}, content);
setTimeout(function(){
page.render("EarlierSavedPage.png"); // render or do whatever
phantom.exit();
}, 0); // phantomjs is single threaded so you need to do this asynchronously, but immediately
When you load an HTML file into content, then you can directly assign it to the current DOM (to page.content):
page.content = content;
This would likely also need some asynchronous decoupling like above.
The other way would be to actually load the HTML file with page.open:
page.open(filePathToHtmlFile, function(success){
// do something like render
phantom.exit();
});

Cache static HTML pages with get variables

I have a website with a lot of iframes like this:
<iframes src="expamle.com\page.html?var=blabla&id=42" scrolling="no"></iframe>
I have to change var=blabla&id=42 for each iFrame. These parameters are used in the javascript of the iframe. Is there any way to cache(give hints to the browser) page.html (static) once for all variables ?
I have to use an iframe since I want to be able to update this code ( from another server) & to run it in another scope.

No - Anything changing the query string represents a seperate resource for the browser.
However, you may be able to achieve that effect if you can make some slight changes to page.html. If you write it this way:
<iframes src="expamle.com\page.html#var=blabla&id=42" scrolling="no"></iframe>
Note the use of the # character - that's the key there.
The query string becomes simply "page.html" and will cache that way. However, the Javascript of that page will have access to the variable document.location.hash, which will contain "var=blabla&id=42". It'll be written as a single string, but it shouldn't be difficult to parse. Some libraries even use that tag to pass parameters in semi-real-time to iframes for IE6 compatibility.

If it's only used in the javascript but is really only 1 page server side don't use ? But use # it will consider it as the same page but at diferent anchor pounts. So if test.com/#foo is cached then test.col/#bar is too (same page, different anchor points)

You can update the frame URLs from code:
var fr = document.getElementsByTagName('iframe');
var sites = "1.com,2.com".split(",");
for(var x=0;x<fr.length;x++) {
document.getElementsByTagName('iframe')[x].src="http://"+sites[x];
}

How do I send a document name to a javascript file?

I want to, let's say on index.htm have this:
<html>
<head>
<script type="text/javascript" src="javascriptfile.js"></script>
</head>
</html>
and then have that script return <title>index</title>, and the index being dynamic according to the file name. How can I accomplish this?

You should really be doing this server-side. But if you insist on client-side processing, this should work:
document.write('<title>' +
window.location.pathname.replace(/^(.*\/)?([^\/.]*).*$/, "$2") +
'</title>');
If you do go with this approach, $DEITY will kill a kitten each time someone visits your site.

There are two possible answers :
if you know exactly what pages you have and all pages are static html, don't do that kind of thing. Just put the title in the page, you'll avoid useless delay
in other cases, this is not a job for JS. Use a server-side technology to dynamically create your page.
In short : dynamic ? use server-side language : put title directly in html

// get your filename
var uri = location.href.split("?")[0].split("/");
var filename = uri[uri.length-1];
// get title tag DOMnode
var title = document.getElementByTagName('title')[0];
title.innerText = filename;

You should note that the browser doesn't not really know what the file name is. All its knows is its public location (URL). Having this in mind, you can read the URI from document.location:
https://developer.mozilla.org/en/DOM/document.location
This property returns a Location object:
https://developer.mozilla.org/en/DOM/window.location
The Location object has a pathname property you can parse.

Develop Reference

JavaScript is the programming language of the Web.

PhantomJS create page from string - javascript

It's very simple, take a look at the colorwheel.js example. var page = require('webpage').create(); page.content = '<html><body><p>Hello world</p></body></html>'; That's all! Then you can manipulate the page, e.g. render it as an image.

Related

How to solve error while parsing HTML

Get script content [duplicate]

How can I load a DOM from a string in PhantomJS?

Cache static HTML pages with get variables

How do I send a document name to a javascript file?

Categories

Resources