My app (under development) uses Safari 4.0.3 and JavaScript to present its front end to the user. The backend is PHP and SQLite. This is under OS X 10.5.8.
The app will from time to time receive chunks of HTML to present to the user. Each chunk is the body of an email received, and as such one has no control over the quality of the HTML received. What I do it use innerHTML to shove the chunk into an iFrame and let Safari render it.
To do that I do this:
window.frames["mainwindow"].window.frames["Frame1"].document.body.innerHTML = myvar;
where myvar contains the received HTML. Now, for the most part this works as desired, and the HTML is rendered as expected. The exception appears to be when the tag for the chunk looks like:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" ...
and so on for more than 2800 chars. The effect is as if my JavaScript statement above had not been executed - I can see that using Safari's Error Console in the Develop menu to look into the iFrame. If I extract the HTML from the SQLite backend db and save it as a .html file, then Safari will render that with no trouble.
Any comments on why this might be happening, or on such use of innerHTML, or pointers to discussion of same, would be appreciated.
innerHTML is not the same as writing a complete document. Even if you write to outerHTML as suggested by Gumbo, there are things outside the root element that can confuse it, such as doctypes. To write a whole document at once, you have to use old-school cross-frame document.write:
var d= window.frames["mainwindow"].window.frames["Frame1"].document;
d.open();
d.write(htmldoc);
d.close();
Each chunk is the body of an email received, and as such one has no control over the quality of the HTML received.
OK, you may have a security problem then.
If you let an untrusted source like an e-mail inject HTML into your security context (and an iframe you're writing to is in your security context), it can run JavaScript of its own, including scripts that reach up and control your entire enclosing application and anything else on the same hostname. Unless your application is so trivial you don't care about that, this is really bad news.
If you need to allow untrusted HTML, the way many webmail services do it is to have it served on a different hostname (eg. a subdomain) that does not have access to any other part of the application. To do this, your iframe src would have to point to the different hostname; you couldn't script between the two security contexts.
Related
I have an HTML file with Javascript, that is supposed to load and process an XML file. The main obstacle is that the script may also be run locally, without an HTTP server, and also has to support Internet Explorer (11).
In normal case, I would use XMLHttpRequest, but as far as I know, it cannot be used with locally stored files (or at least doesn't work in my test cases for Chrome and IE).
I tried using <script> blocks with set src and type="text/xml" attributes and it successfully loads the content of the xml "somewhere" (the content is loaded and is visible in the network trace), but I cannot find a way, to extract the content of the xml from the <script> node.
Most sources (e.g. Getting content of <script> tag) suggests using XHR, but AFAIK it cannot be done in this case.
Is there a sensible option to implement this, without minimal http server?
I am looking for a clean solution without jQuery.
I am creating a firefox extension that lets the operator perform various actions that modify the content of the HTML document. The operator does not edit HTML, they take other actions and my extension modifies the document by inserting elements, adding attributes, and so forth.
When the operator is finished, they need to be able to save the HTML document as a file (or have my extension send it to an internet destination, but this is not required since they can email the saved file).
I thought maybe the changes made by the javascript code in my extension would be reflected in the HTML document, but when I ask the firefox browser to "view source" after making modifications, it displays the original HTML text.
My questions are:
#1: What is the easiest way for the operator to save the HTML document with all the changes my extension has made?
#2: What is the easiest way for the javascript code in my extension to process the HTML document contents and write to an HTML file on the local disk?
#3: Is any valid HTML content incapable of accurate representation in the saved file?
#4: Is the TreeWalker part of the solution (see below)?
A couple observations from my research so far:
I've read about the TreeWalker object, which seems to provide a fairly painless way for an extension to walk through everything (?or almost everything?) in the HTML document. But does it expose everything so everything in the original (and my modifications) can be saved without losing anything of importance?
Does the TreeWalker walk through the HTML document in the "correct order" --- the order necessary for my extension to generate the original and/or modified HTML document?
Anything obscure or tricky about these problems?
Ok so I am assuming here you have access to page DOM. What you need to do it basically make changes to the dom and then get all the dom code and save it as a file. Here is how you can download the page's html code. This will create an a tag which the user needs to click for the file to download.
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
Now you can insert this a tag anywhere in the DOM and the file will download when the user clicks it.
Note: This is sort of a hack, this injects entire html of the file in the a tag, it should in theory work in any up to date browser (except, surprise, IE). There are more stable and less hacky ways of doing it like storing it in a file system API file and then downloading that file instead.
Edit: The document.querySelectorAll line accesses the page DOM. For it to work the document must be accessible. You say you are modifying DOM so that should already be there. Make sure you are adding the code on the page and not your extension code. This code will be at the same place as your DOM modification code, not your extension pages that can't access the DOM.
And as for the a tag, it will be inserted in the page. I skipped the steps since I assumed you already know how to manipulate DOM and also because I don't know where you would like to add the link. And you can skip the user action of clicking the link too, but it's a hack and only works in modern browsers. You can insert the a tag somewhere in the original page where user won't see it and then call the a.click() function to simulate a click event on the link. But this is not a legit way and I personally only use it on my practice projects to call click event listeners.
I can only test this on chrome not on FF but try this code, this will not require you to even add the a link to DOM. You need to add this next to the DOM manipulation code. This will work if luck is on your side :)
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
a.click();
There is no easy way to do this with the web API only, at least when you want a result that does not omit stuff like the doctype or comments. You could still write a serializer yourself that goes through document.childNodes and serialized according to the node type (Element.outerHTML, Comment.data and so on).
Luckily, you're writing a Firefox add-on, so you have access to a lot more (powerful) stuff.
While still not 100% perfect, the nsIDocumentEncoder implementations will produce pretty decent results, that should only differ in some whitespace and explicit charset declaration at most (everything else is a bug).
Here is an example on how one might use this component:
function serializeDocument(document) {
const {
classes: Cc,
interfaces: Ci,
utils: Cu
} = Components;
let encoder = Cc['#mozilla.org/layout/documentEncoder;1?type=text/html'].createInstance(Ci.nsIDocumentEncoder);
encoder.init(document, 'text/html', Ci.nsIDocumentEncoder.OutputLFLineBreak | Ci.nsIDocumentEncoder.OutputRaw);
encoder.setCharset("utf-8");
return encoder.encodeToString();
}
If you're writing an SDK add-on, stuff gets more complicated as the SDK abstracts some important stuff away. You'll need to go through the chrome module, and also figure out the active window and tab yourself. Something like Services.wm.getMostRecentWindow("navigator:browser").content.document (Services.jsm) should do the trick.
In XUL overlay add-ons, content.document should suffice to get the document of the currently active tab, and you have Components access already.
Still, you need to let the user choose a file destination, usually through nsIFilePicker and then actually write the file, by using something like a file stream or the fully async OS.File API.
Looks like I get to answer my own question, thanks to someone in mozilla #extdev IRC.
I got totally faked out by "view source". When I didn't see my modifications in the window displayed by "view source", I assumed the browser would not provide the information.
However, guess what? When I "file" ===>> "save page as...", then examine the page contents with a plain text editor... sure enough, that contained the modifications made by my firefox extension! Surprise!
A browser has no direct write access to the local filesystem. The only read access it has is when explicitly provide a file:// URL (see note 1 below)
In your case, we are explicitly talking about javascript - which can read and write cookies and local storage. It can also send stuff back to the server and retrieve it, e.g. using AJAX.
Stuff you put in local storage/cookies is effectively not accessible to other programs (such as email clients).
It is possible to create very long mailto: URLs (see note 2) but only handles inline content in the email and you're going to run into all sorts of encoding issues that you're not ready to deal with.
Hence I'd recommend pursuing storage serverside via AJAX - and look at local storage once you've got this sorted/working.
Note 1: this is not strictly true. a trusted, signed javascript has access to additional functions which may include direct file access.
Note 2: (the limit depends on the browser and the email client - Lotus Notes truncaets the content rather a lot)
Some background:
I'm developing a web based mobile application using JavaScript. HTML rendering is Safari based. Cross domain policy is disabled, so I can make calls to other domains using XmlHttpRequests. The idea is to parse external HTML and get text content of specific element.
In the past I was parsing the text line by line, finding the line I need. Then get the content of the tag which is a substring of that line. This is very troublesome and requires a lot of maintenance each time the target html changes.
So now I want to parse the html text into DOM and run css or xpath queries on it.
It works well:
$('<div></div>').append(htmlBody).find('#theElementToFind').text()
The only problem is that when I use the browser to load html text into DOM element, it will try to load all external resources (images, js files, etc.). Although it isn't causing any serious problem, I would like to avoid that.
Now the question:
How can I parse html text to DOM without the browser loading external resources, or run js scripts ?
Some ideas I've been thinking about:
creating new document object using createDocument call (document.implementation.createDocument()), but I'm not sure it will skip the loading of external resources.
use third party DOM parser in JS - the only one I've tried was very bad with handling errors
use iframe to create new document, so that external resources with relative path will not throw an error in console
It seems that the following piece of code works great:
var doc = document.implementation.createHTMLDocument("");
doc.documentElement.innerHTML = htmlBody;
var text = $(doc).find('#theElementToFind').text();
external resources aren't loaded, scripts aren't being evaluated.
Found it here:
https://stackoverflow.com/a/9251106/95624
Origin:
https://developer.mozilla.org/en/DOMParser#DOMParser_HTML_extension_for_other_browsers
You can construct jQuery object of any html string, without appending it to the DOM:
$(htmlBody).find('#theElementToFind').text();
If you do script src="/path/to/nonexistent/file.js" in an HTML file and call that in a browser, and there are no dependencies or resources anywhere else in the HTML file that expect the file or code therein to actually exist, is there anything inherently bad-practice about doing this?
Yes, it is an odd question. The rationale is the developer is dealing with a CMS that allows custom (self-contained) javascript files to be provided in certain circumstances. The problem is the CMS is not very flexible when it comes to creating conditional includes for javascript. Therefore it is easier to just make references to the self-contained js files regardless of whether they are actually at the specified path.
Since no errors are displayed to the user, should this practice be considered a viable option?
Well the major drawback is performance since the browser will try (hard) to download the file and your server will look for it. Finally the browser may download the 404 page instead - thus slowing down the page load.
If you have the script referred to in the <head> tag, ( not recommended for starters ), it will slow down the initial page-render time somewhat too.
If instead of quickly returning a 404, your site just accepts the connection and then never responds, this can cause the page to take an indefinite amount of time to load, and in some cases, lock up the entire user interface.
( At least that was the case with one revision of FireFox, I hope they've fixed it since I saw that happen ~2 years ago.* )
You should at least put the tags as low in the page order as you can afford to to remedy this problem.
Your best bet by far is to have one consistent no-op url that is used as a fill-in for all "doesn't exist" JavaScript files, that returns a 0-byte response with HTTP headers telling the UA to cache it till the cows come home, that should negate most your server<->client load penalties beyond the first hit ( and that should hardly hurt people even on ye-olde dialup )
*Lesson learned: don't put script-src references in head, especially for 3rd-party scripts hosted outside your machine, because then you can have the joy of having clients be able to access your website, but run the risk of the page being inoperable because of a bit of advertising JS that was inaccessible due to some internet weirdness. Even if they're a reputable-ish 3rd-party.
If your web server is configured to do work on a 404 error ("you might be looking for this", etc) then you're also causing unnecessary load on the server.
you should ask yourself why you were too lazy to test this yourself :)
i tested 1000 randomized javascript filenames and it took several nanoseconds to load, so no, it doesn't make a difference. example:
script src="/7701992spolsky.js"
This was on my local machine however, so it should take N * roundTripTime for the browser to figure out for remote servers, where N is the number of bad scripts.
If however, you have random domain names that don't exist, like
script src="http://www.randomsite7701992.com/spolsky.js"
then it will take a long f-in time.
If you choose to implement it this way, you could tune the web server that if the referenced JS file is not found, instead of 404, it could return a redirect (301) to an empty/default JS file.
If you are using asp.net you can look into using custom handlers (ASHX files).
Here's an example:
public class JavascriptHandler : IHttpHandler {
public void ProcessRequest (HttpContext context)
{
context.Response.ContentType = "text/plain";
//Some code to check if javascript code exists
string js = "";
if(JavascriptExists())
{
js = GetJavascript();
}
context.Response.write(js);
}
}
Then in your html header you could declare a file pointing to the custom handler:
src="/js/javascripthandler.ashx"
So I need to pull some JavaScript out of a remote page that has (worthless) HTML combined with (useful) JavaScript. The page, call it, http://remote.com/data.html, looks something like this (crazy I know):
<html>
<body>
<img src="/images/a.gif" />
<div>blah blah blah</div><br/><br/>
var data = { date: "2009-03-15", data: "Some Data Here" };
</body>
</html>
so, I need to load this data variable in my local page and use it.
I'd prefer to do so with completely client-side code. I figured, if I could get the HTML of this page into a local JavaScript variable, I could parse out the JavaScript code, run eval on it and be good to use the data. So I thought load the remote page in an iframe, but I can't seem to find the iframe in the DOM. Why not?:
<script>
alert(window.parent.frames.length);
alert(document.getElementById('my_frame'));
</script>
<iframe name="my_frame" id='my_frame' style='height:1px; width:1px;' frameBorder=0 src='http://remote.com/data.html'></iframe>
The first alert shows 0, the second null, which makes no sense. How can I get around this problem?
Have you tried switching the order - i.e. iframe first, script next? The script runs before the iframe is inserted into the DOM.
Also, this worked for me in a similar situation: give the iframe an onload handler:
<iframe src="http://example.com/blah" onload="do_some_stuff_with_the_iframe()"></iframe>
Last but not least, pay attention to the cross-site scripting issues - the iframe may be loaded, but your JS may not be allowed to access it.
One option is to use XMLHttpRequest to retrieve the page, although it is apparently only currently being implemented for cross-site requests.
I understand that you might want to make a tool that used the client's internet connection to retrieve the html page (for security or legal reasons), so it is a legitimate hope.
If you do end up needing to do it server-side, then perhaps a simple php page that takes a url as a query and returns a json chunk containing the script in a string. That way if you do find you need to filter out certain websites, you need only do this in one place.
The inevitable problem is that some of the users will be hostile, and they then have a license to abuse what is effectively a javascript proxy. As a result, the safest option may be to do all the processing on the server, and not allow certain javascript function calls (eval, http requests, etc).