can firefox extension modify DOM of HTML document then save as HTML?

can firefox extension modify DOM of HTML document then save as HTML? - javascript

I am creating a firefox extension that lets the operator perform various actions that modify the content of the HTML document. The operator does not edit HTML, they take other actions and my extension modifies the document by inserting elements, adding attributes, and so forth.
When the operator is finished, they need to be able to save the HTML document as a file (or have my extension send it to an internet destination, but this is not required since they can email the saved file).
I thought maybe the changes made by the javascript code in my extension would be reflected in the HTML document, but when I ask the firefox browser to "view source" after making modifications, it displays the original HTML text.
My questions are:
#1: What is the easiest way for the operator to save the HTML document with all the changes my extension has made?
#2: What is the easiest way for the javascript code in my extension to process the HTML document contents and write to an HTML file on the local disk?
#3: Is any valid HTML content incapable of accurate representation in the saved file?
#4: Is the TreeWalker part of the solution (see below)?
A couple observations from my research so far:
I've read about the TreeWalker object, which seems to provide a fairly painless way for an extension to walk through everything (?or almost everything?) in the HTML document. But does it expose everything so everything in the original (and my modifications) can be saved without losing anything of importance?
Does the TreeWalker walk through the HTML document in the "correct order" --- the order necessary for my extension to generate the original and/or modified HTML document?
Anything obscure or tricky about these problems?

Ok so I am assuming here you have access to page DOM. What you need to do it basically make changes to the dom and then get all the dom code and save it as a file. Here is how you can download the page's html code. This will create an a tag which the user needs to click for the file to download.
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
Now you can insert this a tag anywhere in the DOM and the file will download when the user clicks it.
Note: This is sort of a hack, this injects entire html of the file in the a tag, it should in theory work in any up to date browser (except, surprise, IE). There are more stable and less hacky ways of doing it like storing it in a file system API file and then downloading that file instead.
Edit: The document.querySelectorAll line accesses the page DOM. For it to work the document must be accessible. You say you are modifying DOM so that should already be there. Make sure you are adding the code on the page and not your extension code. This code will be at the same place as your DOM modification code, not your extension pages that can't access the DOM.
And as for the a tag, it will be inserted in the page. I skipped the steps since I assumed you already know how to manipulate DOM and also because I don't know where you would like to add the link. And you can skip the user action of clicking the link too, but it's a hack and only works in modern browsers. You can insert the a tag somewhere in the original page where user won't see it and then call the a.click() function to simulate a click event on the link. But this is not a legit way and I personally only use it on my practice projects to call click event listeners.
I can only test this on chrome not on FF but try this code, this will not require you to even add the a link to DOM. You need to add this next to the DOM manipulation code. This will work if luck is on your side :)
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
a.click();

There is no easy way to do this with the web API only, at least when you want a result that does not omit stuff like the doctype or comments. You could still write a serializer yourself that goes through document.childNodes and serialized according to the node type (Element.outerHTML, Comment.data and so on).
Luckily, you're writing a Firefox add-on, so you have access to a lot more (powerful) stuff.
While still not 100% perfect, the nsIDocumentEncoder implementations will produce pretty decent results, that should only differ in some whitespace and explicit charset declaration at most (everything else is a bug).
Here is an example on how one might use this component:
function serializeDocument(document) {
const {
classes: Cc,
interfaces: Ci,
utils: Cu
} = Components;
let encoder = Cc['#mozilla.org/layout/documentEncoder;1?type=text/html'].createInstance(Ci.nsIDocumentEncoder);
encoder.init(document, 'text/html', Ci.nsIDocumentEncoder.OutputLFLineBreak | Ci.nsIDocumentEncoder.OutputRaw);
encoder.setCharset("utf-8");
return encoder.encodeToString();
}
If you're writing an SDK add-on, stuff gets more complicated as the SDK abstracts some important stuff away. You'll need to go through the chrome module, and also figure out the active window and tab yourself. Something like Services.wm.getMostRecentWindow("navigator:browser").content.document (Services.jsm) should do the trick.
In XUL overlay add-ons, content.document should suffice to get the document of the currently active tab, and you have Components access already.
Still, you need to let the user choose a file destination, usually through nsIFilePicker and then actually write the file, by using something like a file stream or the fully async OS.File API.

Looks like I get to answer my own question, thanks to someone in mozilla #extdev IRC.
I got totally faked out by "view source". When I didn't see my modifications in the window displayed by "view source", I assumed the browser would not provide the information.
However, guess what? When I "file" ===>> "save page as...", then examine the page contents with a plain text editor... sure enough, that contained the modifications made by my firefox extension! Surprise!

A browser has no direct write access to the local filesystem. The only read access it has is when explicitly provide a file:// URL (see note 1 below)
In your case, we are explicitly talking about javascript - which can read and write cookies and local storage. It can also send stuff back to the server and retrieve it, e.g. using AJAX.
Stuff you put in local storage/cookies is effectively not accessible to other programs (such as email clients).
It is possible to create very long mailto: URLs (see note 2) but only handles inline content in the email and you're going to run into all sorts of encoding issues that you're not ready to deal with.
Hence I'd recommend pursuing storage serverside via AJAX - and look at local storage once you've got this sorted/working.
Note 1: this is not strictly true. a trusted, signed javascript has access to additional functions which may include direct file access.
Note 2: (the limit depends on the browser and the email client - Lotus Notes truncaets the content rather a lot)

Related

Where should I put urls in an AJAX based web project?

I am on my way to some web development at the moment. There I have a set of views (different versions of the site the user will be able to see). Many of those allow some interaction that is JS/Ajax based. This is just the context of this question:
Where can I put the request URLs of the various ajax requests?
I know this seems a little stupid this question thus let me explain a bit. I assume jQuery but this question is basically not strictly related to it. I will try to give very minimalistic snippets to see the idea, these are of course not 1:1 correct/finished/good.
Typically such a site has not only one single type of request but a whole bunch of these. Think of a site where the user sees his personal data like name, mail, address, phone etc. On clicking on one such entry, a minimal form should be displayed to allow modification of the entry. Of course you need minor changes in the replacements (e.g. distinguish between change name and change phone).
First approach was to write ajax code for each and every possible entry separately in a JS file. I mean that each entry gets its own html id and I just replace the content of the element with the named id with the new content. I write code for each id explicitly in JS causing quite some redundancy in code (although a well designed set of functions will help here):
$("#name").click(function(){ /* replace #name, hardcode url */});
$("#phone").click(function(){ /* replace #phone, hardcode url */});
One other way was to put some <a> tag with the href set to the url of the AJAX request. Then the developer can define some classes that need to follow a defined and fixed scheme. The JS code gets smaller in size as only a single event must be registered and I need to follow the convention throughout the site.
<div class='foo'>... <a href="ajax.php?first" class="ajax"></div>
<div class='foo'>... <a href="ajax.php?second" class="ajax"></div>
and the simplified JS:
$(".foo a.ajax").click(function(ev){ /* do something and use source of ev to fetch the url */ });
This second method could be done even worse if you did put the url in any html tag and hide it from the user (scary).
Ideally one should write the page such, that all interaction that is AJAX-enabled should be doable with JS disabled as well. Thus I think the way of putting the urls in <a> tags is not good. However I think hardcoding them is also not ideal.
So did I miss a useful/typical part of how one can do this? is there even some consesus where such data can be located best?

If your website is big enough, you should seperate your urls based on modules such as banking, finance, user etc. But if you do not have that much urls, you can store all of them in a single javascript file.
You should store BASE url in a single javascript file with all of should import it(in case of your domain changes or development to production mode).
//base_url.js
var BASE_URL_PROD = "www......com"; // production server url
var BASE_URL_DEV = "localhost:3000"; // local server url
var BASE_URL = BASE_URL_DEV; // change this if you are on dev or prod mode.
// urls.js
var FETCH_USER = BASE_URL + "/user/fetch";
var SAVE_USER = BASE_URL + "/user/save";
// in some javascript class
$("#clickMe").ajax({url: FETCH_USER} ...);

The question here is: do you want to offer a way to access the information, if javascript is turned off or not loaded yet?
You already answered yourself: If javascript is disabled or not loaded yet, the user will directly go to the given url.
If you want to offer a none-javascript way, change your controller and check for ajax request or just use the javascript way, like Abdullah described already.

Force browser to reload all cache after site update

Is there a way to force the clients of a webpage to reload the cache (i.e. images, javascript, etc) after a server has been pushed an update to the code base? We get a lot of help desk calls asking why certain functionality no longer works. A simple hard refresh fixes the problems as it downloads the newly updated javascript file.
For specifics we are using Glassfish 3.x. and JSF 2.1.x. This would apply to more than just JSF of course.
To describe what behavior I hope is possible:
Website A has two images and two javascript files. A user visits the site and the 4 files get cached. As far as I'm concerned, no need to "re-download" said files unless user specifically forces a "hard" refresh or clears their cache. Once a site is pushed an update to one of the files, the server could have some sort of metadata in the header informing the client of said update. If the client chooses, the new files would be downloaded.
What I don't want to do is put meta-tag in the header of a page to force nothing from ever being cached...I just want something that tells the client an update has occurred and it should get the latest once something has been updated. I suppose this would just be some sort of versioning on the client side.
Thanks for your time!

The correct way to handle this is with changing the URL convention for your resources. For example, we have it as:
/resources/js/fileName.js
To get the browser to still cache the file, but do it the proper way with versioning, is by adding something to the URL. Adding a value to the querystring doesn't allow caching, so the place to put it is after /resources/.
A reference for querystring caching: http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.9
So for example, your URLs would look like:
/resources/1234/js/fileName.js
So what you could do is use the project's version number (or some value in a properties/config file that you manually change when you want cached files to be reloaded) since this number should change only when the project is modified. So your URL could look like:
/resources/cacheholder${project.version}/js/fileName.js
That should be easy enough.
The problem now is with mapping the URL, since that value in the middle is dynamic. The way we overcame that is with a URL rewriting module that allowed us to filter URLs before they got to our application. The rewrite watched for URLs that looked like:
/resources/cacheholder______/whatever
And removed the cacheholder_______/ part. After the rewrite, it looked like a normal request, and the server would respond with the correct file, without any other specific mapping/logic...the point is that the browser thought it was a new file (even though it really wasn't), so it requested it, and the server figures it out and serves the correct file (even though it's a "weird" URL).
Of course, another option is to add this dynamic string to the filename itself, and then use the rewrite tool to remove it. Either way, the same thing is done - targeting a string of text during rewrite, and removing it. This allows you to fool the browser, but not the server :)
UPDATE:
An alternative that I really like is to set the filename based on the contents, and cache that. For example, that could be done with a hash. Of course, this type of thing isn't something you'd manually do and save to your project (hopefully); it's something your application/framework should handle. For example, in Grails, there's a plugin that "hashes and caches" resources, so that the following occurs:
Every resource is checked
A new file (or mapping to this file) is created, with a name that is the hash of its contents
When adding <script>/<link> tags to your page, the hashed name is used
When the hash-named file is requested, it serves the original resource
The hash-named file is cached "forever"
What's cool about this setup is that you don't have to worry about caching correctly - just set the files to cache forever, and the hashing should take care of files/mappings being available based on content. It also provides the ability for rollbacks/undos to already be cached and loaded quickly.

i use a no-cache parameter for this situations...
a have a string constant value like (from config file)
$no_cache = "v11";
and in pages, i use assets like
<img src="a.jpg?nc=$no_cache">
and when i update my code, just change the $no_cache value, and it works like a charm.

How to capture a complete webpage using javascript

I inject javascript code into a page user is currently viewing, on users command this script make DOM changes. At the end of this interaction user might want to save the page so that s/he can view/edit it later. I could remember the DOM changes that user made, But if the original page(at its source) is changed, I will not be able to restore this page for user. That is why I want to send the changed page to my server. I should be able to restore it completely and the page should behave exactly the way it did(including scripts and media).
Additionally I can not store media of users page at my end(resource limitation), so I guess I have to parse and modify all addresses/references/links of media to global URL/URI in various scripts(HTML/CSS/JavaScript).
Now the question is, Is there a library/framework/jquery extension that can help me achieve this objective ?
else, What is the right/professional way to do it ?

Since you are using jQuery you could try $("html").html(); just make sure to add the appropriate <html> tags when you output it again.

$('body').html()
$('head').html()
$('html').html()
Download firebug, and try it in the console window on this page. I am getting what looks like the correct data back.

Have I got It right that you are building some kind of CMS that let's the user edit entire pages (Not just seperate content blocks) in Contenteditable mode?
I would definatly advise looking at a solution like ckeditor/tinymce etc... Because doing it all yourself will be a terrible pain.

The answer from #Sydenam should work fine to save the whole HTML page.
Meanwhile, and this is IMPORTANT, I would recommend you to consider a potential SECURITY ISSUE here. Indeed the user can inject whatever he wants in the DOM and have you saving it, like nasty Javascript functions sending confidential information on a remote server for example.
So, in my perspective, a professional way of doing this would be to dedicate a PART of the DOM only to that usage, let say a <div id='editable_div'> that you can load using a $('#editable_div').load('your_url',parameters, etc...), and save afterward using another AJAX call.
When saving it you can parse this chunk of HTML and make sure nothing nasty is inside with some regexp (like tags).
Hope it helps,
Regards,

What is a "?" for in the src attribute of a html script tag?

If this has been asked before, I apologize but this is kinda of a hard question to search for. This is the first time I have come across this in all my years of web development, so I'm pretty curious.
I am editing some HTML files for a website, and I have noticed that in the src attribute of the script tags that the previous author appended a question mark followed by data.
Ex: <script src="./js/somefile.js?version=3.2"></script>
I know that this is used in some languages for value passing in GET request, such as PHP, but as I far as I ever knew, this wasn't done in javascript - at least in calling a javascript file. Does anyone know what this does, if anything?
EDIT: Wow, a lot of responses. Thanks one and all. And since a lot of people are saying similar things, I will post an global update instead of commenting everyone.
In this case the javascript files are static, hence my curiosity. I have also opened them up and did not see anything attempt to access variables on file load. I've never thought about caching or plain version control, both which seam more likely in this circumstance.

I believe what the author was doing was ensuring that if he creates version 3.3 of his script he can change the version= in the url of the script to ensure that users download the new file instead of running off of the old script cached in their browser.
So in this case it is part of the caching strategy.

My guess is it's so if he publishes a new version of the JavaScript file, he can bump the version in the HTML documents. This will not do anything server-side when requested, but it causes the browser to treat it as a different file, effectively forcing the browser to re-fetch the script and bypass the local cache of the file.
This way, you can set a really high cache time (like a week or a month!) but not sacrifice the ability to update scripts frequently if necessary.

What you have to remember is that this ./js/somefile.js?version=3.2 doesn't have to be a physical file. It can be a page which creates the file on the fly. So you could have it where the request says, "Hey give me version 3 of this js file," and the server side code creates it and writes it to the output stream.
The other option is to force the browser to not cache the file and pull down the new one when it makes the request. Since the URI changed, it will think the file is completely new.

A (well-configured) web server will send static files like JavaScript source code once and tell the web browser to cache that file locally for a certain period of time (could be a day, a week, a month, or longer). When the browser sees another request for that same file, it will just use that version instead of getting new code from the server.
If the URL changes -- for example by adding a query string -- then the browser suspects that its cached version is no good and gets a new one. As such, the ? helps developers say "Oops, I changed this file, make sure the browser gets a new copy."

In this case it's probably being used to ensure the source file isn't cached between versions.
Of course, it could also be used server side to generate the javascript file, without knowing what you have on the other end of the request, it's difficult to be definitive.
BTW, the ?... portion of the url is called the query string.

this is used to guarantee that the browser downloads a new version of the script when available. The version number in the url is incremented each time a new version is deployed so that the browser see it as a different file.

Just because the file extension is .js doesn't mean that the target is an actual .js file. They could set up their web server to pass the requested URL to a script (or literally have a script named somefile.js) and have that interpret the filename and version.

The query string has nothing to do with the javascript. Some server side code is hosting up a different version depending on that querystring it appears.
You should never assume anything about paths in a URL. The extension on a path in a URL doesn't really tell you anything. URLs can be completely dynamic and served by some server side code or can rewritten in web servers dynamically.
Now it is common to add a querystring to urls when loading javascript files to prevent client side caching. If the page updates and references a new version of the script then the page can bust through and cause the client to refresh it's script.

Any issues with using innerHTML to load up an iframe?

My app (under development) uses Safari 4.0.3 and JavaScript to present its front end to the user. The backend is PHP and SQLite. This is under OS X 10.5.8.
The app will from time to time receive chunks of HTML to present to the user. Each chunk is the body of an email received, and as such one has no control over the quality of the HTML received. What I do it use innerHTML to shove the chunk into an iFrame and let Safari render it.
To do that I do this:
window.frames["mainwindow"].window.frames["Frame1"].document.body.innerHTML = myvar;
where myvar contains the received HTML. Now, for the most part this works as desired, and the HTML is rendered as expected. The exception appears to be when the tag for the chunk looks like:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" ...
and so on for more than 2800 chars. The effect is as if my JavaScript statement above had not been executed - I can see that using Safari's Error Console in the Develop menu to look into the iFrame. If I extract the HTML from the SQLite backend db and save it as a .html file, then Safari will render that with no trouble.
Any comments on why this might be happening, or on such use of innerHTML, or pointers to discussion of same, would be appreciated.

innerHTML is not the same as writing a complete document. Even if you write to outerHTML as suggested by Gumbo, there are things outside the root element that can confuse it, such as doctypes. To write a whole document at once, you have to use old-school cross-frame document.write:
var d= window.frames["mainwindow"].window.frames["Frame1"].document;
d.open();
d.write(htmldoc);
d.close();
Each chunk is the body of an email received, and as such one has no control over the quality of the HTML received.
OK, you may have a security problem then.
If you let an untrusted source like an e-mail inject HTML into your security context (and an iframe you're writing to is in your security context), it can run JavaScript of its own, including scripts that reach up and control your entire enclosing application and anything else on the same hostname. Unless your application is so trivial you don't care about that, this is really bad news.
If you need to allow untrusted HTML, the way many webmail services do it is to have it served on a different hostname (eg. a subdomain) that does not have access to any other part of the application. To do this, your iframe src would have to point to the different hostname; you couldn't script between the two security contexts.

Develop Reference

JavaScript is the programming language of the Web.