How to capture a complete webpage using javascript

How to capture a complete webpage using javascript - javascript

I inject javascript code into a page user is currently viewing, on users command this script make DOM changes. At the end of this interaction user might want to save the page so that s/he can view/edit it later. I could remember the DOM changes that user made, But if the original page(at its source) is changed, I will not be able to restore this page for user. That is why I want to send the changed page to my server. I should be able to restore it completely and the page should behave exactly the way it did(including scripts and media).
Additionally I can not store media of users page at my end(resource limitation), so I guess I have to parse and modify all addresses/references/links of media to global URL/URI in various scripts(HTML/CSS/JavaScript).
Now the question is, Is there a library/framework/jquery extension that can help me achieve this objective ?
else, What is the right/professional way to do it ?

Since you are using jQuery you could try $("html").html(); just make sure to add the appropriate <html> tags when you output it again.

$('body').html()
$('head').html()
$('html').html()
Download firebug, and try it in the console window on this page. I am getting what looks like the correct data back.

Have I got It right that you are building some kind of CMS that let's the user edit entire pages (Not just seperate content blocks) in Contenteditable mode?
I would definatly advise looking at a solution like ckeditor/tinymce etc... Because doing it all yourself will be a terrible pain.

The answer from #Sydenam should work fine to save the whole HTML page.
Meanwhile, and this is IMPORTANT, I would recommend you to consider a potential SECURITY ISSUE here. Indeed the user can inject whatever he wants in the DOM and have you saving it, like nasty Javascript functions sending confidential information on a remote server for example.
So, in my perspective, a professional way of doing this would be to dedicate a PART of the DOM only to that usage, let say a <div id='editable_div'> that you can load using a $('#editable_div').load('your_url',parameters, etc...), and save afterward using another AJAX call.
When saving it you can parse this chunk of HTML and make sure nothing nasty is inside with some regexp (like tags).
Hope it helps,
Regards,

Related

How to handle the content part in AJAX page switching in PWA?

I have zero experience in native apps, which might help with this question.
Since service worker caches everything so nicely, then I don't see any reason why I should render the entire webpage again when the page gets switched (link gets clicked.) So I will switch only the content, use history pushstate to change the URL and change the title. I have that part figured out.
Problem is, I cannot find any resources that would support either of the two content load ideas I have:
Load center content via AJAX with HTML.
Load center content as data only and render the HTML on-the-fly in JS.
First method would be fairly straight forward, but would mean that the payload would be bigger.
Second seems much more advanced, but would mean that HTML templates have to be in the JS somehow already? I also have a feeling, that there is a method somewhere in here.. that would allow to open the heavily cached page (lets say the article page) and replace the (text) contents. But as I said, I cannot find any resources to wager the cons and pros or give any reliable information on PWA AJAX page switching.
Any credible information on this matter would be much appreciated.
EDIT
I have kept reading and researching on this matter, but sadly there is no clear indication on how to handle dynamic content over AJAX. Whether I should parse the JSON data from AJAX to HTML in JS or send it already as HTML from the backend.
To add in favour to second option. I have figured out, that my theory had somewhat weight to it. If I use pure.js to pull a HTML template from hidden template tag and generate the HTML on the fly from JSON over AJAX.

you make it so complicated can we take a look at your code please?!
if you mean retrieving data from database by ajaxthen all what when you need is a jquery plugin
$(document).ready(function(){
var contentData1 = document.getElementById('contentData1');
$(function() {
$.post("pathToPHP.php",{contentData1: contentData1},function(data){
$("#container").html(data);
});
});
and the pathToPHP.php file should retrieve the data you want
echo "";

PHP HttpRequest to create a web page - how to handle long response times?

I am currently using javascript and XMLHttpRequest on a static html page to create a view of a record in Zotero. This works nicely except for one thing: The page html title.
I can of course also change the <title>...</title> tag, but if someone wants to post the view to for example facebook the static title on the web page will be shown there.
I can't think of any way to fix this with just a static page with javascript. I believe I need a dynamically created page from a server that does something similar to XMLHttpRequest.
For PHP there is HTTPRequest. Now to the problem. In the javascript version I can use asynchronous calls. With PHP I think I need synchronous calls. Is that something to worry about?
Is there perhaps some other way to handle this that I am not aware of?
UPDATE: It looks like those trying to answer are not at all familiar with Zotero. I should have been more clear. Zotero is a reference db located at http://zotero.org/. It has an API that can be used through XMLHttpRequest (which is what I said above).
Now I can not use that in my scenario which I described above. So I want to call the Zotero server from my server instead. (Through PHP or something else.)
(If you are not familiar with the concepts it might be hard to understand and answer the question. Of course.)
UPDATE 2: For those interested in how Facebook scraps an URL you post there, please test here: https://developers.facebook.com/tools/debug
As you can see by testing there no javascript is run.

Sorry, im not sure if i understand what you are trying to ask, are you just wanting to change the pages title?
Why not use javascript?
document.title = newTitle

Facebook expects the title (or opengraph :title tags) to be present when it fetches the page. It won't execyte any JavaScript for you to fill in the blanks.
A cool workaround would be to detect the Facebook scraper with PHP by parsing the User Agent string, and serving a version of the page with the information already filled in by PHP instead of JavaScript.
As far as I know, the Facebook scraper uses this header for User Agent: "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
You can check to see if part of that string is present in the header and load the page accordingly.
if (strpos($_SERVER['HTTP_USER_AGENT'], 'facebookexternalhit') !== false)
{
//synchronously load the title and opengraph tags here.
}
else
{
//load the page normally
}

can firefox extension modify DOM of HTML document then save as HTML?

I am creating a firefox extension that lets the operator perform various actions that modify the content of the HTML document. The operator does not edit HTML, they take other actions and my extension modifies the document by inserting elements, adding attributes, and so forth.
When the operator is finished, they need to be able to save the HTML document as a file (or have my extension send it to an internet destination, but this is not required since they can email the saved file).
I thought maybe the changes made by the javascript code in my extension would be reflected in the HTML document, but when I ask the firefox browser to "view source" after making modifications, it displays the original HTML text.
My questions are:
#1: What is the easiest way for the operator to save the HTML document with all the changes my extension has made?
#2: What is the easiest way for the javascript code in my extension to process the HTML document contents and write to an HTML file on the local disk?
#3: Is any valid HTML content incapable of accurate representation in the saved file?
#4: Is the TreeWalker part of the solution (see below)?
A couple observations from my research so far:
I've read about the TreeWalker object, which seems to provide a fairly painless way for an extension to walk through everything (?or almost everything?) in the HTML document. But does it expose everything so everything in the original (and my modifications) can be saved without losing anything of importance?
Does the TreeWalker walk through the HTML document in the "correct order" --- the order necessary for my extension to generate the original and/or modified HTML document?
Anything obscure or tricky about these problems?

Ok so I am assuming here you have access to page DOM. What you need to do it basically make changes to the dom and then get all the dom code and save it as a file. Here is how you can download the page's html code. This will create an a tag which the user needs to click for the file to download.
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
Now you can insert this a tag anywhere in the DOM and the file will download when the user clicks it.
Note: This is sort of a hack, this injects entire html of the file in the a tag, it should in theory work in any up to date browser (except, surprise, IE). There are more stable and less hacky ways of doing it like storing it in a file system API file and then downloading that file instead.
Edit: The document.querySelectorAll line accesses the page DOM. For it to work the document must be accessible. You say you are modifying DOM so that should already be there. Make sure you are adding the code on the page and not your extension code. This code will be at the same place as your DOM modification code, not your extension pages that can't access the DOM.
And as for the a tag, it will be inserted in the page. I skipped the steps since I assumed you already know how to manipulate DOM and also because I don't know where you would like to add the link. And you can skip the user action of clicking the link too, but it's a hack and only works in modern browsers. You can insert the a tag somewhere in the original page where user won't see it and then call the a.click() function to simulate a click event on the link. But this is not a legit way and I personally only use it on my practice projects to call click event listeners.
I can only test this on chrome not on FF but try this code, this will not require you to even add the a link to DOM. You need to add this next to the DOM manipulation code. This will work if luck is on your side :)
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
a.click();

There is no easy way to do this with the web API only, at least when you want a result that does not omit stuff like the doctype or comments. You could still write a serializer yourself that goes through document.childNodes and serialized according to the node type (Element.outerHTML, Comment.data and so on).
Luckily, you're writing a Firefox add-on, so you have access to a lot more (powerful) stuff.
While still not 100% perfect, the nsIDocumentEncoder implementations will produce pretty decent results, that should only differ in some whitespace and explicit charset declaration at most (everything else is a bug).
Here is an example on how one might use this component:
function serializeDocument(document) {
const {
classes: Cc,
interfaces: Ci,
utils: Cu
} = Components;
let encoder = Cc['#mozilla.org/layout/documentEncoder;1?type=text/html'].createInstance(Ci.nsIDocumentEncoder);
encoder.init(document, 'text/html', Ci.nsIDocumentEncoder.OutputLFLineBreak | Ci.nsIDocumentEncoder.OutputRaw);
encoder.setCharset("utf-8");
return encoder.encodeToString();
}
If you're writing an SDK add-on, stuff gets more complicated as the SDK abstracts some important stuff away. You'll need to go through the chrome module, and also figure out the active window and tab yourself. Something like Services.wm.getMostRecentWindow("navigator:browser").content.document (Services.jsm) should do the trick.
In XUL overlay add-ons, content.document should suffice to get the document of the currently active tab, and you have Components access already.
Still, you need to let the user choose a file destination, usually through nsIFilePicker and then actually write the file, by using something like a file stream or the fully async OS.File API.

Looks like I get to answer my own question, thanks to someone in mozilla #extdev IRC.
I got totally faked out by "view source". When I didn't see my modifications in the window displayed by "view source", I assumed the browser would not provide the information.
However, guess what? When I "file" ===>> "save page as...", then examine the page contents with a plain text editor... sure enough, that contained the modifications made by my firefox extension! Surprise!

A browser has no direct write access to the local filesystem. The only read access it has is when explicitly provide a file:// URL (see note 1 below)
In your case, we are explicitly talking about javascript - which can read and write cookies and local storage. It can also send stuff back to the server and retrieve it, e.g. using AJAX.
Stuff you put in local storage/cookies is effectively not accessible to other programs (such as email clients).
It is possible to create very long mailto: URLs (see note 2) but only handles inline content in the email and you're going to run into all sorts of encoding issues that you're not ready to deal with.
Hence I'd recommend pursuing storage serverside via AJAX - and look at local storage once you've got this sorted/working.
Note 1: this is not strictly true. a trusted, signed javascript has access to additional functions which may include direct file access.
Note 2: (the limit depends on the browser and the email client - Lotus Notes truncaets the content rather a lot)

Save html page of using javascript

How I can save rendered html page using javascript.

You can't. Javascript in the browser has no file IO capabilites.
If it had, going to any website could write anything to your hard drive.

var everything = document.all
will give you everything, but then you still need to move off the browser into a localfile. You will need a serverside language for that.

At least I do know of a Windows/IE-specific way to save the current HTML file:
http://p2p.wrox.com/javascript-how/3193-how-do-you-save-html-page-your-local-hd.html#post78192
However, I wonder if other browsers (i.e. Chrome) have some similar file I/O API. Obviously, according to previous answers, there's no universal standard.

On what purpose? If you just want to print the page, use document.print instead.

you can get all of the page contents and then you can
send an ajax request to a script that handles the html content (be aware of cross domain restrictions)
save the contents into a cookie
save the contents into localStorage or some local db
There is no reliable way to do this in js. :(

How do I protect JavaScript files?

I know it's impossible to hide source code but, for example, if I have to link a JavaScript file from my CDN to a web page and I don't want the people to know the location and/or content of this script, is this possible?
For example, to link a script from a website, we use:
<script type="text/javascript" src="http://somedomain.example/scriptxyz.js">
</script>
Now, is possible to hide from the user where the script comes from, or hide the script content and still use it on a web page?
For example, by saving it in my private CDN that needs password to access files, would that work? If not, what would work to get what I want?

Good question with a simple answer: you can't!
JavaScript is a client-side programming language, therefore it works on the client's machine, so you can't actually hide anything from the client.
Obfuscating your code is a good solution, but it's not enough, because, although it is hard, someone could decipher your code and "steal" your script.
There are a few ways of making your code hard to be stolen, but as I said nothing is bullet-proof.
Off the top of my head, one idea is to restrict access to your external js files from outside the page you embed your code in. In that case, if you have
<script type="text/javascript" src="myJs.js"></script>
and someone tries to access the myJs.js file in browser, he shouldn't be granted any access to the script source.
For example, if your page is written in PHP, you can include the script via the include function and let the script decide if it's safe" to return it's source.
In this example, you'll need the external "js" (written in PHP) file myJs.php:
<?php
$URL = $_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI'];
if ($URL != "my-domain.example/my-page.php")
die("/\*sry, no acces rights\*/");
?>
// your obfuscated script goes here
that would be included in your main page my-page.php:
<script type="text/javascript">
<?php include "myJs.php"; ?>;
</script>
This way, only the browser could see the js file contents.
Another interesting idea is that at the end of your script, you delete the contents of your dom script element, so that after the browser evaluates your code, the code disappears:
<script id="erasable" type="text/javascript">
//your code goes here
document.getElementById('erasable').innerHTML = "";
</script>
These are all just simple hacks that cannot, and I can't stress this enough: cannot, fully protect your js code, but they can sure piss off someone who is trying to "steal" your code.
Update:
I recently came across a very interesting article written by Patrick Weid on how to hide your js code, and he reveals a different approach: you can encode your source code into an image! Sure, that's not bullet proof either, but it's another fence that you could build around your code.
The idea behind this approach is that most browsers can use the canvas element to do pixel manipulation on images. And since the canvas pixel is represented by 4 values (rgba), each pixel can have a value in the range of 0-255. That means that you can store a character (actual it's ascii code) in every pixel. The rest of the encoding/decoding is trivial.

The only thing you can do is obfuscate your code to make it more difficult to read. No matter what you do, if you want the javascript to execute in their browser they'll have to have the code.

Just off the top of my head, you could do something like this (if you can create server-side scripts, which it sounds like you can):
Instead of loading the script like normal, send an AJAX request to a PHP page (it could be anything; I just use it myself). Have the PHP locate the file (maybe on a non-public part of the server), open it with file_get_contents, and return (read: echo) the contents as a string.
When this string returns to the JavaScript, have it create a new script tag, populate its innerHTML with the code you just received, and attach the tag to the page. (You might have trouble with this; innerHTML may not be what you need, but you can experiment.)
If you do this a lot, you might even want to set up a PHP page that accepts a GET variable with the script's name, so that you can dynamically grab different scripts using the same PHP. (Maybe you could use POST instead, to make it just a little harder for other people to see what you're doing. I don't know.)
EDIT: I thought you were only trying to hide the location of the script. This obviously wouldn't help much if you're trying to hide the script itself.

Google Closure Compiler, YUI compressor, Minify, /Packer/... etc, are options for compressing/obfuscating your JS codes. But none of them can help you from hiding your code from the users.
Anyone with decent knowledge can easily decode/de-obfuscate your code using tools like JS Beautifier. You name it.
So the answer is, you can always make your code harder to read/decode, but for sure there is no way to hide.

Forget it, this is not doable.
No matter what you try it will not work. All a user needs to do to discover your code and it's location is to look in the net tab in firebug or use fiddler to see what requests are being made.

From my knowledge, this is not possible.
Your browser has to have access to JS files to be able to execute them. If the browser has access, then browser's user also has access.
If you password protect your JS files, then the browser won't be able to access them, defeating the purpose of having JS in the first place.

I think the only way is to put required data on the server and allow only logged-in user to access the data as required (you can also make some calculations server side). This wont protect your javascript code but make it unoperatable without the server side code

I agree with everyone else here: With JS on the client, the cat is out of the bag and there is nothing completely foolproof that can be done.
Having said that; in some cases I do this to put some hurdles in the way of those who want to take a look at the code. This is how the algorithm works (roughly)
The server creates 3 hashed and salted values. One for the current timestamp, and the other two for each of the next 2 seconds. These values are sent over to the client via Ajax to the client as a comma delimited string; from my PHP module. In some cases, I think you can hard-bake these values into a script section of HTML when the page is formed, and delete that script tag once the use of the hashes is over The server is CORS protected and does all the usual SERVER_NAME etc check (which is not much of a protection but at least provides some modicum of resistance to script kiddies).
Also it would be nice, if the the server checks if there was indeed an authenticated user's client doing this
The client then sends the same 3 hashed values back to the server thru an ajax call to fetch the actual JS that I need. The server checks the hashes against the current time stamp there... The three values ensure that the data is being sent within the 3 second window to account for latency between the browser and the server
The server needs to be convinced that one of the hashes is
matched correctly; and if so it would send over the crucial JS back
to the client. This is a simple, crude "One time use Password"
without the need for any database at the back end.
This means, that any hacker has only the 3 second window period since the generation of the first set of hashes to get to the actual JS code.
The entire client code can be inside an IIFE function so some of the variables inside the client are even more harder to read from the Inspector console
This is not any deep solution: A determined hacker can register, get an account and then ask the server to generate the first three hashes; by doing tricks to go around Ajax and CORS; and then make the client perform the second call to get to the actual code -- but it is a reasonable amount of work.
Moreover, if the Salt used by the server is based on the login credentials; the server may be able to detect who is that user who tried to retreive the sensitive JS (The server needs to do some more additional work regarding the behaviour of the user AFTER the sensitive JS was retreived, and block the person if the person, say for example, did not do some other activity which was expected)
An old, crude version of this was done for a hackathon here: http://planwithin.com/demo/tadr.html That wil not work in case the server detects too much latency, and it goes beyond the 3 second window period

As I said in the comment I left on gion_13 answer before (please read), you really can't. Not with javascript.
If you don't want the code to be available client-side (= stealable without great efforts),
my suggestion would be to make use of PHP (ASP,Python,Perl,Ruby,JSP + Java-Servlets) that is processed server-side and only the results of the computation/code execution are served to the user. Or, if you prefer, even Flash or a Java-Applet that let client-side computation/code execution but are compiled and thus harder to reverse-engine (not impossible thus).
Just my 2 cents.

You can also set up a mime type for application/JavaScript to run as PHP, .NET, Java, or whatever language you're using. I've done this for dynamic CSS files in the past.

I know that this is the wrong time to be answering this question but i just thought of something
i know it might be stressful but atleast it might still work
Now the trick is to create a lot of server side encoding scripts, they have to be decodable(for example a script that replaces all vowels with numbers and add the letter 'a' to every consonant so that the word 'bat' becomes ba1ta) then create a script that will randomize between the encoding scripts and create a cookie with the name of the encoding script being used (quick tip: try not to use the actual name of the encoding script for the cookie for example if our cookie is name 'encoding_script_being_used' and the randomizing script chooses an encoding script named MD10 try not to use MD10 as the value of the cookie but 'encoding_script4567656' just to prevent guessing) then after the cookie has been created another script will check for the cookie named 'encoding_script_being_used' and get the value, then it will determine what encoding script is being used.
Now the reason for randomizing between the encoding scripts was that the server side language will randomize which script to use to decode your javascript.js and then create a session or cookie to know which encoding scripts was used
then the server side language will also encode your javascript .js and put it as a cookie
so now let me summarize with an example
PHP randomizes between a list of encoding scripts and encrypts javascript.js then it create a cookie telling the client side language which encoding script was used then client side language decodes the javascript.js cookie(which is obviously encoded)
so people can't steal your code
but i would not advise this because
it is a long process
It is too stressful

use nwjs i think helpful it can compile to bin then you can use it to make win,mac and linux application

This method partially works if you do not want to expose the most sensible part of your algorithm.
Create WebAssembly modules (.wasm), import them, and expose only your JS, etc... workflow. In this way the algorithm is protected since it is extremely difficult to revert assembly code into a more human readable format.
After having produced the wasm module and imported correclty, you can use your code as you normallt do:
<body id="wasm-example">
<script type="module">
import init from "./pkg/glue_code.js";
init().then(() => {
console.log("WASM Loaded");
});
</script>
</body>

Develop Reference

JavaScript is the programming language of the Web.