Creating a document from url javascript

Creating a document from url javascript - javascript

This is what I've tried:
function createDocumentz() {
var doc = document.implementation.createHTMLDocument('http://www.moviemeter.nl/film/270',null,'html');
return doc;
}
Even though a document gets created, if I run this with Firebug it says that the body node has no childnodes, any idea why?

Looks like you assume that you can use createHTMLDocument() to download and parse a HTML file from the URL you've passed as the first parameter. That is not the case, createHTMLDocument() always creates an empty document.
Also, the parameters you've passed to the function are those of createDocument(). createHTMLDocument() takes only one parameter, the document title. But even if you'd use createDocument(), the first parameter is the URI of the namespace, not the source document.
Unfortunately there's no way to download and manipulate external web site's HTML using JavaScript alone. The closest you can get is displaying the document in an iframe.

No, you cannot get the content from another website, this way.
If it did, it would have lead to cross site scripting.
All you would get is an empty document, due to the browser's policy, which of course has an empty body.
You can use an Iframe & set the source to the same...

Related

How does Debugger.getScriptSource work?

I'm trying to run chrome debugger to gather deobfuscated JavaScripts but it returns a large number of chunks for a script. I want to know how does chrome divides one JavaScript file to multiple chunks? What exactly is one chunk of a script?
I am aware that for each script file, script tag, and eval() function separate chunks will be created. I just want to pinpoint all possible cases for which chunks are created. For example, does lazy parsing also creates chunks for some functions?
It would be great if somebody can point me to some documentation about how the process works.
Chrome debugging protocol API used

The Debugger.scriptParsed event is what generates the scriptId for each script found. Each script file parsed will fire the event and should receive a unique id. As well as individual script files, each instance of a <script> tag within a page source will also get its own id.
There are a number of parameters the event callback gets passed. You can inspect all the arguments by logging out the arguments object. For example, you'll get a url back for a script file and startLine gives you the offset for <script> tags that appear in an HTML resource.
on('Debugger.scriptParsed', function() {
console.log(JSON.stringify(arguments, null, 2)(
});
Investigation
In response to your updated question, my understanding is that in debugger mode it will attempt a full parse and not to try to optimise with a pre-parse. I wasn't able to really understand the v8 code base, but I did a little testing.
I created an HTML page with a button, and a separate script file. I executed one function immediately after page load. Another function is executed on click of button.
document.addEventListener("DOMContentLoaded", () => {
document.getElementById('clickMe').addEventListener('click', () => clicked);
});
I inspected the output from Debugger.scriptParsed. It parsed the script file straight away. Upon clicking the button, there were no additional output entries. If I changed the handler to invoke the clicked called dynamically using eval, it did output a new script chunk.
eval('clicked()');
This makes sense as there's no way for the parser to know about it until it's executed. The same applies to an inline handler, e.g. onclick="doSomething()"
Other chunks I noticed were Chrome Extension URIs and internals URIs from chrome.

How can I get the text inside an object element?

In the object element with the ID x a text file is loaded and displayed correctly. How can I get this text with JavaScript?
I set
y.data = "prova.txt"
then tried
y.innerHTML;
y.text;
y.value;
None of these work.
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<object id="x" data="foo.txt"></object>
<script>
var y = document.getElementById("x")
</script>
</body>
</html>

I'm afraid this isn't going to be easy as you'd like it to be.
According to your comments, you tried AJAX first, but came across CORS problems. This happens when you try to include data from files on a different domain name.
Since that didn't work, you tried to include the file inside an object tag. This works a bit like an iframe - the data will be displayed on the webpage, but for the same reasons as above, you cannot access the data through JavaScript if the file is under a different domain name. This is a security feature. That explains the error you were getting most recently:
Uncaught SecurityError: Failed to read the 'contentDocument' property from 'HTMLObjectElement'
Now, there are a few ways you might be able to get around this.
Firstly, if this is a program exclusively for your own use, you can start your browser with web-security disabled (though this is dangerous for browsing the web generally). In Chrome, for example, you can do this by launching Chrome with the --disable-web-security flag. More details here.
Secondly, you can try to arrange that your document and the file do belong under the same domain. You will probably only be able to do this if you have control of the file.
Your error message (specifically a frame with origin "null") makes me think that you are running the files directly in the web-browser rather than through a server. It might make things work better if you go through an actual server.
If you've got Python installed (it's included on Linux and Mac), the easiest way to do that is to open up the terminal and browse to your code's directory. Then launch a simple Python server:
cd /home/your_user_name/your_directory
python -m SimpleHTTPServer
That will start up a web server which you can access in your browser by navigating to http://localhost:8000/your_file.html.
If you are on Windows and haven't got Python installed, you could also use the built-in IIS server, or WAMP (or just install Python).

y.innerHTML = 'Hello World';
will replace everything in the 'x' element with the text 'Hello World', but it looks like you've already loaded another HTML document into the 'x' element. So the question is...
Where exactly in the 'x' element do you want to insert the text? for example 'x' -> html -> body?

The object element is loading the text file asynchronously, so if you try to get its data by querying the element, you'll get undefined.
However, you can use the onload attribute in <object> elements.
In your HTML, add an onload that calls a function in your script to catch when the text file has fully loaded.
<object id="x" onload="getData()" data="readme.txt"></object>
In the script, you can get the object's data with contentDocument.
function getData() {
var textFile = document.getElementById('x').contentDocument;
/* The <object> element renders a whole
HTML structure in which the data is loaded.
The plain text representation in the DOM is surrounded by <pre>
so we need to target <pre> in the <object>'s DOM tree */
// getElementByTagsName returns an array of matches.
var textObject = textFile.getElementsByTagName('pre')[0];
// I'm sure there are far better ways to select the containing element!.
/*We retrieve the inner HTML from the object*/
var text = textObject.innerHTML;
alert(text); //use the content!
}

can firefox extension modify DOM of HTML document then save as HTML?

I am creating a firefox extension that lets the operator perform various actions that modify the content of the HTML document. The operator does not edit HTML, they take other actions and my extension modifies the document by inserting elements, adding attributes, and so forth.
When the operator is finished, they need to be able to save the HTML document as a file (or have my extension send it to an internet destination, but this is not required since they can email the saved file).
I thought maybe the changes made by the javascript code in my extension would be reflected in the HTML document, but when I ask the firefox browser to "view source" after making modifications, it displays the original HTML text.
My questions are:
#1: What is the easiest way for the operator to save the HTML document with all the changes my extension has made?
#2: What is the easiest way for the javascript code in my extension to process the HTML document contents and write to an HTML file on the local disk?
#3: Is any valid HTML content incapable of accurate representation in the saved file?
#4: Is the TreeWalker part of the solution (see below)?
A couple observations from my research so far:
I've read about the TreeWalker object, which seems to provide a fairly painless way for an extension to walk through everything (?or almost everything?) in the HTML document. But does it expose everything so everything in the original (and my modifications) can be saved without losing anything of importance?
Does the TreeWalker walk through the HTML document in the "correct order" --- the order necessary for my extension to generate the original and/or modified HTML document?
Anything obscure or tricky about these problems?

Ok so I am assuming here you have access to page DOM. What you need to do it basically make changes to the dom and then get all the dom code and save it as a file. Here is how you can download the page's html code. This will create an a tag which the user needs to click for the file to download.
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
Now you can insert this a tag anywhere in the DOM and the file will download when the user clicks it.
Note: This is sort of a hack, this injects entire html of the file in the a tag, it should in theory work in any up to date browser (except, surprise, IE). There are more stable and less hacky ways of doing it like storing it in a file system API file and then downloading that file instead.
Edit: The document.querySelectorAll line accesses the page DOM. For it to work the document must be accessible. You say you are modifying DOM so that should already be there. Make sure you are adding the code on the page and not your extension code. This code will be at the same place as your DOM modification code, not your extension pages that can't access the DOM.
And as for the a tag, it will be inserted in the page. I skipped the steps since I assumed you already know how to manipulate DOM and also because I don't know where you would like to add the link. And you can skip the user action of clicking the link too, but it's a hack and only works in modern browsers. You can insert the a tag somewhere in the original page where user won't see it and then call the a.click() function to simulate a click event on the link. But this is not a legit way and I personally only use it on my practice projects to call click event listeners.
I can only test this on chrome not on FF but try this code, this will not require you to even add the a link to DOM. You need to add this next to the DOM manipulation code. This will work if luck is on your side :)
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
a.click();

There is no easy way to do this with the web API only, at least when you want a result that does not omit stuff like the doctype or comments. You could still write a serializer yourself that goes through document.childNodes and serialized according to the node type (Element.outerHTML, Comment.data and so on).
Luckily, you're writing a Firefox add-on, so you have access to a lot more (powerful) stuff.
While still not 100% perfect, the nsIDocumentEncoder implementations will produce pretty decent results, that should only differ in some whitespace and explicit charset declaration at most (everything else is a bug).
Here is an example on how one might use this component:
function serializeDocument(document) {
const {
classes: Cc,
interfaces: Ci,
utils: Cu
} = Components;
let encoder = Cc['#mozilla.org/layout/documentEncoder;1?type=text/html'].createInstance(Ci.nsIDocumentEncoder);
encoder.init(document, 'text/html', Ci.nsIDocumentEncoder.OutputLFLineBreak | Ci.nsIDocumentEncoder.OutputRaw);
encoder.setCharset("utf-8");
return encoder.encodeToString();
}
If you're writing an SDK add-on, stuff gets more complicated as the SDK abstracts some important stuff away. You'll need to go through the chrome module, and also figure out the active window and tab yourself. Something like Services.wm.getMostRecentWindow("navigator:browser").content.document (Services.jsm) should do the trick.
In XUL overlay add-ons, content.document should suffice to get the document of the currently active tab, and you have Components access already.
Still, you need to let the user choose a file destination, usually through nsIFilePicker and then actually write the file, by using something like a file stream or the fully async OS.File API.

Looks like I get to answer my own question, thanks to someone in mozilla #extdev IRC.
I got totally faked out by "view source". When I didn't see my modifications in the window displayed by "view source", I assumed the browser would not provide the information.
However, guess what? When I "file" ===>> "save page as...", then examine the page contents with a plain text editor... sure enough, that contained the modifications made by my firefox extension! Surprise!

A browser has no direct write access to the local filesystem. The only read access it has is when explicitly provide a file:// URL (see note 1 below)
In your case, we are explicitly talking about javascript - which can read and write cookies and local storage. It can also send stuff back to the server and retrieve it, e.g. using AJAX.
Stuff you put in local storage/cookies is effectively not accessible to other programs (such as email clients).
It is possible to create very long mailto: URLs (see note 2) but only handles inline content in the email and you're going to run into all sorts of encoding issues that you're not ready to deal with.
Hence I'd recommend pursuing storage serverside via AJAX - and look at local storage once you've got this sorted/working.
Note 1: this is not strictly true. a trusted, signed javascript has access to additional functions which may include direct file access.
Note 2: (the limit depends on the browser and the email client - Lotus Notes truncaets the content rather a lot)

How can I customise all assignments to window.location.href?

In my JavaScript application, we have multiple places where we have used window.location.href="any string";. Now I want to write JS code in only one place (probably using window.location.prototype) to override assignments to href, so that I can append a parameter to all instances.
I want to append a parameter (e.g. "?abc=1234") to all urls which are assigned to window.location.href.
I want to write code that means when e.g.
window.location.href = "abc.html";
is written, it should actually result in the href being set to abc.html?abc=1234.

window.location.href = window.location.href + "?abc=1234"
I just test this in WebKit DevTools/

You can't actually do this.
It's more of the JavaScript engine which runs the page. All records in most browsers which are stored of your browsing is the history. Hardly anything else. So basically to the browser, there is no difference between a meta redirect, a header redirect and a javascript redirect.
Unless I'm wrong.

Get objects from javascript using URL without loading the document

I have an URL which links to a HTML docment, and i want to get objects of the document without load the URL in my browser. for instance, i have an URL named:
http://www.example.com/,
how can i get one object (i.e., by getElementsbyTagName) of this document?

You can't. You can omit, at best, extraneous files being linked to from within the document like javascript or css, but you can't just grab one part of the document.
Once you have the document, you can grab out of it a section, but you can't just grab a section without getting the whole thing first.
It's the equivalent of saying that you want the 2nd paragraph of an essay. Without the essay, you don't know what the 2nd paragraph is, where it starts or ends.

Is this document in the same domain, or a different domain as the security domain your javascript is running in.
If it's in the same domain, you have a couple options to explore.
You could load the page using an XMLHttpRequest, or JQuery.get, and parse the data you're looking for out of the HTML with an ugly regular expression.
Or, if you're feeling really clever, you can load the target document into a jsdom object, jQuerify it, and then use the resulting jquery object to access the date you're looking for with a simple selector.

If the url is on the same domain you can use .load() for example:
$("some_element").load("url element_to_get")
See my example - http://jsfiddle.net/ajthomascouk/4BtLv/
On this example it gets the H1 from this page - http://jsfiddle.net/ajthomascouk/xJdFe
Its hard to show using jsfiddle, but I hope you get the gist of it?
Read more about .load() here - http://api.jquery.com/load/

Using Ajax calls, I guess.
This is long to explain if you have never used XHR, so here's a link: https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Using_XMLHttpRequest
Another option is to construct an iframe using
var iframe = document.create('iframe');
iframe.src = 'http://...';

Develop Reference

JavaScript is the programming language of the Web.