I have an URL which links to a HTML docment, and i want to get objects of the document without load the URL in my browser. for instance, i have an URL named:
http://www.example.com/,
how can i get one object (i.e., by getElementsbyTagName) of this document?
You can't. You can omit, at best, extraneous files being linked to from within the document like javascript or css, but you can't just grab one part of the document.
Once you have the document, you can grab out of it a section, but you can't just grab a section without getting the whole thing first.
It's the equivalent of saying that you want the 2nd paragraph of an essay. Without the essay, you don't know what the 2nd paragraph is, where it starts or ends.
Is this document in the same domain, or a different domain as the security domain your javascript is running in.
If it's in the same domain, you have a couple options to explore.
You could load the page using an XMLHttpRequest, or JQuery.get, and parse the data you're looking for out of the HTML with an ugly regular expression.
Or, if you're feeling really clever, you can load the target document into a jsdom object, jQuerify it, and then use the resulting jquery object to access the date you're looking for with a simple selector.
If the url is on the same domain you can use .load() for example:
$("some_element").load("url element_to_get")
See my example - http://jsfiddle.net/ajthomascouk/4BtLv/
On this example it gets the H1 from this page - http://jsfiddle.net/ajthomascouk/xJdFe
Its hard to show using jsfiddle, but I hope you get the gist of it?
Read more about .load() here - http://api.jquery.com/load/
Using Ajax calls, I guess.
This is long to explain if you have never used XHR, so here's a link: https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Using_XMLHttpRequest
Another option is to construct an iframe using
var iframe = document.create('iframe');
iframe.src = 'http://...';
Related
I am creating a firefox extension that lets the operator perform various actions that modify the content of the HTML document. The operator does not edit HTML, they take other actions and my extension modifies the document by inserting elements, adding attributes, and so forth.
When the operator is finished, they need to be able to save the HTML document as a file (or have my extension send it to an internet destination, but this is not required since they can email the saved file).
I thought maybe the changes made by the javascript code in my extension would be reflected in the HTML document, but when I ask the firefox browser to "view source" after making modifications, it displays the original HTML text.
My questions are:
#1: What is the easiest way for the operator to save the HTML document with all the changes my extension has made?
#2: What is the easiest way for the javascript code in my extension to process the HTML document contents and write to an HTML file on the local disk?
#3: Is any valid HTML content incapable of accurate representation in the saved file?
#4: Is the TreeWalker part of the solution (see below)?
A couple observations from my research so far:
I've read about the TreeWalker object, which seems to provide a fairly painless way for an extension to walk through everything (?or almost everything?) in the HTML document. But does it expose everything so everything in the original (and my modifications) can be saved without losing anything of importance?
Does the TreeWalker walk through the HTML document in the "correct order" --- the order necessary for my extension to generate the original and/or modified HTML document?
Anything obscure or tricky about these problems?
Ok so I am assuming here you have access to page DOM. What you need to do it basically make changes to the dom and then get all the dom code and save it as a file. Here is how you can download the page's html code. This will create an a tag which the user needs to click for the file to download.
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
Now you can insert this a tag anywhere in the DOM and the file will download when the user clicks it.
Note: This is sort of a hack, this injects entire html of the file in the a tag, it should in theory work in any up to date browser (except, surprise, IE). There are more stable and less hacky ways of doing it like storing it in a file system API file and then downloading that file instead.
Edit: The document.querySelectorAll line accesses the page DOM. For it to work the document must be accessible. You say you are modifying DOM so that should already be there. Make sure you are adding the code on the page and not your extension code. This code will be at the same place as your DOM modification code, not your extension pages that can't access the DOM.
And as for the a tag, it will be inserted in the page. I skipped the steps since I assumed you already know how to manipulate DOM and also because I don't know where you would like to add the link. And you can skip the user action of clicking the link too, but it's a hack and only works in modern browsers. You can insert the a tag somewhere in the original page where user won't see it and then call the a.click() function to simulate a click event on the link. But this is not a legit way and I personally only use it on my practice projects to call click event listeners.
I can only test this on chrome not on FF but try this code, this will not require you to even add the a link to DOM. You need to add this next to the DOM manipulation code. This will work if luck is on your side :)
var a = document.createElement('a'), code = document.querySelectorAll('html')[0].innerHTML;
a.setAttribute('download', 'filename.html');
a.setAttribute('href', 'data:text/html,' + code);
a.click();
There is no easy way to do this with the web API only, at least when you want a result that does not omit stuff like the doctype or comments. You could still write a serializer yourself that goes through document.childNodes and serialized according to the node type (Element.outerHTML, Comment.data and so on).
Luckily, you're writing a Firefox add-on, so you have access to a lot more (powerful) stuff.
While still not 100% perfect, the nsIDocumentEncoder implementations will produce pretty decent results, that should only differ in some whitespace and explicit charset declaration at most (everything else is a bug).
Here is an example on how one might use this component:
function serializeDocument(document) {
const {
classes: Cc,
interfaces: Ci,
utils: Cu
} = Components;
let encoder = Cc['#mozilla.org/layout/documentEncoder;1?type=text/html'].createInstance(Ci.nsIDocumentEncoder);
encoder.init(document, 'text/html', Ci.nsIDocumentEncoder.OutputLFLineBreak | Ci.nsIDocumentEncoder.OutputRaw);
encoder.setCharset("utf-8");
return encoder.encodeToString();
}
If you're writing an SDK add-on, stuff gets more complicated as the SDK abstracts some important stuff away. You'll need to go through the chrome module, and also figure out the active window and tab yourself. Something like Services.wm.getMostRecentWindow("navigator:browser").content.document (Services.jsm) should do the trick.
In XUL overlay add-ons, content.document should suffice to get the document of the currently active tab, and you have Components access already.
Still, you need to let the user choose a file destination, usually through nsIFilePicker and then actually write the file, by using something like a file stream or the fully async OS.File API.
Looks like I get to answer my own question, thanks to someone in mozilla #extdev IRC.
I got totally faked out by "view source". When I didn't see my modifications in the window displayed by "view source", I assumed the browser would not provide the information.
However, guess what? When I "file" ===>> "save page as...", then examine the page contents with a plain text editor... sure enough, that contained the modifications made by my firefox extension! Surprise!
A browser has no direct write access to the local filesystem. The only read access it has is when explicitly provide a file:// URL (see note 1 below)
In your case, we are explicitly talking about javascript - which can read and write cookies and local storage. It can also send stuff back to the server and retrieve it, e.g. using AJAX.
Stuff you put in local storage/cookies is effectively not accessible to other programs (such as email clients).
It is possible to create very long mailto: URLs (see note 2) but only handles inline content in the email and you're going to run into all sorts of encoding issues that you're not ready to deal with.
Hence I'd recommend pursuing storage serverside via AJAX - and look at local storage once you've got this sorted/working.
Note 1: this is not strictly true. a trusted, signed javascript has access to additional functions which may include direct file access.
Note 2: (the limit depends on the browser and the email client - Lotus Notes truncaets the content rather a lot)
This is what I've tried:
function createDocumentz() {
var doc = document.implementation.createHTMLDocument('http://www.moviemeter.nl/film/270',null,'html');
return doc;
}
Even though a document gets created, if I run this with Firebug it says that the body node has no childnodes, any idea why?
Looks like you assume that you can use createHTMLDocument() to download and parse a HTML file from the URL you've passed as the first parameter. That is not the case, createHTMLDocument() always creates an empty document.
Also, the parameters you've passed to the function are those of createDocument(). createHTMLDocument() takes only one parameter, the document title. But even if you'd use createDocument(), the first parameter is the URI of the namespace, not the source document.
Unfortunately there's no way to download and manipulate external web site's HTML using JavaScript alone. The closest you can get is displaying the document in an iframe.
No, you cannot get the content from another website, this way.
If it did, it would have lead to cross site scripting.
All you would get is an empty document, due to the browser's policy, which of course has an empty body.
You can use an Iframe & set the source to the same...
Some background:
I'm developing a web based mobile application using JavaScript. HTML rendering is Safari based. Cross domain policy is disabled, so I can make calls to other domains using XmlHttpRequests. The idea is to parse external HTML and get text content of specific element.
In the past I was parsing the text line by line, finding the line I need. Then get the content of the tag which is a substring of that line. This is very troublesome and requires a lot of maintenance each time the target html changes.
So now I want to parse the html text into DOM and run css or xpath queries on it.
It works well:
$('<div></div>').append(htmlBody).find('#theElementToFind').text()
The only problem is that when I use the browser to load html text into DOM element, it will try to load all external resources (images, js files, etc.). Although it isn't causing any serious problem, I would like to avoid that.
Now the question:
How can I parse html text to DOM without the browser loading external resources, or run js scripts ?
Some ideas I've been thinking about:
creating new document object using createDocument call (document.implementation.createDocument()), but I'm not sure it will skip the loading of external resources.
use third party DOM parser in JS - the only one I've tried was very bad with handling errors
use iframe to create new document, so that external resources with relative path will not throw an error in console
It seems that the following piece of code works great:
var doc = document.implementation.createHTMLDocument("");
doc.documentElement.innerHTML = htmlBody;
var text = $(doc).find('#theElementToFind').text();
external resources aren't loaded, scripts aren't being evaluated.
Found it here:
https://stackoverflow.com/a/9251106/95624
Origin:
https://developer.mozilla.org/en/DOMParser#DOMParser_HTML_extension_for_other_browsers
You can construct jQuery object of any html string, without appending it to the DOM:
$(htmlBody).find('#theElementToFind').text();
I read (somewhere else on this site) you can't reload (or inject javascript) onto a page that is already rendered.
Is there any other way of doing this. For instance an iFrame?
I have a recent comment widget.js and I need to constantly get it to reload without reloading the whole page.
Any ideas?
edit: The site has recent comments on it and they are displayed via a recentcomment.js
Once the page is loaded it doesn't update itself unless you reload the page. I want it to update itself, a way to do this is to just reload the js file on the page, correct?
Why do you need to do this? It seems to me that there's probably a more appropriate solution to your problem.
But to answer it:
var elm = document.createElement("script");
elm.src = "Widget.js";
document.getElementsByTagName("head")[0].appendChild(elm);
Hope I didn't write any mistakes...
Rather than reloading the file, you can have all the implementation of the file contained in a function and then call the function every minute using the setTimeout() function.
Alternatively, if you want to reload it because the content of the file might have changed, it would probably be better to move that part of the code out to some external file and then use a function (running every minute with setTimeout()) to load the new content you need.
You can make cross-domain AJAX requests using certain methods, so you could make an AJAX request for the script file and parse it yourself. Parsing it yourself probably isn't an optimal solution, but it looks like you're dealing with a brain-dead service provider anyways.
Look at this guy's jQuery mod for an example:
http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/
Once you get the data from the 3rd party, you could probably use some combination of regex and JSON parser to extract the comments.
So I need to pull some JavaScript out of a remote page that has (worthless) HTML combined with (useful) JavaScript. The page, call it, http://remote.com/data.html, looks something like this (crazy I know):
<html>
<body>
<img src="/images/a.gif" />
<div>blah blah blah</div><br/><br/>
var data = { date: "2009-03-15", data: "Some Data Here" };
</body>
</html>
so, I need to load this data variable in my local page and use it.
I'd prefer to do so with completely client-side code. I figured, if I could get the HTML of this page into a local JavaScript variable, I could parse out the JavaScript code, run eval on it and be good to use the data. So I thought load the remote page in an iframe, but I can't seem to find the iframe in the DOM. Why not?:
<script>
alert(window.parent.frames.length);
alert(document.getElementById('my_frame'));
</script>
<iframe name="my_frame" id='my_frame' style='height:1px; width:1px;' frameBorder=0 src='http://remote.com/data.html'></iframe>
The first alert shows 0, the second null, which makes no sense. How can I get around this problem?
Have you tried switching the order - i.e. iframe first, script next? The script runs before the iframe is inserted into the DOM.
Also, this worked for me in a similar situation: give the iframe an onload handler:
<iframe src="http://example.com/blah" onload="do_some_stuff_with_the_iframe()"></iframe>
Last but not least, pay attention to the cross-site scripting issues - the iframe may be loaded, but your JS may not be allowed to access it.
One option is to use XMLHttpRequest to retrieve the page, although it is apparently only currently being implemented for cross-site requests.
I understand that you might want to make a tool that used the client's internet connection to retrieve the html page (for security or legal reasons), so it is a legitimate hope.
If you do end up needing to do it server-side, then perhaps a simple php page that takes a url as a query and returns a json chunk containing the script in a string. That way if you do find you need to filter out certain websites, you need only do this in one place.
The inevitable problem is that some of the users will be hostile, and they then have a license to abuse what is effectively a javascript proxy. As a result, the safest option may be to do all the processing on the server, and not allow certain javascript function calls (eval, http requests, etc).