DOM parsing in JavaScript

DOM parsing in JavaScript - javascript

Some background:
I'm developing a web based mobile application using JavaScript. HTML rendering is Safari based. Cross domain policy is disabled, so I can make calls to other domains using XmlHttpRequests. The idea is to parse external HTML and get text content of specific element.
In the past I was parsing the text line by line, finding the line I need. Then get the content of the tag which is a substring of that line. This is very troublesome and requires a lot of maintenance each time the target html changes.
So now I want to parse the html text into DOM and run css or xpath queries on it.
It works well:
$('<div></div>').append(htmlBody).find('#theElementToFind').text()
The only problem is that when I use the browser to load html text into DOM element, it will try to load all external resources (images, js files, etc.). Although it isn't causing any serious problem, I would like to avoid that.
Now the question:
How can I parse html text to DOM without the browser loading external resources, or run js scripts ?
Some ideas I've been thinking about:
creating new document object using createDocument call (document.implementation.createDocument()), but I'm not sure it will skip the loading of external resources.
use third party DOM parser in JS - the only one I've tried was very bad with handling errors
use iframe to create new document, so that external resources with relative path will not throw an error in console

It seems that the following piece of code works great:
var doc = document.implementation.createHTMLDocument("");
doc.documentElement.innerHTML = htmlBody;
var text = $(doc).find('#theElementToFind').text();
external resources aren't loaded, scripts aren't being evaluated.
Found it here:
https://stackoverflow.com/a/9251106/95624
Origin:
https://developer.mozilla.org/en/DOMParser#DOMParser_HTML_extension_for_other_browsers

You can construct jQuery object of any html string, without appending it to the DOM:
$(htmlBody).find('#theElementToFind').text();

Related

Obtain content of <script src="..."> tag in HTML

I have an HTML file with Javascript, that is supposed to load and process an XML file. The main obstacle is that the script may also be run locally, without an HTTP server, and also has to support Internet Explorer (11).
In normal case, I would use XMLHttpRequest, but as far as I know, it cannot be used with locally stored files (or at least doesn't work in my test cases for Chrome and IE).
I tried using <script> blocks with set src and type="text/xml" attributes and it successfully loads the content of the xml "somewhere" (the content is loaded and is visible in the network trace), but I cannot find a way, to extract the content of the xml from the <script> node.
Most sources (e.g. Getting content of <script> tag) suggests using XHR, but AFAIK it cannot be done in this case.
Is there a sensible option to implement this, without minimal http server?
I am looking for a clean solution without jQuery.

collect all the js css and img resources used in a html file

I want to write a npm package to localize an html url.
1. using the html url download the html page
2. parse the html file, extract all the js, css and img files used in the html and local these resources.
3. If these js, css and img files using some external resources, localize these resources. For example, extract background image in the css.
The first and second requirements are easy to meet. But I have no idea about the last one.
I can parse the all the css files and localize the resources used in it. But how can I parse the js files?
For example:
If the js adds a 'script src = XXX' tag into the html dom, how can I extract the src?

I think I would try to use a headless browser to catch every network calls instead of trying to parse the code.
I didn't used it personally but PhantomJS seems to fit the bill.
It can be used to load a webpage then execute any script / css that would normally happen on the request and execute stuff once the page is loaded.
The network monitoring features are probably what you'll want to use.

Best practices managing JavaScript on a single-page app

With a single page app, where I change the hash and load and change only the content of the page, I'm trying to decide on how to manage the JavaScript that each "page" might need.
I've already got a History module monitoring the location hash which could look like domain.com/#/company/about, and a Page class that will use XHR to get the content and insert it into the content area.
function onHashChange(hash) {
var skipCache = false;
if(hash in noCacheList) {
skipCache = true;
}
new Page(hash, skipCache).insert();
}
// Page.js
var _pageCache = {};
function Page(url, skipCache) {
if(!skipCache && (url in _pageCache)) {
return _pageCache[url];
}
this.url = url;
this.load();
}
The cache should let pages that have already been loaded skip the XHR. I also am storing the content into a documentFragment, and then pulling the current content out of the document when I insert the new Page, so I the browser will only have to build the DOM for the fragment once.
Skipping the cache could be desired if the page has time sensitive data.
Here's what I need help deciding on: It's very likely that any of the pages that get loaded will have some of their own JavaScript to control the page. Like if the page will use Tabs, needs a slide show, has some sort of animation, has an ajax form, or what-have-you.
What exactly is the best way to go around loading that JavaScript into the page? Include the script tags in the documentFragment I get back from the XHR? What if I need to skip the cache, and re-download the fragment. I feel the exact same JavaScript being called a second time might cause conflicts, like redeclaring the same variables.
Would the better way be to attach the scripts to the head when grabbing the new Page? That would require the original page know all the assets that every other page might need.
And besides knowing the best way to include everything, won't I need to worry about memory management, and possible leaks of loading so many different JavaScript bits into a single page instance?

If I understand the case correctly, you are trying to take a site that currently has pages already made for normal navigation, and you want to pull them down via ajax, to save yourself the page-reload?
Then, when this happens, you need to not reload the script tags for those pages, unless they're not loaded onto the page already?
If that is the case, you could try to grab all the tags from the page before inserting the new html into the dom:
//first set up a cache of urls you already have loaded.
var loadedScripts = [];
//after user has triggered the ajax call, and you've received the text-response
function clearLoadedScripts(response){
var womb = document.createElement('div');
womb.innerHTML = response;
var scripts = womb.getElementsByTagName('script');
var script, i = scripts.length;
while (i--) {
script = scripts[i];
if (loadedScripts.indexOf(script.src) !== -1) {
script.parentNode.removeChild(script);
}
else {
loadedScripts.push(script.src);
}
}
//then do whatever you want with the contents.. something like:
document.body.innerHTML = womb.getElementsByTagName('body')[0].innerHTML);
}

Oh boy are you in luck. I just did all of this research for my own project.
1: The hash event / manager you should be using is Ben Alman's BBQ:
http://benalman.com/projects/jquery-bbq-plugin/
2: To make search engines love you, you need to follow this very clear set of rules:
http://code.google.com/web/ajaxcrawling/docs/specification.html
I found this late and the game and had to scrap a lot of my code. It sounds like you're going to have to scrap some too, but you'll get a lot more out of it as a consequence.
Good luck!

I have never built such a site so I don't know if that is nbest practice, but I would put some sort of control information (like a comment or a HTTP header) in the response, and let the loader script handle redundancy/dependency cheching and adding the script tags to the header.

Do you have control over those pages being loaded? If not, I would recommend inserting the loaded page in an IFrame.
Taking the page scripts out of their context and inserting them in the head or adding them to another HTML element may cause problems unless you know exactly how the page is build.
If you have full control of the pages being loaded, I would recommend that you convert all your HTML to JS. It may sound strange but actually, a HTML->JS converter is not that far away. You could start of with Pure JavaScript HTML Parser and then let the parser output JS code, that builds the DOM using JQuery for example.
I was actually about to go down that road for a while ago on a webapp that I started working on, but now I handed it over to a contractor who converted all my pure JS pages into HTML+JQuery, whatever makes his daily work productive, I dont care, but I was really into that pure JS webapp approach and will definitely try it.

To me it sounds like you are creating a single-page app from the start (i.e. not re-factoring an existing site).
Several options I can think of:
Let the server control which script tags are included. pass a list of already-loaded script tags with the XHR request and have the server sort out which additional scripts need to be loaded.
Load all scripts before-hand (perhaps add them to the DOM after the page has loaded to save time) and then forget about it. For scripts that need to initialize UI, just have each requested page call include a script tag that calls a global init function with the page name.
Have each requested page call a JS function that deals with loading/caching scripts. This function would be accessible from the global scope and would look like this: require_scripts('page_1_init', 'form_code', 'login_code') Then just have the function keep a list of loaded scripts and only append DOM script tags for scripts that haven't been loaded yet.

You could use a script loader like YUI Loader, LAB.js or other like jaf
Jaf provides you with mechanism to load views (HTML snippets) and their respective js, css files to create single page apps. Check out the sample todo list app. Although its not complete, there's still a lot of useful libraries you can use.

Personally, I would transmit JSON instead of raw HTML:
{
"title": "About",
"requires": ["navigation", "maps"],
"content": "<div id=…"
}
This lets you send metadata, like an array of required scripts, along with the content. You'd then use a script loader, like one of the ones mentioned above, or your own, to check which ones are already loaded and pull down the ones that aren't (inserting them into the <head>) before rendering the page.
Instead of including scripts inline for page-specific logic, I'd use pre-determined classes, ids, and attributes on elements that need special handling. You can fire an "onrender" event or let each piece of logic register an on-render callback that your page loader will call after a page is rendered or loaded for the first time.

Any issues with using innerHTML to load up an iframe?

My app (under development) uses Safari 4.0.3 and JavaScript to present its front end to the user. The backend is PHP and SQLite. This is under OS X 10.5.8.
The app will from time to time receive chunks of HTML to present to the user. Each chunk is the body of an email received, and as such one has no control over the quality of the HTML received. What I do it use innerHTML to shove the chunk into an iFrame and let Safari render it.
To do that I do this:
window.frames["mainwindow"].window.frames["Frame1"].document.body.innerHTML = myvar;
where myvar contains the received HTML. Now, for the most part this works as desired, and the HTML is rendered as expected. The exception appears to be when the tag for the chunk looks like:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" ...
and so on for more than 2800 chars. The effect is as if my JavaScript statement above had not been executed - I can see that using Safari's Error Console in the Develop menu to look into the iFrame. If I extract the HTML from the SQLite backend db and save it as a .html file, then Safari will render that with no trouble.
Any comments on why this might be happening, or on such use of innerHTML, or pointers to discussion of same, would be appreciated.

innerHTML is not the same as writing a complete document. Even if you write to outerHTML as suggested by Gumbo, there are things outside the root element that can confuse it, such as doctypes. To write a whole document at once, you have to use old-school cross-frame document.write:
var d= window.frames["mainwindow"].window.frames["Frame1"].document;
d.open();
d.write(htmldoc);
d.close();
Each chunk is the body of an email received, and as such one has no control over the quality of the HTML received.
OK, you may have a security problem then.
If you let an untrusted source like an e-mail inject HTML into your security context (and an iframe you're writing to is in your security context), it can run JavaScript of its own, including scripts that reach up and control your entire enclosing application and anything else on the same hostname. Unless your application is so trivial you don't care about that, this is really bad news.
If you need to allow untrusted HTML, the way many webmail services do it is to have it served on a different hostname (eg. a subdomain) that does not have access to any other part of the application. To do this, your iframe src would have to point to the different hostname; you couldn't script between the two security contexts.

Appending a block of javascript to the DOM fails in production - but not in dev

I have an ASP.NET website. Very huge!
The latest addition I've made, everything is JavaScript/AJAX enabled.
I send HTML and JavaScript code back from the server to the client, and the client will inject the HTML into a DIV - and inject the JavaScript into the DOM using this:
$('<script type="text/javascript">' + script + '</sc' + 'ript>').appendTo(document);
or this:
var js = document.createElement("script");
js.setAttribute("type", "text/javascript");
js.text = script;
document.appendChild(js);
On my own dev machine, the injected javascript is accessible and I'm able to execute injected JavaScript functions.
When I deploy to my test environment, )we have an internal domain name such as www.testenv.com) I get JavaScript errors.
I've tried to isolate the problem into a small page, where I inject alert("sfdfdf"); at the bottom of the page, and that works fine.
Is there any policy settings that prohibits this?

Don't appendChild to 'document'; <script> can't go there and you should get a HIERARCHY_REQUEST_ERR DOMException according to the DOM Level 1 Core spec. The only element that can go in the Document object's child list is the single root documentElement, <html>.
Instead append the script to an element inside the document, such as the <body>.

Dynamically creating elements should work fine, the script will be executed upon insertion into the DOM. To answer your question specifically, there are no direct policy settings that prohibit script injection, however, if you're using ajax calls within the dynamically inserted script you could run into Cross-Domain restrictions.
If you could post the error, and maybe the source of the 'script' element you're inserting it would help to debug the problem :)

Develop Reference

JavaScript is the programming language of the Web.

DOM parsing in JavaScript - javascript

You can construct jQuery object of any html string, without appending it to the DOM: $(htmlBody).find('#theElementToFind').text();

Related

Obtain content of <script src="..."> tag in HTML

collect all the js css and img resources used in a html file

Best practices managing JavaScript on a single-page app

Any issues with using innerHTML to load up an iframe?

Appending a block of javascript to the DOM fails in production - but not in dev

Categories

Resources