Parsing HTML Response in jQuery - javascript

My page needs to grab a specific div from another page to display within a div on the current page. It's a perfect job for $.load, except that the HTML source I'm pulling from is not necessarily well-formed, and seems to have occasional tag errors, and IE just plain won't use it in that case. So I've changed it to a $.get to grab the HTML source of the page as a string. Passing it to $ to parse it as a DOM has the same problem in IE as $.load, so I can't do that. I need a way to parse the HTML string to find the contents of my div#information, but not the rest of the page after its </div>. (PS My div#information contains various other div's and elements.)
EDIT: Also if anyone has a fix for jQuery's $.load not being able to parse response HTML in IE, I'd love to hear that too.

If the resource you are trying to load is under your control, your implementation spec is poorly optimized. You don't want to ask your server for an entire page of content when you only really need a small piece of that content.
What you'll want to do is isolate the content you want, and have the server return only what you need.
As a side note, since you are aware that you have malformed HTML, you should probably bite the bullet and validate your markup. That will save you some trouble (like this) in the future.
Finally, if you truly cannot optimize this process, my guess is that you are creating an inconsistency because some element in the parsed HTML has the same ID as an element on your current page. Identical ID's are invalid and lead to many cross-browser JavaScript problems.

Strictly with strings you could use a regular expression to find the id="information" tag contents. Never parse it as html.
I'd try the $.load parameter that accepts a html selector as well
$('#results').load('/MySite/Url #Information');

Related

How to load and parse HTML without modifying its contents

There are many ways to parse and traverse HTML4 files using many technologies. But I can not find a suitable one to save that DOM to file again.
I want to be able to load an HTML file into a DOM, change one small thing (e.g. an attribute's value), save the DOM to file again and when diffing the source file and the created file, I want them to be completely identical, except that small change.
This kind of task is absolutely no problem when working with XML and suitable XML libraries, but when it comes to HTML there are several issues: Whitespace such as indentations or linebreaks get lost or are inserted, self-closing start tags (such as <link...>) emerge as <link.../> and/or the content of CDATA sections (e.g. between <script> and </script>) is wrapped into <![CDATA[ ]]>. These things are critical in my case.
Which way can I go to load, traverse, manipulate and save HTML without the drawbacks described above, most importantly without whitespace text nodes to be altered?
comparison
If you want to get really serious leave out the GUI and go headless, SO example with Phantom
I am going with the HTML Agility Pack.
Loading and saving does not manipulate anything else than invalid parts.

When using CasperJS, is it possible to interact with the DOM of a loaded page before any inline or external Javascript is executed?

The situation I have is that I'm opening a page using CasperJS.
The page in question has some Javascript (a combination of both inline and external) that removes several HTML elements from the document.
However, I want to be able to retrieve those elements using something like getElementsByXPath() within CasperJS before they are removed. Is this possible?
When I dump out the value of getPageContent(), the elements are not in there. However, if I set casper.page.settings.javascriptEnabled = false; before calling the page, getPageContent() now shows the raw HTML before any Javascript is executed, and the missing HTML tags are there. The problem now, though, is that disabling Javascript prevents any usage of evaluate(), so I still can't retrieve the elements. I could probably do it using a regex of some sort on the raw content, but I was hoping there could be a cleaner method of doing it.
Any suggestions welcome!
I've never heard of anyone doing this. I wouldn't say using regex is a bad idea. I usually scrape with a combination of casperjs xpath and python regex it works extremely well and I personally don't think it's any messier than trying to intercept JavaScript before the page is loaded.
That being said, casperjs allows you to inject JavaScript which you could use jquery if it's available on the page you're requesting. The below code fires before anything is loaded. You actually have to go out of your way to add code to prevent this from firing before the page loads.
<script type='text/javascript'>
alert("Stop that parsing!");
</script>

jQuery HTML parser is removing some tags without a warning, why and how to prevent it?

Here is the thing,
I have a textarea (with ID "input_container") full of HTML code, the simple example is:
<!doctype html>
<html>
<head></head>
<body>
the other place
</body>
</html>
I parsed it using jQuery, here is my code:
I have all this HTML string into variable named domString like this:
domString = $('#input_container').val();
To get a parse HTML of everything inside variable domString, I had to wrap it with another tag, so I did:
dom = "<allhtml>" + domString + "</allhtml>";
And got everything inside a jQuery selector to be parsed:
dDom = $(dom);
After that I checked what's in dDom, so I did
alert(dDom.html());
That should give me anything inside the tags, right?
But unfortunately, all I get is:
the other place
And all the other tags are mysteriously gone. Can anyone explain this phenomenon and tell me how to really parse all the DOM?
Thank you
From the jQuery documentation:
When passing in complex HTML, some browsers may not generate a DOM
that exactly replicates the HTML source provided. As mentioned, we use
the browser's .innerHTML property to parse the passed HTML and insert
it into the current document. During this process, some browsers
filter out certain elements such as <html>, <title>, or <head>
elements. As a result, the elements inserted may not be representative
of the original string passed.
This should work instead:
$('<html />').append($('<head />')).append($('<body />').append($('the other place')));
This is kind of a weird thing to do, though- you might want to think about other ways to do what you're trying to accomplish, I worry that you might be suffering from the XY Problem.
I suspect that you're using jQuery load or an AJAX call.
This will attempt to load the document into your current DOM. It will get the contents of the HEAD and BODY tags via innerHtml, but not the tags themselves (including the HTML tag, naturally).
From the jQuery Load documentation
jQuery uses the browser's .innerHTML property to parse the retrieved
document and insert it into the current document. During this process,
browsers often filter elements from the document such as <html>,
<title>, or <head> elements. As a result, the elements retrieved by
.load() may not be exactly the same as if the document were retrieved
directly by the browser.
EDIT:
If you are trying to get the full HTML for your page, the same thing applies. It's going to use the browser's innerHtml function which will behave as described above. HTML doesn't really exist once the DOM is loaded, so going the opposite direction won't necessarily be 100% correct.
When you load that HTML into the DOM, it'll ignore the tags as they don't actually get loaded at all. Then when you're retrieving, all that's left is the link (as well as whatever you have in the HEAD, but you don't have anything in there...).

Disable JavaScript in iframe/div

I am making a small HTML page editor. The editor loads a file into an iframe. From there, it could add, modify, or delete the elements on the page with new attributes, styles, etc. The problem with this, is that JavaScript (and/or other programming languages) can completely modify the page when it loads, before you start editing the elements. So when you save, it won't save the original markup, but the modified page + your changes.
So, I need some way to disable the JavaScript on the iframe, or somehow remove all the JavaScript before the JavaScript starts modifying the page. (I figure I'll have to end up parsing the file for PHP, but that shouldn't be too hard) I considered writing a script to loop through all the elements, removing any tags, onclick's, onfocus's, onmouseover's, etc. But that would be a real pain.
Does anyone know of an easier way to get rid of JavaScript from running inside an iframe?
UPDATE: unless I've missed something, I believe there is no way to simply 'disable JavaScript.' Please correct me if I'm wrong. But, I guess the only way to do it would be to parse out any script tags and JavaScript events (click, mouseover, etc) from a requested page string.
HTML5 introduces the sandbox attribute on the iframe that, if empty, disables JavaScript loading and execution.
Yes, your update is correct. You must assume that any input you receive from the user may contain malicious elements. Thus, you must validate on the server before accepting their input.
You can try the same solution adopted by CKEditor, of which you have a demo here.
By switching from RTE mode to view source mode, you can enter some JavaScript and see the result, which is a replacement of the JS node in a safely encoded string.
If you are in view source mode, by entering some JS line like:
<script type="text/javascript">
// comment
alert('Ciao');
</script>
you will see it rendered this way when going back to rich text editor mode:
<!--{cke_protected}%3Cscript%20type%3D%22text%2Fjavascript%22%3E%0D%0A%2F%2F%20comment%0D%0Aalert('Ciao')%3B%0D%0A%3C%2Fscript%3E-->
I think it is one of the easiest and effective way, since the RegExp to parse JS nodes is not complex.
An example:
var pattern = /<script(\s+(\w+\s*=\s*("|').*?\3)\s*)*\s*(\/>|>.*?<\/script\s*>)/;
var match = HTMLString.match(pattern); // array containing the occurrences found
(Of course, to replace the script node you should use the replace() method).
Regards.
You can set Content Security Policy as script-src 'self' in header. CSP will block any scripts other than from our own website. This way we can prevent any scripts from iframe changing the elements in our page.
You can get more information from here http://www.html5rocks.com/en/tutorials/security/content-security-policy/

Can you insert a form onto a page using jQuery's $.load() function?

I have a page where there's a drag and drop table where the order of the rows determines the value of a subtotal. However, it's more complicated than just addition and I would rather not duplicate the logic in JavaScript to update the values.
A simple solution would be to reload the whole page using Ajax and then replace the table from the page fetched via Ajax. Perhaps it's not the most elegant solution but I thought it'd be a quick way to get the job done that would be acceptable for now.
You can do that with jQuery like this:
$('#element-around-table').load(document.location.href + ' #table-id');
However, my "simple" solution turned out to not be so simple because the table also contains a <form> tag which is not being displayed in Firefox (Safari works).
When I inspect the page using Firebug, I see the form, but it and its elements grayed out.
Searching on the web, I found a rather confused post by a guy who says FF3 and IE strip <form> tags from innerHTML calls.
I'm probably going to move on to do this some other way, but for my future reference, I'd like to know: is this the case?
That post is rather confused, I just tested your code and it worked fine. The form tag was shown in firefox 3.0.8 just fine.
Looking at you code example, though I wonder if you just gave an incomplete example... make sure that the page you call returns only the html that goes inside that wrapper element.
I've run into this type of thing before. FORM tags need to be added to the DOM. If they're added using a method that writes to innerHTML, the tag will appear, but it won't be there as far as JavaScript is concerned.

Categories

Resources