Diff between two HTML chunks: structural instead of lines/chars?

Diff between two HTML chunks: structural instead of lines/chars? - javascript

I'm looking for a JavaScript diff engine that will return the difference in the structure of two chunks of HTML. That is, instead of, "at this line, at such and such character, something happened", it'd be, "this element was inserted after this element", or "this element was removed", or "this text node was altered", etc.
Cursory research suggests this is hard.
The specific scenario is that I have a live preview of Markdown text editor. It works well with just text, but as soon as a user posts in, say, a YouTube <iframe> embed, then it renders/reloads on every keystroke, which is absurdly expensive. Large images are difficult, too, because they cause a nauseous jittering effect as they load from the cache (at least in WebKit).
What would be beautiful is a replacement for jQuery.html() that instead of just replacing the HTML contents actually compared the old with the new, and selectively updated/inserted/appended so that unchanged elements are left alone.

Deep clone (via node.cloneNode(true)) both nodes if they're currently in use (i.e. if any child nodes are referenced in your JavaScript).
Normalize both nodes via node.normalize().
Iterate over every child node of both nodes and compare with node.isEqualNode(other_node).
For every non-equal node, iterate deeper to see if you can find any equal child nodes.
To be honest, you're much better off using a text diff lib instead of making your own DOM-based diff lib.

Related

What kind of performance optimization is done when assigning a new value to the innerHTML attribute

I have a DOM element (let's call it #mywriting) which contains a bigger HTML subtree (a long sequence of paragraph elements). I have to update the content of #mywriting regularly (but only small things will change, the majority of the content remains unchanged).
I wonder what is the smartest way to do this. I see two options:
In my application code I find out which child elements of #mywriting has been changed and I only update the changed child elements.
I just update the innerHTML attribute of #mywriting with the new content.
Is it worth to develop the logic of approach one to find out the changed child nodes or will the browser perform this kind of optimization when I apply approach two?

No, the browser doesn't do such optimisation. When you reassign innerHTML, it will throw away the old contents, parse the HTML, and place the new elements in the DOM.
Doing a diff to only replace (or rather, update) the parts that need an update can be worth a lot, and is done with great success in rendering libraries that employ a so-called virtual DOM.
However, they're doing that diff on an element data structure, not an HTML string. Parsing that to find out which elements changed is going to be horribly inefficient. Don't use HTML strings. (If you're already sold on them, you might as well just use innerHTML).

Without concdering the overhead of calculating which child elements has to be updated option 1 seems to be much faster (at least in chrome), according to this simple benchmark:
https://jsbench.github.io/#6d174b84a69b037c059b6a234bb5bcd0

How does Firefox reader view operate

Summary
I am looking for the criteria by which I can create a webpage and be [fairly] sure it will appear in the Firefox Reader
View, if user desired.
Some sites have this option, some do not. Some with more text do not have this option than others with much less text. Stack Overflow for
instance displays only the question rather than any answers in Reader
View.
Question
I have had my Firefox upgraded from 38.0.1 to 38.0.5 and have found a new feature called ReaderView - which is a sort of overlay which removes "page clutter" and makes text easier to read.
Readerview is found in the right hand side of the address bar as a clickable icon on certain pages.
This is fine, but from the programming point of view I want to know how "reader view" works, which criteria of which pages it applies to. I have done some exploration of the Mozilla Firefox website with no clear answers (sod all programming answers of any sort I found), I have of course Googled / Binged this and this only came back with references to Firefox addons - this is not an addon but a staple part of the new Firefox version.
I made an assumption that readerview used HTML5 and would extract <article> contents but this is not the case as it works on Wikipedia which does not appear to use <article> or similar HTML5 tags, instead the readview extracts certain <div>s and displays them alone. This feature works on some HTML5 pages - such as wikipedia - but then not others.
If anyone has any ideas how Firefox ReaderView actually operates and how this operation can be used by website developers, can you share? Or if you can find where this information can be located, can you point me in the right direction - as I have not been able to find this.

You need at least one <p> tag around the text, that you want to see in Reader View, and at least 516 characters in 7 words inside the text.
for example this will trigger the ReaderView:
<body>
<p>
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789 123456
</p>
</body>
See my example at https://stackoverflow.com/a/30750212/1069083

Reading through the gitHub code, this morning, the process is that page elements are listed in a likelyhood order - with <section>,<p>,<div>,<article> at the top of the list (ie most likely).
Then each of these "nodes" is given a score based on things such as comma counts and class names that apply to the node. This is a somewhat multi-faceted process where scores are added for text chunks but also scores are seemingly reduced for invalid parts or syntax. Scores in sub-parts of "node" are reflected in the score of the node as a whole. ie the parent element contains the scores of all lower elements, I think.
This score value decides if the HTML page can be "page viewed" in Firefox.
I am not absolutely clear if the score value is set by Firefox or by the readability function.
Javascript is really not my strong point,and I think someone else should check over the link provided by Richard ( https://github.com/mozilla/readability ) and see if they can provide a more thorough answer.
What I did not see but expected to see was score based on amount of text content in a <p> or a <div> (or other) relevant tags.
Any improvements on this question or answer, please share!!
EDIT:
Images in <div> or <figure> tags (HTML5) within the <p> element appear to be retained in the Reader View when the page text content is valid.

I followed Martin's link to the Readability.js GitHub repository, and had a look at the source code. Here's what I make of it.
The algorithm works with paragraph tags. First of all, it tries to identify parts of the page which are definitely not content - like forms and so on - and removes them. Then it goes through the paragraph nodes on the page and assigns a score based on content-richness: it gives them points for things like number of commas, length of content, etc. Notice that a paragraph with fewer than 25 characters is immediately discarded.
Scores then "bubble up" the DOM tree: each paragraph will add part of it's score to all of it's parent nodes - a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify higher-level elements which are likely to be the main content section.
Though this is just Firefox's algorithm, my guess is if it works well for Firefox, it'll work well for other browsers too.
In order for these Reader View algorithms to work for your website, you want them to correctly identify the content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in the algorithm.
So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:
Use paragraph tags in your content! Many people tend to overlook
them in favor of <br /> tags. While it may look similar, many
content-related algorithms (not only Reader View ones) rely heavily
on them.
Use HTML5 semantic elements in your markup, like <article>, <nav>,
<section>, <aside>. Even though they're not the only criterion (as you noted in the question), these are very useful to computers reading your
page (not just Reader View) to distinguish different sections of
your content. Readability.js uses them to guess which nodes are likely or unlikely to contain important content.
Wrap your main content in one container, like an <article> or <div>
element. This will receive score points from all the paragraph tags
inside it, and be identified as the main content section.
Keep your DOM tree shallow in content-dense areas. If you have a lot
of elements breaking your content up, you're only making life harder
for the algorithm: there won't be a single element that stands out
as being parent of a lot of content-heavy paragraphs, but many
separate ones with low scores.

How do I get the visible text of a range? (createRange)

The string returned by .toString() on a range created by document.createRange(...) will contain things like the inner part of script and style tags. (At least using current version of Chrome.)
Is there a way to get just the visible text?

I found a solution that seems reasonable and at least tentative standard compliant. (My guess, without checking, is that the standards perhaps does not handle all details in a case like this yet, but that the current implementation in Chrome seems useful and might become standard.)
The solution is simply to first create a document fragment from the range:
var fragment = r.cloneContents();
Then just walk the fragment the way you would walk a sub tree in the DOM. Do not enter nodes like "SCRIPT" and "STYLE". Collect the "#text" nodes.

What's a good way to figure out which code is causing runaway DOM node creation?

The Chrome Dev Tools have unearthed some problems similar to those posted here, more DOM nodes being created than I feel should be given my design choices.
What's a good way to figure out what area of code is causing runaway DOM node creation? The information is really useful but figuring out what to do with it seems much less straightforward than, for example, dealing with a CPU profile.

Try taking two heap snapshots (the Profiles panel), one with few DOM nodes and one with lots of them, then compare and see if many nodes are retained. If yes, you will be able to detect the primary retainers.

I would suggest creating code that walks the DOM and collects some statistics about what nodes are in the DOM (tag type, class name, id value, parent, number of children, textContent, etc...). If you know what is supposed to be in your page, you should be able to look at this data dump and determine what's in there that you aren't expecting. You could even run the code at page load time, then run it again after your page has been exercised a bit and compare the two.

Methods to Move DOM Element

I am programming a Javascript game as an exercise in objects and AJAX. It involves movement of wessels around a nautical-themed grid. Although the wessels are in an array of objects, I need to manipulate their graphical representation, their sprites. At the moment I have chosen, from a DOM perspective, to use 'img' elements within 'td' elements.
From a UI continuity perspective, which method of programmatically moving the elements with Javascript would be recommended:
(a) deleting inner html of 'from' cell (td element) and rewriting inner html of 'to' cell,
(b) clone the img node (sprite), delete the original node from its parent, and append it to the 'to' cell, or
(c) using positioning relative to the table element for the sprite, ignoring the td's alltogether (although their background [color] represents the ocean depth).

I would definitely stick with moving the sprite from cell to cell rather than using relative positioning to the table. My reasoning is that the table cell size might vary from browser to browser (given variances in the way padding, margins etc. are rendered - especially with annoying IE) and calculating the exact location to position the sprite in order for it to line up within a given cell might get complicated.
That narrows it down to (a) or (b) for your options. Here let's eliminate option (a), as deleting the HTML from inside is not a clean way of manipulating the DOM. I like the idea of storing the node in an object, and then appending it to the 'to' cell, and then deleting the original node, which your option (b) suggests. This way, you are still dealing with the high-level 'objects' and not the low-level 'text' needlessly. You don't need to mess with the text - for such an application, that would be the dirty 'hackish' way of doing it if you didn't know about the DOM manipulation functions JavaScript already offers.
My answer is (b). However, if you absolutely require speed - though for your game I don't know if you'll really need the extra boost - you may consider option (a). A few sources, such as http://www.quirksmode.org/dom/innerhtml.html contend that the DOM manipulation methods are generally slower than using innerHTML. But that's the general rule with everything. The lower the level you go, the faster you can make your code. The higher the level, the easier to understand and conceptualize the code is, and in my opinion, since speed will not make a huge difference in this case, keep it neat and go with (b).

There's no need to clone the img node, delete the old one and append the clone. Just append the img node to the receiving td. It will automatically be removed from the td it was previously in. Simple, effective and fast.

Develop Reference

JavaScript is the programming language of the Web.