Consolidate stacked DOM formatting elements - contenteditable DIV

Consolidate stacked DOM formatting elements - contenteditable DIV - javascript

I have a contenteditable DIV which is linked/synced-back to a textarea.
The contenteditable DIV is a free-for-all sandbox which will create formatting elements etc as they are being invoked. However this does result often in messy stacked elements.
I would like to be able to clean up the code before the textarea form is sent to the server.
It is possible to end up with something like the following:
<div>
<b>
<i>
Hel
</i>
<i>
l
</i>
</b>
<i>
<b>
o World!
</b>
</i>
</div>
Which would ideally be converted to:
<div>
<b>
<i>
Hello World!
</i>
</b>
</div>
If I walked (recursively) through the childNodes of the div I could presumably keep track of the formats (tagName.toUpperCase() == {'B','I' ....} ) // or do a document.queryCommandState during which I could do a document.execCommand('removeFormat',false,null) on the selectNode(thenode).
However, I'm a bit lost on how I might keep track across neighbouring nodes of the formats.
As reference here is what I recently did for DOM parsing to remove formatting from IMG tags: http://jsfiddle.net/tjzGg/
NB: This is a similar question> jquery - consolidate stacked DOM elements but it is about consolidating useCSS style lines into one main style. The reason this is a different question is that I am looking to consolidate text with a common style but artificially split over multiple elements because of how the text was formatted. If you take a contenteditable div and individually bold one character at a time you will end up with a single character per element.

I have a couple solutions which have their pros and cons.
First, I discovered while playing around in gmail that a contenteditable DIV will 'absorb' a neighbouring node provided the formatting style is in the same current selected mode. This 'freebie' allowed me to merely attempt to reorganize the ordering in which formatting occurs to clean up the majority of the html soup. This is not a complete solution. The ideal solution would be to have the largest formatting mode as the parent with the subset texts would have in decreasing magnitudes be the further nested modes.
In my above artificial example the result would inherently be converted into:
<div>
<b>
<i>
Hel
l
</i>
</b>
....
caveat: This was tested with text only, no images. I would imagine there is a bug or two when dealing with due to the node parsing in solution 1 and use of textContent.length in solution 2.
Solution 1:
The first works in Chrome, but in Firefox invoking execCommand will cause the node selection to lose focus and become unselected. This is a fatal flaw that I cannot seem to understand or program around. This has been abandoned unless I can figure out how to re-highlight/select the newly formatted node.
http://jsfiddle.net/tjzGg/3/
I would love to be able to get this one working with Firefox. Any suggestions on where I'm going wrong here.
Solution 2:
The second approach is to try come up with a solution for Firefox losing focus. The only way I could handle that is to ignore selecting whole nodes but instead select ONE character at a time, look at its formatting, nuke and reapply in a certain order. This works in both browsers but the DOM is then split into a childNode for each character. I'm not sure of the best way to combine them (textContent?).
http://jsfiddle.net/LDVpD/3/
[background: looked at jsbeautifier, htmlsoup, html tidy, nokogiri, hpricot, jtidy ..... I'm really surprised that there is no solution on this already. GMail will generate 'ugly' formatting as well!]
I know there are better solutions - I'd love to hear some suggestions.
Update
After testing, it is obvious that solution 2 is ridiculously slow (it would not be complicated to optimize it by keeping track of the head as it is a progressive 'flood' but still it is quite slow) and even one could easily modify it to process entire textNodes but solution 1 seems like a better approach if it only worked in Firefox.
Solution 1+2=3:
I found out that if I applied the formatting as a means to toggle it would work however as predicted the text nodes would grow/shrink based on natural consolidation of neighbouring matched formatting. So it dawned on me while sleeping that if I created a list of text nodes, and went from back to front I could care less if internally the DOM (for Firefox!!!) was growing/shrinking while formatting was being applied. Combining Solution 2's textNode list (and then popping off the tail nodes) this works great. In fact iterating instead of recursing the text nodes (original Solution 1 method) is even faster.
http://jsfiddle.net/tjzGg/4/
NB: The selectNodeContents vs selectNode

Related

Setting (ARIA) role for HTML custom elements without explicit attribute?

I have a web app that displays and passes around user-editable semantic markup. For a variety of reasons, including security, the markup consists entirely of custom elements (plus the i, b, and u tags). For regular rendering, I simply have styles for all the tags and stick them straight in the DOM. This looks and works great.
Now I'm doing screen-reader testing, and things aren't great. I have a bunch of graphical symbols I want to add labels for. I've got divs that I want to make into landmarks. I've got custom input fields and buttons.
It would be easy enough to just add role= to all the different tag instances. But part of the reason for the custom markup is to eliminate all the redundant information from the strings that get passed around (note: they're also compressed). My <ns-math> tag should always have role="math", and adding 11 bytes to what might be tags around a single character is an actual problem when multiplied out over a whole article. I know this because the app started with a bunch of <span class="... type elements instead of custom.
For the fields and buttons, I've used a shadow DOM approach. This was necessary anyway to get focus/keyboard semantics correct without polluting the semantic markup with a bunch of redundant attributes, so it's easy to also slap the ARIA stuff on the shadow elements. Especially since the inputs are all leaf nodes. But most of my custom tags amount to fancy spans, and are mostly not leaf nodes, so I don't really want to shadow them all when they're working so well in the light DOM.
After a bunch of searching, it seems like the internals.role property from "4.13.7.4 Accessibility semantics" of the HTML standard is maybe what I want. I may well be using it incorrectly (I'm a novice at front-end), but I can't seem to get this to work in recent versions of Firefox or Chrome. I don't get an error, but it seems to have no effect on the accessibility tree. My elements are not form-associated, but my reading is that the ARIAMixin should be functional anyway. This is maybe a working draft? If this is supposed to work in current browsers, does anybody have a code snippet or example?
Is there some other straight-forward way to achieve my goal of accessibility-annotating my custom elements without adding a bunch of explicit attributes to the element instances?

So you want the benefit of adding a role or an aria-attribute without actually adding those attributes? The concept of an "accessibility object model" (AOM) has been bantering around a bit that would let you access and modify the accessibility tree directly but it's still in the works. Here's an article from a couple years ago that talks about it. Nothing official. Just one person's thoughts.

Further research shows that, as of this time, the abstracted accessibility options I'm asking for are not yet implemented.
For the time being: eliminating a number of page-owned enclosing divs from the accessibility hierarchy via role="presentation" significantly improved my overall tree. With those out of the way, the majority of my custom tags seem to be simply semantically ignored. This is mostly fine as the majority of my content is plain text.
Since I already mark up the vast majority of even single-character symbols, I've simply added all my symbols to the markup generator. Since everything is already in custom tags, I then use a shadow DOM span with role="img" and a character-specific aria-label to present the symbolic character.
My solution is still incomplete. I wish that I could convey the full richness of the semantic content I have available.

Javascript retrieve linebreaks from dom [duplicate]

I need to add line breaks in the positions that the browser naturally adds a newline in a paragraph of text.
For example:
<p>This is some very long text \n that spans a number of lines in the paragraph.</p>
This is a paragraph that the browser chose to break at the position of the \n
I need to find this position and insert a <br />
Does anyone know of any JS libraries or functions that are able to do this?
The only solutuion that I have found so far is to remove tokens from the paragraph and observe the clientHeight property to detect a change in element height. I don't have time to finish this and would like to find something that's already tested.
Edit:
The reason I need to do this is that I need to accurately convert HTML to PDF. Acrobat renders text narrower than the browser does. This results in text that breaks in different positions. I need an identical ragged edge and the same number of lines in the converted PDF.
Edit:
#dtsazza: Thanks for your considered answer. It's not impossible to produce a layout editor that almost exactly replciates HTML I've written 99% of one ;)
The app I'm working on allows a user to create a product catalogue by dragging on 'tiles' The tiles are fixed width, absolutely positioned divs that contain images and text. All elemets are styled so font size is fixed. My solution for finding \n in paragraph is ok 80% of the time and when it works with a given paragrah the resulting PDF is so close to the on-screen version that the differences do not matter. Paragraphs are the same height (to the pixel), images are replaced with high res versions and all bitmap artwork is replaced with SVGs generated server side.
The only slight difference between my HTML and PDF is that Acrobat renderes text slightly more narrowly which results in line slightly shorter line length.
Diodeus's solution of adding span's and finding their coords is a very good one and should give me the location of the BRs. Please remember that the user will never see the HTML with the inserted BRs - these are added so that the PDF conversion produces a paragraph that is exactly the same size.
There are lots of people that seem to think this is impossible. I already have a working app that created extremely accurate HTML->PDF conversion of our docs - I just need a better solution of adding BRs because my solution sometimes misses a BR. BTW when it does work my paragraphs are the same height as the HTML equivalents which is the result we are after.
If anyone is interested in the type of doc i'm converting then you can check ou this screen cast:
http://www.localsa.com.au/brochure/brochure.html
Edit: Many thanks to Diodeus - your suggestion was spot on.
Solution:
for my situation it made more sense to wrap the words in spans instead of the spaces.
var text = paragraphElement.innerHTML.replace(/ /g, '</span> <span>');
text = "<span>"+text+"</span>"; //wrap first and last words.
This wraps each word in a span. I can now query the document to get all the words, iterate and compare y position. When y pos changes add a br.
This works flawlessly and gives me the results I need - Thank you!

I would suggest wrapping all spaces in a span tag and finding the coordinates of each tag. When the Y-value changes, you're on a new line.

I don't think there's going to be a very clean solution to this one, if any at all. The browser will flow a paragraph to fit the available space, linebreaking where needed. Consider that if a user resizes the browser window, all the paragraphs will be rerendered and almost certainly will change their break positions. If the user changes the size of the text on the page, the paragraphs will be rerendered with different line break points. If you (or some script on your page) changes the size of another element on the page, this will change the amount of space available to a floating paragraph and again - different line break points.
Besides, changing the actual markup of your page to mimic something that the browser does for you (and does very well) seems like the wrong approach to whatever you're doing. What's the actual problem you're trying to solve here? There's probably a better way to achieve it.
Edit: OK, so you want to render to PDF the same as "the screen version". Do you have a specific definitive screen version nominated - in terms of browser window dimensions, user stylesheets, font preferences and adjusted font size? The critical thing about HTML is that it deliberately does not specify a specific layout. It simply describes what is on the page, what they are and where they are in relation to one another.
I've seen several misguided attempts before to produce some HTML that will exactly replicate a printed creative, designed in something like a DTP application where a definitive absolute layout is essential. Those efforts were doomed to failure because of the nature of HTML, and doing it the other way round (as you're trying to) will be even worse because you don't even have a definitive starting point to work from.
On the assumption that this is all out of your hands and you'll have to do it anyway, my suggestion would be to give up on the idea of mangling the HTML. Look at the PDF conversion software - if it's any good it should give you some options for font kerning and similar settings. Playing around with the details here should get you something that approximates the font rendering in the browser and thus breaks lines at the same places.
Failing that, all I can suggest is taking screenshots of the browser and parsing these with OCR to work out where the lines break (it shouldn't require a very accurate OCR since you know what the raw text is anyway, it essentially just has to count spaces). Or perhaps just embed the screenshot in the PDF if text search/selection isn't a big deal.
Finally doing it by hand is likely the only way to make this work definitively and reliably.
But really, this is still just wrong and any attempts to revise the requirements would be better. Keep going up one step in the chain - why does the PDF have to have the exact same ragged edge as some arbitrary browser rendering? Can you achieve that purpose in another (better) way?

Sounds like a bad idea when you account for user set font sizes, MS Windows accessibility mode, and the hundreds of different mobile devices. Let the browser do it's thing - trying to have exact control over the rendering will only cause you hours of frustration.

I don't think you'll be able to do this with any kind of accuracy without embedding Gecko/WebKit/Trident or essentially recreating them.

Maybe an alternative: do all line-breaks yourself, instead of relying on the browser. Place all text in pre tags, and add your own linebreaks. Now at least you don't have to figure out where the browser put them.

Implementing an as-you-type spell checker UI for type.js

I am working on an nw.js app that uses a rich text editor. I would like to implement as-you-type spell checking using the typo.js api[1]. Unfortunately this api ships with no UI and it is up to the developer to implement one.
I will mention that while the Chromium built-in spell checker is now available in nw.js, I would prefer not to use that for reasons I won't go into now.
The editor is a contenteditable div element and currently what I have worked up for spell checking is iterating through the text nodes of the contenteditable element using treeWalker, parsing out word strings, and spell checking them. It is convienent once I have identified the nodes and the offsets of the misspelled words to get the geometric position of the words using range.getBoundingClientRect, in order to know where to place decoration (eg "squiggly underlines" or what have you.)
The challenge in this is how to make the UI responsive something of the caliber of spell checkers which are probably written in lower lever languages. I have tried:
1) Creating fixed position divs appearing as underlines and appending them to document.body, and using left/top to position them correctly.
2) Splitting out text nodes of the misspelled words and making them a child of a styled span which is inserted in their place.
1 has the problem of finding a natural (and not too kludgy) way of getting the decoration to follow changes in the positions of the misspelled words, such as when inserting text before a misspelled word, scrolling, or resizing.
2 takes care of this problem, however I would rather not mess with the DOM of the editor. For one thing, this can often be picked up when copying and pasting into another app.
I am hoping there is some other means of which I am unfamiliar in how to go about this task. Any and all suggestions would be appreciated.
[1]https://github.com/cfinke/Typo.js/

How does Firefox reader view operate

Summary
I am looking for the criteria by which I can create a webpage and be [fairly] sure it will appear in the Firefox Reader
View, if user desired.
Some sites have this option, some do not. Some with more text do not have this option than others with much less text. Stack Overflow for
instance displays only the question rather than any answers in Reader
View.
Question
I have had my Firefox upgraded from 38.0.1 to 38.0.5 and have found a new feature called ReaderView - which is a sort of overlay which removes "page clutter" and makes text easier to read.
Readerview is found in the right hand side of the address bar as a clickable icon on certain pages.
This is fine, but from the programming point of view I want to know how "reader view" works, which criteria of which pages it applies to. I have done some exploration of the Mozilla Firefox website with no clear answers (sod all programming answers of any sort I found), I have of course Googled / Binged this and this only came back with references to Firefox addons - this is not an addon but a staple part of the new Firefox version.
I made an assumption that readerview used HTML5 and would extract <article> contents but this is not the case as it works on Wikipedia which does not appear to use <article> or similar HTML5 tags, instead the readview extracts certain <div>s and displays them alone. This feature works on some HTML5 pages - such as wikipedia - but then not others.
If anyone has any ideas how Firefox ReaderView actually operates and how this operation can be used by website developers, can you share? Or if you can find where this information can be located, can you point me in the right direction - as I have not been able to find this.

You need at least one <p> tag around the text, that you want to see in Reader View, and at least 516 characters in 7 words inside the text.
for example this will trigger the ReaderView:
<body>
<p>
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789 123456
</p>
</body>
See my example at https://stackoverflow.com/a/30750212/1069083

Reading through the gitHub code, this morning, the process is that page elements are listed in a likelyhood order - with <section>,<p>,<div>,<article> at the top of the list (ie most likely).
Then each of these "nodes" is given a score based on things such as comma counts and class names that apply to the node. This is a somewhat multi-faceted process where scores are added for text chunks but also scores are seemingly reduced for invalid parts or syntax. Scores in sub-parts of "node" are reflected in the score of the node as a whole. ie the parent element contains the scores of all lower elements, I think.
This score value decides if the HTML page can be "page viewed" in Firefox.
I am not absolutely clear if the score value is set by Firefox or by the readability function.
Javascript is really not my strong point,and I think someone else should check over the link provided by Richard ( https://github.com/mozilla/readability ) and see if they can provide a more thorough answer.
What I did not see but expected to see was score based on amount of text content in a <p> or a <div> (or other) relevant tags.
Any improvements on this question or answer, please share!!
EDIT:
Images in <div> or <figure> tags (HTML5) within the <p> element appear to be retained in the Reader View when the page text content is valid.

I followed Martin's link to the Readability.js GitHub repository, and had a look at the source code. Here's what I make of it.
The algorithm works with paragraph tags. First of all, it tries to identify parts of the page which are definitely not content - like forms and so on - and removes them. Then it goes through the paragraph nodes on the page and assigns a score based on content-richness: it gives them points for things like number of commas, length of content, etc. Notice that a paragraph with fewer than 25 characters is immediately discarded.
Scores then "bubble up" the DOM tree: each paragraph will add part of it's score to all of it's parent nodes - a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify higher-level elements which are likely to be the main content section.
Though this is just Firefox's algorithm, my guess is if it works well for Firefox, it'll work well for other browsers too.
In order for these Reader View algorithms to work for your website, you want them to correctly identify the content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in the algorithm.
So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:
Use paragraph tags in your content! Many people tend to overlook
them in favor of <br /> tags. While it may look similar, many
content-related algorithms (not only Reader View ones) rely heavily
on them.
Use HTML5 semantic elements in your markup, like <article>, <nav>,
<section>, <aside>. Even though they're not the only criterion (as you noted in the question), these are very useful to computers reading your
page (not just Reader View) to distinguish different sections of
your content. Readability.js uses them to guess which nodes are likely or unlikely to contain important content.
Wrap your main content in one container, like an <article> or <div>
element. This will receive score points from all the paragraph tags
inside it, and be identified as the main content section.
Keep your DOM tree shallow in content-dense areas. If you have a lot
of elements breaking your content up, you're only making life harder
for the algorithm: there won't be a single element that stands out
as being parent of a lot of content-heavy paragraphs, but many
separate ones with low scores.

Javascript - Comparing string environment

I am working on a WYSIWYG editor. As it has to include just some basic functions I want to do it myself and avoid problems. Now it is working perfectly but I want to add a functionality in order to unbold, unitalic...
I know that with execCommand it is an automatic thing, but it does not work in the same way in all browsers so... my idea was the next: When pressing BOLD button, check the environment of the string, and...
If the selection is Between the open and close <b> tags, like <b>ab||selected||cd</b> replace selected with </b>selected<b>.
If the selection starts or finishes with the <b> tag, like <b>ab||selected||</b> replace it by </b>selected<b> (and then strip out all <b></b> groups.)
If the selection starts and finishes with the <b> tag, like <b>||selected||</b> replace it by </b>selected<b> (and then strip out all <b></b> groups.)
But... how can I get into a var the <b>content</b> string when just having the caret/selection IN content? It might be possible...
UPDATE
It is curious that the replacement is always the same. So, should I really get what I am asking for, or just replace it in this way, always?

I am working on a WYSIWYG editor. As it has to include just some basic
functions I want to do it myself and avoid problems. Now it is working
perfectly but I want to add a functionality in order to unbold,
unitalic...
Do not write your own WYSIWYG editor.
Do you really want to "avoid problems"? Then use one of existing good editors (there're only 2... maybe 3 in fact). Creating editor is extremely hard task for which you need a lot of time (I mean... few years), a lot of knowledge and patience (a lot of too :P).
I can myself write that "I am working on a WYSIWYG editor". For more than half of the year I'm a core developer of one of these "good editors". And during this period I implemented only one feature - very important and very complex, but one of tens/hundreds of them.
That problem you have... I don't even want to start answering. It sounds like a piece of cake, but it isn't. It's a piece of brick that can kill you when fall on your head :). I'll only start enumerating important parts of the impl: Selection + range implementations, because native differ and are buggy (~5k LOC + min Nk LOC for tests). Then you need the proper styles handling (applying and removing) impl (min 1k LOC + tests), because you have to take care about styles spanning on many blocks (like entire table bolded) and different selections containing parts or entire styles etc. And you have to avoid native execCommand, because they will break your content. Then you should also think about updating toolbar buttons states and, to make your impl bullet proof, handling different style tags (e.g. pasted). And that's only the tip of an iceberg - you'll have styles handling, but hundreds of other things broken. Things that big editors have fixed.
Anyway - learn config options for one of main editors and customize it as you want. This will take you a few hours, not a few years.

Develop Reference

JavaScript is the programming language of the Web.