How does Firefox reader view operate

How does Firefox reader view operate - javascript

Summary
I am looking for the criteria by which I can create a webpage and be [fairly] sure it will appear in the Firefox Reader
View, if user desired.
Some sites have this option, some do not. Some with more text do not have this option than others with much less text. Stack Overflow for
instance displays only the question rather than any answers in Reader
View.
Question
I have had my Firefox upgraded from 38.0.1 to 38.0.5 and have found a new feature called ReaderView - which is a sort of overlay which removes "page clutter" and makes text easier to read.
Readerview is found in the right hand side of the address bar as a clickable icon on certain pages.
This is fine, but from the programming point of view I want to know how "reader view" works, which criteria of which pages it applies to. I have done some exploration of the Mozilla Firefox website with no clear answers (sod all programming answers of any sort I found), I have of course Googled / Binged this and this only came back with references to Firefox addons - this is not an addon but a staple part of the new Firefox version.
I made an assumption that readerview used HTML5 and would extract <article> contents but this is not the case as it works on Wikipedia which does not appear to use <article> or similar HTML5 tags, instead the readview extracts certain <div>s and displays them alone. This feature works on some HTML5 pages - such as wikipedia - but then not others.
If anyone has any ideas how Firefox ReaderView actually operates and how this operation can be used by website developers, can you share? Or if you can find where this information can be located, can you point me in the right direction - as I have not been able to find this.

You need at least one <p> tag around the text, that you want to see in Reader View, and at least 516 characters in 7 words inside the text.
for example this will trigger the ReaderView:
<body>
<p>
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789 123456
</p>
</body>
See my example at https://stackoverflow.com/a/30750212/1069083

Reading through the gitHub code, this morning, the process is that page elements are listed in a likelyhood order - with <section>,<p>,<div>,<article> at the top of the list (ie most likely).
Then each of these "nodes" is given a score based on things such as comma counts and class names that apply to the node. This is a somewhat multi-faceted process where scores are added for text chunks but also scores are seemingly reduced for invalid parts or syntax. Scores in sub-parts of "node" are reflected in the score of the node as a whole. ie the parent element contains the scores of all lower elements, I think.
This score value decides if the HTML page can be "page viewed" in Firefox.
I am not absolutely clear if the score value is set by Firefox or by the readability function.
Javascript is really not my strong point,and I think someone else should check over the link provided by Richard ( https://github.com/mozilla/readability ) and see if they can provide a more thorough answer.
What I did not see but expected to see was score based on amount of text content in a <p> or a <div> (or other) relevant tags.
Any improvements on this question or answer, please share!!
EDIT:
Images in <div> or <figure> tags (HTML5) within the <p> element appear to be retained in the Reader View when the page text content is valid.

I followed Martin's link to the Readability.js GitHub repository, and had a look at the source code. Here's what I make of it.
The algorithm works with paragraph tags. First of all, it tries to identify parts of the page which are definitely not content - like forms and so on - and removes them. Then it goes through the paragraph nodes on the page and assigns a score based on content-richness: it gives them points for things like number of commas, length of content, etc. Notice that a paragraph with fewer than 25 characters is immediately discarded.
Scores then "bubble up" the DOM tree: each paragraph will add part of it's score to all of it's parent nodes - a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify higher-level elements which are likely to be the main content section.
Though this is just Firefox's algorithm, my guess is if it works well for Firefox, it'll work well for other browsers too.
In order for these Reader View algorithms to work for your website, you want them to correctly identify the content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in the algorithm.
So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:
Use paragraph tags in your content! Many people tend to overlook
them in favor of <br /> tags. While it may look similar, many
content-related algorithms (not only Reader View ones) rely heavily
on them.
Use HTML5 semantic elements in your markup, like <article>, <nav>,
<section>, <aside>. Even though they're not the only criterion (as you noted in the question), these are very useful to computers reading your
page (not just Reader View) to distinguish different sections of
your content. Readability.js uses them to guess which nodes are likely or unlikely to contain important content.
Wrap your main content in one container, like an <article> or <div>
element. This will receive score points from all the paragraph tags
inside it, and be identified as the main content section.
Keep your DOM tree shallow in content-dense areas. If you have a lot
of elements breaking your content up, you're only making life harder
for the algorithm: there won't be a single element that stands out
as being parent of a lot of content-heavy paragraphs, but many
separate ones with low scores.

Related

Javascript retrieve linebreaks from dom [duplicate]

I need to add line breaks in the positions that the browser naturally adds a newline in a paragraph of text.
For example:
<p>This is some very long text \n that spans a number of lines in the paragraph.</p>
This is a paragraph that the browser chose to break at the position of the \n
I need to find this position and insert a <br />
Does anyone know of any JS libraries or functions that are able to do this?
The only solutuion that I have found so far is to remove tokens from the paragraph and observe the clientHeight property to detect a change in element height. I don't have time to finish this and would like to find something that's already tested.
Edit:
The reason I need to do this is that I need to accurately convert HTML to PDF. Acrobat renders text narrower than the browser does. This results in text that breaks in different positions. I need an identical ragged edge and the same number of lines in the converted PDF.
Edit:
#dtsazza: Thanks for your considered answer. It's not impossible to produce a layout editor that almost exactly replciates HTML I've written 99% of one ;)
The app I'm working on allows a user to create a product catalogue by dragging on 'tiles' The tiles are fixed width, absolutely positioned divs that contain images and text. All elemets are styled so font size is fixed. My solution for finding \n in paragraph is ok 80% of the time and when it works with a given paragrah the resulting PDF is so close to the on-screen version that the differences do not matter. Paragraphs are the same height (to the pixel), images are replaced with high res versions and all bitmap artwork is replaced with SVGs generated server side.
The only slight difference between my HTML and PDF is that Acrobat renderes text slightly more narrowly which results in line slightly shorter line length.
Diodeus's solution of adding span's and finding their coords is a very good one and should give me the location of the BRs. Please remember that the user will never see the HTML with the inserted BRs - these are added so that the PDF conversion produces a paragraph that is exactly the same size.
There are lots of people that seem to think this is impossible. I already have a working app that created extremely accurate HTML->PDF conversion of our docs - I just need a better solution of adding BRs because my solution sometimes misses a BR. BTW when it does work my paragraphs are the same height as the HTML equivalents which is the result we are after.
If anyone is interested in the type of doc i'm converting then you can check ou this screen cast:
http://www.localsa.com.au/brochure/brochure.html
Edit: Many thanks to Diodeus - your suggestion was spot on.
Solution:
for my situation it made more sense to wrap the words in spans instead of the spaces.
var text = paragraphElement.innerHTML.replace(/ /g, '</span> <span>');
text = "<span>"+text+"</span>"; //wrap first and last words.
This wraps each word in a span. I can now query the document to get all the words, iterate and compare y position. When y pos changes add a br.
This works flawlessly and gives me the results I need - Thank you!

I would suggest wrapping all spaces in a span tag and finding the coordinates of each tag. When the Y-value changes, you're on a new line.

I don't think there's going to be a very clean solution to this one, if any at all. The browser will flow a paragraph to fit the available space, linebreaking where needed. Consider that if a user resizes the browser window, all the paragraphs will be rerendered and almost certainly will change their break positions. If the user changes the size of the text on the page, the paragraphs will be rerendered with different line break points. If you (or some script on your page) changes the size of another element on the page, this will change the amount of space available to a floating paragraph and again - different line break points.
Besides, changing the actual markup of your page to mimic something that the browser does for you (and does very well) seems like the wrong approach to whatever you're doing. What's the actual problem you're trying to solve here? There's probably a better way to achieve it.
Edit: OK, so you want to render to PDF the same as "the screen version". Do you have a specific definitive screen version nominated - in terms of browser window dimensions, user stylesheets, font preferences and adjusted font size? The critical thing about HTML is that it deliberately does not specify a specific layout. It simply describes what is on the page, what they are and where they are in relation to one another.
I've seen several misguided attempts before to produce some HTML that will exactly replicate a printed creative, designed in something like a DTP application where a definitive absolute layout is essential. Those efforts were doomed to failure because of the nature of HTML, and doing it the other way round (as you're trying to) will be even worse because you don't even have a definitive starting point to work from.
On the assumption that this is all out of your hands and you'll have to do it anyway, my suggestion would be to give up on the idea of mangling the HTML. Look at the PDF conversion software - if it's any good it should give you some options for font kerning and similar settings. Playing around with the details here should get you something that approximates the font rendering in the browser and thus breaks lines at the same places.
Failing that, all I can suggest is taking screenshots of the browser and parsing these with OCR to work out where the lines break (it shouldn't require a very accurate OCR since you know what the raw text is anyway, it essentially just has to count spaces). Or perhaps just embed the screenshot in the PDF if text search/selection isn't a big deal.
Finally doing it by hand is likely the only way to make this work definitively and reliably.
But really, this is still just wrong and any attempts to revise the requirements would be better. Keep going up one step in the chain - why does the PDF have to have the exact same ragged edge as some arbitrary browser rendering? Can you achieve that purpose in another (better) way?

Sounds like a bad idea when you account for user set font sizes, MS Windows accessibility mode, and the hundreds of different mobile devices. Let the browser do it's thing - trying to have exact control over the rendering will only cause you hours of frustration.

I don't think you'll be able to do this with any kind of accuracy without embedding Gecko/WebKit/Trident or essentially recreating them.

Maybe an alternative: do all line-breaks yourself, instead of relying on the browser. Place all text in pre tags, and add your own linebreaks. Now at least you don't have to figure out where the browser put them.

Is it always necessary for overlay elements to be located in end of HTML body?

I have noticed that in javascript frameworks elements such as dialogs, tooltips and alerts mostly appear at end of body.
I'm making my own implementation of these elements and trying to make it failproof. I'm repeating some techniques like using transparent iframe to overlay embeded objects in old browsers, and so on.
What restrictions could I face if I place my dialog/tooltip somewhere deep inside of the DOM tree with {position: fixed}? I'm afraid if there are some dangers to this approach, because big frameworks never use it.
I want to support IE8+.

Aside from z-ordering that is a very valid point made by Teemu, another major consideration in JS frameworks is speed of execution / speed of lookup.
The DOM in JS terms is one large object. The deeper into an object javascript needs to go to get what it's being asked for, the less performant the script gets, take a look at this answer.
Therefore it makes sense to keep everything that is probably going to be cloned or deep copied at a sensible nesting level and in the correct z-order. That happens to be toward the end of the body and usually wrapped by at most one containing element.
There may be other reasons but the depth / nesting sprung to mind as a consideration I'd take into account.

Short answer - very few techniques like this are "always necessary". JavaScript can easily remove items from their natural position in the DOM and relocate them at will.
Long answer - I don't think approaching this from a JavaScript first angle is correct. Look at it in terms of where the content belongs naturally within the hierarchy of the rest of the DOM.
For example, if you are talking about a modal dialog, then the chrome (the container elements) usually do not belong within the rest of the DOM - they exist only to contain and provide modal overlay functionality for the content within. This chrome does not participate in the outline of the DOM and the rest of the content. In that case, unless you are able to load them separately via ajax or embed the chrome HTML within the JavaScript, then the closest you will come to removing them from the main DOM is to append them to the bottom of the main DOM content. Note that this disregards the upcoming TEMPLATE element (http://www.html5rocks.com/en/tutorials/webcomponents/template/) which is designed for just this purpose.
However, the content of your dialog might very well belong within the main content of the DOM - either as an element, or as an attribute (i.e. title or data-) to an associated element. This would especially be true for tooltip text.

Consolidate stacked DOM formatting elements - contenteditable DIV

I have a contenteditable DIV which is linked/synced-back to a textarea.
The contenteditable DIV is a free-for-all sandbox which will create formatting elements etc as they are being invoked. However this does result often in messy stacked elements.
I would like to be able to clean up the code before the textarea form is sent to the server.
It is possible to end up with something like the following:
<div>
<b>
<i>
Hel
</i>
<i>
l
</i>
</b>
<i>
<b>
o World!
</b>
</i>
</div>
Which would ideally be converted to:
<div>
<b>
<i>
Hello World!
</i>
</b>
</div>
If I walked (recursively) through the childNodes of the div I could presumably keep track of the formats (tagName.toUpperCase() == {'B','I' ....} ) // or do a document.queryCommandState during which I could do a document.execCommand('removeFormat',false,null) on the selectNode(thenode).
However, I'm a bit lost on how I might keep track across neighbouring nodes of the formats.
As reference here is what I recently did for DOM parsing to remove formatting from IMG tags: http://jsfiddle.net/tjzGg/
NB: This is a similar question> jquery - consolidate stacked DOM elements but it is about consolidating useCSS style lines into one main style. The reason this is a different question is that I am looking to consolidate text with a common style but artificially split over multiple elements because of how the text was formatted. If you take a contenteditable div and individually bold one character at a time you will end up with a single character per element.

I have a couple solutions which have their pros and cons.
First, I discovered while playing around in gmail that a contenteditable DIV will 'absorb' a neighbouring node provided the formatting style is in the same current selected mode. This 'freebie' allowed me to merely attempt to reorganize the ordering in which formatting occurs to clean up the majority of the html soup. This is not a complete solution. The ideal solution would be to have the largest formatting mode as the parent with the subset texts would have in decreasing magnitudes be the further nested modes.
In my above artificial example the result would inherently be converted into:
<div>
<b>
<i>
Hel
l
</i>
</b>
....
caveat: This was tested with text only, no images. I would imagine there is a bug or two when dealing with due to the node parsing in solution 1 and use of textContent.length in solution 2.
Solution 1:
The first works in Chrome, but in Firefox invoking execCommand will cause the node selection to lose focus and become unselected. This is a fatal flaw that I cannot seem to understand or program around. This has been abandoned unless I can figure out how to re-highlight/select the newly formatted node.
http://jsfiddle.net/tjzGg/3/
I would love to be able to get this one working with Firefox. Any suggestions on where I'm going wrong here.
Solution 2:
The second approach is to try come up with a solution for Firefox losing focus. The only way I could handle that is to ignore selecting whole nodes but instead select ONE character at a time, look at its formatting, nuke and reapply in a certain order. This works in both browsers but the DOM is then split into a childNode for each character. I'm not sure of the best way to combine them (textContent?).
http://jsfiddle.net/LDVpD/3/
[background: looked at jsbeautifier, htmlsoup, html tidy, nokogiri, hpricot, jtidy ..... I'm really surprised that there is no solution on this already. GMail will generate 'ugly' formatting as well!]
I know there are better solutions - I'd love to hear some suggestions.
Update
After testing, it is obvious that solution 2 is ridiculously slow (it would not be complicated to optimize it by keeping track of the head as it is a progressive 'flood' but still it is quite slow) and even one could easily modify it to process entire textNodes but solution 1 seems like a better approach if it only worked in Firefox.
Solution 1+2=3:
I found out that if I applied the formatting as a means to toggle it would work however as predicted the text nodes would grow/shrink based on natural consolidation of neighbouring matched formatting. So it dawned on me while sleeping that if I created a list of text nodes, and went from back to front I could care less if internally the DOM (for Firefox!!!) was growing/shrinking while formatting was being applied. Combining Solution 2's textNode list (and then popping off the tail nodes) this works great. In fact iterating instead of recursing the text nodes (original Solution 1 method) is even faster.
http://jsfiddle.net/tjzGg/4/
NB: The selectNodeContents vs selectNode

Maximum number of element IDs you'd want on a page?

Having trouble searching on this. The only thing I've found so far was related to a mapping API where (for map markers) JavaScript bogs down badly around 500 IDs. I'm pretty sure that was related to the particular scripts in use though and not a general rule.
I have a page with a complicated list. Each item in the list has about 10 distinct IDs on it and there's several JavaScript scripts in play. The list is paginating and the user has a choice of how many items to show per page. Considering each item in the list has ~10 IDs, I'm wondering what I should set the maximum number of items per page to (total number of records the user can choose to view at one time on a page).
I mean - aside from increased load times based on the raw number of records, is there a known number of id's where a limit is hit or where CSS/JavaScript performance starts to degrade sharply?
Edit: Pardon, each 'item' is complex business card (to paint a picture) with three divs in play and a small form. That's whay each item (record) has about 10 id's.

I think you will find that it will ultimately depend on the user's browser and personal system.
What would probably make the most sense is to do some research on your intended user base, what browser and how much ram they have available. You can profile your application in your local environment (and there was already a discussion about doing just that). That would give you an idea of how many system resources you're using versus how much the user has available.
But given the amount of highly graphical javascript stuff going on with HTML5, ect., I would wager you're going to need to have a ridiculous amount of data displaying for the javascript performance to be your biggest sticking point (at least in the most modern browsers, if you have to support something like IE6, all bets are off).
Closed the parenthesis, sorry for the willie-giving

You didn't answer my comment and you've already accepted an answer, but I'll give an answer anyway, because I've got a feeling you may not need all the IDs you have.
Let's take following example code:
<div id="card-1" onclick="doSomethingWithCard('card-1')">
<h2 id="card-1-name">John Doe</h2>
</div>
Both IDs here can be unnecessary. Many people try to get a reference to an element, that triggered an event, by passing its ID to the event handler and using getElementById. Instead this already contains a reference:
<div onclick="doSomethingWithCard(this)">
...
</div>
And once you have that reference, you can use getElementsbyTagName, a CSS selector (usually with a JavaScript Framework/Library like jQuery) or XPath to access the elements inside instead of their separate IDs. This works best if all items have the same inner structur using the appropriate elements (h2, h3, ul, form, etc.) and classes.
jQuery example:
function doSomethingWithCard(card) {
alert($(card).find("h2").text());
}

Automatic multi-page multi-column flowing text with QtWebkit (HTML/CSS/JS -> PDF)

I have some HTML documents that are converted to PDF, using software that renders using QtWebkit (not sure which version).
Currently, the documents have specific tags to split into columns and pages - so whenever the wording changes, it is a manual time-consuming process to move these tags so that the columns and pages fit.
Can anyone provide a way to have text auto-wrapped into the next column/page (as appropriate) when it reaches the bottom of the current container?
Any HTML, CSS or JS supported by QtWebkit is ok (assuming it works in the PDF converter).
(I have tested the webkit-column-* in CSS3 and it appears QtWebkit does not support this.)
To make things more exciting, it also needs to:
- put a header at the top of each page, with page X of Y numbering;
- if an odd number of pages, add a blank page at the end (with no header);
- have the ability to say "don't break inside this block" or "don't break after this header"
I have put some quick example initial markup and target markup to help explain what I'm trying to do.
(The actual documents are far more complicated than that, but I need a simple proof-of-concept before I attack the real ones.)
Any suggestions?
Update:
I've got a partially working solution using Aaron's "filling up" suggestion - I'll post more details in a bit.

Create a document with a single page and all the text in a single column. Use JavaScript to cut the text into parts.
Use pixel coordinates to locate the paragraph/element that doesn't fit anymore. Move it and everything below to the next col. If a "page" already has two "col" divs, start a new page.
After all pages have been created, count and number the pages. Fix even/odd stuff, etc.
Will take some time but it's automatic.
Another approach would be to add all the content to a "source" div and move items to the col div until it's full and repeat with the next col.
Have a look at Prototype or jQuery; they should give you lots of tools to move stuff around in the document.
[EDIT] Instead of only relying on jQuery functions, I suggest to create one or two objects which keep track of the current page and the current column, etc. These give you stable foundations to stand on from which you can fire the helper methods.

Develop Reference

JavaScript is the programming language of the Web.