Javascript retrieve linebreaks from dom [duplicate]

Javascript retrieve linebreaks from dom [duplicate] - javascript

I need to add line breaks in the positions that the browser naturally adds a newline in a paragraph of text.
For example:
<p>This is some very long text \n that spans a number of lines in the paragraph.</p>
This is a paragraph that the browser chose to break at the position of the \n
I need to find this position and insert a <br />
Does anyone know of any JS libraries or functions that are able to do this?
The only solutuion that I have found so far is to remove tokens from the paragraph and observe the clientHeight property to detect a change in element height. I don't have time to finish this and would like to find something that's already tested.
Edit:
The reason I need to do this is that I need to accurately convert HTML to PDF. Acrobat renders text narrower than the browser does. This results in text that breaks in different positions. I need an identical ragged edge and the same number of lines in the converted PDF.
Edit:
#dtsazza: Thanks for your considered answer. It's not impossible to produce a layout editor that almost exactly replciates HTML I've written 99% of one ;)
The app I'm working on allows a user to create a product catalogue by dragging on 'tiles' The tiles are fixed width, absolutely positioned divs that contain images and text. All elemets are styled so font size is fixed. My solution for finding \n in paragraph is ok 80% of the time and when it works with a given paragrah the resulting PDF is so close to the on-screen version that the differences do not matter. Paragraphs are the same height (to the pixel), images are replaced with high res versions and all bitmap artwork is replaced with SVGs generated server side.
The only slight difference between my HTML and PDF is that Acrobat renderes text slightly more narrowly which results in line slightly shorter line length.
Diodeus's solution of adding span's and finding their coords is a very good one and should give me the location of the BRs. Please remember that the user will never see the HTML with the inserted BRs - these are added so that the PDF conversion produces a paragraph that is exactly the same size.
There are lots of people that seem to think this is impossible. I already have a working app that created extremely accurate HTML->PDF conversion of our docs - I just need a better solution of adding BRs because my solution sometimes misses a BR. BTW when it does work my paragraphs are the same height as the HTML equivalents which is the result we are after.
If anyone is interested in the type of doc i'm converting then you can check ou this screen cast:
http://www.localsa.com.au/brochure/brochure.html
Edit: Many thanks to Diodeus - your suggestion was spot on.
Solution:
for my situation it made more sense to wrap the words in spans instead of the spaces.
var text = paragraphElement.innerHTML.replace(/ /g, '</span> <span>');
text = "<span>"+text+"</span>"; //wrap first and last words.
This wraps each word in a span. I can now query the document to get all the words, iterate and compare y position. When y pos changes add a br.
This works flawlessly and gives me the results I need - Thank you!

I would suggest wrapping all spaces in a span tag and finding the coordinates of each tag. When the Y-value changes, you're on a new line.

I don't think there's going to be a very clean solution to this one, if any at all. The browser will flow a paragraph to fit the available space, linebreaking where needed. Consider that if a user resizes the browser window, all the paragraphs will be rerendered and almost certainly will change their break positions. If the user changes the size of the text on the page, the paragraphs will be rerendered with different line break points. If you (or some script on your page) changes the size of another element on the page, this will change the amount of space available to a floating paragraph and again - different line break points.
Besides, changing the actual markup of your page to mimic something that the browser does for you (and does very well) seems like the wrong approach to whatever you're doing. What's the actual problem you're trying to solve here? There's probably a better way to achieve it.
Edit: OK, so you want to render to PDF the same as "the screen version". Do you have a specific definitive screen version nominated - in terms of browser window dimensions, user stylesheets, font preferences and adjusted font size? The critical thing about HTML is that it deliberately does not specify a specific layout. It simply describes what is on the page, what they are and where they are in relation to one another.
I've seen several misguided attempts before to produce some HTML that will exactly replicate a printed creative, designed in something like a DTP application where a definitive absolute layout is essential. Those efforts were doomed to failure because of the nature of HTML, and doing it the other way round (as you're trying to) will be even worse because you don't even have a definitive starting point to work from.
On the assumption that this is all out of your hands and you'll have to do it anyway, my suggestion would be to give up on the idea of mangling the HTML. Look at the PDF conversion software - if it's any good it should give you some options for font kerning and similar settings. Playing around with the details here should get you something that approximates the font rendering in the browser and thus breaks lines at the same places.
Failing that, all I can suggest is taking screenshots of the browser and parsing these with OCR to work out where the lines break (it shouldn't require a very accurate OCR since you know what the raw text is anyway, it essentially just has to count spaces). Or perhaps just embed the screenshot in the PDF if text search/selection isn't a big deal.
Finally doing it by hand is likely the only way to make this work definitively and reliably.
But really, this is still just wrong and any attempts to revise the requirements would be better. Keep going up one step in the chain - why does the PDF have to have the exact same ragged edge as some arbitrary browser rendering? Can you achieve that purpose in another (better) way?

Sounds like a bad idea when you account for user set font sizes, MS Windows accessibility mode, and the hundreds of different mobile devices. Let the browser do it's thing - trying to have exact control over the rendering will only cause you hours of frustration.

I don't think you'll be able to do this with any kind of accuracy without embedding Gecko/WebKit/Trident or essentially recreating them.

Maybe an alternative: do all line-breaks yourself, instead of relying on the browser. Place all text in pre tags, and add your own linebreaks. Now at least you don't have to figure out where the browser put them.

Related

Word Cloud for Other Languages

I using JasonDavies's Word Cloud for my project, but there is a problem that I using Persian[Farsi] Strings and my problem here that words have overlapping in Svg.
This is my project's output:
What happened to the Farsi words?

As explained on the About page for the project, the generator needs to retrieve the shape of a glyph to be able to compute where it is "safe" to put other words. The about page explains the process in much more detail, but here's what we care for:
Glyphs are rendered individually to a hidden <canvas> element.
Pixel data is retrieved
Bounding boxes are derived
The word cloud is generated.
Now, the critical insight is that in Western (and many other) scripts, glyphs don't change shape based on context often. Yes, there are such things as ligatures, but they are generally rare, and definitely not necessary for the script.
In Persian, however, the glyph shape will change based on context. For non-Persian readers, look at ی and س which, when combined, become یس. Yes, that last one is two glyphs!
The algorithm actually has no problem dealing with Persian characters, as you can see by hacking the demo on the about page, putting a breakpoint just after the d.code is generated, to be able to modify it:
Replacing it with 1740, which is the charCode for the first Persian glyph above, and letting the algorithm run, shows beautiful and perfectly correct bounding boxes around the glyph:
The issue is that when the word cloud is actually rendered, the glyph is placed in context and... changes shape. The generator doesn't know this, though, and continues to use the old bounding data to place other words, thus creating the overlapping you witnessed. In addition, there is probably also an issue around right-to-left handling of text, which certainly would not help.
I would encourage you to take this up the author of the generator directly. The project has a GitHub page: https://github.com/jasondavies/d3-cloud so opening an issue there (and maybe referring back to this answer) would help!

How does Firefox reader view operate

Summary
I am looking for the criteria by which I can create a webpage and be [fairly] sure it will appear in the Firefox Reader
View, if user desired.
Some sites have this option, some do not. Some with more text do not have this option than others with much less text. Stack Overflow for
instance displays only the question rather than any answers in Reader
View.
Question
I have had my Firefox upgraded from 38.0.1 to 38.0.5 and have found a new feature called ReaderView - which is a sort of overlay which removes "page clutter" and makes text easier to read.
Readerview is found in the right hand side of the address bar as a clickable icon on certain pages.
This is fine, but from the programming point of view I want to know how "reader view" works, which criteria of which pages it applies to. I have done some exploration of the Mozilla Firefox website with no clear answers (sod all programming answers of any sort I found), I have of course Googled / Binged this and this only came back with references to Firefox addons - this is not an addon but a staple part of the new Firefox version.
I made an assumption that readerview used HTML5 and would extract <article> contents but this is not the case as it works on Wikipedia which does not appear to use <article> or similar HTML5 tags, instead the readview extracts certain <div>s and displays them alone. This feature works on some HTML5 pages - such as wikipedia - but then not others.
If anyone has any ideas how Firefox ReaderView actually operates and how this operation can be used by website developers, can you share? Or if you can find where this information can be located, can you point me in the right direction - as I have not been able to find this.

You need at least one <p> tag around the text, that you want to see in Reader View, and at least 516 characters in 7 words inside the text.
for example this will trigger the ReaderView:
<body>
<p>
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789 123456
</p>
</body>
See my example at https://stackoverflow.com/a/30750212/1069083

Reading through the gitHub code, this morning, the process is that page elements are listed in a likelyhood order - with <section>,<p>,<div>,<article> at the top of the list (ie most likely).
Then each of these "nodes" is given a score based on things such as comma counts and class names that apply to the node. This is a somewhat multi-faceted process where scores are added for text chunks but also scores are seemingly reduced for invalid parts or syntax. Scores in sub-parts of "node" are reflected in the score of the node as a whole. ie the parent element contains the scores of all lower elements, I think.
This score value decides if the HTML page can be "page viewed" in Firefox.
I am not absolutely clear if the score value is set by Firefox or by the readability function.
Javascript is really not my strong point,and I think someone else should check over the link provided by Richard ( https://github.com/mozilla/readability ) and see if they can provide a more thorough answer.
What I did not see but expected to see was score based on amount of text content in a <p> or a <div> (or other) relevant tags.
Any improvements on this question or answer, please share!!
EDIT:
Images in <div> or <figure> tags (HTML5) within the <p> element appear to be retained in the Reader View when the page text content is valid.

I followed Martin's link to the Readability.js GitHub repository, and had a look at the source code. Here's what I make of it.
The algorithm works with paragraph tags. First of all, it tries to identify parts of the page which are definitely not content - like forms and so on - and removes them. Then it goes through the paragraph nodes on the page and assigns a score based on content-richness: it gives them points for things like number of commas, length of content, etc. Notice that a paragraph with fewer than 25 characters is immediately discarded.
Scores then "bubble up" the DOM tree: each paragraph will add part of it's score to all of it's parent nodes - a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify higher-level elements which are likely to be the main content section.
Though this is just Firefox's algorithm, my guess is if it works well for Firefox, it'll work well for other browsers too.
In order for these Reader View algorithms to work for your website, you want them to correctly identify the content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in the algorithm.
So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:
Use paragraph tags in your content! Many people tend to overlook
them in favor of <br /> tags. While it may look similar, many
content-related algorithms (not only Reader View ones) rely heavily
on them.
Use HTML5 semantic elements in your markup, like <article>, <nav>,
<section>, <aside>. Even though they're not the only criterion (as you noted in the question), these are very useful to computers reading your
page (not just Reader View) to distinguish different sections of
your content. Readability.js uses them to guess which nodes are likely or unlikely to contain important content.
Wrap your main content in one container, like an <article> or <div>
element. This will receive score points from all the paragraph tags
inside it, and be identified as the main content section.
Keep your DOM tree shallow in content-dense areas. If you have a lot
of elements breaking your content up, you're only making life harder
for the algorithm: there won't be a single element that stands out
as being parent of a lot of content-heavy paragraphs, but many
separate ones with low scores.

Programmatically "Fit" Variable Length HTML Content to a Single Printed Page?

I have a bit of a strange question. My apologies if this has already been asked and answered...I think part of my problem is that I don't really know what exactly I should be searching for since I don't even remotely know what the right approach is!
I have a website that has HTML pages containing product reviews. Each review has about 15 standard text fields, such as Strengths, Weaknesses, Summary, etc. Each of these text fields is generally approximately the same number of words from one review to the next, but they do vary in length by +/- 20% or so. Right now, when I print them, some of them take one page and some of them take two pages.
I'm trying to come up with a decent way to print each of these product reviews such that each one always fits on one sheet of paper. I'm OK with making some assumptions, such as assuming a certain paper size and orientation. What I'm imagining is that each of my review's text fields (Strengths, for example) will have a certain "box" on the printed page that it can occupy and I'll have some code that programmatically resizes the font or adjusts the vertical line spacing (or perhaps just truncates the text and adds "..." at the end) until it fits into the "box".
I'm just looking for some pointers on what the most sensible approaches might be for this sort of thing. For example, here are some of the random thoughts that come to my mind:
1) Is there anything that can be done with CSS in a print style sheet to do this kind of dynamic resizing and/or truncating automatically?
2) I'm up for having a button on each page that says "Print" that when clicked generates a new page with completely different markup that is optimized for what I'm trying to do. All of the data in these pages is stored in a database, so this would be an acceptable solution. If I do end up opting for this option, would it be most sensible to try to lay this out using HTML tables, divs, or something else?
4) I'm wondering if I can do the programmatic resizing using JavaScript. Is there some kind of function or library that is used for this sort of thing (calculating how much space a block of text needs)? If so, is this a fairly reliable way of achieving what I'm shooting for?
5) Is it better to do what I'm trying to do on the server side somehow? I'm using PHP, of that helps.
6) If all else fails, is there a way to programmatically generate PDF pages server side that I can layout per my one-page requirement? Is there a good PHP library out there for that sort of thing?
Thanks in advance for any suggestions and pointers! As you can tell, I'm pretty lost on what path I should start down!

Without using CSS3 you could:
1) Store the original font sizes in variables using javascript
2) Get the Document Height through Javascript
var body = document.body,
html = document.documentElement;
var height = Math.max( body.scrollHeight, body.offsetHeight,
html.clientHeight, html.scrollHeight, html.offsetHeight );
3) Check to make sure your height is small enough to be printed on one page
4) If not start loop where you:
a) decrease each font-size by 1
b) Check height again
c) if good, then do window.print()
d) If still too much height continue in loop
5) Set all your font-sizes back to the original values.
Hope this helps.
:) David

css3 intruduces the vw and vh units which should suit your needs.
Downside is support is ie9+ only

How to determine font-weight bold without using getComputedStyle

I'm in the process of making an HTML text parser and I would like to be able to determine when a text node appears as a header (visually, not HTML headers).
One thing that can usually be said about headers are that they are emphasized - usually in one of two ways: Bold font or larger font size.
I could get both corresponding CSS values using getComputedStyle(), but I want to avoid this because the parser needs high performance (has to run smoothly on, for example, Chromebooks) and getComputedStyle() is not particularly fast when looking through hundreds or thousands of nodes.
Figuring out a font size isn't too hard - I can just select the node with range and check its client rects from range.getClientRects().I haven't figured out a smart way to check font weight though, which is why I'm here.
Can anyone think of higher-performance way of doing this than by using getComputedStyle()?
I'm aware this might not be possible - just looking to see if someone can think of an ingenious way to solve this problem.
Edit
This is for a Google Chrome extension only.

What you're aiming to do here is really messy. Since you want to determine if text is bold visually, on some devices, depending on how they render text, the whole system may just break!
A suggestion I have is to use the HTML5 Data atrributes - find out more here - and use it like so:
<div class="header" data-bold="yes">This will appear bold?</div>
Then, using JavaScript you can just go over all div elements with the data-bold attribute.
I hope this helped ;)

contentEditable insert br when new line occurs

A contentEditable has automatic word wrapping, creating a new line when you reach the width of the editable area. This is great but I am parsing the contents of this afterwards and I need it to add a <br> when it does this. I have tried everything I can think of and I can't achieve this. Any help greatly received.

This is not possible, the word wrapping point is 'browser discretion' and as such susceptible to font size differences, fonts not being installed, font render engines, anti-aliasing settings etc. etc. The line-wrap point is, so to speak, 'not your problem' from the browser's perspective, and as such it doesn't give this info away.
Theoretically you could rebuild the content word-for-word in JS in a dynamically sized and similarly styled div, and monitor for when the height changes - that's where the newlines occur. It'd be a crap load of crappy code to achieve a dodgy result though.
I can't help but feel like you're asking for an XY-solution here - if you need newlines at the given point, let the end user give them when he wants to. Simply adding overflow:auto;white-space:nowrap to the editable element forces them to. Example here.

Develop Reference

JavaScript is the programming language of the Web.