How to parse visually coherent text in rendered HTML?

How to parse visually coherent text in rendered HTML? - javascript

The assumption is that we have access to a rendered DOM via Javascript (such as the developer console when the page is loaded).
I want to extract text from a node in way similar as we humans interpret the content visually.
Example:
<div>
<span>This</span>
<span>Text</span>
<div>
<span>belongs together</span>
</div>
</div>
My algorithm should be able to recognize this text as one cluster, if it is rendered visually coherent.
So it should output: "This text belongs together" instead of ["this, "text", "belongs together"]
Any ideas how to proceed?
I thought about computing the boundingRect for each Text Node and applying some clusterization algorithm with the viewport dimensions as reference point.

Your idea of using bounding rectangles and relating them is a good one.
This file from Chrome, spatial_navigation.cc, might interest you. "Spatial navigation" is a feature in some browsers where the focus doesn't move in tab order but in up-down-left-right space. It is analogous to your problem because it works over the DOM but cares with how the links appear, not the structure of the DOM.
If you examine the primitives spatial navigation is built from, they are:
Bounding rectangles.
Intersecting the viewport.
Whether a rectangle is to the right or below another one.
Whether something is obscured.
From those primitives higher level things are built up.
Some more details on intersecting the viewport: The viewport is the area that's presenting content. You can use window.innerWidth and window.innerHeight for the viewport dimension in pixels and compute whether something is visible accumulating the layout and scroll offsets of it and its parents; or use Intersection Observers to find out whether an element is in the viewport.
Some more details on obscured nodes: In general, detecting obscured nodes is hard. display: none; is an easy case: those nodes will have innerWidth and innerHeight of 0. Overlapped content is harder: Detect how content collides and determine the z-index of what is on top. Hardest is near-transparent content,
low contrast content, and heavily filtered or transformed content.
If you encounter a lot of tricky cases like this it might be simpler to capture the screen and perform OCR on it. This takes advantage of the browser's rendering pipeline to do all of the transforms and layering; you can find text in images; etc. The downside is the getDisplayMedia API doesn't work in all browsers yet and it interrupts the user with a prompt.
You can still look to OCR algorithms for inspiration. OCR has to perform a similar problem: once localized characters have been recognized they have to be put into lines of text.

you can get your elements with getElementsByTagName or getElementsByClassName, this will return elements array and You need to use loop for every element. And in javascript use innerText prop to get text in the element.
var msg = "";
var els = document.getElementsByTagName("span");
for(i = 0; i < els.length; i++){
msg += els[i].innerText;
}
console.log(msg);

Related

Write a completely fluid HTML page (using '%' instead of 'px' for EVERY element height/width)

I am designing my HTML pages to be completely fluid:
For every element in the mark-up (HTML), I am using 'style="height:%;width:%"' (instead of 'style="height:*px;width:*px"').
This approach seems to work pretty well, except for when changing the window measurements, in which case, the web page elements change their position and end up "on top of each other".
I have come up with a pretty good run-time (java-script) solution to that:
var elements = document.getElementsByTagName("*");
for (var i=0; i < elements.length; i++)
{
if (elements[i].style.height) elements[i].style.height = elements[i].offsetHeight+"px";
if (elements[i].style.width ) elements[i].style.width = elements[i].offsetWidth +"px";
}
The only problem remaining is, that if the user opens up the website by entering the URL into a non-maximized window, then the page fits that portion of the window.
Then, when maximizing the window, the page remains in its previous measurements.
So in essence, I have solved the initial problem (when changing the window measurements), but only when the window is initially in its maximum size.
Any ideas on how to tackle this problem? (given that I would like to keep my "% page-design" as is).

I think the real answer is that "completely fluid design" isn't synonymous with "just use percentile measurements for everything". You will need to consider:
Exactly what each specific element should do when the window changes size
Some elements may need to appear/disappear when the screen is resized
Elements should likely have min- and max-widths specified
What happens on a small (e.g. 480x800 mobile) display?
What happens on a large (e.g. 2560x1600 monitor) display?
...amongst other things. There is no generic solution that you can just apply to every element to make fluid design work.

Finding the first word that browsers will classify as overflow

I'm looking to build a page that has no scrolling, and will recognize where the main div's contents overflow. The code will remember that point and construct a separate page that starts at that word or element.
I've spent a few hours fiddling, and here's the approaches that past questions employ:
1. Clone the div, incrementally strip words out until the clone's height/width becomes less than the original's.
Too slow. I suppose I could speed it up by exponentially stripping words and then slowly filling it back up--running past the target then backtracking slowly till I hit it exactly--but the approach itself seems kind of brute force.
2. Do the math on the div's dimensions, calculate out how many ems will fit horizontally and vertically.
Would be good if all contents were uniform text, ala a book, but I'm expecting to deal with headlines and images and whatnot, which throws a monkey wrench in this one. Also complicated by browsers' different default font preferences (100%? 144%?)
3. Render items as tokens, stop when the element in question (i.e. one character) is no longer visible to the user onscreen.
This would be my preferred approach, since it'd just involve some sort of isVisible() check on rendered elements. I don't know if it's consistent with how browsers opt to render, though.
Any recommendations on how this might get done? Or are browsers designed to render the whole page length before deciding whether a scrollbar is needed?

Instead of cloning the div, you could just have an overflow:hidden div and set div.scrollTop += div.height each time you need to advance a 'page'. (Even though the browser will show no scrollbar, you can still programmatically cause the div to scroll.)
This way, you let the browser handle what it's designed to do (flow of content).
Here's a snippet that will automatically advance through the pages: (demo)
var div = $('#pages'), h = div.height(), len = div[0].scrollHeight, p = $('#p');
setInterval(function() {
var top = div[0].scrollTop += h;
if (top >= len) top = div[0].scrollTop = 0;
p.text(Math.floor(top/h)+1 + '/' + Math.ceil(len/h)); // Show 'page' number
}, 1000);
You could also do some fiddling to make sure that a 'page' does not start in the middle of a block-level element if you don't want (for example) headlines sliced in half. Unfortunately, it will be much harder (perhaps impossible) to ensure that a line of text isn't sliced in half.

JavaScript: Given the DOM, find the largest piece of continuous text (content part)

The goal is to find the largest piece of contiguous text in a document. The problem is that the largest piece does not lie under a single element, e.g. a blog post which has <p> tags in it so iterating nodes and comparing innerHTMLs is not going to work. And by getting innerText of an element, the root node always contains biggest text. So how should one accomplish that?
Thanks

Your problem can be complicated because if there is a div that contains 2 words, plus another <p> inside the div with 200 words in it, then do you count the div having 202 words, or do you count the p having 200 words and therefore is the biggest?
If there are 4 borders for p, then it can make sense to say it is p with 200 words. If there is no border, then it makes sense to say it is div with 202 words.
You can try writing a function to traverse down a node, and if there is any block element with 4 borders, then don't include the word counts.
Things can be more complicated if there are floated divs, which are set to display:inline to work around an IE 6 bug. Or if there are borders, but the color is the same as the background color of the containing div.
If you don't care about the inside elements having borders, then one attempt can be just to look at the immediate children of body, and find out how many characters there are inside of it (sum of text under all descendants, probably using innerText or innerHTML and strip all the tags).
You might also look into finding the biggest element with the biggest area (width x height), if you are looking for the content section, unless there is a long and narrow sidebar or ad section to the left and right, with the content area wide but really short.

The most time effective tactic in screen scraping is always to define templates for each instance of what you are scraping. Considering that most pages these days have a "content" container, all you have to do is add the name of the "content" div for each of your sources. If you are scraping blogs it also becomes much easier as you can create rules for most popular blogging systems as they usually have the same content container across implementations. So you can try defaults first and if they come up empty log the url and manually identify the container.
If you really want to automate this you probably will (and I am guessing here) need to compare size of sibling nodes and check their type of the DOM tree at each level of the DOM and only follow the largest branch. When you hit a level where all the siblings are text nodes the container for these most likely your "main content" container. You can accomplish this using jQuery for node iteration or just "normal" javascript DOM functions.

When I started out typing this answer, I was going to write that it is pretty simple.
I was thinking about cloneNode(false). Then i thought about textnodes, then the normalize function, and then the case when textnodes arent adjacent.
Apart from recursing the entire DOM you will have to do the following to each elementNode (NodeType = 1)
ElLength = thisEl.nodeValue.length ;
if (thisEl.hasChildNodes()){
for each (node in thisEl.childNodes){
if (node.nodeType == 3) { // textnode
ElLength += node.data.length;
}
}
}
then you'll have to remember the largest ElLength and the corresponding element.
It's gonna be slow if your DOM is huge.
Code hasn't been tested... I wrote it just to give an example

How to distinguish between blank areas and non-blank areas in a webpage with JavaScript?

How to distinguish between blank areas and non-blank areas in a webpage with JavaScript? Blank areas including:
areas that are not occupied by DOM elements.
margins, borders and paddings of DOM elements.
EDIT:
As response to the first comment: I am working on a web-based ebook reader. Cursor is set to {cursor:move} for blank areas so that the user can drag and scroll the webpage.

You could recursively go through each element and attach onmouseover and onmouseout events (where the former enables the text cursor and the latter enables the move cursor) on each which has text in it, e.g:
function attachEvents(e) {
if (n.nodeType == 3) { // Node.TEXT_NODE
// A text node - add the parent as an element to select text in
e.parentNode.onmouseover = elmMouseOver /* define your own event fns */
e.parentNode.onmouseout = elmMouseOut
}
else if (n.nodeType == 1) { // Node.ELEMENT_NODE
for (var m=e.firstChild; m != null; m = m.nextSibling) {
attachEvents(m)
}
}
}
The best way I can think of to make sure it's actually "text" which is moused over and not a blank area is to use e.g. <div><span>content</span></div> and put the mouseover/mouseout events in the <span> so that blank areas don't trigger events. This is what I'd recommend doing if you can, as things can get very complicated if you use block elements with padding from my experience. For example:
| The quick brown fox jumps |
| over the lazy dog | <- onmouseover/out of SPANs will ignore the space
after "dog" while DIVs won't and you won't need
to calculate padding/margins/positions which
makes it faster and more simple to implement
If you have to use block DIVs: You could use something like jQuery's jSizes plugin to get margins/padding in pixels or this (for a way to get the inherited CSS values and parse yourself by removing the px part from the end etc)
After that, you could figure out the position using position() in jQuery. I personally don't use jQuery for my stuff, but I use those specific "find positions" functions and found them to be one of the best I think in large part because of number of users testing them.
Good luck!

My advice would be to go for a simple scrollbar. That's far more foolproof. By trying to implement the cool drag-and-scroll feature you risk with a lot of buggy behavior in dozens of edge-cases none of us can even imagine.
If you really want to detect clicks in whitespace, you could try attaching to the onmousedown/onmouseup/onmousemove events for the page itself. JavaScript events bubble nicely, so you'll handle the whole page at once (unless it has some JavaScript in itself, in which case you're screwed anyway). These events supply both the mouse X/Y coordinates and the element that was clicked. Then you can check for padding of that element (careful with inline elements) and figure out if it's in the padding or not. You do not need to check the margin because clicking there will instead originate the click in the parent element.
Although the effect you get this way is a lot of scattered "drag-zones" that the user will have to hunt for in order to scroll the page. I doubt this will sit well with your users. Better then make the whole page "draggable", but then you will loose the ability to select text. Or do like Acrobat, which only allows grabbing in the considerable padding area of the page itself (then you should make sure that there is a considerable padding area). Which in my eyes is not much better than a scrollbar. :P

JavaScript: Check width of a <div> object before placing it

Consider:
$("#PlotPlace").append('<div style="position:absolute;left:200px;top:40px;font-size:smaller">Hello world!</div>');
I need to execute that line only if the width of the resultant text would be less than 60px. How can I check the width before placing the object?

Unfortunately, the div will only have a width value once it is rendered into the DOM.
I would append that content to an inconspicuous area of the document, perhaps even absolutely positioned so that no flow disruption occurs, and make sure that it is set to "visibility:hidden". That way it will be inserted into the DOM and be rendered, but be invisible to the viewer.
You can then check the width on it, and move it into position and set it to "visibility:visible" at that point. Otherwise, you can remove it from the document.

Maybe you can append it invisible, then check it's width, and then consider to show or hide.

$("#PlotPlace").append('<div style="position:absolute;left:9001px;top:40px;font-size:smaller">Hello world!</div>');
var div = $('#PlotPlace').children("div");
if(div.width() < 60)
div.css({left:200})

Sounds like something you'd have to hack. I don't believe the JavaScript runtime in any browser has an event you can hook into in between calculating the layout and displaying the element, so you can add it in a way that it can't be seen and doesn't affect the height (doesn't cause additional scrolling), and then show/hide it based on the width at this point. It's hacky and ugly, but because you don't have many event hooks it might be the only way to do it.

You can´t. At least not so easy. The text you insert is written in a specific font, which must be rendered by the browser, then you know the width of the element. By the Way, what exactly do you want to insert, with such a restriction? Wouldn´t it be simpler to cut the text within the output parameters?

Develop Reference

JavaScript is the programming language of the Web.