Preserve DOM elements position while removing texts - javascript

I am looking for a solution where I can remove texts (or replace texts with some characters) in DOM where the position of all DOM elements remain same.
Background
My project capture full source code of web pages from sensitive web pages, however, those sensitive data does not matter and need to be removed prior to transmitting to the server. Captured source code will be later used to recreate what Administrator was seeing (without texts)
Example
Assume this is a page:
<div>Some text here
<input type="button" value="some other text" />
some more text
</div>
So it will be rendered like this by browser:
some text here [some other text]some more text
I need it to be like this:
------ ------ ------ [------- ------ ------]------- ------- ------
Current buggy approach
Currently, I get texts in DOM, count characters between each space, and replace those characters with a dash. unfortunately, it will render like this:
---- ---- --- [---- ----- ----]---- ---- ----
Which as you can see, the position of button and link is completely different from the original.
Purpose
The main purpose is to recreate DOM later on for UX purposes, but without any texts transmitted to a server that might contain sensitive information. Texts can be completely removed, replaced with any characters (I used - in this example), replaced with other texts such as "Lorem ipsum", as long as it is completely removed from source code while preserving the exact location of DOM.
It is used to record mouse click and mouse move positions (X, Y) and show them as a click/move heat-map.
Restrictions
I am not able to change font or codes on target web pages and each element and page might be using a different font for each element.
Ideas?
Looking for help if anyone can come up with an idea about this?
The issue here is that - have different character width than characters used in the real text.
I have thought of scrambling words in all sentences therefor preserve final total width of the text. however, someone might be able to reshuffle them back to original word and it is a security/privacy risk.
I have thought of replacing with multiple dashes based on each word size (and using it currently), but How to get the size of each word in it's specified DOM element? (as each DOM element might use different font, therefore different size for each character) and it could have big performance issue trying to create a hidden div next to each element with their texts just to try to calculate text width of it.
on parent element which have text on it, get computed style for font-size,font-family and letter-spacing and use it in a new div to detect that font's width for space. then put original text on that div and detect width of original text. then divide original text width to space width for that font to detect how many space need to be there to generate same width, and generate those spaces. Issue here is that on some pages that have too many texts, It will be an overkill to browser performance.
Your idea?

Try with this:
// Select 'div','a' and 'input' elements.
// you can add more elements or even select all '*'
$('div,a,input').each(function() {
var contents = $(this).contents();
if (contents.length > 0) {
if (contents.get(0).nodeType == Node.TEXT_NODE) {
// Remove text from children nodes
var elementText = $(this)
.clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text();
// Replace text
$(this).text(elementText.replace(/[a-zA-Z0-9]{1}/g, '-')).append(contents.slice(1));
}
}
// From input tags we will replace value
if($(this).is('input'))
$(this).val($(this).val().replace(/[a-zA-Z0-9]{1}/g, '-'));
});
Here is a JSFiddle Demo

Related

After Effects Script: Inherit multiple character styles from parent text to child

I have an After Effects script that copies the style and text of a text layer to another text layer in a separate comp. This works great if the parent text layer only contains one style, but I need one that is half bold and half normal. Is there a way for the script to loop through each character in the parent and apply that style to the corresponding character in the child?
This is what I have now - I found it in this tutorial: https://blog.adobe.com/en/publish/2020/01/24/after-effects-2020-express-yourself-and-your-text.html#gs.zzpqgg
var parentText = comp("Precomp - People").layer("Single Title").text.sourceText;
var parentStyle = comp("Precomp - People").layer("Single Title").text.sourceText.style;
parentStyle.setText( parentText );
I did some research into this a while back and came to the conclusion that this is not possible in either a script or extension.

Contenteditable Div - Cursor position in terms of innerHTML position

I've done my research and come across these questions on StackOverflow where people asked this same question but the thing is that they wanted to get the position either in terms of x and y coordinates or column from the left. I want to know what the position of the cursor is with respect to the div's innerHTML.
For example:
innerHTML = "This is the innerHTML of the <b>div</b> and bla bla bla..."
^
Cursor is here
So the result I want for this case is 44. How to do it ?
var target = document.createTextNode("\u0001");
document.getSelection().getRangeAt(0).insertNode(target);
var position = contentEditableDiv.innerHTML.indexOf("\u0001");
target.parentNode.removeChild(target);
This temporarily inserts a dummy text node containing a non-printable character (\u0001), and then finds the index of that character within the div's innerHTML.
For the most part this leaves the DOM and the current selection unchanged, with one minor possible side effect: if the cursor is in the middle of text from a single text node, that node will be broken up into two consecutive text nodes. Usually that should be harmless, but keep it in mind in the context of your specific application.
UPDATE: Turns out you can merge the consecutive text nodes using Node.normalize().

Find location of line on screen

So I have HTML text being rendered in a browser (in this case an Android WebView). I want to find out what the (x,y) location in pixels of any given line of text is AFTER it is rendered. The working definition of line I am using is not just all the text contained in a <p> tag or that appears before a <br> tag. I mean a line as it would appear to the user.
I am open to any suggested method.
Is there any CSS property that you are able to find the number of lines in a div and their respective heights? That would provide a workable solution.
Thanks!!
You can't access a line individually. But, with some JavaScript, you can find the position of a line with a known index; here's a basic outline:
var p = document.getElementById("ptag"); //get the text container that contains your line
var nthline = 3; //the line for which you'd like to find the position
var lnheight = parseInt(window.getComputedStyle(p).lineHeight); //get the height of each line
var linepos = [p.offsetLeft, p.offsetTop + lnheight * (nthline - 1)]; //a [left, top] pair that represents the line's position
Note: This assumes the container doesn't have anything but text.
There is no standard way of doing that, you will have to refer to your imagination and invent some hack, right now I can think of two ideas for this:
Enclose each word within a span, like <span
class="word">word</span>, that could easily be done with regex or
string functions, later loop over each <span> reading its
position, add some calculation and you could find out how many
lines, where a line starts (word that incremented its top position
from last one) and when a line ends (last word of line + width of
that word).
Apply some style to first line using :first-line pseudoelement,
like
p:first-line{ background-color: white; /* same existent color so
no affecting display*/ }
later find in DOM what text that style was applied. This idea is not
as good and first one but maybe it can make you think of other ways.

JavaScript: Given the DOM, find the largest piece of continuous text (content part)

The goal is to find the largest piece of contiguous text in a document. The problem is that the largest piece does not lie under a single element, e.g. a blog post which has <p> tags in it so iterating nodes and comparing innerHTMLs is not going to work. And by getting innerText of an element, the root node always contains biggest text. So how should one accomplish that?
Thanks
Your problem can be complicated because if there is a div that contains 2 words, plus another <p> inside the div with 200 words in it, then do you count the div having 202 words, or do you count the p having 200 words and therefore is the biggest?
If there are 4 borders for p, then it can make sense to say it is p with 200 words. If there is no border, then it makes sense to say it is div with 202 words.
You can try writing a function to traverse down a node, and if there is any block element with 4 borders, then don't include the word counts.
Things can be more complicated if there are floated divs, which are set to display:inline to work around an IE 6 bug. Or if there are borders, but the color is the same as the background color of the containing div.
If you don't care about the inside elements having borders, then one attempt can be just to look at the immediate children of body, and find out how many characters there are inside of it (sum of text under all descendants, probably using innerText or innerHTML and strip all the tags).
You might also look into finding the biggest element with the biggest area (width x height), if you are looking for the content section, unless there is a long and narrow sidebar or ad section to the left and right, with the content area wide but really short.
The most time effective tactic in screen scraping is always to define templates for each instance of what you are scraping. Considering that most pages these days have a "content" container, all you have to do is add the name of the "content" div for each of your sources. If you are scraping blogs it also becomes much easier as you can create rules for most popular blogging systems as they usually have the same content container across implementations. So you can try defaults first and if they come up empty log the url and manually identify the container.
If you really want to automate this you probably will (and I am guessing here) need to compare size of sibling nodes and check their type of the DOM tree at each level of the DOM and only follow the largest branch. When you hit a level where all the siblings are text nodes the container for these most likely your "main content" container. You can accomplish this using jQuery for node iteration or just "normal" javascript DOM functions.
When I started out typing this answer, I was going to write that it is pretty simple.
I was thinking about cloneNode(false). Then i thought about textnodes, then the normalize function, and then the case when textnodes arent adjacent.
Apart from recursing the entire DOM you will have to do the following to each elementNode (NodeType = 1)
ElLength = thisEl.nodeValue.length ;
if (thisEl.hasChildNodes()){
for each (node in thisEl.childNodes){
if (node.nodeType == 3) { // textnode
ElLength += node.data.length;
}
}
}
then you'll have to remember the largest ElLength and the corresponding element.
It's gonna be slow if your DOM is huge.
Code hasn't been tested... I wrote it just to give an example

How to distinguish between blank areas and non-blank areas in a webpage with JavaScript?

How to distinguish between blank areas and non-blank areas in a webpage with JavaScript? Blank areas including:
areas that are not occupied by DOM elements.
margins, borders and paddings of DOM elements.
EDIT:
As response to the first comment: I am working on a web-based ebook reader. Cursor is set to {cursor:move} for blank areas so that the user can drag and scroll the webpage.
You could recursively go through each element and attach onmouseover and onmouseout events (where the former enables the text cursor and the latter enables the move cursor) on each which has text in it, e.g:
function attachEvents(e) {
if (n.nodeType == 3) { // Node.TEXT_NODE
// A text node - add the parent as an element to select text in
e.parentNode.onmouseover = elmMouseOver /* define your own event fns */
e.parentNode.onmouseout = elmMouseOut
}
else if (n.nodeType == 1) { // Node.ELEMENT_NODE
for (var m=e.firstChild; m != null; m = m.nextSibling) {
attachEvents(m)
}
}
}
The best way I can think of to make sure it's actually "text" which is moused over and not a blank area is to use e.g. <div><span>content</span></div> and put the mouseover/mouseout events in the <span> so that blank areas don't trigger events. This is what I'd recommend doing if you can, as things can get very complicated if you use block elements with padding from my experience. For example:
| The quick brown fox jumps |
| over the lazy dog | <- onmouseover/out of SPANs will ignore the space
after "dog" while DIVs won't and you won't need
to calculate padding/margins/positions which
makes it faster and more simple to implement
If you have to use block DIVs: You could use something like jQuery's jSizes plugin to get margins/padding in pixels or this (for a way to get the inherited CSS values and parse yourself by removing the px part from the end etc)
After that, you could figure out the position using position() in jQuery. I personally don't use jQuery for my stuff, but I use those specific "find positions" functions and found them to be one of the best I think in large part because of number of users testing them.
Good luck!
My advice would be to go for a simple scrollbar. That's far more foolproof. By trying to implement the cool drag-and-scroll feature you risk with a lot of buggy behavior in dozens of edge-cases none of us can even imagine.
If you really want to detect clicks in whitespace, you could try attaching to the onmousedown/onmouseup/onmousemove events for the page itself. JavaScript events bubble nicely, so you'll handle the whole page at once (unless it has some JavaScript in itself, in which case you're screwed anyway). These events supply both the mouse X/Y coordinates and the element that was clicked. Then you can check for padding of that element (careful with inline elements) and figure out if it's in the padding or not. You do not need to check the margin because clicking there will instead originate the click in the parent element.
Although the effect you get this way is a lot of scattered "drag-zones" that the user will have to hunt for in order to scroll the page. I doubt this will sit well with your users. Better then make the whole page "draggable", but then you will loose the ability to select text. Or do like Acrobat, which only allows grabbing in the considerable padding area of the page itself (then you should make sure that there is a considerable padding area). Which in my eyes is not much better than a scrollbar. :P

Categories

Resources