I created a word counting function and found a discrepancy. It produced different results counting the text words in html depending on if the element the html is enclosed in is part of the document.body or not. For example:
html = "<div>Line1</div><div>Line2<br></div>";
document.body.insertAdjacentHTML("afterend", '<div id="node1"></div>');
node1 = document.getElementById("node1");
node1.style.whiteSpace = 'pre-wrap';
node1.innerHTML = html;
node2 = document.createElement('div');
node2.style.whiteSpace = 'pre-wrap';
node2.innerHTML = html;
The white-space: pre-wrap style is applied so that the code in the html variable is rendered, in terms of line-breaks, consistently across browsers. In the above:
node1.innerText // is "Line1\nLine2\n" which counts as two words.
node2.innerText // is "Line1Line2" which counts as only one word.
My word count function is:
function countWords(s) {
s = (s+' ').replace(/^\s+/g, ''); // remove leading whitespace only
s = s.replace(/\s/g, ' '); // change all whitespace to spaces
s = s.replace(/[ ]{2,}/gi,' ')+' '; // change 2 or more spaces to 1
return s.split(' ').filter(String).length;
}
If I then did something like this in the Web Console:
node1.after(node2);
node2.innerText // is changed to "Line1\nLine2\n" which counts as two words.
My questions are:
Why is the white-space: pre-wrap style not being applied to node 2.innerText before it is inserted into the document.body?
If node 2 has to be a part of document.body in order to get a white-space: pre-wrap style node 2.innerText value, how do I do that without having to make node 2 visible?
I'm curious. When I crate a node element with createElement, where does that node element reside? It doesn't appear to be viewable in a Web Console Inspector inside or outside of the <html> tag and I can't find it in the document object.
This tipped me off that the discrepancy was something to do with if the node element being in the document.body or not: javascript createElement(), style problem.
Indeed, when the element is attached to the DOM, Element.innerText takes the rendered value into account - you can say, the visible output. For non-attached elements, there is no rendering. The CSS properties exist but are not executed.
If you want consistent results between attached and non-attached elements, use Element.textContent.
For more information, see https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText
In follow-up to my question above, I needed to count the words in html text strings like this: <div>Line1</div><div>Line2<br></div> where the word count matched what it would be if that html was rendered in the displayed DOM
To summarize what others have said, when you create an element using createElement it isn’t inserted into the DOM yet and can’t be found when inspecting the DOM. Before the element is inserted into the DOM, the CSS properties exist but are not executed, so there is no rendering. When the element is inserted into the DOM, the CSS properties are executed, and the element is rendered according to the CSS.
Here's the html-string-to-rendered-html-text function I ended up using. This function strips the html tags but retains the "white space" so that the words can then be counted (with consistency across browsers, including IE 11).
var html = "<div>Line1</div><div>Line2<br></div>";
// Display the html string
var htmlts = document.getElementById("htmlts");
htmlts.innerText = html;
// Display a DOM render of the html string
var node1 = document.getElementById("node1");
node1.style.whiteSpace = 'pre-wrap';
node1.innerHTML = html;
// Display the innerText of the above DOM render
var node1ts = document.getElementById("node1ts");
node1ts.innerText = node1.innerText;
// Display the results of the htmlToText function
var node2ts = document.getElementById("node2ts");
node2ts.innerText = htmlToText(html);
// Adapted from https://stackoverflow.com/a/39157530
function htmlToText(html) {
var temp = document.createElement('div');
temp.style.whiteSpace = 'pre-wrap';
temp.style.position = "fixed"; // Overlays the normal flow
temp.style.left = "0"; // Placed flush left
temp.style.top = "0"; // Placed at the top
temp.style.zIndex = "-999"; // Placed under other elements
// opacity = "0" works for the entire temp element, even in IE 11.
temp.style.opacity = "0"; // Everything transparent
temp.innerHTML = html; // Render the html string
document.body.parentNode.appendChild(temp); // Places just before </html>
var out = temp.innerText;
// temp.remove(); // Throws an error in IE 11
// Solution from https://stackoverflow.com/a/27710003
temp.parentNode.removeChild(temp); // Removes the temp element
return out;
}
<html lang="en-US">
<body>
HTML String: <code id="htmlts"></code><br><br>
Visible Render of HTML String (for comparison): <div id="node1"></div><br>
Visible Render Text String: <code id="node1ts"></code><br>
Function Returned Text String: <Code id="node2ts"></code><br>
</body>
</html>
If you prefer to have the temporary element insert inside the body element, change document.body.parentNode.appendChild to document.body.appendChild.
As Noam had suggested, you can also use temp.style.top = "-1000px";.
To answer my curiosity question: before the element is "inserted into the DOM" it appears to be in a Shadow DOM or Shadow Dom-like space.
Related
I searched through a bunch of related questions that help with replacing site innerHTML using JavaScript, but most reply on targetting the ID or Class of the text. However, my can be either inside a span or td tag, possibly elsewhere. I finally was able to gather a few resources to make the following code work:
$("body").children().each(function() {
$(this).html($(this).html().replace(/\$/g,"%"));
});
The problem with the above code is that I randomly see some code artifacts or other issues on the loaded page. I think it has something to do with there being multiple "$" part of the website code and the above script is converting it to %, hence breaking things.using JavaScript or Jquery
Is there any way to modify the code (JavaScript/jQuery) so that it does not affect code elements and only replaces the visible text (i.e. >Here<)?
Thanks!
---Edit---
It looks like the reason I'm getting a conflict with some other code is that of this error "Uncaught TypeError: Cannot read property 'innerText' of undefined". So I'm guessing there are some elements that don't have innerText (even though they don't meet the regex criteria) and it breaks other inline script code.
Is there anything I can add or modify the code with to not try the .replace if it doesn't meet the regex expression or to not replace if it's undefined?
Wholesale regex modifications to the DOM are a little dangerous; it's best to limit your work to only the DOM nodes you're certain you need to check. In this case, you want text nodes only (the visible parts of the document.)
This answer gives a convenient way to select all text nodes contained within a given element. Then you can iterate through that list and replace nodes based on your regex, without having to worry about accidentally modifying the surrounding HTML tags or attributes:
var getTextNodesIn = function(el) {
return $(el)
.find(":not(iframe, script)") // skip <script> and <iframe> tags
.andSelf()
.contents()
.filter(function() {
return this.nodeType == 3; // text nodes only
}
);
};
getTextNodesIn($('#foo')).each(function() {
var txt = $(this).text().trim(); // trimming surrounding whitespace
txt = txt.replace(/^\$\d$/g,"%"); // your regex
$(this).replaceWith(txt);
})
console.log($('#foo').html()); // tags and attributes were not changed
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="foo"> Some sample data, including bits that a naive regex would trip up on:
foo<span data-attr="$1">bar<i>$1</i>$12</span><div>baz</div>
<p>$2</p>
$3
<div>bat</div>$0
<!-- $1 -->
<script>
// embedded script tag:
console.log("<b>$1</b>"); // won't be replaced
</script>
</div>
I did it solved it slightly differently and test each value against regex before attempting to replace it:
var regEx = new RegExp(/^\$\d$/);
var allElements = document.querySelectorAll("*");
for (var i = 0; i < allElements.length; i++){
var allElementsText = allElements[i].innerText;
var regExTest = regEx.test(allElementsText);
if (regExTest=== true) {
console.log(el[i]);
var newText = allElementsText.replace(regEx, '%');
allElements[i].innerText=newText;
}
}
Does anyone see any potential issues with this?
One issue I found is that it does not work if part of the page refreshes after the page has loaded. Is there any way to have it re-run the script when new content is generated on page?
I'm trying to figure out what is the differences between this two:
// first one
var h1 = document.createElement('h1');
var t = document.createTextNode('hey');
h1.appendChild(t);
document.body.appendChild(h1);
// second one
document.body.appendChild(document.createElement('h1').appendChild(document.createTextNode('hey')));
The first (Document.createElement()) works perfectly, but the second (Document.createTextNode()) does not.
The return value of appendChild is the appended child.
So if we add variables to:
document.body.appendChild(document.createElement('h1').appendChild(document.createTextNode('hey')));
it gets broken down into:
var text = document.createTextNode('hey');
var h1 = document.createElement('h1');
h1.appendChild(text);
document.body.appendChild(text);
Appending the text to the body removes the text from the h1.
The h1 is discarded because it is never appended anywhere.
I find a way to do it: (just add .parentNode at the end)
document.body.appendChild(document.createElement('h1').appendChild(document.createTextNode('hey')).parentNode);
I am attempting to write my own piece of Javascript that converts html to ascii code (for learning purposes) so that the browser will render the code as you would see it in a text editor.
After looking around on Stack I have gotten as far as below. I am trying to turn an html element into a string; at this stage I am just trying to .replace() the angular brackets into ascii. If anyone could tell me where I am going wrong as far as having my test <body> tag showing up in the console that would be much appreciated.
<code class="lang-html">
<body></body>
</code>
(function() {
var html = $('.lang-html').innerHTML;
html.replace('<', '<');
html.replace('>', '>');
console.log(html);
});
Just to clarify, I am expecting that the console would spit out <body></body>.
Any help would be much appreciated.
A few things:
$('.lang-html').innerHTML
Assuming this is jQuery, this won't work. .innerHTML only works on raw DOM elements, like what's returned from document.getElementById(...). Instead, $('.lang-html') returns a jQuery collection, which has its own accessor methods. You should do:
$('.lang-html').html() // get the HTML as text from this element
Moving on, .replace() won't modify the original string. It returns a new copy. In the simplest case you can do:
var html = $('.lang-html')
.html()
.replace('<', '<')
.replace('>', '>');
But you still have to re-assign it to the HTML source. Again, jQuery provides a simple API for this.
$('.lang-html').html(html);
However, there's one more problem. .replace() only replaces the first match in a string. To replace all of them, you need to construct a regex and use the /g (global) flag. Here's the complete code:
var $element = $('.lang-html');
var html = $element.html()
.replace(/</g, '<')
.replace(/>/g, '>');
$element.html(html)
If you want get html code representation of an DOMElement in your browser then you won't need the replace to escape the html special chars. But you can use the browser to take care of all edge cases.
You could just use innerHTML/outerHTML and textContent.
This will e.g. will replace the content of the body with its html code representation.
var elm = document.getElementsByTagName('body')[0];
elm.textContent = elm.outerHTML;
Or if you just want to have the result as string but not displayed in the browsers then you could wrap that into a function:
function escapeHTML(html) {
var div = document.createElement('div');
div.textContent = html;
return div.innerHTML;
}
console.log( escapeHTML('<div>test</div>') );
You can also do a
$('.lang-html').prop("innerText")
which will hand you back the contents of that div, as real text.
No further translation should be needed.
Actually <body> tags will not be returned in the innerHTML of the posted code because the HTML is invalid. To explain:
To cater for changes to the DOM made in Javascript, browsers dynamically create innerHTML strings from the DOM by inspecting child elements of a specified node and generating HTML code from them.
Since <body> tags are only valid immediately following the head section, browsers silently respond to the <code> tag in your post by first creating a body element in which to place it. The <body> tags which follow are then ignored because they are invalid in this position. Hence there is no body element child of the code node, and no body tags in its innerHTML
Update (2): To pretty print the HTML without viewing page source you could try.
(function() {
var body = document.body;
var html = body.parentNode.outerHTML;
html = html.replace(/</g, '<');
html = html.replace(/>/g, '>');
html = html.replace(/\ /g, " ");
html = html.replace(/\n/g, '<br>\n');
// console.log(html);
body.innerHTML = html;
body.style.fontFamily = "monospace";
});
I have a contenteditable div as follow (| = cursor position):
<div id="mydiv" contenteditable="true">lorem ipsum <spanclass="highlight">indol|or sit</span> amet consectetur <span class='tag'>adipiscing</span> elit</div>
I would like to get the current cursor position including html tags. My code :
var offset = document.getSelection().focusOffset;
Offset is returning 5 (full text from the last tag) but i need it to handle html tags. The expected return value is 40. The code has to work with all recents browsers.
(i also checked this : window.getSelection() offset with HTML tags? but it doesn't answer my question).
Any ideas ?
Another way to do it is by adding a temporary marker in the DOM and calculating the offset from this marker. The algorithm looks for the HTML serialization of the marker (its outerHTML) within the inner serialization (the innerHTML) of the div of interest. Repeated text is not a problem with this solution.
For this to work, the marker's serialization must be unique within its div. You cannot control what users type into a field but you can control what you put into the DOM so this should not be difficult to achieve. In my example, the marker is made unique statically: by choosing a class name unlikely to cause a clash ahead of time. It would also be possible to do it dynamically, by checking the DOM and changing the class until it is unique.
I have a fiddle for it (derived from Alvaro Montoro's own fiddle). The main part is:
function getOffset() {
if ($("." + unique).length)
throw new Error("marker present in document; or the unique class is not unique");
// We could also use rangy.getSelection() but there's no reason here to do this.
var sel = document.getSelection();
if (!sel.rangeCount)
return; // No ranges.
if (!sel.isCollapsed)
return; // We work only with collapsed selections.
if (sel.rangeCount > 1)
throw new Error("can't handle multiple ranges");
var range = sel.getRangeAt(0);
var saved = rangy.serializeSelection();
// See comment below.
$mydiv[0].normalize();
range.insertNode($marker[0]);
var offset = $mydiv.html().indexOf($marker[0].outerHTML);
$marker.remove();
// Normalizing before and after ensures that the DOM is in the same shape before
// and after the insertion and removal of the marker.
$mydiv[0].normalize();
rangy.deserializeSelection(saved);
return offset;
}
As you can see, the code has to compensate for the addition and removal of the marker into the DOM because this causes the current selection to get lost:
Rangy is used to save the selection and restore it afterwards. Note that the save and restore could be done with something lighter than Rangy but I did not want to load the answer with minutia. If you decide to use Rangy for this task, please read the documentation because it is possible to optimize the serialization and deserialization.
For Rangy to work, the DOM must be in exactly the same state before and after the save. This is why normalize() is called before we add the marker and after we remove it. What this does is merge immediately adjacent text nodes into a single text node. The issue is that adding a marker to the DOM can cause a text node to be broken into two new text nodes. This causes the selection to be lost and, if not undone with a normalization, would cause Rangy to be unable to restore the selection. Again, something lighter than calling normalize could do the trick but I did not want to load the answer with minutia.
EDIT: This is an old answer that doesn't work for OP's requirement of having nodes with the same text. But it's cleaner and lighter if you don't have that requirement.
Here is one option that you can use and that works in all major browsers:
Get the offset of the caret within its node (document.getSelection().anchorOffset)
Get the text of the node in which the caret is located (document.getSelection().anchorNode.data)
Get the offset of that text within #mydiv by using indexOf()
Add the values obtained in 1 and 3, to get the offset of the caret within the div.
The code would look like this for your particular case:
var offset = document.getSelection().anchorOffset;
var text = document.getSelection().anchorNode.data;
var textOffset = $("#mydiv").html().indexOf( text );
offsetCaret = textOffset + offset;
You can see a working demo on this JSFiddle (view the console to see the results).
And a more generic version of the function (that allows to pass the div as a parameter, so it can be used with different contenteditable) on this other JSFiddle:
function getCaretHTMLOffset(obj) {
var offset = document.getSelection().anchorOffset;
var text = document.getSelection().anchorNode.data;
var textOffset = obj.innerHTML.indexOf( text );
return textOffset + offset;
}
About this answer
It will work in all recent browsers as requested (tested on Chrome 42, Firefox 37, and Explorer 11).
It is short and light, and doesn't require any external library (not even jQuery)
Issue: If you have different nodes with the same text, it may return the offset of the first occurrence instead of the real position of the caret.
NOTE: This solution works even in nodes with repeated text, but it detects html entities (e.g.: ) as only one character.
I came up with a completely different solution based on processing the nodes. It is not as clean as the old answer (see other answer), but it works fine even when there are nodes with the same text (OP's requirement).
This is a description of how it works:
Create a stack with all the parent elements of the node in which the caret is located.
While the stack is not empty, traverse the nodes of the containing element (initially the content editable div).
If the node is not the same one at the top of the stack, add its size to the offset.
If the node is the same as the one at the top of the stack: pop it from the stack, go to step 2.
The code is like this:
function getCaretOffset(contentEditableDiv) {
// read the node in which the caret is and store it in a stack
var aux = document.getSelection().anchorNode;
var stack = [ aux ];
// add the parents to the stack until we get to the content editable div
while ($(aux).parent()[0] != contentEditableDiv) { aux = $(aux).parent()[0]; stack.push(aux); }
// traverse the contents of the editable div until we reach the one with the caret
var offset = 0;
var currObj = contentEditableDiv;
var children = $(currObj).contents();
while (stack.length) {
// add the lengths of the previous "siblings" to the offset
for (var x = 0; x < children.length; x++) {
if (children[x] == stack[stack.length-1]) {
// if the node is not a text node, then add the size of the opening tag
if (children[x].nodeType != 3) { offset += $(children[x])[0].outerHTML.indexOf(">") + 1; }
break;
} else {
if (children[x].nodeType == 3) {
// if it's a text node, add it's size to the offset
offset += children[x].length;
} else {
// if it's a tag node, add it's size + the size of the tags
offset += $(children[x])[0].outerHTML.length;
}
}
}
// move to a more inner container
currObj = stack.pop();
children = $(currObj).contents();
}
// finally add the offset within the last node
offset += document.getSelection().anchorOffset;
return offset;
}
You can see a working demo on this JSFiddle.
About this answer:
It works in all major browsers.
It is light and doesn't require external libraries (apart from jQuery)
It has an issue: html entities like are counted as one character only.
Here is an example. Check the console for the result. The first two divs (not appended; above the <script> in the console) have the proper spacing and indention. However, the second two divs do not show the same formatting or white space as the original even though they are completely the same, but appended.
For example the input
var newElem = document.createElement('div');
document.body.appendChild(newElem);
var another = document.createElement('div');
newElem.appendChild(another);
console.log(document.body.innerHTML);
Gives the output
<div><div></div></div>
When I want it to look like
<div>
<div></div>
</div>
Is there any way to generate the proper white space between appended elements and retain that spacing when obtaining it using innerHTML (or a possible similar means)? I need to be able to visually display the hierarchy and structure of the page I'm working on.
I have tried appending it within an element that is in the actual HTML but it has the same behavior
I'd be okay with doing it using text nodes and line breaks as lincolnk suggested, but it needs to affect dynamic results, meaning I cannot use the same .createTextNode(' </br>') because different elements are in different levels of the hierarchy
No jQuery please
I think you're asking to be able to append elements to the DOM, such that the string returned from document.body.innerHTML will be formatted with indentation etc. as if you'd typed it into a text editor, right?
If so, something like this might work:
function indentedAppend(parent,child) {
var indent = "",
elem = parent;
while (elem && elem !== document.body) {
indent += " ";
elem = elem.parentNode;
}
if (parent.hasChildNodes() && parent.lastChild.nodeType === 3 && /^\s*[\r\n]\s*$/.test(parent.lastChild.textContent)) {
parent.insertBefore(document.createTextNode("\n" + indent), parent.lastChild);
parent.insertBefore(child, parent.lastChild);
} else {
parent.appendChild(document.createTextNode("\n" + indent));
parent.appendChild(child);
parent.appendChild(document.createTextNode("\n" + indent.slice(0,-2)));
}
}
demo: http://jsbin.com/ilAsAki/28/edit
I've not put too much thought into it, so you might need to play with it, but it's a starting point at least.
Also, i've assumed an indentation of 2 spaces as that's what you seemed to be using.
Oh, and you'll obviously need to be careful when using this with a <pre> tag or anywhere the CSS is set to maintain the whitespace of the HTML.
You can use document.createTextNode() to add a string directly.
var ft = document.createElement('div');
document.body.appendChild(ft);
document.body.appendChild(document.createTextNode(' '));
var another = document.createElement('div');
document.body.appendChild(another);
console.log(document.body.innerHTML);