Parsing through DOM get all children and values - javascript

Container is a div i've added some basic HTML to.
The debug_log function is printing the following:
I'm in a span!
I'm in a div!
I'm in a
p
What happened to the rest of the text in the p tag ("aragraph tag!!"). I think I don't understand how exactly to walk through the document tree. I need a function that will parse the entire document tree and return all of the elements and their values. The code below is sort of a first crack at just getting all of the values displayed.
container.innerHTML = '<span>I\'m in a span! </span><div> I\'m in a div! </div><p>I\'m in a <span>p</span>aragraph tag!!</p>';
DEMO.parse_dom(container);
DEMO.parse_dom = function(ele)
{
var child_arr = ele.childNodes;
for(var i = 0; i < child_arr.length; i++)
{
debug_log(child_arr[i].firstChild.nodeValue);
DEMO.parse_dom(child_arr[i]);
}
}

Generally when traversing the DOM, you want to specify a start point. From there, check if the start point has childNodes. If it does, loop through them and recurse the function if they too have childNodes.
Here's some code that outputs to the console using the DOM form of these nodes (I used the document/HTML element as a start point). You'll need to run an if against window.console if you're allowing non-developers to load this page/code and using console:
recurseDomChildren(document.documentElement, true);
function recurseDomChildren(start, output)
{
var nodes;
if(start.childNodes)
{
nodes = start.childNodes;
loopNodeChildren(nodes, output);
}
}
function loopNodeChildren(nodes, output)
{
var node;
for(var i=0;i<nodes.length;i++)
{
node = nodes[i];
if(output)
{
outputNode(node);
}
if(node.childNodes)
{
recurseDomChildren(node, output);
}
}
}
function outputNode(node)
{
var whitespace = /^\s+$/g;
if(node.nodeType === 1)
{
console.log("element: " + node.tagName);
}else if(node.nodeType === 3)
{
//clear whitespace text nodes
node.data = node.data.replace(whitespace, "");
if(node.data)
{
console.log("text: " + node.data);
}
}
}
Example: http://jsfiddle.net/ee5X6/

In
<p>I\'m in a <span>p</span>aragraph tag!!</p>
you request the first child, which is the text node containing "I\'m in a".
The text "aragraph tag!!" is the third child, which is not logged.
Curiously, the last line containing "p" should never occur, because the span element is not a direct child of container.

I'm not sure it is what you need or if it is possible in your environment but jQuery can accomplish something similar quite easily. Here is a quick jQuery example that might work.
<html>
<head>
<script src="INCLUDE JQUERY HERE">
</script>
</head>
<body>
<span>
<span>I\'m in a span! </span><div> I\'m in a div! </div><p>I\'m in a <span>p</span>aragraph tag!!</p>
</span>
<script>
function traverse(elem){
$(elem).children().each(function(i,e){
console.log($(e).text());
traverse($(e));
});
}
traverse($("body").children().first());
</script>
</body>
<html>
Which gives the following console output:
I\'m in a span!
I\'m in a div!
I\'m in a paragraph tag!!
p

Related

"False" text node causing trouble

I'm working on a DOM traversal type of script and I'm almost finished with it. However, there is one problem that I've encountered and for the life of me, I can't figure out what to do to fix it. Pardon my ineptitude, as I'm brand new to JS/JQuery and I'm still learning the ropes.
Basically, I'm using Javascript/JQuery to create an "outline", representing the structure of an HTML page, and appending the "outline" to the bottom of the webpage. For example, if the HTML is this...
<html>
<head>
</head>
<body>
<h1>Hello World</h1>
<script src=”http://code.jquery.com/jquery-2.1.0.min.js” type=”text/javascript”>
</script>
<script src=”outline.js” type=”text/javascript”></script>
</body>
</html>
Then the output should be an unordered list like this:
html
head
body
h1
text(Hello World)
script src(”http://code.jquery.com/jquery-2.1.0.min.js”) type(”text/javascript”)
script src(”outline.js”) type(”text/javascript”)
Here's what I've got so far:
var items=[];
$(document).ready(function(){
$("<ul id = 'list'></ul>").appendTo("body");
traverse(document, function (node) {
if(node.nodeName.indexOf("#") <= -1){
items.push("<ul>"+"<li>"+node.nodeName.toLowerCase());
}
else {
var x = "text("+node.nodeValue+")";
if(node.nodeValue == null) {
items.push("<li> document");
}
else if(/[a-z0-9]/i.test(node.nodeValue) && node.nodeValue != null) {
items.push("<ul><li>"+ x +"</ul>");
}
else {
items.push("</ul>");
}
}
});
$('#list').append(items.join(''));
});
function traverse(node, func) {
func(node);
node = node.firstChild;
while (node) {
traverse(node, func);
node = node.nextSibling;
}
}
It works almost perfectly, except it seems to read a carriage return as a text node. For example, if there's
<head><title>
it reads that properly, adding head as an unordered list element, and then creating a new "unordered list" for title, which is nested inside the header. HOWEVER, if it's
<head>
<title>
It makes the new unordered list and its element, "head", but then jumps to the else statement that does items.push(</ul>) . How do I get it to ignore the carriage return? I tried testing to see if the nodeValue was equal to the carriage return, \r, but that didn't seem to do the trick.
I'm having a bit of a hard time understanding exactly which text nodes you want to skip. If you just want to skip a text node that is only whitespace, you can do that like this:
var onlyWhitespaceRegex = /^\s*$/;
traverse(document, function (node) {
if (node.nodeType === 3 && onlyWhitespaceRegex.test(node.nodeValue) {
// skip text nodes that contain only whitespace
return;
}
else if (node.nodeName.indexOf("#") <= -1){
items.push("<ul>"+"<li>"+node.nodeName.toLowerCase());
} else ...
Or, maybe you just want to trim any multiple leading or trailing whitespaces off a text node before displaying it since it may not display in HTML.
var trimWhitespaceRegex = /^\s+|\s+$/g;
traverse(document, function (node) {
if(node.nodeName.indexOf("#") <= -1){
items.push("<ul>"+"<li>"+node.nodeName.toLowerCase());
} else {
var text = node.nodeValue;
if (node.nodeType === 3) {
text = text.replace(trimWhitespaceRegex, " ");
}
var x = "text("+text+")";
if(node.nodeValue == null) {
items.push("<li> document");
} ....
A further description of exactly what you're trying to achieve in the output for various forms of different text nodes would help us better understand your requirements.

innerHTML not working on node

In my script node1.data & node1.textContent are working perfectly but node1.innerHTML is not displaying anything, and I'm not getting why.
var text = [];
function getN(node1) {
textNodesUnder(node1, text);
return text.join("");
}
function textNodesUnder(node1, text) {
if (node1.nodeType == 3) {
text.push(node1.innerHTML);
document.write(node1.textContent + "<br>");
}
var children = node1.childNodes;
for (var i = 0; i < children.length; i++) {
textNodesUnder(children[i], text);
}
}
and
<body onload="alert('The document content is ' + getN(document))" >
At that point node1 is a reference to the document object. Which indeed does not have the innerHTML property
try document.body.innerHTML as it will give the element that has the inner HTML elements
The Element.innerHTML property sets or gets the HTML syntax describing the element's descendants.
Example
document.getElementById("some_element").innerHTML="JavaScript"; //innerHTML is property of element not document
MDN - innerHTML
Useful information regarding JavaScript-innerHTML

jQuery .text('') on multiple nested elements

I wanted to remove all text from html and print only tags. I Ended up writing this:
var html = $('html');
var elements = html.find('*');
elements.text('');
alert(html.html());
It only out prints <head></head><body></body>. Was not that suppose to print all tags. I've nearly 2000 tags in the html.
var elements = html.find('*');
elements.text('');
That says "find all elements below html, then empty them". That includes body and head. When they are emptied, there are no other elements on the page, so they are the only ones that appear in html's content.
If you really wnat to remove all text from the page and leave the elements, you'll have to do it with DOM methods:
html.find('*').each(function() { // loop over all elements
$(this).contents().each(function() { // loop through each element's child nodes
if (this.nodeType === 3) { // if the node is a text node
this.parentNode.removeChild(this); // remove it from the document
}
});
})
You just deleted everything from your dom:
$('html').find('*').text('');
This will set the text of all nodes inside the <html> to the empty string, deleting descendant elements - the only two nodes that are left are the two children of the root node, <head></head> and <body></body> with their empty text node children - exactly the result you got.
If you want to remove all text nodes, you should use this:
var html = document.documentElement;
(function recurse(el) {
for (var i=0; i<el.childNodes.length; i++) {
var child = el.childNodes[i];
if (child.nodeType == 3)
el.removeChild(child);
else
recurse(child);
}
})(html);
alert(html.outerHTML);
Try this instead
$(function(){
var elements = $(document).find("*");
elements.each(function(index, data){
console.log(data);
});
});
This will return all the html elements of page.
lonesomeday seems to have the right path, but you could also do some string rebuilding like this:
var htmlString=$('html').html();
var emptyHtmlString="";
var isTag=false;
for (i=0;i<htmlString.length;i++)
{
if(htmlString[i]=='<')
isTag=true;
if(isTag)
{
emptyHtmlString+=htmlString[i];
}
if(htmlString[i]=='>')
isTag=false;
}
alert(emptyHtmlString);

Highlighting text in document (JavaScript) Efficiently

How can I (efficiently - not slowing the computer [cpu]) highlight a specific part of a page?
Lets say that my page is as so:
<html>
<head>
</head>
<body>
"My generic words would be selected here" !.
<script>
//highlight code here
var textToHighlight = 'selected here" !';
//what sould I write here?
</script>
</body>
</html>
My idea is to "clone" all the body into a variable and find via indexOf the specified text, change(insert a span with a background-color) the "cloned" string and replace the "real" body with the "cloned" one.
I just think that it isn't efficient.
Do you have any other ideas? (be creative :) )
I've adapted the following from my answers to several similar questions on SO (example). It's designed to be reusable and has proved to be so. It traverses the DOM within a container node you specify, searching each text node for the specified text and using DOM methods to split the text node and surround the relevant chunk of text in a styled <span> element.
Demo: http://jsfiddle.net/HqjZa/
Code:
// Reusable generic function
function surroundInElement(el, regex, surrounderCreateFunc) {
// script and style elements are left alone
if (!/^(script|style)$/.test(el.tagName)) {
var child = el.lastChild;
while (child) {
if (child.nodeType == 1) {
surroundInElement(child, regex, surrounderCreateFunc);
} else if (child.nodeType == 3) {
surroundMatchingText(child, regex, surrounderCreateFunc);
}
child = child.previousSibling;
}
}
}
// Reusable generic function
function surroundMatchingText(textNode, regex, surrounderCreateFunc) {
var parent = textNode.parentNode;
var result, surroundingNode, matchedTextNode, matchLength, matchedText;
while ( textNode && (result = regex.exec(textNode.data)) ) {
matchedTextNode = textNode.splitText(result.index);
matchedText = result[0];
matchLength = matchedText.length;
textNode = (matchedTextNode.length > matchLength) ?
matchedTextNode.splitText(matchLength) : null;
surroundingNode = surrounderCreateFunc(matchedTextNode.cloneNode(true));
parent.insertBefore(surroundingNode, matchedTextNode);
parent.removeChild(matchedTextNode);
}
}
// This function does the surrounding for every matched piece of text
// and can be customized to do what you like
function createSpan(matchedTextNode) {
var el = document.createElement("span");
el.style.backgroundColor = "yellow";
el.appendChild(matchedTextNode);
return el;
}
// The main function
function wrapText(container, text) {
surroundInElement(container, new RegExp(text, "g"), createSpan);
}
wrapText(document.body, "selected here");
<html>
<head>
</head>
<body>
<p id="myText">"My generic words would be selected here" !.</p>
<script>
//highlight code here
var textToHighlight = 'selected here" !';
var text = document.getElementById("myText").innerHTML
document.getElementById("myText").innerHTML = text.replace(textToHighlight, '<span style="color:red">'+textToHighlight+'</span>');
//what sould I write here?
</script>
</body>
</html>
Use this in combination with this and you should be pretty ok. (It is almost better than trying to implement selection / selection-highlighting logic yourself.)

How to select a part of string?

How to select a part of string?
My code (or example):
<div>some text</div>
$(function(){
$('div').each(function(){
$(this).text($(this).html().replace(/text/, '<span style="color: none">$1<\/span>'));
});
});
I tried this method, but in this case is selected all context too:
$(function(){
$('div:contains("text")').css('color','red');
});
I try to get like this:
<div><span style="color: red">text</span></div>
$('div').each(function () {
$(this).html(function (i, v) {
return v.replace(/foo/g, '<span style="color: red">$&<\/span>');
});
});
What are you actually trying to do? What you're doing at the moment is taking the HTML of each matching DIV, wrapping a span around the word "text" if it appears (literally the word "text") and then setting that as the text of the element (and so you'll see the HTML markup on the page).
If you really want to do something with the actual word "text", you probably meant to use html rather than text in your first function call:
$('div').each(function(){
$(this).html($(this).html().replace(/text/, '<span style="color: none">$1<\/span>'));
// ^-- here
}
But if you're trying to wrap a span around the text of the div, you can use wrap to do that:
$('div').wrap('<span style="color: none"/>');
Like this: http://jsbin.com/ucopo3 (in that example, I've used "color: blue" rather than "color: none", but you get the idea).
$(function(){
$('div:contains("text")').each(function() {
$(this).html($(this).html().replace(/(text)/g, '<span style="color:red;">\$1</span>'));
});
});
I've updated your fiddle: http://jsfiddle.net/nMzTw/15/
The general practice of interacting with the DOM as strings of HTML using innerHTML has many serious drawbacks:
Event handlers are removed or replaced
Opens the possibility of script inject attacks
Doesn't work in XHTML
It also encourages lazy thinking. In this particular instance, you're matching against the string "text" within the HTML with the assumption that any occurrence of the string must be within a text node. This is patently not a valid assumption: the string could appear in a title or alt attribute, for example.
Use DOM methods instead. This will get round all the problems. The following will use only DOM methods to surround every match for regex in every text node that is a descendant of a <div> element:
$(function() {
var regex = /text/;
function getTextNodes(node) {
if (node.nodeType == 3) {
return [node];
} else {
var textNodes = [];
for (var n = node.firstChild; n; n = n.nextSibling) {
textNodes = textNodes.concat(getTextNodes(n));
}
return textNodes;
}
}
$('div').each(function() {
$.each(getTextNodes(this), function() {
var textNode = this, parent = this.parentNode;
var result, span, matchedTextNode, matchLength;
while ( textNode && (result = regex.exec(textNode.data)) ) {
matchedTextNode = textNode.splitText(result.index);
matchLength = result[0].length;
textNode = (matchedTextNode.length > matchLength) ?
matchedTextNode.splitText(matchLength) : null;
span = document.createElement("span");
span.style.color = "red";
span.appendChild(matchedTextNode);
parent.insertBefore(span, textNode);
}
});
});
});

Categories

Resources