Can the JavaScript command .replace replace text in any webpage? I want to create a Chrome extension that replaces specific words in any webpage to say something else (example cake instead of pie).
The .replace method is a string operation, so it's not immediately simple to run the operation on HTML documents, which are composed of DOM Node objects.
Use TreeWalker API
The best way to go through every node in a DOM and replace text in it is to use the document.createTreeWalker method to create a TreeWalker object. This is a practice that is used in a number of Chrome extensions!
// create a TreeWalker of all text nodes
var allTextNodes = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT),
// some temp references for performance
tmptxt,
tmpnode,
// compile the RE and cache the replace string, for performance
cakeRE = /cake/g,
replaceValue = "pie";
// iterate through all text nodes
while (allTextNodes.nextNode()) {
tmpnode = allTextNodes.currentNode;
tmptxt = tmpnode.nodeValue;
tmpnode.nodeValue = tmptxt.replace(cakeRE, replaceValue);
}
To replace parts of text with another element or to add an element in the middle of text, use DOM splitText, createElement, and insertBefore methods, example.
See also how to replace multiple strings with multiple other strings.
Don't use innerHTML or innerText or jQuery .html()
// the innerHTML property of any DOM node is a string
document.body.innerHTML = document.body.innerHTML.replace(/cake/g,'pie')
It's generally slower (especially on mobile devices).
It effectively removes and replaces the entire DOM, which is not awesome and could have some side effects: it destroys all event listeners attached in JavaScript code (via addEventListener or .onxxxx properties) thus breaking the functionality partially/completely.
This is, however, a common, quick, and very dirty way to do it.
Ok, so the createTreeWalker method is the RIGHT way of doing this and it's a good way. I unfortunately needed to do this to support IE8 which does not support document.createTreeWalker. Sad Ian is sad.
If you want to do this with a .replace on the page text using a non-standard innerHTML call like a naughty child, you need to be careful because it WILL replace text inside a tag, leading to XSS vulnerabilities and general destruction of your page.
What you need to do is only replace text OUTSIDE of tag, which I matched with:
var search_re = new RegExp("(?:>[^<]*)(" + stringToReplace + ")(?:[^>]*<)", "gi");
gross, isn't it. you may want to mitigate any slowness by replacing some results and then sticking the rest in a setTimeout call like so:
// replace some chunk of stuff, the first section of your page works nicely
// if you happen to have that organization
//
setTimeout(function() { /* replace the rest */ }, 10);
which will return immediately after replacing the first chunk, letting your page continue with its happy life. for your replace calls, you're also going to want to replace large chunks in a temp string
var tmp = element.innerHTML.replace(search_re, whatever);
/* more replace calls, maybe this is in a for loop, i don't know what you're doing */
element.innerHTML = tmp;
so as to minimize reflows (when the page recalculates positioning and re-renders everything). for large pages, this can be slow unless you're careful, hence the optimization pointers. again, don't do this unless you absolutely need to. use the createTreeWalker method zetlen has kindly posted above..
have you tryed something like that?
$('body').html($('body').html().replace('pie','cake'));
Related
Firstly, I apologize if my terminology here isn't the most accurate; I'm very much a novice when it comes to programming. A forum I frequent has added a bunch of unneccessary, "glitchy" images and text to the page as a part of some promotion, but the result is that the forum is now difficult to use and read. I was able to script out most of it using adblock, but there's one last bit that shows up inside the forum elements themselves, and adblock wants to remove the whole element (which breaks the forum). This is part of the code in question, with the URLs changed:
<td class="windowbg" valign="middle" width="42%">▓▓▓▓▓
Thread title <span class="smalltext"></span><img src="example.com/forumicon.gif"></td>
As you can see, the ▓ character shows up a bunch of times for no reason. Is there a way to make my browser ignore this character when it's inside of an element? If there's a way to do this using AdBlock, I am not smart enough to see it.
Here's one way to do it, using a NodeIterator:
var iter = document.createNodeIterator( document.body, NodeFilter.SHOW_TEXT );
var node;
while (node = iter.nextNode()) {
node.textContent = node.textContent.replace( /[\u2580-\u259f]+/g, '' );
}
This is just plain JavaScript code; you can paste it into the Firefox / Chrome JS console to test it. The regexp /[\u2580-\u259f]+/ matches any sequence of characters in the "Block Elements" Unicode block, including U+2593 Dark Shade (▓). You may want to tweak the regexp to match the characters you want to remove. (Tip: If you don't know what the codes for those characters are, copy and paste them into the "UTF8 String" box on this page.)
Ps. If these characters that you want to remove occur only in a certain part of the document, you can make this code a bit more efficient by replacing the root node (document.body above) with the specific DOM node that you want to remove the characters from. To find the nodes you want, you can use e.g. document.getElementById() or, more generally, document.querySelector() (or even document.querySelectorAll() and loop over the results).
Let's say that we have a DIV x on the page and we want to duplicate ("copy-paste") the contents of that DIV into another DIV y. We could do this like so:
y.innerHTML = x.innerHTML;
or with jQuery:
$(y).html( $(x).html() );
However, it appears that this method is not a good idea, and that it should be avoided.
(1) Why should this method be avoided?
(2) How should this be done instead?
Update:
For the sake of this question let's assume that there are no elements with ID's inside the DIV x.
(Sorry I forgot to cover this case in my original question.)
Conclusion:
I have posted my own answer to this question below (as I originally intended). Now, I also planed to accept my own answer :P, but lonesomeday's answer is so amazing that I have to accept it instead.
This method of "copying" HTML elements from one place to another is the result of a misapprehension of what a browser does. Browsers don't keep an HTML document in memory somewhere and repeatedly modify the HTML based on commands from JavaScript.
When a browser first loads a page, it parses the HTML document and turns it into a DOM structure. This is a relationship of objects following a W3C standard (well, mostly...). The original HTML is from then on completely redundant. The browser doesn't care what the original HTML structure was; its understanding of the web page is the DOM structure that was created from it. If your HTML markup was incorrect/invalid, it will be corrected in some way by the web browser; the DOM structure will not contain the invalid code in any way.
Basically, HTML should be treated as a way of serialising a DOM structure to be passed over the internet or stored in a file locally.
It should not, therefore, be used for modifying an existing web page. The DOM (Document Object Model) has a system for changing the content of a page. This is based on the relationship of nodes, not on the HTML serialisation. So when you add an li to a ul, you have these two options (assuming ul is the list element):
// option 1: innerHTML
ul.innerHTML += '<li>foobar</li>';
// option 2: DOM manipulation
var li = document.createElement('li');
li.appendChild(document.createTextNode('foobar'));
ul.appendChild(li);
Now, the first option looks a lot simpler, but this is only because the browser has abstracted a lot away for you: internally, the browser has to convert the element's children to a string, then append some content, then convert the string back to a DOM structure. The second option corresponds to the browser's native understanding of what's going on.
The second major consideration is to think about the limitations of HTML. When you think about a webpage, not everything relevant to the element can be serialised to HTML. For instance, event handlers bound with x.onclick = function(); or x.addEventListener(...) won't be replicated in innerHTML, so they won't be copied across. So the new elements in y won't have the event listeners. This probably isn't what you want.
So the way around this is to work with the native DOM methods:
for (var i = 0; i < x.childNodes.length; i++) {
y.appendChild(x.childNodes[i].cloneNode(true));
}
Reading the MDN documentation will probably help to understand this way of doing things:
appendChild
cloneNode
childNodes
Now the problem with this (as with option 2 in the code example above) is that it is very verbose, far longer than the innerHTML option would be. This is when you appreciate having a JavaScript library that does this kind of thing for you. For example, in jQuery:
$('#y').html($('#x').clone(true, true).contents());
This is a lot more explicit about what you want to happen. As well as having various performance benefits and preserving event handlers, for example, it also helps you to understand what your code is doing. This is good for your soul as a JavaScript programmer and makes bizarre errors significantly less likely!
You can duplicate IDs which need to be unique.
jQuery's clone method call like, $(element).clone(true); will clone data and event listeners, but ID's will still also be cloned. So to avoid duplicate IDs, don't use IDs for items that need to be cloned.
It should be avoided because then you lose any handlers that may have been on that
DOM element.
You can try to get around that by appending clones of the DOM elements instead of completely overwriting them.
First, let's define the task that has to be accomplished here:
All child nodes of DIV x have to be "copied" (together with all its descendants = deep copy) and "pasted" into the DIV y. If any of the descendants of x has one or more event handlers bound to it, we would presumably want those handlers to continue working on the copies (once they have been placed inside y).
Now, this is not a trivial task. Luckily, the jQuery library (and all the other popular libraries as well I assume) offers a convenient method to accomplish this task: .clone(). Using this method, the solution could be written like so:
$( x ).contents().clone( true ).appendTo( y );
The above solution is the answer to question (2). Now, let's tackle question (1):
This
y.innerHTML = x.innerHTML;
is not just a bad idea - it's an awful one. Let me explain...
The above statement can be broken down into two steps.
The expression x.innerHTML is evaluated,
That return value of that expression (which is a string) is assigned to y.innerHTML.
The nodes that we want to copy (the child nodes of x) are DOM nodes. They are objects that exist in the browser's memory. When evaluating x.innerHTML, the browser serializes (stringifies) those DOM nodes into a string (HTML source code string).
Now, if we needed such a string (to store it in a database, for instance), then this serialization would be understandable. However, we do not need such a string (at least not as an end-product).
In step 2, we are assigning this string to y.innerHTML. The browser evaluates this by parsing the string which results in a set of DOM nodes which are then inserted into DIV y (as child nodes).
So, to sum up:
Child nodes of x --> stringifying --> HTML source code string --> parsing --> Nodes (copies)
So, what's the problem with this approach? Well, DOM nodes may contain properties and functionality which cannot and therefore won't be serialized. The most important such functionality are event handlers that are bound to descendants of x - the copies of those elements won't have any event handlers bound to them. The handlers got lost in the process.
An interesting analogy can be made here:
Digital signal --> D/A conversion --> Analog signal --> A/D conversion --> Digital signal
As you probably know, the resulting digital signal is not an exact copy of the original digital signal - some information got lost in the process.
I hope you understand now why y.innerHTML = x.innerHTML should be avoided.
I wouldn't do it simply because you're asking the browser to re-parse HTML markup that has already been parsed.
I'd be more inclined to use the native cloneNode(true) to duplicate the existing DOM elements.
var node, i=0;
while( node = x.childNodes[ i++ ] ) {
y.appendChild( node.cloneNode( true ) );
}
Well it really depends. There is a possibility of creating duplicate elements with the same ID, which is never a good thing.
jQuery also has methods that can do this for you.
My problem
I want to clean HTML pasted in a rich text editor (FCK 1.6 at the moment). The cleaning should be based on a whitelist of tags (and perhaps another with attributes). This is not primarily in order to prevent XSS, but to remove ugly HTML.
Currently I see no way to do it on the server, so I guess it must be done in JavaScript.
Current ideas
I found the jquery-clean plugin, but as far as I can see, it is using regexes to do the work, and we know that is not safe.
As I've not found any other JS-based solution I've started to impement one myself using jQuery. It would work by creating a jQuery version of the pasted html ($(pastedHtml)) and then traverse the resulting tree, removing each element not matching the whitelist by looking at the attribute tagName.
My questions
Is this any better?
Can I trust jQuery to represent the pasted
content well (there may be unmatched
ending tags and what-have-you)?
Is there a better solution already that
I couldn't find?
Update
This is my current, jQuery-based, solution (verbose and not extensively tested):
function clean(element, whitelist, replacerTagName) {
// Use div if no replace tag was specified
replacerTagName = replacerTagName || "div";
// Accept anything that jQuery accepts
var jq = $(element);
// Create a a copy of the current element, but without its children
var clone = jq.clone();
clone.children().remove();
// Wrap the copy in a dummy parent to be able to search with jQuery selectors
// 1)
var wrapper = $('<div/>').append(clone);
// Check if the element is not on the whitelist by searching with the 'not' selector
var invalidElement = wrapper.find(':not(' + whitelist + ')');
// If the element wasn't on the whitelist, replace it.
if (invalidElement.length > 0) {
var el = $('<' + replacerTagName + '/>');
el.text(invalidElement.text());
invalidElement.replaceWith(el);
}
// Extract the (maybe replaced) element
var cleanElement = $(wrapper.children().first());
// Recursively clean the children of the original element and
// append them to the cleaned element
var children = jq.children();
if (children.length > 0) {
children.each(function(_index, thechild) {
var cleaned = clean(thechild, whitelist, replacerTagName);
cleanElement.append(cleaned);
});
}
return cleanElement;
}
I am wondering about some points (see comments in the code);
Do I really need to wrap my element in a dummy parent to be able to match it with jQuery's ":not"?
Is this the recommended way to create a new node?
If you leverage the browser's HTML correcting abilities (e.g. you copy the rich text to the innerHTML of an empty div and take the resulting DOM tree), the HTML will be guaranteed to be valid (the way it will be corrected is somewhat browser-dependent). Although this is probably done by rich editor anyways.
jQuery's own text-top DOM transform is probably also safe, but definitely slower, so I would avoid it.
Using a whitelist based on the jQuery selector engine might be somewhat tricky because removing an element while preserving its children might make the document invalid, so the browser would correct it by changing the DOM tree, which might confuse a script trying to iterate through invalid elements. (E.g. you allow ul and li but not ol; the script removes the list root element, naked li elements are invalid so the browser wraps them in ul again, that ul will be missed by the cleaning script.) If you throw away unwanted elements together with all their children, I don't see any problems with that.
I sometimes need to add elements (such as a new link and image) to an existing HTML page, but I only have access to a small portion of the page far from where I need to insert elements. I want to use DOM based JavaScript techniques, and I must avoid using document.write().
Thus far, I've been using something like this:
// Create new image element
var newImg = document.createElement("img");
newImg.src = "images/button.jpg";
newImg.height = "50";
newImg.width = "150";
newImg.alt = "Click Me";
// Create new link element
var newLink = document.createElement("a");
newLink.href = "/dir/signup.html";
// Append new image into new link
newLink.appendChild(newImg);
// Append new link (with image) into its destination on the page
document.getElementById("newLinkDestination").appendChild(newLink);
Is there a more efficient way that I could use to accomplish the same thing? It all seems necessary, but I'd like to know if there's a better way I could be doing this.
There is a more efficient way, and seems to be using documentFragments if possible.
Check it out: http://ejohn.org/blog/dom-documentfragments/ . Also this way should be less error prone and more maintainable than starting to mix up huge strings literals and setting them as innerHTML of some other DOM objects.
Just beware, that innerHTML is both non-standard and notoriously buggy.
Nothing wrong with that. Using innerHTML would be marginally faster and probably fewer characters but not noticeable for something of this scale, and my personal preference is for the more standard, uniformly supported and safer DOM methods and properties.
One minor point: the height and width properties of <img> elements should be numbers rather than strings.
If you're not adding many things, the way you've been doing it is ideal vs innerHTML. If you're doing it frequently though, you might just create a generic function/object that takes the pertinent information as parameters and does the dirty work. IE
function addImage(src,width,height,alt,appendElem,href) {...}
I do this often in my own projects using prototyping to save time.
Which technique below is better from a user experience?
In both examples, assume that thumbContainer is set to the return value of document.getElementById('thumbContainer'), and new Thumbnail(thumb).toXMLString() returns a string with XHTML markup in it.
A) Simply += with thumbContainer.innerHTML:
for (thumb in thumbnails) {
thumbContainer.innerHTML += new Thumbnail(thumb).toXMLString();
}
B) Or converting new Thumbnail(thumb).toXMLString() to DOM elements and using appendChild?
for (thumb in thumbnails) {
var shell = document.createElement('div');
shell.innerHTML = new Thumbnail(thumb).toXMLString();
for (i = 0; i < shell.childElementCount; i++) {
thumbContainer.appendChild(shell.children[i]);
}
}
I've used both and get complaints from Firefox that I've got a long-running script, and do I want to stop it?
Which loop is less-tight, that will allow the UI to update as new elements are added to the DOM?
A) Simply += with thumbContainer.innerHTML
Never do this. This has to serialise all the content to HTML, add a string to it, and parse it all back in. It's slow (doing it in a loop is particularly bad news) and it'll lose all JavaScript references, event handlers and so on.
IE's insertAdjacentHTML will do this more efficiently. It's also part of HTML5, but not widely implemented elsewhere yet.
using appendChild?
Yes, that's fine. If you've got a lot of thumbs it will start to get slow (but not as bad as reparsing innerHTML each time). Making a DocumentFragment to insert multiple elements into the container child node list at once can help in some cases. Combining DocumentFragment with Range can do more still, but gets harder to write compatibly since IE's Range implementation is so StRange.
Either way, since you don't seem to be doing anything with the individual shell nodes, why not simply join all the thumbernail HTML strings together before parsing?
for (var child in shell.childNodes) {
Don't use for..in on sequence types, it won't do what you think. It's only meant for use against Object used as a mapping. The correct way to iterate over an Array or NodeList is plain old for (var i= 0; i<sequence.length; i++).
Personally I like the second approach (using appendChild) better, because, you are adding new element to the document tree without affecting the old elements.
When using innerHTML += new content, you affect old content, because all the HTML has to be reassigned (replaced with new HTML that contains of old code and some new code).
But, if you want to increase the performance, I'd suggest using DocumentFragments. Using them you can append a whole set of nodes with just one appendChild call on the document tree. Worth reading: "DOM DocumentFragments" by John Resig, but it's not the case in your problem, because you get the content as a string, that you have to first convert to DOM nodes.
My second suggestion is, to create a string of HTML code and then use innerHTML on a temporary container to convert it to DOM nodes
var htmlCode = "";
for (thumb in thumbnails) {
htmlCode += new Thumbnail(thumb).toXMLString();
}
var element = document.createElement(htmlCode);
element.innerHTML = htmlCode; // ... and append all the elements to document, or thumbContainer.innerHTML += htmlCode;