I'm trying to develop a Firefox add-on that transliterates the text on any page into specific language. Actually it's just a set of 2D arrays which I iterate and use this code
function escapeRegExp(str) {
return str.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}
function replaceAll(find, replace) {
return document.body.innerHTML.replace(new RegExp(escapeRegExp(find), 'g'), replace);
}
function convert2latin() {
for (var i = 0; i < Table.length; i++) {
document.body.innerHTML = replaceAll(Table[i][1], Table[i][0]);
}
}
It works, and I can ignore HTML tags, as it can be in english only, but the problem is performance. Of course it's very very poor. As I have no experience in JS, I tried to google and found that maybe documentFragment can help.
Maybe I should use another approach at all?
Based on your comments, you appear to have already been told that the most expensive thing is the DOM rebuild that happens when you completely replace the entire contents of the page (i.e. when you assign to document.body.innerHTML). You are currently doing that for each substitution. This results in Firefox re-rendering the entire page for each substitution you are making. You only need assign to document.body.innerHTML once, after you have made all of the substitutions.
The following should provide a first pass at making it faster:
function escapeRegExp(str) {
return str.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}
function convert2latin() {
newInnerHTML = document.body.innerHTML
for (let i = 0; i < Table.length; i++) {
newInnerHTML = newInnerHTML.replace(new RegExp(escapeRegExp(Table[i][1]), 'g'), Table[i][0]);
}
document.body.innerHTML = newInnerHTML
}
You mention in comments that there is no real need to use a RegExp for the match, so the following would be even faster:
function convert2latin() {
newInnerHTML = document.body.innerHTML
for (let i = 0; i < Table.length; i++) {
newInnerHTML = newInnerHTML.replace(Table[i][1], Table[i][0]);
}
document.body.innerHTML = newInnerHTML
}
If you really need to use a RegExp for the match, and you are going to perform these exact substitutions multiple times, you are better off creating all of the RegExp prior to the first use (e.g. when Table is created/changed) and storing them (e.g. in Table[i][2]).
However, assigning to document.body.innerHTML is a bad way to do this:
As the8472 mentioned, replacing the entire content of document.body.innerHTML is a very heavy handed way to perform this task, which has some significant disadvantages including probably breaking the functionality of other JavaScript in the page and potential security issues. A better solution would be to change only the textContent of the text nodes.
One method of doing this is to use a TreeWalker. The code to do so, could be something like:
function convert2latin(text) {
for (let i = 0; i < Table.length; i++) {
text = text.replace(Table[i][1], Table[i][0]);
}
return text
}
//Create the TreeWalker
let treeWalker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT,{
acceptNode: function(node) {
if(node.textContent.length === 0
|| node.parentNode.nodeName === 'SCRIPT'
|| node.parentNode.nodeName === 'STYLE'
) {
//Don't include 0 length, <script>, or <style> text nodes.
return NodeFilter.FILTER_SKIP;
} //else
return NodeFilter.FILTER_ACCEPT;
}
}, false );
//Make a list of nodes prior to modifying the DOM. Once the DOM is modified the TreeWalker
// can become invalid (i.e. stop after the first modification). Doing so is not needed
// in this case, but is a good habit for when it is needed.
let nodeList=[];
while(treeWalker.nextNode()) {
nodeList.push(treeWalker.currentNode);
}
//Iterate over all text nodes, changing the textContent of the text nodes
nodeList.forEach(function(el){
el.textContent = convert2latin(el.textContent));
});
Don't use innerhtml, it would destroy any javascript event handlers registered on the DOM nodes or make references to dom nodes held by the page's javascript obsolete. In other words, you could easily break a page with that. And of course it's inefficient.
You can use a treewalker instead and filter for text nodes only. The walking can be incrementalized by deferring the next step with window.setTimeout every 1000th text node or something like that.
If you register your addon script early enough you could also use a mutation observer to get notified about text nodes as soon as they get inserted and replace them incrementally which should make things less janky.
Related
I'm writing a Firefox extension. I want to go through the entire plaintext, so not Javascript or image sources, and replace certain strings. I currently have this:
var text = document.documentElement.innerHTML;
var anyRemaining = true;
do {
var index = text.indexOf("search");
if (index != -1) {
// This does not just replace the string with something else,
// there's complicated processing going on here. I can't use
// string.replace().
} else {
anyRemaining = false;
}
} while (anyRemaining);
This works, but it will also go through non-text elements and HTML such as Javascript, and I only want it to do the visible text. How can I do this?
I'm currently thinking of detecting an open bracket and continuing at the next closing bracket, but there might be better ways to do this.
You can use xpath to get all the text nodes on the page and then do your search/replace on those nodes:
function replace(search,replacement){
var xpathResult = document.evaluate(
"//*/text()",
document,
null,
XPathResult.ORDERED_NODE_ITERATOR_TYPE,
null
);
var results = [];
// We store the result in an array because if the DOM mutates
// during iteration, the iteration becomes invalid.
while(res = xpathResult.iterateNext()) {
results.push(res);
}
results.forEach(function(res){
res.textContent = res.textContent.replace(search,replacement);
})
}
replace(/Hello/g,'Goodbye');
<div class="Hello">Hello world!</div>
You can either use regex to strip the HTML tags, might be easier to use javascript function to return the text without HTML. See this for more details:
How can get the text of a div tag using only javascript (no jQuery)
Just to clarify what I'm trying to do, I'm trying to make a Chrome extension that can loop through the HTML of the current page and remove html tags containing certain text. But I'm having trouble looping through every html tag.
I've done a bunch of searching for the answer and pretty much every answer says to use:
var items = document.getElementsByTagName("*");
for (var i = 0; i < items.length; i++) {
//do stuff
}
However, I've noticed that if I rebuild the HTML from the page using the elements in "items," I get something different than the page's actual HTML.
For example, the code below returns false:
var html = "";
var elems = document.getElementsByTagName("*");
for (var i = 0; i < elems.length; i++) {
html += elems[i].outerHTML;
}
alert(document.body.outerHTML == html)
I also noticed that the code above wasn't giving ALL the html tags, it grouped them into one tag, for example:
var html = "";
var elems = document.getElementsByTagName("*");
alert(elems[0].outerHTML);
I tried fixing the above by recurssively looking for an element's children, but I couldn't seem to get that to work.
Ideally, I would like to be able to get every individual tag, rather than ones wrapped in other tags. I'm kind of new to Javascript so any advice/explanations or example code (In pure javascript if possible) as to what I'm doing wrong would be really helpful. I also realize my approach might be completely wrong, so any better ideas are welcome.
What you need is the famous Douglas Crockford's WalkTheDOM:
function walkTheDOM(node, func)
{
func(node);
node = node.firstChild;
while (node)
{
walkTheDOM(node, func);
node = node.nextSibling;
}
}
For each node the func will be executed. You can filter, transform or whatever by injecting the proper function.
To remove nodes containing a specific text you would do:
function removeAll(node)
{
// protect against "node === undefined"
if (node && node.nodeType === 3) // TEXT_NODE
{
if (node.textContent.indexOf(filter) !== -1) // contains offending text
{
node.parentNode.removeChild(node);
}
}
}
You can use it like this:
filter = "the offending text";
walkTheDOM(document.getElementsByTagName("BODY")[0], removeAll);
If you want to parametize by offending text you can do that, too, by transforming removeAll into a closure that is instantiated.
References to DOM elements in JavaScript are references to memory addresses of the actual nodes, so you can do something like this (see the working jsfiddle):
Array.prototype.slice.call(document.getElementsByTagName('*')).forEach(function(node) {
if(node.innerHTML === 'Hello') {
node.parentNode.removeChild(node);
}
});
Obviously node.innerHTML === 'Hello' is just an example, so you'll probably want to figure out how you want to match the text content (perhaps with a RegEx?)
I'm trying to dynamically replace specific words with a link within a certain HTML element using JS. I figured I'd use a simple RegEx:
var regEx = new RegExp('\\b'+text+'\\b', 'gi');
The quick'n'nasty way it to apply the RegEx replace on the context div's innerHTML property:
context.innerHTML = context.innerHTML.replace(regEx, ''+text+"");
The problem with this is that it also applies to, say image titles, thus breaking the layout of the page. I want it to apply only to the text of the page, if possible also excluding things like header tags and of course HTML comment and such.
So I tried something like this instead, but it doesn't seem to work at all:
function replaceText(context, regEx, replace) {
var childNodes = context.childNodes;
for (n in childNodes) {
console.log(childNodes[n].nodeName);
if (childNodes[n] instanceof Text) {
childNodes[n].textContent = childNodes[n].textContent.replace(regEx, replace);
} else if (childNodes[n] instanceof HTMLElement) {
replaceText(childNodes[n], regEx, replace);
console.log('Entering '+childNodes[n].nodeName);
} else {
console.log('Skipping '+childNodes[n].nodeName);
}
}
}
Can anyone see what I'm doing wrong, or maybe come up with a better solution? Thanks!
UPDATE:
Here's a snippet of what the contents of context may look like:
<h4>Newton's Laws of Motion</h4>
<p><span class="inline_title">Law No.1</span>: <span class="caption">An object at rest will remain at rest, and an object in motion will continue to move at constant velocity, unless a net force is applied.</span></p>
<ul>Consequences: <li>Conservation of Momentum in both elastic and inelastic collisions</li>
<li>Conservation of kinetic energy in elastic collisions but not inelastic.</li>
<li>Conservation of angular momentum.</li>
</ul>
<h5>Equations</h5>
<p class="equation">ρ = mv</p>
<p>where ρ is the momentum, and m is the mass of an object moving at constant velocity v.</p>
You can use this:
function replaceText(context, regEx, replace)
{
var childNodes = context.childNodes;
for (var i = 0; i<childNodes.length; i++) {
var childNode = childNodes[i];
if (childNode.nodeType === 3) // 3 is for text node
childNode.nodeValue = childNode.nodeValue.replace(regEx, replace);
else if (childNode.nodeType === 1 && childNode.nodeName != "HEAD")
replaceText(childNode, regEx, replace);
}
}
replaceText(context, /cons/ig, 'GROUIK!');
The idea is to find all text nodes in "context" DOM tree, It is the reason why i use a recursive function to search text nodes inside child nodes.
Note: I test childNode.nodeName != "HEAD" in the function. It's only an example to avoid a particular tag. In the real life it is more simple to give the body node as parameter to the function.
As per my understanding, you're trying to replace text in innerHTML but within tags.
First I tried to use to use innerText instead of innerHTML, but it is not giving the expexted result. Later I found a #Alan Moore's answer with Negative Lookahead regex like
(?![^<>]*>)
Which can be use to ignore the text within tags <>. Here is my approach
var regEx = new RegExp("(?![^<>]*>)" + title, 'gi');
context.innerHTML = context.innerHTML.replace(regEx, ''+text+"");
Here is a sample JSFiddle
I want to find all spans in a document that have the "email" attribute and then for each email address i'll check with my server if the email is approved and inject to the span
content an img with a "yes" or a "no". I don't need the implementation of the PHP side, only JavaScript.
So say "newsletter#zend.com" is approved in my db and the HTML code is:
<span dir="ltr"><span class="yP" email="newsletter#zend.com">Zend Technologies</span></span>
Then the JavaScript will change the HTML to:
<span dir="ltr"><span class="yP" email="newsletter#zend.com"><img src="yes.gif" />Zend Technologies</span></span>
I need someone to guide me to the right direction on how to approach this.
Note: i don't want to use jQuery.
If you don't want to use a library, and you don't want to limit yourself to browsers that support querySelectorAll, you're probably best off with a simple recursive-descent function or getElementsByTagName. Here's an RD example:
The function (off-the-cuff, untested):
function findEmailSpans(container, callback) {
var node, emailAttr;
for (node = container.firstChild; node; node = node.nextSibling) {
if (node.nodeType === 1 && node.tagName === "SPAN") { // 1 = Element
emailAttr = node.getAttribute("email");
if (emailAttr) {
callback(node, emailAttr);
}
}
}
switch (node.nodeType) {
case 1: // Element
case 9: // Document
case 11: // DocumentFragment
findEmailSpans(node, callback);
break;
}
}
}
Calling it:
findEmailSpans(document.documentElement, function(span, emailAttr) {
// Do something with `span` and `emailAttr`
});
Alternately, if you want to rely on getElementsByTagName (which is quite widely supported) and don't mind building such a large NodeList in memory, that would be simpler and might be faster: It would let you get one flat NodeList of all span elements, so then you'd just have a simple loop rather than a recursive-descent function (not that the RD function is either difficult or slow, but still). Something like this:
var spans = document.getElementsByTagName("span"),
index, node, emailAttr;
for (index = 0; index < spans.length; ++index) {
node = spans.item(index);
emailAttr = node.getAttribute("email");
if (emailAttr) {
// Do something with `node` and `emailAttr`
}
}
You'll want to compare and decide which method suits you best, each probably has pros and cons.
References:
DOM3 Core Spec
However, for this sort of thing I really would recommend getting and using a good JavaScript library like jQuery, Prototype, YUI, Closure, or any of several others. With any good library, it can look something like this (jQuery):
$("span[email]").each(function() {
// Here, `this` refers to the span that has an email attribute
});
...or this (Prototype):
$$("span[email]").each(function() {
// Here, `this` refers to the span that has an email attribute
});
...and the others won't be massively more complex. Using a library to factor our common ops like searching for things in the DOM lets you concentrate on the actual problem you're trying to solve. Both jQuery and (recent versions of) Prototype will defer to querySelectorAll on browsers that support it (and I imagine most others will, too), and fall back to their own search functions on browsers that don't.
You would use document.getElementsByTagName() to get a list of all spans. Then, check each span for the email attribute using Element.hasAttribute. Then you would use the Node interface to create and insert newe tags accordingly.
EDIT:
window.addEventListener('load', callback, true);
var callback = function() {
var spanTags = document.getElementsByTagName('span');
for (var i = 0; i < spanTags.length; i += 1) {
if (spanTags[i].hasAttribute('email')) {
var imgElement = document.createElement('img');
imgElement.setAttribute('src', 'yes.gif');
spanTags[i].insertBefore(imgElement, spanTags[i].firstChild);
}
}
}
I would like to find all occurrence of the $ character in the dom, how is this done?
You can't do something semantic like wrap $4.00 in a span element?
<span class="money">$4.00</span>
Then you would find elements belonging to class 'money' and manipulate them very easily. You could take it a step further...
<span class="money">$<span class="number">4.00</span></span>
I don't like being a jQuery plugger... but if you did that, jQuery would probably be the way to go.
One way to do it, though probably not the best, is to walk the DOM to find all the text nodes. Something like this might suffice:
var elements = document.getElementsByTagName("*");
var i, j, nodes;
for (i = 0; i < elements.length; i++) {
nodes = elements[i].childNodes;
for (j = 0; j < nodes.length; j++) {
if (nodes[j].nodeType !== 3) { // Node.TEXT_NODE
continue;
}
// regexp search or similar here
}
}
although, this would only work if the $ character was always in the same text node as the amount following it.
You could just use a Regular Expression search on the innerHTML of the body tag:
For instance - on this page:
var body = document.getElementsByTagName('body')[0];
var dollars = body.innerHTML.match(/\$[0-9]+\.?[0-9]*/g)
Results (at the time of my posting):
["$4.00", "$4.00", "$4.00"]
The easiest way to do this if you just need a bunch of strings and don't need a reference to the nodes containing $ would be to use a regular expression on the body's text content. Be aware that innerText and textContent aren't exactly the same. The main difference that could affect things here is that textContent contains the contents of <script> elements whereas innerText does not. If this matters, I'd suggest traversing the DOM instead.
var b = document.body, bodyText = b.textContent || b.innerText || "";
var matches = bodyText.match(/\$[\d.]*/g);
I'd like to add my 2 cents for prototype. Prototype has some very simple DOM traversal functions that might get exactly what you are looking for.
edit so here's a better answer
the decendants() function collects all of the children, and their children and allows them to be enumerated upon using the each() function
$('body').descendants().each(function(item){
if(item.innerHTML.match(/\$/))
{
// Do Fun scripts
}
});
or if you want to start from document
Element.descendants(document).each(function(item){
if(item.innerHTML.match(/\$/))
{
// Do Fun scripts
}
});