JavaScript + Regex:: Replace "foo" with "bar" across entire document, excluding URLs

JavaScript + Regex:: Replace "foo" with "bar" across entire document, excluding URLs - javascript

I'm trying to replace all instances of "foo" on a page with "bar", but to exclude instances occurring within image or URL links.
The current code I have is a simple replace:
document.documentElement.innerHTML = document.documentElement.innerHTML.replace(/foo/g, "bar");
But it breaks images and links containing "foo" in the address.
I'm looking for a regular expression replacement that will take the following:
foo
barfoo
foo
<img src="foo.jpg">
And give me:
bar
barbar
bar
<img src="foo.jpg">
If this can't be accomplished with regex in JavaScript, would there be a more elegant way to only run the replacement against non-URL strings?

Yeah, you're not going to want to use regex to do this. What you want to do is replace the text of every text node in your DOM tree. Try something like this.
var allElements = document.getElementsByTagName("*"); // Get every element.
for (var i = 0; i < allElements.length; i++) {
var children = allElements.item(i).childNodes;
for (var j = 0; j < children.length; j++) {
if (children[j].nodeType === 3 /* is this node a text node? */) {
children[j].nodeValue = children[j].nodeValue.replace(/* run your replacement regex here */).
}
}
}

There are 2 problems to solve.
Firstly, you need to get all the text nodes. This is a problem in and of itself.
This thread on stackoverflow discusses some techniques.
getElementsByTagName() equivalent for textNodes
Once you have your text nodes, you can run your regex on each node, and be fairly certain that you got everything.

Related

JavaScript regex that matches the .innerHTML attribute of any element

I am currently building a Chrome extension which has to find specific pages in a website specifically the Log In / Sign In page, the Sign Up / Register page, the About page and the Contact Us page.
I am trying to achieve this by first getting the list of elements in the page (which I have already done). Now I need to check the innerHTML of the element such that it is a leaf node in the DOM and contains a part of the keyword, and I am trying to do this with a regex. I managed to build a regex which successfully returns what's in between a start or end tag of an element (i.e. the tag name along with its attributes), but not the innerHTML. Below is what I have done so far (with the example for the About page:
var list = document.body.getElementsByTagName("*");
var aboutElement = /^[^<.+>].*About.*[^(<.+>]$/i;
for (var i = 0; i <= list.length; i++) {
if ((aboutElement.test(list[i].innerHTML)) || (aboutElement.test(list[i].alt))) {
list[i].click();
}
}
Any idea what I should add to it such that it only matches leaf nodes (nodes which do not contain other nodes) and not what's in a start or end tag? I also think that with what I've done it's going to match everything in the innerHTML because of the .* part so I may need to change that as well. Any help would be greatly appreciated!

Thanks to two of the answers in the comments I managed to solve the problem. I used .textContent and changed the regex as shown below and it worked.
var list = document.body.getElementsByTagName("*");
var aboutElement = /^(.*?\s*(\bAbout\b)[^$]*)$/i;
for (var i = 0; i <= list.length; i++) {
if ((aboutElement.test(list[i].textContent)) || (aboutElement.test(list[i].alt))) {
list[i].click();
}
}

Iterate through all html tags, including children in Javascript

Just to clarify what I'm trying to do, I'm trying to make a Chrome extension that can loop through the HTML of the current page and remove html tags containing certain text. But I'm having trouble looping through every html tag.
I've done a bunch of searching for the answer and pretty much every answer says to use:
var items = document.getElementsByTagName("*");
for (var i = 0; i < items.length; i++) {
//do stuff
}
However, I've noticed that if I rebuild the HTML from the page using the elements in "items," I get something different than the page's actual HTML.
For example, the code below returns false:
var html = "";
var elems = document.getElementsByTagName("*");
for (var i = 0; i < elems.length; i++) {
html += elems[i].outerHTML;
}
alert(document.body.outerHTML == html)
I also noticed that the code above wasn't giving ALL the html tags, it grouped them into one tag, for example:
var html = "";
var elems = document.getElementsByTagName("*");
alert(elems[0].outerHTML);
I tried fixing the above by recurssively looking for an element's children, but I couldn't seem to get that to work.
Ideally, I would like to be able to get every individual tag, rather than ones wrapped in other tags. I'm kind of new to Javascript so any advice/explanations or example code (In pure javascript if possible) as to what I'm doing wrong would be really helpful. I also realize my approach might be completely wrong, so any better ideas are welcome.

What you need is the famous Douglas Crockford's WalkTheDOM:
function walkTheDOM(node, func)
{
func(node);
node = node.firstChild;
while (node)
{
walkTheDOM(node, func);
node = node.nextSibling;
}
}
For each node the func will be executed. You can filter, transform or whatever by injecting the proper function.
To remove nodes containing a specific text you would do:
function removeAll(node)
{
// protect against "node === undefined"
if (node && node.nodeType === 3) // TEXT_NODE
{
if (node.textContent.indexOf(filter) !== -1) // contains offending text
{
node.parentNode.removeChild(node);
}
}
}
You can use it like this:
filter = "the offending text";
walkTheDOM(document.getElementsByTagName("BODY")[0], removeAll);
If you want to parametize by offending text you can do that, too, by transforming removeAll into a closure that is instantiated.

References to DOM elements in JavaScript are references to memory addresses of the actual nodes, so you can do something like this (see the working jsfiddle):
Array.prototype.slice.call(document.getElementsByTagName('*')).forEach(function(node) {
if(node.innerHTML === 'Hello') {
node.parentNode.removeChild(node);
}
});
Obviously node.innerHTML === 'Hello' is just an example, so you'll probably want to figure out how you want to match the text content (perhaps with a RegEx?)

javascript regex for links and links class

I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is
John Dow
I want to collect the content of href and "John Dow".
The links have class="r_lapi" in them that would identify the links I'm looking for.
What I have right now is:
var link_regex = new RegExp("/<a[^]*</a>/");
var match = content.match(link_regex, 'i');
console.log("match =", match );
Which does absolutely nothing. Any help is very much appreciated.

If you can use the DOM (you've said you want regex, but...)
var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:
var div, links, i;
// Create an element; note we don't append it anywhere
div = document.createElement('div');
// Fill it in with the HTML
div.innerHTML = text;
// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
Live Example, using this text returned via ajax:
John Dow
Don't pick me
Jane Bloggs
The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script tags in the text aren't a problem, they aren't executed when you use innerHTML but beware they are executed by things like jQuery's html function.)
Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:
function handleLinks(data) {
var div, links, htmlIndex, linkIndex;
div = document.createElement('div');
for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
div.innerHTML = data.htmlList[htmlIndex];
links = div.querySelectorAll("a.r_lapi");
for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
// Use `links[linkIndex].innerHTML` here
}
}
}
Live Example, using this JSON returned via ajax:
{
"htmlList": [
"blah blah John Dow blah blah",
"Don't pick me",
"Two in this one Jane Bloggs and Trevor Bloggs"
]
}
If you really need to use regex:
Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.
You can get close with a couple of assumptions.
var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
var match = content.match(link_regex);
if (match) {
// Use match[1], which contains it
}
Live illustration
That looks for this:
The literal text <a
Either a > immediately following, or at least one whitespace character followed by any number of characters that aren't a >, followed by a >
Any number of characters, minimal-match
The literal text </a>
The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>.
I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)
Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).
One of the primary assumptions made with the above above are that there is never a > character within an attribute value in the anchor (e.g., John Dow>). It's perfectly valid to have a>` inside an attribute value, so that assumption is invalid.

If you're in a browser, you really should be using the native DOM.
If you're not, assuming the href does not contain weird characters like > or ", you could use following regex:
var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";
Please note that this will fail on certain links like
test
John <b>Dow</b>
For a complete solution, use a HTML parser.

Javascript .replace() with RegEx causes browser to hang/crash

My goal it to loop through a set of given elements, and replace there inner HTML with a linkifying Regex so I can convert HTML text in the form of http://*.*/* into http://*.*/*
So I'm running a bit of vanilla javascript:
for (var i = 0; i < document.getElementsByClassName('title').length; i++) {
var title = document.getElementsByClassName('title')[i]
title.innerHTML = title.innerHTML.replace(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig,"<a target='_blank' href='$1'>$1</a>")
}
Here's just the RegExp I'm using:
/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig
So, why on earth would this loop cause the browser to hang? The loop is over text no longer than 256 characters and there are usually between 5 and 30 .title elements, definitely not the levels of data that would crash/hang a browser. I've only experienced it in Chrome/Safari, unsure if it happens in Firefox/Opera or not.

Try storing results.
var titles = document.querySelectorAll(".title");
// querySelectorAll is supported in slightly more browsers than getElementsByClassName
var l = titles.length, i, title;
for( i=0; i<l; i++) {
title = titles[i];
title.innerHTML = title.innerHTML.replace(/..../,'....');
}
If the hanging continues, it's probably because the regex is matching stuff you've already replaced. Try adding a negative lookahead to ensure there is no single quote immediately after the URL you are matching. (?=!')

Javascript search for all occurences of a character in the dom?

I would like to find all occurrence of the $ character in the dom, how is this done?

You can't do something semantic like wrap $4.00 in a span element?
<span class="money">$4.00</span>
Then you would find elements belonging to class 'money' and manipulate them very easily. You could take it a step further...
<span class="money">$<span class="number">4.00</span></span>
I don't like being a jQuery plugger... but if you did that, jQuery would probably be the way to go.

One way to do it, though probably not the best, is to walk the DOM to find all the text nodes. Something like this might suffice:
var elements = document.getElementsByTagName("*");
var i, j, nodes;
for (i = 0; i < elements.length; i++) {
nodes = elements[i].childNodes;
for (j = 0; j < nodes.length; j++) {
if (nodes[j].nodeType !== 3) { // Node.TEXT_NODE
continue;
}
// regexp search or similar here
}
}
although, this would only work if the $ character was always in the same text node as the amount following it.

You could just use a Regular Expression search on the innerHTML of the body tag:
For instance - on this page:
var body = document.getElementsByTagName('body')[0];
var dollars = body.innerHTML.match(/\$[0-9]+\.?[0-9]*/g)
Results (at the time of my posting):
["$4.00", "$4.00", "$4.00"]

The easiest way to do this if you just need a bunch of strings and don't need a reference to the nodes containing $ would be to use a regular expression on the body's text content. Be aware that innerText and textContent aren't exactly the same. The main difference that could affect things here is that textContent contains the contents of <script> elements whereas innerText does not. If this matters, I'd suggest traversing the DOM instead.
var b = document.body, bodyText = b.textContent || b.innerText || "";
var matches = bodyText.match(/\$[\d.]*/g);

I'd like to add my 2 cents for prototype. Prototype has some very simple DOM traversal functions that might get exactly what you are looking for.
edit so here's a better answer
the decendants() function collects all of the children, and their children and allows them to be enumerated upon using the each() function
$('body').descendants().each(function(item){
if(item.innerHTML.match(/\$/))
{
// Do Fun scripts
}
});
or if you want to start from document
Element.descendants(document).each(function(item){
if(item.innerHTML.match(/\$/))
{
// Do Fun scripts
}
});

Develop Reference

JavaScript is the programming language of the Web.

JavaScript + Regex:: Replace "foo" with "bar" across entire document, excluding URLs - javascript

Related

JavaScript regex that matches the .innerHTML attribute of any element

Iterate through all html tags, including children in Javascript

javascript regex for links and links class

Javascript .replace() with RegEx causes browser to hang/crash

Javascript search for all occurences of a character in the dom?

Categories

Resources