How can I remove all instances of a specific text-phrase? - javascript

In a situation that the body area of a webpage is the only accessible part, is there a way to remove all instances of a particular text-phrase (written in HTML) using inline JavaScript or another inline capable language?
This could be useful in many situations, such as people using a Tiny.cc/customurl and wanting to remove the portion stating "tiny.cc/"
If specifics are allowed, we're modifying a calendar plugin using Tiny.cc to create a custom URLs (tiny.cc/customurl). The plugin shows the full URL by default so we'd like to strip the text "tiny.cc/" and keep the "customurl" portion in our code:
<div class="ews_cal_grid_custom_item_3">
<div class="ews_cal_grid_select_checkbox_clear" id="wGridTagChk" onclick="__doPostBack('wGridTagChk', 'tiny.cc/Baseball-JV');" > </div>
tiny.cc/Baseball-JV
</div>
The part we'd like to remove is the http://tiny.cc/ on the 3rd line by itself.

To do this without replacing all the HTML (which wrecks all event handlers) and to do it without recursion (which is generally faster), you can do this:
function removeText(top, txt) {
var node = top.firstChild, index;
while(node && node != top) {
// if text node, check for our text
if (node.nodeType == 3) {
// without using regular expressions (to avoid escaping regex chars),
// replace all copies of this text in this text node
while ((index = node.nodeValue.indexOf(txt)) != -1) {
node.nodeValue = node.nodeValue.substr(0, index) + node.nodeValue.substr(index + txt.length);
}
}
if (node.firstChild) {
// if it has a child node, traverse down into children
node = node.firstChild;
} else if (node.nextSibling) {
// if it has a sibling, go to the next sibling
node = node.nextSibling;
} else {
// go up the parent chain until we find a parent that has a nextSibling
// so we can keep going
while ((node = node.parentNode) != top) {
if (node.nextSibling) {
node = node.nextSibling;
break;
}
}
}
}
}​
Working demo here: http://jsfiddle.net/jfriend00/2y9eH/
To do this on the entire document, you would just call:
removeText(document.body, "http://tiny.cc/Baseball-JV");

As long as you can supply the data in string format, you can use Regular Expressions to do this for you.
You could parse the whole innerHTML of the body tag, if that is all that you can access. This is a slow and kinda-bad-practice method, but for explanation's sake:
document.body.innerHTML = document.body.innerHTML.replace(
/http:\/\/tiny\.cc\//i, // The regular expression to search for
""); // Waht to replace with (nothing).
The whole expression is contained within forward slashes, so any forward slashes inside the regexp need to be escaped with a backslash.
This goes for other characters that have special meaning in regexp, such as the period. A single period (.) denotes matching 'any' character. To match a period, it must be escaped (\.)
EDIT:
If you wish to keep the reference to the URL in the onclick, you can modify the regexp to not match when inside single quotes (as your example):
/([^']http:\/\/tiny\.cc\/[^'])/i

If you don't want to replace all the instances of that string in the HTML, then you'll have to recursively iterate over the node structure, for instance:
function textFilter(element, search, replacement) {
for (var i = 0; i < element.childNodes.length; i++) {
var child = element.childNodes[i];
var nodeType = child.nodeType;
if (nodeType == 1) { // element
textFilter(child, search, replacement);
} else if (nodeType == 3) { // text node
child.nodeValue = child.nodeValue.replace(search, replacement);
}
}
}
Then you just grab hold of the appropriate element, and call this function on it:
var el = document.getElementById('target');
textFilter(el, /http:\/\/tiny.cc\//g, "");​ // You could use a regex
textFilter(el, "Baseball", "Basketball");​ // or just a simple string

Related

Is there an equivalent for string.find() in window objects?

So I know window.find(), an unstandard js object that finds a string on the page, it returns true if found and false if not.
Now is there somthing similar to string.replace(), but is a window object (ex : window.replace()) that would replace all concurrent elements to soemthing else (eg replace all of the "Hi" to "Hello")?
I don't think there is, but it's easier to write than you might suspect. You just walk the DOM looking for Text nodes and use replace on their nodeValue:
function replaceAll(element, regex, replacement) {
for (var child = element.firstChild;
child;
child = child.nextSibling) {
if (child.nodeType === 3) { // Text
child.nodeValue = child.nodeValue.replace(regex, replacement);
} else if (child.nodeType === 1) { // Element
replaceAll(child, regex, replacement);
}
}
}
There I used a regular expression (which needs to have the g flag) to get the "global" behavior when doing the replace, and for flexibility.
Live Example:
function replaceAll(element, regex, replacement) {
for (var child = element.firstChild;
child;
child = child.nextSibling) {
if (child.nodeType === 3) { // Text
child.nodeValue = child.nodeValue.replace(regex, replacement);
} else if (child.nodeType === 1) { // Element
replaceAll(child, regex, replacement);
}
}
}
setTimeout(function() {
replaceAll(document.body, /one/g, "two");
}, 800);
<div>
Here's one.
<p>And here's one.</p>
<p>And here's <strong>one</strong>
</div>
If you want to use a simple string instead of a regular expression, just use a regular expression escape function (such as the ones in the answers to this question and build your regex like this:
var regex = new RegExp(yourEscapeFunction(simpleString), "g");
The case this doesn't handle is where the target string crosses text nodes, like this:
<span>ex<span>ample</span></span>
Using the function above looking for "example", you wouldn't find it. I leave it as an exercise for the reader to handle that case if desired... :-)

Search body for document for {~ contents ~}

Alright, so basically I would like to search the Body tags for {~ , then get whatever follows that until ~} and turn that into a string (not including the {~ or ~} ).
const match = document.body.innerHTML.match(/\{~(.+)~\}/);
if (match) console.log(match[1]);
else console.log('No match found');
<body>text {~inner~} text </body>
$(function(){
var bodyText = document.getElementsByTagName("body")[0].innerHTML;
found=bodyText.match(/{~(.*?)~}/gi);
$.each(found, function( index, value ) {
var ret = value.replace(/{~/g,'').replace(/~}/g,'');
console.log(ret);
});
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<body> {~Content 1~}
{~Content 2~}
</body>
There you go, put gi at the end of the regex.
This is a harder problem to solve than it would first appear; things like script tags and comments can throw a wrench into things if you just grab the innerHTML of the body. The following function takes a base element to search, in your case you'll want to pass in document.body, and returns an array containing any of the strings found.
function getMyTags (baseElement) {
const rxFindTags = /{~(.*?)~}/g;
// .childNodes contains not only elements, but any text that
// is not inside of an element, comments as their own node, etc.
// We will need to filter out everything that isn't a text node
// or a non-script tag.
let nodes = baseElement.childNodes;
let matches = [];
nodes.forEach(node => {
let nodeType = node.nodeType
// if this is a text node or an element, and it is not a script tag
if (nodeType === 3 || nodeType === 1 && node.nodeName !== 'SCRIPT') {
let html;
if (node.nodeType === 3) { // text node
html = node.nodeValue;
} else { // element
html = node.innerHTML; // or .innerText if you don't want the HTML
}
let match;
// search the html for matches until it can't find any more
while ((match = rxFindTags.exec(html)) !== null) {
// the [1] is to get the first capture group, which contains
// the text we want
matches.push(match[1]);
}
}
});
return matches;
}
console.log('All the matches in the body:', getMyTags(document.body));
console.log('Just in header:', getMyTags(document.getElementById('title')));
<h1 id="title"><b>{~Foo~}</b>{~bar~}</h1>
Some text that is {~not inside of an element~}
<!-- This {~comment~} should not be captured -->
<script>
// this {~script~} should not be captured
</script>
<p>Something {~after~} the stuff that shouldn't be captured</p>
The regular expression /{~(.*?)~}/g works like this:
{~ start our match at {~
(.*?) capture anything after it; the ? makes it "non-greedy" (also known as "lazy") so, if you have two instances of {~something~} in any of the strings we are searching it captures each individually instead of capturing from the first {~ to the last ~} in the string.
~} says there has to be a ~} after our match.
The g option makes it a 'global' search, meaning it will look for all matches in the string, not just the first one.
Further reading
childNodes
nodeType
Regular-Expressions.info has a great regular expression tutorial.
MDN RegExp documentation
Tools
There are lots of different tools out there to help you develop regular expressions. Here are a couple I've used:
RegExr has a great tool that explains how a particular regular expression works.
RegExPal

How to get numbers in elements' inner text by javascript's regex

I want to get numbers in the inner text of an html by javascript regex to replace them.
for example in the below code I want to get 1,2,3,4,5,6,1,2,3,1,2,3, but not the 444 inside of the div tag.
<body>
aaaa123aaa456
<div style="background: #444">aaaa123aaaa</div>
aaaa123aaa
</body>
What could be the regular expression?
Your best bet is to use innerText or textContent to get at the text without the tags and then just use the regex /\d/g to get the numbers.
function digitsInText(rootDomNode) {
var text = rootDomNode.textContent || rootDomNode.innerText;
return text.match(/\d/g) || [];
}
For example,
alert(digitsInText(document.body));
If your HTML is not in the DOM, you can try to strip the tags yourself : JavaScript: How to strip HTML tags from string?
Since you need to do a replacement, I would still try to walk the DOM and operate on text nodes individually, but if that is out of the question, try
var HTML_TOKEN = /(?:[^<\d]|<(?!\/?[a-z]|!--))+|<!--[\s\S]*?-->|<\/?[a-z](?:[^">']|"[^"]*"|'[^']*')*>|(\d+)/gi;
function incrementAllNumbersInHtmlTextNodes(html) {
return html.replace(HTML_TOKEN, function (all, digits) {
if ("string" === typeof digits) {
return "" + (+digits + 1);
}
return all;
});
}
then
incrementAllNumbersInHtmlTextNodes(
'<b>123</b>Hello, World!<p>I <3 Ponies</p><div id=123>245</div>')
produces
'<b>124</b>Hello, World!<p>I <4 Ponies</p><div id=123>246</div>'
It will get confused around where special elements like <script> end and won't recognize digits that are entity encoded, but should work otherwise.
You don't necessarily need RegExp to get the text contents of an element excluding its descendant elements' — in fact I'd advise against it as RegExp matching for HTML is notoriously difficult — there are DOM solutions:
function getImmediateText(element){
var text = '';
// Text and elements are all DOM nodes. We can grab the lot of immediate descendants and cycle through them.
for(var i = 0, l = element.childNodes.length, node; i < l, node = element.childNodes[i]; ++i){
// nodeType 3 is text
if(node.nodeType === 3){
text += node.nodeValue;
}
}
return text;
}
var bodyText = getImmediateText(document.getElementsByTagName('body')[0]);
So here there's a function that will return only the immediate text content as a string. Of course, you could then strip that for numbers with the RegExp using something like this:
var numberString = bodyText.match(/\d+/g).join('');
Just to answer my old question:
It is possible to achieve it by lookahead.
/\d(?=[^<>]*(<|$))/g
to replace the numbers
html.replace(/\d(?=[^<>]*(<|$))/g, function($0) {
return map[$0]
});
the source of the answer https://www.drupal.org/node/619198#comment-5710052

Matching a string only if it is not in <script> or <a> tags

I'm working on a browser plugin that replaces all instances of "someString" (as defined by a complicated regex) with $1. This generally works ok just doing a global replace on the body's innerHTML. However it breaks the page when it finds (and replaces) the "someString" inside <script> tags (i.e. as a JS variable or other JS reference). It also breaks if "someString" is already part of an anchor.
So basically I want to do a global replace on all instances of "someString" unless it falls inside a <script></script> or <a></a> tag set.
Essentially what I have now is:
var body = document.getElementsByTagName('body')[0].innerHTML;
body = body.replace(/(someString)/gi, '$1');
document.getElementsByTagName('body')[0].innerHTML = body;
But obviously that's not good enough. I've been struggling for a couple hours now and reading all of the answers here (including the many adamant ones that insist regex should not be used with HTML), so I'm open to suggestions on how to do this. I'd prefer using straight JS, but can use jQuery if necessary.
Edit - Sample HTML:
<body>
someString
<script type="text/javascript">
var someString = 'blah';
console.log(someString);
</script>
someString
</body>
In that case, only the very first instance of "someString" should be replaced.
Try this and see if it meets your needs (tested in IE 8 and Chrome).
<script src="jquery-1.4.4.js" type="text/javascript"></script>
<script>
var pattern = /(someString)/gi;
var replacement = "$1";
$(function() {
$("body :not(a,script)")
.contents()
.filter(function() {
return this.nodeType == 3 && this.nodeValue.search(pattern) != -1;
})
.each(function() {
var span = document.createElement("span");
span.innerHTML = " " + $.trim(this.nodeValue.replace(pattern, replacement));
this.parentNode.insertBefore(span, this);
this.parentNode.removeChild(this);
});
});
</script>
The code uses jQuery to find all the text nodes within the document's <body>that are not in <anchor> or <script> blocks, and contain the search pattern. Once those are found, a span is injected containing the target node's modified content, and the old text node is removed.
The only issue I saw was that IE 8 handles text nodes containing only whitespace differently than Chrome, so sometimes a replacement would lose a leading space, hence the insertion of the non-breaking space before the text containing the regex replacements.
Well, You can use XPath with Mozilla (assuming you're writing the plugin for FireFox). The call is document.evaluate. Or you can use an XPath library to do it (there are a few out there)...
var matches = document.evaluate(
'//*[not(name() = "a") and not(name() = "script") and contains(., "string")]',
document,
null,
XPathResult.UNORDERED_NODE_ITERATOR_TYPE
null
);
Then replace using a callback function:
var callback = function(node) {
var text = node.nodeValue;
text = text.replace(/(someString)/gi, '$1');
var div = document.createElement('div');
div.innerHTML = text;
for (var i = 0, l = div.childNodes.length; i < l; i++) {
node.parentNode.insertBefore(div.childNodes[i], node);
}
node.parentNode.removeChild(node);
};
var nodes = [];
//cache the tree since we want to modify it as we iterate
var node = matches.iterateNext();
while (node) {
nodes.push(node);
node = matches.iterateNext();
}
for (var key = 0, length = nodes.length; key < length; key++) {
node = nodes[key];
// Check for a Text node
if (node.nodeType == Node.TEXT_NODE) {
callback(node);
} else {
for (var i = 0, l = node.childNodes.length; i < l; i++) {
var child = node.childNodes[i];
if (child.nodeType == Node.TEXT_NODE) {
callback(child);
}
}
}
}
I know you don't want to hear this, but this doesn't sound like a job for a regex. Regular expressions don't do negative matches very well before becoming complicated and unreadable.
Perhaps this regex might be close enough though:
/>[^<]*(someString)[^<]*</
It captures any instance of someString that are inbetween a > and a <.
Another idea is if you do use jQuery, you can use the :contains pseudo-selector.
$('*:contains(someString)').each(function(i)
{
var markup = $(this).html();
// modify markup to insert anchor tag
$(this).html(markup)
});
This will grab any DOM item that contains 'someString' in it's text. I dont think it will traverse <script> tags or so you should be good.
You could try the following:
/(someString)(?![^<]*?(<\/a>|<\/script>))/
I didn't test every schenario, but it is basically using a negative lookahead to look for the next opening bracket following someString, and if that bracket is part of an anchor or script closing tag, it does not match.
Your example seems to work in this fiddle, although it certainly doesn't cover all possibilities. In cases where the innerHTML in your <a></a> contains tags (like <b> or <span>), or the code in your script tags generates html (contains strings with tags in it), you would need something more complex.

Help write regex that will surround certain text with <strong> tags, only if the <strong> tag isn't present

I have several posts on a website; all these posts are chat conversations of this type:
AD: Hey!
BC: What's up?
AD: Nothing
BC: Okay
They're marked up as simple paragraphs surrounded by <p> tags.
Using the javascript replace function, I want all instances of "AD" in the beginning of a conversation (ie, all instances of "AD" at the starting of a line followed by a ":") to be surrounded by <strong> tags, but only if the instance isn't already surrounded by a <strong> tag.
What regex should I use to accomplish this? Am I trying to do what this advises against?
The code I'm using is like this:
var posts = document.getElementsByClassName('entry-content');
for (var i = 0; i < posts.length; i++) {
posts[i].innerHTML = posts[i].innerHTML.replace(/some regex here/,
'replaced content here');
}
If AD: is always at the start of a line then the following regex should work, using the m switch:
.replace(/^AD:/gm, "<strong>AD:</strong>");
You don't need to check for the existence of <strong> because ^ will match the start of the line and the regex will only match if the sequence of characters that follows the start of the line are AD:.
You're not going against the "Don't use regex to parse HTML" advice because you're not parsing HTML, you're simply replacing a string with another string.
An alternative to regex would be to work with ranges, creating a range selecting the text and then using execCommand to make the text bold. However, I think this would be much more difficult and you would likely face differences in browser implementations. The regex way should be enough.
After seeing your comment, the following regex would work fine:
.replace(/<(p|br)>AD:/gm, "<$1><strong>AD:</strong>");
Wouldn't it be easier to set the class or style property of found paragraph to text-weight: bold or a class that does roughly the same? That way you wouldn't have to worry about adding in tags, or searching for existing tags. Might perform better, too, if you don't have to do any string replaces.
If you really want to add the strong tags anyway, I'd suggest using DOM functions to find childNodes of your paragraph that are <strong>, and if you don't find one, add it and move the original (text) childNode of the paragraph into it.
Using regular expressions on the innerHTML isn't reliable and will potentially lead to problems. The correct way to do this is a tiresome process but is much more reliable.
E.g.
for (var i = 0, l = posts.length; i < l; i++) {
findAndReplaceInDOM(posts[i], /^AD:/g, function(match, node){
// Make sure current node does note have a <strong> as a parent
if (node.parentNode.nodeName.toLowerCase() === 'strong') {
return false;
}
// Create and return new <strong>
var s = document.createElement('strong');
s.appendChild(document.createTextNode(match[0]));
return s;
});
}
And the findAndReplaceInDOM function:
function findAndReplaceInDOM(node, regex, replaceFn) {
// Note: regex MUST have global flag
if (!regex || !regex.global || typeof replaceFn !== 'function') {
return;
}
var start, end, match, parent, leftNode,
rightNode, replacementNode, text,
d = document;
// Loop through all childNodes of "node"
if (node = node && node.firstChild) do {
if (node.nodeType === 1) {
// Regular element, recurse:
findAndReplaceInDOM(node, regex, replaceFn);
} else if (node.nodeType === 3) {
// Text node, introspect
parent = node.parentNode;
text = node.data;
regex.lastIndex = 0;
while (match = regex.exec(text)) {
replacementNode = replaceFn(match, node);
if (!replacementNode) {
continue;
}
end = regex.lastIndex;
start = end - match[0].length;
// Effectively split node up into three parts:
// leftSideOfReplacement + REPLACEMENT + rightSideOfReplacement
leftNode = d.createTextNode( text.substring(0, start) );
rightNode = d.createTextNode( text.substring(end) );
parent.insertBefore(leftNode, node);
parent.insertBefore(replacementNode, node);
parent.insertBefore(rightNode, node);
// Remove original node from document
parent.removeChild(node);
}
}
} while (node = node.nextSibling);
}

Categories

Resources