How to get numbers in elements' inner text by javascript's regex - javascript

I want to get numbers in the inner text of an html by javascript regex to replace them.
for example in the below code I want to get 1,2,3,4,5,6,1,2,3,1,2,3, but not the 444 inside of the div tag.
<body>
aaaa123aaa456
<div style="background: #444">aaaa123aaaa</div>
aaaa123aaa
</body>
What could be the regular expression?

Your best bet is to use innerText or textContent to get at the text without the tags and then just use the regex /\d/g to get the numbers.
function digitsInText(rootDomNode) {
var text = rootDomNode.textContent || rootDomNode.innerText;
return text.match(/\d/g) || [];
}
For example,
alert(digitsInText(document.body));
If your HTML is not in the DOM, you can try to strip the tags yourself : JavaScript: How to strip HTML tags from string?
Since you need to do a replacement, I would still try to walk the DOM and operate on text nodes individually, but if that is out of the question, try
var HTML_TOKEN = /(?:[^<\d]|<(?!\/?[a-z]|!--))+|<!--[\s\S]*?-->|<\/?[a-z](?:[^">']|"[^"]*"|'[^']*')*>|(\d+)/gi;
function incrementAllNumbersInHtmlTextNodes(html) {
return html.replace(HTML_TOKEN, function (all, digits) {
if ("string" === typeof digits) {
return "" + (+digits + 1);
}
return all;
});
}
then
incrementAllNumbersInHtmlTextNodes(
'<b>123</b>Hello, World!<p>I <3 Ponies</p><div id=123>245</div>')
produces
'<b>124</b>Hello, World!<p>I <4 Ponies</p><div id=123>246</div>'
It will get confused around where special elements like <script> end and won't recognize digits that are entity encoded, but should work otherwise.

You don't necessarily need RegExp to get the text contents of an element excluding its descendant elements' — in fact I'd advise against it as RegExp matching for HTML is notoriously difficult — there are DOM solutions:
function getImmediateText(element){
var text = '';
// Text and elements are all DOM nodes. We can grab the lot of immediate descendants and cycle through them.
for(var i = 0, l = element.childNodes.length, node; i < l, node = element.childNodes[i]; ++i){
// nodeType 3 is text
if(node.nodeType === 3){
text += node.nodeValue;
}
}
return text;
}
var bodyText = getImmediateText(document.getElementsByTagName('body')[0]);
So here there's a function that will return only the immediate text content as a string. Of course, you could then strip that for numbers with the RegExp using something like this:
var numberString = bodyText.match(/\d+/g).join('');

Just to answer my old question:
It is possible to achieve it by lookahead.
/\d(?=[^<>]*(<|$))/g
to replace the numbers
html.replace(/\d(?=[^<>]*(<|$))/g, function($0) {
return map[$0]
});
the source of the answer https://www.drupal.org/node/619198#comment-5710052

Related

replace and regex exception

I want to wrap all the words of a text in a <trans> tag, to be able to work on each words. Hover them, translate on click etc.
For that I need an exception in my replace function to ignore html tags like <br> or <span>.
Here is the function I have :
function wrapWords(str, tmpl) {
return str.replace(/(?![<br>\<span class="gras">\</span>])[a-zA-ZÀ-ÿ]+/gi, tmpl || "<trans>$&</trans>");
}
This function is working well with russian characters but not with french ones. The problem is that the <br> and <span> exception is excluding french characters b,r,s,p,a... Because of that some words are not wrapped correctly in my <trans> tag.
Does anyone knows how could I exclude a group of characters like specific tags <br> for example without affecting letters b and r in french ?
Thanks for any answer!
Properly using DOM, it is a bit more complex, but no corner cases to worry about, as it is very straightforward.
You want to split the text, thus it makes sense to only operate on text nodes. To find all text nodes, we could evaluate an XPath, or we could construct a TreeWalker.
Once we know which nodes we want to operate on, we take one node at a time and get all-space and no-space sequences. Each will be transformed into another text node, but the no-space sequences will additionally be wrapped inside a <span>. We append them one by one in front of the original node, which will guarantee the correct order, then finally we'll remove the original node, when the replacement nodes are all in their place.
function getTextNodes(node) {
let walker = document.createTreeWalker(node, NodeFilter.SHOW_TEXT, null, false);
let textnodes = [];
let textnode;
while (textnode = walker.nextNode()) {
textnodes.push(textnode);
}
return textnodes;
}
function wrap(element) {
getTextNodes(element).forEach(node => {
node.textContent.replace(/(\S+)|(\s+)/g, (match, word, space) => {
let textnode = document.createTextNode(match);
let newnode;
if (word) {
newnode = document.createElement('trans');
newnode.appendChild(textnode);
} else {
newnode = textnode;
}
node.parentNode.insertBefore(newnode, node);
});
node.remove();
});
}
wrap(document.getElementById('wrapthis'));
trans {
background-color: pink;
}
Not affected<br/>
<div id="wrapthis">
This is affected<br>
<span class="gras">HTML tags are fine</span><br/>
This as well<br/>
</div>
Not affected<br/>
Here's a quick way:
"foo bar baz".split(" ").map(w => "<trans>" + w + "</trans>").join(" ");
Explanation:
sentence is splitted by space character, which gives an Array. Each element of this Array is then wrapped in <trans> tags. Then everything is joined to create back a string.
Edit: usage in the DOM:
var sourceTextNode = document.createElement("div"); // here you're supposed to get an existing node...
sourceTextNode.textContent = "foo bar baz"; // ... and doing this is for the example purposes
sourceTextNode.innerHTML = sourceTextNode.textContent.split(" ").map(w => "<trans>" + w + "</trans>").join(" ");
sourceTextNode is:
<div>
<trans>foo</trans>
<trans>bar</trans>
<trans>baz</trans>
</div>
Note: You may want to exclude empty elements in the splitted Array that you'll get when there are multiple consecutive space charcaters.
One way to do this is testing the non-emptiness of the elements in a filter:
sourceText.split(" ").filter(Boolean)...

Why does RegEx output escaped text instead of HTML

I am writing a Chrome extension which adds a <span> ... </span> around every string that matches a certain regular expression. The RegEx match works perfectly, but I cannot seem to find a way to correctly add the span tag around the text.
My code thus far is:
// main.js
var regex_pattern = new RegEx('(apple)', 'g'); // Let's pretend I want to match every instance of 'apple'
var textNodes = getTextNodes(); // A function that returns a list of every text node from the DOM
for (var i = 0; i < textNodes.length; i++) {
if (textNodes[i].nodeValue.match(regex_pattern)) {
textNodes[i].nodeValue = textNodes[i].nodeValue.replace(regex_pattern, "<span class='highlight'>$&</span>");
}
}
This will correctly identify every match of my RegEx pattern (in this case 'apple') and output <span class="highlight">apple</span>. The only problem is that this is not treated as HTML by Chrome, it's treated as text - so instead of seeing the world 'apple' styled according to the highlight class, one would see the literal output: <span class="highlight">apple<span>
Why does this happen, and how can I fix it so that the style is correctly applied? Realizing that this was less than desirable, I tried using the insertBefore() method to wrap the matched text in a span, but this didn't do anything, it would either error or fail to add the span node, depending on how I tweaked the code. Thanks for any insight you can provide!
You can't use nodeValue to replace a text node with arbitrary HTML.
You must do it manually:
function replaceNodeWithHTML(node, html) {
var parent = node.parentNode;
if(!parent) return;
var next = node.nextSibling;
var parser = document.createElement('div');
parser.innerHTML = html;
while(parser.firstChild)
parent.insertBefore(parser.firstChild, next);
parent.removeChild(node);
}
var regex_pattern = /(apple)/g;
var textNodes = [document.querySelector('div').firstChild];
for (var i = 0; i < textNodes.length; i++)
if (textNodes[i].nodeValue.match(regex_pattern))
replaceNodeWithHTML(
textNodes[i],
textNodes[i].nodeValue.replace(regex_pattern, "<span class='highlight'>$&</span>")
);
.highlight {
background: yellow;
}
<div>I have an (apple). You have an (apple) too.</div>
It would be easier if nodes had insertAdjacentHTML method, but only elements do.
Set .innerHTML on the element. Setting textNode.nodeValue value direcly sets the text.

Matching a string only if it is not in <script> or <a> tags

I'm working on a browser plugin that replaces all instances of "someString" (as defined by a complicated regex) with $1. This generally works ok just doing a global replace on the body's innerHTML. However it breaks the page when it finds (and replaces) the "someString" inside <script> tags (i.e. as a JS variable or other JS reference). It also breaks if "someString" is already part of an anchor.
So basically I want to do a global replace on all instances of "someString" unless it falls inside a <script></script> or <a></a> tag set.
Essentially what I have now is:
var body = document.getElementsByTagName('body')[0].innerHTML;
body = body.replace(/(someString)/gi, '$1');
document.getElementsByTagName('body')[0].innerHTML = body;
But obviously that's not good enough. I've been struggling for a couple hours now and reading all of the answers here (including the many adamant ones that insist regex should not be used with HTML), so I'm open to suggestions on how to do this. I'd prefer using straight JS, but can use jQuery if necessary.
Edit - Sample HTML:
<body>
someString
<script type="text/javascript">
var someString = 'blah';
console.log(someString);
</script>
someString
</body>
In that case, only the very first instance of "someString" should be replaced.
Try this and see if it meets your needs (tested in IE 8 and Chrome).
<script src="jquery-1.4.4.js" type="text/javascript"></script>
<script>
var pattern = /(someString)/gi;
var replacement = "$1";
$(function() {
$("body :not(a,script)")
.contents()
.filter(function() {
return this.nodeType == 3 && this.nodeValue.search(pattern) != -1;
})
.each(function() {
var span = document.createElement("span");
span.innerHTML = " " + $.trim(this.nodeValue.replace(pattern, replacement));
this.parentNode.insertBefore(span, this);
this.parentNode.removeChild(this);
});
});
</script>
The code uses jQuery to find all the text nodes within the document's <body>that are not in <anchor> or <script> blocks, and contain the search pattern. Once those are found, a span is injected containing the target node's modified content, and the old text node is removed.
The only issue I saw was that IE 8 handles text nodes containing only whitespace differently than Chrome, so sometimes a replacement would lose a leading space, hence the insertion of the non-breaking space before the text containing the regex replacements.
Well, You can use XPath with Mozilla (assuming you're writing the plugin for FireFox). The call is document.evaluate. Or you can use an XPath library to do it (there are a few out there)...
var matches = document.evaluate(
'//*[not(name() = "a") and not(name() = "script") and contains(., "string")]',
document,
null,
XPathResult.UNORDERED_NODE_ITERATOR_TYPE
null
);
Then replace using a callback function:
var callback = function(node) {
var text = node.nodeValue;
text = text.replace(/(someString)/gi, '$1');
var div = document.createElement('div');
div.innerHTML = text;
for (var i = 0, l = div.childNodes.length; i < l; i++) {
node.parentNode.insertBefore(div.childNodes[i], node);
}
node.parentNode.removeChild(node);
};
var nodes = [];
//cache the tree since we want to modify it as we iterate
var node = matches.iterateNext();
while (node) {
nodes.push(node);
node = matches.iterateNext();
}
for (var key = 0, length = nodes.length; key < length; key++) {
node = nodes[key];
// Check for a Text node
if (node.nodeType == Node.TEXT_NODE) {
callback(node);
} else {
for (var i = 0, l = node.childNodes.length; i < l; i++) {
var child = node.childNodes[i];
if (child.nodeType == Node.TEXT_NODE) {
callback(child);
}
}
}
}
I know you don't want to hear this, but this doesn't sound like a job for a regex. Regular expressions don't do negative matches very well before becoming complicated and unreadable.
Perhaps this regex might be close enough though:
/>[^<]*(someString)[^<]*</
It captures any instance of someString that are inbetween a > and a <.
Another idea is if you do use jQuery, you can use the :contains pseudo-selector.
$('*:contains(someString)').each(function(i)
{
var markup = $(this).html();
// modify markup to insert anchor tag
$(this).html(markup)
});
This will grab any DOM item that contains 'someString' in it's text. I dont think it will traverse <script> tags or so you should be good.
You could try the following:
/(someString)(?![^<]*?(<\/a>|<\/script>))/
I didn't test every schenario, but it is basically using a negative lookahead to look for the next opening bracket following someString, and if that bracket is part of an anchor or script closing tag, it does not match.
Your example seems to work in this fiddle, although it certainly doesn't cover all possibilities. In cases where the innerHTML in your <a></a> contains tags (like <b> or <span>), or the code in your script tags generates html (contains strings with tags in it), you would need something more complex.

Help write regex that will surround certain text with <strong> tags, only if the <strong> tag isn't present

I have several posts on a website; all these posts are chat conversations of this type:
AD: Hey!
BC: What's up?
AD: Nothing
BC: Okay
They're marked up as simple paragraphs surrounded by <p> tags.
Using the javascript replace function, I want all instances of "AD" in the beginning of a conversation (ie, all instances of "AD" at the starting of a line followed by a ":") to be surrounded by <strong> tags, but only if the instance isn't already surrounded by a <strong> tag.
What regex should I use to accomplish this? Am I trying to do what this advises against?
The code I'm using is like this:
var posts = document.getElementsByClassName('entry-content');
for (var i = 0; i < posts.length; i++) {
posts[i].innerHTML = posts[i].innerHTML.replace(/some regex here/,
'replaced content here');
}
If AD: is always at the start of a line then the following regex should work, using the m switch:
.replace(/^AD:/gm, "<strong>AD:</strong>");
You don't need to check for the existence of <strong> because ^ will match the start of the line and the regex will only match if the sequence of characters that follows the start of the line are AD:.
You're not going against the "Don't use regex to parse HTML" advice because you're not parsing HTML, you're simply replacing a string with another string.
An alternative to regex would be to work with ranges, creating a range selecting the text and then using execCommand to make the text bold. However, I think this would be much more difficult and you would likely face differences in browser implementations. The regex way should be enough.
After seeing your comment, the following regex would work fine:
.replace(/<(p|br)>AD:/gm, "<$1><strong>AD:</strong>");
Wouldn't it be easier to set the class or style property of found paragraph to text-weight: bold or a class that does roughly the same? That way you wouldn't have to worry about adding in tags, or searching for existing tags. Might perform better, too, if you don't have to do any string replaces.
If you really want to add the strong tags anyway, I'd suggest using DOM functions to find childNodes of your paragraph that are <strong>, and if you don't find one, add it and move the original (text) childNode of the paragraph into it.
Using regular expressions on the innerHTML isn't reliable and will potentially lead to problems. The correct way to do this is a tiresome process but is much more reliable.
E.g.
for (var i = 0, l = posts.length; i < l; i++) {
findAndReplaceInDOM(posts[i], /^AD:/g, function(match, node){
// Make sure current node does note have a <strong> as a parent
if (node.parentNode.nodeName.toLowerCase() === 'strong') {
return false;
}
// Create and return new <strong>
var s = document.createElement('strong');
s.appendChild(document.createTextNode(match[0]));
return s;
});
}
And the findAndReplaceInDOM function:
function findAndReplaceInDOM(node, regex, replaceFn) {
// Note: regex MUST have global flag
if (!regex || !regex.global || typeof replaceFn !== 'function') {
return;
}
var start, end, match, parent, leftNode,
rightNode, replacementNode, text,
d = document;
// Loop through all childNodes of "node"
if (node = node && node.firstChild) do {
if (node.nodeType === 1) {
// Regular element, recurse:
findAndReplaceInDOM(node, regex, replaceFn);
} else if (node.nodeType === 3) {
// Text node, introspect
parent = node.parentNode;
text = node.data;
regex.lastIndex = 0;
while (match = regex.exec(text)) {
replacementNode = replaceFn(match, node);
if (!replacementNode) {
continue;
}
end = regex.lastIndex;
start = end - match[0].length;
// Effectively split node up into three parts:
// leftSideOfReplacement + REPLACEMENT + rightSideOfReplacement
leftNode = d.createTextNode( text.substring(0, start) );
rightNode = d.createTextNode( text.substring(end) );
parent.insertBefore(leftNode, node);
parent.insertBefore(replacementNode, node);
parent.insertBefore(rightNode, node);
// Remove original node from document
parent.removeChild(node);
}
}
} while (node = node.nextSibling);
}

Regex: how to get contents from tag inner (use javascript)?

page contents:
aa<b>1;2'3</b>hh<b>aaa</b>..
.<b>bbb</b>
blabla..
i want to get result:
1;2'3aaabbb
match tag is <b> and </b>
how to write this regex using javascript?
thanks!
Lazyanno,
If and only if:
you have read SLaks's post (as well as the previous article he links to), and
you fully understand the numerous and wondrous ways in which extracting information from HTML using regular expressions can break, and
you are confident that none of the concerns apply in your case (e.g. you can guarantee that your input will never contain nested, mismatched etc. <b>/</b> tags or occurrences of <b> or </b> within <script>...</script> or comment <!-- .. --> tags, etc.)
you absolutely and positively want to proceed with regular expression extraction
...then use:
var str = "aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";
var match, result = "", regex = /<b>(.*?)<\/b>/ig;
while (match = regex.exec(str)) { result += match[1]; }
alert(result);
Produces:
1;2'3aaabbb
You cannot parse HTML using regular expressions.
Instead, you should use Javascript's DOM.
For example (using jQuery):
var text = "";
$('<div>' + htmlSource + '</div>')
.find('b')
.each(function() { text += $(this).text(); });
I wrap the HTML in a <div> tag to find both nested and non-nested <b> elements.
Here is an example without a jQuery dependency:
// get all elements with a certain tag name
var b = document.getElementsByTagName("B");
// map() executes a function on each array member and
// builds a new array from the function results...
var text = b.map( function(element) {
// ...in this case we are interested in the element text
if (typeof element.textContent != "undefined")
return element.textContent; // standards compliant browsers
else
return element.innerText; // IE
});
// now that we have an array of strings, we can join it
var result = text.join('');
var regex = /(<([^>]+)>)/ig;
var bdy="aa<b>1;2'3</b>hh<b>aaa</b>..\n.<b>bbb</b>\nblabla..";
var result =bdy.replace(regex, "");
alert(result) ;
See : http://jsfiddle.net/abdennour/gJ64g/
Just use '?' character after the generating pattern for your inner text if you want to use Regular experssions.
for example:
".*" to "(.*?)"

Categories

Resources