mark text in a string with regular expression but exclude links

mark text in a string with regular expression but exclude links - javascript

I have a text and I want when a user search for a term, the term becomes highlighted by wrapping the term with mark tag.
javascript to wrap the match term:
var sampleText = window.document.getElementById('test').innerHTML;
var _keywordHighlight = function (text, term) {
var pattern = new RegExp('('+term+')', 'gi');
text = text.replace(pattern, '<mark>$1</mark>');;
return text;
};
var newText = _keywordHighlight(sampleText, 'sample');
window.document.getElementById('test').innerHTML = newText;
jsfiddle.net link:
https://jsfiddle.net/homa/j0Lgk6pf/
The problem is, the search term inside the url also wraps by mark tag and it broke the link.
How can I exclude links to be wrapped by mark tag?

Use a negative lookahead to add an additional constraint that the term is not followed by a > without first having a <. This will effectively exclude matches within <...> markup.
var pattern = new RegExp('('+term+')(?![^<]*>)', 'gi');
https://jsfiddle.net/qdk80o0k/

You're reinverting the wheel
Using innerHTML will destroy events
Using innerHTML will trigger regeneration of the DOM
To make thins easy you should use an existing plugin. There are many jQuery plugins out there, but as you haven't added the jquery tag I assume that you're searching a plain JS solution. Then the only plugin is mark.js.
Example of your use case

Related

Regex Help for content between two strings (javascript)

Hoping someone might help. I have a string formatted like the example below:
Lipsum text as part of a paragraph here, yada. |EMBED|{"content":"foo"}|/EMBED|. Yada and the text continues...
What I am looking for is a Javascript RegEx to capture the content between the |EMBED||/EMBED| 'tags', run a function on that content, and then to replace the entire |EMBED|...|/EMBED| string with the return of that function.
The catch is that I may have multiple |EMBED| blocks within a larger string. For example:
Yabba...|EMBED|{"content":"foo"}|/EMBED|. Dabba-do...|EMBED|{"content":"yo"}|/EMBED|.
I need the RegEx to capture and process each |EMBED| block separately, since the content contained within will be similar, but unique.
My initial thought is that I could just have a RegEx that captures the first iteration of the |EMBED| block, and the function which replaces this |EMBED| block is either part of a loop or recursion to continuously find the next block and replace it, until no more blocks are found in the string.
...but this seems expensive. Is there a more eloquent way?

You can use String.prototype.replace to replace a substring found via a regular expression with a modified version of the match using a mapping function, e.g.:
var input = 'Yabba...|EMBED|{"content":"foo"}|/EMBED|. Dabba-do...|EMBED|{"content":"yo"}|/EMBED|.'
var output = input.replace(/\|EMBED\|(.*?)\|\/EMBED\|/g, function(match, p1) {
return p1.toUpperCase()
})
console.log(output) // "Yabba...{"CONTENT":"FOO"}. Dabba-do...{"CONTENT":"YO"}."
Make sure that you use a non-greedy selector .*? to select the content between the delimiters to allow multiple replacements per string.

This is the cod which iterate through the matches of the regex:
var str = 'Lipsum text as part of a paragraph here, yada. |EMBED|{"content":"foo"}|/EMBED|. Yada and the text continues...';
var rx = /\|EMBED\|(.*)\|\/EMBED\|/gi;
var match;
while (true)
{
match = rx.exec(str);
if (!match)
break;
console.log(match[1]); //match[1] is the content between "the tags"
}

Replace with RegExp only outside tags in the string

I have a strings where some html tags could present, like
this is a nice day for bowling <b>bbbb</b>
how can I replace with RegExp all b symbols, for example, with :blablabla: (for example) but ONLY outside html tags?
So in that case the resulting string should become
this is a nice day for :blablabla:owling <b>bbbb</b>
EDIT: I would like to be more specific, based on the answers I have received. So first of all I have just a string, not DOM element, or anything else. The string may or may not contain tags (opening and closing). The main idea is to be able to replace anywhere in the text except inside tags. For example if I have a string like
not feeling well today :/ check out this link http://example.com
the regexp should replace only first :/ with real smiley image, but should not replace second and third, because they are inside (and part of) tag. Here's an example snippet using the regexp from one of the answer.
var s = 'not feeling well today :/ check out this link http://example.com';
var replaced = s.replace(/(?:<[^\/]*?.*?<\/.*?>)|(:\/)/g, "smiley_image_here");
document.querySelector("pre").textContent = replaced;
<pre></pre>
It is strange but the DEMO shows that it captured the correct group, but the same regexp in replace function seem not to be working.

The regex itself to replace all bs with :blablabla: is not that hard:
.replace(/b/g, ":blablabla:")
It is a bit tricky to get the text nodes where we need to perform search and replace.
Here is a DOM-based example:
function replaceTextOutsideTags(input) {
var doc = document.createDocumentFragment();
var wrapper = document.createElement('myelt');
wrapper.innerHTML = input;
doc.appendChild( wrapper );
return textNodesUnder(doc);
}
function textNodesUnder(el){
var n, walk=document.createTreeWalker(el,NodeFilter.SHOW_TEXT,null,false);
while(n=walk.nextNode())
{
if (n.parentNode.nodeName.toLowerCase() === 'myelt')
n.nodeValue = n.nodeValue.replace(/:\/(?!\/)/g, "smiley_here");
}
return el.firstChild.innerHTML;
}
var s = 'not feeling well today :/ check out this link http://example.com';
console.log(replaceTextOutsideTags(s));
Here, we only modify the text nodes that are direct children of the custom-created element named myelt.
Result:
not feeling well today smiley_here check out this link http://example.com

var input = "this is a nice day for bowling <b>bbbb</b>";
var result = input.replace(/(^|>)([^<]*)(<|$)/g, function(_,a,b,c){
return a
+ b.replace(/b/g, ':blablabla:')
+ c;
});
document.querySelector("pre").textContent = result;
<pre></pre>
You can do this:
var result = input.replace(/(^|>)([^<]*)(<|$)/g, function(_,a,b,c){
return a
+ b.replace(/b/g, ':blablabla:') // you may do something else here
+ c;
});
Note that in most (no all but most) real complex use cases, it's much more convenient to manipulate a parsed DOM rather than just a string. If you're starting with a HTML page, you might use a library (some, like my one, accept regexes to do so).

I think you can use a regex like this : (Just for a simple data not a nested one)
/<[^\/]*?b.*?<\/.*?>|(b)/ig
[Regex Demo]
If you wanna use a regex I can suggest you use below regex to remove all tags recursively until all tags removed:
/<[^\/][^<]*>[^<]*<\/.*?>/g
then use a replace for finding any b.

replaceText() RegEx "not followed by"

Any ideas why this simple RegEx doesn't seem to be supported in a Google Docs script?
foo(?!bar)
I'm assuming that Google Apps Script uses the same RegEx as JavaScript. Is this not so?
I'm using the RegEx as such:
DocumentApp.getActiveDocument().getBody().replaceText('foo(?!bar)', 'hello');
This generates the error:
ScriptError: Invalid regular expression pattern foo(?!bar)

As discussed in comments on this question, this is a documented limitation; the replaceText() method doesn't support reverse-lookaheads or any other capture group.
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.ref
Serge suggested a work-around, "it should be possible to manipulate your document at a lower level (extracting text from paragraph etc) but it could rapidly become quite cumbersome."
Here's what that could look like. If you don't mind losing all formatting, this example will apply capture groups, RegExp flags (i for case-insensitivity) and reverse-lookaheads to change:
Little rabbit Foo Foo, running through the foobar.
to:
Little rabbit Fred Fred, running through the foobar.
Code:
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var paragraphs = body.getParagraphs();
for (var i=0; i<paragraphs.length; i++) {
var text = paragraphs[i].getText();
paragraphs[i].replaceText(".*", text.replace(/(f)oo(?!bar)/gi, '$1red') );
}
}

You have a sequence which you can match with a regular expression, but that regular expression will also match one, or more, things which you do not desire to change. The generalized solution to this situation is to:
Change the text such that you have known sequences of characters which are definitely not used. Effectively, this gives you sequences of characters which you use as variables to hold the values you don't want to change. Personally, I would use:
body.replaceText('Q','Qz');
Which will make it such that there is no sequence in your document which matches /Q[^z]/. This results in you being able to use sequences like Qa to represent some text you don't want to change. I use Q because it has a low frequency of use in English. You can use any character. For efficiency, choose a character which results in a low number of changes within the text you are affecting.
Change the things you don't want to end up changing to one of the character sequences you now know are unused. For example:
body.replaceText('foobar','Qa');
Repeat this for whatever additional items you don't want to end up changing.
Change the text you are really wanting to change. In this example:
body.replaceText('foo','hello'.replace(/Q/g,'Qz'));
Note that you need to apply to the new replacement text the first substitution which you used to open up known unused sequences.
Restore all of the things you did not want to change to their original state:
body.replaceText('Qa','foobar');
Restore the text you used to open up unused character sequences:
body.replaceText('Qz','Q');
All together that would be:
var body = DocumentApp.getActiveDocument().getBody();
body.replaceText('Q','Qz'); //Open up unused character sequences
body.replaceText('foobar','Qa'); //Save the things you don't want to change.
//In the general case, you need to apply to the new text the same substitution
// which you used to open up unused character sequences. If you don't you
// may end up with those sequences being changed in the new text.
body.replaceText('foo','hello'.replace(/Q/g,'Qz')); //Make the change you desire.
body.replaceText('Qa','foobar'); //Restore the things you saved.
body.replaceText('Qz','Q'); //Restore the original sequence.
While solving the problem this way does not allow you to use all the features of JavaScript RegExp (e.g. capture groups, look-ahead assertions, and flags), it should preserve the formatting within your document.
You can choose not to perform steps 1 and 5 above by picking a longer sequence of characters to use to represent the text which you do not want to match (e.g. kNoWn1UnUsEd). However, such a longer sequence is something that must be selected based on your knowledge of what already exists in the document. Doing that can save a couple of steps, but you either have to search for an unused string or accept that there is some probability that the string you use is already in the document, which would result in an undesired substitution.

I figured out a way to obtain most of JS's str.replace() functionalities including capture groups and smart replacers in Apps Script without messing up the style. The trick is to use Javascript's regex.exec() function and Apps Script's text.deleteText() and text.insertText() functions.
function replaceText(body, regex, replacer, attribute){
var content = body.getText();
const text = body.editAsText();
var match = "";
while (true){
content = body.getText();
var oldLength = content.length;
match = regex.exec(content);
if (match === null){
break;
}
var start = match.index;
var end = regex.lastIndex - 1;
text.deleteText(start, end);
text.insertText(start, replacer(match, regex));
var newLength = body.getText().length;
var replacedLength = oldLength - newLength;
var newEnd = end - replacedLength;
text.setAttributes(start, newEnd, attribute);
regex.lastIndex -= replacedLength;
}
}
Argument explanations:
body: the body of the document you want to operate on
regex: the normal JS regular expression object used as a search pattern
replacer: the replacer function used to return the string you want to replace with, replacer automatically receive two arguments:
I. match: match object generated by regex.exec() and
II. regex: the regular expression object used as a search pattern
attribute: An Apps Script attribute object
For example, if you want to apply bold style to new strings replacing the old ones, you can create a boldStyle attribute object:
var boldStyle = {};
boldStyle[DocumentApp.Attribute.BOLD] = true;
Tips:
How can I use capture groups in replaceText()?
You can access all capture groups from the replacer function, match[0] is the whole string matched, match[1] is the first capture group, match[2] is the second, etc.
How can I access the index and position of the match in replaceText()?
You can access the start index of the match (match.index) and end index of the match (regex.lastIndex) from the replacer function.
For more in-depth reference of JS RegExp, see this excellent tutorial from Javascript.info.
Example:
Here's a example use case of the replaceText() function. It's simple implementation of a markdown to google docs conversion script:
function markdownToDocs() {
const body = DocumentApp.getActiveDocument().getBody();
// Use editAsText to obtain a single text element containing
// all the characters in the document.
const text = body.editAsText();
// e.g. replace "**string**" with "string" (bolded)
var boldStyle = {};
boldStyle[DocumentApp.Attribute.BOLD] = true;
replaceDeliminaters(body, "\\*\\*", boldStyle, false);
// e.g. replace multiline "```line 1\nline 2\nline 3```" with "line 1\nline 2\nline 3" (with gray background highlight)
var blockHighlightStyle = {};
blockHighlightStyle[DocumentApp.Attribute.BACKGROUND_COLOR] = "#EEEEEE";
replaceDeliminaters(body, "```", blockHighlightStyle, true);
// e.g. replace inline "`console.log("hello world")`" with "console.log("hello world")" (in "Times New Roman" font and italic)
var inlineStyle = {};
inlineStyle[DocumentApp.Attribute.FONT_FAMILY] = "Times New Roman";
inlineStyle[DocumentApp.Attribute.ITALIC] = true;
replaceDeliminaters(body, "`", inlineStyle, false);
// feel free to change all the styling and markdown deliminaters as you wish.
}
// replace markdown deliminaters like "**", "`", and "```"
function replaceDeliminaters(body, deliminator, attributes, multiline){
var capture;
if (multiline){
capture = "([\\s\\S]+?)"; // capture newline characters as well
} else{
capture = "(.+?)"; // do not capture newline characters
}
const regex = new RegExp(deliminator + capture + deliminator, "g");
const replacer = function(match, regex){
return match[1]; // return the first capture group
}
replaceText(body, regex, replacer, attributes);
}

Regular expression to match a string which is NOT matched by a given regexp

I've been hoving around by some answers here, and I can't find a solution to my problem:
I have this regexp which matches everyting inside an HTML span tag, including contents:
<span\b[^>]*>(.*?)</span>
and I want to find a way to make a search in all the text, except for what is matched with that regexp.
For example, if my text is:
var text = "...for there is a class of <span class="highlight">guinea</span> pigs which..."
... then the regexp would match:
<span class="highlight">guinea</span>
and I want to be able to make a regexp such that if I search for "class", regexp will match "...for there is a class of..."
and will not match inside the tag, like in
"... class="highlight"..."
The word to be matched ("class") might be anywhere within the text. I've tried
(?!<span\b[^>]*>(.*?)</span>)class
but it keeps searching inside tags as well.
I want to find a solution using only regexp, not dealing with DOM nor JQuery. Thanks in advance :).

Although I wouldn't recommend this, I would do something like below
(class)(?:(?=.*<span\b[^>]*>))|(?:(?<=<\/span>).*)(class)
You can see this in action here
Rubular Link for this regex
You can capture your matches from the groups and work with them as needed. If you can, use a HTML parser and then find matches from the text element.

It's not pretty, but if I get you right, this should do what you wan't. It's done with a single RegEx but js can't (to my knowledge) extract the result without joining the results in a loop.
The RegEx: /(?:<span\b[^>]*>.*?<\/span>)|(.)/g
Example js code:
var str = '...for there is a class of <span class="highlight">guinea</span> pigs which...',
pattern = /(?:<span\b[^>]*>.*?<\/span>)|(.)/g,
match,
res = '';
match = pattern.exec(str)
while( match != null )
{
res += match[1];
match = pattern.exec(str)
}
document.writeln('Result:' + res);
In English: Do a non capturing test against your tag-expression or capture any character. Do this globally to get the entire string. The result is a capture group for each character in your string, except the tag. As pointed out, this is ugly - can result in a serious number of capture groups - but gets the job done.
If you need to send it in and retrieve the result in one call, I'd have to agree with previous contributors - It can't be done!

Javascript regular expressions

Having a small problem for a quick "Search and Highlight" script that I'm working on. I'm using regular expressions because I'd like to do the searching all on client side, after the document has loaded. My search/highlight function goes like this:
function highlight(word, colour, container) {
var regex = new RegExp("(>[^<]*?)(" + word + ")", "ig");
var replace = "$1<span name='searchTerm' style='background-color: " + colour + "'>$2</span>";
if (regex.exec(container.innerHTML)) {
container.innerHTML = container.innerHTML.replace(regex, replace);
return true;
}
return false;
}
word is the word to search for, colour is the colour to highlight it and container is the element to search in.
Consider an element that contained this:
<ul>
<li>Set the setting to the correct setting.</li>
</ul>
Say I passed the word "set" to the highlight function. In it's current state, it only finds the first instance of set due to lazy repitition.
So what if I change the regex to this:
var regex = new RegExp("(>[^<]*?)?(" + word + ")", "ig");
This now works great, it highlights all instances of the string "set". But if I pass the search word "li" then it will replace the text inside the tags!
Is there a quick fix for this regular expression to get the behaviour I want? I need it to replace all instances of the search string but not those found as part of a tag. I'd like to keep it client-side using regex.
Thanks!

You shouldn't be using regex to parse HTML. Walk the DOM tree properly and do a search and replace on pure text.
By the way there's a jQuery plugin that does what you want; you could use it or look at it to get an idea on how to do it:
http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html

Develop Reference

JavaScript is the programming language of the Web.

mark text in a string with regular expression but exclude links - javascript

Use a negative lookahead to add an additional constraint that the term is not followed by a > without first having a <. This will effectively exclude matches within <...> markup. var pattern = new RegExp('('+term+')(?![^<]*>)', 'gi'); https://jsfiddle.net/qdk80o0k/

Related

Regex Help for content between two strings (javascript)

Replace with RegExp only outside tags in the string

replaceText() RegEx "not followed by"

Regular expression to match a string which is NOT matched by a given regexp

Javascript regular expressions

Categories

Resources