Javascript Regex Word Boundary with optional non-word character - javascript

I am looking to find a keyword match in a string. I am trying to use word boundary, but this may not be the best case for that solution. The keyword could be any word, and could be preceded with a non-word character. The string could be any string at all and could include all three of these words in the array, but I should only match on the keyword:
['hello', '#hello', '#hello'];
Here is my code, which includes an attempt found in post:
let userStr = 'why hello there, or should I say #hello there?';
let keyword = '#hello';
let re = new RegExp(`/(#\b${userStr})\b/`);
re.exec(keyword);
This would be great if the string always started with #, but it does not.
I then tried this /(#?\b${userStr})\b/, but if the string does start with #, it tries to match ##hello.
The matchThis str could be any of the 3 examples in the array, and the userStr may contain several variations of the matchThis but only one will be exact

You need to account for 3 things here:
The main point is that a \b word boundary is a context-dependent construct, and if your input is not always alphanumeric-only, you need unambiguous word boundaries
You need to double escape special chars inside constructor RegExp notation
As you pass a variable to a regex, you need to make sure all special chars are properly escaped.
Use
let userStr = 'why hello there, or should I say #hello there?';
let keyword = '#hello';
let re_pattern = `(?:^|\\W)(${keyword.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')})(?!\\w)`;
let res = [], m;
// To find a single (first) match
console.log((m=new RegExp(re_pattern).exec(userStr)) ? m[1] : "");
// To find multiple matches:
let rx = new RegExp(re_pattern, "g");
while (m=rx.exec(userStr)) {
res.push(m[1]);
}
console.log(res);
Pattern description
(?:^|\\W) - a non-capturing string matching the start of string or any non-word char
(${keyword.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&')}) - Group 1: a keyword value with escaped special chars
(?!\\w) - a negative lookahead that fails the match if there is a word char immediately to the right of the current location.

Check whether the keyword already begins with a special character. If it does, don't include it in the regular expression.
var re;
if ("##".indexOf(keyword[0]) == -1) {
re = new RegExp(`[##]?\b${keyword}\b`);
} else {
re = new RegExp(`\b${keyword}\b`);
}

Related

How to use regex with an array of keywords to replace?

I am trying to create a loop that will replace certain words with their uppercase version. However I cannot seem to get it to work with capture groups as I need to only uppercase words surrounded by whitespace or a start-line marker. If I understand correctly \b is the boundary matcher? The list below is shortened for convenience.
raw_text = 'crEate Alter Something banana'
var lower_text = raw_text.toLowerCase();
var sql_keywords = ['ALTER', 'ANY', 'CREATE']
for (i = 0; i < sql_keywords.length; i++){
search_key = '(\b)' + sql_keywords[i].toLowerCase() + '(\b)';
replace_key = sql_keywords[i].toUpperCase();
lower_text = lower_text.replace(search_key, '$1' + replace_key + '$2');
}
It loops fine but the replace fails. I assume I have formatted it incorrectly but I cannot work out how to correctly format it. To be clear, it is searching for a word surrounded by either line start or a space, then replacing the word with the upper case version while keeping the boundaries preserved.
Several issues:
A backslash inside a string literal is an escape character, so if you intend to have a literal backslash (for the purpose of generating regex syntax), you need to double it
You did not create a regular expression. A dynamic regular expression is created with a call to RegExp
You would want to provide regex option flags, including g for global, and you might as well ease things by adding the i (case insensitive) flag.
There is no reason to make a capture group of a \b as it represents no character from the input. So even if your code would work, then $1 and $2 would just resolve to empty strings -- they serve no purpose.
You are casting the input to all lower case, so you will lose the capitalisation on words that are not matched.
It will be easier when you create one regular expression for all at the same time, and use the callback argument of replace:
var raw_text = 'crEate Alter Something banana';
var sql_keywords = ['ALTER','ANY','CREATE'];
var regex = RegExp("\\b(" + sql_keywords.join("|") + ")\\b", "gi");
var result = raw_text.replace(regex, word => word.toUpperCase());
console.log(result);
BTW, you probably also want to match reserved words when they are followed by punctuation, such as a comma. \b will match any switch between alphanumerical and non-alphanumerical, and vice versa, so that seems fine.
You can use the RegExp constructor.
Then make a function:
const listRegexp = list => new RegExp(list.map(word => `(${word})`).join("|"), "gi");
Then use it:
const re = listRegexp(sql_keywords);
Then replace:
const output = raw_text.replace(r, x => x.toUpperCase())

Remove hashtag symbol js, by regex

Tried to search on the forum but could not find anything that would precisely similar to what i need. Im basically trying to remove the # symbol from results that im receving, here is the dummy example of the regex.
let postText = 'this is a #test of #hashtags';
var regexp = new RegExp('#([^\\s])', 'g');
postText = postText.replace(regexp, '');
console.log(postText);
It gives the following result
this is a est of ashtags
What do i need to change around so that it removes just the hashtags without cutting the first letter of each word
You need a backreference $1 as the replacement:
let postText = 'this is a #test of #hashtags';
var regexp = /#(\S)/g;
postText = postText.replace(regexp, '$1');
console.log(postText);
// Alternative with a lookahead:
console.log('this is a #test of #hashtags'.replace(/#(?=\S)/g, ''));
Note I suggest replacing the constructor notation with a regex literal notation to make the regex a bit more readable, and changing [^\s] with a shorter \S (any non-whitespace char).
Here, /#(\S)/g matches multiple occurrences (due to g modifier) of # and any non-whitespace char right after it (while capturing it into Group 1) and String#replace will replace the found match with that latter char.
Alternatively, to avoid using backreferences (also called placeholders) you may use a lookahead, as in .replace(/#(?=\S)/g, ''), where (?=\S) requires a non-whitespace char immediately to the right of the current location. If you need to remove # at the end of the string, too, replace (?=\S) with (?!\s) that will fail the match if the next char is a whitespace.
Probably easier will be to write your own function which probably will look like this: (covers the usecase when symbol may be repeated)
function replaceSymbol(symbol, string) {
if (string.indexOf(symbol) < 0) {
return string;
}
while(string.indexOf(symbol) > -1) {
string = string.replace(symbol, '');
}
return string;
}
var a = replaceSymbol('#', '##s##u#c###c#e###ss is he#re'); // 'success is here'
You might be able to use the following :
let postText = 'this is a #test of #hashtags';
postText = postText.replace(/#\b/g, '');
It relies on the fact that a #hashtag contains a word-boundary between the # and the word that follows it. By matching that word-boundary with \b, we make sure not to match single #.
However, it might match a bit more than you would expect, because the definition of 'word character' in regex isn't obvious : it includes numbers (so #123 would be matched) and more confusingly, the _ character (so #___ would be matched).
I don't know if there's an authoritative source defining whether those are acceptable hashtags or not, so I'll let you judge whether this suits your needs.
You only need the #, the stuff in parens match anything else after said #
postText = postText.replace('#', '');
This will replace all #

regex match not outputting the adjacent matches javascript

i was experimenting on regex in javascript. Then i came across an issue such that let consider string str = "+d+a+", I was trying to output those characters in the string which are surrounded by +, I used str.match(/\+[a-z]\+/ig), so here what I'm expecting is ["+d+","+a+"], but what i got is just ["+d+"], "+a+" is not showing in the output. Why?
.match(/.../g) returns all non-overlapping matches. Your regex requires a + sign on each side. Given your target string:
+d+a+
^^^
^^^
Your matches would have to overlap in the middle in order to return "+a+".
You can use look-ahead and a manual loop to find overlapping matches:
var str = "+d+a+";
var re = /(?=(\+[a-z]\+))/g;
var matches = [], m;
while (m = re.exec(str)) {
matches.push(m[1]);
re.lastIndex++;
}
console.log(matches);
With regex, when a character gets consumed with a match, then it won't count for the next match.
For example, a regex like /aba/g wouldn't find 2 aba's in a string like "ababa".
Because the second "a" was already consumed.
However, that can be overcome by using a positive lookahead (?=...).
Because lookaheads just check what's behind without actually consuming it.
So a regex like /(ab)(?=(a))/g would return 2 capture groups with 'ab' and 'a' for each 'aba'.
But in this case it just needs to be followed by 1 fixed character '+'.
So it can be simplified, because you don't really need capture groups for this one.
Example snippet:
var str = "+a+b+c+";
var matches = str.match(/\+[a-z]+(?=\+)/g).map(function(m){return m + '+'});
console.log(matches);

Javascript word boundaries

I have seen this answer proposed in this question
However the resulting match is not the same. When the match is at the beginning of the string the string is returned, however when matched after a whitespace the whitespace is also returned as part of the match; even though the non-capture colon is used.
I tested with the following code is Firefox console:
let str1 = "un ejemplo";
let str2 = "ejemplo uno";
let reg = /(?:^|\s)un/gi;
console.log(str1.match(reg)); // ["un"]
console.log(str2.match(reg)); // [" un"]
Why is the whitespace being returned?
The colon in (?:^|\s) just means that it's a non-capturing group. In other words, when reading, back-referencing, or replacing with the captured group values, it will not be included. Without the colon, it would be reference-able as \1, but with the colon, there is no way to reference it. However, non-capturing groups are by default still included in the match. For instance My (?:dog|cat) is sick will still include the word dog or cat in the match, even though it's a non-capturing group.
To make it exclude the value, you have two options. If your regex engine supports negative look-behinds, you can use on of those, such as (?!<^|\s). If it does not (and unfortunately, JavaScript's engine is one of the ones which does not), you could put a capturing group around just the part you want and then read that group's value rather than the whole match (e.g, (?:^|\s)(un)). For instance:
let reg = /(?:^|\s)(un)/gi;
let match = reg.exec(input)
let result = match[1];
One solution would be to use a capturing group (ie. (un)) so that you can use RegExp.prototype.exec() and then use match[1] of this result to get the matched string, like this:
let str1 = "un ejemplo";
let str2 = "ejemplo uno";
let reg = /(?:^|\s)(un)/gi;
var match1 = reg.exec(str1);
var match2 = reg.exec(str2);
console.log(match1[1]); // ["un"]
console.log(match2[1]); // ["un"]

Javascript regular expression matching prior and trailing characters

I have this string in a object:
<FLD>dsfgsdfgdsfg;NEW-7db5-32a8-c907-82cd82206788</FLD><FLD>dsfgsdfgsd;NEW-480e-e87c-75dc-d70cd731c664</FLD><FLD>dfsgsdfgdfsgfd;NEW-0aad-440a-629c-3e8f7eda4632</FLD>
this.model.get('value_long').match(/[<FLD>\w+;](NEW[-|\d|\w]+)[</FLD>]/g)
Returns:
[";NEW-7db5-32a8-c907-82cd82206788<", ";NEW-480e-e87c-75dc-d70cd731c664<", ";NEW-0aad-440a-629c-3e8f7eda4632<"]
What is wrong with my regular expression that it is picking up the preceding ; and trailing <
here is a link to the regex
http://regexr.com?30k3m
Updated:
this is what I would like returned:
["NEW-7db5-32a8-c907-82cd82206788", "NEW-480e-e87c-75dc-d70cd731c664", "NEW-0aad-440a-629c-3e8f7eda4632"]
here is a JSfiddle for it
http://jsfiddle.net/mwagner72/HHMLK/
Square brackets create a character class, which you do not want here, try changing your regex to the following:
<FLD>\w+;(NEW[-\d\w]+)</FLD>
Since it looks like you want to grab the capture group from each match, you can use the following code to construct an array with the capture group in it:
var regex = /<FLD>\w+;(NEW[\-\d\w]+)<\/FLD>/g;
var match = regex.exec(string);
var matches = [];
while (match !== null) {
matches.push(match[1]);
match = regex.exec(string);
}
[<FLD>\w+;] would match one of the characters inside of the square brackets, when I think what you actually want to do is match all of those. Also for the other character class, [-|\d|\w], you can remove the | because it is already implied in a character class, | should only be used for alternation inside of a group.
Here is an updated link with the new regex: http://jsfiddle.net/RTkzx/1

Categories

Resources