Why does string.replace(/\W*/g,'_') prepend all characters? - javascript

I've been learning regexp in js an encountered a situation that I didn't understand.
I ran a test of the replace function with the following regexp:
/\W*/g
And expected it prepend the beginning of the string and proceed to replace all non-word characters.
The Number is (123)(234)
would become:
_The_Number_is__123___234_
This would be prepending the string because it has at least zero instances, and then replacing all non-breaking spaces and non-word characters.
Instead, it prepended every character and replaced all non-word characters.
_T_h_e__N_u_m_b_e_r__i_s__1_2_3__2_3_4__
Why did it do this?

The problem is the meaning of \W*. It means "0 or more non-word characters". This means that the empty string "" would match, given that it is indeed 0 non-word characters.
So the regex matches before every character in the string and at the end, hence why all the replacements are done.
You want either /\W/g (replacing each individual non-word character) or /\W+/g (replacing each set of consecutive non-word characters).
"The Number is (123)(234)".replace(/\W/g, '_') // "The_Number_is__123__234_"
"The Number is (123)(234)".replace(/\W+/g, '_') // "The_Number_is_123_234_"

TL;DR
Never use a pattern that can match an empty string in a regex replace method if your aim is to replace and not insert text
To replace all separate occurrences of a non-word char in a string, use .replace(/\W/g, '_') (that is, remove * quantifier that matches zero or more occurrences of the quantified subpattern)
To replace all chunks of non-word chars in a string with a single pattern, use .replace(/\W+/g, '_') (that is, replace * quantifier with + that matches one or more occurrences of the quantified subpattern)
Note: the solution below is tailored for the OP much more specific requirements.
A string is parsed by the JS regex engine as a sequence of chars and locations in between them. See the following diagram where I marked locations with hyphens:
-T-h-e- -N-u-m-b-e-r- -i-s- -(-1-2-3-)-(-2-3-4-)-
||| |
||Location between T and h, etc. ............. |
|1st symbol |
start -> end
All these positions can be analyzed and matched with a regex.
Since /\W*/g is a regex matching all non-overlapping occurrences (due to g modifier) of 0 and more (due to * quantifier) non-word chars, all the positions before word chars are matched. Between T and h, there is a location tested with the regex, and as there is no non-word char (h is a word char), the empty match is returned (as \W* can match an empty string).
So, you need to replace the start of string and each non-word char with a _. Naive approach is to use .replace(/\W|^/g, '_'). However, there is a caveat: if a string starts with a non-word character, no _ will get appended at the start of the string:
console.log("Hi there.".replace(/\W|^/g, '_')); // _Hi_there_
console.log(" Hi there.".replace(/\W|^/g, '_')); // _Hi_there_
Note that here, \W comes first in the alternation and "wins" when matching at the beginning of the string: the space is matched and then no start position is found at the next match iteration.
You may now think you can match with /^|\W/g. Look here:
console.log("Hi there.".replace(/^|\W/g, '_')); // _Hi_there_
console.log(" Hi there.".replace(/^|\W/g, '_')); // _ Hi_there_
The _ Hi_there_ second result shows how JS regex engine handles zero-width matches during a replace operation: once a zero-width match (here, it is the position at the start of the string) is found, the replacement occurs, and the RegExp.lastIndex property is incremented, thus proceeding to the position after the first character! That is why the first space is preserved, and no longer matched with \W.
A solution is to use a consuming pattern that will not allow zero-width matches:
console.log("Hi there.".replace(/^(\W?)|\W/g, function($0,$1) { return $1 ? "__" : "_"; }));
console.log(" Hi there.".replace(/^(\W?)|\W/g, function($0,$1) { return $1 ? "__" : "_"; }));

You can use RegExp /(^\W*){1}|\W(?!=\w)/g to match one \W at beginning of string or \W not followed by \w
var str = "The Number is (123)(234)";
var res = str.replace(/(^\W*){1}|\W(?!=\w)/g, "_");
console.log(res);

You should have used /\W+/g instead.
"*" means all characters by itself.

It's because you're using the * operator. That matches zero or more characters. So between every character matches. If you replace the expression with /\W+/g it works as you expected.

This should work for you
Find: (?=.)(?:^\W|\W$|\W|^|(.)$)
Replace: $1_
Cases explained:
(?= . ) # Must be at least 1 char
(?: # Ordered Cases:
^ \W # BOS + non-word (consumes bos)
| \W $ # Non-word + EOS (consumes eos)
| \W # Non-word
| ^ # BOS
| ( . ) # (1), Any char + EOS
$
)
Note this could have been done without the lookahead via
(?:^\W|\W$|\W|^$)
But, this will insert a single _ on an empty string.
So, it ends up being more elaborate.
All in all though, it's a simple replacement.
Unlike Stribnez's solution, no callback logic is required
on the replace side.

Related

use regex to replace spaces that occur with a value depending on how many spaces found

I want to use a regex that looks for spaces with a minimum length of 2 in a row, and replaces the occurrence with another value for each occurrence of the space found.
For example:
I love to eat cake
There are 3 spaces after love and 4 spaces after eat. I want my regex to replace occurrences of a space more than 1, and to replace it with a value for each occurrence found.
The output I am trying to go for:
I love---to eat----cake
I tried something like
myStr.replace(/ +{2,}/g, '-')
You may use this code with a lookahead and a lookbehind:
const s = 'I love to eat cake'
var r = s.replace(/ (?= )|(?<= ) /g, '-');
console.log(r);
//=> 'I love---to eat----cake'
RegEx Details:
(?= ): Match a space only if that is followed by a space
|: OR
(?<= ) : Match a space only if that is preceded by a space
You can match two or more whitespaces and replace with the same amount of hyphens:
const s = 'I love to eat cake'
console.log(s.replace(/\s{2,}/g, (x) => '-'.repeat(x.length)) )
The same approach can be used in Python (since you asked), re.sub(r'\s{2,}', lambda x: '-' * len(x.group()), s), see the Python demo.
Also, you may replace any whitespace that is followed with a whitespace char or is preceded with whitespace using
const s = 'I love to eat cake'
console.log(s.replace(/\s(?=\s|(?<=\s.))/gs, '-') )
console.log(s.replace(/\s(?=\s|(?<=\s\s))/g, '-') )
See this regex demo. Here, s flag makes . match any char. g makes the regex replace all occurrences. Also,
\s - matches any whitespace
(?=\s|(?<=\s.)) - a positive lookahead that matches a location that is immediately followed with a whitespace char (\s), or (|) if it is immediately preceded with a whitespace and any one char (which is the matched whitespace). If you use (?<=\s\s) version, there is no need of s flag, \s\s just makes sure the whitespace before the matched whitespace is checked.

Regex - I want my string to end with 2 special character

I've been trying to make a regex that ends with 2 special characters, but I couldnt find solution. Here is what i tried, but it seems like it is not working.
/.[!##$%^&*]{2}+$/;
Thanks in advance.
Try this regex:
^.*[!##$%^&*]{2}$
Demo
const regex = /^.*[!##$%^&*]{2}$/;
const str = `abc##\$`;
let m;
if(str.match(regex)) {
console.log("matched");
}
else
console.log("not matched");
The /.[!##$%^&*]{2}+$/ regex matches
. - any character but a line break char
[!##$%^&*]{2}+ - in PCRE/Boost/Java/Oniguruma and other regex engines supporting possessive quantifiers, it matches exactly 2 cars from the defined set, but in JS, it causes a "Nothing to repeat" error
$ - end of string.
To match any string ending with 2 occurrences of the chars from your defined set, you need to remove the . and + and use
console.log(/[!##$%^&*]{2}$/.test("##"))
Or, if these 2 chars cannot be preceded by a 3rd one:
console.log(/(?:^|[^!##$%^&*])[!##$%^&*]{2}$/.test("##"))
// ^^^^^^^^^^^^^^^^^
The (?:^|[^!##$%^&*]) non-capturing group matches start of string (^) or (|) any char other than !, #, #, $, %, ^, &, * ([^!##$%^&*])

Merge contiguous "," into a single "," and remove leading and trailing "," in one RegExp

Suppose I have a string
",,,a,,,,,b,,c,,,,d,,,,"
I want to convert this into
"a,b,c,d"
in 1 RegExp operation.
I can do it in 2 RegExp operations like
var str = ",,,a,,,b,,,,c,,,,,,,,d,,,,,,";
str = str.replace(/,+,/g,",").replace(/^,*|,*$/g, '');
is it possible to do this in 1 RegExp operation ?
You could use a regular expression, which are at start or are followed by a comma or at the and and replace it with an empty string.
/^,*|,(?=,|$)/g
1st Alternative ^,*
^ asserts position at start of the string
,* matches the character , literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Alternative ,+(?=,|$)
,+ matches the character , literally (case sensitive)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Positive Lookahead (?=,|$)
Assert that the Regex below matches
1st Alternative ,
, matches the character , literally (case sensitive)
2nd Alternative $
$ asserts position at the end of the string
Global pattern flags
g modifier: global. All matches (don't return after first match)
var string = ",,,a,,,,,b,,c,,,,d,,,,";
console.log(string.replace(/^,*|,+(?=,|$)/g, ''));
The approach below returns expected result using two processes. .match() and template literal, which casts the encapsulated javascript expression to string when assigned to a variable.
You can use String.prototype.match() with RegExp /[^,]+/ to negate matching comma character ,, match one or more characters other than ,, including + in RegExp following character class where , is negated to match "abc" as suggested by #4castle; template literal to cast resulting array to string.
var str = ",,,a,,,b,,,,c,,,,,,,,d,,,efg,,,";
str = `${str.match(/[^,]+/g)}`;
console.log(str);

regular expression, not reading entire string

I have a standard expression that is not working correctly.
This expression is supposed to catch if a string has invalid characters anywhere in the string. It works perfect on RegExr.com but not in my tests.
The exp is: /[a-zA-Z0-9'.\-]/g
It is failing on : ####
but passing with : aa####
It should fail both times, what am I doing wrong?
Also, /^[a-zA-Z0-9'.\-]$/g matches nothing...
//All Boxs
$('input[type="text"]').each(function () {
var text = $(this).prop("value")
var textTest = /[a-zA-Z0-9'.\-]/g.test(text)
if (!textTest && text != "") {
allFieldsValid = false
$(this).css("background-color", "rgba(224, 0, 0, 0.29)")
alert("Invalid characters found in " + text + " \n\n Valid characters are:\n A-Z a-z 0-9 ' . -")
}
else {
$(this).css("background-color", "#FFFFFF")
$(this).prop("value", text)
}
});
edit:added code
UPDATE AFTER QUESTION RE-TAGGING
You need to use
var textTest = /^[a-zA-Z0-9'.-]+$/.test(text)
^^
Note the absence of /g modifier and the + quantifier. There are known issues when you use /g global modifier within a regex used in RegExp#test() function.
You may shorten it a bit with the help of the /i case insensitive modifier:
var textTest = /^[A-Z0-9'.-]+$/i.test(text)
Also, as I mention below, you do not have to escape the - at the end of the character class [...], but it is advisable to keep escaped if the pattern will be modified later by less regex-savvy developers.
ORIGINAL C#-RELATED DETAILS
Ok, say, you are using Regex.IsMatch(str, #"[a-zA-Z0-9'.-]"). The Regex.IsMatch searches for partial matches inside a string. So, if the input string contains an ASCII letter, digit, ', . or -, this will pass. Thus, it is logical that aa#### passes this test, and #### does not.
If you use the second one as Regex.IsMatch(str, #"^[a-zA-Z0-9'.-]$"), only 1 character strings (with an optional newline at the end) would get matched as ^ matches at the start of the string, [a-zA-Z0-9'.-] matches 1 character from the specified ranges/sets, and $ matches the end of the string (or right before the final newline).
So, you need a quantifier (+ to match 1 or more, or * to match zero or more occurrences) and the anchors \A and \z:
Regex.IsMatch(str, #"\A[a-zA-Z0-9'.-]+\z")
^^ ^^^
\A matches the start of string (always) and \z matches the very end of the string in .NET. The [a-zA-Z0-9'.-]+ will match 1+ characters that are either ASCII letters, digits, ', . or -.
Note that - at the end of the character class does not have to be escaped (but you may keep the \- if some other developers will have to modify the pattern later).
And please be careful where you test your regexps. Regexr only supports JavaScript regex syntax. To test .NET regexps, use RegexStorm.net or RegexHero.
/^[a-zA-Z0-9'.-]+$/g
In the second case your (/[a-zA-Z0-9'.-]/g) was working because it matched on the first letter, so to make it correct you need to match the whole string (use ^ and $) and also allow more letters by adding a + or * (if you allow empty string).
Try this regex it matches any char which isn't part of the allowed charset
/[^a-zA-Z0-9'.\-]+/g
Test
>>regex = /[^a-zA-Z0-9'.\-]+/g
/[^a-zA-Z0-9'.\-]+/g
>>regex.test( "####dsfdfjsakldfj")
true
>>regex.test( "dsfdfjsakldfj")
false

Javascript RegExp Tokenizing

Given a string, I want to use a regular expression to tokenize it. The pattern is as follows: any character (including new line, etc.), until "<", followed by a space zero or more times, followed by "%".
I tried
var patt = /(.)*<(\s)*%/;
but it does not yield the desired result. I would appreciate an explanation along with the pattern.
Use this:
"some string".split(/.*<\s*%/);
/^[\s\S]*?< *%/
should do what you want.
^ causes it to match at the beginning of the string.
[\s\S] matches any character. Literally, it means any space or non-space character, and works around the fact that . does not match newlines.
*? matches zero or more but the fewest necessary for the rest of the pattern to match.
< matches a literal '<'
* (note the space) matches zero or more spaces. This is more readable if written as [ ]*.
% finally matches that character.
If you want to match the entire string (i.e. the % should be the last character in the string), then you can put a $ before the last /.

Categories

Resources