JavaScript Regex start of string clarification + str.replace() - javascript

got a question about the start of string regex anchor tag ^.
I was trying to sanitize a string to check if it's a palindrome and found a solution to use regex but couldn't wrap my head around the explanations I found for the start of string anchor tag:
To my understanding:
^ denotes that whatever expression that follows must match, starting from the beginning of the string.
Question:
Why then is there a difference between the two output below:
1)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/[^a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: A*man**a*plan**a*canal**Panama
VS.
2)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/[a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: * ***, * ****, * *****: ******
VS.
3)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/^[a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: * man, a plan, a canal: Panama
Please let me know if my explanation for each of the case above is off:
1) Confused about this one. If it matches a character class of [a-z] case insensitive + global find, with start of string anchor ^ denoting that it must match at the start of each string, should it not return all the words in the sentence? Since each word is a match of [a-z] insensitive characters that occurs at the start of each string per global find iteration?
(i.e.
finds "A" at the start
then on the next iteration, it should start search on the remaining string " man"
finds a space...and moves on to search "man"?
and so on and so forth...
Q: Why does it then when I call replace does it only targets the non alpha stuff? Should I in this case be treating ^ as inverting [a-z]?
2) This seems pretty straight forward, finds all occurrence of [a-z]and replaces them with the start. Inverse case of 1)??
3) Also confused about this one. I'm not sure how this is different from 1).
/^[a-z]/gi to me means: "starting at the start of the string being looked at, match all alpha characters, case insensitive. Repeat for global find".
Compared to:
1) /[^a-z]/gi to me means: "match all character class that starts each line with alpha character. case insensitive, repeat search for global find."
To mean they mean exactly the same #_#. Please let me know how my understanding is off for the above cases.

Your first expression [^a-z] matches anything other than an alphabetic, lower case letter, therefore that's why when you replace with * all the special characters such as whitespace, commas and colons are replaced.
Your second expression [a-z] matches any alphabetic, lower case letter, therefore the special characters mentioned are not replaced by *.
Your third expression ^[a-z] matches a alphabetic, lower case letter at the start of the string, therefore only the first letter is replaced by *.
For the first two expressions, the global flag g ensures that all characters that match the specified pattern, regardless of their position in the string, are replaced. For the third pattern however, since ^ anchors the pattern at the beginning of the string, only the first letter is replaced.
As you mentioned, the i flag ensures case insensitivity, so that all three patterns operate on both lower and upper case alphabetic letters, from a to z and A to Z.
The character ^ therefore has two meanings:
It negates characters in a character set.
It asserts position at the start of string.

^ denotes that whatever expression that follows must match, starting from the beginning of the string.
That's only when it's the first thing in the regex; it has other purposes when used elsewhere:
/[^a-z]/gi
In the above regex, the ^ does not indicate anchoring the match to the beginning of a string; it inverts the rest of the contents of the [] -- so the above regex will match any single character except a-z. Since you're using the g flag it will repeat that match for all characters in the string.
/[a-z]/gi
The above is not inverted, so will match a single instance of any character from a-z (and again because of the g flag will repeat to match all of those instances.)
/^[a-z]/gi
In this last example, the caret anchors the match to the beginning of the string; the bracketed portion will match any single a-z character. The g flag is still in use, so the regex would try to continue matching more characters later in the string -- but none of them except the first one will will meet the anchored-to-start requirement, so this will end up matching only the first character (if it's within a-z), exactly as if the g flag was not in use.
(When used anywhere in a regex other than the start of the regex or the start of a [] group, the ^ will be treated as a literal ^.)
If you're trying to detect palindromes, you'll want to remove everything except letter characters (and will probably want to convert everything to the same letter case, instead of having to detect that "P" == "p":)
const isPalindrome = function(input) {
let str = input.toLowerCase().replace(/[^a-z]/g,'');
return str === str.split('').reverse().join('')
}
console.log(isPalindrome("Able was I, ere I saw Elba!"))
console.log(isPalindrome("No, it never propagates if I set a ”gap“ or prevention."))
console.log(isPalindrome("Are we not pure? “No, sir!” Panama’s moody Noriega brags. “It is garbage!” Irony dooms a man –– a prisoner up to new era."))
console.log(isPalindrome("Taco dog is not a palindrome."))

Related

Regex match first character once, followed by repetitive matching until end

I'm trying to match characters that shouldn't be allowed in a username string to then be replaced.
Anything outside this range should match first character [a-zA-Z] <-- restricting the first character is causing problems and I don't know how to fix it
And then match everything else outside this range [0-9a-zA-Z_.] <---- repeat until the end of the string
Matches:
/////hey/// <-- first match /////, second match ///
[][123Bc_.// <-- first match [][, second match //
(/abc <-- should match (/
a2__./) <-- should match /)
Non Matches:
a_____
b__...
Current regex
/^([^a-zA-Z])([^\w.])*/
const regex = /^([^a-zA-Z])([^0-9a-zA-Z_.])*/;
'(/abc'.replace(regex, '') // => return expected abc
'/////hey///'.replace(regex, '') // => return expected "hey"
/^([^a-zA-Z])([^\w.])*/
You can not do it this way, with negated character classes and the pattern anchored at the start. For example for your va2__./), this of course won’t match - because the first character is not in the disallowed range, so the whole expression doesn’t match.
Your allowed characters for the first position are a subset, of what you want to allow for “the rest” - so do that second part first, replace everything that does not match [0-9a-zA-Z_.] with an empty string, without anchoring the pattern at the beginning or end.
And then, in the result of that operation, replace any characters not matching [a-zA-Z] from the start. (So that second pattern does get anchored at the beginning, and you’ll want to use + as quantifier - because when you remove the first invalid character, the next one becomes the new first, and that one might still be invalid.)

regex if capture group matches string

I need to build a simple script to hyphenate Romanian words. I've seen several and they don't implement the rules correctly.
var words = "arta codru";
Rule: if 2 consonants are between 2 vowels, then they become split between syllables unless they belong in this array in which case both consonants move to the second syllable:
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
Expected result: ar-ta co-dru
The code so far:
https://playcode.io/156923?tabs=console&script.js&output
var words = "arta codru";
var exceptions_to_regex2 = ["bl","cl","dl","fl","gl","hl","pl","tl","vl","br","cr","dr","fr","gr","hr","pr","tr","vr"];
var regex2 = /([aeiou])([bcdfghjklmnprstvwxy]{1})(?=[bcdfghjklmnprstvwxy]{1})([aeiou])/gi;
console.log(words.replace(regex2, '$1$2-'));
console.log("desired result: ar-ta co-dru");
Now I would need to do something like this:
if (exceptions_to_regex2.includes($2+$3)){
words.replace(regex2, '$1-');
}
else {
words.replace(regex2, '$1$2-');
}
Obviously it doesn't work because I can't just use the capture groups as I would a regular variable. Please help.
You may code your exceptions as a pattern to check for after a vowel, and stop matching there, or you may still consume any other consonant before another vowel, and replace with the backreference to the whole match with a hyphen right after:
.replace(/[aeiou](?:(?=[bcdfghptv][lr])|[bcdfghj-nprstvwxy](?=[bcdfghj-nprstvwxy][aeiou]))/g, '$&-')
Add i modifier after g if you need case insensitive matching.
See the regex demo.
Details
[aeiou] - a vowel
(?: - start of a non-capturing group:
(?=[bcdfghptv][lr]) - a positive lookahead that requires the exception letter clusters to appear immediately to the right of the current position
| - or
[bcdfghj-nprstvwxy] - a consonant
(?=[bcdfghj-nprstvwxy][aeiou]) - followed with any consonant and a vowel
) - end of the non-capturing group.
The $& in the replacement pattern is the placeholder for the whole match value (at regex101, $0 can only be used at this moment, since the Web site does not support language specific only replacement patterns).

JS Regex: Remove anything (ONLY) after a word

I want to remove all of the symbols (The symbol depends on what I select at the time) after each word, without knowing what the word could be. But leave them in before each word.
A couple of examples:
!!hello! my! !!name!!! is !!bob!! should return...
!!hello my !!name is !!bob ; for !
and
$remove$ the$ targetted$# $$symbol$$# only $after$ a $word$ should return...
$remove the targetted# $$symbol# only $after a $word ; for $
You need to use capture groups and replace:
"!!hello! my! !!name!!! is !!bob!!".replace(/([a-zA-Z]+)(!+)/g, '$1');
Which works for your test string. To work for any generic character or group of characters:
var stripTrailing = trail => {
let regex = new RegExp(`([a-zA-Z0-9]+)(${trail}+)`, 'g');
return str => str.replace(regex, '$1');
};
Note that this fails on any characters that have meaning in a regular expression: []{}+*^$. etc. Escaping those programmatically is left as an exercise for the reader.
UPDATE
Per your comment I thought an explanation might help you, so:
First, there's no way in this case to replace only part of a match, you have to replace the entire match. So we need to find a pattern that matches, split it into the part we want to keep and the part we don't, and replace the whole match with the part of it we want to keep. So let's break up my regex above into multiple lines to see what's going on:
First we want to match any number of sequential alphanumeric characters, that would be the 'word' to strip the trailing symbol from:
( // denotes capturing group for the 'word'
[ // [] means 'match any character listed inside brackets'
a-z // list of alpha character a-z
A-Z // same as above but capitalized
0-9 // list of digits 0 to 9
]+ // plus means one or more times
)
The capturing group means we want to have access to just that part of the match.
Then we have another group
(
! // I used ES6's string interpolation to insert the arg here
+ // match that exclamation (or whatever) one or more times
)
Then we add the g flag so the replace will happen for every match in the target string, without the flag it returns after the first match. JavaScript provides a convenient shorthand for accessing the capturing groups in the form of automatically interpolated symbols, the '$1' above means 'insert contents of the first capture group here in this string'.
So, in the above, if you replaced '$1' with '$1$2' you'd see the same string you started with, if you did 'foo$2' you'd see foo in place of every word trailed by one or more !, etc.

How to understand regex '\b'?

I am learning the regex.But I can't understand the '\b' , match a word boundary . there have three situation,like this:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
I can't understand the third situation.for example:
var reg = /end\bend/g;
var string = 'wenkend,end,end,endend';
alert( reg.test(string) ) ; //false
The '\b' require a '\w' character at its one side , another not '\w' character at the other side . the string 'end,end' should match the rule, after the first character is string ',' , before the last character is string ',' , so why the result is error .Could you help,Thanks in advance!
============dividing line=============
With your help, I understand it. the 'end,end' match the first 'end' and have a boundary ,but the next character is ',' not 'e',so '/end\bend' is false.
In other words ,the reg '/end\bend/g' or others similar reg aren't exit forever.
Thanks again
The \b matches position, not a character. So this regex /end\bend/g says that there must be string end. Then it should be followed by not a word character, which is , and it matches, but the regex engine doesn't move in the string and it stays at ,. So the next character in your regex is e, and e doesn't match ,. So regexp fails. Here is step by step what happens:
-----------------
/end\bend/g, "end,end" (match)
| |
-----------------
/end\bend/g, "end,end" (both regex and string position moved - match)
| |
------------------
/end\bend/g, "end,end" (the previous match was zero-length, so only regex position moved - not match)
| |
With (most) regular expression engines, you can match, capture characters and assert positions within a string.
For the purpose of this example let's assume the string
Rogue One: A Star Wars Story
where you want to match the character o (which is there twice, after R and after t). Now you want to specify the position and want to match os only before lowercase rs.
You write (with a positive lookahead):
o(?=r)
Now generalize the idea of zero-width assertions where you want to look for a word character ahead while making sure there's no word character immediately behind. Herefore you could write:
(?=\w)(?<!\w)
A positive and a negative lookahead, combined. We're almost there :) You only need the same thing around (a word character behind and not a word character ahead) which is:
(?<=\w)(?!\w)
If you combine these two, you'll eventually get (see the | in the middle):
(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))
Which is equivalent to \b (and a lot longer). Coming back to our string, this is true for:
Rogue One: A Star Wars Story
# right before R
# right after e in Rogue
# right before O of One
# right after e of One (: is not a word character)
# and so on...
See a demo on regex101.com.
To conclude, you can think of \b as a zero-width assertion which only ensures a position within the string.
Try this Expression
/(end)\b|\b(end)/g

Why is 'B' matched by [a-z]?

a very simple & naive question:
why this is true?
new RegExp('^[a-z]+$', 'i').test('B')
apparently 'B' is out of [a-z]?
Yes, but you have the i parameter which tells the regex to ignore case.
From the MDN documentation for RegEx:
Parameters
pattern
The text of the regular expression.
flags
If specified, flags can have any combination of the following values:
...
i
ignore case
It's defining a class, which is to say [a-z] is symbolic of "any character, from a to z."
Regex is, by nature, case SensAtiVe as well, so [a-z] varies from [A-Z] (unless you use the i (case insensitive) flag, like you've demonstrated).
e.g.
/[a-z]/ -- Any single character, a through z
/[A-Z]/ -- Any single uppercase letter, A through Z
/[a-zA-Z]/ -- Any single upper or lowercase letter, a through z
/[a-z]/i or /[A-Z]/i -- (note the i) Any upper or lowercase letter, a through z
Summary
The [a-z] means a character set containing characters a-z.
The ^ is an anchor which means the set must begin with the first character of input.
The + means you must match on one or more from the character set.
The $ is an end anchor meaning the set must end the last character of input.
The i means to ignore case on your input letters.
It means any character between a and z.
As you specified the i flag (case insensitive), it contains also B.
The whole regexp checks that the string contains at least one character and that all characters are in a-z or A-Z.
You can check that new RegExp('^[a-z]+$', 'i').test('B') returns true.

Categories

Resources