Optional lookahead in javascript - javascript

I was trying to build a regex for a user input. Im building a form based on the user Input. Lets assume that the user assigns the css property as to "Icon-[anything]" (Bootstrap Icon). Now in this case i have to ensure that "--" is not repeated more than once and also should ensure that "icon-white" should be the only class assigned beside the other one; this 'icon-white' has to optional as well.
/^icon-[a-z-]+(\ icon-white)?$/ - this regex works fine for the OPTIONAL Icon-white Scenario, but having some issue in avoiding the repetition of '--'.

If you want to match "icon-somevalue" but not "icon-white" try
icon-(?!white).*

If I understand correctly (although I'm not sure I do, sorry...) I think you're saying that the following two scenarios are allowed:
icon-white
icon-[anything] where [anything] can be any lower-case text and include a hyphen, but never two (or more) hyphens directly next to each other like --.
You've not said where this pattern might occur, although your original regex suggests this pattern will occur anchored to the start of your test string, so I'll assume that's the case. In which case, this regex should help:
^icon-white$|^icon-([a-z]+-?)+$
Breaking that down:
^icon-white$ Match the literal string that contains exactly "icon-white"
| or
^icon-([a-z]+-?)+$ Match the literal string that starts with exactly "icon-" and then immediately ends with "something" which is ([a-z]+-?)+.
Now, to be clear - I don't get the relationship between icon-white and icon-[something]. That is, as far as I see it there's no reason why the icon-[something] pattern at 3 above can't cover the "icon-white" literal too. ie 1 and 2 above are redundant. But I've included them here so you can maybe piece something more suitable together.
Breaking that "something" down from 3:
( )+ means one or more instances of whatever's inside the parenthesis, which is [a-z]+-?
Breaking that [a-z]+-? down:
[a-z]+ At least one character "a" through "z" (note hyphen is NOT allowed here to avoid additional hyphen immediately after the previous one)
-? An optional hyphen (ie exactly 0 or 1 hyphen)
This matches the following test cases:
icon-white
icon-x
icon-xx
icon-x-
icon-xx-
icon-x-x
icon-xx-x
icon-xx-xx
icon-x-x-x-
icon-x-xx-xx-x-xxxxx-
... and so on
This DOES NOT match the following test cases:
any case where a capital letter used (you've specified only lower-case)
icon- (because we need one or more characters for "something".
icon--
icon--x
icon-x--
I hope this covers your needs, but I doubt it does (because I didn't really understand your explanation "ensure that "icon-white" should be the only class assigned beside the other one"), but hopefully my breakdown will give you the pieces you need.
EDIT:
I think maybe you're saying the scenarios allowed are:
icon-[something]
icon-[something] icon-white
icon-white icon-[something]
where [something] is any combinations of lower-case text and hypens, so long as there's never a double-hyphen, and so long as it's not "white".
So... this defines "icon-[something]" : icon-(?!white$)([a-z]+-?)+
This means our 3 above scenarios are:
^icon-(?!white$)([a-z]+-?)+$
^icon-(?!white$)([a-z]+-?)+ icon-white$
^icon-white icon-(?!white$)([a-z]+-?)+$
And hence, putting it all together:
^icon-(?!white$)([a-z]+-?)+$|^icon-(?!white$)([a-z]+-?)+ icon-white$|^icon-white icon-(?!white$)([a-z]+-?)+$
I tried doing this with the icon-white section as an optional group, but had trouble with the negative lookahead from the first section capturing it... so... this'll do ;-)

Related

Regex for bible references

I am working on some code for an online bible. I need to identify when references are written out. I have looked all through stackoverflow and tried various regex examples but they all seem to fail with single books (eg Jude) as they require a number to proceed the book name. Here is my solution so far :
/((?:(I+|1st|2nd|3rd|First|Second|Third|1|2|3))?( )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Jude|Revelation))(([ .)\n|])([^a-zA-Z]))([\d])?([:\d])?([:\d])?/gi;
Here is the regex code with some sample text to match:
https://regexr.com/5pfg3
On the above you will notice, Jude if double spaced will work. If I put a full stop after it will work. I know the issue is this section :
(([ .)\n|])([^a-zA-Z]))
What I want is to match spaces, brackets, new lines BUT not a letter.
It does not match as it expects 2 characters using (([ .)\n|])([^a-zA-Z])) where the second one can not be a char a-zA-Z due to the negated character class, so it can not match the s in Jude some.
What you might do is make the character class in the second part optional, if you intent to keep all the capture groups.
You could also add word boundaries \b to make the pattern a bit more performant as it is right now.
See a regex demo
(Note that Jude is listed twice in the alternation)
If you only want to use 3 groups, you can write the first part as:
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?
The second part will be the alternation with the names, and in the 3rd part you can match one of the character class followed by the digit part and make that optional as a whole (so you don't match a trailing space or char after the word without the digits).
(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
The full pattern will look like
\b(?:(I+|1st|2nd|3rd|First|Second|Third|[123]) )?(Gen|Ge|Gn|Exo|Ex|Exod|Lev|Le|Lv|Num|Nu|Nm|Nb|Deut|Dt|Josh|Jos|Jsh|Judg|Jdg|Jg|Jdgs|Rth|Ru|Sam|Samuel|Kings|Kgs|Kin|Chron|Chronicles|Ezra|Ezr|Neh|Ne|Esth|Es|Job|Job|Jb|Pslm|Ps|Psalms|Psa|Psm|Pss|Prov|Pr|Prv|Eccles|Ec|Song|So|Canticles|Song of Songs|SOS|Isa|Is|Jer|Je|Jr|Lam|La|Ezek|Eze|Ezk|Dan|Da|Dn|Hos|Ho|Joel|Joe|Jl|Amos|Am|Obad|Ob|Jnh|Jon|Micah|Mic|Nah|Na|Hab|Zeph|Zep|Zp|Haggai|Hag|Hg|Zech|Zec|Zc|Mal|Mal|Ml|Matt|Mt|Mrk|Mk|Mr|Luk|Lk|John|Jn|Jhn|Acts|Ac|Rom|Ro|Rm|Co|Cor|Corinthians|Gal|Ga|Ephes|Eph|Phil|Php|Col|Col|Th|Thes|Thess|Thessalonians|Ti|Tim|Timothy|Titus|Tit|Philem|Phm|Hebrews|Heb|James|Jas|Jm|Pe|Pet|Pt|Peter|Jn|Jo|Joh|Jhn|John|Jude|Jud|Rev|The Revelation|Genesis|Exodus|Leviticus|Numbers|Deuteronomy|Joshua|Judges|Ruth|Samuel|Kings|Chronicles|Ezra|Nehemiah|Esther|Job|Psalms|Psalm|Proverbs|Ecclesiastes|Song of Solomon|Isaiah|Jeremiah|Lamentations|Ezekiel|Daniel|Hosea|Joel|Amos|Obadiah|Jonah|Micah|Nahum|Habakkuk|Zephaniah|Haggai|Zechariah|Malachi|Matthew|Mark|Luke|John|Acts|Romans|Corinthians|Galatians|Ephesians|Philippians|Colossians|Thessalonians|Timothy|Titus|Philemon|Hebrews|James|Peter|John|Revelation)\b(?:[ .)\n|](\d+(?::\d+){0,2}\b))?
Regex demo of the full pattern

How to count two words as 1 in same line

In the text file I've got, each sentence is represented with a specific type such as: contrast.
A contrasting sentence can either be represented with a tag "CONTRAST" or "CONTR" or "WEAKCONTR". For instance:
IMPSENT_CONTRAST_VIS(Studying networks in this way can help to
identify the people from whom an individual learns , where
conflicts_MD:+ in understanding_MD:+ may originate , and which
contextual factors influence learning .)
So I count these with following expression: /(\_(WEAK))|(\_CONTRAST)|(\_CONTR(\_|\())/g which works perfectly fine.
Now the problem is some sentences are expressed with more than one contrast tag such as CONTR & WEAKCONTR together. For instance:
IMPSENT_CONTRAST_EMPH_WEAKCONTR_VIS(Studying_MD:+ networks in this way
can help to identify_MD:+ the people from whom an individual learns
, where conflicts_MD:+ in understanding_MD:+ may originate , and
which contextual factors influence learning .)
At this point I have to count these as 1 not 2. Do you have any idea how possible this is with RegExp?
You can use lookaheads to assert it, and then count the matches:
(?=\w*_(?:WEAK|CONTRAST|CONTR[_)]))\b\w+\b
Demo here: http://regex101.com/r/xP2yI7/3
Notice the match count.
This will match the whole IMPSENT_CONTRAST_EMPH_WEAKCONTR_VIS expression, but only if it matches the part in the lookahead, which filters for the keywords you're looking after. This will match even if you have multiple such sentences on the same line.
Also, I've simplified your regex a bit, retaining the same meaning. Notice you don't have to escape the _.
You really just care if the tag shows up in the line at all, so just grab the whole line, provided it has a tag, like so:
/^([A-Z_]+(WEAK|CONTRAST|CONTR)+[A-Z_]*)/gm
From the start of the line ^ look for a word block with A-Z or _ followed by the tag, optionally followed by more words/underscores.
DEMO
Can you try adding \w+:
/(\_(WEAK\w+))|(\_CONTRAST\w+)|(\_CONTR(\_\w+|\())/g
Something like this?
(^(\_(WEAK))|(\_CONTRAST)|(\_CONTR(\_|\()))

What is meaning of [_|\_|\.]? in Javascript regexps?

I have a js code:
/^([a-zA-Z0-9]+[_|\_|\.]?)*[a-zA-Z0-9]+#([a-zA-Z0-9]+[_|\_|\.]?)*[a-zA-Z0-9]+\.[a-zA-Z]{2,3}$/
But what's meaning of [_|\_|\.]?(js regexp)
If we use a resource like Regexper, we can visualise this regular expression:
From this we can conclude that [_|\_|\.] requires one of either "_", "|" or ".". We can also see that the double declaration of "_" and "|" is unnecessary. As HamZa commented, this segment can be shortened to [_|.] to achieve the same result.
In fact, we can even use resources like Regexper to visualise the entire expression.
REGEX101 is a very good tool for
understanding regular expression
Char class [_|\_|\.] 0 to 1 times [greedy] matches:
[_|\_|\. One of the following characters _|_|.
[_|\_|\.] requires one of either "_", "|" or "."
See This Link of RegEx101 here
Your Expression explanation
It matches a pipe character, an underscore, or a period.
It is unnecessarily convoluted, however. It could be simpler.
It could be shortened to this
[|_.]
[_|\_|\.] is probably meant to match an underscore (_) or a period (.), and should have been written as [_.].
I'm reasonably sure the author is using the pipe (|) to mean "or" (i.e., alternation), which isn't necessary inside a character class. As the other responders said, the pipe actually matches a literal pipe, but I don't believe that was the author's intent. It's a very common beginner's mistake.
The dot (.) is another special character that loses its special meaning when it appears in a character class. There's no need to escape it with a backslash as the author did, though it does no harm. And the underscore never has any special meaning; I won't even try to guess why the author listed it twice, once with a backslash and once without.
You didn't ask about it, but the ? doesn't belong there either. That's what makes the regex so horribly inefficient, as Kobi remarked. The idea was to match one or more alphanumerics, then optionally match a separator character (dot or underscore), which must be followed by some more alphanumerics, repeating as needed. Here's how I would write that:
[a-zA-Z0-9]+([_.][a-zA-Z0-9]+)*
If it runs out of alphanumerics and the next character is not _ or ., it skips that whole section and tries to match the next part. And if it can't do that, it can bail out immediately because no match is possible. But the way your regex is written, the separator is optional independently of the things it's supposed to separate, which makes it useless. The regex engine has to keep backing up, trying to match characters that it has already consumed in endless, pointless combinations before it can give up. And that, unfortunately, is another common mistake.

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

RegEx in JS to find No 3 Identical consecutive characters

How to find a sequence of 3 characters, 'abb' is valid while 'abbb' is not valid, in JS using Regex (could be alphabets,numerics and non alpha numerics).
This question is a variation of the question that I have asked in here : How to combine these regex for javascript.
This is wrong : /(^([0-9a-zA-Z]|[^0-9a-zA-Z]))\1\1/ , so what is the right way to do it?
This depends on what you actually mean. If you only want to match three non-identical characters (that is, if abb is valid for you), you can use this negative lookahead:
(?!(.)\1\1).{3}
It first asserts, that the current position is not followed by three times the same character. Then it matches those three characters.
If you really want to match 3 different characters (only stuff like abc), it gets a bit more complicated. Use these two negative lookaheads instead:
(.)(?!\1)(.)(?!\1|\2).
First match one character. Then we assert, the this is not followed by the same character. If so, we match another character. Then we assert that these are followed neither by the first nor the second character. Then we match a third character.
Note that those negative lookaheads ((?!...)) do not consume any characters. That is why they are called lookaheads. They just check what is coming next (or in this case what is not coming next) and then the regex continues from where it left of. Here is a good tutorial.
Note also that this matches anything but line breaks, or really anything if you use the DOTALL or SINGLELINE option. Since you are using JavaScript you can just activate the option by appending s after the regexes closing delimiter. If (for some reason) you don't want to use this option, replace the .s by [\s\S] (this always matches any character).
Update:
After clarification in the comments, I realised that you do not want to find three non-identical characters, but instead you want to assert that your string does not contain three identical (and consecutive) characters.
This is a bit easier, and closer to your former question, since it only requires one negative lookahead. What we do is this: we search the string from the beginning for three consecutive identical characters. But since we want to assert that these do not exist we wrap this in a negative lookahead:
^(?!.*(.)\1\1)
The lookahead is anchored to the beginning of the string, so this is the only place where we will look. The pattern in the lookahead then tries to find three identical characters from any position in the string (because of the .*; the identical characters are matched in the same way as in your previous question). If the pattern finds these, the negative lookahead will thus fail, and so the string will be invalid. If not three identical characters can be found, the inner pattern will never match, so the negative lookahead will succeed.
To find non-three-identical characters use regex pattern
([\s\S])(?!\1\1)[\s\S]{2}

Categories

Resources