What is meaning of [_|\_|\.]? in Javascript regexps? - javascript

I have a js code:
/^([a-zA-Z0-9]+[_|\_|\.]?)*[a-zA-Z0-9]+#([a-zA-Z0-9]+[_|\_|\.]?)*[a-zA-Z0-9]+\.[a-zA-Z]{2,3}$/
But what's meaning of [_|\_|\.]?(js regexp)

If we use a resource like Regexper, we can visualise this regular expression:
From this we can conclude that [_|\_|\.] requires one of either "_", "|" or ".". We can also see that the double declaration of "_" and "|" is unnecessary. As HamZa commented, this segment can be shortened to [_|.] to achieve the same result.
In fact, we can even use resources like Regexper to visualise the entire expression.

REGEX101 is a very good tool for
understanding regular expression
Char class [_|\_|\.] 0 to 1 times [greedy] matches:
[_|\_|\. One of the following characters _|_|.
[_|\_|\.] requires one of either "_", "|" or "."
See This Link of RegEx101 here
Your Expression explanation

It matches a pipe character, an underscore, or a period.
It is unnecessarily convoluted, however. It could be simpler.
It could be shortened to this
[|_.]

[_|\_|\.] is probably meant to match an underscore (_) or a period (.), and should have been written as [_.].
I'm reasonably sure the author is using the pipe (|) to mean "or" (i.e., alternation), which isn't necessary inside a character class. As the other responders said, the pipe actually matches a literal pipe, but I don't believe that was the author's intent. It's a very common beginner's mistake.
The dot (.) is another special character that loses its special meaning when it appears in a character class. There's no need to escape it with a backslash as the author did, though it does no harm. And the underscore never has any special meaning; I won't even try to guess why the author listed it twice, once with a backslash and once without.
You didn't ask about it, but the ? doesn't belong there either. That's what makes the regex so horribly inefficient, as Kobi remarked. The idea was to match one or more alphanumerics, then optionally match a separator character (dot or underscore), which must be followed by some more alphanumerics, repeating as needed. Here's how I would write that:
[a-zA-Z0-9]+([_.][a-zA-Z0-9]+)*
If it runs out of alphanumerics and the next character is not _ or ., it skips that whole section and tries to match the next part. And if it can't do that, it can bail out immediately because no match is possible. But the way your regex is written, the separator is optional independently of the things it's supposed to separate, which makes it useless. The regex engine has to keep backing up, trying to match characters that it has already consumed in endless, pointless combinations before it can give up. And that, unfortunately, is another common mistake.

Related

Need a regular expression in javascript [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 2 years ago.
I know that I can negate group of chars as in [^bar] but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual bar, and not "any chars in bar"?
A great way to do this is to use negative lookahead:
^(?!.*bar).*$
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it's often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you're doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
Solution:
^(?!.*STRING1|.*STRING2|.*STRING3).*$
xxxxxx OK
xxxSTRING1xxx KO (is whether it is desired)
xxxSTRING2xxx KO (is whether it is desired)
xxxSTRING3xxx KO (is whether it is desired)
You could either use a negative look-ahead or look-behind:
^(?!.*?bar).*
^(.(?<!bar))*?$
Or use just basics:
^(?:[^b]+|b(?:$|[^a]|a(?:$|[^r])))*$
These all match anything that does not contain bar.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive "bar"s), possibly resulting in a potential for high overhead if you're working with very long strings.
b(?!ar)|(?<!b)a|a(?!r)|(?<!ba)r|[^bar]
I came across this forum thread while trying to identify a regex for the following English statement:
Given an input string, match everything unless this input string is exactly 'bar'; for example I want to match 'barrier' and 'disbar' as well as 'foo'.
Here's the regex I came up with
^(bar.+|(?!bar).*)$
My English translation of the regex is "match the string if it starts with 'bar' and it has at least one other character, or if the string does not start with 'bar'.
The accepted answer is nice but is really a work-around for the lack of a simple sub-expression negation operator in regexes. This is why grep --invert-match exits. So in *nixes, you can accomplish the desired result using pipes and a second regex.
grep 'something I want' | grep --invert-match 'but not these ones'
Still a workaround, but maybe easier to remember.
If it's truly a word, bar that you don't want to match, then:
^(?!.*\bbar\b).*$
The above will match any string that does not contain bar that is on a word boundary, that is to say, separated from non-word characters. However, the period/dot (.) used in the above pattern will not match newline characters unless the correct regex flag is used:
^(?s)(?!.*\bbar\b).*$
Alternatively:
^(?!.*\bbar\b)[\s\S]*$
Instead of using any special flag, we are looking for any character that is either white space or non-white space. That should cover every character.
But what if we would like to match words that might contain bar, but just not the specific word bar?
(?!\bbar\b)\b\[A-Za-z-]*bar[a-z-]*\b
(?!\bbar\b) Assert that the next input is not bar on a word boundary.
\b\[A-Za-z-]*bar[a-z-]*\b Matches any word on a word boundary that contains bar.
See Regex Demo
Extracted from this comment by bkDJ:
^(?!bar$).*
The nice property of this solution is that it's possible to clearly negate (exclude) multiple words:
^(?!bar$|foo$|banana$).*
I wish to complement the accepted answer and contribute to the discussion with my late answer.
#ChrisVanOpstal shared this regex tutorial which is a great resource for learning regex.
However, it was really time consuming to read through.
I made a cheatsheet for mnemonic convenience.
This reference is based on the braces [], (), and {} leading each class, and I find it easy to recall.
Regex = {
'single_character': ['[]', '.', {'negate':'^'}],
'capturing_group' : ['()', '|', '\\', 'backreferences and named group'],
'repetition' : ['{}', '*', '+', '?', 'greedy v.s. lazy'],
'anchor' : ['^', '\b', '$'],
'non_printable' : ['\n', '\t', '\r', '\f', '\v'],
'shorthand' : ['\d', '\w', '\s'],
}
Just thought of something else that could be done. It's very different from my first answer, as it doesn't use regular expressions, so I decided to make a second answer post.
Use your language of choice's split() method equivalent on the string with the word to negate as the argument for what to split on. An example using Python:
>>> text = 'barbarasdbarbar 1234egb ar bar32 sdfbaraadf'
>>> text.split('bar')
['', '', 'asd', '', ' 1234egb ar ', '32 sdf', 'aadf']
The nice thing about doing it this way, in Python at least (I don't remember if the functionality would be the same in, say, Visual Basic or Java), is that it lets you know indirectly when "bar" was repeated in the string due to the fact that the empty strings between "bar"s are included in the list of results (though the empty string at the beginning is due to there being a "bar" at the beginning of the string). If you don't want that, you can simply remove the empty strings from the list.
I had a list of file names, and I wanted to exclude certain ones, with this sort of behavior (Ruby):
files = [
'mydir/states.rb', # don't match these
'countries.rb',
'mydir/states_bkp.rb', # match these
'mydir/city_states.rb'
]
excluded = ['states', 'countries']
# set my_rgx here
result = WankyAPI.filter(files, my_rgx) # I didn't write WankyAPI...
assert result == ['mydir/city_states.rb', 'mydir/states_bkp.rb']
Here's my solution:
excluded_rgx = excluded.map{|e| e+'\.'}.join('|')
my_rgx = /(^|\/)((?!#{excluded_rgx})[^\.\/]*)\.rb$/
My assumptions for this application:
The string to be excluded is at the beginning of the input, or immediately following a slash.
The permitted strings end with .rb.
Permitted filenames don't have a . character before the .rb.

how to negate a capture group?

Using a javascript regexp, I would like to find strings like "/foo" or "/foo d/" but not "/foo /"; ie, "annotation character", then either word with no terminating annotation, or multiple words, where the termination comes at the end of the phrase (with no space). Complicating the situation, there are three possible annotation symbols: /, \ and |.
I've tried something like:
/(?:^|\s)([\\\/|])((?:[\w_-]+(?![^\1]+[\w_-]\1))|(?:[\w\s]+[\w](?=\1)))/g
That is, start with space, then annotation, then
word not followed by (anything but annotation) then letter and annotation... or
possibly multiple words, immediately followed by annotation character.
The problem is the [^\1]: this doesn't read as "anything but the annotation character" in the angle brackets.
I could repeat the whole phrase three times, one for each annotation character. Any better ideas?
As you've mentioned, [^\1] doesn't work - it matches anything that is not the character 1. In JavaScript, you can negate \1 by using a lookahead: (?:(?!\1).)* . This is not as efficient, but it works.
Your pattern can be written as:
([\\\/|])([\w\-]+(?:(?:(?!\1).)*[\w\-]\1)?)
Working example at Regex101
\w already contains underscore.
Instead of alternation (a|ab) I'm using an optional group (a(?:b)?) - we always match the first word, with optional further words and tags.
You may still want to include (?:^|\s) at the beginning.

Unable to find a string matching a regex pattern

While trying to submit a form a javascript regex validation always proves to be false for a string.
Regex:- ^(([a-zA-Z]:)|(\\\\{2}\\w+)\\$?)(\\\\(\\w[\\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
I have tried following strings against it
abc.jpg,
abc:.jpg,
a:.jpg,
a:asdas.jpg,
What string could possible match this regex ?
This regex won't match against anything because of that $? in the middle of the string.
Apparently using the optional modifier ? on the end string symbol $ is not correct (if you paste it on https://regex101.com/ it will give you an error indeed). If the javascript parser ignores the error and keeps the regex as it is this still means you are going to match an end string in the middle of a string which is supposed to continue.
Unescaped it was supposed to match a \$ (dollar symbol) but as it is written it won't work.
If you want your string to be accepted at any cost you can probably use Firebug or a similar developer tool and edit the string inside the javascript code (this, assuming there's no server side check too and assuming it's not wrong aswell). If you ignore the $? then a matching string will be \\\\w\\\\ww.jpg (but since the . is unescaped even \\\\w\\\\ww%jpg is a match)
Of course, I wrote this answer assuming the escaping is indeed the one you showed in the question. If you need to find a matching pattern for the correctly escaped one ^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(\.jpeg|\.JPEG|\.jpg|\.JPG)$ then you can use this tool to find one http://fent.github.io/randexp.js/ (though it will find weird matches). A matching pattern is c:\zz.jpg
If you are just looking for a regular expression to match what you got there, go ahead and test this out:
(\w+:?\w*\.[jpe?gJPE?G]+,)
That should match exactly what you are looking for. Remove the optional comma at the end if you feel like it, of course.
If you remove escape level, the actual regex is
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
After ^start the first pipe (([a-zA-Z]:)|(\\{2}\w+)\$?) which matches an alpha followed by a colon or two backslashes followed by one or more word characters, followed by an optional literal $. There is some needless parenthesis used inside.
The second part (\\(\w[\w].*))+ matches a backslash, followed by two word characters \w[\w] which looks weird because it's equivalent to \w\w (don't need a character class for second \w). Followed by any amount of any character. This whole thing one or more times.
In the last part (.jpeg|.JPEG|.jpg|.JPG) one probably forgot to escape the dot for matching a literal. \. should be used. This part can be reduced to \.(JPE?G|jpe?g).
It would match something like
A:\12anything.JPEG
\\1$\anything.jpg
Play with it at regex101. A better readable could be
^([a-zA-Z]:|\\{2}\w+\$?)(\\\w{2}.*)+\.(jpe?g|JPE?G)$
Also read the explanation on regex101 to understand any pattern, it's helpful!

Regular expression - val.replace(/^[^a-zA-Z0-9]*|[^a-zA-Z0-9]*$/g,"'');

I was learning regular expression, It seems very much confusing to me for now.
val.replace(/^[^a-zA-Z0-9]*|[^a-zA-Z0-9]*$/g, '');
In the above expression
1) which part denotes not to include white space? as i am trying to exclude all non alphanumeric characters.
2) Since i don't want to use even '$' and ''(underscore) can i specify '$' & ''(underscore) in expression something like below?
val.replace(/^[^a-zA-Z0-9$_]*|[^a-zA-Z0-9$_]*/g, '');?
3) As 'x|y' specify that - "Find any of the alternatives specified". Then Why we have used something like this [^a-zA-Z0-9]|[^a-zA-Z0-9] which is same on both sides?
Please help me understand this, Finding it bit confused and difficult.
This regular expression replaces all starting and trailing non alphanumeric characters from the string.
It doesn't specifically specifies whitespace. It just negates every thing other than alphanumeric characters. Whatever inside square bracket is a character set - [Whatever]. A starting cap(^) INSIDE the character set says its a negation. So [^a-zA-Z0-9]* says zero or more characters which are other than a-z, A-z or 0-9.
The $ sign at the end says, to the end of string and nothing to do with $ and _ symbols. That will be already included in the character set as it all non alpha numeric characters.
Refer answer of #smathy.
Also just FYI, AFAIU regular expression can't be learned by scrolling a tutorial. You just need to go through the basics and try out the examples.
Some basic info.
When you read regular expressions, you read them from left to right. That's how the engine does it.
This is important in the case of alternations as the one on the left side(s) are always tried first.
But in the case of a $ (EOL or EOS) anchor, it might be easier to read from right to left.
Built-in assertions like line break anchors ^$ and word boundry \b along with normal assertions look ahead (?=)(?!) and look behind (?<=)(?<!), do not consume characters.
They are like single path in-line conditionals that pass or fail, where only if it passes will the expression to the right of it be examined. So they do actually Match something, they match a condition.
Format your regex so you can see what its doing. (Use a app to help you RegexFormat 5)
^ # BOS
[^a-zA-Z0-9]* # Optional not any alphanum chars
| # or,
[^a-zA-Z0-9]* # Optional not any alphanum chars
$ # EOS
Your regex in global context will always match twice, once at the beginning of the string, once at the end because of the line break anchors and because you don't actually require anything else to match.
So basically you should avoid trying to match (mix) all optional things with the built-in anchors ^$\b. That means your regex is better represented by ^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$ since you don't care if its NOT there (in the case of *, zero or more quantifier).
Good Luck, keep studying.
To answer your third question, the alternatives run all the way to the //s, so both sides are not the same. In the original regex the left alternative is "all non alphanumerics at the start of the string" and the right alternative is "all non alphanumerics at the end of the string".

Replace Pipe and Comma with Regex in Javascript

I'm sitting here with "The Good Parts" in hand but I'm still none the wiser.
Can anyone knock up a regex for me that will allow me to replace any instances of "|" and "," from a string.
Also, could anyone point me in the direction of a really good resource for learning regular expressions, especially in javascript (are they a particular flavour??) It really is a weak point in my knowledge.
Cheers.
str.replace(/(\||,)/g, "replaceWith") don't forget the g at the end so it seaches the string globally, if you don't put it the regex will only replace the first instance of the characters.
What is saying is replace | (you need to escape this character) OR(|) ,
Nice Cheatsheet here
The best resource I have found if you really want to understand regular expressions (and the special caveats or quirks of any of a majority of the implementations/flavors) is Regular-Expressions.info.
If you really get into regular expressions, I would recommend the product called RegexBuddy for testing and debugging regular expressions in all sorts of languages (though there are a few things it does not quite support, it is rather good overall)
Edit:
The best way (I think, especially if you consider readability) is using a character class rather than alternation (i.e.: [] instead of |)
use:
var newString = str.replace(/[|,]/g, ";");
This will replace either a | or a , with a semicolon
The character class essentially means "match anything inside these square brackets" - with only a few exceptions.
First, you can specify ranges of characters ([a-zA-Z] means any letter from a to z or from A to Z).
Second, putting a caret (^) at the beginning of the character class negates it - it means anything not in this character class ([^0-9] means any character that is not from 0 to 9).
put the dash at the beginning and the caret at the end of the character class to match those characters literally, or escape them anywhere else in the class with a \ if you prefer

Categories

Resources