Please explain some Javascript Regular Expressions - javascript

I'm learning Javascript via an online tutorial, but nowhere on that website or any other I googled for was the jumble of symbols explained that makes up a regular expression.
Check if all numbers: /^[0-9]+$/
Check if all letters: /^[a-zA-Z]+$/
And the hardest one:
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
What do all the slashes and dollar signs and brackets mean? Please explain.
(By the way, what languages are required to create a flexible website? I know a bit of Javascript and wanna learn jQuery and PHP. Anything else needed?)
Thanks.

There are already a number of good sites that explain regular expressions so I'll just dive a bit into how each of the specific examples you gave translate.
Check if all numbers: ^ anchors the start of the expression (e.g. start at the beginning of the text). Without it a match could be found anywhere. [0-9] finds the characters in that character class (e.g. the numbers 0-9). The + after the character class just means "one or more". The ending $ anchors the end of the text (e.g. the match should run to the end of the input). So if you put that together, that regular expression would allow for only 1 or more numbers in a string. Note that the anchors are important as without them it might match something like "foo123bar".
Check if all letters: Pretty much the same as above but the character classes are different. In this example the character class [a-zA-Z] represents all lowercase and uppercase characters.
The last one actually isn't any more difficult than the other two it's just longer. This answer is getting quite long so I'll just explain the new symbols. A \w in a character class will match word characters (which are defined per regex implementation but are generally 0-9a-zA-Z_ at least). The backslash before the # escapes the # so that it isn't seen as a token in the regex. A period will match any character so .+ will match one or more of any character (e.g. a, 1, Z, 1a, etc). The last part of the regex ({2,4}) defines an interval expression. This means that it can match a minimum of 2 of the thing that precedes it, and a maximum of 4.
Hope you got something out of the above.

There is an awesome explanation of regular expressions at http://www.regular-expressions.info/ including notes on language and implementation specifics.

Let me explain:
Check if all numbers: /^[0-9]+$/
So, first thing we see is the "/" at the beginning and the end. This is a deliminator, and only serves to show the beginning and end of the regular expression.
Next, we have a "^", this means the beginning of the string. [0-9] means a number from 0-9. + is a modifier, which modifies the term in front of it, in this case, it means you can have one or more of something, so you can have one or more numbers from 0-9.
Finally, we end with "$", which is the opposite of "^", and means the end of the string. So put that all together and it basically makes sure that inbetween the start and end of the string, there can be any number of digits from 0-9.
Check if all letters: /^[a-zA-Z]+$/
We notice this is very similar, but instead of checking for numbers 0-9, it checks for letters a-z (lowercase) and A-Z (uppercase).
And the hardest one:
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
"\w" means that it is a word, in this case we can have any number of letters or numbers, as well as the period means that it can be pretty much any character.
The new thing here is escape characters. Many symbols cannot be used without escaping them by placing a slash in front, as is the case with "\#". This means it is looking directly for the symbol "#".
Now it looks for letters and symbols, a period (this one seems incorrect, it should be escaping the period too, though it will still work, since an unescaped period will make any symbol). Numbers inside {} mean that there is inbetween this many terms in the previous term, so of the [a-zA-Z0-9], there should be 2-4 characters (this part here is the website domain, such as .com, .ca, or .info). Note there's another error in this one here, the [a-zA-z0-9] should be [a-zA-Z0-9] (capital Z).
Oh, and check out that site listed above, it is a great set of tutorials too.

Regular Expressions is a complex beast and, as already pointed out, there are quite a few guides off of google you can go read.
To answer the OP questions:
Check if all numbers: /^[0-9]+$/
regexps here are all delimated with //, much like strings are quoted with '' or "".
^ means start of string or line (depending on what options you have about multiline matching)
[...] are called character classes. Anything in [] is a list of single matching characters at that position in this case 0-9. The minus sign has a special meaning of "sequence of characters between". So [0-9] means "one of 0123456789".
+ means "1 or more" of the preceeding match (in this case [0-9]) so one or more numbers
$ means end of string/line match.
So in summary find any string that contains only numbers, i.e '0123a' will not match as [0-9]+ fails to match a before $).
Check if all letters: /^[a-zA-Z]+$/
Hopefully [A-Za-z] makes sense now (A-Z = ABCDEF...XYZ and a-z abcdef...xyz)
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
Not all regexp parses know the \w sequence. Javascript, java and perl I know do support it.
I have already have covered '/^ at the beginning, for this [] match we are looking for
\w - . and +. I think that regexp is incorrect. Either the minus sign should be escaped with \ or it should be at the end of the [] (i.e [\w+.-]). But that is an aside they are basically attempting to allow anything of abcdefghijklmnopqrstuvwxyz01234567890-.+
so fred.smith-foo+wee#mymail.com will match but fred.smith%foo+wee#mymail.com wont (the % is not matched by [\w.+-]).
\# is the litteral atsil sign (it is escaped as perl expands # an array variable reference)
[a-zA-Z0-9.-]+ is the same as [\w.-]+. Very much like the user part of the match, but does not match +. So this matches foo.com. and google.co. but not my+foo.com or my***domain.co.
. means match any one character. This again is incorrect as fred#foo%com will match as . matches %*^%$£! etc. This should of been written as \.
The last character class [a-zA-z0-9]{2,4} looks for between 2 3 or 4 of the a-zA-Z0-9 specified in the character class (much like + looks for "1 more more" {2,4} means at least 2 with a maximum of 4 of the preceeding match. So 'foo' matches, '11' matches, '11111' does not match and 'information' does not.
The "tweaked" regexp should be:
/^[\w.+-]+\#[a-zA-Z0-9.-]+\.[a-zA-z0-9]{2,4}$/

I'm not doing a tutorial on RegEx's, that's been done really well already, but here are what your expressions mean.
/^<something>$/ String begins, has something in the middle, and then immediately ends.
/^foo$/.test('foo'); // true
/^foo$/.test('fool'); // false
/^foo$/.test('afoo'); // false
+ One or more of something:
/a+/.test('cot');//false
/a+/.test('cat');//true
/a+/.test('caaaaaaaaaaaat');//true
[<something>] Include any characters found between the brackets. (includes ranges like 0-9, a-z, and A-Z, as well as special codes like \w for 0-9a-zA-Z_-
/^[0-9]+/.test('f00')//false
/^[0-9]+/.test('000')//true
{x,y} between X and Y occurrences
/^[0-9]{1,2}$/.test('12');// true
/^[0-9]{1,2}$/.test('1');// true
/^[0-9]{1,2}$/.test('d');// false
/^[0-9]{1,2}$/.test('124');// false
So, that should cover everything, but for good measure:
/^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
Begins with at least character from \w, -, +, or .. Followed by an #, followed by at least one in the set a-zA-Z0-9.- followed by one character of anything (. means anything, they meant \.), followed by 2-4 characters of a-zA-z0-9
As a side note, this regular expression to check emails is not only dated, but it is very, very, very incorrect.

Related

Forcing a Strict Character Order in a Regex Expression

I'm trying to create a regex in Javascript that has a limited order the characters can be placed in, but I'm having trouble getting the validation to be fully correct.
The criteria for the expression is a little complicated. The user must input strings with the following criteria:
The string contains two parts, an initial group, and an end group.
The groups are separated by a colon (:).
Strings are separated by a semi-colon (;).
The initial group can start with one optional forward-slash and end with one optional forward-slash, but these forward-slashes may not appear anywhere else in the group.
Inside forward-slashes, one optional underscore may appear on either end, but they may not appear anywhere else in the group.
Inside these optional elements, the user may enter any number of numbers or letters, uppercase or lowercase, but exactly one of these characters must be surrounded with angular brackets (<>).
If the letter inside the brackets is an uppercase C, it may be followed by one of a lowercase u or v.
The end group may contain one or more of a number or letter, uppercase or lowercase (If it is an uppercase C, it can be followed by a lowercase u or v.) or one asterisk (*), but not both.
A string must be able to validate with multiple groupings.
This probably sounds a little confusing.
For example, the following examples are valid:
<C>:Cu;
<Cu>:Cv;
/_V<C>V:C;
/_VV<Cv>VV_/:Cu;
_<V>:V1;
_<V>_:V1;
_<V>/:V1;
_<V>:*;
_<m>:n;
The following are invalid:
Cu:Cv;
Cu:Cv
CuCv;
<Cu/>:Cv;
<Cu_>:Cv;
<Cu>:Cv/;
_/<Cu>:Cv;
<Cu>/_:Cv;
They should validate when grouped together like so.
<Cu>:Cv;/_V<C>V:C;_<V>:V1;_<V>/:V1;_<V>:*;_<m>:n;
Hopefully, these examples help you understand what I'm trying to match.
I created the following regexp and tested it on Regex101.com, but this is the closest I could come:
\\/{0,1}_{0,1}[A-Za-z0-9]{0,}<{1}[A-Za-z0-9]{1,2}>{1}[A-Za-z0-9]{0,}_{0,1}\\/{0,1}):([A-Za-z0-9]{1,2}|\\*;$
It's mostly correct, but it allows strings that should be invalid such as:
_/<C>:C;
If an underscore comes before the first forward-slash, it should be rejected. Otherwise, my regexp seems to be correct for all other cases.
If anyone has any suggestions on how to fix this, or knows of a way to match all criteria much more efficiently, any help is appreciated.
The following seems to fulfill all the criteria:
(?:^|;)(\/?_?[a-zA-Z0-9]*<(?:[a-zA-Z]|C[uv]?)>[a-zA-Z0-9]*_?\/?):([a-zA-Z0-9]+|\*)(?=;|$)
Regex101 demo.
It puts each of the "groups" in a capturing group so you can access them individually.
Details:
(?:^|;) A non-capturing group to make sure the string is either at the beginning or starts with a semicolon.
( Start of group 1.
\/?_? An optional forward-slash followed by an optional underscore.
[a-zA-Z0-9]* Any letter or number - Matches zero or more.
<(?:[a-zA-Z]|C[uv]?)> Mandatory <> pair containing one letter or the capital letter C followed by a lowercase u or v.
[a-zA-Z0-9]* Any letter or number - Matches zero or more.
_?\/? An optional underscore followed by an optional forward-slash.
) End of group1.
: Matches a colon character literally.
([a-zA-Z0-9]+|\*) Group 2 - containing one or more numbers or letters or a single * character.
(?=;|$) A positive Lookahead to make sure the string is either followed by a semicolon or is at the end.
Did you mean this?
/^(?:(^|\s*;\s*)(?:\/_|_)?[a-z]*<[a-z]+>[a-z]*_?\/?:(?:[a-z0-9]+|\*)(?=;))+;$/i
We start with a case-insensitive expression /.../i to keep it more readable. You have to rewrite it to a case-sensitive expression if you only want to allow uppercase at the beginning of a word.
^ means the begin of the string. $ means the end of the string.
The whole string ends with ';' after multiple repeatitions of the inner expression (?:...)+ where + means 1 or more ocurrences. ;$ at the end includes the last semicolon into the result. It is not necessary for a test only, since the look-ahead already does the job.
(^|\s*;\s*) every part is at the begin of the string or after a semicolon surrounded by arbitrary whitespaces including linefeed. Use \n if you do not want to allow spaces and tabs.
(?:...|...) is a non-captured alternative. ? after a character or group is the quantifier 0/1 - none or once.
So (?:\/_|_)? means '/', '' or nothing. Use \/?_? if you do want to allow strings starting with a single slash as well.
[a-z]*<[a-z]+>[a-z]* 0 or more letters followed by <...> with at least one letter inside and again followed by 0 or more letters.
_?\/?: optional '_', optional '/', mandatory : in this sequence.
(?:[a-z0-9]+|\*) The part after the colon contains letters and numbers or the asterisk.
(?=;) Look-ahead: Every group must be followed by a semicolon. Look-ahead conditions do not move the search position.

Regex for matching HashTags in any language

I have a field in my application where users can enter a hashtag.
I want to validate their entry and make sure they enter what would be a proper HashTag.
It can be in any language and it should NOT precede with the # sign.
I am writing in JavaScript.
So the following are GOOD examples:
Abcde45454_fgfgfg (good because: only letters, numbers and _)
2014_is-the-year (good because: only letters, numbers, _ and -)
בר_רפאלי (good because: only letters and _)
арбуз (good because: only letters)
And the following are BAD examples:
Dan Brown (Bad because has a space)
OMG!!!!! (Bad because has !)
בר רפ#לי (Bad because has # and a space)
We had a regex that matched only a-zA-Z0-9, we needed to add language support so we changed it to ignore white spaces and forgot to ignore special characters, so here I am.
Some other StackOverflow examples I saw but didn't work for me:
Other languges don't work
Again, English only
[edit]
Added explanation why bad is bad and good is good
I don't want a preceding # character, but if I would to add a # in the beginning, it should be a valid hashtag
Basically I don't want to allow any special characters like !##$%^&*()=+./,[{]};:'"?><
If your disallowed characters list is thorough (!##$%^&*()=+./,[{]};:'"?><), then the regex is:
^#?[^\s!##$%^&*()=+./,\[{\]};:'"?><]+$
Demo
This allows an optional leading # sign: #?. It disallows the special characters using a negative character class. I just added \s to the list (spaces), and also I escaped [ and ].
Unfortunately, you can't use constructs like \p{P} (Unicode punctuation) in JavaScript's regexes, so you basically have to blacklist characters or take a different approach if the regex solution isn't good enough for your needs.
I don't understand why this question does not get more votes. Hashtag detection for multiple languages is a problem. The only working option I could find is posted by Lucas above (all other ones do not work so well).
It needs a modification though:
#[^\s!##$%^&*()=+.\/,\[{\]};:'"?><]+
DEMO
this detects all the hashtags, not only in the beginning of the string, fixes an unescaped character, and removes the unnecessary $ in the end.
First if we exclude all symbol it will not a handy solution. Because symbol depends on keyboard layout and there are hundreds of math symbols and so on. So use this..
[\p{sc=Bengali}|\p{L}_\p{N}]+
1. If you think if language need extra care include like \p{sc=Bengali}|\p{sc=Spanish} etc. Suppose bangla has surrogate alphabet like া, ে ৌ etc so codepoint need to recognize Bangla separately first by \p{sc=Bengali}
2. Than use \p{L} that matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç too or normal any alphabet without complex...matches a single code point in the category "letter"
3. _ underscore allowed
4. \p{N} matches any kind of numeric character in any script. (\d matches only a digit (equal to [0-9]) but for allowed Unicode digit \p{N} only option, because its works with any digit codepoint)

Can it be done with regex?

Having the following regex: ([a-zA-Z0-9//._-]{3,12}[^//._-]) used like pattern="([a-zA-Z0-9/._-]{3,12}[^/._-])" to validate an HTML text input for username, I wonder if is there anyway of telling it to check that the string has only one of the following: ., -, _
By that I mean, that I'm in need of regex that would accomplish the following (if possible)
alex-how => Valid
alex-how. => Not valid, because finishing in .
alex.how => Valid
alex.how-ha => Not valid, contains already a .
alex-how_da => Not valid, contains already a -
The problem with my current regex, is that for some reason, accepts any character at the end of the string that is not ._-, and can't figure it out why.
The other problem, is that it doesn't check to see that it contains only of the allowed special characters.
Any ideas?
Try this one out:
^(?!(.*[.|_|-].*){2})(?!.*[.|_|-]$)[a-zA-Z0-9//._-]{3,12}$
Regexpal link. The regex above allow at max one of ., _ or -.
What you want is one or more strings containing all upper, lower and digit characters
followed by either one or none of the characters in "-", ".", or "_", followed by at least one character:
^[a-zA-Z0-9]+[-|_|\.]{0,1}[a-zA-Z0-9]+$
Hope this will work for you:-
It says starts with characters followed by (-,.,_) and followed and end with characters
^[\w\d]*[-_\.\w\d]*[\w\d]$
Seems to me you want:
^[A-Za-z0-9]+(?:[\._-][A-Za-z0-9]+)?$
Breaking it down:
^: beginning of line
[A-Za-z0-9]+: one or more alphanumeric characters
(?:[\._-][A-Za-z0-9]+)?: (optional, non-captured) one of your allowed special characters followed by one or more alphanumeric characters
$: end of line
It's unclear from your question if you wanted one of your special characters (., -, and _) to be optional or required (e.g., zero-or-one versus exactly-one). If you actually wanted to require one such special character, you would just get rid of the ? at the very end.
Here's a demonstration of this regular expression on your example inputs:
http://rubular.com/r/SQ4aKTIEF6
As for the length requirement (between 3 and 12 characters): This might be a cop-out, but personally I would argue that it would make more sense to validate this by just checking the length property directly in JavaScript, rather than over-complicating the regular expression.
^(?=[a-zA-Z0-9/._-]{3,12}$)[a-zA-Z0-9]+(?:[/._-][a-zA-Z0-9]+)?$
or, as a JavaScript regex literal:
/^(?=[a-zA-Z0-9\/._-]{3,12})[a-zA-Z0-9]+(?:[\/._-][a-zA-Z0-9]+)?$/
The lookahead, (?=[a-zA-Z0-9/._-]{3,12}$), does the overall-length validation.
Then [a-zA-Z0-9]+ ensures that the name starts with at least one non-separator character.
If there is a separator, (?:[/._-][a-zA-Z0-9]+)? ensures that there's at least one non-separator following it.
Note that / has no special meaning in a regex. You only have to escape it if you're using a regex literal (because / is the regex delimiter), and you escape it by prefixing with a backslash, not another forward-slash. And inside a character class, you don't need to escape the dot (.) to make it match a literal dot.
The dot in regex has a special meaning: "any character here".
If you mean a literal dot, you should escape it to tell the regex parser so.
Escape dot in a regex range

RegEx in JS to find No 3 Identical consecutive characters

How to find a sequence of 3 characters, 'abb' is valid while 'abbb' is not valid, in JS using Regex (could be alphabets,numerics and non alpha numerics).
This question is a variation of the question that I have asked in here : How to combine these regex for javascript.
This is wrong : /(^([0-9a-zA-Z]|[^0-9a-zA-Z]))\1\1/ , so what is the right way to do it?
This depends on what you actually mean. If you only want to match three non-identical characters (that is, if abb is valid for you), you can use this negative lookahead:
(?!(.)\1\1).{3}
It first asserts, that the current position is not followed by three times the same character. Then it matches those three characters.
If you really want to match 3 different characters (only stuff like abc), it gets a bit more complicated. Use these two negative lookaheads instead:
(.)(?!\1)(.)(?!\1|\2).
First match one character. Then we assert, the this is not followed by the same character. If so, we match another character. Then we assert that these are followed neither by the first nor the second character. Then we match a third character.
Note that those negative lookaheads ((?!...)) do not consume any characters. That is why they are called lookaheads. They just check what is coming next (or in this case what is not coming next) and then the regex continues from where it left of. Here is a good tutorial.
Note also that this matches anything but line breaks, or really anything if you use the DOTALL or SINGLELINE option. Since you are using JavaScript you can just activate the option by appending s after the regexes closing delimiter. If (for some reason) you don't want to use this option, replace the .s by [\s\S] (this always matches any character).
Update:
After clarification in the comments, I realised that you do not want to find three non-identical characters, but instead you want to assert that your string does not contain three identical (and consecutive) characters.
This is a bit easier, and closer to your former question, since it only requires one negative lookahead. What we do is this: we search the string from the beginning for three consecutive identical characters. But since we want to assert that these do not exist we wrap this in a negative lookahead:
^(?!.*(.)\1\1)
The lookahead is anchored to the beginning of the string, so this is the only place where we will look. The pattern in the lookahead then tries to find three identical characters from any position in the string (because of the .*; the identical characters are matched in the same way as in your previous question). If the pattern finds these, the negative lookahead will thus fail, and so the string will be invalid. If not three identical characters can be found, the inner pattern will never match, so the negative lookahead will succeed.
To find non-three-identical characters use regex pattern
([\s\S])(?!\1\1)[\s\S]{2}

Javascript Regular Expression for Password

I am writing the regex for validating password in Javascript. The constraints are:
Password must contain at least one uppercase character
Password must contain at least a special character
With trial and error and some searching on the net, I found that this works:
/(?=.*[A-Z]+)(?=.*[!##\$%]+)/
Can someone please explain the part of this expression which mentions that the uppercase letter and special character can come in ANY order?
I think this would work even better:
/(?=.*[A-Z])(?=.*[!##\$%])/
Look-arounds do not consume characters, therefore, start for the second look-ahead is the same as for the first. Which makes checks for those two characters independent of each other. You could swap them around and resulting regex would still be equivalent to this.
The following regex (suggested by Gumbo) is slightly more efficient, as it avoids unnecessary backtracking:
/(?=[^A-Z]*[A-Z])(?=[^!##\$%]*[!##\$%])/
On passwords of usual lengths the time difference probably won't be easily measurable, though.
The ?= is called a lookahead where it will scan the rest of the string to see if the match is found. Normally, regex go character by character, but the ?= tells it to "lookahead" to see if it exists.
There is also a negative lookahead of ?!.
the "?=" does this. It is a "Positive Lookahead"
From JavaScript Regular Expression Syntax
Positive lookahead matches the search string at any point where a string matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.

Categories

Resources