Forcing a Strict Character Order in a Regex Expression

Forcing a Strict Character Order in a Regex Expression - javascript

I'm trying to create a regex in Javascript that has a limited order the characters can be placed in, but I'm having trouble getting the validation to be fully correct.
The criteria for the expression is a little complicated. The user must input strings with the following criteria:
The string contains two parts, an initial group, and an end group.
The groups are separated by a colon (:).
Strings are separated by a semi-colon (;).
The initial group can start with one optional forward-slash and end with one optional forward-slash, but these forward-slashes may not appear anywhere else in the group.
Inside forward-slashes, one optional underscore may appear on either end, but they may not appear anywhere else in the group.
Inside these optional elements, the user may enter any number of numbers or letters, uppercase or lowercase, but exactly one of these characters must be surrounded with angular brackets (<>).
If the letter inside the brackets is an uppercase C, it may be followed by one of a lowercase u or v.
The end group may contain one or more of a number or letter, uppercase or lowercase (If it is an uppercase C, it can be followed by a lowercase u or v.) or one asterisk (*), but not both.
A string must be able to validate with multiple groupings.
This probably sounds a little confusing.
For example, the following examples are valid:
<C>:Cu;
<Cu>:Cv;
/_V<C>V:C;
/_VV<Cv>VV_/:Cu;
_<V>:V1;
_<V>_:V1;
_<V>/:V1;
_<V>:*;
_<m>:n;
The following are invalid:
Cu:Cv;
Cu:Cv
CuCv;
<Cu/>:Cv;
<Cu_>:Cv;
<Cu>:Cv/;
_/<Cu>:Cv;
<Cu>/_:Cv;
They should validate when grouped together like so.
<Cu>:Cv;/_V<C>V:C;_<V>:V1;_<V>/:V1;_<V>:*;_<m>:n;
Hopefully, these examples help you understand what I'm trying to match.
I created the following regexp and tested it on Regex101.com, but this is the closest I could come:
\\/{0,1}_{0,1}[A-Za-z0-9]{0,}<{1}[A-Za-z0-9]{1,2}>{1}[A-Za-z0-9]{0,}_{0,1}\\/{0,1}):([A-Za-z0-9]{1,2}|\\*;$
It's mostly correct, but it allows strings that should be invalid such as:
_/<C>:C;
If an underscore comes before the first forward-slash, it should be rejected. Otherwise, my regexp seems to be correct for all other cases.
If anyone has any suggestions on how to fix this, or knows of a way to match all criteria much more efficiently, any help is appreciated.

The following seems to fulfill all the criteria:
(?:^|;)(\/?_?[a-zA-Z0-9]*<(?:[a-zA-Z]|C[uv]?)>[a-zA-Z0-9]*_?\/?):([a-zA-Z0-9]+|\*)(?=;|$)
Regex101 demo.
It puts each of the "groups" in a capturing group so you can access them individually.
Details:
(?:^|;) A non-capturing group to make sure the string is either at the beginning or starts with a semicolon.
( Start of group 1.
\/?_? An optional forward-slash followed by an optional underscore.
[a-zA-Z0-9]* Any letter or number - Matches zero or more.
<(?:[a-zA-Z]|C[uv]?)> Mandatory <> pair containing one letter or the capital letter C followed by a lowercase u or v.
[a-zA-Z0-9]* Any letter or number - Matches zero or more.
_?\/? An optional underscore followed by an optional forward-slash.
) End of group1.
: Matches a colon character literally.
([a-zA-Z0-9]+|\*) Group 2 - containing one or more numbers or letters or a single * character.
(?=;|$) A positive Lookahead to make sure the string is either followed by a semicolon or is at the end.

Did you mean this?
/^(?:(^|\s*;\s*)(?:\/_|_)?[a-z]*<[a-z]+>[a-z]*_?\/?:(?:[a-z0-9]+|\*)(?=;))+;$/i
We start with a case-insensitive expression /.../i to keep it more readable. You have to rewrite it to a case-sensitive expression if you only want to allow uppercase at the beginning of a word.
^ means the begin of the string. $ means the end of the string.
The whole string ends with ';' after multiple repeatitions of the inner expression (?:...)+ where + means 1 or more ocurrences. ;$ at the end includes the last semicolon into the result. It is not necessary for a test only, since the look-ahead already does the job.
(^|\s*;\s*) every part is at the begin of the string or after a semicolon surrounded by arbitrary whitespaces including linefeed. Use \n if you do not want to allow spaces and tabs.
(?:...|...) is a non-captured alternative. ? after a character or group is the quantifier 0/1 - none or once.
So (?:\/_|_)? means '/', '' or nothing. Use \/?_? if you do want to allow strings starting with a single slash as well.
[a-z]*<[a-z]+>[a-z]* 0 or more letters followed by <...> with at least one letter inside and again followed by 0 or more letters.
_?\/?: optional '_', optional '/', mandatory : in this sequence.
(?:[a-z0-9]+|\*) The part after the colon contains letters and numbers or the asterisk.
(?=;) Look-ahead: Every group must be followed by a semicolon. Look-ahead conditions do not move the search position.

Related

Regex creation to allow, disallow few characters

I am new to regex, i have this use case:
Allow characters, numbers.
Zero or one question mark allowed. (? - valid, consecutive question marks are not allowed (??)).
test-valid
?test - valid
??test- invalid
?test?test - valid
???test-invalid
test??test -invalid
Exlcude $ sign.
[a-zA-Z0-9?] - seems this doesn't work
Thanks.

Try the following regular expression: ^(?!.*\?\?)[a-zA-Z0-9?]+$
first we're using Negetive lookahead - which allows us to exclude any character which is followed by double question marks (Negetive lookahaed does not consume characters)
Since question mark has special meaning in regular expressions (Quantifier — Matches between zero and one times), each question mark is escaped using backslash.
The plus sign at the end is a Quantifier — Matches between one and unlimited times, as many times as possible
You can test it here

Your description can be broken down into the regex:
^(?:\??[a-zA-Z0-9])+\??$
You say characters and your description shows letters and numbers only, but it's possible \w (word characters) may be used instead - this includes underscore
It's between ^ and $ meaning the whole field must match (no partial matches, although if you want those you can remove this. The + means there must be at least one match (so empty string won't match). The capturing group ((\??[a-zA-Z0-9])) says I must either see a question mark followed by letters or just letters repeating many times, and the final question mark allows the string to end with a single question mark.
You probably don't want capturing groups here, so we can start that with ?: to prevent capture leading to:
^(?:\??[a-zA-Z0-9])+\??$
Matches
test
?test
?test?test
test?
Doesn't match
??test
???test
test??test
test??
<empty string>
?

RegEx does not match my example

This RegEx could not find example string.
RegEx:
^ALTER\\sTABLE\\sADMIN_\\sADD CONSTRAINT \\s(.*)\\sPRIMARY KEY \\s(\(.*\))\\.([a-zA-Z0-9_]+)
Example:
ALTER TABLE ADMIN_ ADD CONSTRAINT PK_ADMIN_ PRIMARY KEY (RECNOADM);
I am new to regex and tried to complete my RegEx at REGEX101.COM but with no success. What am I missing?
Djorjde

^\s*ALTER\s+TABLE\s+ADMIN_\s+ADD\s+CONSTRAINT\s+(.+)\s+PRIMARY\s+KEY\s*\((.+)\)\s*;\s*$
This expression will match the SQL statement you used as an example, capturing PK_ADMIN_ in the first group and RECNOADM in the second.
My suggestion is to use always \s+ to match the spaces (\s* when they are optional, like the leading or trailing spaces), unless they have to be exactly a single space.
So let's break the regex down:
^ Marks the beginning of the line. You don't want the line to match if there's anything else before.
\s* Optional leading spaces.
ALTER\s+TABLE\s+ADMIN_\s+ADD\s+CONSTRAINT This will match ALTER TABLE ADMIN_ ADD CONSTRAINT, regardless of the spacing used.
\s+(.+)\s+ Then, the next space-bound word(s)** will be captured into the first group. You're accepting any character here! Maybe you could want to restrict that to \w+ or the like. Unless you accept an empty group, use the + closure (i.e., one or more), not the * one (i.e., zero or more)
PRIMARY\s+KEY Matches the sequence PRIMARY KEY, again, regardless of the spacing.
\s*\((.+)\) This will capture anything inside the parentheses as the PK in the second capture group.
\s* Means that it can be optionally preceded by an arbitrary number of spaces (although they are optional. They are in SQL if I recall correctly)
\(...\) You have to escape the parentheses because they are characters to match, no special characters of the regex.
(.+) Here you capture (between unescaped parentheses) everything between the (escaped) parentheses into a capture group. The second one in this case.
\s*;\s* The sentence has to end with a semicolon, optionally preceded and/or succeeded by any spaces.
$ Marks the end of the line.
In case you want to accept more than one sentence in the same line, you'd remove the ^ and $ zero-width delimiters.
About the escaping, the easiest way here is to simply double every backslash in the expression you built in the editor: ^\\s*ALTER\\s+TABLE\\s+ADMIN_\\s+ADD\\s+CONSTRAINT\\s+(.+)\\s+PRIMARY\\s+KEY\\s*\\((.+)\\)\\s*;\\s*$ However, there are context and/or languages where a more complex escaping may be needed (e.g., the Linux shell)
** Note that in 4, the inner expression .+ will take as many characters as possible, as long as the remaining parts also match the string. This is because the closures are by default greedy, meaning that the engine will try to match the longest string possible. That means that, for instance, this entry will match: ALTER TABLE ADMIN_ ADD CONSTRAINT PK_ADMIN_ OR *WHATEVER* YOU "WANT" TO PUT HERE! PRIMARY KEY (RECNOADM);, capturing PK_ADMIN_ OR *WHATEVER* YOU "WANT" TO PUT HERE! in the first group. Hence the importance of restricting the set of accepted characters ;)

Have you tried the following?
^ALTER\sTABLE\sADMIN_\sADD\sCONSTRAINT\s((.*))\sPRIMARY\sKEY\s\((.*)\);
I am wrapping two separate blocks through () in order to identify from the Regex the values inserted if you need to access them too.
In your regex there are few issues with white spaces (mixing up white spaces with \s and the white space should be \s not \s)

In JavaScript you only need to escape backslashes that are part of escape sequences when you're composing a regexp from a string, e.g.:
var r = new RegExp('\\d');
console.log(r.test('2'));
But the additional \ is not part of the regexp and you don't need it when using the literal syntax (or regexp101):
var r = /\d/;
console.log(r.test('2'));

Regular expression match 0 or exact number of characters

I want to match an input string in JavaScript with 0 or 2 consecutive dashes, not 1, i.e. not range.
If the string is:
-g:"apple" AND --projectName:"grape": it should match --projectName:"grape".
-g:"apple" AND projectName:"grape": it should match projectName:"grape".
-g:"apple" AND -projectName:"grape": it should not match, i.e. return null.
--projectName:"grape": it should match --projectName:"grape".
projectName:"grape": it should match projectName:"grape".
-projectName:"grape": it should not match, i.e. return null.
To simplify this question considering this example, the RE should match the preceding 0 or 2 dashes and whatever comes next. I will figure out the rest. The question still comes down to matching 0 or 2 dashes.
Using -{0,2} matches 0, 1, 2 dashes.
Using -{2,} matches 2 or more dashes.
Using -{2} matches only 2 dashes.
How to match 0 or 2 occurrences?

Answer
If you split your "word-like" patterns on spaces, you can use this regex and your wanted value will be in the first capturing group:
(?:^|\s)((?:--)?[^\s-]+)
\s is any whitespace character (tab, whitespace, newline...)
[^\s-] is anything except a whitespace-like character or a -
Once again the problem is anchoring the regex so that the relevant part isn't completely optionnal: here the anchor ^ or a mandatory whitespace \s plays this role.
What we want to do
Basically you want to check if your expression (two dashes) is there or not, so you can use the ? operator:
(?:--)?
"Either two or none", (?:...) is a non capturing group.
Avoiding confusion
You want to match "zero or two dashes", so if this is your entire regex it will always find a match: in an empty string, in --, in -, in foobar... What will be match in these string will be an empty string, but the regex will return a match.
This is a common source of misunderstanding, so bear in mind the rule that if everything in your regex is optional, it will always find a match.
If you want to only return a match if your entire string is made of zero or two dashes, you need to anchor the regex:
^(?:--)?$
^$ match respectively the beginning and end of the string.

a(-{2})?(?!-)
This is using "a" as an example. This will match a followed by an optional 2 dashes.
Edit:
According to your example, this should work
(?<!-)(-{2})?projectName:"[a-zA-Z]*"
Edit 2:
I think Javascript has problems with lookbehinds.
Try this:
[^-](-{2})?projectName:"[a-zA-Z]*"
Debuggex Demo

JavaScript RegExp to match a (partial) hour

I want to allow people to enter times into a textbox in various formats. One of the formats would be either:
2h for 2 hours, or
2.5h for 2 and a half hours
I want to use a regex to recognise the pattern but it's not picking it up for some reason:
I have:
var hourRegex = /^\d{1,2}[\.\d+]?[h|H]$/;
which works for 2h, but not for 2.5h.
I thought that this regex would mean - Start at the beginning of the string, have one or two digits, then have none or one decimal points which if present must be followed by one or more digits then have a h or a H and then it must be the end of the string.
I have tried the regex tool here but no luck.

/^\d{1,2}(?:\.\d+)?h$/i; Use parentheses instead of square braces.
Start at the beginning
One or two digits
Optional: a dot followed by at least one digit
End with a h
Case insensitive
RegExp tuturial
[...] - square braces mean: anything which is within the provided range.
[^...] means: Match a character which is not within the provided range
(...) - parentheses mean: Group me. Optionally, the first characters of a group can start with:
?: - Don't reference me (me, I = group)
?= - Don't include me in the match, though I have to be here
?! - I may not show up at this point
{a,b}, {a,} means: At least a, maximum b characters. Omitting b = Infinity
+ means: at least one time, match as much as possible equivalen to {1,}
* means: match as much as possible equivalent to {0,}
+? and *? have the same effect as previously described, with one difference: Match as less as possible
Examples
[a-z] One character, any character between a, b, c, ..., z
(a-z) Match "a-z", and group it
[^0-9] Match any non-number character
See also
MDN: Regular Expressions - A more detailed guide

The trouble is here :
[\.\d+]
you can not use character classes inside brackets.
Use this instead:
(\.[0-9]+)?

You've confused your square brackets with your parenthesis. Square brackets look for a single match of any contained character, whereas parenthesis look for a match of the entire enclosed pattern.
Your issue lies in [\.\d+]? It's looking for . or 0-9 or +.
Instead you should try:
/^\d{1,2}(\.\d+)?(h|H)$/
Although that will still allow users to enter invalid numbers, such as 99.3 which is probably not the expected behavior.

Please explain some Javascript Regular Expressions

I'm learning Javascript via an online tutorial, but nowhere on that website or any other I googled for was the jumble of symbols explained that makes up a regular expression.
Check if all numbers: /^[0-9]+$/
Check if all letters: /^[a-zA-Z]+$/
And the hardest one:
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
What do all the slashes and dollar signs and brackets mean? Please explain.
(By the way, what languages are required to create a flexible website? I know a bit of Javascript and wanna learn jQuery and PHP. Anything else needed?)
Thanks.

There are already a number of good sites that explain regular expressions so I'll just dive a bit into how each of the specific examples you gave translate.
Check if all numbers: ^ anchors the start of the expression (e.g. start at the beginning of the text). Without it a match could be found anywhere. [0-9] finds the characters in that character class (e.g. the numbers 0-9). The + after the character class just means "one or more". The ending $ anchors the end of the text (e.g. the match should run to the end of the input). So if you put that together, that regular expression would allow for only 1 or more numbers in a string. Note that the anchors are important as without them it might match something like "foo123bar".
Check if all letters: Pretty much the same as above but the character classes are different. In this example the character class [a-zA-Z] represents all lowercase and uppercase characters.
The last one actually isn't any more difficult than the other two it's just longer. This answer is getting quite long so I'll just explain the new symbols. A \w in a character class will match word characters (which are defined per regex implementation but are generally 0-9a-zA-Z_ at least). The backslash before the # escapes the # so that it isn't seen as a token in the regex. A period will match any character so .+ will match one or more of any character (e.g. a, 1, Z, 1a, etc). The last part of the regex ({2,4}) defines an interval expression. This means that it can match a minimum of 2 of the thing that precedes it, and a maximum of 4.
Hope you got something out of the above.

There is an awesome explanation of regular expressions at http://www.regular-expressions.info/ including notes on language and implementation specifics.

Let me explain:
Check if all numbers: /^[0-9]+$/
So, first thing we see is the "/" at the beginning and the end. This is a deliminator, and only serves to show the beginning and end of the regular expression.
Next, we have a "^", this means the beginning of the string. [0-9] means a number from 0-9. + is a modifier, which modifies the term in front of it, in this case, it means you can have one or more of something, so you can have one or more numbers from 0-9.
Finally, we end with "$", which is the opposite of "^", and means the end of the string. So put that all together and it basically makes sure that inbetween the start and end of the string, there can be any number of digits from 0-9.
Check if all letters: /^[a-zA-Z]+$/
We notice this is very similar, but instead of checking for numbers 0-9, it checks for letters a-z (lowercase) and A-Z (uppercase).
And the hardest one:
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
"\w" means that it is a word, in this case we can have any number of letters or numbers, as well as the period means that it can be pretty much any character.
The new thing here is escape characters. Many symbols cannot be used without escaping them by placing a slash in front, as is the case with "\#". This means it is looking directly for the symbol "#".
Now it looks for letters and symbols, a period (this one seems incorrect, it should be escaping the period too, though it will still work, since an unescaped period will make any symbol). Numbers inside {} mean that there is inbetween this many terms in the previous term, so of the [a-zA-Z0-9], there should be 2-4 characters (this part here is the website domain, such as .com, .ca, or .info). Note there's another error in this one here, the [a-zA-z0-9] should be [a-zA-Z0-9] (capital Z).
Oh, and check out that site listed above, it is a great set of tutorials too.

Regular Expressions is a complex beast and, as already pointed out, there are quite a few guides off of google you can go read.
To answer the OP questions:
Check if all numbers: /^[0-9]+$/
regexps here are all delimated with //, much like strings are quoted with '' or "".
^ means start of string or line (depending on what options you have about multiline matching)
[...] are called character classes. Anything in [] is a list of single matching characters at that position in this case 0-9. The minus sign has a special meaning of "sequence of characters between". So [0-9] means "one of 0123456789".
+ means "1 or more" of the preceeding match (in this case [0-9]) so one or more numbers
$ means end of string/line match.
So in summary find any string that contains only numbers, i.e '0123a' will not match as [0-9]+ fails to match a before $).
Check if all letters: /^[a-zA-Z]+$/
Hopefully [A-Za-z] makes sense now (A-Z = ABCDEF...XYZ and a-z abcdef...xyz)
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
Not all regexp parses know the \w sequence. Javascript, java and perl I know do support it.
I have already have covered '/^ at the beginning, for this [] match we are looking for
\w - . and +. I think that regexp is incorrect. Either the minus sign should be escaped with \ or it should be at the end of the [] (i.e [\w+.-]). But that is an aside they are basically attempting to allow anything of abcdefghijklmnopqrstuvwxyz01234567890-.+
so fred.smith-foo+wee#mymail.com will match but fred.smith%foo+wee#mymail.com wont (the % is not matched by [\w.+-]).
\# is the litteral atsil sign (it is escaped as perl expands # an array variable reference)
[a-zA-Z0-9.-]+ is the same as [\w.-]+. Very much like the user part of the match, but does not match +. So this matches foo.com. and google.co. but not my+foo.com or my***domain.co.
. means match any one character. This again is incorrect as fred#foo%com will match as . matches %*^%$£! etc. This should of been written as \.
The last character class [a-zA-z0-9]{2,4} looks for between 2 3 or 4 of the a-zA-Z0-9 specified in the character class (much like + looks for "1 more more" {2,4} means at least 2 with a maximum of 4 of the preceeding match. So 'foo' matches, '11' matches, '11111' does not match and 'information' does not.
The "tweaked" regexp should be:
/^[\w.+-]+\#[a-zA-Z0-9.-]+\.[a-zA-z0-9]{2,4}$/

I'm not doing a tutorial on RegEx's, that's been done really well already, but here are what your expressions mean.
/^<something>$/ String begins, has something in the middle, and then immediately ends.
/^foo$/.test('foo'); // true
/^foo$/.test('fool'); // false
/^foo$/.test('afoo'); // false
+ One or more of something:
/a+/.test('cot');//false
/a+/.test('cat');//true
/a+/.test('caaaaaaaaaaaat');//true
[<something>] Include any characters found between the brackets. (includes ranges like 0-9, a-z, and A-Z, as well as special codes like \w for 0-9a-zA-Z_-
/^[0-9]+/.test('f00')//false
/^[0-9]+/.test('000')//true
{x,y} between X and Y occurrences
/^[0-9]{1,2}$/.test('12');// true
/^[0-9]{1,2}$/.test('1');// true
/^[0-9]{1,2}$/.test('d');// false
/^[0-9]{1,2}$/.test('124');// false
So, that should cover everything, but for good measure:
/^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
Begins with at least character from \w, -, +, or .. Followed by an #, followed by at least one in the set a-zA-Z0-9.- followed by one character of anything (. means anything, they meant \.), followed by 2-4 characters of a-zA-z0-9
As a side note, this regular expression to check emails is not only dated, but it is very, very, very incorrect.

Develop Reference

JavaScript is the programming language of the Web.

Forcing a Strict Character Order in a Regex Expression - javascript

Related

Regex creation to allow, disallow few characters

RegEx does not match my example

Regular expression match 0 or exact number of characters

JavaScript RegExp to match a (partial) hour

Please explain some Javascript Regular Expressions

Categories

Resources