Is this regex the most efficient way of parsing my string? - javascript

First off, here are the parameters to follow in the string I allow the user to input:
If there is a slash, it has to appear at the start of the string, nowhere else, is limited to 1, is optional and must be succeeded by [a-zA-Z].
If there is a tilde, it has to appear after a space " ", nothing else, is optional and must be succeeded by [a-zA-Z]. Also, this expression is limited to 2. (ie: ~exa ~mple is passed but ~exa ~mp ~le is not passed)
The slash followed by a word is an instruction, like /get or /post.
The tilde followed by a word is a parameter like ~now or ~later.
String format:
[instruction] (optional) [query] [extra parameters] (optional)
[instruction] - Must contain / succeeded with [a-zA-Z] only
[query] - Can contain [\w\s()'-] (alphanumeric, whitespace, parentheses, apostrophe, dash)
[extra parameters] - ~ preceded by whitespace, succeeded with only [a-zA-Z]
String examples that should work:
/get D0cUm3nt ex4Mpl3' ~now
D0cUm3nt ex4Mpl3'
/post T(h)(i5 s(h)ou__ld w0rk t0-0'
String examples that shouldn't work:
//get document~now
~later
example ~now~later
Before I pass the string through the regex I trim any whitespace at the start and end of the string (before any text is seen) but I don't trim double whitespaces within the string as some queries require them.
Here is the regex I used:
^(/{0,1}[a-zA-Z])?[\w\s()'-]*((\s~[a-zA-Z]*){0,2})?$
To break it down slightly:
[instruction check] - (/{0,1}[a-zA-Z])?
[query check] - [\w\s()'-]*
[parameter check] - ((\s~[a-zA-Z]*){0,2})?
This is the first time I've actually done any serious regex away from a tutorial so I'm wondering is there anything I can change within my regex to make it more compact/efficient?
All fresh perspectives are appreciated!
Thanks.

From your regex: ^(/{0,1}[a-zA-Z])?[\w\s()'-]*((\s~[a-zA-Z]*){0,2})?$,
you can change {0,1} to ? that is a shortcut to say 0 or 1 times:
^(/?[a-zA-Z])?[\w\s()'-]*((\s~[a-zA-Z]*){0,2})?$
The last part is present 0,1 or 2 times, then the ? is superfluous:
^(/?[a-zA-Z])?[\w\s()'-]*(\s~[a-zA-Z]*){0,2}$
The first part may be simplified too, the ? just after the / is superfluous:
^(/[a-zA-Z])?[\w\s()'-]*(\s~[a-zA-Z]*){0,2}$
If you don't use the captured groups, you can change them to non-capture group: (?: ) that are more efficient
^(?:/[a-zA-Z])?[\w\s()'-]*(?:\s~[a-zA-Z]*){0,2}$
You can also use the case-insensitive modifier (?i):
^(?i)(?:/[a-z])?[\w\s()'-]*(?:\s~[a-z]*){0,2}$
Finally, as said in OP, ~ must be followed by [a-zA-Z], so change the last * by +:
^(?i)(?:/[a-z])?[\w\s()'-]*(?:\s~[a-z]+){0,2}$

This looks slightly better:
^(?:/?[a-zA-Z]*\s)?[\w\s()'-]*(?:\s~[a-zA-Z]*)*$
https://codereview.stackexchange.com/ is more the place for this kind of thing

Assuming that capture groups are useful to you:
^((?:\/|\s~)[a-z]+)?([\w\s()'-]+)(~[a-z]+)?$
Regex101 Demo

Maybe this is what you're looking for:
var regex = /^((\/)?[a-zA-Z]+)?[\w\s()'-]*((\s~)?[a-zA-Z]+){0,2}$/;

Related

Replace a phrase in a string that is being broken up into 2 separate lines [duplicate]

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Regex: how to exclude empty match from somthing like (RegexA)?(RegexB)?(RegexA)? [duplicate]

I have regex which works fine in my application, but it matches an empty string too, i.e. no error occurs when the input is empty. How do I modify this regex so that it will not match an empty string ? Note that I DON'T want to change any other functionality of this regex.
This is the regex which I'm using: ^([0-9\(\)\/\+ \-]*)$
I don't know a lot about regex formulation myself, which is why I'm asking. I have searched for an answer, but couldn't find a direct one. Closest I got to was this: regular expression for anything but an empty string in c#, but that doesn't really work for me ..
Replace "*" with "+", as "*" means "0 or more occurrences", while "+" means "at least one occurrence"
There are a lot of pattern types that can match empty strings. The OP regex belongs to an ^.*$ type, and it is easy to modify it to prevent empty string matching by replacing * (= {0,}) quantifier (meaning zero or more) with the + (= {1,}) quantifier (meaning one or more), as has already been mentioned in the posts here.
There are other pattern types matching empty strings, and it is not always obvious how to prevent them from matching empty strings.
Here are a few of those patterns with solutions:
[^"\\]*(?:\\.[^"\\]*)* ⇒ (?:[^"\\]|\\.)+
abc||def ⇒ abc|def (remove the extra | alternation operator)
^a*$ ⇒ ^a+$ (+ matches 1 or more chars)
^(a)?(b)?(c)?$ ⇒ ^(?!$)(a)?(b)?(c?)$ (the (?!$) negative lookahead fails the match if end of string is at the start of the string)
or ⇒ ^(?=.)(a)?(b)?(c?)$ (the (?=.) positive lookahead requires at least a single char, . may match or not line break chars depending on modifiers/regex flavor)
^$|^abc$ ⇒ ^abc$ (remove the ^$ alternative that enables a regex to match an empty string)
^(?:abc|def)?$ ⇒ ^(?:abc|def)$ (remove the ? quantifier that made the (?:abc|def) group optional)
To make \b(?:north|south)?(?:east|west)?\b (that matches north, south, east, west, northeast, northwest, southeast, southwest), the word boundaries must be precised: make the initial word boundary only match start of words by adding (?<!\w) after it, and let the trailing word boundary only match at the end of words by adding (?!\w) after it.
\b(?:north|south)?(?:east|west)?\b ⇒ \b(?<!\w)(?:north|south)?(?:east|west)?\b(?!\w)
You can either use + or the {min, max} Syntax:
^[0-9\(\)\/\+ \-]{1,}$
or
^[0-9\(\)\/\+ \-]+$
By the way: this is a great source for learning regular expressions (and it's fun): http://regexone.com/
Obviously you need to replace Replace * with +, as + matches 1 or more character. However inside character class you don't to do all that escaping you're doing. Your regex can be simplified to:
^([0-9()\/+ -]+)$

Javascript RegEx match 1-1-1 and 1-1-1-1-1 but not -1-1-1-1 or 1-1-1-1-

i haven't found anything when using google and stack overflow.
I need to match 1-1-1 but not -1-1-1 or 1-1-1- with javascript RegEx.
So it has to start with a number and end with a number and has to be seperated with "-".
I can't figure out, how to do it.
Is it even possible?
Unfortunately, JavaScript regex doesn't have a look-behind (see javascript regex - look behind alternative?), so to exclude a preceding -, the regex will have to match on the preceding character too (as long as it's not a -).
Since there might not be a preceding character (input starts with 1), you have to also match on beginning of input (^).
So, this regex will do it: (?:[^-]|^)(1(?:-1)+)(?!-)
See regex101.com.
Whether it should match a standalone 1, or only on 1-1 (and longer), is up to you. The regex above will not match standalone 1. Change + to * to change that.
I also added capturing of the actual text you wanted to match, i.e. without the leading character. You can remove the extra () around 1(?:-1)+ if that's not needed.

Need help to find the right regex pattern to match

my RegEx is not working the way i think, it should.
[^a-zA-Z](\d+-)?OSM\d*(?![a-zA-Z])
I will use this regex in a javascript, to check if a string match with it.
Should match:
12345612-OSM34
12-OSM34
OSM56
7-OSM
OSM
Should not match:
-OSM
a-OSM
rOSMann
rOSMa
asdrOSMa
rOSM89
01-OSMann
OSMond
23OSM
45OSM678
One line, represents a string in my javascript.
https://www.regex101.com/r/xQ0zG1/3
The rules for matching:
match OSM if it stands alone
optional match if line starts with digit/s AND is followed by a -
optional match if line ends with digit/s
match all 3 above combined
no match if line starts with a character/word except OSM
no match if line end with chracter/word except OSM
I Hope someone can help.
You can use the following simplified pattern using anchors:
^(?:\d+-)?OSM\d*$
The flags needed (if matching multi-line paragraph) would be: g for global match and m for multi-line match, so that ^ and $ match the begin/end of each line.
EDIT
Changed the (\d+-) match to (?:\d+-) so that it doesn't group.
[^a-zA-Z](\d+-)?OSM\d*(?![a-zA-Z])
[^a-zA-Z] In regex, you specify what you want, not what you don't want. This piece of code says there must be one character that isn't a letter. I believe what you wanted to say is to match the start of a line. You don't need to specify that there's no letter, you're about to specify what there will be on the line anyway. The start of a regex is represented with ^ (outside of brackets). You'll have to use the m flag to make the regex multi-line.
(\d+-)? means one or more digits followed by a - character. The ? means this whole block isn't required. If you don't want foreign digits, you might want to use [0-9] instead, but it's not as important. This part of the code, you got right. However, if you don't need capture blocks, you could write (?:) instead of ().
\d*(?![a-zA-Z]) uses lookahead, but you almost never need to do that. Again, specifying what you don't want is a bad idea because then I could write OSMé and it would match because you didn't specify that é is forbidden. It's much simpler to specify what is allowed. In your case since you want to match line ends. So instead, you can write \d*$ which means zero or more digits followed by the end of the line.
/^(?:\d+-)?OSM\d*$/gm is the final result.

Javascript regex: how to not capture an optional string on the right side

For example /(www\.)?(.+)(\.com)?/.exec("www.something.com") will result with 'something.com' at index 1 of the resulting array. But what if we want to capture only 'something' in a capturing group?
Clarifications:
The above string is just for example - we dont want to assume anything about the suffix string (.com above). It could as well be orange.
Just this part can be solved in C# by matching from right to left (I dont know of a way of doing that in JS though) but that will end up having www. included then!
Sure, this problem as such is easily solvable mixing regex with other string methods like replace / substring. But is there a solution with only regex?
(?:www\.)?(.+?)(?:\.com|$)
This will give only something ingroups.Just make other groups non capturing.See demo.
https://regex101.com/r/rO0yD8/4
Just removing the last character (?) from the regex does the trick:
https://regex101.com/r/uR0iD2/1
The last ? allows a valid output without the (\.com) matching anything, so the (.+) can match all the characters after the www..
Another option is to replace the greedy quantifier +, which always tries to match as much characters as possible, with the +?, which tries to match as less characters as possible:
(www\.)?(.+?)(\.com)?$
https://regex101.com/r/oY7fE0/2
Note that it is necessary to force a match with the entire string through the end of line anchor ($).
If you only want to capture "something", use non-capturing groups for the other sections:
/(?:www\.)?(.+)(?:\.com)?/.exec("www.something.com")
The ?: denotes the groups as non-capturing.

Categories

Resources