I have a long Regex (JavaScript), and it contains the following construct:
((\\\\)|(\\[abc])|([^abc]))*
The regex says:
Match any String, that doesn't contain the letters a,b and c.
In except if they're escaped by a backslash.
If the backslash is escaped (eg. \\a), also don't match these letters.
Here's a simple match-example:
eeeaeaee\aee\\\\ae\\\\\aee
I wonder if it's possible to optimise this regulat expression. This is only a little example, the actual regex I'm using is bigger, and I have lots of code twice.
I think a more logical (and likely faster) regexp would be something like:
(?:[^abc\\]|\\.)*
In other words, a backslash will escape anything, including another backslash.
Note a few things: first, if you don't need to capture parts of the match, use non-capturing groups. That buys you a little performance. Second, when there are multiple alternatives, put the most common one first.
You might get even better performance this way (try it):
[^abc\\]*(?:\\.[^abc\\]*)*
Rather than going through the alternation for each and every character, that will "eat" runs of non-special characters with a single step. Nested * can be bad news, leading to quadratic (or worse) runtime in cases where the regex doesn't match, but in this case that won't happen.
When writing this answer, I discovered that JS's regex engine has no possessive matchers. That sucks -- you could get better worst-case performance if they were available. (An important tip for working towards regex mastery: when performance testing a regex, always test cases where it does match AND where it doesn't match. The worst-case performance generally occurs when it doesn't.)
You can match any character after a backslash or any character that is not in [abc]:
(\\.|[^abc])*
That will match the exact same language.
I think it's actually more clear what you're intention is if you flip it around like:
([^abc]|\\.)*
Related
I'm trying to match a string starting from the last character to fail as soon as possible. This way I can fail a match with a custom string cstr (see specification below) with least amount of operations (4th property).
From a theoritical perspective the regex can be represented as a finite state mashine and the arrows can be flipped, creating the reversed regex.
I'm looking for an implementation of this. A library/program which I can give the string and the pattern. cstr is implemented in python, so if possible a python module. (For the curious i-th character is not calculated until needed.) For anything other I need to do much more work because of cstr's calculation is hard to port to another language.
The implementation doesn't have to cover all latex syntax. I'm looking for the basics. No lookaheads or fancy stuff. See specification below.
I may be lacking common knowledge. Please do comment obvious things, too.
Specification
The custom string cstr has the following properties:
String can be calculated in finite time.
String has finite length
The last character is known
Every previous character requires a costly calculation
Until the string is calculated fully, length is unknown
When the string is calcualted fully, I want to match it with a simple regex which may contain these from the syntax. No look aheads or fancy stuff.
alphanumeric characters
uinicode characters
., *, +, ?, \w, \W, [], |, escape char \, range specifitation with { , }
PS: This is not a homework question. I'm trying to formulate my question as clear as possible.
OP here. Here are some thougts:
Since I'm looking for an unoptimized regex mashine, I have to build it myself, which takes time.
Alternatively we can define an upperbound for cstr length and create all strings that matches given regex with length < upperbound. Then we put all solutions to a tire data structure and match it. This depends on the use case and maybe a cache can be involved.
What I'm going for is python module greenery
from greenery import parse
pattern = parse.Pattern(...)
pattern.reversed()
...
this sometimes provieds a good matching experience. Sometimes not but it is ok for me.
I have these two simple regex patterns to match urls that are from these stores, but they lead to catastrophic backtracking and a frozen browser when running on some string url with an edge case. This logic is running on thousands of random requests, so the chance of catastrophic backtracking is high. Does anyone have an idea of what could be wrong in the way I wrote this regex.
> ".*://.*.newegg.com/Product/Product.*"
> ".*://.*.gamestop.com*.*Product-Variation*.*productDetailsRedesign"
You have too many greedy dot patterns in the expressions. Try be a ted bit more verbose:
\w+://[^/]*\.newegg\.com/Product/Product\S*
The second pattern:
\w+://[^\s/]*\.gamestop\.com\S*?Product-Variation\S*?productDetailsRedesign
See proof #1 | proof #2.
Use \S*? to match any characters different from whitespace (as few as possible).
Escape the period characters as they are regex metacharacters.
Use [^...] negated character classes if you know there can be no such characters between two substrings in a match.
I was using a lookbehind to check for a dot before the # but just realized not all browsers are supporting lookbehinds. It works perfect in Chrome but fails in Firefox and IE.
This is what I came up with but it certainly is messy
^([a-zA-Z0-9&^*%#~{}=+?`_-]\.?)*[a-zA-Z0-9&^*%#~{}=+?`_-]#([a-zA-Z0-9]+\.)+[a-zA-Z]$
Is there a simpler and/or more elegant way to do this? I don't think I can negate the dot (^.) because I'm only allowing certain characters to be present in the local part.
This ([a-zA-Z0-9&^*%#~{}=+?`_-].?)*[a-zA-Z0-9&^*%#~{}=+?`_-] part is not messy, but inefficient, because the * quantifies a group containing an obligatory part, [...], and an optional \.?. Instead of (ab?)*a, you may use a+(?:ba+)* that will make matching linear and swift, in your case, [a-zA-Z0-9&^*%#~{}=+?`_-]+(?:.[a-zA-Z0-9&^*%#~{}=+?`_-]+)*.
More, [a-zA-Z0-9_] equals \w in JS regex, you may use this to shorten the pattern.
Besides, the last [a-zA-Z]$ pattern only matches a single letter, you most probably need [a-zA-Z]{2}$ there, as TLDs consist of 2+ letters.
So, you may use
^[\w&^*%#~{}=+?`-]+(?:\.[\w&^*%#~{}=+?`-]+)*#(?:[a-zA-Z0-9]+\.)+[a-zA-Z]{2,}$
See the regex demo.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have absolutly no experience with regular expressions and I need some help setting up one to match a string with. This is for phone number validation. I need to make sure that a string a user inputs has only upper case letters A-Z, numbers 0-9, open/close parentheses[()], and hyphens(-). I also don't know what string method I need to use either match or string.
RegEx is explained poorly all over the web. I don't fault anyone for asking more general questions about it and this is different from the other post which is more do-it-form-me google-evasion than specific question. The characters you asked about:
[A-Z]
[0-9] or \d
\(
\)
-
/matchme/ is a regular expression literal. This is preferable to useing the RegExp constructor because you end up having to escape your escape backslashes which gets real ugly.
You can actually use regEx literals in a lot of string methods, like replace, split, etc.
Without special characters following, any non-special character is about matching one character at that position in a string. Stuff in [] is a class and can match more than one KIND of character but only the character at that positions following the last position matched. You might [.- ] useful for identifying non-number characters for telephone numbers. You can also express ranges in character classes, e.g. [a-hA-H] or [4-9]
But one str position at a time goes out the window when you start using the follow-up characters:
? - one or none
* - 0 or many
+ - 1 or more
Avoid the . wildcard character. It is inefficient. For some reason that I suspect goes down all the way to implementation in assembly for efficiency's sake, it checks against every single possibility rather than the 1-2 teletype whitespace characters it actually doesn't represent and there is no honest use for on a computer. More importantly, the better-performing alternative is much more powerful and helpful. Negating character classes are much faster. [^<]* represents 0 or more positions of anything that is NOT a < character.
Very handy stuff for XML/SGML-style parsing which in spite of what many on Stack have said, is perfectly feasible with regEx, which is no longer technically confined to "regular" languages. You have to be aware of what your looking with something that allows as much sloppiness as somebody else's HTML but that's just a 'duh' in my book.
Crockford warns against negating character classes in JSlint. Crockford is painfully wrong on that count. They are not only much more efficient, they also make it much easier to think through how to tokenize stuff. If there is a security risk, you can set explicit limits to the number of characters matched with {} brackets, e.g. p{2,5} - which matches two to five p chars or {5} for exactly 5 or {,5} for up to 5 or {5,} at least 5 (I think - test those last two)
Other random stuff you should look up:
(ph|f) - ph or f - helpful for finding phish and fish (when a class won't do, basically)
^ - represents beginning of a string - think of as a condition for the next character more than a character itself. Yes, it also negates character classes.
$ - represents end of a string - same caveat as above but on the previous character.
\ - used to escape special symbols. Note: a lot of special symbols that have no meaning in character classes require no \ inside []
\s\w\d - These represent commonly used sets of characters. The first is pretty much all whitespace (js-style escapes typically have regEx equivalents) followed by w for word characters (class equivalent [a-zA-Z0-9_]) and d for digits [0-9]. Capitalize any of these for the exact opposite.
There's more, like back-references, and lookaheads whose use-case scenarios are worth knowing but this is the commonly used stuff I actually remember from regular experience (bwaahaahaa).
I assume you're looking for non US since you have that A-Z concern and I'm sure there's plenty of US phone-numbers regExes out there but I'd probably do something like this for US numbers:
/\(?\d{3}[)\-. ]?\d{3}[\-. ]?\d{4}/
to match:
123-456-7890
(123)456-7890
123.456.7890
123 456 7890
1234567890
But also perhaps messily allows:
(123456.7890
...which I'm willing to live with for the sake of avoiding complexity. Resist the temptation to do it all with one expression. Sometimes it's much cleaner to eliminate trailing/leading whitespace for instance, and then hit something with an expression. Split and join methods are very powerful for tokenizing
If this goes like a usual regEx conversation, somebody will shortly point out something I missed in my pattern. So yeah, test 'em out on stuff. There's sites that let you set the expression and then just plug in characters to try and break them.
In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.