It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have absolutly no experience with regular expressions and I need some help setting up one to match a string with. This is for phone number validation. I need to make sure that a string a user inputs has only upper case letters A-Z, numbers 0-9, open/close parentheses[()], and hyphens(-). I also don't know what string method I need to use either match or string.
RegEx is explained poorly all over the web. I don't fault anyone for asking more general questions about it and this is different from the other post which is more do-it-form-me google-evasion than specific question. The characters you asked about:
[A-Z]
[0-9] or \d
\(
\)
-
/matchme/ is a regular expression literal. This is preferable to useing the RegExp constructor because you end up having to escape your escape backslashes which gets real ugly.
You can actually use regEx literals in a lot of string methods, like replace, split, etc.
Without special characters following, any non-special character is about matching one character at that position in a string. Stuff in [] is a class and can match more than one KIND of character but only the character at that positions following the last position matched. You might [.- ] useful for identifying non-number characters for telephone numbers. You can also express ranges in character classes, e.g. [a-hA-H] or [4-9]
But one str position at a time goes out the window when you start using the follow-up characters:
? - one or none
* - 0 or many
+ - 1 or more
Avoid the . wildcard character. It is inefficient. For some reason that I suspect goes down all the way to implementation in assembly for efficiency's sake, it checks against every single possibility rather than the 1-2 teletype whitespace characters it actually doesn't represent and there is no honest use for on a computer. More importantly, the better-performing alternative is much more powerful and helpful. Negating character classes are much faster. [^<]* represents 0 or more positions of anything that is NOT a < character.
Very handy stuff for XML/SGML-style parsing which in spite of what many on Stack have said, is perfectly feasible with regEx, which is no longer technically confined to "regular" languages. You have to be aware of what your looking with something that allows as much sloppiness as somebody else's HTML but that's just a 'duh' in my book.
Crockford warns against negating character classes in JSlint. Crockford is painfully wrong on that count. They are not only much more efficient, they also make it much easier to think through how to tokenize stuff. If there is a security risk, you can set explicit limits to the number of characters matched with {} brackets, e.g. p{2,5} - which matches two to five p chars or {5} for exactly 5 or {,5} for up to 5 or {5,} at least 5 (I think - test those last two)
Other random stuff you should look up:
(ph|f) - ph or f - helpful for finding phish and fish (when a class won't do, basically)
^ - represents beginning of a string - think of as a condition for the next character more than a character itself. Yes, it also negates character classes.
$ - represents end of a string - same caveat as above but on the previous character.
\ - used to escape special symbols. Note: a lot of special symbols that have no meaning in character classes require no \ inside []
\s\w\d - These represent commonly used sets of characters. The first is pretty much all whitespace (js-style escapes typically have regEx equivalents) followed by w for word characters (class equivalent [a-zA-Z0-9_]) and d for digits [0-9]. Capitalize any of these for the exact opposite.
There's more, like back-references, and lookaheads whose use-case scenarios are worth knowing but this is the commonly used stuff I actually remember from regular experience (bwaahaahaa).
I assume you're looking for non US since you have that A-Z concern and I'm sure there's plenty of US phone-numbers regExes out there but I'd probably do something like this for US numbers:
/\(?\d{3}[)\-. ]?\d{3}[\-. ]?\d{4}/
to match:
123-456-7890
(123)456-7890
123.456.7890
123 456 7890
1234567890
But also perhaps messily allows:
(123456.7890
...which I'm willing to live with for the sake of avoiding complexity. Resist the temptation to do it all with one expression. Sometimes it's much cleaner to eliminate trailing/leading whitespace for instance, and then hit something with an expression. Split and join methods are very powerful for tokenizing
If this goes like a usual regEx conversation, somebody will shortly point out something I missed in my pattern. So yeah, test 'em out on stuff. There's sites that let you set the expression and then just plug in characters to try and break them.
Related
I'm trying to match a string starting from the last character to fail as soon as possible. This way I can fail a match with a custom string cstr (see specification below) with least amount of operations (4th property).
From a theoritical perspective the regex can be represented as a finite state mashine and the arrows can be flipped, creating the reversed regex.
I'm looking for an implementation of this. A library/program which I can give the string and the pattern. cstr is implemented in python, so if possible a python module. (For the curious i-th character is not calculated until needed.) For anything other I need to do much more work because of cstr's calculation is hard to port to another language.
The implementation doesn't have to cover all latex syntax. I'm looking for the basics. No lookaheads or fancy stuff. See specification below.
I may be lacking common knowledge. Please do comment obvious things, too.
Specification
The custom string cstr has the following properties:
String can be calculated in finite time.
String has finite length
The last character is known
Every previous character requires a costly calculation
Until the string is calculated fully, length is unknown
When the string is calcualted fully, I want to match it with a simple regex which may contain these from the syntax. No look aheads or fancy stuff.
alphanumeric characters
uinicode characters
., *, +, ?, \w, \W, [], |, escape char \, range specifitation with { , }
PS: This is not a homework question. I'm trying to formulate my question as clear as possible.
OP here. Here are some thougts:
Since I'm looking for an unoptimized regex mashine, I have to build it myself, which takes time.
Alternatively we can define an upperbound for cstr length and create all strings that matches given regex with length < upperbound. Then we put all solutions to a tire data structure and match it. This depends on the use case and maybe a cache can be involved.
What I'm going for is python module greenery
from greenery import parse
pattern = parse.Pattern(...)
pattern.reversed()
...
this sometimes provieds a good matching experience. Sometimes not but it is ok for me.
This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*
I have a long Regex (JavaScript), and it contains the following construct:
((\\\\)|(\\[abc])|([^abc]))*
The regex says:
Match any String, that doesn't contain the letters a,b and c.
In except if they're escaped by a backslash.
If the backslash is escaped (eg. \\a), also don't match these letters.
Here's a simple match-example:
eeeaeaee\aee\\\\ae\\\\\aee
I wonder if it's possible to optimise this regulat expression. This is only a little example, the actual regex I'm using is bigger, and I have lots of code twice.
I think a more logical (and likely faster) regexp would be something like:
(?:[^abc\\]|\\.)*
In other words, a backslash will escape anything, including another backslash.
Note a few things: first, if you don't need to capture parts of the match, use non-capturing groups. That buys you a little performance. Second, when there are multiple alternatives, put the most common one first.
You might get even better performance this way (try it):
[^abc\\]*(?:\\.[^abc\\]*)*
Rather than going through the alternation for each and every character, that will "eat" runs of non-special characters with a single step. Nested * can be bad news, leading to quadratic (or worse) runtime in cases where the regex doesn't match, but in this case that won't happen.
When writing this answer, I discovered that JS's regex engine has no possessive matchers. That sucks -- you could get better worst-case performance if they were available. (An important tip for working towards regex mastery: when performance testing a regex, always test cases where it does match AND where it doesn't match. The worst-case performance generally occurs when it doesn't.)
You can match any character after a backslash or any character that is not in [abc]:
(\\.|[^abc])*
That will match the exact same language.
I think it's actually more clear what you're intention is if you flip it around like:
([^abc]|\\.)*
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I want to find a . and get any characters after I found a (, with a regex. How do i make that happen?
I would also like to see some good tutorials on regexes for Javascript.
Following regex should work for you:
[.]([^(]*)[(]
Text you want to capture will be available in group # 1.
Javascript Code:
var str='here is a sentance. and some other text ( here )';
var match = str.match(/[.]([^(]*)[(]/);
console.log(match[1]); // and some other text
Live Demo: http://www.rubular.com/r/ALqusiC9EQ
It seems like you're having trouble with the concept of escaping. The . and ( characters have special meaning in RegEx, so you need to escape them by placing a \ in front of them. For example, to match a literal dot, you might use \.
For repetition, you can use * or + for 0+ and 1+ respectively. These are used as modifiers on preceding expressions. So, for example, A+ means "one or more A characters", whereas A* means "zero or more A characters". You can also use the ? modifier to alter the "greedy" behavior of these matches, but that's a more complicated topic.
If you need to constrain the exact number of repetitions, you can use the {n} syntax. For example, you might use A{10} to match exactly 10 A characters, or A{3,5} to match between 3 and 5 A characters.
These also work on groups and classes, e.g. [A-Z]{3} or (a*b+){3}.
As far as RegEx tutorials go, pretty much nowhere beats Regular-Expressions.info, though the MDN article on RegEx might be useful on the JavaScript side of things too.
In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.