What Regular Expression Can I Use To Find Simple Regular Expressions - javascript

If I have a string, which is the source of a regular expression:
"For example, I have (.*) string with (\.d+) special bits (but this is just an aside)."
Is there a way to extract the special parts of the regular expression?
In particular, I'm interested in the parts that will give back values when I call string.match(expr);

Regex can be complicated, but if you do a global regex with ([\.\\]([*a-z])\+?), it will capture your individual fields without including the parenthesis per your request. Demo code as put in this fiddle is below as well.
var testString = 'For example, I have (.*) string with (.d+) special bits (but this is just an aside). (\\w+)';
var regex = /([\.\\]([*a-z])\+?)/gi;
var matches_array = testString.match(regex);
//Outputs the following: [".*", ".d+", "\w+"]

Regular expressions are not powerful enough to recognize the language of matching parentheses. (The formal proof uses the equivalence of regular expressions and finite state machines and the fact that there are infinitely many levels of nesting possible.) Thus, matching the first ) after each ( would make (\d+(\.d+)?) return (\d+(\.d+) and matching the last ) after each ( would make (\w+) (\w+) match the entire string.
The correct way to do this is with recursion (which mathematical regular expressions do not allow, but actual implementations such as PCRE do). You can also get a simple expression for non-nested parentheses. Just be careful to parse escape characters: to be fully robust, \( and \\\( are special, but \\( is not.

Related

If-else condition in Regex [duplicate]

This is what I have so far...
var regex_string = "s(at)?u(?(1)r|n)day"
console.log("Before: "+regex_string)
var regex_string = regex_string.replace(/\(\?\((\d)\)(.+?\|)(.+?)\)/g,'((?!\\$1)$2\\$1$3)')
console.log("After: "+regex_string)
var rex = new RegExp(regex_string)
var arr = "thursday tuesday thuesday tursday saturday sunday surday satunday monday".split(" ")
for(i in arr){
var m
if(m = arr[i].match(rex)){
console.log(m[0])
}
}
I am swapping (?(n)a|b) for ((?!\n)a|\nb) where n is a number, and a and b are strings. This seems to work fine - however, I am aware that it is a big fat hack.
Is there a better way to approach this problem?
In the specific case of your regex, it is much simpler and more readable to use alternation:
(?:sunday|saturday)
Or you can create alternation only between the 2 positions where the conditional regex is involved (this is more useful in the case where there are many such conditional expressions, but only refers to the nearby capturing group). Using your case as an example, we will only create the alternation for un and atur since only those are involved in the condition:
s(?:un|atur)day
There are 2 common types of conditional regex. (There are more exotic stuffs supported by Perl regular expression, but those requires support for features that JavaScript regular expression or other common regex engine doesn't have).
The first type is where an explicit pattern is provided as condition. This type can be mimicked in JavaScript regex. In the language that supports conditional regex, the pattern will be:
(?(conditional-pattern)yes-pattern|no-pattern)
In JavaScript, you can mimic it with look-ahead, with the (obvious) assumption that the original conditional-pattern is a look-ahead:
((?=conditional-pattern)yes-pattern|(?!conditional-pattern)no-pattern)
The negative look-ahead is necessary, to prevent the cases where the input string passes the conditional-pattern and fail in the yes-pattern, but it can match the no-pattern. It is safe to do so, because positive look-around and negative look-around are exact opposite of each other logically.
The second type is where a reference to a capturing group is provided (name or number), and the condition will be evaluated to true when the capturing group has a match. In such case, there is no simple solution.
The only way I can think of is by duplication, as what I have done with your case as an example. This of course reduces the maintainability. It is possible to compose you regex by writing them in parts (in literal RegExp), retrieve the string with source attribute, then concatenate them together; this will allow for changes to propagate to other duplicated parts, but makes it harder to understand the regex and/or make major modification to it.
References
Alternation Constructs in Regular Expression - .NET - Microsoft
re package in Python: Ctrl+F for (?(
perlre - Perl regular expression: Ctrl+F for (?(

Regex not working as expected in JavaScript

I wrote the following regex:
(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?
Its behaviour can be seen here: http://gskinner.com/RegExr/?34b8m
I wrote the following JavaScript code:
var urlexp = new RegExp(
'^(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?$', 'gi'
);
document.write(urlexp.test("blaaa"))
And it returns true even though the regex was supposed to not allow single words as valid.
What am I doing wrong?
Your problem is that JavaScript is viewing all your escape sequences as escapes for the string. So your regex goes to memory looking like this:
^(https?://)?([da-z.-]+).([a-z]{2,6})(/(w|-)*)*/?$
Which you may notice causes a problem in the middle when what you thought was a literal period turns into a regular expressions wildcard. You can solve this in a couple ways. Using the forward slash regular expression syntax JavaScript provides:
var urlexp = /^(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?$/gi
Or by escaping your backslashes (and not your forward slashes, as you had been doing - that's exclusively for when you're using /regex/mod notation, just like you don't have to escape your single quotes in a double quoted string and vice versa):
var urlexp = new RegExp('^(https?://)?([da-z.-]+)\\.([a-z]{2,6})(/(\\w|-)*)*/?$', 'gi')
Please note the double backslash before the w - also necessary for matching word characters.
A couple notes on your regular expression itself:
[da-z.-]
d is contained in the a-z range. Unless you meant \d? In that case, the slash is important.
(/(\w|-)*)*/?
My own misgivings about the nested Kleene stars aside, you can whittle that alternation down into a character class, and drop the terminating /? entirely, as a trailing slash will be match by the group as you've given it. I'd rewrite as:
(/[\w-]*)*
Though, maybe you'd just like to catch non space characters?
(/[^/\s]*)*
Anyway, modified this way your regular expression winds up looking more like:
^(https?://)?([\da-z.-]+)\.([a-z]{2,6})(/[\w-]*)*$
Remember, if you're going to use string notation: Double EVERY backslash. If you're going to use native /regex/mod notation (which I highly recommend), escape your forward slashes.

Solving regular expression recursive strings

The Problem
I could match this string
(xx)
using this regex
\([^()]*\)
But it wouldn't match
(x(xx)x)
So, this regex would
\([^()]*\([^()]*\)[^()]*\)
However, this would fail to match
(x(x(xx)x)x)
But again, this new regex would
[^()]*\([^()]*\([^()]*\)[^()]*\)[^()]*
This is where you can notice the replication, the entire regex pattern of the second regex after the first \( and before the last \) is copied and replaces the center most [^()]*. Of course, this last regex wouldn't match
(x(x(x(xx)x)x)x)
But, you could always copy replace the center most [^()]* with [^()]*\([^()]*\)[^()]* like we did for the last regex and it'll capture more (xx) groups. The more you add to the regex the more it can handle, but it will always be limited to how much you add.
So, how do you get around this limitation and capture a group of parenthesis (or any two characters for that matter) that can contain extra groups within it?
Falsely Assumed Solutions
I know you might think to just use
\(.*\)
But this will match all of
(xx)xx)
when it should only match the sub-string (xx).
Even this
\([^)]*\)
will not match pairs of parentheses that have pairs nested like
(xx(xx)xx)
From this, it'll only match up to (xx(xx).
Is it possible?
So is it possible to write a regex that can match groups of parentheses? Or is this something that must be handled by a routine?
Edit
The solution must work in the JavaScript implementation of Regular Expressions
If you want to match only if the round brackets are balanced you cannot do it by regex itself..
a better way would be to
1>match the string using \(.*\)
2>count the number of (,) and check if they are equal..if they are then you have the match
3>if they are not equal use \([^()]*\) to match the required string
Formally speaking, this isn't possible using regular expressions! Regular expressions define regular languages, and regular languages can't have balanced parenthesis.
However, it turns out that this is the sort of thing people need to do all the time, so lots of Regex engines have been extended to include more than formal regular expressions. Therefore, you can do balanced brackets with regular expressions in javascript. This article might help get you started: http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx . It's for .net, but the same applies for the standard javascript regex engine.
Personally though, I think it's best to solve a complex problem like this with your own function rather than leveraging the extended features of a Regex engine.

Issue with custom javascript regex

I have a custom regular expression which I use to detect whole numbers, fractions and floats.
var regEx = new RegExp("^((^[1-9]|(0\.)|(\.))([0-9]+)?((\s|\.)[0-9]+(/[0-9])?)?)$");
var quantity = 'd';
var matched = quantity.match(regEx);
alert(matched);
​
(The code is also found here: http://jsfiddle.net/aNb3L/ .)
The problem is that for a single letter it matches, and I can't figure out why. But for more letters it fails(which is good).
Disclaimer: I am new to regular expressions, although in http://gskinner.com/RegExr/ it doesn't match a single letter
It's easier to use straight regular expression syntax:
var regEx = /^((^[1-9]|(0\.)|(\.))([0-9]+)?((\s|\.)[0-9]+(\/[0-9])?)?)$/;
When you use the RegExp constructor, you have to double-up on the backslashes. As it is, your code only has single backslashes, so the \. subexpressions are being treated as . — and that's how single non-digit characters are slipping through.
Thus yours would also work this way:
var regEx = new RegExp("^((^[1-9]|(0\\.)|(\\.))([0-9]+)?((\\s|\\.)[0-9]+(/[0-9])?)?)$");
This happens because the string syntax also uses backslash as a quoting mechanism. When your regular expression is first parsed as a string constant, those backslashes are stripped out if you don't double them. When the string is then passed to the regular expression parser, they're gone.
The only time you really need to use the RegExp constructor is when you're building up the regular expression dynamically or when it's delivered to your code via JSON or something.
Well, for a whole number this would be your regex:
/^(0|[1-9]\d*)$/
Then you have to account for the possibility of a float:
/^(0|[1-9]\d*)(.\d+)?$/
Then you have to account for the possibility of a fraction:
/^(0|[1-9]\d*)((.\d+)|(\/[1-9]\d*)?$/
To me this regex is much easier to read than your original, but it's up to you of course.

Regular expression match only if subpattern doesn't match

I'm trying to match C style comments form a file, but only if the comment don't start with a certain labels introduced by #
For example from
/* some comment to match */
/* another comment.
this should match also */
/*#special shouldn't match*/
Is this possible using regular expressions only?
I'm trying this using JavaScript implementation of regular expressions.
/\*\s*(?!#)(?:(?!\*/).)*\*/
Breaks down as:
/\* // "/*"
\s* // optional space
(?!#) // not followed by "#"
(?: // don't capture...
(?!\*/). // ...anything that is not "*/"
)* // but match it as often as possible
\*/ // "*/"
Use in "global" and "dotall" mode (e.g. the dot should match new lines as well)
The usual word of warning: As with all parsing jobs that are executed with regular expressions, this will fail on nested patterns and broken input.
emk points out a nice example of (otherwise valid) input that will cause this expression to break. This can't be helped, regex is not for parsing. If you are positive that things like this can never occur in your input, a regex might still work for you.
You could start with something like this:
/\*[^#]
But in general, you don't watch to match C-style comments with regular expressions, because of nasty corner-cases. Consider:
"foo\" /* " " */ "
There's no comment in that code (it's a compile-time concatenation of two string literals), but you're not going to have much luck parsing it without a real parser. (Technically, you could use a regular expression, because you only need a simple finite state machine. But it's a very disgusting regular expression.)
use negative lookahead

Categories

Resources