I'd like to create a regular expression that is able to fetch every loop inside a brainfuck code.
Let's say this code is given:
++++[>+[>,++.]<<-]++[>,.<-]
I want to fetch these three loops (actually it would be sufficient just to fetch the first one):
[>+[>,++.]<<-]
[>,++.]
[,.<-]
My knowledge of regular expressions is pretty weak, so I can't do much more than basics. What I have thought of is this expression:
\[[-+><.,\[\]]*]
\[ - Match the first (opening) bracket
[-+><.,\[\]]* - followed by a number of brainfuck operators
] - followed by a closing bracket
This however matches (obviously) everything between the first opening, and the last closing bracket:
[>+[>,++.]<<-]++[>,.<-]
It might need something to test for the same number of opening and closing brackets inside the loop, before matching the last closing bracket - If that makes any sense.
Maybe a lookaround (I need to use this in javascript, so I can only use lookaheads) is the right way to do this, but I can't figure out how it's supposed to be done.
I had written this one once when I needed to match a pair of square brackets (while handling nesting correctly)
It is a .NET regex that uses some features that aren't available in all regex engines. Here goes:
\[(?>\[(?<d>)|\](?<-d>)|.?)*(?(d)(?!))\]
Regular expressions cannot match infinitely recursing things. Look at the Chomsky hierarchy of languages.
You can write a regular expression matching finitely recursing things by expanding them. For example, this POSIX ERE (tested with egrep) will match brainfuck loops up to nesting depth 3:
(\[[^][]*\]|\[([^][]|\[[^][]*\])*|\[([^][]|\[([^][]|\[[^][]*\])*\])*\])
Use a non-greedy (or lazy) matching:
\[[-+><.,\[\]]*?\]
Notice the ?. Though, it'll match the shortest string between [ and ]. Thus, one of the results would be:
[>+[>,++.]
Related
Experimenting with simple regexes I found some weird behavior.
Single pair of brackets [] is treated either as an incomplete character class (PCRE and Python) and throws an error, or as an empty character class (JS), which is not an error, but doesn't match anything.
Going forward, JS treats [][] as expected, as two empty classes, but in PCRE and Python innermost brackets ][ are interpreted as literals, even though they are not escaped.
Further experiments showed that three expressions are equivalent in practice:
[][]
[\]\[]
[\[\]]
The second and the third one make sense to me, but why does the first one work? Can someone please explain to me how exactly [][] construction is parsed?
Chalk it up to excessive cleverness on the part of the JavaScript designers. They decided [] means nothing (a null construct, no effect on the match), and [^] means not nothing--in other words, anything including newlines. Most other flavors have a singleline/DOTALL mode that allows . to match newlines, but JavaScript doesn't. Instead it offers [^] as a sort of super-dot.
That didn't catch on, which is just as well. As you've observed, it's thoroughly incompatible with other flavors. Everyone else took the attitude that a closing bracket right after an opening bracket should be treated as a literal character. And, since character classes can't be nested (traditionally), the opening bracket never has special meaning inside one. Thus, [][] is simply a compact way to match a square bracket.
Taking it further, if you want to match any character except ], [ or ^, in most flavors you can write it exactly like that: [^][^]. The closing bracket immediately after the negating ^ is treated as a literal, the opening bracket isn't special, and the second ^ is also treated as a literal. But in JavaScript, [^][^] is two separate atoms, each matching any character (including newlines). To get the same meaning as the other flavors, you have to escape the first closing bracket: [^\][^].
The pond gets even muddier when Java jumps in. It introduced a set intersection feature, so you can use, for example, [a-z&&[^aeiou]] to match consonants (the set of characters in the range a to z, intersected with the set of all characters that are not a, e, i, o or u). However, the [ doesn't have to be right after && to have special meaning; [[a-z]&&[^aeiou]] is the same as the previous regex.
That means, in Java you always have to escape an opening bracket with a backslash inside a character class, but you can still escape a closing bracket by placing it first. So the most compact way to match a square bracket in Java is []\[]. I find that confusing and ugly, so I often escape both brackets, at least in Java and JavaScript.
.NET has a similar feature called set subtraction that's much simpler and uses a tighter syntax: [a-z--[aeiou]]. The only place a nested class can appear is after --, and the whole construct must be at the end of the enclosing character class. You can still match a square bracket using [][] in .NET.
My string is the follow :
str = "(2+2)^(4*(5+6^(5^6))))";
As you can see, the power can be nested inside another power with or without parenthesis.
So I want to convert this string by using regexp to replace ^ by Math.pow(a,b) of javascript.
An idea ? Thank you very much in advance, cordially.
I think that using a regex to parse expressions will not turn out well for you...
Why not use an math expression parser library like http://mathjs.org/
These are the steps your algorithm would have to perform:
Find the "root ^" char
Capture the groups before and after ^
Repeat
The problem here is that this type of data structure is both recursive and non regular...
It's recursive since you can have an infinite number of nested parenthesis, and each needs to be evaluated separately
It's non regular since, for instance, you can have groups that don't have parenthesis: (2+2)^2
...which makes finding the said "root ^" problematic
Also, the input might not always be valid (for instance, user forgets to close a parenthesis).
I am trying to make a regex that will match a number of x's that is a power of two. I am using JavaScript. I tried this one:
^(x\1?)$
but it doesn't work. Shouldn't the \1 refer to the outer parathesis so it should match xx, and therefore also xxxx, etc.?
I tried a simpler one that I thought would match x and xx:
^((x)|(\2{2}))$
but this only matches x.
What am I doing wrong?
You can't do "recursive backreferences". At least, it is not so easy.
I'm no sure that you need recuresive regular expressions here. May be you could just count number of the characters in the string and check if it is equal to a power of two?
But if you really need recursive regular expressions (I'm almost sure, you don't),
you can check this question:
Recursive matching with regular expressions in Javascript
and this blog
http://blog.stevenlevithan.com/archives/javascript-match-nested
The Problem
I could match this string
(xx)
using this regex
\([^()]*\)
But it wouldn't match
(x(xx)x)
So, this regex would
\([^()]*\([^()]*\)[^()]*\)
However, this would fail to match
(x(x(xx)x)x)
But again, this new regex would
[^()]*\([^()]*\([^()]*\)[^()]*\)[^()]*
This is where you can notice the replication, the entire regex pattern of the second regex after the first \( and before the last \) is copied and replaces the center most [^()]*. Of course, this last regex wouldn't match
(x(x(x(xx)x)x)x)
But, you could always copy replace the center most [^()]* with [^()]*\([^()]*\)[^()]* like we did for the last regex and it'll capture more (xx) groups. The more you add to the regex the more it can handle, but it will always be limited to how much you add.
So, how do you get around this limitation and capture a group of parenthesis (or any two characters for that matter) that can contain extra groups within it?
Falsely Assumed Solutions
I know you might think to just use
\(.*\)
But this will match all of
(xx)xx)
when it should only match the sub-string (xx).
Even this
\([^)]*\)
will not match pairs of parentheses that have pairs nested like
(xx(xx)xx)
From this, it'll only match up to (xx(xx).
Is it possible?
So is it possible to write a regex that can match groups of parentheses? Or is this something that must be handled by a routine?
Edit
The solution must work in the JavaScript implementation of Regular Expressions
If you want to match only if the round brackets are balanced you cannot do it by regex itself..
a better way would be to
1>match the string using \(.*\)
2>count the number of (,) and check if they are equal..if they are then you have the match
3>if they are not equal use \([^()]*\) to match the required string
Formally speaking, this isn't possible using regular expressions! Regular expressions define regular languages, and regular languages can't have balanced parenthesis.
However, it turns out that this is the sort of thing people need to do all the time, so lots of Regex engines have been extended to include more than formal regular expressions. Therefore, you can do balanced brackets with regular expressions in javascript. This article might help get you started: http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx . It's for .net, but the same applies for the standard javascript regex engine.
Personally though, I think it's best to solve a complex problem like this with your own function rather than leveraging the extended features of a Regex engine.
I'm writing a brush for Alex Gorbatchev's Syntax Highlighter to get highlighting for Smalltalk code. Now, consider the following Smalltalk code:
aCollection do: [ :each | each shout ]
I want to find the block argument ":each" and then match "each" every time it occurrs afterwards (for simplicity, let's say every occurrence an not just inside the brackets).
Note that the argument can have any name, e.g. ":myArg".
My attempt to match ":each":
\:([\d\w]+)
This seems to work. The problem is for me to match the occurrences of "each". I thought something like this could work:
\:([\d\w]+)|\1
But the right hand side of the alternation seems to be treated as an independent expression, so backreferencing doesn't work.
Is it even possible to accomplish what I want in a single expression? Or would I have to use the backreference within a second expression (via another function call)?
You could do it in languages that support variable-length lookbehind (AFAIK only the .NET framework languages do, Perl 6 might). There you could highlight a word if it matches (?<=:(\w+)\b.*)\1. But JavaScript doesn't support lookbehind at all.
But anyway this regex would be very inefficient (I just checked a simple example in RegexBuddy, and the regex engine needs over 60 steps for nearly every character in the document to decide between match and non-match), so this is not a good idea if you want to use it for code highlighting.
I'd recommend you use the two-step approach you mentioned: First match :(\w+)\b (word boundary inserted for safety, \d is implied in \w), then do a literal search for match result \1.
I believe the only thing stored by the Regex engine between matches is the position of the last match. Therefore, when looking for the next match, you cannot use a backreference to the match before.
So, no, I do not think that this is possible.