How is [][] parsed in regex? - javascript

Experimenting with simple regexes I found some weird behavior.
Single pair of brackets [] is treated either as an incomplete character class (PCRE and Python) and throws an error, or as an empty character class (JS), which is not an error, but doesn't match anything.
Going forward, JS treats [][] as expected, as two empty classes, but in PCRE and Python innermost brackets ][ are interpreted as literals, even though they are not escaped.
Further experiments showed that three expressions are equivalent in practice:
[][]
[\]\[]
[\[\]]
The second and the third one make sense to me, but why does the first one work? Can someone please explain to me how exactly [][] construction is parsed?

Chalk it up to excessive cleverness on the part of the JavaScript designers. They decided [] means nothing (a null construct, no effect on the match), and [^] means not nothing--in other words, anything including newlines. Most other flavors have a singleline/DOTALL mode that allows . to match newlines, but JavaScript doesn't. Instead it offers [^] as a sort of super-dot.
That didn't catch on, which is just as well. As you've observed, it's thoroughly incompatible with other flavors. Everyone else took the attitude that a closing bracket right after an opening bracket should be treated as a literal character. And, since character classes can't be nested (traditionally), the opening bracket never has special meaning inside one. Thus, [][] is simply a compact way to match a square bracket.
Taking it further, if you want to match any character except ], [ or ^, in most flavors you can write it exactly like that: [^][^]. The closing bracket immediately after the negating ^ is treated as a literal, the opening bracket isn't special, and the second ^ is also treated as a literal. But in JavaScript, [^][^] is two separate atoms, each matching any character (including newlines). To get the same meaning as the other flavors, you have to escape the first closing bracket: [^\][^].
The pond gets even muddier when Java jumps in. It introduced a set intersection feature, so you can use, for example, [a-z&&[^aeiou]] to match consonants (the set of characters in the range a to z, intersected with the set of all characters that are not a, e, i, o or u). However, the [ doesn't have to be right after && to have special meaning; [[a-z]&&[^aeiou]] is the same as the previous regex.
That means, in Java you always have to escape an opening bracket with a backslash inside a character class, but you can still escape a closing bracket by placing it first. So the most compact way to match a square bracket in Java is []\[]. I find that confusing and ugly, so I often escape both brackets, at least in Java and JavaScript.
.NET has a similar feature called set subtraction that's much simpler and uses a tighter syntax: [a-z--[aeiou]]. The only place a nested class can appear is after --, and the whole construct must be at the end of the enclosing character class. You can still match a square bracket using [][] in .NET.

Related

Why 'ABC'.replace('B', '$`') gives AAC

Why this code prints AAC instead of expected A$`C?
console.log('ABC'.replace('B', '$`'));
==>
AAC
And how to make it give the expected result?
To insert a literal $ you have to pass $$, because $`:
Inserts the portion of the string that precedes the matched substring.
console.log('ABC'.replace('B', "$$`"));
See the documentation.
Other patterns:
Pattern
Inserts
$$
Inserts a $.
$&
Inserts the matched substring.
$`
Inserts the portion of the string that precedes the matched substring.
$'
Inserts the portion of the string that follows the matched substring.
$n
Where n is a positive integer less than 100, inserts the _n_th parenthesized submatch string, provided the first argument was a RegExp object. Note that this is 1-indexed. If a group n is not present (e.g., if group is 3), it will be replaced as a literal (e.g., $3).
$<Name>
Where Name is a capturing group name. If the group is not in the match, or not in the regular expression, or if a string was passed as the first argument to replace instead of a regular expression, this resolves to a literal (e.g., $<Name>). Only available in browser versions supporting named capturing groups.
JSFiddle
Also, there are even more things on the reference link I’ve posted above. If you still have any issue or doubt you probably can find an answer there, the screenshot above was taken from the link posted at the beginning of the answer.
It is worth saying, in my opinion, that any pattern that doesn’t match the above doesn’t need to be escaped, hence $ doesn’t need to be escaped, same story happens with $AAA.
In the comments above a user asked about why you need to “escape” $ with another $: despite I’m not truly sure about that, I think it is also worth to point out, from what we said above, that any invalid pattern won’t be interpreted, hence I think (and suspect, at this point) that $$ is a very special case, because it covers the cases where you need to replace the match with a dollar sign followed by a “pattern-locked” character, like the tick (`) as an example (or really the & as another).
In any other case, though, the dollar sign doesn’t need to be escaped, hence it probably makes sense that they decided to create such a specific rule, else you would’ve needed to escape the $ everywhere else (and I think this could’ve had an impact on any string object, because that would mean that even in var a = "hello, $ hey this one is a dollar";, you would’ve needed to escape the $).
If you’re still interested and want to read more, please also check regular-expressions.info and this JSFiddle with more cases.
In the replacement the $ dollar sign has a special meaning and is used when data from the match should be used in the replacement.
MDN: String.prototype.replace(): Specifying a string as a parameter
$$ Inserts a "$".
$` Inserts the portion of the string that precedes the matched substring.
As long as the $ does not result in a combination that has a special meaning, then it will be just handled as a regular char. But you should still always write it as a $$ in the replacement because otherwise, it might fail in future if a new $x combination is added.

Regex for brainfuck loops

I'd like to create a regular expression that is able to fetch every loop inside a brainfuck code.
Let's say this code is given:
++++[>+[>,++.]<<-]++[>,.<-]
I want to fetch these three loops (actually it would be sufficient just to fetch the first one):
[>+[>,++.]<<-]
[>,++.]
[,.<-]
My knowledge of regular expressions is pretty weak, so I can't do much more than basics. What I have thought of is this expression:
\[[-+><.,\[\]]*]
\[ - Match the first (opening) bracket
[-+><.,\[\]]* - followed by a number of brainfuck operators
] - followed by a closing bracket
This however matches (obviously) everything between the first opening, and the last closing bracket:
[>+[>,++.]<<-]++[>,.<-]
It might need something to test for the same number of opening and closing brackets inside the loop, before matching the last closing bracket - If that makes any sense.
Maybe a lookaround (I need to use this in javascript, so I can only use lookaheads) is the right way to do this, but I can't figure out how it's supposed to be done.
I had written this one once when I needed to match a pair of square brackets (while handling nesting correctly)
It is a .NET regex that uses some features that aren't available in all regex engines. Here goes:
\[(?>\[(?<d>)|\](?<-d>)|.?)*(?(d)(?!))\]
Regular expressions cannot match infinitely recursing things. Look at the Chomsky hierarchy of languages.
You can write a regular expression matching finitely recursing things by expanding them. For example, this POSIX ERE (tested with egrep) will match brainfuck loops up to nesting depth 3:
(\[[^][]*\]|\[([^][]|\[[^][]*\])*|\[([^][]|\[([^][]|\[[^][]*\])*\])*\])
Use a non-greedy (or lazy) matching:
\[[-+><.,\[\]]*?\]
Notice the ?. Though, it'll match the shortest string between [ and ]. Thus, one of the results would be:
[>+[>,++.]

RegExp in JavaScript, when a quantifier is part of the pattern

I have been trying to use a regexp that matches any text that is between a caret, less than and a greater than, caret.
So it would look like: ^< THE TEXT I WANT SELECTED >^
I have tried something like this, but it isn't working: ^<(.*?)>^
I'm assuming this is possible, right? I think the reason I have been having such a tough time is because the caret serves as a quantifier. Thanks for any help I get!
Update
Just so everyone knows, they following from am not i am worked
/\^<(.*?)>\^/
But, it turned out that I was getting html entities since I was getting my string by using the .innerHTML property. In other words,
> ... >
< ... <
To solve this, my regexp actually looks like this:
\^<(.*?)((.|\n)*)>\^
This includes the fact that the string in between should be any character or new line. Thanks!
You need to escape the ^ symbol since it has special meaning in a JavaScript regex.
/\^<(.*?)>\^/
In a JavaScript regex, the ^ means beginning of the string, unless the m modifier was used, in which case it means beginning of the line.
This should work:
\^<(.*?)>\^
In a regex, if you want to use a character that has a special meaning (caret, brackets, pipe, ...), you have to escape it using a backslash. For example, (\w\b)*\w\. will select a sequence of words terminated by a dot.
Careful!
If you have to pass the regex pattern as a string, i.e. there's no regex literal like in javascript or perl, you may have to use a double backslash, which the programming language will escape to a single one, which will then be processed by the regex engine.
Same regex in multiple languages:
Python:
import re
myRegex=re.compile(r"\^<(.*?)>\^") # The r before the string prevents backslash escaping
PHP:
$result=preg_match("/\\^<(.*?)>\\^/",$subject); // Notice the double backslashes here?
JavaScript:
var myRegex=/\^<(.*?)>\^/,
subject="^<blah example>^";
subject.match(myRegex);
If you tell us what programming language you're writing in, we'll be able to give you some finished code to work with.
Edit: Whoops, didn't even notice this was tagged as javascript. Then, you don't have to worry about double backslash at all.
Edit 2: \b represent a word boundary. Though I agree yours is what I would have used myself.

understanding regular expression for detecting string

I encountered this regular expression that detects string literal of Unicode characters in JavaScript.
'"'("\\x"[a-fA-F0-9]{2}|"\\u"[a-fA-F0-9]{4}|"\\"[^xu]|[^"\n\\])*'"'
but I couldn't understand the role and need of
"\\x"[a-fA-F0-9]{2}
"\\"[^xu]|[^"\n\\]
My guess about 1) is that it is detecting control characters.
"\\x"[a-fA-F0-9]{2}
This is a literal \x followed by two characters from the hex-digit group.
This matches the shorter-form character escapes for the code points 0–255, \x00–\xFF. These are valid in JavaScript string literals but they aren't in JSON, where you have to use \u0000–\u00FF instead.
"\\"[^xu]|[^"{esc}\n]
This matches one of:
backslash followed by one more character, except for x or u. The valid cases for \xNN and \uNNNN were picked up in the previous |-separated clauses, so what this does is avoid matching invalid syntax like \uqX.
anything else, except for the " or newline. It is probably also supposed to be excluding other escape characters, which I'm guessing is what {esc} means. That isn't part of the normal regex syntax, but it may be some extended syntax or templating over the top of regex. Otherwise, [^"{esc}\n] would mean just any character except ", {, e, s, c, } or newline, which would be wrong.
Notably, the last clause, that picks up ‘anything else’, doesn't exclude \ itself, so you can still have \uqX in your string and get a match even though that is invalid in both JSON and JavaScript.

Replace Pipe and Comma with Regex in Javascript

I'm sitting here with "The Good Parts" in hand but I'm still none the wiser.
Can anyone knock up a regex for me that will allow me to replace any instances of "|" and "," from a string.
Also, could anyone point me in the direction of a really good resource for learning regular expressions, especially in javascript (are they a particular flavour??) It really is a weak point in my knowledge.
Cheers.
str.replace(/(\||,)/g, "replaceWith") don't forget the g at the end so it seaches the string globally, if you don't put it the regex will only replace the first instance of the characters.
What is saying is replace | (you need to escape this character) OR(|) ,
Nice Cheatsheet here
The best resource I have found if you really want to understand regular expressions (and the special caveats or quirks of any of a majority of the implementations/flavors) is Regular-Expressions.info.
If you really get into regular expressions, I would recommend the product called RegexBuddy for testing and debugging regular expressions in all sorts of languages (though there are a few things it does not quite support, it is rather good overall)
Edit:
The best way (I think, especially if you consider readability) is using a character class rather than alternation (i.e.: [] instead of |)
use:
var newString = str.replace(/[|,]/g, ";");
This will replace either a | or a , with a semicolon
The character class essentially means "match anything inside these square brackets" - with only a few exceptions.
First, you can specify ranges of characters ([a-zA-Z] means any letter from a to z or from A to Z).
Second, putting a caret (^) at the beginning of the character class negates it - it means anything not in this character class ([^0-9] means any character that is not from 0 to 9).
put the dash at the beginning and the caret at the end of the character class to match those characters literally, or escape them anywhere else in the class with a \ if you prefer

Categories

Resources