Back Reference Fault in JavaScript Regex

Back Reference Fault in JavaScript Regex - javascript

I am teaching myself regex using a combination of the two websites Eloquent JavaScript and regular expressions.info. I am trying to use back references via a self made example in which I want to roughly be able to test for syntactic correctness of a Java while loop (assuming we limit it to while( value operator value) for the sake of simplicity).
However take a look at my code below and you will see that the reference \1 does not appear to work. I've tried my solution in JS. but also using software tool The Regex Coach.
Can anyone see the problem here?
var rx = /^while\s*\((\s*[a-zA-Z][a-zA-Z0-9_]*\s*)(\<\=|\<|\>\=|\>|\!\=|\=\=)\s*\1\)/
document.writeln(rx.test("while(x <= y)"));

Your regex would match
while(x <= x )
because \1 matches the exact text that was matched by the first capturing group - which in this case is "x ". And since "y" isn't the same as "x ", your regex fails on the example you've chosen.
For your example, the following would work:
var rx = /^while\s*\(\s*([a-zA-Z]\w*)\s*(<=?|>=?|!=|==)\s*([a-zA-Z]\w*)\s*\)$/
Note that \w is a shorthand for [a-zA-Z0-9_] in JavaScript, and that you don't need to escape all those symbols.

Related

Javascript regex to pick all multi line text between two strings [duplicate]

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Regex to find if there is only one block of code

My input is a string, I want to validate that there is only one first-level block of code.
Examples :
{ abc } TRUE
{ a { bc } } TRUE
{ a {{}} } TRUE
{ abc {efg}{hij}} TRUE
{ a b cde }{aa} FALSE
/^\{.*\}$/ is valid for the 5 cases, can you help me to find a regex invalid for the last case ?
Language is JavaScript.

EDIT: I started writing the answer before JavaScript was specified. Will leave it as for the record as it fully explains the regex.
In short: In JavaScript I cannot think of a reliable solution. In other engines there are several options:
Recursion (on which I will expand below)
Balancing group (.NET)
For solutions 2 (which anyhow won't work in JS either), I'll refer you to the example in this question
Recursive Regex
In Perl, PCRE (e.g. Notepad++, PHP, R) and the Matthew Barnett's regex module for Python, you can use:
^({(?:[^{}]++|(?1))*})$
The idea is to match exactly one set of nested braces. Anything more makes the regex fail.
See what matches and fails in the Regex Demo.
Explanation
The ^ anchor asserts that we are at the beginning of the string
The outer parentheses define Group 1 (or Subroutine 1)
{ match the opening brace
(?: ... )* zero or more times, we will...
[^{}]++ match any chars that are not { or }
OR |
(?1) repeat the expression of subroutine 1
} match closing brace
The $ anchor asserts that we are at the end of the string. Therefore,

This is a terrible workaround.
Since this is in Javascript there's not really much to do, but please see the following regex:
/^{([^{}]*|{})*}$/
Where you copy ([^{}]*|{})* and insert it between the last pair of curly brackets (rinse and repeat). Every duplication of this pattern allows another level of nesting between your elements. (This is a workaround for the lack of recursion in JS regex, required to solve nesting problems.)
Online Regex Demo

In JavaScript what you need to do is strip out all the nested blocks until no nested blocks are left and then check whether there are still multiple blocks:
var r = input.replace(/(['"])(?:(?!\1|\\).|\\.)*\1|\/(?![*/])(?:[^\\/]|\\.)+\/[igm]*|\/\/[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\//gi, '');
if (r.split('{').length != r.split('}').length || r.indexOf('}') < r.indexOf('{')) {
// ERROR
continue;
}
while (r.match(/\{[^}]*\{[^{}]*\}/))
r = r.replace(/(\{[^}]*)\{[^{}]*\}/g, '$1');
if (r.match(/\}.*\{/)
// FALSE
else
// TRUE
Working JSFiddle
Be sure to make the regex in the while and the regex in the replace match the same otherwise this might result in infinite loops.
Updated to address ERROR cases and remove anything in comments, strings and regex-literals first after Unihedron asked.

(\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*|\(([^()]*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*\))*
Code for brackets

difference between ruby regex and javascript regex

I made this regular expression: /.net.(\w*)/
I'm trying to capture the qa in a string like this:
https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG
I'm doing .replace on it like so location.replace(/.net.(\w*)/,data.newName));
But instead of capturing qa, it captures .net, when I run the code in Javascript
According to this online regex tool made for ruby, it captures qa as intended
http://rubular.com/r/ItrG7BRNRn
What's the difference between Javascript regexes and Ruby regexes, and how can I make my regex work as intended in javascript?
Edit:
I changed my code to this:
var str = `https://xxxxxxxxxx.cloudfront.net/qa/club`;
var re = /\.net\/([^\/]*)\//;
console.log(data2.files[i].location.replace(re,'$1'+ "test"));
And instead of
https://dm7svtk8jb00c.cloudfront.net/test/club
I get this:
https://dm7svtk8jb00c.cloudfrontqatestclub
If I remove the $1 I get https://dm7svtk8jb00c.cloudfronttestclub, which is closer, but I want to keep the slashes.

This would be a better regex:
/\.net\/([^\/]*)\//
Remember that . will match any character, not the period character. For that you need to escape it with a leading backslash: \.
Also, \w will only match numbers, letters and underscores. You could quite legitimately have a dash in that part of the URL. Therefore you're far better off matching anything that isn't a forward slash.

I am not sure how Ruby works, but JavaScript replace will not just replace the capture group, it replaces the whole matched string. By adding another capture group, you can use $1 to add back in the string you want to keep.
...replace(/(.net.)(\w*)/,"$1" + data.newName");

You have to do that like this:
location.replace(/(\.net.)(\w*)/, '$1' + data.newName)
replace replaces the whole matched substring, not a particular group. Ruby works exactly in the same way:
ruby -e "puts 'https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG'.sub(/.net.(\w*)/, '##')"
https://xxxxxx.cloudfront##/club/Slide1.PNG
ruby -e "puts 'https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG'.sub(/(.net.)(\w*)/, '\\1' + '##')"
https://xxxxxx.cloudfront.net/##/club/Slide1.PNG

There's no difference (at least with the pattern you've provided). In both cases, the expression matches ".net/qa", with qa being the first capture group within the expression. Notice that even in your linked example the entire match is highlighted.
I'd recommend something like this:
location.replace(/(.net.)\w*/, "$1" + data.newName);
Or this, to be a bit safer:
location.replace(/(.net.)\w*/, function(m, a) { return a + data.newName; });

It's not so much a different between JavaScript and Ruby's implementations of regular expressions, it's your pattern that needs a bit of work. It's not tight enough.
You can use something like /\.net\/([^\/]+)/, which you can see in action at Rubular.
That returns the characters delimited by / following .net.
Regex patterns are very powerful, but they're also fraught with dangerous side-effects that open up big holes easily, causing false-positives, which can ruin results unexpectedly. Until you know them well, start simply, and test them every imaginable way. And, once you think you know them well, keep doing that; Patterns in code we write where I work are a particular hot-button for me, and I'm always finding holes in them in our code-reviews and requiring them to be tightened until they do exactly what the developer meant, not what they thought they meant.
While the pattern above works, I'd probably do it a bit differently in Ruby. Using the tools made for the job:
require 'uri'
URL = 'https://xxxxxx.cloudfront.net/qa/club/Slide1.PNG'
uri = URI.parse(URL)
path = uri.path # => "/qa/club/Slide1.PNG"
path.split('/')[1] # => "qa"
Or, more succinctly:
URI.parse(URL).path.split('/')[1] # => "qa"

JavaScript RegExp: R naming conventions

We're all familiar with naming conventions in R (if not: Venables & Smith - Introduction to R, Chapter 1.8). Regular expressions, on the other hand, remain terra incognita and the most hated part in my programming life so far ... Recently, I've officially started JavaScript recapitulation, and I'm trying to create precise RegExp to check correct R object name.
Brief intro:
Object names must start with . or lower/uppercase letter, and if they start with . cannot continue with numeric... afterward, alphanumeric symbols are allowed with . and underscore _.
Long story short, here's my JS code:
var txt = "._.";
var inc = /^\.(?!\d)|^[a-z\.]/i;
document.write(inc.test(txt));
This approach manages the first part (. and/or lower/upper case and numeric after .), but I cannot pass something like & [\w\.]. I can write a function that will take care of this one, but is it at all possible to manage this with a single RegExp?

I'm not familiar with R or its naming conventions, but I'll give it a shot:
If you're only trying to verify that the name begins correctly, all you need to do is remove the \. from the character class, leaving you with: /^\.(?!\d)|^[a-z]/i. Otherwise, the . may still be the first character with no restrictions on the remaining ones.
If you want to verify the that entire name is correct, something like this should work:
/^(?:\.(?!\d)|[a-z])[a-z0-9_\.]+$/i

How to use JavaScript regex over multiple lines?

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr); // null
I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.
Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here
So the solution is:
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr); // <pre>...</pre> :)
Does anyone have a less cryptic way?
Edit: this is a duplicate but since it's harder to find than mine, I don't remove.
It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

DON'T use (.|[\r\n]) instead of . for multiline matching.
DO use [\s\S] instead of . for multiline matching
Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.
See the benchmark I have made: https://jsben.ch/R4Hxu
Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower
NB: You can also use [^] but it is deprecated in the below comment.

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).
That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.
In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.
Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:
With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://stackoverflow.com/a/36006948/141801).
Thus:
let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');
let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true
This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Now there's the s (single line) modifier, that lets the dot matches new lines as well :)
\s will also match new lines :D
Just add the s behind the slash
/<pre>.*?<\/pre>/gms

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr); //Working

In addition to above-said examples, it is an alternate.
^[\\w\\s]*$
Where \w is for words and \s is for white spaces

[\\w\\s]*
This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Develop Reference

JavaScript is the programming language of the Web.

Back Reference Fault in JavaScript Regex - javascript

Related

Javascript regex to pick all multi line text between two strings [duplicate]

Regex to find if there is only one block of code

difference between ruby regex and javascript regex

JavaScript RegExp: R naming conventions

How to use JavaScript regex over multiple lines?

Categories

Resources