When does Javascript "initialize" a RegEx pattern? - javascript

I take care to declare a RegEx pattern once and reuse if possible, for performance reasons. I'm not entirely certain why - something I probably read once many years ago and has been filed away in the ol' skull sponge.
I find myself in a regex-heavy situation, and a thought occurred... does declaring a RegEx pattern "instantiate" or "initialize" that pattern, or does it just store the pattern until it's needed?
var NonNumbers = /[^0-9]/g; //"initialized" here?
"h5u4i15h1iu".replace(NonNumbers, "*"); //or "initialized" here?
Maybe RegExp() actually creates one and the literal waits until it's used, even though both patterns return the same results?
var NonNumbers = /[^0-9]/g; //just stores the pattern
var NonNumbers = RegExp(/[^0-9]/, 'g'); //actually creates the RegExp
Just an itch I'm hoping someone who understands the inner workings can scratch.

From the Mozilla spec:
You construct a regular expression in one of two ways:
Using a regular expression literal, which consists of a pattern enclosed between slashes, as follows:
var re = /ab+c/;
Regular expression literals provide compilation of the regular expression when the script is loaded. If the regular expression remains constant, using this can improve performance.
Or calling the constructor function of the RegExp object, as follows:
var re = new RegExp('ab+c');
Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.
Since the spec indicates that the regular expression is being compiled when using the literal syntax, it is also safe to assume that it is being initialized as a full, bona-fide regular expression object at that point.
Another advantage of using literals is that regular expressions can be interned, meaning that if the same regular expression literal is found in multiple places, both literals can refer to the same object, saving both memory and initialization costs.

Related

What is the function of .source in context of this new RegExp

I ran into the below monster of a regex in the wild today. The regex is meant to validate a url.
function superUrlValidation(url) {
return new RegExp(/^/.source + "((.+):\/\/)?" + /(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/.source, "i")
.test(url);
}
I've never seen .source used in a regex like this so I looked it up.
The MDN docs for RegExp.prototype.source states:
The source property returns a String containing the source text of the regexp object, and it doesn't contain the two forward slashes on both sides and any flags.
... and gives this example:
var regex = /fooBar/ig;
console.log(regex.source); // "fooBar", doesn't contain /.../ and "ig".
I understand the MDN example (you're getting the source text of the regex object after it is created, makes sense), but I dont understand how this is being used in the superUrlValidation regex above.
How is the source being used before the regex object is completed and what does this accomplish? I cant find any documentation showing .source being used in this way.
Note that .source is used twice in the regex, at the beginning and the end
Use of .source everywhere in your regex seems totally unnecessary, may be just a trick to avoid double escaping. In fact even use of new RegExp is not needed and you can get away with just the regex literal as this:
var re = /^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i;
/^/ is a regex literal, meaning it's a valid regex object in it's own right. This means that /^/.source === "^".
This seems like an arbitrary example of using the source property as this means the author could have just placed a "^" in it's place, or even just put a ^ at the beginning of the next string, and it would have the same effect.
The .source property returns the content of the regex between the forward slashes as you say. so the result of the above is equivalent to this string:
/^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
In JavaScript you can write regexes like this: /matchsomething/ or using the RegExp function/constructor above. It looks like the code you found is the result of someone not know what they were doing. They seem to have taken a few regexes using the literal syntax (i.e /match_here/) and plugged it into the constructor version and stuck them all together.
I can't see any benefit in using the source property this way. I would just use the string version or the constructor version. Or better, find out what the original author intended and write it again or find a respected regex library with the criteria you need.
And, yeah, wow. It's massive.

Do RegExps made by expression literals share a single instance?

The following snippet of code (from Crockford's Javascript: The Good Parts) demonstrates that RegExp objects made by regular expression literals share a single instance:
function make_a_matcher( ) {
return /a/gi;
}
var x = make_a_matcher( );
var y = make_a_matcher( );
// Beware: x and y are the same object!
x.lastIndex = 10;
document.writeln(y.lastIndex); // 10
Question: Is this the same with any other literals? I tried modifying the code above to work with the string "string", but got a bunch of errors.
No, they are not shared. From the spec on Regular Expression Literals:
A regular expression literal is an input element that is converted to a RegExp object (see 15.10) each time the literal is evaluated. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
However, this changed with ES5. Old ECMAScript 3 had a different behavior:
A regular expression literal is an input element that is converted to a RegExp object (section 15.10) when it is scanned. The object is created before evaluation of the containing program or function begins. Evaluation of the literal produces a reference to that object; it does not create a new object. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
This was supposed to share the compilation result of the regex engine across evaluations, but obviously led to buggy progams.
You should throw your book away and get a newer edition.

Is using var foo = new RegExp necessary? [duplicate]

This question already has answers here:
Are /regex/ Literals always RegExp Objects?
(3 answers)
Closed 10 years ago.
While researching the use of regular expressions in javascript, one can encounter examples of two types:
A:
var regex = /^[a-zA-Z0-9]+$/;
B:
var regex = new RegExp ("^[a-zA-Z0-9]*$");
Is it necessary to use var foo = new RegExp? Or, when should one pick each method?
The RegExp() constructor is useful when you have to assemble a regular expression dynamically at run time. If the expression is completely static, it's easier to use the native regex syntax (your "A"). The ease-of-use of the native syntax stems from the fact that you don't have to worry about quoting backslashes, as you do when your regular expression begins life as a string constant.
Is it necessary to use var foo = new RegExp?
No, obviously not. The other one works as well.
Or, when should one pick each method?
Regex literals are easier to read and write as you do not need to string-escape regex escape characters - you can just use them (backslashes, quotes). Also, they are parsed only once during script "compilation" - nothing needs to be executed each time you the line is evaluated.
The RegExp constructor only needs to be used if you want to build regexes dynamically.
Here's an example of a 'dynamic' regular expression where you might need new RegExp.
var search = 'dog',
re = new RegExp('.*' + search + '.*');
If it's a static regular expression, then the literal syntax (your A option) is better because it's easier to write and read.

What is the algorithm of the search() function?

Does any body know what the algorithm used for the search() function in javascript is?
var myRegExp = /Alex/;
var string1 = "Today John went to the store and talked with Alex.";
var matchPos1 = string1.search(myRegExp);
if(matchPos1 != -1)
document.write("There was a match at position " + matchPos1);
else
document.write("There was no match in the first string");
Example copied tizaq.com
I need to use this function to search a text document for different string values. But I need to document what the algorithm behind this method is, and what the complexity is. Otherwise I have to write my own method that searches the text file that I have.
The specification says it's implemented as a regular expression match:
3) If Type(regexp) is Object and the value of the [[Class]] internal
property of regexp is "RegExp", then let rx be regexp;
4) Else, let rx be a new RegExp object created as if by the
expression new RegExp( regexp) where RegExp is the standard built-in
constructor with that name.
5) Search the value string from its beginning for an occurrence of
the regular expression pattern rx. Let result be a Number indicating
the offset within string where the pattern matched, or –1 if there was
no match. (...)
(Section 15.5.4.12 String.prototype.search (regexp)).
This means your question boils down to the regex matching algorithm. But that is not in the specification either, it depends on the implementation:
The value of the [[Match]] internal property is an implementation dependent representation of the Pattern of the RegExp object.
(Section 15.10.7 Properties of RegExp Instances).
So, if documenting the complexity of that algorithm is really a requirement, I guess you'll have to write your own method. But keep in mind that, by doing that, you'll probably come up with something less efficient, and probably dependent on other built-in methods whose complexity is unknown (maybe even RegExp itself). So, can't you convince the powers that be that documenting the complexity of a built-in, implementation-dependent js method is not your job?

Can I avoid code duplication by referencing one Javascript Regexp literal inside another?

Is there a simple way to refer to one Javascript literal (e.g. "string") within another regexp literal?
Kind of familiar with Javascript Regexp but far from a guru. Trying to write a simple parser for a small handful of expression types. E.g. One type is expressions like:
`value gender 1='Male' 2 ='Female' 3="Didn't answer" >3 = 'Other';
Rather than write a whole parser in say, Jison, and the attendant learning curve, I thought it would be simple enough to use RegExp.
It appears Javascript Regexp can't capture an arbitrary number of repeating subgroups, and there's no clear character to split on, I'm parsing subgroups with their own regexps.
The following works okay, but the regexp literals are far from DRY, and all but unreadable. Each higher level construct repeats the lower level constructs.
var re_value_stmt = /value\s+(\w+)((?:\s+(?:[^=]+[=](?:(?:["][^"]+["])|(?:['][^']+[']))))+)/i
var re_value_clause = /([^=]+[=](\s*(?:(['][^']*['])|(["][^"]*["])))+)/ig
var re_value_elems = /([^=]+)[=]\s*(?:(?:[']([^']*)['])|(?:["]([^"]*)["]))/ig
console.log(re_value_elems.exec("1='Male'"));
console.log(re_value_clause.exec("1=\"Male\" 2=\"Female\""));
console.log(re_value_stmt.exec("value gender 1='Male' 2='Female'"));
For instance, (?:(?:["][^"]+["])|(?:['][^']+['])) just means QuotedString. Can I write that instead?
Is there a simple way to refer to one Javascript literal (e.g. "string") within another regexp literal? Specifying regexp by munging strings might work, but also seems awkward and error prone (e.g. needing to escape quote marks and escape escapes).
Or is this already the poster child for why people create parsers based on grammars and move out of Regexp?

Categories

Resources