JavaScript Regex Compile()

JavaScript Regex Compile() - javascript

Is there a shorter way to write this?
var needed = /\$\[\w+\]/mi;
needed.compile(/\$\[\w+\]/mi);
Why do I have to pass the pattern back into the regex when I've already declared it in the first line?!

There are two ways of defining regular expressions in JavaScript — one through an object constructor and one through a literal. The object can be changed at runtime, but the literal is compiled at load of the script, and provides better performance.
var txt=new RegExp(pattern,modifiers);
or more simply:
var txt=/pattern/modifiers;
This is the same thing that cobbai is saying. In short, you do not have to do both.

from MDC:
The literal notation provides compilation of the regular expression when the expression is evaluated
so /\$\[\w+\]/mi is a compiled regex already.

Related

Creating dynamic regex match in JavaScript

Is there a way to define a new regex character pattern in JavaScript with NodeJS? For example, a turning /hello\sworld/gm into /{message}/gm, where {message} is interpreted to match "hello world" or whatever other string I decide.
Essentially I'm trying to avoid this:
var message = "hello world";
(new RegExp(message, "gm")).test(someString);
In hopes of getting something like:
/{message}/gm.test(someString);
I'd like to note that it shouldn't work for only the test method. Any method that RegExp uses to match, test, search, etc, should all work. I imagine this would be possible to do if there is a way to override the functions? Or is there a way to edit the RegExp arguments on object creation?
The idea is for me to define {message} as meaning something, and for that to be interpreted globally without having to deal with concatenating a variable into every regex pattern.
I am aware that others have asked about dynamic regexps before. The answer to all of those is to use the RegExp constructor. I am wondering if there is an alternative, possibly like overriding vanilla JavaScript classes.
Also note that I'm not asking whether or not this is good practice. I'm asking whether or not it is possible with or without good practice in mind.
For clarity, {message} should be replaced in every single regex made in any file. So /{message}/ and /bananas:\s{message}/ become /hello\sworld/ and /bananas:\shello\sworld/ respectively, etc etc.

No, it's not possible. The best you can do is create a function which will take a string, replace {message}, create a RegExp object and return it.

Dynamic vs Inline RegExp performance in JavaScript

I stumbled upon that performance test, saying that RegExps in JavaScript are not necessarily slow: http://jsperf.com/regexp-indexof-perf
There's one thing i didn't get though: two cases involve something that i believed to be exactly the same:
RegExp('(?:^| )foo(?: |$)').test(node.className);
And
/(?:^| )foo(?: |$)/.test(node.className);
In my mind, those two lines were exactly the same, the second one being some kind of shorthand to create a RegExp object. Still, it's twice faster than the first.
Those cases are called "dynamic regexp" and "inline regexp".
Could someone help me understand the difference (and the performance gap) between these two?

Nowadays, answers given here are not entirely complete/correct.
Starting from ES5, the literal syntax behavior is the same as RegExp() syntax regarding object creation: both of them creates a new RegExp object every time code path hits an expression in which they are taking part.
Therefore, the only difference between them now is how often that regexp is compiled:
With literal syntax - one time during initial code parsing and
compiling
With RegExp() syntax - every time new object gets created
See, for instance, Stoyan Stefanov's JavaScript Patterns book:
Another distinction between the regular expression literal and the
constructor is that the literal creates an object only once during
parse time. If you create the same regular expression in a loop, the
previously created object will be returned with all its properties
(such as lastIndex) already set from the first time. Consider the
following example as an illustration of how the same object is
returned twice.
function getRE() {
var re = /[a-z]/;
re.foo = "bar";
return re;
}
var reg = getRE(),
re2 = getRE();
console.log(reg === re2); // true
reg.foo = "baz";
console.log(re2.foo); // "baz"
This behavior has changed in ES5 and the literal also creates new objects. The behavior has also been corrected in many browser
environments, so it’s not to be relied on.
If you run this sample in all modern browsers or NodeJS, you get the following instead:
false
bar
Meaning that every time you're calling the getRE() function, a new RegExp object is created even with literal syntax approach.
The above not only explains why you shouldn't use the RegExp() for immutable regexps (it's very well known performance issue today), but also explains:
(I am more surprised that inlineRegExp and storedRegExp have different
results.)
The storedRegExp is about 5 - 20% percent faster across browsers than inlineRegExp because there is no overhead of creating (and garbage collecting) a new RegExp object every time.
Conclusion:
Always create your immutable regexps with literal syntax and cache it if it's to be re-used. In other words, don't rely on that difference in behavior in envs below ES5, and continue caching appropriately in envs above.
Why literal syntax? It has some advantages comparing to constructor syntax:
It is shorter and doesn’t force you to think in terms of class-like
constructors.
When using the RegExp() constructor, you also need to escape quotes and double-escape backslashes. It makes regular expressions
that are hard to read and understand by their nature even more harder.
(Free citation from the same Stoyan Stefanov's JavaScript Patterns book).
Hence, it's always a good idea to stick with the literal syntax, unless your regexp isn't known at the compile time.

The difference in performance is not related to the syntax that is used is partly related to the syntax that is used: in /pattern/ and RegExp(/pattern/) (where you did not test the latter) the regular expression is only compiled once, but for RegExp('pattern') the expression is compiled on each usage. See Alexander's answer, which should be the accepted answer today.
Apart from the above, in your tests for inlineRegExp and storedRegExp you're looking at code that is initialized once when the source code text is parsed, while for dynamicRegExp the regular expression is created for each invocation of the method. Note that the actual tests run things like r = dynamicRegExp(element) many times, while the preparation code is only run once.
The following gives you about the same results, according to another jsPerf:
var reContains = /(?:^| )foo(?: |$)/;
...and
var reContains = RegExp('(?:^| )foo(?: |$)');
...when both are used with
function storedRegExp(node) {
return reContains.test(node.className);
}
Sure, the source code of RegExp('(?:^| )foo(?: |$)') might first be parsed into a String, and then into a RegExp, but I doubt that by itself will be twice as slow. However, the following will create a new RegExp(..) again and again for each method call:
function dynamicRegExp(node) {
return RegExp('(?:^| )foo(?: |$)').test(node.className);
}
If in the original test you'd only call each method once, then the inline version would not be a whopping 2 times faster.
(I am more surprised that inlineRegExp and storedRegExp have different results. This is explained in Alexander's answer too.)

in the second case, the regular expression object is created during the parsing of the language, and in the first case, the RegExp class constructor has to parse an arbitrary string.

How often does JavaScript recompile regex literals in functions?

Given this function:
function doThing(values,things){
var thatRegex = /^http:\/\//i; // is this created once or on every execution?
if (values.match(thatRegex)) return values;
return things;
}
How often does the JavaScript engine have to create the regex? Once per execution or once per page load/script parse?
To prevent needless answers or comments, I personally favor putting the regex outside the function, not inside. The question is about the behavior of the language, because I'm not sure where to look this up, or if this is an engine issue.
EDIT:
I was reminded I didn't mention that this was going to be used in a loop. My apologies:
var newList = [];
foreach(item1 in ListOfItems1){
foreach(item2 in ListOfItems2){
newList.push(doThing(item1, item2));
}
}
So given that it's going to be used many times in a loop, it makes sense to define the regex outside the function, but so that's the idea.
also note the script is rather genericized for the purpose of examining only the behavior and cost of the regex creation

From Mozilla's JavaScript Guide on regular expressions:
Regular expression literals provide compilation of the regular expression when the script is evaluated. When the regular expression will remain constant, use this for better performance.
And from the ECMA-262 spec, §7.8.5 Regular Expression Literals:
A regular expression literal is an input element that is converted to a RegExp object (see 15.10) each time the literal is evaluated.
In other words, it's compiled once when it's evaluated as a script is first parsed.
It's worth noting also, from the ES5 spec, that two literals will compile to two distinct instances of RegExp, even if the literals themselves are the same. Thus if a given literal appears twice within your script, it will be compiled twice, to two distinct instances:
Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
...
... each time the literal is evaluated, a new object is created as if by the expression new RegExp(Pattern, Flags) where RegExp is the standard built-in constructor with that name.

The provided answers don't clearly distinguish between two different processes behind the scene: regexp compilation and regexp object creation when hitting regexp object creation expression.
Yes, using regexp literal syntax, you're gaining the performance benefit of one time regexp compilation.
But if your code executes in ES5+ environment, every time the code path enters the doThing() function in your example, it actually creates a new RegExp object, though, without need to compile the regexp again and again.
In ES5, literal syntax produces a new RegExp object every time code path hits expression that creates a regexp via literal:
function getRE() {
var re = /[a-z]/;
re.foo = "bar";
return re;
}
var reg = getRE(),
re2 = getRE();
console.log(reg === re2); // false
reg.foo = "baz";
console.log(re2.foo); // "bar"
To illustrate the above statements from the point of actual numbers, take a look at the performance difference between storedRegExp and inlineRegExp tests in this jsperf.
storedRegExp would be about 5 - 20% percent faster across browsers than inlineRegExp - the overhead of creating (and garbage collecting) a new RegExp object every time.
Conslusion:
If you're heavily using your literal regexps, consider caching them outside the scope where they are needed, so that they are not only be compiled once, but actual regexp objects for them would be created once as well.

There are two "regular expression" type objects in javascript.
Regular expression instances and the RegExp object.
Also, there are two ways to create regular expression instances:
using the /regex/ syntax and
using new RegExp('regex');
Each of these will create new regular expression instance each time.
However there is only ONE global RegExp object.
var input = 'abcdef';
var r1 = /(abc)/;
var r2 = /(def)/;
r1.exec(input);
alert(RegExp.$1); //outputs 'abc'
r2.exec(input);
alert(RegExp.$1); //outputs 'def'
The actual pattern is compiled as the script is loaded when you use Syntax 1
The pattern argument is compiled into an internal format before use. For Syntax 1, pattern is compiled as the script is loaded. For Syntax 2, pattern is compiled just before use, or when the compile method is called.
But you still could get different regular expression instances each method call. Test in chrome vs firefox
function testregex() {
var localreg = /abc/;
if (testregex.reg != null){
alert(localreg === testregex.reg);
};
testregex.reg = localreg;
}
testregex();
testregex();
It's VERY little overhead, but if you wanted exactly one regex, its safest to only create one instance outside of your function

The regex will be compiled every time you call the function if it's not in literal form.
Since you are including it in a literal form, you've got nothing to worry about.
Here's a quote from websina.com:
Regular expression literals provide compilation of the regular expression when the script is evaluated. When the regular expression will remain constant, use this for better performance.
Calling the constructor function of the RegExp object, as follows:
re = new RegExp("ab+c")
Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.

Why use (function(){})() or !function(){}()?

I was reading In JavaScript, what is the advantage of !function(){}() over (function () {})()? then it hit me, why use :
(function(){})() or !function(){}() instead of just function(){}()?
Is there any specific reason?

It depends on where you write this. function(){}() by itself will generate a syntax error as it is evaluated as function declaration and those need names.
By using parenthesis or the not operator, you enforce it to be interpreted as function expression, which don't need names.
In case where it would be treated as expression anyway, you can omit the parenthesis or the operator.

I guess you are asking why use:
var fn = (function(){}());
versus:
var fn = function(){}();
The simple answer for me is that often the function on the RHS is long and it's not until I get to the bottom and see the closing () that I realise I've been reading a function expression and not a function assignment.
A full explanation is in Peter Michaux's An Important Pair of Parens.

A slight variation on RobG's answer.
Many scripts encompass the entire program in one function to ensure proper scoping. This function is then immediately run using the double parentheses at the end. However, this is slightly different then programs which define a function that can be used in the page but not run initially.
The only difference between these two scenarios is the last two characters (the addition of the double parentheses). Since these could be very long programs, the initial parenthesis is there to indicate that "this will be run immediately."
Is it necessary for the program to run? No. Is it helpful for someone looking at the code and trying to understand it? Yes.

In javascript, can I override the brackets to access characters in a string?

Is there some way I can define String[int] to avoid using String.CharAt(int)?

No, there isn't a way to do this.
This is a common question from developers who are coming to JavaScript from another language, where operators can be defined or overridden for a certain type.
In C++, it's not entirely out of the question to overload operator* on MyType, ending up with a unique asterisk operator for operations involving objects of type MyType. The readability of this practice might still be called into question, but the language affords for it, nevertheless.
In JavaScript, this is simply not possible. You will not be able to define a method which allows you to index chars from a String using brackets.
#Lee Kowalkowski brings up a good point, namely that it is, in a way, possible to access characters using the brackets, because the brackets can be used to access members of a JavaScript Array. This would involve creating a new Array, using each of the characters of the string as its members, and then accessing the Array.
This is probably a confusing approach. Some implementations of JavaScript will provide access to a string via the brackets and some will not, so it's not standard practice. The object may be confused for a string, and as JavaScript is a loosely typed language, there is already a risk of misrepresenting a type. Defining an array solely for the purposes of using a different syntax from what the language already affords is only gong to promote this type of confusion. This gives rise to #Andrew Hedges's question: "Why fight the language?"..
There are useful patterns in JavaScript for legitimate function overloading and polymorphic inheritance. This isn't an example of either.
All semantics aside, the operators still haven't been overridden.
Side note: Developers who are accustomed to the conventions of strong type checking and classical inheritance are sometimes confused by JavaScript's C-family syntax. Under the hood, it is working in an unfamiliar way. It's best to write JavaScript in clean and unambiguous ways, in order to prevent confusion.

Please note: Before anybody else would like to vote my answer down, the question I answered was:
IE javascript string indexers
is there some way I can define string[int] to avoid using string.CharAt(int)?"
Nothing about specifically overriding brackets, or syntax, or best-practice, the question just asked for "some way". (And the only other answer said "No, there isn't.")
Well, there is actually, kind of:
var newArray = oldString.split('');
...now you can access newArray using bracket notation, because you've just converted it to an array.

Use String.charAt()
It's standard and works in all browsers.
In non-IE browsers you can use bracket notation to access characters like this:
"TEST"[1]; // = E
You could convert a string into an array of characters doing this:
var myString = "TEST";
var charArray = myString.split(''); // charArray[1] == E
These would be discouraged. There isn't any reason not to use the charAt() method, and there is no benefit to doing anything else.

This is not an answer, just a trick (strongly deprecated!). It shows, in particular, that in Javascript you can do whatever you want. It's just a matter of your fantasy.
You can use a fact that you can set any additional properties to String Object like to all others, so you can create String.0, String.1, ... properties:
String.prototype.toChars = function() {
for (var i=0; i<this.length; i++) {
this[i+""] = this.charAt(i);
}
};
Now you can access single characters using:
var str = "Hello World";
str.toChars();
var i = 1+"";
var c = str[i]; // "e"
Note that it's useful only for access. It should be another method defined for assigning string chars in such manner.
Also note that you must call .toChars() method every time you modify the sting.

Develop Reference

JavaScript is the programming language of the Web.