Javascript literal regexp vs RegExp object, instances inside loops - javascript

I have the following two js example codes, one using literal regexp and the other one using RegExp object:
"use strict";
var re;
// literal regexp
for(var i = 0; i<10; i++)
{
re = /cat/g;
console.log(re.test("catastrophe"));
}
// RegExp constructor
for(var i = 0; i<10;i++)
{
re = new RegExp("cat", "g");
console.log(re.test("catastrophe"));
}
Some books say that using the first example "true" should be printed on each second iteration given the fact that the using the literal expression there will be created only one instance of RegExp. So the loop finds on the first run the substring "cat", than on the second run continues from where is left and finds nothing. On the third run it starts from the beginning and so on. I've tested this but it seems that in both examples i get the count of 10.
Can you explain why this is happening?

The 3rd Edition ECMAScript (JavaScript) specification allowed caching and reusing regular expression literals, including their state, leading to the "surprising" behavior you mention in relation to your first code example, which certainly looks like it should create a new regular expression object on every loop. That caching of literals was not implemented by most engines and was a phenomenally bad idea, and the 5th Edition specification fixes it.
I believe all modern engines that used to do caching (primarily SpiderMonkey, Firefox's engine) were updated accordingly. A new regex is created for every iteration in both of your examples.
More in this blog post (right at the end) by Steven Levithan, and in the fourth paragraph of Annex E in the specification:
7.8.5: Regular expression literals now return a unique object each time the literal is evaluated. This change is detectable by any programs that test the object identity of such literal values or that are sensitive to the shared side effects.

In both cases, you are creating a new RegExp each time through the for loop. It doesn't matter which way you declare the RegExp - it's still creating a new one each time the loop iterates. Thus, you get the same behavior.
Now, if you initialized the re variable before the for loop, you would get a different behavior because of the persistence of the same RegExp object and how it uses the g flag.

Related

Indexing through a constant string

I was playing with a constant string in a loop from another question…
Here it is:
str = "abcd";
for (i = 0; i < 4; i++) {
console.log(str[i]);
}
… and I ended up doing that:
for (i = 0; i < 4; i++) {
console.log("abcd"[i]);
}
I didn't know this kind of coding was working before I tried!
How is this way of doing called?
Should it be avoided for any technical reason?
Thanks for any answer.
How is this way of doing called?
I'm not aware of it having any specific name. You're just using a string literal inside the loop.
Should it be avoided for any technical reason?
With a string literal it probably doesn't matter, because string literals define primitive strings (and are likely reused by the JavaScript engine, as they're immutable). But if you were creating an object every time, that would be unnecessary overhead compared with just creating it once and reusing it.
For instance, if you were doing this:
for (var i = 0; i < 4; i++) {
console.log([1, 2, 3, 4][i]);
}
That code tells the JavaScript engine to create that array each time the loop body runs, which is fast, but not instantaneous. (The JavaScript engine might be able to analyze the code and optimize it if the code were used enough that it seemed worth bothering, but that's a different topic.)
The way to tried is okay but not very useful as if you specify the string in the loop it becomes static to the loop.
It is advised to use Variable insteads of "HARDCODED" values.
All your code really does is do away with the variable and index the "array-like" object directly.
Strings are "array-like" objects. They have .length property and can be indexed just as Arrays can be. They are not however, actual arrays and don't support the full Array API. JavaScript is full of "array-like" objects and they are certainly not anything to be avoided. To the contrary, it is a great feature of the language to be able to leverage this. It's just important to know when you have an actual Array and when you have an "array-like" object, so you don't use the latter incorrectly.
So, because "abcd" is array-like, there is no reason you can't place an index right after it:
"abcd"[2]
Scott Marcus explains what's happening here well. As far as whether this should be avoided, many believe it is better to access chars in a string using chatAt() instead of bracket notation. Namely because:
Bracket notation is part of ECMAScript 5 and therefore not universally supported
The similarity to bracket notation of an array or hash object can be confusing. For instance, strings are immutable so you cannot set the value of a string at a certain index as you can with an array or hash. Using chatAt() can therefore elucidate that one should not expect this to be possible.
Source:
string.charAt(x) or string[x]?

Javascript variables names which are reserved words

I find myself, for reasons which are too irrelevant to go into here but which involve machine-generated code, needing to create variables with names which are Javascript reserved words.
For example, with object literals, by quoting the keys if they're invalid identifiers:
var o = { validIdentifier: 1, "not a valid identifier": 2 };
Is there a similar technique which works for simple variable references?
A poke around the spec shows that there used to be a mechanism that allowed this, by abusing Unicode escapes:
f\u0075nction = 7;
However this seems incredibly dubious, and is apparently rapidly vanishing (although my recent Chrome still appears to support it). Is there a more modern equivalent?
If they're object keys, you can call them what you like (even reserved names), and you don't need to quote them.
var o = { function: 'a' }
console.log(o.function) // a
DEMO
Is there a more modern equivalent?
No. Reserved words cannot be used as variable names. You can use names that look like those words but that's it.
From the spec:
A reserved word is an IdentifierName that cannot be used as an Identifier.
FYI, reserved words can be used as property names, even without quotes:
var o = { function: 1 };
Variable names can't be reserved words.
You can always do bad things like:
window["function"] = 'foo';
console.log(window.function);
> foo
You still can't reference it using a bare reserved, because it's reserved.
That it's machine-generated code means whatever is generating the code is broken.

Difference between RegExp constructor and Regex literal test function? [duplicate]

This question already has answers here:
Why does a RegExp with global flag give wrong results?
(7 answers)
Closed 6 years ago.
I'm confused on how this is possible...
var matcher = new RegExp("d", "gi");
matcher.test(item)
The code above contains the following values
item = "Douglas Enas"
matcher = /d/gi
Yet when I run the matcher.test function back to back I get true for the first run and false for the second run.
matcher.test(item) // true
matcher.test(item) // false
If I use a regexp literal such as
/d/gi.test("Douglas Enas")
and run it back to back in chrome I get true both times. Is there an explanation for this?
Sample of a back to back run in chrome console creating a regexp object using constructor
matcher = new RegExp("d","gi")
/d/gi
matcher.test("Douglas Enas")
true
matcher.test("Douglas Enas")
false
matcher
/d/gi
Sample using back to back calls on literal
/d/gi.test("Douglas Enas")
true
/d/gi.test("Douglas Enas")
true
The reason for this question if because using the RegExp constructor and the test function against a list of values I'm losing matches... However using the literal I'm getting back all the values I expect
UPDATE
var suggestions = [];
////process response
$.each(responseData, function (i, val)
{
suggestions.push(val.desc);
});
var arr = $.grep(suggestions, function(item) {
var matcher = new RegExp("d", "gi");
return matcher.test(item);
});
Moving the creation of the matcher inside the closure included the missing results. the "d" is actually a dynamically created string but I used "d" for simplicity sake. I'm still not sure now creating a new expression every time I do the test when I am iterating over the suggestions array would inadvertently exclude results is a little confusing still, and probably has something to do with the advancement of the match test
From RegExp.test():
test called multiple times on the same global regular expression instance will advance past the previous match.
So basically when you have an instance of RegExp, each call to test advances the matcher. Once you've found the first d, it will look beyond that and try to find another d. Well, there are none anymore, so it returns false.
On the other hand, when you do:
/d/gi.test("Douglas Enas")
You create a new RegExp instance every time on the spot, so it will always find that first d (and thus return true).
According to Mozilla Developer Network,
As with exec (or in combination with it), test called multiple times
on the same global regular expression instance will advance past the
previous match.
https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp/test
I can't find the link ATM, but as I recall, this is a known issue with the test method in combination with a g pattern: the match position isn't "forgotten" somehow. Either drop the global flag and/or use the .match method, and it'll work just fine.

Dynamic vs Inline RegExp performance in JavaScript

I stumbled upon that performance test, saying that RegExps in JavaScript are not necessarily slow: http://jsperf.com/regexp-indexof-perf
There's one thing i didn't get though: two cases involve something that i believed to be exactly the same:
RegExp('(?:^| )foo(?: |$)').test(node.className);
And
/(?:^| )foo(?: |$)/.test(node.className);
In my mind, those two lines were exactly the same, the second one being some kind of shorthand to create a RegExp object. Still, it's twice faster than the first.
Those cases are called "dynamic regexp" and "inline regexp".
Could someone help me understand the difference (and the performance gap) between these two?
Nowadays, answers given here are not entirely complete/correct.
Starting from ES5, the literal syntax behavior is the same as RegExp() syntax regarding object creation: both of them creates a new RegExp object every time code path hits an expression in which they are taking part.
Therefore, the only difference between them now is how often that regexp is compiled:
With literal syntax - one time during initial code parsing and
compiling
With RegExp() syntax - every time new object gets created
See, for instance, Stoyan Stefanov's JavaScript Patterns book:
Another distinction between the regular expression literal and the
constructor is that the literal creates an object only once during
parse time. If you create the same regular expression in a loop, the
previously created object will be returned with all its properties
(such as lastIndex) already set from the first time. Consider the
following example as an illustration of how the same object is
returned twice.
function getRE() {
var re = /[a-z]/;
re.foo = "bar";
return re;
}
var reg = getRE(),
re2 = getRE();
console.log(reg === re2); // true
reg.foo = "baz";
console.log(re2.foo); // "baz"
This behavior has changed in ES5 and the literal also creates new objects. The behavior has also been corrected in many browser
environments, so it’s not to be relied on.
If you run this sample in all modern browsers or NodeJS, you get the following instead:
false
bar
Meaning that every time you're calling the getRE() function, a new RegExp object is created even with literal syntax approach.
The above not only explains why you shouldn't use the RegExp() for immutable regexps (it's very well known performance issue today), but also explains:
(I am more surprised that inlineRegExp and storedRegExp have different
results.)
The storedRegExp is about 5 - 20% percent faster across browsers than inlineRegExp because there is no overhead of creating (and garbage collecting) a new RegExp object every time.
Conclusion:
Always create your immutable regexps with literal syntax and cache it if it's to be re-used. In other words, don't rely on that difference in behavior in envs below ES5, and continue caching appropriately in envs above.
Why literal syntax? It has some advantages comparing to constructor syntax:
It is shorter and doesn’t force you to think in terms of class-like
constructors.
When using the RegExp() constructor, you also need to escape quotes and double-escape backslashes. It makes regular expressions
that are hard to read and understand by their nature even more harder.
(Free citation from the same Stoyan Stefanov's JavaScript Patterns book).
Hence, it's always a good idea to stick with the literal syntax, unless your regexp isn't known at the compile time.
The difference in performance is not related to the syntax that is used is partly related to the syntax that is used: in /pattern/ and RegExp(/pattern/) (where you did not test the latter) the regular expression is only compiled once, but for RegExp('pattern') the expression is compiled on each usage. See Alexander's answer, which should be the accepted answer today.
Apart from the above, in your tests for inlineRegExp and storedRegExp you're looking at code that is initialized once when the source code text is parsed, while for dynamicRegExp the regular expression is created for each invocation of the method. Note that the actual tests run things like r = dynamicRegExp(element) many times, while the preparation code is only run once.
The following gives you about the same results, according to another jsPerf:
var reContains = /(?:^| )foo(?: |$)/;
...and
var reContains = RegExp('(?:^| )foo(?: |$)');
...when both are used with
function storedRegExp(node) {
return reContains.test(node.className);
}
Sure, the source code of RegExp('(?:^| )foo(?: |$)') might first be parsed into a String, and then into a RegExp, but I doubt that by itself will be twice as slow. However, the following will create a new RegExp(..) again and again for each method call:
function dynamicRegExp(node) {
return RegExp('(?:^| )foo(?: |$)').test(node.className);
}
If in the original test you'd only call each method once, then the inline version would not be a whopping 2 times faster.
(I am more surprised that inlineRegExp and storedRegExp have different results. This is explained in Alexander's answer too.)
in the second case, the regular expression object is created during the parsing of the language, and in the first case, the RegExp class constructor has to parse an arbitrary string.

How often does JavaScript recompile regex literals in functions?

Given this function:
function doThing(values,things){
var thatRegex = /^http:\/\//i; // is this created once or on every execution?
if (values.match(thatRegex)) return values;
return things;
}
How often does the JavaScript engine have to create the regex? Once per execution or once per page load/script parse?
To prevent needless answers or comments, I personally favor putting the regex outside the function, not inside. The question is about the behavior of the language, because I'm not sure where to look this up, or if this is an engine issue.
EDIT:
I was reminded I didn't mention that this was going to be used in a loop. My apologies:
var newList = [];
foreach(item1 in ListOfItems1){
foreach(item2 in ListOfItems2){
newList.push(doThing(item1, item2));
}
}
So given that it's going to be used many times in a loop, it makes sense to define the regex outside the function, but so that's the idea.
also note the script is rather genericized for the purpose of examining only the behavior and cost of the regex creation
From Mozilla's JavaScript Guide on regular expressions:
Regular expression literals provide compilation of the regular expression when the script is evaluated. When the regular expression will remain constant, use this for better performance.
And from the ECMA-262 spec, §7.8.5 Regular Expression Literals:
A regular expression literal is an input element that is converted to a RegExp object (see 15.10) each time the literal is evaluated.
In other words, it's compiled once when it's evaluated as a script is first parsed.
It's worth noting also, from the ES5 spec, that two literals will compile to two distinct instances of RegExp, even if the literals themselves are the same. Thus if a given literal appears twice within your script, it will be compiled twice, to two distinct instances:
Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
...
... each time the literal is evaluated, a new object is created as if by the expression new RegExp(Pattern, Flags) where RegExp is the standard built-in constructor with that name.
The provided answers don't clearly distinguish between two different processes behind the scene: regexp compilation and regexp object creation when hitting regexp object creation expression.
Yes, using regexp literal syntax, you're gaining the performance benefit of one time regexp compilation.
But if your code executes in ES5+ environment, every time the code path enters the doThing() function in your example, it actually creates a new RegExp object, though, without need to compile the regexp again and again.
In ES5, literal syntax produces a new RegExp object every time code path hits expression that creates a regexp via literal:
function getRE() {
var re = /[a-z]/;
re.foo = "bar";
return re;
}
var reg = getRE(),
re2 = getRE();
console.log(reg === re2); // false
reg.foo = "baz";
console.log(re2.foo); // "bar"
To illustrate the above statements from the point of actual numbers, take a look at the performance difference between storedRegExp and inlineRegExp tests in this jsperf.
storedRegExp would be about 5 - 20% percent faster across browsers than inlineRegExp - the overhead of creating (and garbage collecting) a new RegExp object every time.
Conslusion:
If you're heavily using your literal regexps, consider caching them outside the scope where they are needed, so that they are not only be compiled once, but actual regexp objects for them would be created once as well.
There are two "regular expression" type objects in javascript.
Regular expression instances and the RegExp object.
Also, there are two ways to create regular expression instances:
using the /regex/ syntax and
using new RegExp('regex');
Each of these will create new regular expression instance each time.
However there is only ONE global RegExp object.
var input = 'abcdef';
var r1 = /(abc)/;
var r2 = /(def)/;
r1.exec(input);
alert(RegExp.$1); //outputs 'abc'
r2.exec(input);
alert(RegExp.$1); //outputs 'def'
The actual pattern is compiled as the script is loaded when you use Syntax 1
The pattern argument is compiled into an internal format before use. For Syntax 1, pattern is compiled as the script is loaded. For Syntax 2, pattern is compiled just before use, or when the compile method is called.
But you still could get different regular expression instances each method call. Test in chrome vs firefox
function testregex() {
var localreg = /abc/;
if (testregex.reg != null){
alert(localreg === testregex.reg);
};
testregex.reg = localreg;
}
testregex();
testregex();
It's VERY little overhead, but if you wanted exactly one regex, its safest to only create one instance outside of your function
The regex will be compiled every time you call the function if it's not in literal form.
Since you are including it in a literal form, you've got nothing to worry about.
Here's a quote from websina.com:
Regular expression literals provide compilation of the regular expression when the script is evaluated. When the regular expression will remain constant, use this for better performance.
Calling the constructor function of the RegExp object, as follows:
re = new RegExp("ab+c")
Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.

Categories

Resources