Eloquent Javascript Looping over RegExp Matches

Eloquent Javascript Looping over RegExp Matches - javascript

The following example is a bit confusing to me:
var text = "A string with 3 numbers in it ... 42 and 88.";
var number = /\b(\d+)\b/g;
var match;
while (match = number.exec(text)){
console.log("Found", match[1], "at", match.index);
}
Specifically, I don't understand how this has a "looping" effect. How does it run through all the matches within one string if it keeps calling match[1]. Is there some kind of side effect with exec that I am unaware of?
Edit:
I still would like an answer to how match[1] is working.
How does match[1] produce any answer? When I test this type of thing myself, I get undefined, look
> var y = /\d+/g.exec('5')
undefined
> y
[ '5', index: 0, input: '5' ]
> y[1]
undefined
Whats going on here? Wouldn't it be y[0], or in the case above, match[0]? Like:
> y[0]
'5'

The RegExp object remembers the last matched position with lastIndex property.
Quoting MDN Documentation,
If your regular expression uses the "g" flag, you can use the exec() method multiple times to find successive matches in the same string. When you do so, the search starts at the substring of str specified by the regular expression's lastIndex property (test() will also advance the lastIndex property).
Important Note: The first part of the first line of the quoted section is important. If your regular expression uses the"g"flag. Only if the RegEx has g flag you will get this behavior.

Related

javascript regular expression error in test function? [duplicate]

What is the meaning of the g flag in regular expressions?
What is is the difference between /.+/g and /.+/?

g is for global search. Meaning it'll match all occurrences. You'll usually also see i which means ignore case.
Reference: global - JavaScript | MDN
The "g" flag indicates that the regular expression should be tested against all possible matches in a string.
Without the g flag, it'll only test for the first.
Additionally, make sure to check cchamberlain's answer below for details on how it sets the lastIndex property, which can cause unexpected side effects when re-using a regex against a series of values.

Example in Javascript to explain:
> 'aaa'.match(/a/g)
[ 'a', 'a', 'a' ]
> 'aaa'.match(/a/)
[ 'a', index: 0, input: 'aaa' ]

As #matiska pointed out, the g flag sets the lastIndex property as well.
A very important side effect of this is if you are reusing the same regex instance against a matching string, it will eventually fail because it only starts searching at the lastIndex.
// regular regex
const regex = /foo/;
// same regex with global flag
const regexG = /foo/g;
const str = " foo foo foo ";
const test = (r) => console.log(
r,
r.lastIndex,
r.test(str),
r.lastIndex
);
// Test the normal one 4 times (success)
test(regex);
test(regex);
test(regex);
test(regex);
// Test the global one 4 times
// (3 passes and a fail)
test(regexG);
test(regexG);
test(regexG);
test(regexG);

g is the global search flag.
The global search flag makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern.
So the difference between /.+/g and /.+/ is that the g version will find every occurrence instead of just the first.

There is no difference between /.+/g and /.+/ because they will both only ever match the whole string once. The g makes a difference if the regular expression could match more than once or contains groups, in which case .match() will return an array of the matches instead of an array of the groups.

g -> returns all matches
without g -> returns first match
example:
'1 2 1 5 6 7'.match(/\d+/) returns ["1", index: 0, input: "1 2 1 5 6 7", groups: undefined]. As you see we can only take first match "1".
'1 2 1 5 6 7'.match(/\d+/g) returns an array of all matches ["1", "2", "1", "5", "6", "7"].

Beside already mentioned meaning of g flag, it influences regexp.lastIndex property:
The lastIndex is a read/write integer property of regular expression
instances that specifies the index at which to start the next match.
(...) This property is set only if the regular expression instance
used the "g" flag to indicate a global search.
Reference: Mozilla Developer Network

G in regular expressions is a defines a global search, meaning that it would search for all the instances on all the lines.

Will give example based on string. If we want to remove all occurences from a
string.
Lets say if we want to remove all occurences of "o" with "" from "hello world"
"hello world".replace(/o/g,'');

In my case i have a problem that i need to reevaluate string each time from the first letter, for this a have to remove /my_regexp/g(global flag) to stop moving lastIndex.
as mentioned in mdn:
Be sure that the global (g) flag is set, or lastIndex will never be advanced.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec#specifications

Why `pattern.test(name)` opposite results on consecutive calls [duplicate]

This question already has answers here:
Why does a RegExp with global flag give wrong results?
(7 answers)
Closed 7 years ago.
Why is this code returning first true, then false
var pattern = new RegExp("mstea", 'gi'), name = "Amanda Olmstead";
console.log('1', pattern.test(name));
console.log('1', pattern.test(name));
Demo: Fiddle

g is for repeating searches. It changes the regular expression object into an iterator. If you want to use the test function to check your string is valid according to your pattern, remove this modifier :
var pattern = new RegExp("mstea", 'i'), name = "Amanda Olmstead";
The test function, contrary to replace or match doesn't consume the whole iteration, which lets it in a "bad" state. You should probably never use this modifier when using the test function.

You don't want to use gi in combination with pattern.test. The g flag means that it keeps track of where you are running so it can be reused. So instead, you should use:
var pattern = new RegExp("mstea", 'i'), name = "Amanda Olmstead";
console.log('1', pattern.test(name));
console.log('1', pattern.test(name));
Also, you can use /.../[flags] syntax for regex, like so:
var pattern = /mstea/i;

Because you set the g modifier.
Remove it for your case.
var pattern = new RegExp("mstea", 'i'), name = "Amanda Olmstead";

It isn't a bug.
The g causes it to carry out the next attempted match for the substring, after the first match. And that is why it returns false in every even attempt.
First attempt:
It is testing "Amanda Olmstead"
Second attempt:
It is testing "d" //match found in previous attempt (performs substring there)
Third attempt:
It is testing "Amanda Olmstead" again //no match found in previous attempt
... so on
MDN page for Regexp.exec states:
If your regular expression uses the "g" flag, you can use the exec
method multiple times to find successive matches in the same string.
When you do so, the search starts at the substring of str specified by
the regular expression's lastIndex property
MDN page for test states:
As with exec (or in combination with it), test called multiple times
on the same global regular expression instance will advance past the
previous match.

Is there a way to do a substring in Javascript but use string characters as the parameters for what you want to select?

So a substring can take two parameters, the index to start at and the index to stop at like so
var str="Hello beautiful world!";
document.write(str.substring(3,7));
but is there a way to designate the start and stopping points as a set of characters to grab, so instead of the starting point being 3 I would want it to be "lo" and instead of the end point being 7 I would want it to be "wo" so I would be grabbing "lo beautiful wo". Is there a Javascript function that serves that purpose already?

Sounds like you want to use regular expressions and string.match() instead:
var str="Hello beautiful world!";
document.write(str.match(/lo.*wo/)[0]); // document.write("lo beautiful wo");
Note, match() returns an array of matches, which might be null if there is no match. So you should include a null check.
If you're not familiar with regexes, this is a pretty good source:
http://www.w3schools.com/jsref/jsref_obj_regexp.asp

use the method indexOf: document.write(str.substring(3,str.indexOf('wo')+2));

Yup, you can do this easily with regular expressions:
var substr = /lo.+wo/.exec( 'Hello beautiful world!' )[0];
console.log( substr ); //=> 'lo beautiful wo'

Use a regex brother:
if (/(lo.+wo)/.test("Hello beautiful world!")) {
document.write(RegExp.$1);
}
You need a backup plan in case the string does not match. Hence the use of test.

Regular expression may be able to achieve this to some extent, but there are many details that you must be aware of.
For example, if you want to find all the substrings that starts with "lo", and ends with the nearest "wo" after "lo". (If there are more than 1 match, the subsequent matches will pick up the first "lo" after the "wo" of last match).
"Hello beautiful world!".match(/lo.*?wo/g);
Using the RegExp constructor, you can make it more flexible (you can substitute "lo" and "wo" with the actual string you want to find):
"Hello beautiful world!".match(new RegExp("lo" + ".*?" + "wo", "g"));
Important: The downside of the RegExp approach above is that, you need to know what characters are special to escape them - otherwise, they will not match the actual substring you want to find.
It can also be achieve with indexOf, albeit a little bit dirty. For the first substring:
var startIndex = str.indexOf(startString);
var endIndex = str.indexOf(endString, startIndex);
if (startIndex >= 0 && endIndex >= 0)
str.substring(startIndex, endIndex + endString.length)
If you want to find the substring that starts with the first "lo" and ends with the last "wo" in the string, you can use indexOf and lastIndexOf to find it (with a small modification to the code above). RegExp can also do it, by changing .*? to .* in the two example above (there will be at most 1 match, so the "g" flag at the end is redundant).

Why does Javascript's regex.exec() not always return the same value? [duplicate]

This question already has answers here:
Why does a RegExp with global flag give wrong results?
(7 answers)
Closed 6 years ago.
In the Chrome or Firebug console:
reg = /ab/g
str = "abc"
reg.exec(str)
==> ["ab"]
reg.exec(str)
==> null
reg.exec(str)
==> ["ab"]
reg.exec(str)
==> null
Is exec somehow stateful and depends on what it returned the previous time? Or is this just a bug? I can't get it to happen all the time. For example, if 'str' above were "abc abc" it doesn't happen.

A JavaScript RegExp object is stateful.
When the regex is global, if you call a method on the same regex object, it will start from the index past the end of the last match.
When no more matches are found, the index is reset to 0 automatically.
To reset it manually, set the lastIndex property.
reg.lastIndex = 0;
This can be a very useful feature. You can start the evaluation at any point in the string if desired, or if in a loop, you can stop it after a desired number of matches.
Here's a demonstration of a typical approach to using the regex in a loop. It takes advantage of the fact that exec returns null when there are no more matches by performing the assignment as the loop condition.
var re = /foo_(\d+)/g,
str = "text foo_123 more text foo_456 foo_789 end text",
match,
results = [];
while (match = re.exec(str))
results.push(+match[1]);
DEMO: http://jsfiddle.net/pPW8Y/
If you don't like the placement of the assignment, the loop can be reworked, like this for example...
var re = /foo_(\d+)/g,
str = "text foo_123 more text foo_456 foo_789 end text",
match,
results = [];
do {
match = re.exec(str);
if (match)
results.push(+match[1]);
} while (match);
DEMO: http://jsfiddle.net/pPW8Y/1/

From MDN docs:
If your regular expression uses the "g" flag, you can use the exec method multiple times to find successive matches in the same string. When you do so, the search starts at the substring of str specified by the regular expression's lastIndex property (test will also advance the lastIndex property).
Since you are using the g flag, exec continues from the last matched string until it gets to the end (returns null), then starts over.
Personally, I prefer to go the other way around with str.match(reg)

Multiple Matches
If your regex need the g flag (global match), you will need to reset the index (position of the last match) by using the lastIndex property.
reg.lastIndex = 0;
This is due to the fact that exec() will stop on each occurence so you can run again on the remaining part. This behavior also exists with test()) :
If your regular expression uses the "g" flag, you can use the exec
method multiple times to find successive matches in the same string.
When you do so, the search starts at the substring of str specified by
the regular expression's lastIndex property (test will also advance
the lastIndex property)
Single Match
When there is only one possible match, you can simply rewrite you regex by omitting the g flag, as the index will be automatically reset to 0.

Regex returning a value in IE, 'undefined' in Firefox and Safari/Chrome

Have a regex:
.*?
(rule1|rule2)
(?:(rule1|rule2)|[^}])*
(It's designed to parse CSS files, and the 'rules' are generated by JS.)
When I try this in IE, all works as it should.
Ditto when I try it in RegexBuddy or The Regex Coach.
But when I try it in Firefox or Chrome, the results are missing values.
Can anyone please explain what the real browsers are thinking, or how I can achieve results similar to IE's?
To see this in action, load up a page that gives you interactive testing, such as the W3Schools try-it-out editor.
Here's the source that can be pasted in:
http://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_regexp_exec
<html>
<body>
<script type="text/javascript">
var str="#rot { rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;
var result=patt.exec(str);
for(var i = 0; i < 3; i++) document.write(i+": " + result[i]+"<br>");
</script>
</body>
</html>
Here is the output in IE:
0: #rot { rule1; rule2;
1: rule1
2: rule2
Here is the output in Firefox and Chrome:
0: #rot { rule1; rule2;
1: rule1
2: undefined
When I try the same using string.match, I get back an array of undefined in all browsers, including IE.
var str="#rot { rule2; rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/gi;
var result=str.match(patt);
for(var i = 0; i < 5; i++) document.write(i+": "+result[i]+"<br>");
As far as I can tell, the issue is the last non-capturing parenthesis.
When I remove them, the results are consistent cross browser - and match() gets results.
However, it does capture from the last parenthesis, in all browsers, in the following example:
<script>
var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;
var result=patt.exec(str);
for(var i =0; i < 3; i++) document.write(i+": "+result[i]+"<br>");
</script>
Notice that I've added a space to the patterns in the second regex.
The same applies if I add any negative character to the strings in the second regex:
var patt=/.*?(rule1|rule2)(?:(rule1[^1]|rule2[^1])|[^}])*/gi;
What the expletive is going on?!
All other strings that I've tried result in the first set of non-catches.
Any help is greatly appreciated!
EDIT:
The code has been shortened, and many hours of research put in, on Mathhew's advice.
The title has been changed to make the thread easier to find.
I have marked Mathew's answer as correct, as it is well researched and described.
My answer below (written before Mathew revised his) states the logic in simpler and more direct terms.

There is a disagreement how to handle repeating capturing brackets.
Firefox and Webkit both make the following assumptions, IE makes only the first:
If a parenthesis is repeated, capturing each time something new, only the last result is stored.
If the parenthesis are inside a larger non capturing repeating parenthesis, and do not capture anything on the last loop, the parenthesis should capture nothing.
For example:
var str = 'abcdef';
var pat = /([a-f])+/;
pat.exec will catch an 'a', then replace it with a 'b' etc, until it returns an 'f'.
In all browsers.
var str = 'abcdefg';
var pat = /(?:([a-f])|g)+/;
pat.exec will first fill in the capturing parenthesis with an 'a', 'b', through 'f'.
But the non-capturing parent will then continue and match the 'g'. During which time there is nothing to go into the capturing parenthesis, so it is emptied.
And the regex will return a undefined string as its response.
IE considers the capturing parenthesis to have caught nothing in the last loop throup, and therefore sticks with the last valid response of 'f'.
Which is useful, but not logical.
Being illogically useful is more destructive than useful. (We all hate quirksmode.)
Advantage Firefox/Chrome.

The test case can be simplified, e.g.:
/^(?:(Foo)|Bar)(?:(Foo)|Bar)/.exec("FooBar") // => [ 'FooBar', 'Foo' ]
/^(?:(Foo)|Bar){2}/.exec("FooBar") // => [ 'FooBar', undefined ]
The only difference here is that the (?:(Foo)|Bar) atom is repeated (by a quantifier) in the second case, which results in its captures being cleared.
This behavior is stipulated by the ECMAScript spec:
Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated.
IE's deviation from this spec is also documented:
ES3 states that "Step 4 of the RepeatMatcher clears Atom's captures each time Atom is repeated."
JScript does not clear the Atom's matches each time the Atom is repeated.
It's worth noting that the ES spec is at odds with the behavior of other Perl-flavored regex engines, which typically behave like IE:
Chrome, Firefox
"FooBar".match(/^(?:(Foo)|Bar)*/)[1] // => undefined
Perl
("FooBar" =~ m/^(?:(Foo)|Bar)*/)[0] # => "Foo"
Python
re.match("^(?:(Foo)|Bar)*", "FooBar").group(1) # => "Foo"
Ruby
"FooBar"[/^(?:(Foo)|Bar)*/, 1] # => "Foo"

IE is wrong. In ECMAScript, exactly one alternative can result in a string. All the others have to be undefined (not "" or anything else).
So for your alternatives, including (transform[^-][^;}]+)|(transform-origin[^;}]+), Firefox and Chrome are correct in setting the failed capture to undefined.
There's an example in the ECMAScript 5 standard (§15.10.2.3) specifically about this:
NOTE The | regular expression operator
separates two alternatives. The
pattern first tries to match the left
Alternative (followed by the sequel of
the regular expression); if it fails,
it tries to match the right
Disjunction (followed by the sequel of
the regular expression). If the left
Alternative, the right Disjunction,
and the sequel all have choice points,
all choices in the sequel are tried
before moving on to the next choice in
the left Alternative. If choices in
the left Alternative are exhausted,
the right Disjunction is tried instead
of the left Alternative. Any capturing
parentheses inside a portion of the
pattern skipped by | produce undefined
values instead of Strings.
Thus, for
example, /a|ab/.exec("abc") returns
the result "a" and not "ab". Moreover,
/((a)|(ab))((c)|(bc))/.exec("abc")
returns the array ["abc", "a", "a",
undefined, "bc", undefined, "bc"] and
not ["abc", "ab", undefined, "ab",
"c", "c", undefined]
EDIT: I figured the last part out. This applies to the original as well as the simplified version. In both cases, rule1 and rule2 can't match the ; (in the original because ; is in the negated character class [^;}]). Thus, when a ; hit between declarations, the alternation chooses [^}]. Thus, it must set the last two captures to undefined.
For the * to be fully greedy, the final ; and space in the input must also be matched. For the last two * repetitions (';' and ' '), the alternation again chooses [^}], so the captures should be set undefined at the end too.
IE fails to do this in both cases, so they stay equal to "rule1" and "rule2".
Finally, the reason that the second example behaves differently is that (transform-origin[^;}]+)) matches on the very last * repetition, since there's no ; before the end.
EDIT 2: I'll walk through what should be happening both current examples. match is the match array.
var str="#rot { rule1; rule2; }";
var patt=/.*?(rule1|rule2)(?:(rule1|rule2)|[^}])*/i;
.*? - "#rot { "
(rule1|rule2) - "rule1"
match[1] = "rule1"
Star 1
[^}] - ";"
match[2] = undefined
Star 2
[^}] - " "
match[2] = undefined
Star 3
(rule1|rule2) - "rule2"
match[2] = "rule2"
Star 4
[^}] - ";"
match[2] = undefined
Star 5
[^}] - " "
match[2] = undefined
Again, IE isn't setting match[2] to undefined.
For the str.match example, you're using the global flag. That means it returns an array of matches, without captures. This applies to any use of String.match. If you use g, you have to use exec to get captures.
var str="#rot { rule1; rule2 }";
var patt=/.*?(rule1|rule2)(?:(rule1 |rule2 )|[^}])*/gi;
.*? - "#rot { "
(rule1|rule2) - "rule1"
match[1] = "rule1"
Star 1
[^}] - ";"
match[2] = undefined
Star 2
[^}] - " "
match[2] = undefined
Star 3
(rule1 |rule2 ) - "rule2 "
match[2] = "rule2 "
Since this is the last *, the capture never gets set to undefined.

Try removing the ?: at the front of lines 4 and 5 in your regex above. I haven't tested it, but it really looks like they don't belong there.
(?:^|})
([^{]+)
[^}]+?-moz-
((transform[^-][^;}]+)|(transform-origin[^;}]+))
(-moz-(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))|[^}])*

Your 4th and 5th patterns are competing. Ultimately it is up to the implementation of the browsers regex engine to determine the matches. This wouldn't be the first difference between IE and others.
(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))
(?:-moz-(?:(transform[^-][^;}]+)|(transform-origin[^;}]+))|[^}])*
Both of these are prefixed by transform and suffixed by origin. You need to condense these into a more concise expression. Something like the following is an example:
((?:-moz-)?(?:transfrom-origin[^;}]+))

Develop Reference

JavaScript is the programming language of the Web.

Eloquent Javascript Looping over RegExp Matches - javascript

Related

javascript regular expression error in test function? [duplicate]

Why `pattern.test(name)` opposite results on consecutive calls [duplicate]

Is there a way to do a substring in Javascript but use string characters as the parameters for what you want to select?

Why does Javascript's regex.exec() not always return the same value? [duplicate]

Regex returning a value in IE, 'undefined' in Firefox and Safari/Chrome

Categories

Resources