Javascript regex: why groups with one syntax, and not the other - javascript

I just spent a couple of hours wondering why a regular expression that I thought I understood, wasn't giving me the results I expected.
Consider these two ways of using the same regular expression:
var str="This will put us on the map!"
var a=str.match(/(?:\bwill\W+)(\w+)(\W+)/g)
alert(a[0]) //will put
alert(a[1]) //undefined
var regex=/(?:\bwill\W+)(\w+)(\W+)/g
var match = regex.exec(str)
alert(match[0]) //will put
alert(match[1]) //put
Fiddle
Obviously, the latter form is working properly; but what's wrong with the former?
Also, for thoroughness:
var re = new RegExp("(?:\\bwill\\W+)(\\w+)(\\W+)","g")
var rematch = re.exec(str)
alert(rematch[0]) //will put
alert(rematch[1]) //put
Fiddle
When I was searching here, I came across this question ("Javascript Regex Missing Groups") which claims that the g flag was causing the problem. However, that is clearly not the problem here, since the RE is exactly the same in the two cases, the only difference is how it's executed.
Thanks for your help!
Edit: The responses below do an excellent job of clearing this up. One thing I learned from this that I'd like to make clear for the record, is that the re.exec() method can be used to get all the matches, and it can also be used to get all the groups, but the way of accessing those two modes is somewhat subtle: With or without the g flag, the return value is always an array with the full match followed by the match groups. It is never an array containing multiple matches. The way to access multiple matches is to call the exec() method again on the same RegExp object.
It was mystifying to me why I was unable to answer this question myself with several hours of Google searching. The behavior in question is described in the documentation of string.match() and RegExp.exec(), although it was not described in a way that made those come up with any of the search strings related to the way I was experiencing the problem. So, for reference, I'm linking those here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match

The difference is indeed with the g modifier. When used together with .match() it will yield all values of $0 for each match that was found.
For example:
> "This will put us on the map!".match(/(\w+)/g)
["This", "will", "put", "us", "on", "the", "map"]
But:
> "This will put us on the map!".match(/(\w+)/)
["This", "This"]

string.match will only match the string and leaves of all sub expressions. It only matches $0.
It is meant only for matching. But if your match is inside a group, then you'll get duplicates. 1 will be the matched one and the other one will be the group.
Whereas, regex.exec with g modifier is used to be used in loops and the groups will be retained in the array.
To put it simply:
.match() will only match the matched part of string without groups, an exception being is when the match itself is a group.
.exec will give you the match, the groups.
So use .match only when you want to find the match and use .exec when you want the groups.

Here's some perspective to clarify the point the others have made:
var str="This will put us on the map will do this!"
var a=str.match(/(?:\bwill\W+)(\w+)(\W+)/g)
console.log(a);
gives you this:
["will put ", "will do "]
So when you have the g modifier, it only does a global match for the full pattern which would normally be $0.
It's not like e.g. php that gives you a multi-dim array of full pattern matches and grouped matches e.g.
preg_match_all('~(?:\bwill\W+)(\w+)(\W+)~',$string,$matches);
Array
(
[0] => Array
(
[0] => will put
[1] => will do
)
[1] => Array
(
[0] => put
[1] => do
)
[2] => Array
(
[0] =>
[1] =>
)
)
In javascript, you only ever get a single-dim array. So .match will either give you each element as the match the full pattern does (with g modifier), or else just element 0 as the full pattern match, and elements 1+ as the grouped. Whereas .exec will only do the latter. Neither one of them will give you a multi-dim with both, like in my php example.

Related

Why JS Regexp.exec returns an array with more elements than expected?

I'm attempting to regex match various duration strings (e.g. 1d10h, 30m, 90s, etc.) and have come up with a regex string to split the string into pieces, but it seems that I'm getting two undefined results at the ends that shouldn't be there. I imagine it has to do with the greedy matching via the ? groupings, but I'm not sure how to fix it.
My code looks like this:
const regex = /^(\d+?[d])?(\d+?[h])?(\d+[m])?(\d+[s])?$/gmi
const results = regex.exec('1d10h')
and the results I get look like so:
[
"1d10h",
"1d",
"10h",
undefined,
undefined,
]
I was only expecting the first three results (and in fact, I only really want 1d and 10h) but the two remaining undefined results keep popping up.
You have 4 groups in the regular expression - each enclosed with braces ( ... ) and enumerated naturally - the earlier opening brace appear in the expression the lower order index a group has.
And, of course, the whole match that could be named a "zero" group.
So, result of regex.exec('1d10h') contains 5 items:
results[0] - the whole expression match
results[i] - match of each group, i in {1,2,3,4}
Since in this case each group is optional (followed by ?) - it is allowed to have undefined in place of any unmatched group.
It is easy to see that if you remove a ? symbol after an unmatched group, the whole expression will fail to match and hence regex.exec('1d10h') will return null.
To get rid of undefined elements just filter them out:
const result = regex.exec('1d10h').filter(x => x);

Extracting a complicated part of the string with plain Javascript

I have a following string:
Text
I want to extract from this string, with the use of JavaScript 'pl' or 'pl_company_com'
There are a few variables:
jan_kowalski is a name and surname it can change, and sometimes even have 3 elements
the country code (in this example 'pl') will change to other en / de / fr (this is that part of the string i want to get)
the rest of the string remains the same for every case (beginning + everything after starting with _company_com ...
Ps. I tried to do it with split, but my knowledge of JS is very basic and I cant get what i want, plase help
An alternative to Randy Casburn's solution using regex
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_(.*_company_com)')[1];
console.log(out);
Or if you want to just get that string with those country codes you specified
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
let out = new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx').href.match('.*_((en|de|fr|pl)_company_com)')[1];
console.log(out);
A proof of concept that this solution also works for other combinations
let urls = [
new URL('https://my.domain.com/personal/jan_kowalski_pl_company_com/Documents/Forms/All.aspx'),
new URL('https://my.domain.com/personal/firstname_middlename_lastname_pl_company_com/Documents/Forms/All.aspx')
]
urls.forEach(url => console.log(url.href.match('.*_(en|de|fr|pl).*')[1]))
I have been very successful before with this kind of problems with regular expressions:
var string = 'Text';
var regExp = /([\w]{2})_company_com/;
find = string.match(regExp);
console.log(find); // array with found matches
console.log(find[1]); // first group of regexp = country code
First you got your given string. Second you have a regular expression, which is marked with two slashes at the beginning and at the end. A regular expression is mostly used for string searches (you can even replace complicated text in all major editors with it, which can be VERY useful).
In this case here it matches exactly two word characters [\w]{2} followed directly by _company_com (\w indicates a word character, the [] group all wanted character types, here only word characters, and the {}indicate the number of characters to be found). Now to find the wanted part string.match(regExp) has to be called to get all captured findings. It returns an array with the whole captured string followed by all capture groups within the regExp (which are denoted by ()). So in this case you get the country code with find[1], which is the first and only capture group of the regular expression.

How to look for a pattern that might be missing some characters, but following a certain order?

I am trying to make a validation for "KQkq" <or> "-", in the first case, any of the letters can be missing (expect all of them, in which case it should be "-"). The order of the characters is also important.
So quick examples of legal examples are:
-
Kkq
q
This is for a Chess FEN validation, I have validated the first two parts using:.
var fen_parts = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1";
fen_parts = fen_parts.split(" ");
if(!fen_parts[0].replace(/[1-8/pnbrqk]/gi,"").length
&& !fen_parts[1].replace(/[wb]/,"").length
&& !fen_parts[2].replace(/[kq-]/gi,"").length /*not working, allows KKKKKQkq to be valid*/
){
//...
}
But simply using /[kq-]/gi to validate the third part allows too many things to be introduced, here are some quick examples of illegal examples:
KKKKQkq (there is more than one K)
QK (order is incorrect)
You can do
-|K?Q?k?q?
though you will need to do a second test to ensure that the input is not empty. Alternatively, using only regex:
KQ?k?q?|Qk?q?|kq?|q|-
This seems to work for me...
^(-|(K)?((?!\2)Q)?((?!\2\3)k)?((?!\2\3\4)q)?)$
A .match() returns null if the expression did not match. In that case you can use the logical OR to default to an array with an empty-string (a structure similar to the one returned by .match() on a successful match), which will allow you to check the length of the matched expression. The length will be 0 if the expression did not match, or K?Q?k?q? matched the empty string. If the pattern matches, the length will be > 0. in code:
("KQkq".match(/^(?:K?Q?k?q?|-)$/) || [""])[0].length
Because | is "stronger" than you'd expect, it is necessary to wrap your actual expression in a non-capturing group (?:).
Having answered the question, let's have a look at the rest of your code:
if (!fen_parts[0].replace(/[1-8/pnbrqk]/gi,"").length)
is, from the javascript's perspective equivalent to
if (!fen_parts[0].match(/[^1-8/pnbrqk]/gi))
which translates to "false if any character but 1-8/pnbrqk". This notation is not only simpler to read, it also executes faster as there is no unnecessary string mutation (replace) and computation (length) going on.

JavaScript regex back references returning an array of matches from single capture group (multiple groups)

I'm fairly sure after spending the night trying to find an answer that this isn't possible, and I've developed a work around - but, if someone knows of a better method, I would love to hear it...
I've gone through a lot of iterations on the code, and the following is just a line of thought really. At some point I was using the global flag, I believe, in order for match() to work, and I can't remember if it was necessary now or not.
var str = "#abc#def#ghi&jkl";
var regex = /^(?:#([a-z]+))?(?:&([a-z]+))?$/;
The idea here, in this simplified code, is the optional group 1, of which there is an unspecified amount, will match #abc, #def and #ghi. It will only capture the alpha characters of which there will be one or more. Group 2 is the same, except matches on & symbol. It should also be anchored to the start and end of the string.
I want to be able to back reference all matches of both groups, ie:
result = str.match(regex);
alert(result[1]); //abc,def,ghi
alert(result[1][0]); //abc
alert(result[1][1]); //def
alert(result[1][2]); //ghi
alert(result[2]); //jkl
My mate says this works fine for him in .net, unfortunately I simply can't get it to work - only the last matched of any group is returned in the back reference, as can be seen in the following:
(additionally, making either group optional makes a mess, as does setting global flag)
var str = "#abc#def#ghi&jkl";
var regex = /(?:#([a-z]+))(?:&([a-z]+))/;
var result = str.match(regex);
alert(result[1]); //ghi
alert(result[1][0]); //g
alert(result[2]); //jkl
The following is the solution I arrived at, capturing the whole portion in question, and creating the array myself:
var str = "#abc#def#ghi&jkl";
var regex = /^([#a-z]+)?(?:&([a-z]+))?$/;
var result = regex.exec(str);
alert(result[1]); //#abc#def#ghi
alert(result[2]); //jkl
var result1 = result[1].toString();
result[1] = result1.split('#')
alert(result[1][1]); //abc
alert(result[1][2]); //def
alert(result[1][3]); //ghi
alert(result[2]); //jkl
That's simply not how .match() works in JavaScript. The returned array is an array of simple strings. There's no "nesting" of capture groups; you just count the ( symbols from left to right.
The first string (at index [0]) is always the overall matched string. Then come the capture groups, one string (or null) per array element.
You can, as you've done, rearrange the result array to your heart's content. It's just an array.
edit — oh, and the reason your result[1][0] was "g" is that array indexing notation applied to a string gets you the individual characters of the string.

Regex to extract substring, returning 2 results for some reason

I need to do a lot of regex things in javascript but am having some issues with the syntax and I can't seem to find a definitive resource on this.. for some reason when I do:
var tesst = "afskfsd33j"
var test = tesst.match(/a(.*)j/);
alert (test)
it shows
"afskfsd33j, fskfsd33"
I'm not sure why its giving this output of original and the matched string, I am wondering how I can get it to just give the match (essentially extracting the part I want from the original string)
Thanks for any advice
match returns an array.
The default string representation of an array in JavaScript is the elements of the array separated by commas. In this case the desired result is in the second element of the array:
var tesst = "afskfsd33j"
var test = tesst.match(/a(.*)j/);
alert (test[1]);
Each group defined by parenthesis () is captured during processing and each captured group content is pushed into result array in same order as groups within pattern starts. See more on http://www.regular-expressions.info/brackets.html and http://www.regular-expressions.info/refcapture.html (choose right language to see supported features)
var source = "afskfsd33j"
var result = source.match(/a(.*)j/);
result: ["afskfsd33j", "fskfsd33"]
The reason why you received this exact result is following:
First value in array is the first found string which confirms the entire pattern. So it should definitely start with "a" followed by any number of any characters and ends with first "j" char after starting "a".
Second value in array is captured group defined by parenthesis. In your case group contain entire pattern match without content defined outside parenthesis, so exactly "fskfsd33".
If you want to get rid of second value in array you may define pattern like this:
/a(?:.*)j/
where "?:" means that group of chars which match the content in parenthesis will not be part of resulting array.
Other options might be in this simple case to write pattern without any group because it is not necessary to use group at all:
/a.*j/
If you want to just check whether source text matches the pattern and does not care about which text it found than you may try:
var result = /a.*j/.test(source);
The result should return then only true|false values. For more info see http://www.javascriptkit.com/javatutors/re3.shtml
I think your problem is that the match method is returning an array. The 0th item in the array is the original string, the 1st thru nth items correspond to the 1st through nth matched parenthesised items. Your "alert()" call is showing the entire array.
Just get rid of the parenthesis and that will give you an array with one element and:
Change this line
var test = tesst.match(/a(.*)j/);
To this
var test = tesst.match(/a.*j/);
If you add parenthesis the match() function will find two match for you one for whole expression and one for the expression inside the parenthesis
Also according to developer.mozilla.org docs :
If you only want the first match found, you might want to use
RegExp.exec() instead.
You can use the below code:
RegExp(/a.*j/).exec("afskfsd33j")
I've just had the same problem.
You only get the text twice in your result if you include a match group (in brackets) and the 'g' (global) modifier.
The first item always is the first result, normally OK when using match(reg) on a short string, however when using a construct like:
while ((result = reg.exec(string)) !== null){
console.log(result);
}
the results are a little different.
Try the following code:
var regEx = new RegExp('([0-9]+ (cat|fish))','g'), sampleString="1 cat and 2 fish";
var result = sample_string.match(regEx);
console.log(JSON.stringify(result));
// ["1 cat","2 fish"]
var reg = new RegExp('[0-9]+ (cat|fish)','g'), sampleString="1 cat and 2 fish";
while ((result = reg.exec(sampleString)) !== null) {
console.dir(JSON.stringify(result))
};
// '["1 cat","cat"]'
// '["2 fish","fish"]'
var reg = new RegExp('([0-9]+ (cat|fish))','g'), sampleString="1 cat and 2 fish";
while ((result = reg.exec(sampleString)) !== null){
console.dir(JSON.stringify(result))
};
// '["1 cat","1 cat","cat"]'
// '["2 fish","2 fish","fish"]'
(tested on recent V8 - Chrome, Node.js)
The best answer is currently a comment which I can't upvote, so credit to #Mic.

Categories

Resources