How does this regexp work? - javascript

RegExes give me headaches. I have a very simple regex but I don't understand how it works.
The code:
var str= "startBlablablablablaend";
var regex = /start(.*?)end/;
var match = str.match(regex);
console.log( match[0] ); //startBlablablablablaend
console.log( match[1] ); //Blablablablabla
What I ultimately want would be the second one, in other words the text between the two delimiters (start,end).
My questions:
How does it work? (each character explained please)
Why does it match two different things?
Is there a better way to get match[1]?
If I want to get all the text's between all the start-end instances, how would I go about it?
For the last question, what I mean:
var str = "startBla1end startBla2end startBla3end";
var regex = /start(.*?)end/gmi;
var match = str.match(regex);
console.log( match ); // [ "startBla1end" , "startBla2end" , "startBla3end" ]
What I need is:
console.log( match ); // [ "Bla1" , "Bla2" , "Bla3" ];
Thanks :)

How does it work?
start matches start in the string
(.*?) non greedy match for character
end matches the end in the string
Matching
startBlablablablablaend
|
start
startBlablablablablaend
|
.
startBlablablablablaend
|
.
# and so on since quantifier * matches any number of character. ? makes the match non greedy
startBlablablablablaend
|
end
Why does it match two different things?
It doesnt match 2 differnt things
match[0] will contain the entire match
match[1] will contain the first capture group (the part matched in the first paranthesis)
Is there a better way to get match[1]?
Short answer No
If you are using languages other than javascript. its possible using look arounds
(?<=start)(.*?)(?=end)
#Blablablablabla
Note This wont work with javascript as it doesnt support negative lookbehinds
Last Question
The best that you can get from a single match statement would be
var str = "startBla1end startBla2end startBla3end";
var regex = /start(.*?)(?=end)/gmi;
var match = str.match(regex);
console.log( match ); // [ "startBla" , "startBla2" , "startBla3" ]

You need not to do a much effort on it.
Try this this regex:
start(.*)end
You can look at this stackoverflow question which already been answered before.
Regular Expression to get a string between two strings in Javascript
Hope it helps.

To solve your last question, you can split up your string and iterate:
var str = "startBla1end startBla2end startBla3end";
var str_array = str.split(" ");
Then iterate over each element of the str_array using your existing code to extract each Bla# substring.

Related

How to split a string by a character not directly preceded by a character of the same type?

Let's say I have a string: "We.need..to...split.asap". What I would like to do is to split the string by the delimiter ., but I only wish to split by the first . and include any recurring .s in the succeeding token.
Expected output:
["We", "need", ".to", "..split", "asap"]
In other languages, I know that this is possible with a look-behind /(?<!\.)\./ but Javascript unfortunately does not support such a feature.
I am curious to see your answers to this question. Perhaps there is a clever use of look-aheads that presently evades me?
I was considering reversing the string, then re-reversing the tokens, but that seems like too much work for what I am after... plus controversy: How do you reverse a string in place in JavaScript?
Thanks for the help!
Here's a variation of the answer by guest271314 that handles more than two consecutive delimiters:
var text = "We.need.to...split.asap";
var re = /(\.*[^.]+)\./;
var items = text.split(re).filter(function(val) { return val.length > 0; });
It uses the detail that if the split expression includes a capture group, the captured items are included in the returned array. These capture groups are actually the only thing we are interested in; the tokens are all empty strings, which we filter out.
EDIT: Unfortunately there's perhaps one slight bug with this. If the text to be split starts with a delimiter, that will be included in the first token. If that's an issue, it can be remedied with:
var re = /(?:^|(\.*[^.]+))\./;
var items = text.split(re).filter(function(val) { return !!val; });
(I think this regex is ugly and would welcome an improvement.)
You can do this without any lookaheads:
var subject = "We.need.to....split.asap";
var regex = /\.?(\.*[^.]+)/g;
var matches, output = [];
while(matches = regex.exec(subject)) {
output.push(matches[1]);
}
document.write(JSON.stringify(output));
It seemed like it'd work in one line, as it did on https://regex101.com/r/cO1dP3/1, but had to be expanded in the code above because the /g option by default prevents capturing groups from returning with .match (i.e. the correct data was in the capturing groups, but we couldn't immediately access them without doing the above).
See: JavaScript Regex Global Match Groups
An alternative solution with the original one liner (plus one line) is:
document.write(JSON.stringify(
"We.need.to....split.asap".match(/\.?(\.*[^.]+)/g)
.map(function(s) { return s.replace(/^\./, ''); })
));
Take your pick!
Note: This answer can't handle more than 2 consecutive delimiters, since it was written according to the example in the revision 1 of the question, which was not very clear about such cases.
var text = "We.need.to..split.asap";
// split "." if followed by "."
var res = text.split(/\.(?=\.)/).map(function(val, key) {
// if `val[0]` does not begin with "." split "."
// else split "." if not followed by "."
return val[0] !== "." ? val.split(/\./) : val.split(/\.(?!.*\.)/)
});
// concat arrays `res[0]` , `res[1]`
res = res[0].concat(res[1]);
document.write(JSON.stringify(res));

Why is my regex capture group only capturing the last part of the string when it matches multiple parts?

What I Tried
var test = "asdfdas ABCD EFGH";
var regex = /^\S+( [A-Z]{4})+$/;
// Also tried: /^\S+( [A-Z]{4})+$/g
// And: /^\S+( [A-Z]{4})+?$/g
var matches = test.match(regex);
I made a JSFiddle.
What I Expect
The variable matches should become this array:
[
"asdfdas ABCD EFGH",
" ABCD",
" EFGH"
]
What I Get
The variable matches is actually this array:
[
"asdfdas ABCD EFGH",
" EFGH"
]
My Thoughts
My guess is that there's something I'm missing with the capture group and/or $ logic. Any help would be appreciated. (I know I can figure out how to do this in multiple regular expressions, but I want to understand what is happening here.)
Yes, that’s exactly what it does; you’re not doing anything wrong. When a group is given a quantifier, it only captures its last match, and that’s all it will ever do in JavaScript. The general fix is to use multiple regular expressions, as you said, e.g.
var test = "asdfdas ABCD EFGH";
var match = test.match(/^\S+((?: [A-Z]{4})+)$/); // capture all repetitions
var matches = match[1].match(/ [A-Z]{4}/g); // match again to get individual ones

using a lookahead to get the last occurrence of a pattern in javascript

I was able to build a regex to extract a part of a pattern:
var regex = /\w+\[(\w+)_attributes\]\[\d+\]\[own_property\]/g;
var match = regex.exec( "client_profile[foreclosure_defenses_attributes][0][own_property]" );
match[1] // "foreclosure_defenses"
However, I also have a situation where there will be a repetitive pattern like so:
"client_profile[lead_profile_attributes][foreclosure_defenses_attributes][0][own_property]"
In that case, I want to ignore [lead_profile_attributes] and just extract the portion of the last occurence as I did in the first example. In other words, I still want to match "foreclosure_defenses" in this case.
Since all patterns will be like [(\w+)_attributes], I tried to do a lookahead, but it is not working:
var regex = /\w+\[(\w+)_attributes\](?!\[(\w+)_attributes\])\[\d+\]\[own_property\]/g;
var match = regex.exec("client_profile[lead_profile_attributes][foreclosure_defenses_attributes][0][own_property]");
match // null
match returns null meaning that my regex isn't working as expected. I added the following:
\[(\w+)_attributes\](?!\[(\w+)_attributes\])
Because I want to match only the last occurrence of the following pattern:
[lead_profile_attributes][foreclosure_defenses_attributes]
I just want to grab the foreclosure_defenses, not the lead_profile.
What might I be doing wrong?
I think I got it working without positive lookahead:
regex = /(\[(\w+)_attributes\])+/
/(\[(\w+)_attributes\])+/
match = regex.exec(str);
["[a_attributes][b_attributes][c_attributes]", "[c_attributes]", "c"]
I was able to also achieve it through noncapturing groups. Output from chrome console:
var regex = /(?:\w+(\[\w+\]\[\d+\])+)(\[\w+\])/;
undefined
regex
/(?:\w+(\[\w+\]\[\d+\])+)(\[\w+\])/
str = "profile[foreclosure_defenses_attributes][0][properties_attributes][0][other_stuff]";
"profile[foreclosure_defenses_attributes][0][properties_attributes][0][other_stuff]"
match = regex.exec(str);
["profile[foreclosure_defenses_attributes][0][properties_attributes][0][other_stuff]", "[properties_attributes][0]", "[other_stuff]"]

Regexp to capture comma separated values

I have a string that can be a comma separated list of \w, such as:
abc123
abc123,def456,ghi789
I am trying to find a JavaScript regexp that will return ['abc123'] (first case) or ['abc123', 'def456', 'ghi789'] (without the comma).
I tried:
^(\w+,?)+$ -- Nope, as only the last repeating pattern will be matched, 789
^(?:(\w+),?)+$ -- Same story. I am using non-capturing bracket. However, the capturing just doesn't seem to happen for the repeated word
Is what I am trying to do even possible with regexp? I tried pretty much every combination of grouping, using capturing and non-capturing brackets, and still not managed to get this happening...
If you want to discard the whole input when there is something wrong, the simplest way is to validate, then split:
if (/^\w+(,\w+)*$/.test(input)) {
var values = input.split(',');
// Process the values here
}
If you want to allow empty value, change \w+ to \w*.
Trying to match and validate at the same time with single regex requires emulation of \G feature, which assert the position of the last match. Why is \G required? Since it prevents the engine from retrying the match at the next position and bypass your validation. Remember than ECMA Script regex doesn't have look-behind, so you can't differentiate between the position of an invalid character and the character(s) after it:
something,=bad,orisit,cor&rupt
^^ ^^
When you can't differentiate between the 2 positions, you can't rely on the engine to do a match-all operation alone. While it is possible to use a while loop with RegExp.exec and assert the position of last match yourself, why would you do so when there is a cleaner option?
If you want to savage whatever available, torazaburo's answer is a viable option.
Live demo
Try this regex :
'/([^,]+)/'
Alternatively, strings in javascript have a split method that can split a string based on a delimeter:
s.split(',')
Split on the comma first, then filter out results that do not match:
str.split(',').filter(function(s) { return /^\w+$/.test(s); })
This regex pattern separates numerical value in new line which contains special character such as .,,,# and so on.
var val = [1234,1213.1212, 1.3, 1.4]
var re = /[0-9]*[0-9]/gi;
var str = "abc123,def456, asda12, 1a2ass, yy8,ghi789";
var re = /[a-z]{3}\d{3}/g;
var list = str.match(re);
document.write("<BR> list.length: " + list.length);
for(var i=0; i < list.length; i++) {
document.write("<BR>list(" + i + "): " + list[i]);
}
This will get only "abc123" code style in the list and nothing else.
May be you can use split function
var st = "abc123,def456,ghi789";
var res = st.split(',');

Javascript regex: discard end of string match

I want to split a string preserving the newlines. The string can be everything, so the code must work in any case (new lines at begin of string, at end of string, consecutive new lines...).
I'm using this code:
var text = "abcd\nefg\n\nhijk\n"
var matches = text.match(/.*\n?/g)
which produces the following result:
[ 'abcd\n', 'efg\n', '\n', 'hijk', '' ]
That is what I need, except for the last match ('').
Actually I use matches.pop() in order to remove it, but I wonder if the regex could be improved in order to avoid that match.
Bonus points if you can explain why that match is present (I can't find any reason, but I suck at regexs :-) ).
Use an alternative:
var text = "abcd\nefg\n\nhijk\n";
var matches = text.match(/.+\n?|\n/g);
You can use array#filter:
var matches = text.match(/.*\n?/g).filter(Boolean);
//=> [ 'abcd\n', 'efg\n', '\n', 'hijk' ]
Or using a slightly different regex with non-optional \n (but it assumes new line is always there after last line):
var matches = text.match(/.*\n/g);
//=> [ 'abcd\n', 'efg\n', '\n', 'hijk' ]

Categories

Resources