How does .split(/_(.+)?/)[i] work? - javascript

After finding this solution useful,
split string only on first instance of specified character
I'm confused at how this actually works. One top comment explains, "Just to be clear, the reason this solution works is because everything after the first _ is matched inside a capturing group, and gets added to the token list for that reason." - #Alan Moore
That doesn't make sense to me; what's a "capturing group"? Additionally, the author's positive-rated solution,
"good_luck_buddy".split(/_(.+)?/)[1]
"luck_buddy"
is being noted in the comments as having an improved method by omitting the question mark, ?,
split(/_(.+)/)
or omitting the question mark and replacing the plus sign, +, with an asterisk, *.
split(/_(.*)/)
Which is actually the best solution and why?
Thank you.

"good_luck_buddy".split(/_(.+)?/)
doesn't really make much sense. It's essentially the same as
"good_luck_buddy".split(/_(.*)/)
("match 1 or more, optionally" is the same as "match 0 or more").
The behaviour of regex.split in most languages is "take pieces of string that do not match":
"a_#b_#c".split(/_#/) => ["a", "b", "c"]
If the split expression contains capturing groups (...), these are also included in the resulting list:
"a_#b_#c".split(/_(#)/) => ["a", "#", "b", "#", "c"]
So the above code
"good_luck_buddy".split(/_(.*)/)
works as follows:
it finds the first piece in the string that doesn't match _(.*). This is good.
it finds a piece that does match _(.*). This is _luck_buddy. Since there's a capturing group, its content (luck_buddy) is also included in the output
finally, it finds the next piece that doesn't match _(.*). This is an empty string, and it's added to the output, so the output becomes ["good", "luck_buddy", ""]
To address the "what's the best" part, I'd use the second voted solution for a literal splitter:
result = str.slice(str.indexOf('_') + 1)
and .replace for a regex splitter:
result = str.replace(/.*?<regex>/, '')

I'm not going to explain how basic RegEx works ("what is a capture group" ...). But to answer your question "which is best and why": It's just a matter of performance. Different regexes result in different processing times in the regex processor.
See this jsperf comparision:
http://jsperf.com/regex-split-on-first-occurence-of-char
I tested IE11, FF and Chrome. There is not really a noticable difference between the three regex variants in this case.

No need for a regular expression. Just find the index of the '_' (underscores) and get the substring.
function head(str, pattern) {
var index = str.indexOf(pattern);
return index > -1 ? str.substring(0, index) : '';
}
function tail(str, pattern) {
var index = str.indexOf(pattern);
return index > -1 ? str.substr(index + 1) : '';
}
function foot(str, pattern) { // Made this one up...
var index = str.lastIndexOf(pattern);
return index > -1 ? str.substr(index + 1) : '';
}
var str = "good_luck_buddy";
var pattern = '_';
document.body.innerHTML = head(str, pattern) + '<br />';
document.body.innerHTML += tail(str, pattern) + '<br />';
document.body.innerHTML += foot(str, pattern);
If you want to find the index of a pattern (regex) in a string, this question will show you the way:
Polyfill for String.prototype.regexIndexOf(regex, startpos)

Related

Need a javascript regex to match word without different word both before and after

I'm trying to match the full word "and" EXCEPT when it appears in idioms that repeat the same word before and after it, like "more and more" or "again and again". I've got this:
/(\b\w*\b)\s\band\s(?!\1)/gi
Which works, except it also captures the word before "and." I understand you can't do lookbehind regex in JS... Any help appreciated!
EXAMPLE:
It should match this and plus this and but not more and more or less and less.
would it be possible for you to test the opposite, test for the existence of matching pairs and negate the result?
/(\b\w+\b)\s\band\s(\1)/gi
I did a quick few tests on this at https://regex101.com/r/ZqXbtL/1/tests and it gave me the results I would expect with negation.
i think its better to get the index of matched items and add the length of the word and the space after it to get to index of "and", and so you can use substring or anything like that, i know its not the exact thing you want , but maybe it helps .
the snippet below will get index of second "and"
edit: got help from this answer
var reg = /(\b\w*\b)\s(?=\band\s(?!\1))/g;
var match,indexes = [];
while(match = reg.exec(document.body.innerHTML)){
indexes.push(match.index + match[0].length);
}
// testing
indexes.forEach(function(element){
document.body.innerHTML += "<br>\"" +
document.body.innerHTML.substr(element,3) + "\" found at " + element;
});
test and test<br>
test and not

Don't replace regex if it is enclosed by a character

I would like to replace all strings that are enclosed by - into strings enclosed by ~, but not if this string again is enclosed by *.
As an example, this string...
The -quick- *brown -f-ox* jumps.
...should become...
The ~quick~ *brown -f-ox* jumps.
We see - is only replaced if it is not within *<here>*.
My javascript-regex for now (which takes no care whether it is enclosed by * or not):
var message = source.replace(/-(.[^-]+?)-/g, "~$1~");
Edit: Note that it might be the case that there is an odd number of *s.
That's a tricky sort of thing to do with regular expressions. I think what I'd do is something like this:
var msg = source.replace(/(-[^-]+-|\*[^*]+\*)/g, function(_, grp) {
return grp[0] === '-' ? grp.replace(/^-(.*)-$/, "~$1~") : grp;
});
jsFiddle Demo
That looks for either - or * groups, and only performs the replacement on dashed ones. In general, "nesting" syntaxes are challenging (or impossible) with regular expressions. (And of course as a comment on the question notes, there are special cases — dangling metacharacters — that complicate this too.)
I would solve it by splitting the array based on * and then replacing only the even indices. Matching unbalanced stars is trickier, it involves knowing whether the last item index is odd or even:
'The -quick- *brown -f-ox* jumps.'
.split('*')
.map(function(item, index, arr) {
if (index % 2) {
if (index < arr.length - 1) {
return item; // balanced
}
// not balanced
item = '*' + item;
}
return item.replace(/\-([^-]+)\-/, '~$1~');
})
.join('');
Demo
Finding out whether a match is not enclosed by some delimiters is a very complicated task - see also this example. Lookaround could help, but JS only supports lookahead. So we could rewrite "not surrounded by ~" to "followed by an even number or ~", and match on that:
source.replace(/-([^-]+)-(?=[^~]*([^~]*~[^~]*~)*$)/g, "~$1~");
But better we match on both - and *, so that we consume anything wrapped in *s as well and can then decide in a callback function not to replace it:
source.replace(/-([^-]+)-|\*([^*]+)\*/g, function(m, hyp) {
if (hyp) // the first group has matched
return "~"+hyp+"~";
// else let the match be unchanged:
return m;
});
This has the advantage of being able to better specify "enclosed", e.g. by adding word boundaries on the "inside", for better handling of invalid patterns (odd number of * characters as mentioned by #Maras for example) - the current regex just takes the next two appearances.
A terser version of Jack's very clear answer.
source.split(/(\*[^*]*\*)/g).map(function(x,i){
return i%2?x:x.replace(/-/g,'~');
}).join('');
Seems to work,
Cheers.

Moving index in JavaScript regex matching

I have this regex to extract double words from text
/[A-Za-z]+\s[A-Za-z]+/g
And this sample text
Mary had a little lamb
My output is this
[0] - Mary had; [1] - a little;
Whereas my expected output is this:
[0] - Mary had; [1] - had a; [2] - a little; [3] - little lamb
How can I achieve this output? As I understand it, the index of the search moves to the end of the first match. How can I move it back one word?
Abusing String.replace function
I use a little trick using the replace function. Since the replace function loops through the matches and allows us to specify a function, the possibility is infinite. The result will be in output.
var output = [];
var str = "Mary had a little lamb";
str.replace(/[A-Za-z]+(?=(\s[A-Za-z]+))/g, function ($0, $1) {
output.push($0 + $1);
return $0; // Actually we don't care. You don't even need to return
});
Since the output contains overlapping portion in the input string, it is necessary to not to consume the next word when we are matching the current word by using look-ahead 1.
The regex /[A-Za-z]+(?=(\s[A-Za-z]+))/g does exactly as what I have said above: it will only consume one word at a time with the [A-Za-z]+ portion (the start of the regex), and look-ahead for the next word (?=(\s[A-Za-z]+)) 2, and also capture the matched text.
The function passed to the replace function will receive the matched string as the first argument and the captured text in subsequent arguments. (There are more - check the documentation - I don't need them here). Since the look-ahead is zero-width (the input is not consumed), the whole match is also conveniently the first word. The capture text in the look-ahead will go into the 2nd argument.
Proper solution with RegExp.exec
Note that String.replace function incurs a replacement overhead, since the replacement result is not used at all. If this is unacceptable, you can rewrite the above code with RegExp.exec function in a loop:
var output = [];
var str = "Mary had a little lamb";
var re = /[A-Za-z]+(?=(\s[A-Za-z]+))/g;
var arr;
while ((arr = re.exec(str)) != null) {
output.push(arr[0] + arr[1]);
}
Footnote
In other flavor of regex which supports variable width negative look-behind, it is possible to retrieve the previous word, but JavaScript regex doesn't support negative look-behind!.
(?=pattern) is syntax for look-ahead.
Appendix
String.match can't be used here since it ignores the capturing group when g flag is used. The capturing group is necessary in the regex, as we need look-around to avoid consuming input and match overlapping text.
It can be done without regexp
"Mary had a little lamb".split(" ")
.map(function(item, idx, arr) {
if(idx < arr.length - 1){
return item + " " + arr[idx + 1];
}
}).filter(function(item) {return item;})
Here's a non-regex solution (it's not really a regular problem).
function pairs(str) {
var parts = str.split(" "), out = [];
for (var i=0; i < parts.length - 1; i++)
out.push([parts[i], parts[i+1]].join(' '));
return out;
}
Pass your string and you get an array back.
demo
Side note: if you're worried about non-words in your input (making a case for regular expressions!) you can run tests on parts[i] and parts[i+1] inside the for loop. If the tests fail: don't push them onto out.
A way that you could like could be this one:
var s = "Mary had a little lamb";
// Break on each word and loop
s.match(/\w+/g).map(function(w) {
// Get the word, a space and another word
return s.match(new RegExp(w + '\\s\\w+'));
// At this point, there is one "null" value (the last word), so filter it out
}).filter(Boolean)
// There, we have an array of matches -- we want the matched value, i.e. the first element
.map(Array.prototype.shift.call.bind(Array.prototype.shift));
If you run this in your console, you'll see ["Mary had", "had a", "a little", "little lamb"].
With this way, you keep your original regex and can do the other stuff you want in it. Although with some code around it to make it really work.
By the way, this code is not cross-browser. The following functions are not supported in IE8 and below:
Array.prototype.filter
Array.prototype.map
Function.prototype.bind
But they're easily shimmable. Or the same functionality is easily achievable with for.
Here we go:
You still don't know how the regular expression internal pointer really works, so I will explain it to you with a little example:
Mary had a little lamb with this regex /[A-Za-z]+\s[A-Za-z]+/g
Here, the first part of the regex: [A-Za-z]+ will match Mary so the pointer will be at the end of the y
Mary had a little lamb
^
In the next part (\s[A-Za-z]+) it will match an space followed by another word so...
Mary had a little lamb
^
The pointer will be where the word had ends. So here's your problem, you are increasing the internal pointer of the regular expression without wanting, how is this solved? Lookaround is your friend. With lookarounds (lookahead and lookbehind) you are able to walk through your text without increasing the main internal pointer of the regular expression (it would use another pointer for that).
So at the end, the regular expression that would match what you want would be: ([A-Za-z]+(?=\s[A-Za-z]+))
Explanation:
The only think you dont know about that regular expression is the (?=\s[A-Za-z]+) part, it means that the [A-Za-z]+ must be followed by a word, else the regular expression won't match. And this is exactly what you seem to want because the interal pointer will not be increased and will match everyword but the last one because the last one won't be followed by a word.
Then, once you have that you only have to replace whatever you are done right now.
Here you have a working example, DEMO
In full admiration of the concept of 'look-ahead', I still propose a pairwise function (demo), since it's really Regex's task to tokenize a character stream, and the decision of what to do with the tokens is up to the business logic. At least, that's my opinion.
A shame that Javascript hasn't got a pairwise, yet, but this could do it:
function pairwise(a, f) {
for (var i = 0; i < a.length - 1; i++) {
f(a[i], a[i + 1]);
}
}
var str = "Mary had a little lamb";
pairwise(str.match(/\w+/g), function(a, b) {
document.write("<br>"+a+" "+b);
});
​

Regex: match word (but delete commas after OR before)

I have tried to delete an item from a string divided with commas:
var str="this,is,unwanted,a,test";
if I do a simple str.replace('unwanted',''); I end up with 2 commas
if I do a more complex str.replace('unwanted','').replace(',,','');
It might work
But the problem comes when the str is like this:
var str="unwanted,this,is,a,test"; // or "...,unwanted"
However, I could do a 'if char at [0 or str.length] == comma', then remove it
But I really think this is not the way to go, it is absurd I need to do 2 replaces and 2 ifs to achieve what I want
I have heard that regex can do powerful stuff, but I simply can't understand it no matter how hard I try
Important Notes:
It should match after OR before (not both), or we will end with
"this,is,,a,test"
There are no spaces between commas
How about something less flaky than a regex for this sort of replacement?
str = str
.split(',')
.filter(function(token) { return token !== 'unwanted' })
.join(',');
jsFiddle.
However if you are convinced a regex is the best way...
str = str.replace(/(^|,)?unwanted(,|$)?/g, function(all, leading, trailing) {
return leading && trailing ? ',' : '';
});
(thanks Logan F. Smyth.)
jsFiddle.
Since Alex hasn't fixed this in his solution, I wanted to get a fully functional version up somewhere.
var unwanted = 'unwanted';
var regex = new RegExp('(^|,)' + unwanted + '(,|$)', 'g');
str = str.replace(regex, function(a, pre, suf) {
return pre && suf ? ',' : '';
});
The only thing to be careful of when dynamically building a regex, is that the 'unwanted' variable can't have anything in it that could be interpretted as a regex pattern.
There are way easier ways to parse this though, as Alex mentioned. Don't resort to regular expressions unless you have to.

Regex to match all '&' before first '?'

Basically, I want to do "zipzam&&&?&&&?&&&" -> "zipzam%26%26%26?&&&?&&&". I can do that without regex many different ways, but it'd cleanup things a tad bit if I could do it with regex.
Thanks
Edit: "zip=zam&&&=?&&&?&&&" -> "zip=zam%26%26%26=?&&&?&&&" should make things a little clearer.
Edit: "zip=zam=&=&=&=?&&&?&&&" -> "zip=zam=%26=%26=%26=?&&&?&&&" should make things clearer.
However, theses are just examples. I still want to replace all '&' before the first '?' no matter where the '&' are before the first '?' and no matter if the '&' are consecutive or not.
This should do it:
"zip=zam=&=&=&=?&&&?&&&".replace(/^[^?]+/, function(match) { return match.replace(/&/g, "%26"); });
you need negative lookbehinds which are tricky to replicate in JS, but fortunately there are ways and means:
var x = "zipzam&&&?&&&?&&&";
x.replace(/(&+)(?=.*?\?)/,function ($1) {for(var i=$1.length, s='';i;i--){s+='%26';} return s;})
commentary: this works because it's not global. The first match is therefore a given, and the trick of replacing all of the matching "&" chars 1:1 with "%26" is achieved with the function loop
edit: a solution for unknown groupings of "&" can be achieved simply (if perhaps a little clunkily) with a little modification. The basic pattern for replacer methods is infinitely flexible.
var x = "zipzam&foo&bar&baz?&&&?&&&";
var f = function ($1,$2)
{
return $2 + ($2=='' || $2.indexOf('?')>-1 ? '&' : '%26')
}
x.replace(/(.*?)&(?=.*?\?)/g,f)
This should do it:
^[^?]*&[^?]*\?
Or this one, I think:
^[^?]*(&+?)\?
In this case regexes are really not the most appropiate things to use. A simple search for the first index of '?' and then replacing each '&' character would be best. However, if you really want a regex then this should do the job.
(?:.*?(&))*?\?
This close enough to what you are after:-
alert("zipzam&&&?&&&?&&&".replace(/^([^&\?]*)(&*)\?/, function(s, p, m)
{
for (var i = 0; i < m.length; i++) p += '%26';
return p +'?';
}));
Since the OP only wants to match ampersands before the first question mark, slightly modifying Michael Borgwardt's answer gives me this Regex which appears to be appropriate :
^[^?&]*(\&+)\?
Replace all matches with "%26"
This will not match zipzam&&abc?&&&?&&& because the first "?" does not have an ampersand immediately before it.

Categories

Resources