Tolerate certain characters in RegEx - javascript

I am writing a message formatting parser that has the capability (among others) to parse links. This specific case requires parsing a link in the from of <url|linkname> and replacing that text with just the linkname. The issue here is that both url or linkname may or may not contain \1 or \2 characters anywhere in any order (at most one of each though). I want to match the pattern but keep the "invalid" characters. This problem solves itself for linkname as that part of the pattern is just ([^\n+]), but the url fragment is matched by a much more complicated pattern, more specifically the URL validation pattern from is.js. It would not be trivial to modify the whole pattern manually to tolerate [\1\2] everywhere, and I need the pattern to preserve those characters as they are used for tracking purposes (so I can't simply just .replace(/\1|\2/g, "") before matching).
If this kind of matching is not possible, is there some automated way to reliably modify the RegExp to add [\1\2]{0,2} between every character match, add \1\2 to all [chars] matches, etc.
This is the url pattern taken from is.js:
/(?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?/i
This pattern was adapted for my purposes and for the <url|linkname> format as follows:
let namedUrlRegex = /<((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)\|([^\n]+)>/ig;
The code where this is used is here: JSFiddle
Examples for clarification (... represents the namedUrlRegex variable from above, and $2 is the capture group that captures linkname):
Current behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "<googl\1e.com|Google>" WRONG
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle" CORRECT
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>" CORRECT
Expected behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "Google" (note there is no \1)
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"
Note the same rules for \1 apply to \2, \1\2, \1...\2, \2...\1 etc
Context: This is used to normalize a string from a WYSIWYG editor to the length/content that it will display as, preserving the location of the current selection (denoted by \1 and \2 so it can be restored after parsing). If the "caret" is removed completely (e.g. if the cursor was in the URL of a link), it will select the whole string instead. Everything works as expected, except for when the selection starts or ends in the url fragment.
Edit for clarification: I only want to change a segment in a string if it follows the format of <url|linkname> where url matches the URL pattern (tolerating \1, \2) and linkname consists of non-\n characters. If this condition is not met within a <...|...> string, it should be left unaltered as per the not_a_url example above.

I ended up making a RegEx that matches all "symbols" in the expression. One quirk of this is that it expects :, =, ! characters to be escaped, even outside of a (?:...), (?=...), (?!...) expression. This is addressed by escaping them before processing.
Fiddle
let r = /(\\.|\[.+?\]|\w|[^\\\/\[\]\^\$\(\)\?\*\+\{\}\|\+\:\=\!]|(\{.+?\}))(?:((?:\{.+?\}|\+|\*)\??)|\??)/g;
let url = /((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/
function tolerate(regex, insert) {
let first = true;
// convert to string
return regex.toString().replace(/\/(.+)\//, "$1").
// escape :=!
replace(/((?:^|[^\\])\\(?:\\)*\(\?|[^?])([:=!]+)/g, (m, g1, g2) => g1 + (g2.split("").join("\\"))).
// substitute string
replace(r, function(m, g1, g2, g3, g4) {
// g2 = {...} multiplier (to prevent matching digits as symbols)
if (g2) return m;
// g3 = multiplier after symbol (must wrap in parenthesis to preserve behavior)
if (g3) return "(?:" + insert + g1 + ")" + g3;
// prevent matching tolerated characters at beginning, remove to change this behavior
if (first) {
first = false;
return m;
}
// insert the insert
return insert + m;
}
);
}
alert(tolerate(url, "\1?\2?"));

Related

Using JS to modify user input for REGEXP search

I'm taking user input from a searchbar and modifying it to a regexp. From there I can search a json file for valid values and return them. It works fine with input without quotes, but with them, I'm appending "\Q" and "\E" so I can find the entirety of the string (with spaces and other special characters).
if (searchField.includes('"')){
var tempexpress = searchField.substring(1,searchField.length-1);
var tempexpress = "\\Q" + tempexpress + "\\E";
var expression = new RegExp(tempexpress);
} else {
var tempexpress = searchField.replace('(',"\\(");
var tempexpress = tempexpress.replace(')',"\\)");
var tempexpress = tempexpress.replace(/'/g,"\\'");
var tempexpress = tempexpress.replace('*',"\.");
var expression = new RegExp(tempexpress, "i");
};
if (value.data.label.search(expression) != -1){
console.log('found it');
}
If I input "QTT6" into the search field (with quotes for a literal), then it creates the following regexp: /\QQTT6\E/
In my testing, I found that it doesn't match to QTT6 for some reason and I'm not sure why. Any help is appreciated.
Also I'm very new to JS and Jquery, so sorry if my code isn't very well put together.
Per Kelly's comment:
In JS you need to use ^ and $ instead of \Q and \E.
For more information, see the MDN docs on Regex Assertions:
^:
Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character. For example, /^A/ does not match the "A" in "an A", but does match the first "A" in "An A".
Note: This character has a different meaning when it appears at the start of a character class.
$:
Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character. For example, /t$/ does not match the "t" in "eater", but does match it in "eat".

Uppercase for each new word swedish characters and html markup

I was pointed out to this post, which does not seem to follow the criteria I have:
Replace a Regex capture group with uppercase in Javascript
I am trying to make a regex that will:
format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
ignore HTML markup
Accept swedish characters (åäöÅÄÖ)
Say I've got this string:
<b>app</b>le store östersund
Then I want it to be (changes marked by uppercase characters)
<b>App</b>le Store Östersund
I've been playing around with it and the closest I've got is the following:
(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w
Resulted in
<b>app</b>le Store Östersund
Or this
/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g
Resulted in
<B>App</B>Le store Östersund
Here's a fiddle:
http://refiddle.com/refiddles/598aabef75622d4a531b0000
Any help or advice is much appreciated.
It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.
function htmlToTitlecase(html, letters) {
let div = document.createElement('div');
let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
div.innerHTML = html;
let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
let startOfWord = true;
while (treeWalker.nextNode()) {
let node = treeWalker.currentNode;
node.data = node.data.replace(re, function(match, space, letter) {
if (space || startOfWord) {
return space + letter.toUpperCase();
} else {
return match;
}
});
startOfWord = node.data.match(/\s$/);
}
return div.innerHTML;
}
console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund
[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.
EDIT:
Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.
This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...
function simpleHtmlToTitlecaseSwedish(html) {
return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
return space + tag + letter.toUpperCase();
});
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));
I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)
You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)
Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3
Test it here
Here is a working javascript code :
// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";
// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");
// Display result
console.log(result);
Edit : I forgot to handle first word of the string, it's corrected :)

replaceText() RegEx "not followed by"

Any ideas why this simple RegEx doesn't seem to be supported in a Google Docs script?
foo(?!bar)
I'm assuming that Google Apps Script uses the same RegEx as JavaScript. Is this not so?
I'm using the RegEx as such:
DocumentApp.getActiveDocument().getBody().replaceText('foo(?!bar)', 'hello');
This generates the error:
ScriptError: Invalid regular expression pattern foo(?!bar)
As discussed in comments on this question, this is a documented limitation; the replaceText() method doesn't support reverse-lookaheads or any other capture group.
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.ref
Serge suggested a work-around, "it should be possible to manipulate your document at a lower level (extracting text from paragraph etc) but it could rapidly become quite cumbersome."
Here's what that could look like. If you don't mind losing all formatting, this example will apply capture groups, RegExp flags (i for case-insensitivity) and reverse-lookaheads to change:
Little rabbit Foo Foo, running through the foobar.
to:
Little rabbit Fred Fred, running through the foobar.
Code:
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var paragraphs = body.getParagraphs();
for (var i=0; i<paragraphs.length; i++) {
var text = paragraphs[i].getText();
paragraphs[i].replaceText(".*", text.replace(/(f)oo(?!bar)/gi, '$1red') );
}
}
You have a sequence which you can match with a regular expression, but that regular expression will also match one, or more, things which you do not desire to change. The generalized solution to this situation is to:
Change the text such that you have known sequences of characters which are definitely not used. Effectively, this gives you sequences of characters which you use as variables to hold the values you don't want to change. Personally, I would use:
body.replaceText('Q','Qz');
Which will make it such that there is no sequence in your document which matches /Q[^z]/. This results in you being able to use sequences like Qa to represent some text you don't want to change. I use Q because it has a low frequency of use in English. You can use any character. For efficiency, choose a character which results in a low number of changes within the text you are affecting.
Change the things you don't want to end up changing to one of the character sequences you now know are unused. For example:
body.replaceText('foobar','Qa');
Repeat this for whatever additional items you don't want to end up changing.
Change the text you are really wanting to change. In this example:
body.replaceText('foo','hello'.replace(/Q/g,'Qz'));
Note that you need to apply to the new replacement text the first substitution which you used to open up known unused sequences.
Restore all of the things you did not want to change to their original state:
body.replaceText('Qa','foobar');
Restore the text you used to open up unused character sequences:
body.replaceText('Qz','Q');
All together that would be:
var body = DocumentApp.getActiveDocument().getBody();
body.replaceText('Q','Qz'); //Open up unused character sequences
body.replaceText('foobar','Qa'); //Save the things you don't want to change.
//In the general case, you need to apply to the new text the same substitution
// which you used to open up unused character sequences. If you don't you
// may end up with those sequences being changed in the new text.
body.replaceText('foo','hello'.replace(/Q/g,'Qz')); //Make the change you desire.
body.replaceText('Qa','foobar'); //Restore the things you saved.
body.replaceText('Qz','Q'); //Restore the original sequence.
While solving the problem this way does not allow you to use all the features of JavaScript RegExp (e.g. capture groups, look-ahead assertions, and flags), it should preserve the formatting within your document.
You can choose not to perform steps 1 and 5 above by picking a longer sequence of characters to use to represent the text which you do not want to match (e.g. kNoWn1UnUsEd). However, such a longer sequence is something that must be selected based on your knowledge of what already exists in the document. Doing that can save a couple of steps, but you either have to search for an unused string or accept that there is some probability that the string you use is already in the document, which would result in an undesired substitution.
I figured out a way to obtain most of JS's str.replace() functionalities including capture groups and smart replacers in Apps Script without messing up the style. The trick is to use Javascript's regex.exec() function and Apps Script's text.deleteText() and text.insertText() functions.
function replaceText(body, regex, replacer, attribute){
var content = body.getText();
const text = body.editAsText();
var match = "";
while (true){
content = body.getText();
var oldLength = content.length;
match = regex.exec(content);
if (match === null){
break;
}
var start = match.index;
var end = regex.lastIndex - 1;
text.deleteText(start, end);
text.insertText(start, replacer(match, regex));
var newLength = body.getText().length;
var replacedLength = oldLength - newLength;
var newEnd = end - replacedLength;
text.setAttributes(start, newEnd, attribute);
regex.lastIndex -= replacedLength;
}
}
Argument explanations:
body: the body of the document you want to operate on
regex: the normal JS regular expression object used as a search pattern
replacer: the replacer function used to return the string you want to replace with, replacer automatically receive two arguments:
I. match: match object generated by regex.exec() and
II. regex: the regular expression object used as a search pattern
attribute: An Apps Script attribute object
For example, if you want to apply bold style to new strings replacing the old ones, you can create a boldStyle attribute object:
var boldStyle = {};
boldStyle[DocumentApp.Attribute.BOLD] = true;
Tips:
How can I use capture groups in replaceText()?
You can access all capture groups from the replacer function, match[0] is the whole string matched, match[1] is the first capture group, match[2] is the second, etc.
How can I access the index and position of the match in replaceText()?
You can access the start index of the match (match.index) and end index of the match (regex.lastIndex) from the replacer function.
For more in-depth reference of JS RegExp, see this excellent tutorial from Javascript.info.
Example:
Here's a example use case of the replaceText() function. It's simple implementation of a markdown to google docs conversion script:
function markdownToDocs() {
const body = DocumentApp.getActiveDocument().getBody();
// Use editAsText to obtain a single text element containing
// all the characters in the document.
const text = body.editAsText();
// e.g. replace "**string**" with "string" (bolded)
var boldStyle = {};
boldStyle[DocumentApp.Attribute.BOLD] = true;
replaceDeliminaters(body, "\\*\\*", boldStyle, false);
// e.g. replace multiline "```line 1\nline 2\nline 3```" with "line 1\nline 2\nline 3" (with gray background highlight)
var blockHighlightStyle = {};
blockHighlightStyle[DocumentApp.Attribute.BACKGROUND_COLOR] = "#EEEEEE";
replaceDeliminaters(body, "```", blockHighlightStyle, true);
// e.g. replace inline "`console.log("hello world")`" with "console.log("hello world")" (in "Times New Roman" font and italic)
var inlineStyle = {};
inlineStyle[DocumentApp.Attribute.FONT_FAMILY] = "Times New Roman";
inlineStyle[DocumentApp.Attribute.ITALIC] = true;
replaceDeliminaters(body, "`", inlineStyle, false);
// feel free to change all the styling and markdown deliminaters as you wish.
}
// replace markdown deliminaters like "**", "`", and "```"
function replaceDeliminaters(body, deliminator, attributes, multiline){
var capture;
if (multiline){
capture = "([\\s\\S]+?)"; // capture newline characters as well
} else{
capture = "(.+?)"; // do not capture newline characters
}
const regex = new RegExp(deliminator + capture + deliminator, "g");
const replacer = function(match, regex){
return match[1]; // return the first capture group
}
replaceText(body, regex, replacer, attributes);
}

Regex converting & to &

I am developing a small character encoder generator where the user input their text and on the click of a button, it outputs the encoded version.
I've defined an object of the characters that need to be encoded like so:
map = {
'©' : '©',
'&' : '&'
},
And here is the loop that gets the values from the map and replaces them:
Object.keys(map).forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
I am them simply outputting the result to a textarea. This all works fine, however the problem I'm facing is this.
© is replaced with © however the & symbol at the beginning of this is then converted to & so it ends up being &copy;.
I see why this is happening however I'm not sure how to go about ensuring that & is not replaced within character encoded strings.
Here is a JSFiddle for a live preview of what I mean:
http://jsfiddle.net/4m3nw/1/
Any help would be much appreciated
Prelude: Apart from regex, an idea worth considering is something like this JS function that already handles html entities. Now, on to the regex question.
HTML Special Characters, Negative Lookahead
In HTML, special characters can look not only like © but also like —, and they can have upper-case characters.
To replace ampersands that are not immediately followed by a hash or word characters and a semicolon, you can use something like this:
&(?!(?:#[0-9]+|[a-z]+);)
See the demo.
Make sure to use the i flag to activate case-insensitive mode
& matches the literal ampersand
The negative lookahead (?!(?:#[0-9]+|[a-z]+);) asserts that it is not followed by...
(?:#[0-9]+|[a-z]+) a hash and digits, | OR letters...
then a semicolon.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
The problem is that since you process the same string you replace the &in ©. If you re-order your map then that seemingly solves the problem. However according to the ECMAScript specifications, this is not a given, so you would be relying on implementation details of the ECMAScript engine used.
What you can do to make sure it will always work is to swap the keys so that & is always processed first:
map = {
'©' : '©',
'&' : '&'
};
var keys = Object.keys(map);
keys[keys.indexOf('&')] = keys[0];
keys[0] = '&';
keys.forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
Obviously you need to add checks for the &'s existence if it isn't always there.
jsFiddle Demo.
Probably the simplest code change is to reorder your map by putting the ampersand on top.

Make AngularJS routes named groups non-greedy

Given the following route:
$routeProvider.when('/users/:userId-:userEncodedName', {
...
})
When hitting the URL /users/42-johndoe, the $routeParams are initialized as expected:
$routeParams.userId // is 42
$routeParams.userEncodedName // is johndoe
But when hitting the URL /users/42-john-doe, the $routeParam are initialized as follow:
$routeParams.userId // is 42-john
$routeParams.userEncodedName // is doe
Is there any way to make the named groups non-greedy, i.e. to obtain the following $routeParams:
$routeParams.userId // is 42
$routeParams.userEncodedName // is john-doe
?
You can change the path
from
$routeProvider.when('/users/:userId-:userEncodedName', {});
to
$routeProvider.when('/users/:userId*-:userEncodedName', {})
As stated in the AngularJS Documentation regarding $routeProviders, path property:
path can contain named groups starting with a colon and ending with a
star: e.g.:name*. All characters are eagerly stored in $routeParams
under the given name when the route matches.
Oddly enough Ryeballar's answer does indeed work (as is demonstrated in this short demo). I say "oddly enough", because based on the docs ("[...] characters are eagerly stored [...]"), I would expect it to work exactly the opposite way.
So, out of curiosity, I did some digging into the source code (v1.2.16) and it turns out that by a strange coincidence it indeed works. (Actually, this looks more like an inconsistency in the way route-paths are parsed).
The pathRegExp() function is responsible for converting the route path template into a regular expression, which is later used to match against the actual route paths.
The code that converts the route path template string into a RegExp pattern is the following:
path = path
.replace(/([().])/g, '\\$1')
.replace(/(\/)?:(\w+)([\?\*])?/g, function(_, slash, key, option){
var optional = option === '?' ? option : null;
var star = option === '*' ? option : null;
...
slash = slash || '';
return ''
+ (optional ? '' : slash)
+ '(?:'
+ (optional ? slash : '')
+ (star && '(.+?)' || '([^/]+)')
+ (optional || '')
+ ')'
+ (optional || '');
})
.replace(/([\/$\*])/g, '\\$1');
Based on the code above, the two route path templates (with and without *) end up in the following (totally different) regular expressions:
'/test/:param1-:param2' ==> '\/test\/(?:([^\/]+))-(?:([^\/]+))'
'/test/:param1*-:param2' ==> '\/test\/(?:(.+?))-(?:([^\/]+))'
So, what does each RegExp mean ?
/test/(?:([^/]+))-(?:([^/]+))
Let's break this up:
\/test\/: Match the string '/test/'.
(?:([^\/]+)) is equivalent to ([^\/]+) with the difference that we tell the RegExp engine not to store the capturing group's backreference.
([^\/]+): Match any sequence of 1 or more characters that does not contain /. By default, the RegExp engine will try to match as many characters as possible, as long as the rest of the string can match the remaining pattern (-(?:([^\/]+))).
Since the minimum substring that matches -(?:([^\/]+)) is -doe, :param2 will be matched to doe and :param1 to 42-john.
/test/(?:(.+?))-(?:([^/]+))
Let's break this up:
\/test\/: Match the string '/test/'.
(?:(.+?)) is equivalent to (.+?) with the difference that we tell the RegExp engine not to store the capturing group's backreference.
(.+?): Non-greedily match any sequence of 1 or more characters (any characters), as long as the rest of the string can match the remaining pattern (-(?:([^\/]+))). The key here is the ? following .+ which adds the non-greedy behaviour.
Since the minimum substring that matches (.+?) (and on the same time let the rest of the string match -(?:([^\/]+))) is 42, :param1 will be matched to 42 and :param2 to john-doe.
I hope this makes sense. Feel free to leave a comment if it doesn't :)

Categories

Resources