Unable to craft dynamically growing regex - javascript

I'm trying to build a regex in JavaScript that will match parts of an arithmetic operation. For instance, here are a few inputs and expected outputs:
What is 7 minus 5? >> ['7','minus','5']
What is 6 multiplied by -3? >> ['6','multiplied by', '-3']
I have this working regex: /^What is (-?\d+) (minus|plus|multiplied by|divided by) (-?\d+)\?$/
Now I want to expand things to capture additional operations. For instance:
What is 7 minus 5 plus 3? >> ['7','minus','5','plus','3']
So I used: ^What is (-?\d+)(?: (minus|plus|multiplied by|divided by) (-?\d+))+\?$. But it yields:
What is 7 minus 5 plus 3? >> ['7','plus','3']
Why is the minus 5 skipped? And how do I include it in results as I'd like? (here is my sample)

The problem you are facing comes from the fact that a capturing group can only return one value. If the same capturing group would have more than one value (like it is in your case) it would always return the last one.
I like how it is explained at http://www.rexegg.com/regex-capture.html#spawn_groups
The capturing parentheses you see in a pattern only capture a single
group. So in (\d)+, capture groups do not magically mushroom as you
travel down the string. Rather, they repeatedly refer to Group 1,
Group 1, Group 1… If you try this regex on 1234 (assuming your regex
flavor even allows it), Group 1 will contain 4—i.e. the last capture.
In essence, Group 1 gets overwritten every time the regex iterates
through the capturing parentheses.
So the trick for you is use a regex with the global flag (g) and execute the expression more than once, when using the g flag, the following execution starts where the last one ended.
I've made a regex to show you the strategy, isolate the formula and then iterate until you found everything.
var formula = "What is 2 minus 1 minus 1";
var regex = /^What is ((?:-?\d+)(?: (?:minus|plus|multiplied by|divided by) (?:-?\d+))+)$/
if (regex.exec(formula).length > 1) {
var math_string = regex.exec(formula)[1];
console.log(math_string);
var math_regex = /(-?\d+)? (minus|plus|multiplied by|divided by) (-?\d+)/g
var operation;
var result = [];
while (operation = math_regex.exec(math_string)) {
if (operation[1]) {
result.push(operation[1]);
}
result.push(operation[2], operation[3]);
}
console.log(result);
}
Another solution, if you aren't requiring anything fancy would be to remove the "What is", replace multiplied by with multiplied_by (same for divided) and split the string on spaces.
var formula = "What is 2 multiplied by 1 divided by 1";
var regex = /^What is ((?:-?\d+)(?: (?:minus|plus|multiplied by|divided by) (?:-?\d+))+)$/
if (regex.exec(formula).length > 1) {
var math_string = regex.exec(formula)[1].replace('multiplied by', 'multiplied_by').replace('divided by', 'divided_by');
console.log(math_string.split(" "));
}

Each capturing group in a regex can only hold a single value. So, if you have a repetition on a group, you're only going to get one result for that group (usually the last one, I think). In your case it's the following:
(?: (minus|plus|multiplied by|divided by) (-?\d+))+
You're repeating the non-capturing group around, which will match repeatedly. But the groups within can, in the end, only hold a single match, which is the result of the last repetition.
You should probably switch to matching tokens instead of having a single regex that tries to match the whole phrase and dissects it via capturing groups. Something like a two-step process where you first verify that the whole phrase is constructed correctly (starts with »What is«, ends with »?«, etc.) and then a pass that extracts the individual tokens, e.g. something like
-?\d+|minus|plus|multiplied by|divided by

Related

Mathematical Formula - Variable Substitution with Regex

I'm working on a Javascript function that evaluates a user-entered string as a mathematical formula.
For example, the user may type in 1 + 1, and the function will evaluate it to 2. I'm using a library to do this, so the math and syntax is handled for that already. However, I have variables that the user can reference. The user can create a number variable, give it a name (of their choosing), and reference it in the equation. Assume the user writes 1 + counter, the math eval library obviously doesn't know what counter is, so I am using regular expressions to pre-process the formula. The preprocessing function will see counter, lookup its value, and replace it with the literal. So if the user had set counter to 3 elsewhere, my function will take 1 + counter, replace counter with 3 to get 1 + 3, and then send the formula to the math evaluation library.
The issue I'm having is writing a function that processes this using regular expressions.
I'm starting with the regular expression ([^A-Za-z0-9])counter($|[^A-Za-z0-9]), which matches counter only if there is NOT another alphanumeric character on either side of it. For example, the user may type in counter2 at some point, and I want to make sure that counter2 is looked up, but that counter would not match. Secretely, to improve performance, I actually loop through variables, generate regular expressions for them, and match them that way. Some may not match at all, but it runs in O(n) rather than having to search through a list of variables for every reference in the array. In other words, I don't build a syntax tree or anything, so if I had the variables counter and counter2, I would generate regex for each and try to match them, hence if the formula was counter2, the function still tries a match for counter and counter2, but only counter2 should match.
The code I'm using is as follows:
var re = new RegExp(`(^|[^A-Za-z0-9])${variableName}($|[^A-Za-z0-9])`, "g");
let match = re.exec(formula);
while (match !== null) {
// If "+counter+" is matched, I have to make sure that the +'s remain, hence replacing on the match
var sub = match[0].replace(`${variableName}`, `{${variableValue}}`);
formula = formula.replace(match[0], sub)
re.lastIndex = 0; // just reset to the start for now
match = re.exec(formula);
}
// Pass to math library next
This works in most cases but I have the following issue:
For the formula counter+counter, only the first counter+ matches, when both should match.
So, what I need is basically regular expression/function that does the following:
Take a variable name
Replace all occurences of it as long as the occurences don't have a alphanumeric character in front or back. So if I'm matching counter against a formula, +counter+ would match (+ aren't alphanumeric), + counter would match (space isn't alphanumeric), but counter2 wouldn't match, because it's a different variable name entirely, and 2 is alphanumeric.
Any ideas? I'm trying to do this the right way, I imagine there can be many unknown side effects if I don't do this correctly.
Thanks for the help!
You may use a lookahead at the end, (?=$|[^A-Za-z0-9]) instead of a ($|[^A-Za-z0-9]) capturing group, and shrink the code to a greater extent if you just use replace:
var re = new RegExp(`(^|[^A-Za-z0-9])${variableName}(?=\$|[^A-Za-z0-9])`, "g");
formula = formula.replace(re, "$1"+variableValue)
Note the $1 in the replacement part is the backreference to the value stored in Group 1, that is, start of string or any char but an ASCII alphanumeric (captured with (^|[^A-Za-z0-9])).

Javascript regex pattern match multiple strings ( AND, OR, NEAR/n, P/n )

I need to filter a collection of strings based on a rather complex query
I have query input as a string
var query1 ='Abbott near/10 (assay* OR test* ) AND BLOOD near/10 (Point P/1 Care)';
From this query INPUT string I want to collect just the important words:
var words= 'Abbott assay* test* BLOOD Point care';
The query can change for example:
var query2='(assay* OR test* OR analy* OR array) OR (Abbott p/1 Point P/1 Care)';
from this query need to collect
var words='assay* test* analy* array Abbott Point Care';
I'm looking for your suggestion.
Thanks.
You may just use | in your regex to capture the words and/or special characters that you want to remove:
([()]|AND|OR|(NEAR|P)\/\d+) ?
DEMO: https://regex101.com/r/rqpmXr/2
Note the /gi in the regex options, with i meaning that it's case insensitive.
EXPLANATION:
([()]|AND|OR|(NEAR|P)\/\d+) - This is a capture group containing all the words you specified in your title, plus the parentheses.
(NEAR|P)\/\d+ - Just to clear out this part, \d+ means that one or more digits are following the words NEAR or P.
 ? - This captures the possible trailing space after the captured word.

javascript regex capturing parentheses

I don't really get the concept on capturing parentheses when dealing with javascript regex. I don't understand why we need parentheses for the following example
var x = "{xxx} blah blah blah {yyy} and {111}";
x.replace( /{([^{}]*)}/g ,
function(match,content) {
console.log(match,content);
return "whatever";
});
//it will print
{xxx} xxx
{yyy} yyy
{111} 111
so when i drop the parentheses from my pattern x the results give a different value
x.replace( /{[^{}]*}/g ,
function(match,content) {
console.log(match,content);
return "whatever";
});
//it will print
{xxx} 0
{yyy} 37
{111} 49
so the content values now become numeric value which i have no idea why. Can someone explains what's going on behind the scene ?
According to the MDN documentation, the parameters to the function will be, in order:
The matched substring.
Any groups that are defined, if there are any.
The index in the original string where the match was found.
The original string.
So in the first example, content will be the string which was captured in group 1. But when you remove the group in the second example, content is actually the index where the match was found.
This is useful with replacement of texts.
For example, I have this string "one two three four" that I want to reverse like "four three two one". To achieve that I will use this line of code:
var reversed = "one two three four".replace(/(one) (two) (three) (four)/, "$4 $3 $2 $1");
Note how $n represents each word in the string.
Another example: I have the same string "one two three four" and I want to print each word twice:
var eachWordTwice = "one two three four".replace(/(one) (two) (three) (four)/, "$1 $1 $2 $2 $3 $3 $4 $4");
The numbers:
The offset of the matched substring within the total string being
examined. (For example, if the total string was "abcd", and the
matched substring was "bc", then this argument will be 1.)
Source:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace
"Specifying a function as a parameter" section
Parenthesis are used to capture/replace only a portion of the match. For instance, when I use it to match phone numbers that may or may not have extensions. This function matches the whole string (if the if is correct), so the entire string is replaced, but I am only using a specific types of characters in a specific order, with whitespace or other("() -x") characters allowed in the input.
It will always output a string formatted to (651) 258-9631 x1234 if given 6512589631x1234 or 1 651 258 9631 1234. It also doesn't allow (or in this case format) toll-free numbers as they aren't allowed in my field.
function phoneNumber(v) {
// take in a string, return a formatted string (651) 651-6511 x1234
if (v.search(/^[1]{0,1}[-(\s.]{0,1}(?!800|888|877|866|855|900)([2-9][0-9]{2})[-)\s.]{0,2}([2-9][0-9]{2})[-.\s]{0,2}([0-9]{4})[\s]*[x]{0,1}([0-9]{1,5}){1}$/gi) !== -1) {return v.replace(/^[1]{0,1}[-(\s.]{0,1}(?!800|888|877|866|855|900)([2-9][0-9]{2})[-)\s.]{0,2}([2-9][0-9]{2})[-.\s]{0,2}([0-9]{4})[\s]*[x]{0,1}([0-9]{1,5}){1}$/gi,"($1) $2-$3 x$4"); }
if (v.search(/^[1]{0,1}[-(\s.]{0,1}(?!800|888|877|866|855|900)([2-9][0-9]{2})[-)\s.]{0,1}([2-9][0-9]{2})[-.\s]{0,2}([0-9]{4})$/gi) !== -1) { return v.replace(/^[1]{0,1}[-(\s.]{0,1}(?!800|888|877|866|855|900)([2-9][0-9]{2})[-)\s.]{0,1}([2-9][0-9]{2})[-.\s]{0,2}([0-9]{4})$/gi,"($1) $2-$3"); }
return v;
}
What this allows me to do is gather the area code, prefix, line number, and an optional extension, and format it the way I need it (for users who can't follow directions, for instance).
So it you input 6516516511x1234 or "(651) 651-6511 x1234", it will match one regex or another in this example.
Now what is happening in your code is as #amine-hajyoussef said - The index of the start of each match is being returned. Your use of that code would be better serviced by match for example one (text returned), or search for the index, as in example two. p.s.w.g's answer expands.

Regex with limited use of specific characters (like a scrabble bank of letters)

I would like to be able to do a regex where I can identify sort of a bank of letters like [dgos] for example and use that within my regex... but whenever a letter from that gets used, it takes it away from the bank.
So lets say that somehow \1 is able to stand for that bank of letters (dgos). I could write a regex something like:
^\1{1}o\1{2}$
and it would match basically:
\1{1} = [dgos]{1}
o
\1{2} = [dgos]{2} minus whatever was used in the first one
Matches could include good, dosg, sogd, etc... and would not include sods (because s would have to be used twice) or sooo (because the o would have to be used twice).
I would also like to be able to identify letters that can be used more than once. I started to write this myself but then realized I didn't even know where to begin with this so I've also searched around and haven't found a very elegant way to do this, or a way to do it that would be flexible enough that the regex could easily be generated with minimal input.
I have a solution below using a combination of conditions and multiple regexs (feel free to comment thoughts on that answer - perhaps it's the way I'll have to do it?), but I would prefer a single regex solution if possible... and something more elegant and efficient if possible.
Note that the higher level of elegance and single regex part are just my preferences, the most important thing is that it works and the performance is good enough.
Assuming you are using javascript version 1.5+ (so you can use lookaheads), here is my solution:
^([dogs])(?!.*\1)([dogs])(?!.*\2)([dogs])(?!.*\3)[dogs]$
So after each letter is matched, you perform a negative lookahead to ensure that this matched letter never appears again.
This method wouldn't work (or at least, would need to be made a heck of a lot more complicated!) if you want to allow some letters to be repeated, however. (E.g. if your letters to match are "example".)
EDIT: I just had a little re-think, and here is a much more elegant solution:
^(?:([dogs])(?!.*\1)){4}
After thinking about it some more, I have thought of a way that this IS possible, although this solution is pretty ugly! For example, suppose you want to match the letters "goods" (i.e. including TWO "o"s):
^(?=.*g)(?=.*o.*o)(?=.*d)(?=.*s).{5}$
- So, I have used forward lookaheads to check that all of these letters are in the text somewhere, then simply checked that there are exactly 5 letters.
Or as another example, suppose we want to match the letters "banana", with a letter "a" in position 2 of the word. Then you could match:
^(?=.*b)(?=.*a.*a.*a)(?=.*n.*n).a.{4}$
Building on #TomLord's answer, taking into account that you don't necessarily need to exhaust the bank of letters, you can use negative lookahead assertions instead of positive ones. For your example D<2 from bank>R<0-5 from bank>, that regex would be
/^(?![^o]*o[^o]*o)(?![^a]*a[^a]*a)(?![^e]*e[^e]*e[^e]*e)(?![^s]*s[^s]*s)d[oaes]{2}r[oaes]{0,5}$/i
Explanation:
^ # Start of string
(?![^o]*o[^o]*o) # Assert no more than one o
(?![^a]*a[^a]*a) # Assert no more than one a
(?![^e]*e[^e]*e[^e]*e) # Assert no more than two e
(?![^s]*s[^s]*s) # Assert no more than one s
d # Match d
[oaes]{2} # Match two of the letters in the bank
r # Match r
[oaes]{0,5} # Match 0-5 of the letters in the bank
$ # End of string
You could also write (?!.*o.*o) instead of (?![^o]*o[^o]*o), but the latter is faster, just harder to read. Another way to write this is (?!(?:[^o]*o){2}) (useful if the number of repetitions increases).
And of course you need to account for the number of letters in your "fixed" part of the string (which in this case (d and r) don't interfere with the bank, but they might do so in other examples).
Something like that would be pretty simple to specify, just very long
gdos|gdso|gods|gosd|....
You'd basically specify every permutation. Just write some code to generate every permutation, combine with the alternate operator, and you're done!
Although... It might pay to actually encode the decision tree used to generate the permutations...
I swear I think I answered this before on stackoverflow. Let me get back to you...
I couldn't think of a way to do this completely in a regex, but I was able to come up with this: http://jsfiddle.net/T2TMd/2/
As jsfiddle is limited to post size, I couldn't do the larger dictionary there. Check here for a better example using a 180k word dictionary.
Main function:
/*
* filter = regex to filter/split potential matches
* bank = available letters
* groups = capture groups that use the bank
* dict = list of words to search through
*/
var matchFind = function(filter, bank, groups, dict) {
var matches = [];
for(var i=0; i < dict.length; i++) {
if(dict[i].match(filter)){
var fail = false;
var b = bank;
var arr = dict[i].split(filter);
//loop groups that use letter bank
for(var k=0; k<groups.length && !fail; k++) {
var grp = arr[groups[k]] || [];
//loop characters of that group
for(var j=0; j<grp.length && !fail; j++) {
var regex = new RegExp(b);
var currChar = grp.charAt(j);
if(currChar.match(regex)) {
//match found, remove from bank
b = b.replace(currChar,"");
} else {
fail = true;
}
}
}
if(!fail) {
matches.push(dict[i]);
}
}
}
return matches;
}
Usage:
$("#go").click( function() {
var f = new RegExp($("#filter").val());
var b = "["+$("#bank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var g = $("#groups").val().replace(/[^0-9,]+/g,"").split(",") || [];
$("#result").text(matchFind(f,b,g,dict).toString());
});
To make it easier to create scenarios, I created this as well:
$("#build").click( function() {
var bank = "["+$("#buildBank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var buildArr = $("#builder").val().split(",");
var groups = [];
var build = "^";
for(var i=0; i<buildArr.length; i++) {
var part = buildArr[i];
if(/\</.test(part)) {
part = "(" + bank + part.replace("<", "{").replace(">", "}").replace("-",",") + ")";
build = build + part;
groups.push(i+1);
} else {
build = build + "("+part+")";
}
}
build = build + "$";
$("#filter").val(build);
$("#bank").val(bank);
$("#groups").val(groups.toString());
$("#go").click();
});
This would be useful in Scrabble, so lets say that you are in a position where a word must start with a "D", there are two spaces between that "D" and an "R" from a parallel word, and you have OAEES for your letter bank. In the builder I could put D,<2>,R,<0-3> because it must start with a D, then it must have 2 letters from the bank, then there is an R, then I'd have up to 3 letters to use (since I'm using 2 in between D and R).
The builder would use the letter bank and convert D,<2>,R,<0-3> to ^(D)([OAEES]{2})(R)([OAEES]{0,5})$ which would be used to filter for possible matches. Then with those possible matches it will look at the capture groups that use the letter bank, character by character, removing letters from that regex when it finds them so that it won't match if there are more of the letter bank letters used than there is in the letter bank.
Test the above scenario here.

JavaScript regex back references returning an array of matches from single capture group (multiple groups)

I'm fairly sure after spending the night trying to find an answer that this isn't possible, and I've developed a work around - but, if someone knows of a better method, I would love to hear it...
I've gone through a lot of iterations on the code, and the following is just a line of thought really. At some point I was using the global flag, I believe, in order for match() to work, and I can't remember if it was necessary now or not.
var str = "#abc#def#ghi&jkl";
var regex = /^(?:#([a-z]+))?(?:&([a-z]+))?$/;
The idea here, in this simplified code, is the optional group 1, of which there is an unspecified amount, will match #abc, #def and #ghi. It will only capture the alpha characters of which there will be one or more. Group 2 is the same, except matches on & symbol. It should also be anchored to the start and end of the string.
I want to be able to back reference all matches of both groups, ie:
result = str.match(regex);
alert(result[1]); //abc,def,ghi
alert(result[1][0]); //abc
alert(result[1][1]); //def
alert(result[1][2]); //ghi
alert(result[2]); //jkl
My mate says this works fine for him in .net, unfortunately I simply can't get it to work - only the last matched of any group is returned in the back reference, as can be seen in the following:
(additionally, making either group optional makes a mess, as does setting global flag)
var str = "#abc#def#ghi&jkl";
var regex = /(?:#([a-z]+))(?:&([a-z]+))/;
var result = str.match(regex);
alert(result[1]); //ghi
alert(result[1][0]); //g
alert(result[2]); //jkl
The following is the solution I arrived at, capturing the whole portion in question, and creating the array myself:
var str = "#abc#def#ghi&jkl";
var regex = /^([#a-z]+)?(?:&([a-z]+))?$/;
var result = regex.exec(str);
alert(result[1]); //#abc#def#ghi
alert(result[2]); //jkl
var result1 = result[1].toString();
result[1] = result1.split('#')
alert(result[1][1]); //abc
alert(result[1][2]); //def
alert(result[1][3]); //ghi
alert(result[2]); //jkl
That's simply not how .match() works in JavaScript. The returned array is an array of simple strings. There's no "nesting" of capture groups; you just count the ( symbols from left to right.
The first string (at index [0]) is always the overall matched string. Then come the capture groups, one string (or null) per array element.
You can, as you've done, rearrange the result array to your heart's content. It's just an array.
edit — oh, and the reason your result[1][0] was "g" is that array indexing notation applied to a string gets you the individual characters of the string.

Categories

Resources