I'm working on a Javascript function that evaluates a user-entered string as a mathematical formula.
For example, the user may type in 1 + 1, and the function will evaluate it to 2. I'm using a library to do this, so the math and syntax is handled for that already. However, I have variables that the user can reference. The user can create a number variable, give it a name (of their choosing), and reference it in the equation. Assume the user writes 1 + counter, the math eval library obviously doesn't know what counter is, so I am using regular expressions to pre-process the formula. The preprocessing function will see counter, lookup its value, and replace it with the literal. So if the user had set counter to 3 elsewhere, my function will take 1 + counter, replace counter with 3 to get 1 + 3, and then send the formula to the math evaluation library.
The issue I'm having is writing a function that processes this using regular expressions.
I'm starting with the regular expression ([^A-Za-z0-9])counter($|[^A-Za-z0-9]), which matches counter only if there is NOT another alphanumeric character on either side of it. For example, the user may type in counter2 at some point, and I want to make sure that counter2 is looked up, but that counter would not match. Secretely, to improve performance, I actually loop through variables, generate regular expressions for them, and match them that way. Some may not match at all, but it runs in O(n) rather than having to search through a list of variables for every reference in the array. In other words, I don't build a syntax tree or anything, so if I had the variables counter and counter2, I would generate regex for each and try to match them, hence if the formula was counter2, the function still tries a match for counter and counter2, but only counter2 should match.
The code I'm using is as follows:
var re = new RegExp(`(^|[^A-Za-z0-9])${variableName}($|[^A-Za-z0-9])`, "g");
let match = re.exec(formula);
while (match !== null) {
// If "+counter+" is matched, I have to make sure that the +'s remain, hence replacing on the match
var sub = match[0].replace(`${variableName}`, `{${variableValue}}`);
formula = formula.replace(match[0], sub)
re.lastIndex = 0; // just reset to the start for now
match = re.exec(formula);
}
// Pass to math library next
This works in most cases but I have the following issue:
For the formula counter+counter, only the first counter+ matches, when both should match.
So, what I need is basically regular expression/function that does the following:
Take a variable name
Replace all occurences of it as long as the occurences don't have a alphanumeric character in front or back. So if I'm matching counter against a formula, +counter+ would match (+ aren't alphanumeric), + counter would match (space isn't alphanumeric), but counter2 wouldn't match, because it's a different variable name entirely, and 2 is alphanumeric.
Any ideas? I'm trying to do this the right way, I imagine there can be many unknown side effects if I don't do this correctly.
Thanks for the help!
You may use a lookahead at the end, (?=$|[^A-Za-z0-9]) instead of a ($|[^A-Za-z0-9]) capturing group, and shrink the code to a greater extent if you just use replace:
var re = new RegExp(`(^|[^A-Za-z0-9])${variableName}(?=\$|[^A-Za-z0-9])`, "g");
formula = formula.replace(re, "$1"+variableValue)
Note the $1 in the replacement part is the backreference to the value stored in Group 1, that is, start of string or any char but an ASCII alphanumeric (captured with (^|[^A-Za-z0-9])).
Related
I am trying to capture mathematical expressions between parenthesis in a string with javascript. I need to capture parenthesis that ONLY include numbers and mathematical operators [0-9], +, - , *, /, % and the decimal dot. The examples below demonstrate what I am after. I managed to get close to the desired result but the nested parenthesis always screw my regex up so I need help! I also need it to look globally, not for first occurence only.
let string = "If(2>1,if(a>100, (-2*(3-5)(8-2)), (1+2)), (3(1+2)) )";
What I want to do if possible is manage to transform this syntax
if(condition, iftrue, iffalse)
to this syntax
if(condition) { iftrue } else { iffalse }
so that it can be evaluated by javascript and previewed in the browser. I have done it so far but if the iftrue or iffalse blocks contain parenthesis, everything blows up! So I m trying to capture that parenthesis and calculate before transforming the syntax. Any advice is appreciated.
The closest i got was this /[\d()+-*/.]/g which gets whats i want but in this example
(1+2) (1 < 1) sdasdasd (1*(2+3))
instead of dismissing the (1<1) group entirelly it matches (1 and 1). My ideal scenario would be
(1+2) (1<1) sdasdasd (1*(2+3))
Another example:
let codeToEval = "if(a>10, 2, 2*(b+4))";
codeToEval is the passed in a function that replaces a and b with the correct values so it ends up like this.
codeToEvalAfterReplacement = "if(5>10,2,2*(5+4))";
And now I want to transform this in
if(5>10) {
2
} else {
2*(5+4)
}
so it can be evaluated by javascript eval() and eventually previewed to the users.
Your current regex /[\d()+-*/.]/g will match single characters from the class
but multiple times because of the g flag, this is why (1 and 1) are still matched
in (1 < 1).
Based on your pattern requirements I would change it to /\([-+*/%.0-9()]+\)/g.
This will match parentheses containing one or more of the characters you describe within them.
Note that your current pattern has a - somewhere in the middle of a class which can lead to weird behaviours because some regex engines will treat +-* within a class as a range (plus through asterisk, which is a stange range). Notice I put - at the start of the class in the new pattern so it matches an actual -.
I've assumed there will be no empty parentheses (), if there are you can change + (one or more) after ] to * (zero or more)
The g flag is still added so you match every one of such expressions.
I can't say with 100% certainty that the new regex will allow you to robustly transform the syntax you state, as it depends on the complexity of the 'iftrue' and 'iffalse' code blocks. See if you can make it work with the new pattern, otherwise you may want to look into other solutions for parsing code.
Call function in if parenthesis and all conditions in that function.
if(test()){
// if code
}else{
// else code
}
function test(){
// check both cases here
if(case 1 && case 2){
return true
}
return false;
}
I'm trying to build a regex in JavaScript that will match parts of an arithmetic operation. For instance, here are a few inputs and expected outputs:
What is 7 minus 5? >> ['7','minus','5']
What is 6 multiplied by -3? >> ['6','multiplied by', '-3']
I have this working regex: /^What is (-?\d+) (minus|plus|multiplied by|divided by) (-?\d+)\?$/
Now I want to expand things to capture additional operations. For instance:
What is 7 minus 5 plus 3? >> ['7','minus','5','plus','3']
So I used: ^What is (-?\d+)(?: (minus|plus|multiplied by|divided by) (-?\d+))+\?$. But it yields:
What is 7 minus 5 plus 3? >> ['7','plus','3']
Why is the minus 5 skipped? And how do I include it in results as I'd like? (here is my sample)
The problem you are facing comes from the fact that a capturing group can only return one value. If the same capturing group would have more than one value (like it is in your case) it would always return the last one.
I like how it is explained at http://www.rexegg.com/regex-capture.html#spawn_groups
The capturing parentheses you see in a pattern only capture a single
group. So in (\d)+, capture groups do not magically mushroom as you
travel down the string. Rather, they repeatedly refer to Group 1,
Group 1, Group 1… If you try this regex on 1234 (assuming your regex
flavor even allows it), Group 1 will contain 4—i.e. the last capture.
In essence, Group 1 gets overwritten every time the regex iterates
through the capturing parentheses.
So the trick for you is use a regex with the global flag (g) and execute the expression more than once, when using the g flag, the following execution starts where the last one ended.
I've made a regex to show you the strategy, isolate the formula and then iterate until you found everything.
var formula = "What is 2 minus 1 minus 1";
var regex = /^What is ((?:-?\d+)(?: (?:minus|plus|multiplied by|divided by) (?:-?\d+))+)$/
if (regex.exec(formula).length > 1) {
var math_string = regex.exec(formula)[1];
console.log(math_string);
var math_regex = /(-?\d+)? (minus|plus|multiplied by|divided by) (-?\d+)/g
var operation;
var result = [];
while (operation = math_regex.exec(math_string)) {
if (operation[1]) {
result.push(operation[1]);
}
result.push(operation[2], operation[3]);
}
console.log(result);
}
Another solution, if you aren't requiring anything fancy would be to remove the "What is", replace multiplied by with multiplied_by (same for divided) and split the string on spaces.
var formula = "What is 2 multiplied by 1 divided by 1";
var regex = /^What is ((?:-?\d+)(?: (?:minus|plus|multiplied by|divided by) (?:-?\d+))+)$/
if (regex.exec(formula).length > 1) {
var math_string = regex.exec(formula)[1].replace('multiplied by', 'multiplied_by').replace('divided by', 'divided_by');
console.log(math_string.split(" "));
}
Each capturing group in a regex can only hold a single value. So, if you have a repetition on a group, you're only going to get one result for that group (usually the last one, I think). In your case it's the following:
(?: (minus|plus|multiplied by|divided by) (-?\d+))+
You're repeating the non-capturing group around, which will match repeatedly. But the groups within can, in the end, only hold a single match, which is the result of the last repetition.
You should probably switch to matching tokens instead of having a single regex that tries to match the whole phrase and dissects it via capturing groups. Something like a two-step process where you first verify that the whole phrase is constructed correctly (starts with »What is«, ends with »?«, etc.) and then a pass that extracts the individual tokens, e.g. something like
-?\d+|minus|plus|multiplied by|divided by
All I want is to test whether a string contains non-overlapping substrings to match the array of regexes in the following way: if a substring matches some item of the array, remove the corresponding regex from the array, and continue. I will need a function func(arg1, arg2) that will take two arguments: the first one is the string itself, and the second one is an array of regular expressions to test.
I've read some explanations (such as Regular Expressions: Is there an AND operator?), but they do not answer this specific question. For example, in Javascript, the following three code snippets will return true:
/(?=ab)(?=abc)(?=abcd)/gi.test("eabzzzabcde");
/(?=.*ab)(?=.*abc)(?=.*abcd)/gi.test("eabzzzabcde");
/(?=.*?ab)(?=.*?abc)(?=.*?abcd)/gi.test("eabzzzabcde");
which is, obviously, not what I want (because "abc" and "abcd" in "eabzzzabcde" are just mixed together in an overlapping way). So, func("eabzzzabcde", [/ab/gi, /abc/gi, /abcd/gi]) should return false.
But, func("Fhh, fabcw wxabcdy yz... zab.", [/ab/gi, /abc/gi, /abcd/gi]) should return true because none of "ab", "abc" and "abcd" substrings overlap each other. The logic is the following. We have an array of regexes: [/ab/gi, /abc/gi, /abcd/gi], and some possible combination of three (where 3 is equal to the length of that array) non-overlapping, separate substrings of the original string: fabcw, xabcdy and zab. Does fabcw match /abc/gi? Yes. Okay, we remove /abc/gi from the array, and we have [/ab/gi, /abcd/gi] for xabcdy and zab. Does xabcdy match /abcd/gi? Yes. Okay, we remove /abcd/gi from the current array, and we have [/ab/gi] for zab. Does zab. match /ab/gi? Yes. No more regexes left in the current array, and we always answered "yes", so — return true.
The tricky part here is to find an efficient (such that performance is not too terrible) way to get at least one possible “good” combination of non-overlapping substrings.
The more complex case is e.g. func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]). Using the logic described above, we can see that if we take two non-overlapping parts of the original string — "acdxba" (or "cdxba") and "abaac" (or "abaacb", "babaac" etc.) — the first one matches /.*?c.*?b.*?a/gi, and the second one matches /.*?a.*?b.*?c/gi. So, func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]) should return true.
Is there any efficient way to solve such a problem?
Assuming each pattern should match exactly once, then we can construct a regexp of all of their permutations:
const patterns= ['ab', 'abc', 'abcd'];
const input = "Fhh, fabcw wxabcdy yz... zab.";
// Create a regexp of the form
// (.*?ab.*?abc.*?abcd.*?)
function build(patterns) {
return `(${['', ...patterns, ''].join('.*?')})`;
}
function match(input, patterns) {
const regexps = [...permute(patterns)].map(build);
// Create a regexp of the form
// /(.*?ab.*?abc.*?abcd.*?)|(.*?ab.*?abcd.*?abc.*?)|.../
const regexp = new RegExp(regexps.join('|'));
return regexp.test(input);
}
// Simple permutation generator.
function *permute(a, n = a.length) {
if (n <= 1) yield a.slice();
else for (let i = 0; i < n; i++) {
yield *permute(a, n - 1);
const j = n % 2 ? 0 : i;
[a[n-1], a[j]] = [a[j], a[n-1]];
}
}
console.log(match(input, patterns));
This will result in a very long regexp if there are more than a half-dozen or so patterns. To deal with this, we can test each permutation one at a time:
function match(input, patterns) {
return Array.from(permute(patterns))
.some(perm => input.match(build(perm)));
}
If there are ten patterns, we will end up doing a couple million tests.
Disclaimers
This uses ES6 features. Fall back to equivalent ES5 syntax if you need to.
The input patterns here are strings. To handle regexps instead would require a little bit of logic to extract the pattern from the regexp, and also escape any special regexp characters appearing in it.
Is there an efficient way to test whether a string contains non-overlapping substrings to match the array of regular expressions?
I doubt that you would call the above solution "efficient", but I don't know if there is a more efficient one. As far as I can see, any approach to this problem is going to involve backtracking. You could match the first nine of ten patterns, and then discover that the last one won't match because one of the earlier nine greedily ate up part of what the tenth needed, even though it could have matched itself somewhere later in the string. Therefore, I will go out on a limb and say that this problem is intrinsically of order O(n!).
The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}
Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.
You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.
The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.
I'm fairly sure after spending the night trying to find an answer that this isn't possible, and I've developed a work around - but, if someone knows of a better method, I would love to hear it...
I've gone through a lot of iterations on the code, and the following is just a line of thought really. At some point I was using the global flag, I believe, in order for match() to work, and I can't remember if it was necessary now or not.
var str = "#abc#def#ghi&jkl";
var regex = /^(?:#([a-z]+))?(?:&([a-z]+))?$/;
The idea here, in this simplified code, is the optional group 1, of which there is an unspecified amount, will match #abc, #def and #ghi. It will only capture the alpha characters of which there will be one or more. Group 2 is the same, except matches on & symbol. It should also be anchored to the start and end of the string.
I want to be able to back reference all matches of both groups, ie:
result = str.match(regex);
alert(result[1]); //abc,def,ghi
alert(result[1][0]); //abc
alert(result[1][1]); //def
alert(result[1][2]); //ghi
alert(result[2]); //jkl
My mate says this works fine for him in .net, unfortunately I simply can't get it to work - only the last matched of any group is returned in the back reference, as can be seen in the following:
(additionally, making either group optional makes a mess, as does setting global flag)
var str = "#abc#def#ghi&jkl";
var regex = /(?:#([a-z]+))(?:&([a-z]+))/;
var result = str.match(regex);
alert(result[1]); //ghi
alert(result[1][0]); //g
alert(result[2]); //jkl
The following is the solution I arrived at, capturing the whole portion in question, and creating the array myself:
var str = "#abc#def#ghi&jkl";
var regex = /^([#a-z]+)?(?:&([a-z]+))?$/;
var result = regex.exec(str);
alert(result[1]); //#abc#def#ghi
alert(result[2]); //jkl
var result1 = result[1].toString();
result[1] = result1.split('#')
alert(result[1][1]); //abc
alert(result[1][2]); //def
alert(result[1][3]); //ghi
alert(result[2]); //jkl
That's simply not how .match() works in JavaScript. The returned array is an array of simple strings. There's no "nesting" of capture groups; you just count the ( symbols from left to right.
The first string (at index [0]) is always the overall matched string. Then come the capture groups, one string (or null) per array element.
You can, as you've done, rearrange the result array to your heart's content. It's just an array.
edit — oh, and the reason your result[1][0] was "g" is that array indexing notation applied to a string gets you the individual characters of the string.