I'm trying to split a string in infix notation into a tokenized list, ideally with regex.
e.g. ((10 + 4) ^ 2) * 5 would return ['(', '(', '10', '+', '4', ')', '^', '2', ')', '*', '5']
At the moment I'm just splitting it up by character, but this doesn't allow me to represent numbers with more than one digit.
I tried tokens = infixString.split("(\d+|[^ 0-9])"); which I found online for this very same problem, but I think it was for Java and it simply gives a list with only one element, being the entire infixString itself.
I know next to nothing about regex, so any tips would be appreciated. Thanks!
It's because you're passing a string to split. If you use a literal regex it will output something closer to what you'd expect
infixString.split(/(\d+|[^ 0-9])/)
// Array(23) [ "", "(", "", "(", "", "10", " ", "+", " ", "4", … ]
However there's a bunch of empty elements and white space that you might want to filter out
infixString.split(/(\d+|[^ 0-9])/).filter(e => e.trim().length > 0)
// Array(11) [ "(", "(", "10", "+", "4", ")", "^", "2", ")", "*", … ]
Depending on the version of JavaScript/ECMAScript you're targeting here, the syntax in the filter (or the filter function itself) might need to be adapted.
let test = "((10 + 4) ^ 2) * 5 * -1.5";
let arr = test.replace(/\s+/g, "").match(/(?:(?<!\d)-)?\d+(?:\.\d+)?|./g);
console.log(arr);
code { white-space: nowrap !important }
(?:(?<!\d)-)?\d+(?:\.\d+)?
(?:(?<!\d)-)? — Negative lookbehind. Catching minus sign, only if it is not a subtraction (has no \d digit behind)
(?:\.\d+)? — ?: non capture group, \.\d+ dot and one or more digits, ? optional.
Related
I want to split a string with separators ' or .. WHILE KEEPING them:
"'TEST' .. 'TEST2' ".split(/([' ] ..)/g);
to get:
["'", "TEST", "'", "..", "'", "TEST2", "'" ]
but it doesn't work: do you know how to fix this ?
The [' ] .. pattern matches a ' or space followed with a space and any two chars other than line break chars.
You may use
console.log("'TEST' .. 'TEST2' ".trim().split(/\s*('|\.{2})\s*/).filter(Boolean))
Here,
.trim() - remove leading/trailing whitespace
.split(/\s*('|\.{2})\s*/) - splits string with ' or double dot (that are captured in a capturing group and thus are kept in the resulting array) that are enclosed in 0+ whitespaces
.filter(Boolean) - removes empty items.
I m not sure it will work for every situations, but you can try this :
"'TEST' .. 'TEST2' ".replace(/(\'|\.\.)/g, ' $1 ').trim().split(/\s+/)
return :
["'", "TEST", "'", "..", "'", "TEST2", "'"]
Splitting while keeping the delimiters can often be reduced to a matchAll. In this case, /(?:'|\.\.|\S[^']+)/g seems to do the job on the example. The idea is to alternate between literal single quote characters, two literal periods, or any sequence up to a single quote that starts with a non-space.
const result = [..."'TEST' .. 'TEST2' ".matchAll(/(?:'|\.\.|\S[^']+)/g)].flat();
console.log(result);
Another idea that might be more robust even if it's not a single shot regex is to use a traditional, non-clever "stuff between delimiters" pattern like /'([^']+)'/g, then flatMap to clean up the result array to match your format.
const s = "'TEST' .. 'TEST2' ";
const result = [...s.matchAll(/'([^']+)'/g)].flatMap(e =>
["'", e[1], "'", ".."]
).slice(0, -1);
console.log(result);
I'm building something called formula builder. The idea is, the user have to type text of formula in textarea, then we'll be parse the string value. The result is array.
For example, this text will be parsed
LADV-(GCNBIZ+UNIN)+(TNW*-1)
then generate result below
["LADV", "-", "(", "GCNBIZ", "+", "UNIN", ")", "+", "(", "TNW", "*", "-1", ")"]
The point is to split each word joined by one of this character: +, *, /, -, (, and this ); but still include the splitter itself.
I have tried to split using this expression /[-+*/()]/g, but the result doesn't include the splitter character. And also the -1 need to be detected as one expression.
["LADV", "MISC", "", "GCNBIZ", "UNIN", "", "", "TNW", "", "1", ""]
What is the match regex to solve this?
var input = 'LADV-(GCNBIZ+UNIN)+(TNW*-1)';
var match = input.match(/(-?\d+)|([a-z]+)|([-+*()\/])/gmi);
console.log(match);
You can use match instead of split with an alternation regex:
var s = 'LADV-(GCNBIZ+UNIN)+(TNW*-1)';
var m = s.match(/(-\d+(?:\.\d+)?|[-+\/*()]|\w+)/g);
console.log(m);
//=> ["LADV", "-", "(", "GCNBIZ", "+", "UNIN", ")", "+", "(", "TNW", "*", "-1", ")"]
RegEx Breakup:
( # start capture group
- # match a hyphen
\d+(?:\.\d+)? # match a number
| # OR
[-+\/*()] # match one of the symbols
| # OR
\w+ # match 1 or more word characters
) # end capture group
Order of patterns in alternation is important.
When using a regex as the separator in the split(), is there a way to know what string it matched?
Example:
var
string = "12+34-12",
numberlist = split(/[^0-9]/);
how would I know if it found a + or a -?
You can use capturing group to also capture string that was used in String#split:
var m = string.split(/(\D)/);
//=> ["12", "+", "34", "-", "12"]
To see the difference here is the output without capturing group:
var m = string.split(/\D/);
//=> ["12", "34", "12"]
PS: I have changed your use of [^0-9] to \D since they are equivalent.
Just capture the splitting regular expression, like
numberlist = string.split(/([^0-9])/);
and the output will be
[ '12', '+', '34', '-', '12' ]
Since you are capturing the splitting regular expression, it will also be a part of the resulting array.
I want to find in a math expression elements that are not wrapped between { and }
Examples:
Input: abc+1*def
Matches: ["abc", "1", "def"]
Input: {abc}+1+def
Matches: ["1", "def"]
Input: abc+(1+def)
Matches: ["abc", "1", "def"]
Input: abc+(1+{def})
Matches: ["abc", "1"]
Input: abc def+(1.1+{ghi})
Matches: ["abc def", "1.1"]
Input: 1.1-{abc def}
Matches: ["1.1"]
Rules
The expression is well-formed. (So there won't be start parenthesis without closing parenthesis or starting { without })
The math symbols allowed in the expression are + - / * and ( )
Numbers could be decimals.
Variables could contains spaces.
Only one level of { } (no nested brackets)
So far, I ended with: http://regex101.com/r/gU0dO4
(^[^/*+({})-]+|(?:[/*+({})-])[^/*+({})-]+(?:[/*+({})-])|[^/*+({})-]+$)
I split the task into 3:
match elements at the beginning of the string
match elements that are between two { and }
match elements at the end of the string
But it doesn't work as expected.
Any idea ?
Matching {}s, especially nested ones is hard (read impossible) for a standard regular expression, since it requires counting the number of {s you encountered so you know which } terminated it.
Instead, a simple string manipulation method could work, this is a very basic parser that just reads the string left to right and consumes it when outside of parentheses.
var input = "abc def+(1.1+{ghi})"; // I assume well formed, as well as no precedence
var inParens = false;
var output = [], buffer = "", parenCount = 0;
for(var i = 0; i < input.length; i++){
if(!inParens){
if(input[i] === "{"){
inParens = true;
parenCount++;
} else if (["+","-","(",")","/","*"].some(function(x){
return x === input[i];
})){ // got symbol
if(buffer!==""){ // buffer has stuff to add to input
output.push(buffer); // add the last symbol
buffer = "";
}
} else { // letter or number
buffer += input[i]; // push to buffer
}
} else { // inParens is true
if(input[i] === "{") parenCount++;
if(input[i] === "}") parenCount--;
if(parenCount === 0) inParens = false; // consume again
}
}
This might be an interesting regexp challenge, but in the real world you'd be much better off simply finding all [^+/*()-]+ groups and removing those enclosed in {}'s
"abc def+(1.1+{ghi})".match(/[^+/*()-]+/g).filter(
function(x) { return !/^{.+?}$/.test(x) })
// ["abc def", "1.1"]
That being said, regexes is not a correct way to parse math expressions. For serious parsing, consider using formal grammars and parsers. There are plenty of parser generators for javascript, for example, in PEG.js you can write a grammar like
expr
= left:multiplicative "+" expr
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative
/ primary
primary
= atom
/ "{" expr "}"
/ "(" expr ")"
atom = number / word
number = n:[0-9.]+ { return parseFloat(n.join("")) }
word = w:[a-zA-Z ]+ { return w.join("") }
and generate a parser which will be able to turn
abc def+(1.1+{ghi})
into
[
"abc def",
"+",
[
"(",
[
1.1,
"+",
[
"{",
"ghi",
"}"
]
],
")"
]
]
Then you can iterate this array just normally and fetch the parts you're interested in.
The variable names you mentioned can be match by \b[\w.]+\b since they are strictly bounded by word separators
Since you have well formed formulas, the names you don't want to capture are strictly followed by }, therefore you can use a lookahead expression to exclude these :
(\b[\w.]+ \b)(?!})
Will match the required elements (http://regexr.com/38rch).
Edit:
For more complex uses like correctly matching :
abc {def{}}
abc def+(1.1+{g{h}i})
We need to change the lookahead term to (?|({|}))
To include the match of 1.2-{abc def} we need to change the \b1. This term is using lookaround expression which are not available in javascript. So we have to work around.
(?:^|[^a-zA-Z0-9. ])([a-zA-Z0-9. ]+(?=[^0-9A-Za-z. ]))(?!({|}))
Seems to be a good one for our examples (http://regex101.com/r/oH7dO1).
1 \b is the separation between a \w and a \W \z or \a. Since \w does not include space and \W does, it is incompatible with the definition of our variable names.
Going forward with user2864740's comment, you can replace all things between {} with empty and then match the remaining.
var matches = "string here".replace(/{.+?}/g,"").match(/\b[\w. ]+\b/g);
Since you know that expressions are valid, just select \w+
Suppose I've a long string containing newlines and tabs as:
var x = "This is a long string.\n\t This is another one on next line.";
So how can we split this string into tokens, using regular expression?
I don't want to use .split(' ') because I want to learn Javascript's Regex.
A more complicated string could be this:
var y = "This #is a #long $string. Alright, lets split this.";
Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:
var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];
var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];
Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/
Basically, the code is very simple:
var y = "This #is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"
var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
document.write(match[i]);
document.write('<br>');
}
UPDATE:
Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/
var regex = /[^\s\.,!?]+/g;
UPDATE 2:
Only letters all the time:
http://jsfiddle.net/ayezutov/BjXw5/3/
var regex = /\w+/g;
Use \s+ to tokenize the string.
exec can loop through the matches to remove non-word (\W) characters.
var A= [], str= "This #is a #long $string. Alright, let's split this.",
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words;
while((words= rx.exec(str))!= null){
A.push(words[1]);
}
A.join(', ')
/* returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/
var words = y.split(/[^A-Za-z0-9]+/);
Here is a solution using regex groups to tokenise the text using different types of tokens.
You can test the code here https://jsfiddle.net/u3mvca6q/5/
/*
Basic Regex explanation:
/ Regex start
(\w+) First group, words \w means ASCII letter with \w + means 1 or more letters
| or
(,|!) Second group, punctuation
| or
(\s) Third group, white spaces
/ Regex end
g "global", enables looping over the string to capture one element at a time
Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(\w+)|(,|!)|(\s)/g;
/*
Advanced Regex explanation:
[a-zA-Z\u0080-\u00FF] instead of \w Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex
(\.\.\.|\.|,|!|\?) Identify ellipsis (...) and points as separate entities
You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g;
var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";
console.log("------------------");
var result = null;
do {
result = basicRegex.exec(basicString)
console.log(result);
} while(result != null)
console.log("------------------");
var result = null;
do {
result = advancedRegex.exec(advancedString)
console.log(result);
} while(result != null)
/*
Output:
Array [ "Hello", "Hello", undefined, undefined ]
Array [ ",", undefined, ",", undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "this", "this", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "is", "is", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "a", "a", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "random", "random", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "message", "message", undefined, undefined ]
Array [ "!", undefined, "!", undefined ]
null
*/
In order to extract word-only characters, we use the \w symbol. Whether or not this will match Unicode characters or not is implementation-dependent, and you can use this reference to see what the case is for your language/library.
Please see Alexander Yezutov's answer (update 2) on how to apply this into an expression.