Regex split on upper case and first digit - javascript

I need to split the string "thisIs12MyString" to an array looking like [ "this", "Is", "12", "My", "String" ]
I've got so far as to "thisIs12MyString".split(/(?=[A-Z0-9])/) but it splits on each digit and gives the array [ "this", "Is", "1", "2", "My", "String" ]
So in words I need to split the string on upper case letter and digits that does not have an another digit in front of it.

Are you looking for this?
"thisIs12MyString".match(/[A-Z]?[a-z]+|[0-9]+/g)
returns
["this", "Is", "12", "My", "String"]

As I said in my comment, my approach would be to insert a special character before each sequence of digits first, as a marker:
"thisIs12MyString".replace(/\d+/g, '~$&').split(/(?=[A-Z])|~/)
where ~ could be any other character, preferably a non-printable one (e.g. a control character), as it is unlikely to appear "naturally" in a string.
In that case, you could even insert the marker before each capital letter as well, and omit the lookahead, making the split very easy:
"thisIs12MyString".replace(/\d+|[A-Z]/g, '~$&').split('~')
It might or might not perform better.

In my rhino console,
js> "thisIs12MyString".replace(/([A-Z]|\d+)/g, function(x){return " "+x;}).split(/ /);
this,Is,12,My,String
another one,
js> "thisIs12MyString".split(/(?:([A-Z]+[a-z]+))/g).filter(function(a){return a;});
this,Is,12,My,String

You can fix the JS missing of lookbehinds working on the array split using your current regex.
Quick pseudo code:
var result = [];
var digitsFlag = false;
"thisIs12MyString".split(/(?=[A-Z0-9])/).forEach(function(word) {
if (isSingleDigit(word)) {
if (!digitsFlag) {
result.push(word);
} else {
result[result.length - 1] += word;
}
digitsFlag = true;
} else {
result.push(word);
digitsFlag = false;
}
});

I can't think of any ways to achieve this with a RegEx.
I think you will need to do it in code.
Please check the URL, same question different language (ruby) ->
The code is at the bottom:
http://code.activestate.com/recipes/440698-split-string-on-capitalizeduppercase-char/

Related

Regex to for .split() to separate string on spaces except for quotes

Is it possible to use a Regex to split a string into spaces and quotes? I can only use .split() instead of .match() for performance reasons.
Example:
'This is an "example for" stack overflow.'
Output:
["This", "is", "an", "example for", "stack", "overflow"]
Short answer to your question is Yes, you can use a regex in String.prototype.split().
This is the code you want based on your example:
'This is an "example for" stack overflow.'.split(/\"\s|\s\"|\s|\"/g);
Rather than splitting the string, it will be easier for you to capture them in tokens either purely a word or capture a word having space within but enclosed within double quotes with using this regex,
"([\w ]*?)"|\w+
Sample Javascript code for same is following,
var s = 'This is an "example for" stack overflow.';
var re = /"([\w ]*?)"|\w+/g;
var arr = [];
do {
m = re.exec(s);
if (m) {
if (m[1]) {
arr.push(m[1]);
} else if (m[0]) {
arr.push(m[0]);
}
}
} while (m);
console.log(arr);

regex capturing with repeating pattern

I'm trying to capture all parts of a string, but I can't seem to get it right.
The string has this structure: 1+22+33. Numbers with an operator in between. There could be any number of terms.
What I want is ["1+22+33", "1", "+", "22", "+", "33"]
But I get: ["1+22+33", "22", "+", "33"]
I've tried all kinds of regexes, this is the best I've got, but it's obviously wrong.
let re = /(?:(\d+)([+]+))+(\d+)/g;
let s = '1+22+33';
let m;
while (m = re.exec(s))
console.log(m);
Note: the operators may vary. So in reality I'd look for [+/*-].
You can simply use String#split, like this:
const input = '3+8 - 12'; // I've willingly added some random spaces
console.log(input.split(/\s*(\+|-)\s*/)); // Add as many other operators as needed
Just thought of a solution: /(\d+)|([+*/-]+)/g;
You only have to split on digits:
console.log(
"1+22+33".split(/(\d+)/).filter(Boolean)
);

Regex match entire string while grouping

I'm trying to match a currency string that may or may not be suffixed with one of K, M, or Bn, and group them into two parts
Valid matches:
500 K // Expected grouping: ["500", "K"]
900,000 // ["900,000", ""]
2.3 Bn // ["2.3", "Bn"]
800M // ["800", "M"]
ps: I know the matches first item in match output array is the entire match string, the above expected grouping in only an example
The Regex I've got so far is this:
/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i
When I match it with a normal string, it does OK.
"898734 K".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i)
=> ["898734 K", "898734", "K"] // output
"500,000".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i)
=> ["500,000", "500,000", ""]
Trouble is, it also matches space in there
"89 8734 K".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i)
=> ["89 ", "89", ""]
And I'm not sure why. So I thought I'd add /g option in there to match entire string, but now it doesn't group the matches.
"898734 K".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/gi)
=> ["898734 K"]
What change do I need to make to get the regex behave as expected?
You could use a different regular expression, which looks for some numbers, a comma or dot and some other numbers as well, some whitepspace and the wanted letters.
var array = ['500 K', '900,000', '2.3 Bn', '800M'],
regex = /(\d+[.,]?\d*)\s*(K|Bn|M|$)/
array.forEach(function (a) {
var m = a.match(regex);
if (m) {
m.shift();
console.log(m);
}
});
.as-console-wrapper { max-height: 100% !important; top: 0; }
You have a problem and want to use a regex to solve the problem. Now you have two problems...
Joke aside, I think you can achieve what you want to do without any regex:
"".join([c for i, c in enumerate(itertools.takewhile(lambda c: c.isdigit() or c in ',.', s))]), s[i+1:]
I tried this with s="560 K", s="900,000", etc and it seems to work.

Separating words with Regex

I am trying to get this result: 'Summer-is-here'. Why does the code below generate extra spaces? (Current result: '-Summer--Is- -Here-').
function spinalCase(str) {
var newA = str.split(/([A-Z][a-z]*)/).join("-");
return newA;
}
spinalCase("SummerIs Here");
You are using a variety of split where the regexp contains a capturing group (inside parentheses), which has a specific meaning, namely to include all the splitting strings in the result. So your result becomes:
["", "Summer", "", "Is", " ", "Here", ""]
Joining that with - gives you the result you see. But you can't just remove the unnecessary capture group from the regexp, because then the split would give you
["", "", " ", ""]
because you are splitting on zero-width strings, due to the * in your regexp. So this doesn't really work.
If you want to use split, try splitting on zero-width or space-only matches looking ahead to a uppercase letter:
> "SummerIs Here".split(/\s*(?=[A-Z])/)
^^^^^^^^^ LOOK-AHEAD
< ["Summer", "Is", "Here"]
Now you can join that to get the result you want, but without the lowercase mapping, which you could do with:
"SummerIs Here" .
split(/\s*(?=[A-Z])/) .
map(function(elt, i) { return i ? elt.toLowerCase() : elt; }) .
join('-')
which gives you want you want.
Using replace as suggested in another answer is also a perfectly viable solution. In terms of best practices, consider the following code from Ember:
var DECAMELIZE_REGEXP = /([a-z\d])([A-Z])/g;
var DASHERIZE_REGEXP = /[ _]/g;
function decamelize(str) {
return str.replace(DECAMELIZE_REGEXP, '$1_$2').toLowerCase();
}
function dasherize(str) {
return decamelize(str).replace(DASHERIZE_REGEXP, '-');
}
First, decamelize puts an underscore _ in between two-character sequences of lower-case letter (or digit) and upper-case letter. Then, dasherize replaces the underscore with a dash. This works perfectly except that it lower-cases the first word in the string. You can sort of combine decamelize and dasherize here with
var SPINALIZE_REGEXP = /([a-z\d])\s*([A-Z])/g;
function spinalCase(str) {
return str.replace(SPINALIZE_REGEXP, '$1-$2').toLowerCase();
}
You want to separate capitalized words, but you are trying to split the string on capitalized words that's why you get those empty strings and spaces.
I think you are looking for this :
var newA = str.match(/[A-Z][a-z]*/g).join("-");
([A-Z][a-z]*) *(?!$|[a-z])
You can simply do a replace by $1-.See demo.
https://regex101.com/r/nL7aZ2/1
var re = /([A-Z][a-z]*) *(?!$|[a-z])/g;
var str = 'SummerIs Here';
var subst = '$1-';
var result = str.replace(re, subst);
var newA = str.split(/ |(?=[A-Z])/).join("-");
You can change the regex like:
/ |(?=[A-Z])/ or /\s*(?=[A-Z])/
Result:
Summer-Is-Here

Javascript regex find variables in a math equation

I want to find in a math expression elements that are not wrapped between { and }
Examples:
Input: abc+1*def
Matches: ["abc", "1", "def"]
Input: {abc}+1+def
Matches: ["1", "def"]
Input: abc+(1+def)
Matches: ["abc", "1", "def"]
Input: abc+(1+{def})
Matches: ["abc", "1"]
Input: abc def+(1.1+{ghi})
Matches: ["abc def", "1.1"]
Input: 1.1-{abc def}
Matches: ["1.1"]
Rules
The expression is well-formed. (So there won't be start parenthesis without closing parenthesis or starting { without })
The math symbols allowed in the expression are + - / * and ( )
Numbers could be decimals.
Variables could contains spaces.
Only one level of { } (no nested brackets)
So far, I ended with: http://regex101.com/r/gU0dO4
(^[^/*+({})-]+|(?:[/*+({})-])[^/*+({})-]+(?:[/*+({})-])|[^/*+({})-]+$)
I split the task into 3:
match elements at the beginning of the string
match elements that are between two { and }
match elements at the end of the string
But it doesn't work as expected.
Any idea ?
Matching {}s, especially nested ones is hard (read impossible) for a standard regular expression, since it requires counting the number of {s you encountered so you know which } terminated it.
Instead, a simple string manipulation method could work, this is a very basic parser that just reads the string left to right and consumes it when outside of parentheses.
var input = "abc def+(1.1+{ghi})"; // I assume well formed, as well as no precedence
var inParens = false;
var output = [], buffer = "", parenCount = 0;
for(var i = 0; i < input.length; i++){
if(!inParens){
if(input[i] === "{"){
inParens = true;
parenCount++;
} else if (["+","-","(",")","/","*"].some(function(x){
return x === input[i];
})){ // got symbol
if(buffer!==""){ // buffer has stuff to add to input
output.push(buffer); // add the last symbol
buffer = "";
}
} else { // letter or number
buffer += input[i]; // push to buffer
}
} else { // inParens is true
if(input[i] === "{") parenCount++;
if(input[i] === "}") parenCount--;
if(parenCount === 0) inParens = false; // consume again
}
}
This might be an interesting regexp challenge, but in the real world you'd be much better off simply finding all [^+/*()-]+ groups and removing those enclosed in {}'s
"abc def+(1.1+{ghi})".match(/[^+/*()-]+/g).filter(
function(x) { return !/^{.+?}$/.test(x) })
// ["abc def", "1.1"]
That being said, regexes is not a correct way to parse math expressions. For serious parsing, consider using formal grammars and parsers. There are plenty of parser generators for javascript, for example, in PEG.js you can write a grammar like
expr
= left:multiplicative "+" expr
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative
/ primary
primary
= atom
/ "{" expr "}"
/ "(" expr ")"
atom = number / word
number = n:[0-9.]+ { return parseFloat(n.join("")) }
word = w:[a-zA-Z ]+ { return w.join("") }
and generate a parser which will be able to turn
abc def+(1.1+{ghi})
into
[
"abc def",
"+",
[
"(",
[
1.1,
"+",
[
"{",
"ghi",
"}"
]
],
")"
]
]
Then you can iterate this array just normally and fetch the parts you're interested in.
The variable names you mentioned can be match by \b[\w.]+\b since they are strictly bounded by word separators
Since you have well formed formulas, the names you don't want to capture are strictly followed by }, therefore you can use a lookahead expression to exclude these :
(\b[\w.]+ \b)(?!})
Will match the required elements (http://regexr.com/38rch).
Edit:
For more complex uses like correctly matching :
abc {def{}}
abc def+(1.1+{g{h}i})
We need to change the lookahead term to (?|({|}))
To include the match of 1.2-{abc def} we need to change the \b1. This term is using lookaround expression which are not available in javascript. So we have to work around.
(?:^|[^a-zA-Z0-9. ])([a-zA-Z0-9. ]+(?=[^0-9A-Za-z. ]))(?!({|}))
Seems to be a good one for our examples (http://regex101.com/r/oH7dO1).
1 \b is the separation between a \w and a \W \z or \a. Since \w does not include space and \W does, it is incompatible with the definition of our variable names.
Going forward with user2864740's comment, you can replace all things between {} with empty and then match the remaining.
var matches = "string here".replace(/{.+?}/g,"").match(/\b[\w. ]+\b/g);
Since you know that expressions are valid, just select \w+

Categories

Resources