Regex split on upper case and first digit

Regex split on upper case and first digit - javascript

I need to split the string "thisIs12MyString" to an array looking like [ "this", "Is", "12", "My", "String" ]
I've got so far as to "thisIs12MyString".split(/(?=[A-Z0-9])/) but it splits on each digit and gives the array [ "this", "Is", "1", "2", "My", "String" ]
So in words I need to split the string on upper case letter and digits that does not have an another digit in front of it.

Are you looking for this?
"thisIs12MyString".match(/[A-Z]?[a-z]+|[0-9]+/g)
returns
["this", "Is", "12", "My", "String"]

As I said in my comment, my approach would be to insert a special character before each sequence of digits first, as a marker:
"thisIs12MyString".replace(/\d+/g, '~$&').split(/(?=[A-Z])|~/)
where ~ could be any other character, preferably a non-printable one (e.g. a control character), as it is unlikely to appear "naturally" in a string.
In that case, you could even insert the marker before each capital letter as well, and omit the lookahead, making the split very easy:
"thisIs12MyString".replace(/\d+|[A-Z]/g, '~$&').split('~')
It might or might not perform better.

In my rhino console,
js> "thisIs12MyString".replace(/([A-Z]|\d+)/g, function(x){return " "+x;}).split(/ /);
this,Is,12,My,String
another one,
js> "thisIs12MyString".split(/(?:([A-Z]+[a-z]+))/g).filter(function(a){return a;});
this,Is,12,My,String

You can fix the JS missing of lookbehinds working on the array split using your current regex.
Quick pseudo code:
var result = [];
var digitsFlag = false;
"thisIs12MyString".split(/(?=[A-Z0-9])/).forEach(function(word) {
if (isSingleDigit(word)) {
if (!digitsFlag) {
result.push(word);
} else {
result[result.length - 1] += word;
}
digitsFlag = true;
} else {
result.push(word);
digitsFlag = false;
}
});

I can't think of any ways to achieve this with a RegEx.
I think you will need to do it in code.
Please check the URL, same question different language (ruby) ->
The code is at the bottom:
http://code.activestate.com/recipes/440698-split-string-on-capitalizeduppercase-char/

Related

Regex to for .split() to separate string on spaces except for quotes

Is it possible to use a Regex to split a string into spaces and quotes? I can only use .split() instead of .match() for performance reasons.
Example:
'This is an "example for" stack overflow.'
Output:
["This", "is", "an", "example for", "stack", "overflow"]

Short answer to your question is Yes, you can use a regex in String.prototype.split().
This is the code you want based on your example:
'This is an "example for" stack overflow.'.split(/\"\s|\s\"|\s|\"/g);

Rather than splitting the string, it will be easier for you to capture them in tokens either purely a word or capture a word having space within but enclosed within double quotes with using this regex,
"([\w ]*?)"|\w+
Sample Javascript code for same is following,
var s = 'This is an "example for" stack overflow.';
var re = /"([\w ]*?)"|\w+/g;
var arr = [];
do {
m = re.exec(s);
if (m) {
if (m[1]) {
arr.push(m[1]);
} else if (m[0]) {
arr.push(m[0]);
}
}
} while (m);
console.log(arr);

regex capturing with repeating pattern

I'm trying to capture all parts of a string, but I can't seem to get it right.
The string has this structure: 1+22+33. Numbers with an operator in between. There could be any number of terms.
What I want is ["1+22+33", "1", "+", "22", "+", "33"]
But I get: ["1+22+33", "22", "+", "33"]
I've tried all kinds of regexes, this is the best I've got, but it's obviously wrong.
let re = /(?:(\d+)([+]+))+(\d+)/g;
let s = '1+22+33';
let m;
while (m = re.exec(s))
console.log(m);
Note: the operators may vary. So in reality I'd look for [+/*-].

You can simply use String#split, like this:
const input = '3+8 - 12'; // I've willingly added some random spaces
console.log(input.split(/\s*(\+|-)\s*/)); // Add as many other operators as needed

Just thought of a solution: /(\d+)|([+*/-]+)/g;

You only have to split on digits:
console.log(
"1+22+33".split(/(\d+)/).filter(Boolean)
);

Regex match entire string while grouping

I'm trying to match a currency string that may or may not be suffixed with one of K, M, or Bn, and group them into two parts
Valid matches:
500 K // Expected grouping: ["500", "K"]
900,000 // ["900,000", ""]
2.3 Bn // ["2.3", "Bn"]
800M // ["800", "M"]
ps: I know the matches first item in match output array is the entire match string, the above expected grouping in only an example
The Regex I've got so far is this:
/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i
When I match it with a normal string, it does OK.
"898734 K".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i)
=> ["898734 K", "898734", "K"] // output
"500,000".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i)
=> ["500,000", "500,000", ""]
Trouble is, it also matches space in there
"89 8734 K".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/i)
=> ["89 ", "89", ""]
And I'm not sure why. So I thought I'd add /g option in there to match entire string, but now it doesn't group the matches.
"898734 K".match(/\b([-\d\,\.]+)\s?([M|Bn|K]?)\b/gi)
=> ["898734 K"]
What change do I need to make to get the regex behave as expected?

You could use a different regular expression, which looks for some numbers, a comma or dot and some other numbers as well, some whitepspace and the wanted letters.
var array = ['500 K', '900,000', '2.3 Bn', '800M'],
regex = /(\d+[.,]?\d*)\s*(K|Bn|M|$)/
array.forEach(function (a) {
var m = a.match(regex);
if (m) {
m.shift();
console.log(m);
}
});
.as-console-wrapper { max-height: 100% !important; top: 0; }

You have a problem and want to use a regex to solve the problem. Now you have two problems...
Joke aside, I think you can achieve what you want to do without any regex:
"".join([c for i, c in enumerate(itertools.takewhile(lambda c: c.isdigit() or c in ',.', s))]), s[i+1:]
I tried this with s="560 K", s="900,000", etc and it seems to work.

Separating words with Regex

I am trying to get this result: 'Summer-is-here'. Why does the code below generate extra spaces? (Current result: '-Summer--Is- -Here-').
function spinalCase(str) {
var newA = str.split(/([A-Z][a-z]*)/).join("-");
return newA;
}
spinalCase("SummerIs Here");

You are using a variety of split where the regexp contains a capturing group (inside parentheses), which has a specific meaning, namely to include all the splitting strings in the result. So your result becomes:
["", "Summer", "", "Is", " ", "Here", ""]
Joining that with - gives you the result you see. But you can't just remove the unnecessary capture group from the regexp, because then the split would give you
["", "", " ", ""]
because you are splitting on zero-width strings, due to the * in your regexp. So this doesn't really work.
If you want to use split, try splitting on zero-width or space-only matches looking ahead to a uppercase letter:
> "SummerIs Here".split(/\s*(?=[A-Z])/)
^^^^^^^^^ LOOK-AHEAD
< ["Summer", "Is", "Here"]
Now you can join that to get the result you want, but without the lowercase mapping, which you could do with:
"SummerIs Here" .
split(/\s*(?=[A-Z])/) .
map(function(elt, i) { return i ? elt.toLowerCase() : elt; }) .
join('-')
which gives you want you want.
Using replace as suggested in another answer is also a perfectly viable solution. In terms of best practices, consider the following code from Ember:
var DECAMELIZE_REGEXP = /([a-z\d])([A-Z])/g;
var DASHERIZE_REGEXP = /[ _]/g;
function decamelize(str) {
return str.replace(DECAMELIZE_REGEXP, '$1_$2').toLowerCase();
}
function dasherize(str) {
return decamelize(str).replace(DASHERIZE_REGEXP, '-');
}
First, decamelize puts an underscore _ in between two-character sequences of lower-case letter (or digit) and upper-case letter. Then, dasherize replaces the underscore with a dash. This works perfectly except that it lower-cases the first word in the string. You can sort of combine decamelize and dasherize here with
var SPINALIZE_REGEXP = /([a-z\d])\s*([A-Z])/g;
function spinalCase(str) {
return str.replace(SPINALIZE_REGEXP, '$1-$2').toLowerCase();
}

You want to separate capitalized words, but you are trying to split the string on capitalized words that's why you get those empty strings and spaces.
I think you are looking for this :
var newA = str.match(/[A-Z][a-z]*/g).join("-");

([A-Z][a-z]*) *(?!$|[a-z])
You can simply do a replace by $1-.See demo.
https://regex101.com/r/nL7aZ2/1
var re = /([A-Z][a-z]*) *(?!$|[a-z])/g;
var str = 'SummerIs Here';
var subst = '$1-';
var result = str.replace(re, subst);

var newA = str.split(/ |(?=[A-Z])/).join("-");
You can change the regex like:
/ |(?=[A-Z])/ or /\s*(?=[A-Z])/
Result:
Summer-Is-Here

Javascript regex find variables in a math equation

I want to find in a math expression elements that are not wrapped between { and }
Examples:
Input: abc+1*def
Matches: ["abc", "1", "def"]
Input: {abc}+1+def
Matches: ["1", "def"]
Input: abc+(1+def)
Matches: ["abc", "1", "def"]
Input: abc+(1+{def})
Matches: ["abc", "1"]
Input: abc def+(1.1+{ghi})
Matches: ["abc def", "1.1"]
Input: 1.1-{abc def}
Matches: ["1.1"]
Rules
The expression is well-formed. (So there won't be start parenthesis without closing parenthesis or starting { without })
The math symbols allowed in the expression are + - / * and ( )
Numbers could be decimals.
Variables could contains spaces.
Only one level of { } (no nested brackets)
So far, I ended with: http://regex101.com/r/gU0dO4
(^[^/*+({})-]+|(?:[/*+({})-])[^/*+({})-]+(?:[/*+({})-])|[^/*+({})-]+$)
I split the task into 3:
match elements at the beginning of the string
match elements that are between two { and }
match elements at the end of the string
But it doesn't work as expected.
Any idea ?

Matching {}s, especially nested ones is hard (read impossible) for a standard regular expression, since it requires counting the number of {s you encountered so you know which } terminated it.
Instead, a simple string manipulation method could work, this is a very basic parser that just reads the string left to right and consumes it when outside of parentheses.
var input = "abc def+(1.1+{ghi})"; // I assume well formed, as well as no precedence
var inParens = false;
var output = [], buffer = "", parenCount = 0;
for(var i = 0; i < input.length; i++){
if(!inParens){
if(input[i] === "{"){
inParens = true;
parenCount++;
} else if (["+","-","(",")","/","*"].some(function(x){
return x === input[i];
})){ // got symbol
if(buffer!==""){ // buffer has stuff to add to input
output.push(buffer); // add the last symbol
buffer = "";
}
} else { // letter or number
buffer += input[i]; // push to buffer
}
} else { // inParens is true
if(input[i] === "{") parenCount++;
if(input[i] === "}") parenCount--;
if(parenCount === 0) inParens = false; // consume again
}
}

This might be an interesting regexp challenge, but in the real world you'd be much better off simply finding all [^+/*()-]+ groups and removing those enclosed in {}'s
"abc def+(1.1+{ghi})".match(/[^+/*()-]+/g).filter(
function(x) { return !/^{.+?}$/.test(x) })
// ["abc def", "1.1"]
That being said, regexes is not a correct way to parse math expressions. For serious parsing, consider using formal grammars and parsers. There are plenty of parser generators for javascript, for example, in PEG.js you can write a grammar like
expr
= left:multiplicative "+" expr
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative
/ primary
primary
= atom
/ "{" expr "}"
/ "(" expr ")"
atom = number / word
number = n:[0-9.]+ { return parseFloat(n.join("")) }
word = w:[a-zA-Z ]+ { return w.join("") }
and generate a parser which will be able to turn
abc def+(1.1+{ghi})
into
[
"abc def",
"+",
[
"(",
[
1.1,
"+",
[
"{",
"ghi",
"}"
]
],
")"
]
]
Then you can iterate this array just normally and fetch the parts you're interested in.

The variable names you mentioned can be match by \b[\w.]+\b since they are strictly bounded by word separators
Since you have well formed formulas, the names you don't want to capture are strictly followed by }, therefore you can use a lookahead expression to exclude these :
(\b[\w.]+ \b)(?!})
Will match the required elements (http://regexr.com/38rch).
Edit:
For more complex uses like correctly matching :
abc {def{}}
abc def+(1.1+{g{h}i})
We need to change the lookahead term to (?|({|}))
To include the match of 1.2-{abc def} we need to change the \b1. This term is using lookaround expression which are not available in javascript. So we have to work around.
(?:^|[^a-zA-Z0-9. ])([a-zA-Z0-9. ]+(?=[^0-9A-Za-z. ]))(?!({|}))
Seems to be a good one for our examples (http://regex101.com/r/oH7dO1).
1 \b is the separation between a \w and a \W \z or \a. Since \w does not include space and \W does, it is incompatible with the definition of our variable names.

Going forward with user2864740's comment, you can replace all things between {} with empty and then match the remaining.
var matches = "string here".replace(/{.+?}/g,"").match(/\b[\w. ]+\b/g);
Since you know that expressions are valid, just select \w+

Develop Reference

JavaScript is the programming language of the Web.

Regex split on upper case and first digit - javascript

Are you looking for this? "thisIs12MyString".match(/[A-Z]?[a-z]+|[0-9]+/g) returns ["this", "Is", "12", "My", "String"]

In my rhino console, js> "thisIs12MyString".replace(/([A-Z]|\d+)/g, function(x){return " "+x;}).split(/ /); this,Is,12,My,String another one, js> "thisIs12MyString".split(/(?:([A-Z]+[a-z]+))/g).filter(function(a){return a;}); this,Is,12,My,String

I can't think of any ways to achieve this with a RegEx. I think you will need to do it in code. Please check the URL, same question different language (ruby) -> The code is at the bottom: http://code.activestate.com/recipes/440698-split-string-on-capitalizeduppercase-char/

Related

Regex to for .split() to separate string on spaces except for quotes

regex capturing with repeating pattern

Regex match entire string while grouping

Separating words with Regex

Javascript regex find variables in a math equation

Categories

Resources