Let's say I have a string testTESTCheckTESTAnother and I want to split it in few words, like that ["test", "TEST", "Check", "TEST", "Another"].
Input:
Only [A-Za-z] characters allowed
testTESTCheckTESTAnother
Code:
My best try with regex was:
"testTESTCheckTESTAnother".match(/^[a-z]+|[A-Z][a-z]*/g)
Output: ["test", "T", "E", "S", "T", "Check", "T", "E", "S", "T", "Another"]
I tried negative lookahead but it didn't work either:
"testTESTCheckTESTAnother".match(/?![A-Z][a-z]+)[A-Z]+/g)
Output: ["TESTC", "TESTA"]
Desired output:
["test", "TEST", "Check", "TEST", "Another"]
Other inputs-outputs:
input: "ITest"
output: ["I", "Test"]
input: "WHOLETESTWORD"
output: ["WHOLETESTWORD"]
input: "C"
output: ["C"]
Regex
/[a-z]+|[A-Z]+(?=[A-Z]|$)|([A-Z][a-z]+)/g
Demo
[a-z]+ - Lowercase
[A-Z]+(?=[A-Z]|$) - Uppercase
([A-Z][a-z]+) - TitleCase
let string = "testTESTCheckTESTAnother"
console.log(string.match(/[a-z]+|[A-Z]+(?=[A-Z]|$)|([A-Z][a-z]+)/g))
Use this regular expression: ^[a-z]+|((?![A-Z][a-z])[A-Z])+|[A-Z][a-z]+
See it in action at https://regex101.com/r/5r8MzJ/1
Explanation. We have three alternative patterns we will capture.
^[a-z]+
Accept a series of lowercase letters at the start of the string only.
((?![A-Z][a-z])[A-Z])+
Accept a series of uppercase letters except the last one if followed by a lowercase letter
[A-Z][a-z]+
Accept a series of one uppercase letter and at least one lowercase letters.
Related
I want to match sets of characters that include a letter and non-letter characters. Many of them are a single letter. Or two letters.
const match = 'tɕ\'i mɑ mɑ ku ʂ ɪɛ'.match(/\b(p|p\'|m|f|t|t\'|n|l|k|k\'|h|tɕ|tɕ\'|ɕ|tʂ|tʂ\'|ʂ|ʐ|ts|ts\'|s)\b/g)
console.log(match)
I thought I could use \b, but it's wrong because there are "non-words" characters in the sets.
This is the current output:
[
"t",
"m",
"m"
]
But I want this to be the output:
[
"tɕ'",
"m",
"m",
"k",
"ʂ"
]
Note: notice that some sets end with a non-word boundary, like tɕ'.
(In phonetic terms, the consonants.)
As stated in comments above \b doesn't with unicode characters in JS and moreover from your expected output it appears that you don't need word boundaries.
You can use this shortened and refactored regex:
t[ɕʂs]'?|[tkp]'?|[tmfnlhshɕʐʂ]
Code:
const s = 'tɕ\'i mɑ mɑ ku ʂ ɪɛ';
const re = /t[ɕʂs]'?|[tkp]'?|[tmfnlhshɕʐʂ]/g
console.log(s.match(re))
//=> ["tɕ'", "m", "m", "k", "ʂ" ]
RegEx Demo
RegEx Details:
- t[ɕʂs]'?: Match t followed by any letter inside [...] and then an optional '
|: OR
[tkp]'?: Match letters t or k or p and then an optional '
|: OR
[tmfnlhshɕʐʂ]): Match any letter inside [...]
I would like to split a text but keep "a-zA-z" and "'" (single quote).
I need this:
let str = "-I'm (going crazy with) this*, so I'%ve ^decided ?(to ask /for help. I hope you'll_ help me before I go crazy!"
To be this:
let arr = ["i'm", "going", "crazy", "with", "this", "so", "I've", "decided", "to", "ask", "for", "help", "I", "hope", "you'll", "help", "me", "before", "I", "go", "crazy"]
Currently I have this:
function splitText(text) {
let words = text.split(/\s|\W/);
return words;
}
Obviously, this won't keep "I'm" nor "you'll", for example, which is what I need. I've tried a few combinations with W$, ^W and so on, but with not success.
All I want to keep is letters and "'" wherever there's a declination.
Help! Thanks!
You can use
let str = "-I'm (going crazy with) this*, so I'%ve ^decided ?(to ask /for help. I hope you'll_ help me before I go crazy!";
str = str.replace(/[^a-zA-Z0-9\s']+/g, '').split(/\s+/);
console.log(str);
// => [ "I'm", "going", "crazy", "with", "this", "so", "I've", "decided", "to", "ask", "for", "help", "I",
// "hope", "you'll", "help", "me", "before", "I", "go", "crazy" ]
NOTES:
.replace(/[^a-zA-Z0-9\s']+/g, '') - removes all chars other than letters, digits, whitespace and single quotation marks
.split(/\s+/) - split with one or more whitespace chars.
Also, if you want to only keep ' between word chars, you may use an enhanced version of the first regex:
/[^a-zA-Z0-9\s']+|\B'|'\B/g
See the regex demo with an input containing ' not in the middle of the words.
For example, get a string abaacaaa, a character a, split the string to get ['ab', 'aac', 'aaa'].
string = 'abaacaaa'
string.split('a') // 1. ["", "b", "", "c", "", "", ""]
string.split(/(?=a)/) // 2. ["ab", "a", "ac", "a", "a", "a"]
string.split(/(?=a+)/) // 3. ["ab", "a", "ac", "a", "a", "a"]
string.split(/*???*/) // 4. ['ab', 'aac', 'aaa']
Why is 3rd expression outputs the same value as 2nd even if + presented after a, and what to put into 4th?
Edit:
string.match(/a+[^a]*/g) doesn't work properly in babaacaaa.
string = 'babaacaaa' // should be splited to ['b', 'ab', 'aac', 'aaa']
string.match(/a+[^a]*/g) // ["ab", "aac", "aaa"]
Solutions 2 and 3 are equal because unanchored lookaheads test each position in the input. string. (?=a) tests the start of string in abaacaaa, and finds a match, the leading empty result is discarded. Next, it tries after a, no match since the char to the right is b, the regex engine goes on to the next position. Next, it matches after b. ab is added to the result. Then it matches a position after a, adds a to the resulting array, and goes to the next position to find a match. And so on. With (?=a+) the process is indetical, it just matches 1+ as, but still tests each position.
To split babaacaaa, you need
var s = 'babaacaaa';
console.log(
s.split(/(a+[^a]*)/).filter(Boolean)
);
The a+[^a]* matches
a+ - 1 or more a
[^a]* - 0 or more chars other than a
The capturing group allows adding matched substrings to the resulting split array, and .filter(Boolean) will discard empty matches in between adjoining matches.
let string = 'abaacaaa'
let result = string.match(/a*([^a]+|a)/g)
console.log(result)
string = 'babaacaaa'
result = string.match(/a*([^a]+|a)/g)
console.log(result)
string.match(/^[^a]+|a+[^a]*/g) seems to work as expected.
I'm building something called formula builder. The idea is, the user have to type text of formula in textarea, then we'll be parse the string value. The result is array.
For example, this text will be parsed
LADV-(GCNBIZ+UNIN)+(TNW*-1)
then generate result below
["LADV", "-", "(", "GCNBIZ", "+", "UNIN", ")", "+", "(", "TNW", "*", "-1", ")"]
The point is to split each word joined by one of this character: +, *, /, -, (, and this ); but still include the splitter itself.
I have tried to split using this expression /[-+*/()]/g, but the result doesn't include the splitter character. And also the -1 need to be detected as one expression.
["LADV", "MISC", "", "GCNBIZ", "UNIN", "", "", "TNW", "", "1", ""]
What is the match regex to solve this?
var input = 'LADV-(GCNBIZ+UNIN)+(TNW*-1)';
var match = input.match(/(-?\d+)|([a-z]+)|([-+*()\/])/gmi);
console.log(match);
You can use match instead of split with an alternation regex:
var s = 'LADV-(GCNBIZ+UNIN)+(TNW*-1)';
var m = s.match(/(-\d+(?:\.\d+)?|[-+\/*()]|\w+)/g);
console.log(m);
//=> ["LADV", "-", "(", "GCNBIZ", "+", "UNIN", ")", "+", "(", "TNW", "*", "-1", ")"]
RegEx Breakup:
( # start capture group
- # match a hyphen
\d+(?:\.\d+)? # match a number
| # OR
[-+\/*()] # match one of the symbols
| # OR
\w+ # match 1 or more word characters
) # end capture group
Order of patterns in alternation is important.
I have a very specific requirement. Consider the sentence "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray"
I am interested in a regexp which recognizes "I", "am", "a" , "robot", "X-rrt", ",", "I", "am", "35", "and", "my", "creator", "is", "5-MAF", ".", "Everthing", "here", "is", "5", "times", "than", "my", "world5", "-", "hurray"
i.e 1)it should recognize all punctuations except "-" when it a part of a word
2)numbers if part of a word containg alphabets should not be recognized seperately
I am extremely confused with this one. Would appreciate some advise!
Try splitting at each group of whitespaces, and before dots and commas:
str.split(/\s+|(?=[.,])/);
This is not too easy. I suggest some preprocession on the text before a split, for example:
var text = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
var preprocessedText = text.replace(/(\w|^)(\W)( |$)/g, "$1 $2$3");
var tokens = preprocessedText.split(" ");
alert(tokens.join("\n"));
I tested this in perl. Shouldn't be too hard to translate to javascript.
my $sentence = 'I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray';
my #words = split(/\s|(?<!-)\b(?!-)/, $sentence);
say "'" . join ("', '", #words) . "'";
Try this match regexp:
str.match(/[\w\d-]+|.|,/g);
Here is a solution that meets both your requirements:
/(?:\w|\b-\b)+|[^\w\s]+/g
See the regex demo.
Details:
(?:\w|\b-\b)+ - 1 or more
\w - word char
| - or
\b-\b - a hyphen in between word characters
| - or
[^\w\s]+ - 1 or more characters other than word and whitespace symbols.
See the JS demo below:
var s = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
console.log(s.match(/(?:\w|\b-\b)+|[^\w\s]+/g));