Struggling with Regex

Struggling with Regex - javascript

I would like to split a text but keep "a-zA-z" and "'" (single quote).
I need this:
let str = "-I'm (going crazy with) this*, so I'%ve ^decided ?(to ask /for help. I hope you'll_ help me before I go crazy!"
To be this:
let arr = ["i'm", "going", "crazy", "with", "this", "so", "I've", "decided", "to", "ask", "for", "help", "I", "hope", "you'll", "help", "me", "before", "I", "go", "crazy"]
Currently I have this:
function splitText(text) {
let words = text.split(/\s|\W/);
return words;
}
Obviously, this won't keep "I'm" nor "you'll", for example, which is what I need. I've tried a few combinations with W$, ^W and so on, but with not success.
All I want to keep is letters and "'" wherever there's a declination.
Help! Thanks!

You can use
let str = "-I'm (going crazy with) this*, so I'%ve ^decided ?(to ask /for help. I hope you'll_ help me before I go crazy!";
str = str.replace(/[^a-zA-Z0-9\s']+/g, '').split(/\s+/);
console.log(str);
// => [ "I'm", "going", "crazy", "with", "this", "so", "I've", "decided", "to", "ask", "for", "help", "I",
// "hope", "you'll", "help", "me", "before", "I", "go", "crazy" ]
NOTES:
.replace(/[^a-zA-Z0-9\s']+/g, '') - removes all chars other than letters, digits, whitespace and single quotation marks
.split(/\s+/) - split with one or more whitespace chars.
Also, if you want to only keep ' between word chars, you may use an enhanced version of the first regex:
/[^a-zA-Z0-9\s']+|\B'|'\B/g
See the regex demo with an input containing ' not in the middle of the words.

Related

Matching UPPERCASE, PascalCase and camelCase in single word

Let's say I have a string testTESTCheckTESTAnother and I want to split it in few words, like that ["test", "TEST", "Check", "TEST", "Another"].
Input:
Only [A-Za-z] characters allowed
testTESTCheckTESTAnother
Code:
My best try with regex was:
"testTESTCheckTESTAnother".match(/^[a-z]+|[A-Z][a-z]*/g)
Output: ["test", "T", "E", "S", "T", "Check", "T", "E", "S", "T", "Another"]
I tried negative lookahead but it didn't work either:
"testTESTCheckTESTAnother".match(/?![A-Z][a-z]+)[A-Z]+/g)
Output: ["TESTC", "TESTA"]
Desired output:
["test", "TEST", "Check", "TEST", "Another"]
Other inputs-outputs:
input: "ITest"
output: ["I", "Test"]
input: "WHOLETESTWORD"
output: ["WHOLETESTWORD"]
input: "C"
output: ["C"]

Regex
/[a-z]+|[A-Z]+(?=[A-Z]|$)|([A-Z][a-z]+)/g
Demo
[a-z]+ - Lowercase
[A-Z]+(?=[A-Z]|$) - Uppercase
([A-Z][a-z]+) - TitleCase
let string = "testTESTCheckTESTAnother"
console.log(string.match(/[a-z]+|[A-Z]+(?=[A-Z]|$)|([A-Z][a-z]+)/g))

Use this regular expression: ^[a-z]+|((?![A-Z][a-z])[A-Z])+|[A-Z][a-z]+
See it in action at https://regex101.com/r/5r8MzJ/1
Explanation. We have three alternative patterns we will capture.
^[a-z]+
Accept a series of lowercase letters at the start of the string only.
((?![A-Z][a-z])[A-Z])+
Accept a series of uppercase letters except the last one if followed by a lowercase letter
[A-Z][a-z]+
Accept a series of one uppercase letter and at least one lowercase letters.

Split a string into an array of words, punctuation and spaces in JavaScript

I have a string which I'd like to split into items contained in an array as the following example:
var text = "I like grumpy cats. Do you?"
// to result in:
var wordArray = ["I", " ", "like", " ", "grumpy", " ", "cats", ".", " ", "Do", " ", "you", "?" ]
I've tried the following expression (and a similar varieties without success
var wordArray = text.split(/(\S+|\W)/)
//this disregards spaces and doesn't separate punctuation from words
In Ruby there's a Regex operator (\b) that splits at any word boundary preserving spaces and punctuation but I can't find a similar for Java Script. Would appreciate your help.

Use String#match method with regex /\w+|\s+|[^\s\w]+/g.
\w+ - for any word match
\s+ - for whitespace
[^\s\w]+ - for matching combination of anything other than whitespace and word character.
var text = "I like grumpy cats. Do you?";
console.log(
text.match(/\w+|\s+|[^\s\w]+/g)
)
Regex explanation here
FYI : If you just want to match single special char then you can use \W or . instead of [^\s\w]+.

The word boundary \b should work fine.
Example
"I like grumpy cats. Do you?".split(/\b/)
// ["I", " ", "like", " ", "grumpy", " ", "cats", ". ", "Do", " ", "you", "?"]
Edit
To handle the case of ., we can split it on [.\s] as well
Example
"I like grumpy cats. Do you?".split(/(?=[.\s]|\b)/)
// ["I", " ", "like", " ", "grumpy", " ", "cats", ".", " ", "Do", " ", "you", "?"]
(?=[.\s] Positive look ahead, splits just before . or \s

var text = "I like grumpy cats. Do you?"
var arr = text.split(/\s|\b/);
alert(arr);

javascript regexp to identify different components of a sentence

I have a very specific requirement. Consider the sentence "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray"
I am interested in a regexp which recognizes "I", "am", "a" , "robot", "X-rrt", ",", "I", "am", "35", "and", "my", "creator", "is", "5-MAF", ".", "Everthing", "here", "is", "5", "times", "than", "my", "world5", "-", "hurray"
i.e 1)it should recognize all punctuations except "-" when it a part of a word
2)numbers if part of a word containg alphabets should not be recognized seperately
I am extremely confused with this one. Would appreciate some advise!

Try splitting at each group of whitespaces, and before dots and commas:
str.split(/\s+|(?=[.,])/);

This is not too easy. I suggest some preprocession on the text before a split, for example:
var text = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
var preprocessedText = text.replace(/(\w|^)(\W)( |$)/g, "$1 $2$3");
var tokens = preprocessedText.split(" ");
alert(tokens.join("\n"));

I tested this in perl. Shouldn't be too hard to translate to javascript.
my $sentence = 'I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray';
my #words = split(/\s|(?<!-)\b(?!-)/, $sentence);
say "'" . join ("', '", #words) . "'";

Try this match regexp:
str.match(/[\w\d-]+|.|,/g);

Here is a solution that meets both your requirements:
/(?:\w|\b-\b)+|[^\w\s]+/g
See the regex demo.
Details:
(?:\w|\b-\b)+ - 1 or more
\w - word char
| - or
\b-\b - a hyphen in between word characters
| - or
[^\w\s]+ - 1 or more characters other than word and whitespace symbols.
See the JS demo below:
var s = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
console.log(s.match(/(?:\w|\b-\b)+|[^\w\s]+/g));

Javascript Split difference

Javascript
var sitename="Welcome to JavaScript Kit"
var words=sitename.split(" ") //split using blank space as delimiter
for (var i=0; i<words.length; i++)
alert(words[i])
//4 alerts: "Welcome", "to", "JavaScript", and "Kit"
And
var sitename="Welcome to JavaScript Kit"
var words=sitename.split("") //split using blank space as delimiter
for (var i=0; i<words.length; i++)
alert(words[i])
//6 alerts: "W", "e", "l", "c","o","m"
What is the difference between
var words=sitename.split(" ");
And
var words=sitename.split("");
Here, what is the difference between two splits.

var sitename="Welcome to JavaScript Kit"
var words=sitename.split("") //split using blank space as delimiter
for (var i=0; i<words.length; i++)
alert(words[i])
//6 alerts: "W", "e", "l", "c","o","m"
It wont stop on just m it will have many more alerts after that.
every word will be alerted till "K" "I" "T" http://jsfiddle.net/zwJJN/
var words=sitename.split("") //split using blank space as delimiter
var words=sitename.split(" ") //split using white space space as delimiter
When we use split the whole string is searched for the delimiter given and is splitted on the basis of that
var words=sitename.split("")// every character is splitted.
var words=sitename.split(" ")// every words is splitted having white space before it.

var words=sitename.split(" ");
This code is split by the blank space
var words=sitename.split("");
But here you didnt given anything so it will be split the char's

var words=sitename.split(" ");
this will split around space character
var words=sitename.split("");
this will split around each character
I ran the script and in my browser it is working fine, i get all alerts till the end 't'. may be your browser is not allowing the webpage to generate any more dialogs

I'm guessing your browser is preventing the alerts from spamming
Don't use alert inspect the .slice result. Use something like console.log to get a better look
console.log("Welcome to JavaScript Kit".split(""));
// ["W", "e", "l", "c", "o", "m", "e", " ", "t", "o", " ", "J", "a", "v", "a", "S", "c", "r", "i", "p", "t", " ", "K", "i", "t"]
And
console.log("Welcome to JavaScript Kit".split(" "));
// ["Welcome", "to", "JavaScript", "Kit"]

var words=sitename.split(" ");
This one split the words using the space Welcometo
var words=sitename.split("");
This one split the words using the character. i.e. Separate each charater, including white-space
Ref: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split

Javascript Match and RegExp Issue -- Strange Behavior

I have been trying to use a simple jQuery operation to dynamically match and store all anchor tags and their texts on the page. But I have found a weird behavior. When you are using match() or exec(), if you designate the needle as a separate RegExp object or a pattern variable, then your query matches only one instance among dozens in the haystack.
And if you designate the pattern like this
match(/needle/gi)
then it matches every instance of the needle.
Here is my code.
You can even fire up Firebug and try this code right here on this page.
var a = {'text':'','parent':[]};
$("a").each(function(i,n) {
var module = $.trim($(n).text());
a.text += module.toLowerCase() + ',' + i + ',';
a.parent.push($(n).parent().parent());
});
var stringLowerCase = 'b';
var regex = new RegExp(stringLowerCase, "gi");
//console.log(a.text);
console.log("regex 1: ", regex.exec(a.text));
var regex2 = "/" + stringLowerCase + "/";
console.log("regex 2: ", a.text.match(regex2));
console.log("regex 3: ", a.text.match(/b/gi));
For me it is returning:
regex 1: ["b"]
regex 2: null
regex 3: ["b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"]
Can anyone explain the root of this behavior?
EDIT: I forgot to mention that for regex1, it doesn't make any difference whether you add the flags "gi" for global and case insensitive matching. It still returns only one match.
EDIT2: SOlved my own problem. I still don't know why one regex1 matches only one instance, but I managed to match all instances using the match() and the regex1.
So..this matches all and dynamically!
var regex = new RegExp(stringLowerCase, "gi");
console.log("regex 2: ", a.text.match(regex));

This is not unusual behaviour at all. In regex 1 you are only checking for 1 instance of it where in regex 3 you have told it to return all instances of the item by using the /gi argument.
In Regex 2 you are assuming that "/b/" === /b/ when it doesn't. "/b/" !== /b/. "/b/" is a string that is searching so if you string has "/b/" in it then it will return while /b/ means that it needs to search between the slashes so you could have "abc" and it will return "b"
I hope that helps.
EDIT:
Looking into it a little bit more, the exec methods returns the first match that it finds rather than all the matches that it finds.
EDIT:
var myRe = /ab*/g;
var str = "abbcdefabh";
var myArray;
while ((myArray = myRe.exec(str)) != null)
{
var msg = "Found " + myArray[0] + ". ";
msg += "Next match starts at " + myRe.lastIndex;
console.log(msg);
}
Having a look at it again it definitely does return the first instance that it finds. If you looped through it then would return more.
Why it does this? I have no idea...my JavaScript Kung Fu clearly isnt strong enough to answer that part

The reason regex 2 is returning null is that you're passing "/b/" as the pattern parameter, while "b" is actually the only thing that is actually part of the pattern. The slashes are shorthand for regex, just as [ ] is for array. So if you were to replace that to just new regex("b"), you'd get one match, but only one, since you're omitting the "global+ignorecase" flags in that example. To get the same results for #2 and #3, modify accordingly:
var regex2 = stringLowerCase;
console.log("regex 2: ", a.text.match(regex2, "gi"));
console.log("regex 3: ", a.text.match(/b/gi));

regex2 is a string, not a RegExp, I had trouble too using this kind of syntax, tho i'm not really sure of the behavior.
Edit : Remebered : for regex2, JS looks for "/b/" as a needle, not "b".

Develop Reference

JavaScript is the programming language of the Web.

Struggling with Regex - javascript

Related

Matching UPPERCASE, PascalCase and camelCase in single word

Split a string into an array of words, punctuation and spaces in JavaScript

javascript regexp to identify different components of a sentence

Javascript Split difference

Javascript Match and RegExp Issue -- Strange Behavior

Categories

Resources