javascript regexp to identify different components of a sentence

javascript regexp to identify different components of a sentence - javascript

I have a very specific requirement. Consider the sentence "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray"
I am interested in a regexp which recognizes "I", "am", "a" , "robot", "X-rrt", ",", "I", "am", "35", "and", "my", "creator", "is", "5-MAF", ".", "Everthing", "here", "is", "5", "times", "than", "my", "world5", "-", "hurray"
i.e 1)it should recognize all punctuations except "-" when it a part of a word
2)numbers if part of a word containg alphabets should not be recognized seperately
I am extremely confused with this one. Would appreciate some advise!

Try splitting at each group of whitespaces, and before dots and commas:
str.split(/\s+|(?=[.,])/);

This is not too easy. I suggest some preprocession on the text before a split, for example:
var text = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
var preprocessedText = text.replace(/(\w|^)(\W)( |$)/g, "$1 $2$3");
var tokens = preprocessedText.split(" ");
alert(tokens.join("\n"));

I tested this in perl. Shouldn't be too hard to translate to javascript.
my $sentence = 'I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray';
my #words = split(/\s|(?<!-)\b(?!-)/, $sentence);
say "'" . join ("', '", #words) . "'";

Try this match regexp:
str.match(/[\w\d-]+|.|,/g);

Here is a solution that meets both your requirements:
/(?:\w|\b-\b)+|[^\w\s]+/g
See the regex demo.
Details:
(?:\w|\b-\b)+ - 1 or more
\w - word char
| - or
\b-\b - a hyphen in between word characters
| - or
[^\w\s]+ - 1 or more characters other than word and whitespace symbols.
See the JS demo below:
var s = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
console.log(s.match(/(?:\w|\b-\b)+|[^\w\s]+/g));

Related

Matching sets consisting of letters plus non-letter characters

I want to match sets of characters that include a letter and non-letter characters. Many of them are a single letter. Or two letters.
const match = 'tɕ\'i mɑ mɑ ku ʂ ɪɛ'.match(/\b(p|p\'|m|f|t|t\'|n|l|k|k\'|h|tɕ|tɕ\'|ɕ|tʂ|tʂ\'|ʂ|ʐ|ts|ts\'|s)\b/g)
console.log(match)
I thought I could use \b, but it's wrong because there are "non-words" characters in the sets.
This is the current output:
[
"t",
"m",
"m"
]
But I want this to be the output:
[
"tɕ'",
"m",
"m",
"k",
"ʂ"
]
Note: notice that some sets end with a non-word boundary, like tɕ'.
(In phonetic terms, the consonants.)

As stated in comments above \b doesn't with unicode characters in JS and moreover from your expected output it appears that you don't need word boundaries.
You can use this shortened and refactored regex:
t[ɕʂs]'?|[tkp]'?|[tmfnlhshɕʐʂ]
Code:
const s = 'tɕ\'i mɑ mɑ ku ʂ ɪɛ';
const re = /t[ɕʂs]'?|[tkp]'?|[tmfnlhshɕʐʂ]/g
console.log(s.match(re))
//=> ["tɕ'", "m", "m", "k", "ʂ" ]
RegEx Demo
RegEx Details:
- t[ɕʂs]'?: Match t followed by any letter inside [...] and then an optional '
|: OR
[tkp]'?: Match letters t or k or p and then an optional '
|: OR
[tmfnlhshɕʐʂ]): Match any letter inside [...]

Regex Javascript add space after punctuation

Currently I'm using replace(/\s*([,.!?:;])[,.!?:;]*\s*/g, '$1 ') to add space after punctuation. But it doesn't work if the sentence contains three dots.
Example text: "Hello,today is a beautiful day...But tomorrow is,not."
Expected output: "Hello, today is a beautiful day... But tomorrow is, not."
let text = "Hello,today is a beautiful day...But tomorrow is,not.";
text = text.replace(/\s*([,.!?:;])[,.!?:;]*\s*/g, '$1 ')
Gives:
"Hello, today is a beautiful day. But tomorrow is, not. "
Please tell me what regex I can use so that I can get the expected output.

You should match all consecutive punctuation chars into Group 1, not just the first char. Also, it makes sense to exclude a match of the punctuation at the end of the string.
You can use
text.replace(/\s*([,.!?:;]+)(?!\s*$)\s*/g, '$1 ')
Also, it still might be handy to .trim() the result. See the regex demo.
Details
\s* - 0 or more whitspace chars
([,.!?:;]+) - Group 1 ($1): one or more ,, ., !, ?, : or ;
(?!\s*$) - if not immediately followed with zero or more whitespace chars and then end of string
\s* - 0 or more whitspace chars
See a JavaScript demo:
let text = "Hello,today is a beautiful day...But tomorrow is,not.";
text = text.replace(/\s*([,.!?:;]+)(?!\s*$)\s*/g, '$1 ');
console.log(text);

Thanks #Wiktor Stribiżew for his suggestions and I come up with the final regex that meets my requirements:
let text = 'Test punctuation:c,c2,c3,d1.D2.D3.Q?.Ec!Sc;Test number:1.200 1,200 2.3.Test:money: $15,000.Test double quote and three dots:I said "Today is a...beautiful,sunny day.".But tomorrow will be a long day...';
text = text.replace(/\.{3,}$|\s*(?:\d[,.]\d|([,.!?:;]+))(?!\s*$)(?!")\s*/g, (m0, m1) => { return m1 ? m1 + ' ' : m0; });
console.log(text); // It will print out: "Test punctuation: c, c2, c3, d1. D2. D3. Q?. Ec! Sc; Test number: 1.200 1,200 2.3. Test: money: $15,000. Test double quote and three dots: I said "Today is a... beautiful, sunny day.". But tomorrow will be a long day..."
Bonus! I also converted this into Dart as I'm using this feature in Flutter app as well. So just in case someone needs to use it in Dart:
void main() {
String addSpaceAfterPunctuation(String words) {
var regExp = r'\.{3,}$|\s*(?:\d[,.]\d|([,.!?:;]+))(?!\s*$)(?!")\s*';
return words.replaceAllMapped(
RegExp(regExp),
(Match m) {
return m[1] != null ? "${m[1]} " : "${m[0]}";
},
);
}
var text = 'Test punctuation:c,c2,c3,d1.D2.D3.Q?.Ec!Sc;Test number:1.200 1,200 2.3.Test:money: \$15,000.Test double quote and three dots:I said "Today is a...beautiful,sunny day.".But tomorrow will be a long day...';
text = addSpaceAfterPunctuation(text);
print(text); // Print out: Test punctuation: c, c2, c3, d1. D2. D3. Q?. Ec! Sc; Test number: 1.200 1,200 2.3. Test: money: $15,000. Test double quote and three dots: I said "Today is a... beautiful, sunny day.". But tomorrow will be a long day...
}

Regex to split with multiple separators of one or several characters

I want to split a string with separators ' or .. WHILE KEEPING them:
"'TEST' .. 'TEST2' ".split(/([' ] ..)/g);
to get:
["'", "TEST", "'", "..", "'", "TEST2", "'" ]
but it doesn't work: do you know how to fix this ?

The [' ] .. pattern matches a ' or space followed with a space and any two chars other than line break chars.
You may use
console.log("'TEST' .. 'TEST2' ".trim().split(/\s*('|\.{2})\s*/).filter(Boolean))
Here,
.trim() - remove leading/trailing whitespace
.split(/\s*('|\.{2})\s*/) - splits string with ' or double dot (that are captured in a capturing group and thus are kept in the resulting array) that are enclosed in 0+ whitespaces
.filter(Boolean) - removes empty items.

I m not sure it will work for every situations, but you can try this :
"'TEST' .. 'TEST2' ".replace(/(\'|\.\.)/g, ' $1 ').trim().split(/\s+/)
return :
["'", "TEST", "'", "..", "'", "TEST2", "'"]

Splitting while keeping the delimiters can often be reduced to a matchAll. In this case, /(?:'|\.\.|\S[^']+)/g seems to do the job on the example. The idea is to alternate between literal single quote characters, two literal periods, or any sequence up to a single quote that starts with a non-space.
const result = [..."'TEST' .. 'TEST2' ".matchAll(/(?:'|\.\.|\S[^']+)/g)].flat();
console.log(result);
Another idea that might be more robust even if it's not a single shot regex is to use a traditional, non-clever "stuff between delimiters" pattern like /'([^']+)'/g, then flatMap to clean up the result array to match your format.
const s = "'TEST' .. 'TEST2' ";
const result = [...s.matchAll(/'([^']+)'/g)].flatMap(e =>
["'", e[1], "'", ".."]
).slice(0, -1);
console.log(result);

Using Regex, how to check if second to last character is odd

I'm trying to wrap my head around Regex, but having some troubles with the basics.
I want to check to see if a the last character in a string is either a "0" or a "5", but I also want to check to is if the second to last character (if it exists) is odd.
If it matters, I'm trying to do this in Javascript for some form validation. I have the following Regex to satisfy my first condition of checking the last character and making sure its a "0" or a "5"
/([0|5]$)/g
But how do I properly add a 2nd condition to see if the 2nd to last character exists and is odd? Something like the following...?
/([0|5]$)([1|3|5|7|9]$-1)/g
If someone doesn't mind helping me out here and also explain to me what each part of their regex is doing, I'd be very grateful.

I'd go with /(?<=[13579]{1})[05]|^[05]$/.
This utilises two conditionals. One that checks for the presence of an odd character in the second-to-last position when there's at least two characters in the string, and one that checks for a single character string.
Breaking this down:
(?<=[13579]{1}) - does a positive lookbehind on exactly one odd character
[05] - match a 0 or a 5 directly following the lookbehind
| - denotes an OR
^ denotes the start of the string
[05] - match a 0 or a 5
$ - the end of the string
This can be seen in the following:
var re = /(?<=[13579]{1})[05]|^[05]$/;
console.log(re.test('12345')); // 12345 should return `false`
console.log(re.test('12335')); // 12335 should return `true`
console.log(re.test('1')); // 1 should return `false`
console.log(re.test('5')); // 5 should return `true`
And also seen on Regex101 here.

You're thinking about it the wrong way.
Try this:
/([13579])([05])$/g

If you want to check if a the last character in a string is either a "0" or a "5" and also want to check if the second to last character (if it exists) is odd, I think you do not need the capturing groups.
You could use an alternation and character classes for your requirements.
(?:\D[05]|[13579][05]|^[05])$
That would match:
(?: Non capturing group
\D[05] Match not a digit and 0 or 5
| Or
[13579][05] Match an odd digit and 0 or 5
| Or
^[05] Match from the beginning of the string 0 or 5
) Close non capturing group
$ Assert the end of the line
const strings = [
"00",
"11",
"text1",
"text10",
"text00",
"text5",
"10",
"05",
"15",
"99",
"12345",
"12335",
"0000",
"0010",
"5",
"1",
"0",
];
let pattern = /(?:[13579][05]|\D[05]|^[05])$/;
strings.forEach((s) => {
console.log(s + " ==> " + pattern.test(s));
});

/(^|[13579])[05]$/
Explained:
[05]$ means "0 or 5 followed by end of string"
(^|[13579]) means "beginning of string OR 1 or 3 or 5 or 7 or 9"
Tested in console:
re.test('aaa0') - false
re.test('aa15') - true
re.test('aa20') - false
re.test('0') - true
Is this what you were after?

As you said
I want to check to see if a the last character in a string is either a "0" or a "5", but I also want to check to is if the second to last character (if it exists) is odd
Try this :
var rgx = /^([1-9]+[13579][05]|[1-9][05])$/;
function test(str) {
for (var i = 0; i < str.length; i++) {
var res = str[i].match(rgx);
if (res) {
console.log("match");
} else {
console.log("not match");
}
}
}
var arr = ["12335", "12350", "45", "10", "12337", "11", "01", "820"];
test(arr);

You would want to do:
/(^|[1|3|5|7|9])([0|5])$/
https://regex101.com/r/nMX7L2/4
1st Capturing Group (^|[1|3|5|7|9])
1st Alternative ^
^ asserts position at start of the string
2nd Alternative [|1|3|5|7|9]
Match a single character present in the list below [|1|3|5|7|9]
|1|3|5|7|9 matches a single character in the list |13579 (case sensitive)
Match a single character present in the list below [1|3|5|7|9]
1|3|5|7|9 matches a single character in the list 1|3|5|7|9 (case sensitive)
2nd Capturing Group ([0|5])
Match a single character present in the list below [0|5]
0|5 matches a single character in the list 0|5 (case sensitive)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Split a string into an array of words, punctuation and spaces in JavaScript

I have a string which I'd like to split into items contained in an array as the following example:
var text = "I like grumpy cats. Do you?"
// to result in:
var wordArray = ["I", " ", "like", " ", "grumpy", " ", "cats", ".", " ", "Do", " ", "you", "?" ]
I've tried the following expression (and a similar varieties without success
var wordArray = text.split(/(\S+|\W)/)
//this disregards spaces and doesn't separate punctuation from words
In Ruby there's a Regex operator (\b) that splits at any word boundary preserving spaces and punctuation but I can't find a similar for Java Script. Would appreciate your help.

Use String#match method with regex /\w+|\s+|[^\s\w]+/g.
\w+ - for any word match
\s+ - for whitespace
[^\s\w]+ - for matching combination of anything other than whitespace and word character.
var text = "I like grumpy cats. Do you?";
console.log(
text.match(/\w+|\s+|[^\s\w]+/g)
)
Regex explanation here
FYI : If you just want to match single special char then you can use \W or . instead of [^\s\w]+.

The word boundary \b should work fine.
Example
"I like grumpy cats. Do you?".split(/\b/)
// ["I", " ", "like", " ", "grumpy", " ", "cats", ". ", "Do", " ", "you", "?"]
Edit
To handle the case of ., we can split it on [.\s] as well
Example
"I like grumpy cats. Do you?".split(/(?=[.\s]|\b)/)
// ["I", " ", "like", " ", "grumpy", " ", "cats", ".", " ", "Do", " ", "you", "?"]
(?=[.\s] Positive look ahead, splits just before . or \s

var text = "I like grumpy cats. Do you?"
var arr = text.split(/\s|\b/);
alert(arr);

Develop Reference

JavaScript is the programming language of the Web.

javascript regexp to identify different components of a sentence - javascript

Try splitting at each group of whitespaces, and before dots and commas: str.split(/\s+|(?=[.,])/);

I tested this in perl. Shouldn't be too hard to translate to javascript. my $sentence = 'I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray'; my #words = split(/\s|(?<!-)\b(?!-)/, $sentence); say "'" . join ("', '", #words) . "'";

Try this match regexp: str.match(/[\w\d-]+|.|,/g);

Related

Matching sets consisting of letters plus non-letter characters

Regex Javascript add space after punctuation

Regex to split with multiple separators of one or several characters

Using Regex, how to check if second to last character is odd

Split a string into an array of words, punctuation and spaces in JavaScript

Categories

Resources