Match sentences and whitespace separately - javascript

Take the following text:
This is a sentence. This is a sentence... This is a sentence! This is a sentence? This is a sentence.This is a sentence. This is a sentence
I'd like to match this so I have an array like the following:
[
"This is a sentence.",
" ",
"This is a sentence...",
" ",
"This is a sentence!",
" ",
"This is a sentence?",
" ",
"This is a sentence.",
"",
"This is a sentence.",
" ",
"This is a sentence",
]
With my current regex, however:
str.match(/[^.!?]+[.!?]*(\s*)/g);
I get the following:
[
"This is a sentence. ",
"This is a sentence... ",
"This is a sentence! ",
"This is a sentence? ",
"This is a sentence.",
"This is a sentence. ",
"This is a sentence"
]
How can I achieve this with JS ReExp?
Thanks in advance!

Just add [^\s] at the beginning and change (\s*) to |\s+.
The final regex will be like:
str.match(/[^\s][^.!?]+[.!?]*|\s+/g)
[^\s] will remove white spaces from the beginning of the expression
|\s+ will treat white spaces as a new expression

here is solution using you regex in the question, but doing some array spliting afterwards to keep the whitespaces in the array; essentially it will split the array by white spaces if they are in the end of the string ( positive lookahead of $ ) then flatting it again to achieve the exact output you want .
const baseStr = "This is a sentence. This is a sentence... This is a sentence! This is a sentence? This is a sentence.This is a sentence. This is a sentence";
var result = baseStr.match(/[^.!?]+[.!?]*(\s*)/g).map( str => str.split(/(\s*)(?=$)/).filter(_=>_)).flat();
console.log(result);

Related

How to remove duplicate \n (line break) from a string and keep only one?

I have a string like this:
This is a sentence.\n This is sentence 2.\n\n\n\n\n\n This is sentence 3.\n\n And here is the final sentence.
What I want to is:
This is a sentence.\n This is sentence 2.\n This is sentence 3.\n And here is the final sentence.
I want to remove all duplicated \n characters from a string but keep only one left, is it possible to do like that in javascript ?
You may try replacing \n{2,} with a single \n:
var input = "This is a sentence.\n This is sentence 2.\n\n\n\n\n\n This is sentence 3.\n\n And here is the final sentence.";
var output = input.replace(/\n{2,}\s*/g, '\n');
console.log(output);
You can use regex as /\n+/g to replace it with single \n
const str =
"This is a sentence.\n This is sentence 2.\n\n\n\n\n\n This is sentence 3.\n\n And here is the final sentence.";
const result = str.replace(/\n+/g, "\n");
console.log(result);

How to prevent this regex code from removing the period?

I'm writing a code to turn dumb quotes to smart quotes:
text.replace(/\b"|\."/g, '”')
(I added OR period because sometimes sentences end with a period not a word.)
Input:
"This is a text."
Output:
“This is a text”
Desired output:
“This is a text.”
As you can see, that code removes the dot.
How to prevent this?
RULES: I want to replace dumb double quotes that are at end of a word or after a period, turn them into right double smart quotes.
You should include in the replacement the capturing group 1 , you can do that with :
replace(/\b"$|(\.)"$/g, "$1”");
$1 Will contain the .
Adding the $ you will avoid miss those cases:
"This is a "text"."
EDIT For the new RULE:
If you want also to replace the internal quotes of a quote do this >
const regex = /( ")([\w\s.]*)"(?=.*"$)|\b"$|(\.)?"$/g;
const str = `"This is a "subquote" about "life"."`;
const subst = `$1$2$3”`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
"You live "your live" always in company"
"You live "alone" always in company"
"You live "" always in company"
"You live "in the dark..." always in company"
"You live "alone" very "alone" always in company"
Please try this:
let text = '"This is a text."'
console.log(
text.replace(/\b"|(\.)"/g,'$1\u201d')
)

Regular expression to match group of consecutive characters [duplicate]

This question already has answers here:
How to check whether a string contains a substring in JavaScript?
(3 answers)
Closed 4 years ago.
I need a JS expression to match a combination of /* characters
I have this now
/(\b\/*\b)g
but it does not work.
ETA:
any string that has /* should match
so...
Hello NO MATCH
123 NO MATCH
/* HELLo MATCH
/*4534534 MATCH
Since you only want to detect if it contains something you don't have to use regex and can just use .includes("/*"):
function fits(str) {
return str.includes("/*");
}
var test = [
"Hello NO MATCH",
"123 NO MATCH",
"/* HELLo MATCH",
"/*4534534 MATCH"
];
var result = test.map(str => fits(str));
console.log(result);
You might use a positive lookahead and test if the string contains /*?
If so, match any character one or more times .+ from the beginning of the string ^ until the end of the string $
^(?=.*\/\*).+$
Explanation
^ Begin of the string
(?= Positive lookahead that asserts what is on the right
.*\/\* Match any character zero or more time and then /*
) Close positive lookahead
.+ Match any character one or more times
$ End of the string
const strings = [
"Hello NO MATCH",
"123 NO MATCH",
"/* HELLo MATCH",
"/*4534534 MATCH",
"test(((*/*"
];
let pattern = /^(?=.*\/\*).+$/;
strings.forEach((s) => {
console.log(s + " ==> " + pattern.test(s));
});
I think you could also use indexOf() to get the index of the first occurence of /*. It will return -1 if the value is not found.
const strings = [
"Hello NO MATCH",
"123 NO MATCH",
"/* HELLo MATCH",
"/*4534534 MATCH",
"test(((*/*test",
"test /",
"test *",
"test /*",
"/*"
];
let pattern = /^(?=.*\/\*).+$/;
strings.forEach((s) => {
console.log(s + " ==> " + pattern.test(s));
console.log(s + " ==> " + (s.indexOf("/*") !== -1));
});

Split a string into an array of words, punctuation and spaces in JavaScript

I have a string which I'd like to split into items contained in an array as the following example:
var text = "I like grumpy cats. Do you?"
// to result in:
var wordArray = ["I", " ", "like", " ", "grumpy", " ", "cats", ".", " ", "Do", " ", "you", "?" ]
I've tried the following expression (and a similar varieties without success
var wordArray = text.split(/(\S+|\W)/)
//this disregards spaces and doesn't separate punctuation from words
In Ruby there's a Regex operator (\b) that splits at any word boundary preserving spaces and punctuation but I can't find a similar for Java Script. Would appreciate your help.
Use String#match method with regex /\w+|\s+|[^\s\w]+/g.
\w+ - for any word match
\s+ - for whitespace
[^\s\w]+ - for matching combination of anything other than whitespace and word character.
var text = "I like grumpy cats. Do you?";
console.log(
text.match(/\w+|\s+|[^\s\w]+/g)
)
Regex explanation here
FYI : If you just want to match single special char then you can use \W or . instead of [^\s\w]+.
The word boundary \b should work fine.
Example
"I like grumpy cats. Do you?".split(/\b/)
// ["I", " ", "like", " ", "grumpy", " ", "cats", ". ", "Do", " ", "you", "?"]
Edit
To handle the case of ., we can split it on [.\s] as well
Example
"I like grumpy cats. Do you?".split(/(?=[.\s]|\b)/)
// ["I", " ", "like", " ", "grumpy", " ", "cats", ".", " ", "Do", " ", "you", "?"]
(?=[.\s] Positive look ahead, splits just before . or \s
var text = "I like grumpy cats. Do you?"
var arr = text.split(/\s|\b/);
alert(arr);

Javascript replace string which doesn't match?

Say I have this string:
cat hates dog
When i do a replace :
str = str.replace('cat', 'fish');
I will only get "cat" replaced by "fish" , how to get it works like this:
"cat" replaced by "fish"
"other string"(else) replaced by "goat"
so I will get new string:
fish goat goat
You can use this regexp \b\w+?\b:
"cat hates dog".replace(/\b\w+?\b/g, function(a) {
return a === 'cat' ? 'fish' : 'goat';
});
It will match every word (sequence of word characters \w surrounded by word boundary \b) and pass match results in replace callback;
Output:
fish goat goat

Categories

Resources