Regex Javascript add space after punctuation - javascript

Currently I'm using replace(/\s*([,.!?:;])[,.!?:;]*\s*/g, '$1 ') to add space after punctuation. But it doesn't work if the sentence contains three dots.
Example text: "Hello,today is a beautiful day...But tomorrow is,not."
Expected output: "Hello, today is a beautiful day... But tomorrow is, not."
let text = "Hello,today is a beautiful day...But tomorrow is,not.";
text = text.replace(/\s*([,.!?:;])[,.!?:;]*\s*/g, '$1 ')
Gives:
"Hello, today is a beautiful day. But tomorrow is, not. "
Please tell me what regex I can use so that I can get the expected output.

You should match all consecutive punctuation chars into Group 1, not just the first char. Also, it makes sense to exclude a match of the punctuation at the end of the string.
You can use
text.replace(/\s*([,.!?:;]+)(?!\s*$)\s*/g, '$1 ')
Also, it still might be handy to .trim() the result. See the regex demo.
Details
\s* - 0 or more whitspace chars
([,.!?:;]+) - Group 1 ($1): one or more ,, ., !, ?, : or ;
(?!\s*$) - if not immediately followed with zero or more whitespace chars and then end of string
\s* - 0 or more whitspace chars
See a JavaScript demo:
let text = "Hello,today is a beautiful day...But tomorrow is,not.";
text = text.replace(/\s*([,.!?:;]+)(?!\s*$)\s*/g, '$1 ');
console.log(text);

Thanks #Wiktor Stribiżew for his suggestions and I come up with the final regex that meets my requirements:
let text = 'Test punctuation:c,c2,c3,d1.D2.D3.Q?.Ec!Sc;Test number:1.200 1,200 2.3.Test:money: $15,000.Test double quote and three dots:I said "Today is a...beautiful,sunny day.".But tomorrow will be a long day...';
text = text.replace(/\.{3,}$|\s*(?:\d[,.]\d|([,.!?:;]+))(?!\s*$)(?!")\s*/g, (m0, m1) => { return m1 ? m1 + ' ' : m0; });
console.log(text); // It will print out: "Test punctuation: c, c2, c3, d1. D2. D3. Q?. Ec! Sc; Test number: 1.200 1,200 2.3. Test: money: $15,000. Test double quote and three dots: I said "Today is a... beautiful, sunny day.". But tomorrow will be a long day..."
Bonus! I also converted this into Dart as I'm using this feature in Flutter app as well. So just in case someone needs to use it in Dart:
void main() {
String addSpaceAfterPunctuation(String words) {
var regExp = r'\.{3,}$|\s*(?:\d[,.]\d|([,.!?:;]+))(?!\s*$)(?!")\s*';
return words.replaceAllMapped(
RegExp(regExp),
(Match m) {
return m[1] != null ? "${m[1]} " : "${m[0]}";
},
);
}
var text = 'Test punctuation:c,c2,c3,d1.D2.D3.Q?.Ec!Sc;Test number:1.200 1,200 2.3.Test:money: \$15,000.Test double quote and three dots:I said "Today is a...beautiful,sunny day.".But tomorrow will be a long day...';
text = addSpaceAfterPunctuation(text);
print(text); // Print out: Test punctuation: c, c2, c3, d1. D2. D3. Q?. Ec! Sc; Test number: 1.200 1,200 2.3. Test: money: $15,000. Test double quote and three dots: I said "Today is a... beautiful, sunny day.". But tomorrow will be a long day...
}

Related

Regex not finding two letter words that include Swedish letters

So I am very new with Regex and I have managed to create a way to check if a specific word exists inside of a string without just being part of another word.
Example:
I am looking for the word "banana".
banana == true, bananarama == false
This is all fine, however a problem occurs when I am looking for words containing Swedish letters (Å,Ä,Ö) with words containing only two letters.
Example:
I am looking for the word "på" in a string looking like this: "på påsk"
and it comes back as negative.
However if I look for the word "påsk" then it comes back positive.
This is the regex I am using:
const doesWordExist = (s, word) => new RegExp('\\b' + word + '\\b', 'i').test(s);
stringOfWords = "Färg på plagg";
console.log(doesWordExist(stringOfWords, "på"))
//Expected result: true
//Actual result: false
However if I were to change the word "på" to a three letter word then it comes back true:
const doesWordExist = (s, word) => new RegExp('\\b' + word + '\\b', 'i').test(s);
stringOfWords = "Färg pås plagg";
console.log(doesWordExist(stringOfWords, "pås"))
//Expected result: true
//Actual result: true
I have been looking around for answers and I have found a few that have similar issues with Swedish letters, none of them really look for only the word in its entirity.
Could anyone explain what I am doing wrong?
The word boundary \b strictly depends on the characters matched by \w, which is a short-hand character class for [A-Za-z0-9_].
For obtaining a similar behaviour you must re-implement its functionality, for example like this:
const swedishCharClass = '[a-zäöå]';
const doesWordExist = (s, word) => new RegExp(
'(?<!' + swedishCharClass + ')' + word + '(?!' + swedishCharClass + ')', 'i'
).test(s);
console.log(doesWordExist("Färg på plagg", "på")); // true
console.log(doesWordExist("Färg pås plagg", "pås")); // true
console.log(doesWordExist("Färg pås plagg", "på")); // false
For more complex alphabets, I'd suggest you to take a look at Concrete Javascript Regex for Accented Characters (Diacritics).

JavaScript Regex split at first letter?

Since many cases using Regex, differs from case to case, depending on what format your string is in, I'm having a hard time finding a solution to my problem.
I have an array containing strings in the format, as an example:
"XX:XX - XX:XX Algorithm and Data Structures"
Where "XX:XX - XX:XX" is timespan for a lecture, and X being a number.
I'm new to Regex and trying to split the string at the first letter occurring, like so:
let str = "08:15 - 12:50 Algorithm and Data Structures";
let re = //Some regex expression
let result = str.split(re); // Output: ["08:15 - 12:50", "Algorithm and Data Structures"]
I'm thinking it should be something like /[a-Z]/ but I'm not sure at all...
Thanks in advance!
The easiest way is probably to "mark" where you want to split and then split:
const str = '12 34 abcde 45 abcde'.replace(/^([^a-z]+)([a-z])/i, '$1,$2');
// '12 34 ,abcde 45 abcde'
str.split(',')
// [ '12 34 ', 'abcde 45 abcde' ]
This finds the place where the string starts, has a bunch of non a-z characters, then has an a-z characters, and puts a comma right in-between. Then you split by the comma.
You can also split directly with a positive look ahead but it might make the regex a bit less readable.
console.log(
"08:15 - 12:50 Algorithm and Data Structures".split(/ ([A-Za-z].*)/).filter(Boolean)
)
or, if it's really always XX:XX - XX:XX, easier to just do:
const splitTimeAndCourse = (input) => {
return [
input.slice(0, "XX:XX - XX:XX".length),
input.slice("XX:XX - XX:XX".length + 1)
]
}
console.log(splitTimeAndCourse("08:15 - 12:50 Algorithm and Data Structures"))
If you have a fixed length of the string where the time is, you can use this regex for example
(^.{0,13})(.*)
Check this here https://regex101.com/r/ANMHy5/1
I know you asked about regex in particular, but here is a way to this without regex...
Provided your time span is always at the beginning of your string and will always be formatted with white space between the numbers as XX:XX - XX:XX. You could use a function that splits the string at the white space and reconstructs the first three indexed strings into one chunk, the time span, and the last remaining strings into a second chunk, the lecture title. Then return the two chunks as an array.
let str = "08:15 - 12:50 Algorithm and Data Structures";
const splitString = (str) => {
// split the string at the white spaces
const strings = str.split(' ')
// define variables
let lecture = '',
timespan = '';
// loop over the strings
strings.forEach((str, i) => {
// structure the timespan
timespan = `${strings[0]} ${strings[1]} ${strings[2]}`;
// conditional to get the remaining strings and concatenate them into a new string
i > 2 && i < strings.length?lecture += `${str} `: '';
})
// place them into an array and remove white space from end of second string
return [timespan, lecture.trimEnd()]
}
console.log(splitString(str))
For that format, you might also use 2 capture groups instead of using split.
^(\d{1,2}:\d{1,2}\s*-\s*\d{1,2}:\d{1,2})\s+([A-Za-z].*)
The pattern matches:
^ Start of string
(\d{1,2}:\d{1,2}\s*-\s*\d{1,2}:\d{1,2}) Capture group 1, match a timespan like pattern
\s+ Match 1+ whitspac chars
([A-Za-z].*) Capture group 2, start with a char A-Za-z and match the rest of the line.
Regex demo
let str = "08:15 - 12:50 Algorithm and Data Structures";
let regex = /^(\d{1,2}:\d{1,2}\s*-\s*\d{1,2}:\d{1,2})\s+([A-Za-z].*)/;
let [, ...groups] = str.match(regex);
console.log(groups);
Another option using split might be asserting not any chars a-zA-Z to the left from the start of the string using a lookbehind (see this link for the support), match 1+ whitespace chars and asserting a char a-zA-Z to the right.
(?<=^[^a-zA-Z]+)\s+(?=[A-Za-z])
Regex demo
let str = "08:15 - 12:50 Algorithm and Data Structures";
let regex = /(?<=^[^a-zA-Z]+)\s+(?=[A-Za-z])/;
console.log(str.split(regex))

Using Regex to add spaces after punctuation but ignore instances of U.S

I am using
(/(?<=[.,])(?=[^\s])/mg,' ')
to add spaces after . and , that are not followed by spaces. I want to ignore instances of the word U.S. Could someone help do this?
You can use this regex
\b(U\.S)\b|([,.])(?=\S)
\b(U\.S)\b - Matches U.S. Since nothing is mentioned in question so i am considering word boundaries. (g1)
([.,])(?=\S) - Matches . or , followed by a non space character. (g2)
let str = 'ab.c,de'
let str2 = 'U.S xyzU.S U.S xyz.x'
const replacer = (input)=>{
return input.replace(/\b(U\.S)\b|([,.])(?=\S)/gm, function(match,g1,g2){
return g1 ? g1 : g2+' '
})
}
console.log(replacer(str))
console.log(replacer(str2))

JavaScript regex symbol occurence

I have some problem with regex in JS. I wrote my regular expression:
/^([A-Z]+)\s+([^\s]+)\s+([^\s]+)\s(\[.*\])\s+(.+)$/g
But it gives wrong result with one example:
WARN 2016-01-19 13:17:32,051 [localhost-startStop-1] Duplicate property values for key Data\ Df : [ Date from] and [ Starting Day]
I want regex to divide the string in a such parts:
WARN
2016-01-19
13:17:32,051
[localhost-startStop-1]
Duplicate property values for key Data\ Df : [ Date from] and [ Starting Day]
And everything OK, except last 2 parts. There I got:
[localhost-startStop-1] Duplicate property values for key Data\ Df : [ Date from]
and [ Starting Day]
Why? I want to divide that part of string by first ] occurrence. Don't know why it takes the second.
PS: Here is the example: https://regex101.com/r/wG5xV6/2
Thanks.
You need to restrict the .* (that matches zero or more characters other than a newline, as many as possible) with a lazy dot matching .*? that matches zero or more characters other than a newline, as few as possible:
^([A-Z]+)\s+([^\s]+)\s+([^\s]+)\s(\[.*?\])\s+(.+)$
^^^
See the regex demo
You can also shorten the pattern a bit by replacing [^\s] with \S:
^([A-Z]+)\s+(\S+)\s+(\S+)\s(\[.*?\])\s+(.+)$
Another demo
var re = /^([A-Z]+)\s+(\S+)\s+(\S+)\s(\[.*?\])\s+(.+)$/gm;
var str = 'INFO 2016-01-20 08:03:21,113 [C3P0PooledConnectionPoolManager[identityToken->1bqu9pa9eq1cqr515yzwu7|6c240779]-HelperThread-#0] Connection to \'rander\' established. Notifying listeners...\nWARN 2016-01-19 13:17:32,051 [localhost-startStop-1] Duplicate property values for key Data\ Df : [ Date from] and [ Starting Day]';
while ((m = re.exec(str)) !== null) {
document.body.innerHTML += "<pre>"+ JSON.stringify([m[1],m[2],m[3],m[4], m[5]], 0, 4) + "</pre>";
}
You can try this:
^(.*?)(\s*?)(\S*?)(\s*?)(\S*?)(\s*?)(\[.*?\])(\s*)(.*?)$
Also change \S with .
\S means not sapce.
?means get less.
The rules of this sentence can be expressed as follows
begin + word + space + word + space + word + space + word + space + word + end
It must find first ],so we use ? to find it.
if u want to change the format of this sentence,you can replace it use
($1)\r($3)\r($5)\r($7)\r($9) or other.

Adding a condition to a regex

Given the Javascript below how can I add a condition to the clause? I would like to add a "space" character after a separator only if a space does not already exist. The current code will result in double-spaces if a space character already exists in spacedText.
var separators = ['.', ',', '?', '!'];
for (var i = 0; i < separators.length; i++) {
var rg = new RegExp("\\" + separators[i], "g");
spacedText = spacedText.replace(rg, separators[i] + " ");
}
'. , ? ! .,?!foo'.replace(/([.,?!])(?! )/g, '$1 ');
//-> ". , ? ! . , ? ! foo"
Means replace every occurence of one of .,?! that is not followed by a space with itself and a space afterwards.
I would suggest the following regexp to solve your problem:
"Test!Test! Test.Test 1,2,3,4 test".replace(/([!,.?])(?!\s)/g, "$1 ");
// "Test! Test! Test. Test 1, 2, 3, 4 test"
The regexp matches any character in the character class [!,.?] not followed by a space (?!\s). The parenthesis around the character class means that the matched separator will be contained in the first backreference $1, which is used in the replacement string. See this fiddle for working example.
You could do a replace of all above characters including a space. In that way you will capture any punctuation and it's trailing space and replace both by a single space.
"H1a!. A ?. ".replace(/[.,?! ]+/g, " ")
[.,?! ] is a chararcter class. It will match either ., ,, ?, ! or and + makes it match atleast once (but if possible multiple times).
spacedText = spacedText.replace(/([\.,!\?])([^\s])/g,"$1 ")
This means: replace one of these characters ([\.,!\?]) followed by a non-whitespace character ([^\s]) with the match from first group and a space ("$1 ").
Here is a working code :
var nonSpaced= 'Hello World!Which is your favorite number? 10,20,25,30 or other.answer fast.';
var spaced;
var patt = /\b([!\.,\?])+\b/g;
spaced = nonSpaced.replace(patt, '$1 ');
If you console.log the value of spaced, It will be : Hello World! Which is your favorite number? 10, 20, 25, 30 or other. answer fast. Notice the number of space characters after the ? sign , it is only one, and there is not extra space after last full-stop.

Categories

Resources