Single regex to remove empty lines and double spaces from multiline input - javascript

I would like to combine two regex functions to clean up some textarea input. I wonder if it is even possible, or if I should keep it two separate ones (which work fine but aren't looking as pretty or clean).
I have adjusted either so that they utilize global and multiline (/gm) and are replaced by nothing (''). I tried with brackets and vertical/or lines in any position, but it never ends up giving the expected result, so I can only assume there is a way that I have overlooked or that I should keep it as is.
Regex 1: /^\s+[\r\n]/gm
Regex 2: /^\s+| +(?= )|\s+$/gm
Currently in JavaScript: string.replace(/^\s+[\r\n]/gm,'').replace(/^\s+| +(?= )|\s+$/gm,'')
The goal is to remove:
Empty spaces in the beginning and end of each line
Empty lines (including any in the very beginning and end)
Double spaces
Without it ending up on one and the same line. The single line breaks (\r\n) should still be there in the end.
Regex 1 is to remove any empty line (^\s+[\r\n]), Regex 2 does the trimming of whitespaces in the beginning (^\s+) and end (\s+$), and removes double (and triple, quadriple, etc) spaces in between (+(?= )).
Input:
Let's
make this
look
a little
nicer
and
more
readible
Output:
Let's
make this
look
a little
nicer
and
more
readible
Edit: Many thanks to Wiktor Stribiżew and his comment for this complete solution:
/^\s*$[\r\n]*|^[^\S\r\n]+|[^\S\r\n]+$|([^\S\r\n]){2,}|\s+$(?![^])/gm

I'd suggest the following expression with a substitution template "$1$2" (demo):
/^\s*|\s*$|\s*(\r?\n)\s*|(\s)\s+/g
Explanation:
^\s* - matches whitespace from the text beginning
\s*$ - matches whitespace from the text ending
\s*(\r?\n)\s* - matches whitespace between two words located in different lines, captures one CRLF to group $1
(\s)\s+ - captures the first whitespace char in a sequence of 2+ whitespace chars to group $2

Related

The second pattern of a regex not replacing apostrophe

I'm creating a regex that matches straight apostrophes and replaces them with a curly ones. Sometimes an apostrophe goes in the middle of two characters. Other times goes at the end of a character/word (e.g. ellipsis').
So I have two regexes that handle both situations (separated by an or statement).
However, only the first case is being replaced, not the second. In other words, this:
"Wor'd word'".replace(/(?<=\w)\'(?=\w)|(?<=\w)\'(?=\s)/, '’')
Becomes this:
"Wor’d word'"
This confuses me because both types of apostrophes are matching: https://regexr.com/4td7p
Why is this, and how to fix it?
Update: I figured the problem was that there's no space after the last apostrophe, so I changed the second part of the regex to this: (?<=\w)\'(?!\w) (don't match if there's a character after the apostrophe). But I'm getting the same result.
If you want to match (?<=\w)\' followed by a character and also match (?<=\w)\' not followed by a character, why not just drop the logic after it altogether and just use (?<=\w)'? (no need to escape 's in a regex)
You also need the global flag to replace more than one thing at a time:
console.log(
"Wor'd word'".replace(/(?<=\w)'/g, '’')
);
updated
var str = "Wor'd word' that's a good thing'";
var afterReplace = str.replace(/'\b/g, '’')
console.log(afterReplace);

Replacing carriage returns (?) after compiling HTML

After parsing HTML I get the following object:
I would like to strip all the "↵" except of one. How can I do this? I tried with something like this:
weirdString.replace(/(\r\n|\n|\r)/gm, ""));
However, this replaces all the "↵" but as I've already mentioned I want to replace all of those except the first...
You may capture it and restore with a backreference:
weirdString.replace(/^([^\S\r\n]*(?:\r\n?|\n))|(?:\r\n?|\n)/g, "$1"));
No need using m modifier here.
Details:
^ - start of a string
([^\S\r\n]*(?:\r\n?|\n)) - Capturing group 1:
[^\S\r\n]* - any 0+ whitespaces other than CR and LF
(?:\r\n?|\n) - any style line break
| - or
(?:\r\n?|\n) - any style line break.
With $1, only the contents captured into Group 1 are put back in the replacement result.
var weirdString = " \r\n\r\n\n\rSome text";
console.log(weirdString.replace(/^([^\S\r\n]*(?:\r\n?|\n))|(?:\r\n?|\n)/g, "$1"));
A little bit tricky, but why dont you first replace your first carriage return with something else? e.g.: %#% or something else, what your are not using in your text... then replace all other carriage returns, and at last return your %#% tag back to carrige return...
The exact matching regexp must cope with some things you have not accounted for:
first is whitespace that can be in between two such line ends. It should be considered the case of intervening.
Second is that the \r in front of \n should be considered optional, as it appears in texts that come from socket connections from internet (most protocols force to send \r\n but can be optional.
a sequence of two or more newlines of this type should be collapsed to one \n (or one \r\n as you prefer)
If you do a pattern match and substitute with multiple flag enabled you'll get the desired effect with this pattern:
([ \t]*\r*\n)+
as seen in the following demo. I have substituted the newlines by a [<--']\r\n to be able to see the effect. It also deletes all trailing whitespace at line ends (normally invisible) but doesn't touch the leading at beginning of lines (this could affect the visible looking of your text)

Replace comma as separator

I am trying to build a Regex to replace commas used as a separator in between normal text.
Different ways I can replace that is valid:
Space before comma
Comma is between text and/or numbers, without any space
Several commas after each other
Example:
"This is a text separated with comma, that I try to fix. , It can be split in several ways.,1234321 , I try to make all the examples in one string,,4321,"
Results:
This is a text separated with comma, that I try to fix.
It can be split in several ways.
1234321
I try to make all the examples in one string
4321
This is the code I have so far using Node.js / Javascript:
data.replace(/(\S,\S)|( ,)|(,,)|(,([a-z0-9]))/ig,';')
The answer from #torazaburo work best, except for several commas with space in-between (, , , ,)
console.log(str.split(/ +, *|,(?=\w|,|$)/));
var str = "This is a text separated with comma, that I try to fix. , It can be split in several ways.,1234321 , I try to make all the examples in one string,,4321,";
console.log(str.split(/ +, *|,(?=\w|,|$)/));
This will split on any comma preceded by one or more spaces, no matter what follows (and eat the preceding spaces, and following spaces if any); or, any comma followed by an alphanumeric or comma or end-of-string.
There is no easy way with the regexp to get rid of the final empty string in the result, caused by the comma at the very end of the input. You can get rid of that yourself if you don't want it.
To rejoin with semi-colon, add .join(';').
data.replace(/\s*,+\s*/g, ';');
This will yield:
This is a text separated with comma;that I try to fix.;It can be split in several ways.;1234321;I try to make all the examples in one string;4321;
There are three parts to this:
\s*: Match zero or more whitespace characters.
,+: Match one or more commas.
\s*: Match zero or more whitespace characters.
If, instead, you want to replace any number of consecutive commas with a single semi-colon:
data.replace(/,+/g, ';');
Honestly, I'm not sure I understood your requirements. If I did misunderstand, please provide the output string you're expecting.

matching multiple lines beginning with whitespaces

I have a simple regex syntax to match lines that begin with exactly 4 spaces.
/^(\s{4}).*/g
The problem is that the . token matches everything except a new line so multiple lines beginning with 4 spaces, only the first line is matched. I've tried explicitly matching \n tokens but I haven't been able to quite get the results I need. I've been testing this using regexr.com here I can't use any syntax that isn't supported by javascript.
The ^ symbol can denote 2 things: a beginning of string, or a beginning of a line. To make it denote the latter, you need to specify the /m MULTILINE modifier:
/^(\s{4}).*/gm
Or - to only match literal regular spaces (note that \scan also match newlines):
/^( {4}).*/gm
See regex demo

Cannot get a regex to work in JavaScript that allows whitespace and backslash

I have a regular expression as below. It should allow alphabets, digits, round brackets, square brackets, backslash and following punctuation marks: period, comma, semi-colon, full colon, exclamation, percentage and dash.
^[(a-z)(A-Z) .,;:!'%\-(0-9)(\\)\(\)[\]\s]+$
Question : I have tried this regular expression with some text at this online tester: https://regex101.com/r/kO5tW2/2, but it always comes up with no matches. What is causing the expression to fail in above case? To me, the string being tested should come back as valid, but it's not.
Your spec does not mention a question mark. However, the test text you give does include a question mark. You could have tested this easily enough by removing one character at a time from the test text until you got a match, which would have happened when you removed the question mark.
Either add the question mark to the regexp, or remove it from your test test.
Also, you do not need to (and should not) enclose ranges in parentheses.
In the below, I've also removed escaping for characters which do not need to be escaped:
^[a-zA-Z .,;:!'%\-0-9\\()[\]\s?]+$
^
https://regex101.com/r/kO5tW2/4
Try adding m (multiline) modifier to regex
If you have a string consisting of multiple lines, like first line\nsecond line (where \n indicates a line break), it is often desirable to work with lines, rather than the entire string. Therefore, all the regex engines discussed in this tutorial have the option to expand the meaning of both anchors. ^ can then match at the start of the string (before the f in the above string), as well as after each line break (between \n and s). Likewise, $ still matches at the end of the string (after the last e), and also before every line break (between e and \n). Source

Categories

Resources