I expect to replace "\n\t" with "xxx" in a txt file:
"数字多功能光盘 DVD shùzì"
I do this: str.replace("\n\t","xxx")
method matches needed parts but leaves \n part and only replaces \t for 'xxx'.WHY?
why when use crtl+F in VSCOde and it works like charm but in code it doesn't.
First of all, str.replace("a","b") only replaces the first occurrence in JavaScript. To replace all of them, you need to use a regex with g modifier. So, you could try str.replace(/\n\t/g,"xxx") first.
Next, why does it work in VSCode? In VSCode regex, \n matches any line break sequence that is selected in the bottom right-hand corner of VSCode app. It works as \R in PCRE, Java, Onigmo, etc. in this case.
As there can be many line ending sequences you may consider "converting" VSCode \n to (?:\r\n|[\r\n\x0B\x0C\x85\u2028\u2029]) that matches any single Unicode line break sequence and use
s = s.replace(/(?:\r\n|[\r\n\x0B\x0C\x85\u2028\u2029])\t/g, '')
Related
Update: my original test involving copy/pasting from a text file into the browser was flawed. I created a new test in JavaScript which verified that carriage return \r is in fact being matched.
The following code logs ['\r', '\r', '\r'] to the console, which verifies that the \r is being matched:
<script>
const CarriageReturn = String.fromCharCode(13); // char code for carriage return is 13
const str = CarriageReturn + CarriageReturn + CarriageReturn;
const matches = str.match(/\r/g);
console.log(matches); // this will output ['\r', '\r', '\r']
</script>
Original Question
The common method suggested by numerous StackOverflow answers and articles across the internet to match a line break in regular expressions is to use the ubiquitous token [\r\n]. It is supposedly to ensure compatibility with Windows systems, since Windows uses the carriage return \r and line feed \n together to make a new line, as opposed to just the line feed \n for UNIX based operating system such as Linux or Mac.
I'm beginning to think JavaScript ignores this distinction and just treats every line break as \n.
Today, I did an experiment where I created a text file with 10 carriage returns, opened up the text file, then copy/pasted the carriage returns into the regular expression tester at https://regex101.com.
When I tested all those carriage returns against the simple regular expression \r, nothing matched. However, using the alternative \n matched all 10 carriage returns.
So my question is, based on my experiment, is it safe to just write \n instead of [\r\n] when matching line breaks in JavaScript?
No, do not replace [\r\n] with \n.
Line ends at http://regex101.com are only \n and that is why you had no match with \r.
In real texts, both carriage return and line feed characters might need matching.
Besides, the dot does not match \r in JavaScript regex.
JavaScript treats newlines as \n, that's why it matched all when you tested it. \r\n is windows style of representing new lines while Unix based systems uses \n. If you are not sure, you can use this regex: /\r?\n/
After doing a different test, it appears JavaScript does make a distinction between \r and \n, but not in all cases. Here are the exceptions:
If you generate a carriage return string in JavaScript using String.fromCharCode(13), and try to match it with pattern \r, the pattern will match successfully.
If you type a line break with your keyboard directly into a <textarea> in your browser, it is interpreted by JavaScript as just \n. There will be no matches for \r.
If you copy/paste text containing carriage returns (\r) from a text file into a <textarea> in your browser, your browser will convert all the sequences of \r\n into just \n. So, it will appear as if JavaScript is ignoring the \rs in your text, but it's only because your browser removed them in the process of pasting it into the <textarea>.
I updated my original question with the test I ran to confirm that the \r token is matched when generated with String.fromCharCode(13).
After parsing HTML I get the following object:
I would like to strip all the "↵" except of one. How can I do this? I tried with something like this:
weirdString.replace(/(\r\n|\n|\r)/gm, ""));
However, this replaces all the "↵" but as I've already mentioned I want to replace all of those except the first...
You may capture it and restore with a backreference:
weirdString.replace(/^([^\S\r\n]*(?:\r\n?|\n))|(?:\r\n?|\n)/g, "$1"));
No need using m modifier here.
Details:
^ - start of a string
([^\S\r\n]*(?:\r\n?|\n)) - Capturing group 1:
[^\S\r\n]* - any 0+ whitespaces other than CR and LF
(?:\r\n?|\n) - any style line break
| - or
(?:\r\n?|\n) - any style line break.
With $1, only the contents captured into Group 1 are put back in the replacement result.
var weirdString = " \r\n\r\n\n\rSome text";
console.log(weirdString.replace(/^([^\S\r\n]*(?:\r\n?|\n))|(?:\r\n?|\n)/g, "$1"));
A little bit tricky, but why dont you first replace your first carriage return with something else? e.g.: %#% or something else, what your are not using in your text... then replace all other carriage returns, and at last return your %#% tag back to carrige return...
The exact matching regexp must cope with some things you have not accounted for:
first is whitespace that can be in between two such line ends. It should be considered the case of intervening.
Second is that the \r in front of \n should be considered optional, as it appears in texts that come from socket connections from internet (most protocols force to send \r\n but can be optional.
a sequence of two or more newlines of this type should be collapsed to one \n (or one \r\n as you prefer)
If you do a pattern match and substitute with multiple flag enabled you'll get the desired effect with this pattern:
([ \t]*\r*\n)+
as seen in the following demo. I have substituted the newlines by a [<--']\r\n to be able to see the effect. It also deletes all trailing whitespace at line ends (normally invisible) but doesn't touch the leading at beginning of lines (this could affect the visible looking of your text)
I have a regular expression as below. It should allow alphabets, digits, round brackets, square brackets, backslash and following punctuation marks: period, comma, semi-colon, full colon, exclamation, percentage and dash.
^[(a-z)(A-Z) .,;:!'%\-(0-9)(\\)\(\)[\]\s]+$
Question : I have tried this regular expression with some text at this online tester: https://regex101.com/r/kO5tW2/2, but it always comes up with no matches. What is causing the expression to fail in above case? To me, the string being tested should come back as valid, but it's not.
Your spec does not mention a question mark. However, the test text you give does include a question mark. You could have tested this easily enough by removing one character at a time from the test text until you got a match, which would have happened when you removed the question mark.
Either add the question mark to the regexp, or remove it from your test test.
Also, you do not need to (and should not) enclose ranges in parentheses.
In the below, I've also removed escaping for characters which do not need to be escaped:
^[a-zA-Z .,;:!'%\-0-9\\()[\]\s?]+$
^
https://regex101.com/r/kO5tW2/4
Try adding m (multiline) modifier to regex
If you have a string consisting of multiple lines, like first line\nsecond line (where \n indicates a line break), it is often desirable to work with lines, rather than the entire string. Therefore, all the regex engines discussed in this tutorial have the option to expand the meaning of both anchors. ^ can then match at the start of the string (before the f in the above string), as well as after each line break (between \n and s). Likewise, $ still matches at the end of the string (after the last e), and also before every line break (between e and \n). Source
I want to take a text from my textarea, put in a variable, an change all the linebreaks (\n\r) to "##".
For some reason it won't work.
Help please, here's a fiddle: http://jsfiddle.net/HK82q/
$("#go").click(function(){
curtext = $("textarea").val();
curtext = curtext.replace("\n\r\n","##");
alert(curtext);
});
"Line break" can mean one of three things:
\r (carriage return), used by old Mac computers
\n (line feed), used by Linux, Unix, and I think new Macs
\r\n (CRLF), used by Windows
Therefore, you need to handle all three cases. This can be done with multiple .replace calls, or a regex:
curtext = curtext.replace(/(?=[\r\n])\r?\n?/g,"##");
This regex works by first asserting that there is either a CR or LF ahead, then matching them optionally to allow for all three options. The assertion ensures that "nothing" doesn't match.
You're expecting to meet a \n followed by \r. It's wrong. You should expect either one or the other. Then, regex for replace function should be enclosed with / not with ". It's not a string. The last thing - add the g modifier (stands for global - will replace all occurences, not only the first one).
curtext = curtext.replace(/[\n\r]/g,"##");
Here's the updated fiddle: http://jsfiddle.net/HK82q/1/
note: I'm -not- trying to parse HTML with regex
I'm trying to replace any content wrapped in $ signs ($for example$) in a string. I've managed to come up with str.replace(/\$([^\$]*)\$/g), "hello $1!"), but I'm having issues with making sure I don't replace such strings when they are wrapped in HTML tags.
Example string: $someone$, <a>$welcome$</a>, and $another$
Expression: /[^>]\$([^\$]*)\$[^<]/g
Expected output: hello someone!, <a>$welcome</a>, and hello another!
Actual output: $someonhello , !elcomhello , and !nother$
Test code: alert("$someone$, <a>$welcome$</a>, and $another$".replace(/[^>]\$([^\$]*)\$[^<]/g, "hello $1!"));
fiddle:
http://jsfiddle.net/WMWHZ/
Thanks!
Keep in mind that you have 6 '$' in your test case. The problem here is that when you try to check if the previous character isn't a '>', the regexp moves forward and matches what's between the 4th and the 5th dollar symbol, capturing "</a>, and " and making a mess.
Try this one:
$('div').text(test.replace(/(^|[^>])\$([^<][^\$]*)\$(?!<)/g, "$1hello $2!"))
Javascript doesn't support lookbehinds in regular expressions, but it does support lookaheads (the (?!<) part). To emulate lookbehinds, you correctly tried to put [^>] before the dollar, but then the character is matched so you have to catch it and put it again in the string.
You just have to refine it a little, because if the '$' is at the beginning of the string, the group isn't captured.
Also, to avoid problems like the one above, you should check if there isn't a '<' after the first dollar, so I put a [^<] at the beginning of the capturing group. This also mean that it won't catch empty strings between dollar symbols (as in '$$'), they must contain at least one character.
This way, you have the expected result.