Ignoring whitespace in Javascript regular expression patterns? - javascript

From my research it looks like Javascript's regular expressions don't have any built-in equivalent to Perl's /x modifier, or .NET's RegexOptions.IgnorePatternWhitespace modifier. These are very useful as they can make a complex regex much easier to read. Firstly, have I missed something and is there a Javascript built-in equivalent to these? Secondly, if there isn't, does anyone know of a good jQuery plugin that will implement this functionality? It's a shame to have to compress my complex regex into one line because of Javascript's apparent regex limitations.

If I understand you correctly you want to add white space that isn't part of the regexp?
As far as I know it isn't possible with literal regexp.
Example:
var a = /^[\d]+$/
You can break up the regexp in several lines like this:
var a = RegExp(
"^" +
"[\\d]+" + // This is a comment
"$"
);
Notice that since it is now a normal string, you have to escape \ with \\
Or if you have a complex one:
var digit_8 = "[0-9]{8}";
var alpha_4 = "[A-Za-z]{4}";
var a = RegExp(
digit_8 +
alpha_4 + // Optional comment
digit_8
);
Update: Using a temporary array to concatenate the regular expression:
var digit_8 = "[0-9]{8}";
var alpha_4 = "[A-Za-z]{4}";
var a = RegExp([
digit_8,
alpha_4, // Optional comment
digit_8,
"[0-9A-F]" // Another comment to a string
].join(""));

Unfortunately there isn't such option in ES5, and I suspect it's unlikely to ever be in RegExp literals, because they're already very hard to parse and line breaks would make them even more ambiguous.
If you want easy escaping and syntax highlighting of RegExp literals, you can join them by taking advantage of the source property. It's not perfect, but IMHO less bad than falling all the way back to strings:
new RegExp(
/foo/.source +
/[\d+]/.source +
/bar/.source
);
In ES6 you can create your own template string:
regexp`
foo
[\d+]
bar
`;
function regexp(parts) {
// I'm ignoring support for ${} placeholders for brevity,
// but full implementation should escape them.
// Note that `.raw` allows use of \d instead of \\d.
return new RegExp(parts.raw.join('').replace(/\s+/g,''));
}

Related

Using search and replace with regex in javascript

I have a regular expression that I have been using in notepad++ for search&replace to manipulate some text, and I want to incorporate it into my javascript code. This is the regular expression:
Search
(?-s)(.{150,250}\.(\[\d+\])*)\h+ and replace with \1\r\n\x20\x20\x20
In essence creating new paragraphs for every 150-250 words and indenting them.
This is what I have tried in JavaScript. For a text area <textarea name="textarea1" id="textarea1"></textarea>in the HTML. I have the following JavaScript:
function rep1() {
var re1 = new RegExp('(?-s)(.{150,250}\.(\[\d+\])*)\h+');
var re2 = new RegExp('\1\r\n\x20\x20\x20');
var s = document.getElementById("textarea1").value;
s = string.replace(re1, re2);
document.getElementById("textarea1").value = s;
}
I have also tried placing the regular expressions directly as arguments for string.replace() but that doesn't work either. Any ideas what I'm doing wrong?
Several issues:
JavaScript does not support (?-s). You would need to add modifiers separately. However, this is the default setting in JavaScript, so you can just leave it out. If it was your intention to let . also match line breaks, then use [^] instead of . in JavaScript regexes.
JavaScript does not support \h -- the horizontal white space. Instead you could use [^\S\r\n].
When passing a string literal to new RegExp be aware that backslashes are escape characters for the string literal notation, so they will not end up in the regex. So either double them, or else use JavaScript's regex literal notation
In JavaScript replace will only replace the first occurrence unless you provided the g modifier to the regular expression.
The replacement (second argument to replace) should not be a regex. It should be a string, so don't apply new RegExp to it.
The backreferences in the replacement string should be of the $1 format. JavaScript does not support \1 there.
You reference string where you really want to reference s.
This should work:
function rep1() {
var re1 = /(.{150,250}\.(\[\d+\])*)[^\S\r\n]+/g;
var re2 = '$1\r\n\x20\x20\x20';
var s = document.getElementById("textarea1").value;
s = s.replace(re1, re2);
document.getElementById("textarea1").value = s;
}

javascript, declare associative array of regex expressions

How to declare associative array of regex?
This is not working
var Validators = {
url : /^http(s?)://((\w+\.)?\w+\.\w+|((2[0-5]{2}|1[0-9]{2}|[0-9]{1,2})\.){3}(2[0-5]{2}|1[0-9]{2}|[0-9]{1,2}))(/)?$/gm
};
EDITED: Now working!
This will be valid in JS (like # operator in C#)
url : `/^http(s?)://((\w+\.)?\w+\.\w+|((2[0-5]{2}|1[0-9]{2}|[0-9]{1,2})\.){3}(2[0-5]{2}|1[0-9]{2}|[0-9]{1,2}))(/)?$/gm`
However, will still not work due to double escape, one in JS and other in Regex. If expression is small, perhaps naked eye can manually escape for both JS and Regex. My brain just can't :)
In order to use strings as tested on regex101.com for example, all required strings should be declared as 'row' like this:
var exp = String.raw`^(http(s?):\/\/)?(((www\.)?[a-zA-Z0-9\.\-\_]+(\.[a-zA-Z]{2,3})+)|(\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b))(\/[a-zA-Z0-9\_\-\s\.\/\?\%\#\&\=]*)?$`;
var strings = [
String.raw`http://www.goo gle.com`,
String.raw`http://www.google.com`,
];
Wrap it with new RegExp() and escape slashes
var Validators = {
url : new RegExp( /^http(s?):\/\/((\w+\.)?\w+\.\w+|((2[0-5]{2}|1[0-9]{2}|[0-9]{1,2})\.){3}(2[0-5]{2}|1[0-9]{2}|[0-9]{1,2}))(\/)?$/gm )
};
Your regex has forward slashes in it. This symbol needs to be escaped because it is supposed to indicate the start and end of the expression. Try \/.

What are the actual uses of ES6 Raw String Access?

What are the actual uses of String.raw Raw String Access introduced in ECMAScript 6?
// String.raw(callSite, ...substitutions)
function quux (strings, ...values) {
strings[0] === "foo\n"
strings[1] === "bar"
strings.raw[0] === "foo\\n"
strings.raw[1] === "bar"
values[0] === 42
}
quux `foo\n${ 42 }bar`
String.raw `foo\n${ 42 }bar` === "foo\\n42bar"
I went through the below docs.
http://es6-features.org/#RawStringAccess
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/template_strings
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/raw
http://www.2ality.com/2015/01/es6-strings.html
https://msdn.microsoft.com/en-us/library/dn889830(v=vs.94).aspx
The only the thing that I understand, is that it is used to get the raw string form of template strings and used for debugging the template string.
When this can be used in real time development? They were calling this a tag function. What does that mean?
What concrete use cases am I missing?
The best, and very nearly only, use case for String.raw I can think of is if you're trying to use something like Steven Levithan's XRegExp library that accepts text with significant backslashes. Using String.raw lets you write something semantically clear rather than having to think in terms of doubling your backslashes, just like you can in a regular expression literal in JavaScript itself.
For instance, suppose I'm doing maintenance on a site and I find this:
var isSingleUnicodeWord = /^\w+$/;
...which is meant to check if a string contains only "letters." Two problems: A) There are thousands of "word" characters across the realm of human language that \w doesn't recognize, because its definition is English-centric; and B) It includes _, which many (including the Unicode consortium) would argue is not a "letter."
So if we're using XRegExp on the site, since I know it supports \pL (\p for Unicode categories, and L for "letter"), I might quickly swap this in:
var isSingleUnicodeWord = XRegExp("^\pL+$"); // WRONG
Then I wonder why it didn't work, facepalm, and go back and escape that backslash, since it's being consumed by the string literal.
Easy enough in that simple regex, but in something complicated, remembering to double all those backslashes is a maintenance pain. (Just ask Java programmers trying to use Pattern.)
Enter String.raw:
let isSingleUnicodeWord = XRegExp(String.raw`^\pL+$`);
Example:
let isSingleUnicodeWord = XRegExp(String.raw`^\pL+$`); // L: Letter
console.log(isSingleUnicodeWord.test("Русский")); // true
console.log(isSingleUnicodeWord.test("日本語")); // true
console.log(isSingleUnicodeWord.test("العربية")); // true
console.log(isSingleUnicodeWord.test("foo bar")); // false
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
Now I just kick back and write what I mean. I don't even really have to worry about ${...} constructs used in template literals to do substitution, because the odds of my wanting to apply a quantifier {...} to the end-of-line assertion ($) are...low. So I can happily use substitutions and still not worry about backslashes. Lovely.
Having said that, though, if I were doing it a lot, I'd probably want to write a function and use a tagged template instead of String.raw itself. But it's surprisingly awkward to do correctly:
// My one-time tag function
function xrex(strings, ...values) {
let raw = strings.raw;
let max = Math.max(raw.length, values.length);
let result = "";
for (let i = 0; i < max; ++i) {
if (i < raw.length) {
result += raw[i];
}
if (i < values.length) {
result += values[i];
}
}
console.log("Creating with:", result);
return XRegExp(result);
}
// Using it, with a couple of substitutions to prove to myself they work
let category = "L"; // L: Letter
let maybeEol = "$";
let isSingleUnicodeWord = xrex`^\p${category}+${maybeEol}`;
console.log(isSingleUnicodeWord.test("Русский")); // true
console.log(isSingleUnicodeWord.test("日本語")); // true
console.log(isSingleUnicodeWord.test("العربية")); // true
console.log(isSingleUnicodeWord.test("foo bar")); // false
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
Maybe the hassle is worth it if you're using it in lots of places, but for a couple of quick ones, String.raw is the simpler option.
First, a few things:
Template strings is old name for template literals.
A tag is a function.
String.raw is a method.
String.raw `foo\n${ 42 }bar\` is a tagged template literal.
Template literals are basically fancy strings.
Template literals can interpolate.
Template literals can be multi-line without using \.
String.raw is required to escape the escape character \.
Try putting a string that contains a new-line character \n through a function that consumes newline character.
console.log("This\nis\nawesome"); // "This\nis\nawesome"
console.log(String.raw`This\nis\nawesome`); // "This\\nis\\nawesome"
If you are wondering, console.log is not one of them. But alert is. Try running these through http://learnharmony.org/ .
alert("This\nis\nawesome");
alert(String.raw`This\nis\nawesome`);
But wait, that's not the use of String.raw.
Possible uses of String.raw method:
To show string without interpretation of backslashed characters (\n, \t) etc.
To show code for the output. (As in example below)
To be used in regex without escaping \.
To print windows director/sub-directory locations without using \\ to much. (They use \ remember. Also, lol)
Here we can show output and code for it in single alert window:
alert("I printed This\nis\nawesome with " + Sring.raw`This\nis\nawesome`);
Though, it would have been great if It's main use could have been to get back the original string. Like:
var original = String.raw`This is awesome.`;
where original would have become: This\tis \tawesome.. This isn't the case sadly.
References:
http://exploringjs.com/es6/ch_template-literals.html
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/raw
Template strings can be useful in many situations which I will explain below. Considering this, the String.raw prevents escapes from being interpreted. This can be useful in any template string in which you want to contain the escape character but do not want to escape it. A simple example could be the following:
var templateWithBackslash = String.raw `someRegExp displayed in template /^\//`
There are a few things inside that are nice to note with template strings.
They can contain unescaped line breaks without problems.
They can contain "${}". Inside these curly braces the javascript is interpreted instead.
(Note: running these will output the result to your console [in browser dev tools])
Example using line breaks:
var myTemplate = `
<div class="myClass">
<pre>
My formatted text
with multiple lines
{
asdf: "and some pretty printed json"
}
</pre>
</div>
`
console.log(myTemplate)
If you wanted to do the above with a normal string in Javascript it would look like the following:
var myTemplate = "\
<div class="myClass">\
<pre>\
My formatted text\
with multiple lines\
{\
asdf: "and some pretty printed json"\
}\
</pre>\
</div>"
console.log(myTemplate)
You will notice the first probably looks much nicer (no need to escape line breaks).
For the second I will use the same template string but also insert the some pretty printed JSON.
var jsonObj = {asdf: "and some pretty printed json", deeper: {someDeep: "Some Deep Var"}}
var myTemplate = `
<div class="myClass">
<pre>
My formatted text
with multiple lines
${JSON.stringify(jsonObj, null, 2)}
</pre>
</div>
`
console.log(myTemplate)
In NodeJS it is extremely handy when it comes to filepath handling:
var fs=require('fs');
var s = String.raw`C:\Users\<username>\AppData\Roaming\SomeApp\someObject.json`;
var username = "bob"
s=s.replace("<username>",username)
fs.readFile(s,function(err,result){
if (err) throw error;
console.log(JSON.parse(result))
})
It improves readability of filepaths on Windows. \ is also a fairly common separator, so I can definitely see why it would be useful in general. However it is pretty stupid how \ still escapes `... So ultimately:
String.raw`C:\Users\` //#==> C:\Users\`
console.log(String.raw`C:\Users\`) //#==> SyntaxError: Unexpected end of input.
In addition to its use as a tag, String.raw is also useful in implementing new tag functions as a tool to do the interleaving that most people do with a weird loop. For example, compare:
function foo(strs, ...xs) {
let result = strs[0];
for (let i = 0; i < xs.length; ++i) {
result += useFoo(xs[i]) + strs[i + 1];
}
return result;
}
with
function foo(strs, ...xs) {
return String.raw({raw: strs}, ...xs.map(useFoo));
}
The Use
(Requisite knowledge: tstring §.)
Instead of:
console.log(`\\a\\b\\c\\n\\z\\x12\\xa9\\u1234\\u00A9\\u{1234}\\u{00A9}`);
.you can:
console.log(String.raw`\a\b\c\n\z\x12\xa9\u1234\u00A9\u{1234}\u{00A9}`);
"Escaping"
<\\u> is fine, yet <\u> needs "escaping", eg:
console.log(String.raw`abc${'\\u'}abc`);
.Dit <\\x>, <\x>,
<console.log(String.raw`abc${`\\x`}abc`)>;
.<\`>, <`>, <console.log(String.raw`abc${`\``}abc`)>;
.<\${>, <${&>, <console.log(String.raw`abc${`$\{`}abc`)>;
.<\\1> (till <\\7>), <\1>, <console.log(String.raw`abc${`\\1`}abc`)>;
.<\\>, endunit <\>, <console.log(String.raw`abc${`\\`}`)>.
Nb
There's also a new "latex" string. Cf §.
I've found it to be useful for testing
my RegExps. Say I have a RegExp which
should match end-of-line comments because
I want to remove them. BUT, it must not
match source-code for a regexp like /// .
If your code contains /// it is not the
start of an EOL comment but a RegExp, as
per the rules of JavaScript syntax.
I can test whether my RegExp in variable patEOLC
matches or doesn't /// with:
String.raw`/\//` .match (patEOLC)
In other words it is a way to let my
code "see" code the way it exists in
source-code, not the way it exists
in memory after it has been read
into memory from source-code, with
all backslashes removed.
It is a way to "escape escaping" but
without having to do it separately
for every backslash in a string, but
for all of them at the same time.
It is a way to say that in a given
(back-quoted) string backslash
shall behave just like any other
character, it has no special
meaning or interpretation.

javascript - Better Way to Escape Dollar Signs in the String Used By String.prototype.replace

I want to replace a string by another. I found when the replaceValue contains "$", the replace will fail. So I am trying to escape "$" by "$$" first. The code is looks like this:
var str = ..., reg = ...;
function replaceString(replaceValue) {
str.replace(reg, replaceValue.replace(/\$/g, '$$$$'));
}
But I think it is ugly since I need to write 4 dollar signs.
Is there any other charactors that I need to escape? And is there any better way to do this?
There is a way to call replace that allows us not to worry about escaping anything.
var str = ..., reg = ...;
function replaceString(replaceValue) {
return str.replace(reg, function () { return replaceValue });
}
Your method to escape the replacement string is correct.
According to section 15.5.4.11 String.prototype.replace of ECMAScript specification edition 5.1, all special replacement sequences begins with $ ($&, $`, $', $n, $nn) and $$ specify a single $ in the replacement.
Therefore, it is sufficient to escape all $ with double $$ like what you are doing right now if the replacement text is meant to be treated literally.
There is no other concise way to do the replacement as far as I can see.
Unfortunately, nothing you can do about it.
It's just how JavaScript works with regular expressions.
Here's a good article with the list of all replacement patterns you should be aware of:
http://es5.github.io/#x15.5.4.11

Embed comments within JavaScript regex like in Perl

Is there any way to embed a comment in a JavaScript regex, like you can do in Perl? I'm guessing there is not, but my searching didn't find anything stating you can or can't.
You can't embed a comment in a regex literal.
You may insert comments in a string construction that you pass to the RegExp constructor :
var r = new RegExp(
"\\b" + // word boundary
"A=" + // A=
"(\\d+)"+ // what is captured : some digits
"\\b" // word boundary again
, 'i'); // case insensitive
But a regex literal is so much more convenient (notice how I had to escape the \) I'd rather separate the regex from the comments : just put some comments before your regex, not inside.
EDIT 2018: This question and answer are very old. EcmaScript now offers new ways to handle this, and more precisely template strings.
For example I now use this simple utility in node:
module.exports = function(tmpl){
let [, source, flags] = tmpl.raw.toString()
.replace(/\s*(\/\/.*)?$\s*/gm, "") // remove comments and spaces at both ends of lines
.match(/^\/?(.*?)(?:\/(\w+))?$/); // extracts source and flags
return new RegExp(source, flags);
}
which lets me do things like this or this or this:
const regex = rex`
^ // start of string
[a-z]+ // some letters
bla(\d+)
$ // end
/ig`;
console.log(regex); // /^[a-z]+bla(\d+)$/ig
console.log("Totobla58".match(regex)); // [ 'Totobla58' ]
Now with the grave backticky things, you can do inline comments with a little finagling. Note that in the example below there are some assumptions being made about what won't appear in the strings being matched, especially regarding the whitespace. But I think often you can make intentional assumptions like that, if you write the process() function carefully. If not, there are probably creative ways to define the little "mini-language extension" to regexes in such a way as to make it work.
function process() {
var regex = new RegExp("\\s*([^#]*?)\\s*#.*$", "mg");
var output = "";
while ((result = regex.exec(arguments[0])) !== null ){
output += result[1];
}
return output;
}
var a = new RegExp(process `
^f # matches the first letter f
.* # matches stuff in the middle
h # matches the letter 'h'
`);
console.log(a);
console.log(a.test("fish"));
console.log(a.test("frog"));
Here's a codepen.
Also, to the OP, just because I feel a need to say this, this is neato, but if your resulting code turns out just as verbose as the string concatenation or if it takes you 6 hours to figure out the right regexes and you are the only one on your team who will bother to use it, maybe there are better uses of your time...
I hope you know that I am only this blunt with you because I value our friendship.

Categories

Resources