Embed comments within JavaScript regex like in Perl

Embed comments within JavaScript regex like in Perl - javascript

Is there any way to embed a comment in a JavaScript regex, like you can do in Perl? I'm guessing there is not, but my searching didn't find anything stating you can or can't.

You can't embed a comment in a regex literal.
You may insert comments in a string construction that you pass to the RegExp constructor :
var r = new RegExp(
"\\b" + // word boundary
"A=" + // A=
"(\\d+)"+ // what is captured : some digits
"\\b" // word boundary again
, 'i'); // case insensitive
But a regex literal is so much more convenient (notice how I had to escape the \) I'd rather separate the regex from the comments : just put some comments before your regex, not inside.
EDIT 2018: This question and answer are very old. EcmaScript now offers new ways to handle this, and more precisely template strings.
For example I now use this simple utility in node:
module.exports = function(tmpl){
let [, source, flags] = tmpl.raw.toString()
.replace(/\s*(\/\/.*)?$\s*/gm, "") // remove comments and spaces at both ends of lines
.match(/^\/?(.*?)(?:\/(\w+))?$/); // extracts source and flags
return new RegExp(source, flags);
}
which lets me do things like this or this or this:
const regex = rex`
^ // start of string
[a-z]+ // some letters
bla(\d+)
$ // end
/ig`;
console.log(regex); // /^[a-z]+bla(\d+)$/ig
console.log("Totobla58".match(regex)); // [ 'Totobla58' ]

Now with the grave backticky things, you can do inline comments with a little finagling. Note that in the example below there are some assumptions being made about what won't appear in the strings being matched, especially regarding the whitespace. But I think often you can make intentional assumptions like that, if you write the process() function carefully. If not, there are probably creative ways to define the little "mini-language extension" to regexes in such a way as to make it work.
function process() {
var regex = new RegExp("\\s*([^#]*?)\\s*#.*$", "mg");
var output = "";
while ((result = regex.exec(arguments[0])) !== null ){
output += result[1];
}
return output;
}
var a = new RegExp(process `
^f # matches the first letter f
.* # matches stuff in the middle
h # matches the letter 'h'
`);
console.log(a);
console.log(a.test("fish"));
console.log(a.test("frog"));
Here's a codepen.
Also, to the OP, just because I feel a need to say this, this is neato, but if your resulting code turns out just as verbose as the string concatenation or if it takes you 6 hours to figure out the right regexes and you are the only one on your team who will bother to use it, maybe there are better uses of your time...
I hope you know that I am only this blunt with you because I value our friendship.

Related

RegExp replacement with variable and anchor

I've done some research, like 'How do you use a variable in a regular expression?' but no luck.
Here is the given input
const input = 'abc $apple 123 apple $banana'
Each variable has its own value.
Currently I'm able to query all variables from the given string using
const variables = input.match(/([$]\w+)/g)
Replacement through looping the variables array with the following codes is not successful
const r = `/(\$${variable})/`;
const target = new RegExp(r, 'gi');
input.replace(target, value);
However, without using the variable, it will be executed,
const target = new RegExp(/(\$apple)/, 'gi');
input.replace(target, value);
I also changed the variable flag from $ to % or #, and it works with the following codes,
// const target = new RegExp(`%{variable}`, 'gi');
// const target = new RegExp(`#{variable}`, 'gi');
input.replace(target, value);
How to match the $ symbol with variable in this case?

If understand correctly, uses (\\$${variable}).
You can check below Fiddle, if only one slash, it will cause RegExp is /($apple)/gi (the slash and the following character are escaped: \$ => $), but $ indicates the end of string in Regex if not escaped.
So the solution is add another slash.
Like below demo:
const input = 'abc $apple 123 apple $banana'
let variable = 'apple'
let value = '#test#'
const r = `(\\$${variable})`;
const target = new RegExp(r, 'gi');
console.log('template literals', r, `(\$${variable})`)
console.log('regex:', target, new RegExp(`(\$${variable})`, 'gi'))
console.log(input.replace(target, value))

I have been using regular expressions just about everyday for almost a year now. I'll post my thoughts and just say:
Regular Expressions are most useful at finding parts of a text or data
file.
If there is a text-file that contains the word "apple" or some derivative there-in, regular-expressions can be a great tool. I use them everyday for parsing HTML content, as I write foreign-news translations (based in HTML).
I believe the code that was posted was in JavaScript (because I saw the replace(//,"gi") function which is what I know is used in that scripting language. I use Java's java.util.regex package myself, and the syntax is just slightly different.
If all you want to do is put a "place-holder" inside of a String - this code could work, I guess - but again, understanding why "regular-expressions" are necessary seems like the real question. In this example, I have used the ampersand ('&') as the placeholder - since I know for a fact it is not one of the "reserved key words" used by (most, but not necessarily all of) the Regular Expression Compiler and Processor.
var s1 = "&VAR1"; // Un-used variable - leaving it here for show!
var myString = "An example text-string with &VAR1, a particular kind of fruit.";
myString.replace(/&VAR1/gi, "apple");
If you want a great way to practice with Regular-Expressions, go to this web-site and play around with them:
https://regexr.com/
Here are the rules for "reserved key symbols" of RegEx Patterns (Copied from that Site):
The following character have special meaning, and should be preceded
by a \ (backslash) to represent a literal character:
+*?^$.[]{}()|/
Within a character set, only \, -, and ] need to be escaped.
Also, sort of "most importantly" - Regular Expressions are "compiled" in Java. I'm not exactly sure about how they work in Java-Script - but there is no such concept as a "Variable" in the Compiled-Expression Part of a Regular Expression - just in the text and data it analyzes. What that means is - if you want to change what you are searching for in a particular piece of Text or Data in Java, you must re-compile your expression using:
Pattern p = Pattern.compile(regExString, flags);
There is not an easy way to "dynamically change" particular values of text in the expression. The amount of complexity it would add would be astronomical, and the value, minimal. Instead, just compile another expression in your code, and search again. Another option is to better undestand things like .* .+ .*? .+? and even (.*) (.+) (.*?) (.+?) so that things that change, do change, and things that don't change, won't!
For instance if you used this pattern to match different-variables:
input.replace(/&VAR.*VAREND/gi, "apple");
All of your variables could be identified by the re-used pattern: "&VAR-stuff-VAREND" but this is just one of millions of ways to change your idea - skin the cat.

Using a replace callback function you can avoid building a separate regex for each variable and replace them all at once:
const input = 'abc $apple 123 apple $banana'
const vars = {
apple: 'Apfel',
banana: 'Banane'
}
let result = input.replace(/\$(\w+)/g, (_, v) => vars[v])
console.log(result)
This won't work if apple and banana were local variables though, but using locals is a bad idea anyways.

regex to remove certain characters at the beginning and end of a string

Let's say I have a string like this:
...hello world.bye
But I want to remove the first three dots and replace .bye with !
So the output should be
hello world!
it should only match if both conditions apply (... at the beginning and .bye at the end)
And I'm trying to use js replace method. Could you please help? Thanks

First match the dots, capture and lazy-repeat any character until you get to .bye, and match the .bye. Then, you can replace with the first captured group, plus an exclamation mark:
const str = '...hello world.bye';
console.log(str.replace(/\.\.\.(.*)\.bye/, '$1!'));
The lazy-repeat is there to ensure you don't match too much, for example:
const str = `...hello world.bye
...Hello again! Goodbye.`;
console.log(str.replace(/\.\.\.(.*)\.bye/g, '$1!'));

You don't actually need a regex to do this. Although it's a bit inelegant, the following should work fine (obviously the function can be called whatever makes sense in the context of your application):
function manipulate(string) {
if (string.slice(0, 3) == "..." && string.slice(-4) == ".bye") {
return string.slice(4, -4) + "!";
}
return string;
}
(Apologies if I made any stupid errors with indexing there, but the basic idea should be obvious.)
This, to me at least, has the advantage of being easier to reason about than a regex. Of course if you need to deal with more complicated cases you may reach the point where a regex is best - but I personally wouldn't bother for a simple use-case like the one mentioned in the OP.

Your regex would be
const rx = /\.\.\.([\s\S]*?)\.bye/g
const out = '\n\nfoobar...hello world.bye\nfoobar...ok.bye\n...line\nbreak.bye\n'.replace(rx, `$1!`)
console.log(out)
In English, find three dots, anything eager in group, and ending with .bye.
The replacement uses the first match $1 and concats ! using a string template.

An arguably simpler solution:
const str = '...hello world.bye'
const newStr = /...(.+)\.bye/.exec(str)
const formatted = newStr ? newStr[1] + '!' : str
console.log(formatted)
If the string doesn't match the regex it will just return the string.

How to split a string by a character not directly preceded by a character of the same type?

Let's say I have a string: "We.need..to...split.asap". What I would like to do is to split the string by the delimiter ., but I only wish to split by the first . and include any recurring .s in the succeeding token.
Expected output:
["We", "need", ".to", "..split", "asap"]
In other languages, I know that this is possible with a look-behind /(?<!\.)\./ but Javascript unfortunately does not support such a feature.
I am curious to see your answers to this question. Perhaps there is a clever use of look-aheads that presently evades me?
I was considering reversing the string, then re-reversing the tokens, but that seems like too much work for what I am after... plus controversy: How do you reverse a string in place in JavaScript?
Thanks for the help!

Here's a variation of the answer by guest271314 that handles more than two consecutive delimiters:
var text = "We.need.to...split.asap";
var re = /(\.*[^.]+)\./;
var items = text.split(re).filter(function(val) { return val.length > 0; });
It uses the detail that if the split expression includes a capture group, the captured items are included in the returned array. These capture groups are actually the only thing we are interested in; the tokens are all empty strings, which we filter out.
EDIT: Unfortunately there's perhaps one slight bug with this. If the text to be split starts with a delimiter, that will be included in the first token. If that's an issue, it can be remedied with:
var re = /(?:^|(\.*[^.]+))\./;
var items = text.split(re).filter(function(val) { return !!val; });
(I think this regex is ugly and would welcome an improvement.)

You can do this without any lookaheads:
var subject = "We.need.to....split.asap";
var regex = /\.?(\.*[^.]+)/g;
var matches, output = [];
while(matches = regex.exec(subject)) {
output.push(matches[1]);
}
document.write(JSON.stringify(output));
It seemed like it'd work in one line, as it did on https://regex101.com/r/cO1dP3/1, but had to be expanded in the code above because the /g option by default prevents capturing groups from returning with .match (i.e. the correct data was in the capturing groups, but we couldn't immediately access them without doing the above).
See: JavaScript Regex Global Match Groups
An alternative solution with the original one liner (plus one line) is:
document.write(JSON.stringify(
"We.need.to....split.asap".match(/\.?(\.*[^.]+)/g)
.map(function(s) { return s.replace(/^\./, ''); })
));
Take your pick!

Note: This answer can't handle more than 2 consecutive delimiters, since it was written according to the example in the revision 1 of the question, which was not very clear about such cases.
var text = "We.need.to..split.asap";
// split "." if followed by "."
var res = text.split(/\.(?=\.)/).map(function(val, key) {
// if `val[0]` does not begin with "." split "."
// else split "." if not followed by "."
return val[0] !== "." ? val.split(/\./) : val.split(/\.(?!.*\.)/)
});
// concat arrays `res[0]` , `res[1]`
res = res[0].concat(res[1]);
document.write(JSON.stringify(res));

What is the memory issue in this RegEx function

I am trying to scrape a web page for email addresses. I almost have it working, but there seems to be some kind of huge memory error that makes the page freeze when my script loads.
This is what I have:
var bodyText = document.body.textContent.replace(/\n/g, " ").split(' '); // Location to pull our text from. In this case it's the whole body
var r = new RegExp("[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", 'i');
function validateEmail(string) {
return r.test(string);
}
var domains = [];
var domain;
for (var i = 0; i < bodyText.length; i++){
domain = bodyText[i].toString();
if (validateEmail(domain)) {
domains.push(domain);
}
}
The only thing I can think of is that the email validating function I'm using is a 32 step expression and the page I'm running it on returns with over 3,000 parts, but I feel like this should be possible.
Here is a script that reproduces the error:
var str = "help.yahoo.com/us/tutorials/cg/mail/cg_addressguard2.html";
var r = new RegExp("[a-z0-9!#$%&'*+\/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_{|}~-]+)*#(?:[a-‌z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", 'i');
console.log("before:"+(new Date()));
console.log(r.test(str));
console.log("after:"+(new Date()));`
What can I do to overcome the memory issue?

stribizhev has pointed out the solution in the comment: specify the regex in RegExp literal syntax. Another solution, as shown in the comment by sln, is to escape \ in the string literal properly.
I will not address what is the correct regex to validating/matching email address with regex in this answer, since it has been rehashed many times over.
To demonstrate what causes the problem, let us print the string passed to RegExp constructor to the console. Did you notice that some \ are missing?
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])
^ ^ ^ ^
The string above is what the RegExp constructor sees and compiles.
/ only needs to be escaped in RegExp literal (since RegExp literals are delimited by /), and doesn't need to be escaped in the string passes to RegExp constructor, so the omission doesn't cause any problem.
Below are equivalent examples showing how to write a regex to match / with RegExp literal and RegExp constructor:
/\//;
new RegExp("/");
However, since \ in \. is not properly escaped in the string, instead of matching literal ., it allows any character (except for line separator) to be matched.
As a result, from being perfectly fine solution, these parts in the regex suffers from catastrophic backtracking:
(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+
Since . can match any character, the fragments above degenerates to the classic catastrophic backtracking pattern (A*)*. By reducing the power of the regex to its strict subset, you can see the problem more clearly:
(?:a[a]+)*
(?:[a](?:[a]*[a])?a)+
This is the solution with RegExp literal, which is the same as specified in the string literal in the question. You got the escape for RegExp literal done properly, but instead use it in RegExp constructor:
var r = /[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])/i;
As for equivalent RegExp constructor solution:
var r = new RegExp("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", "i");

Not exactly an answer to your question, but the first thing you need to do is to reduce the amount of text parts you have to test with your "corrected" pattern. In your html example file, you have about 3300 text strings to test with a regex. Keep in mind that using a regex has a cost, so removing useless text part is a priority:
var textParts = document.body.textContent
.split(/\s+/) // see the note
.filter(function(part) {
return part.length > 4 && part.length < 255 && part.indexOf('#') > 1;
});
alert(textParts.join("\n"));
Now you have only ~50 text parts to test.
note: if you want to take in account email addresses with spaces inside double quotes, you can try to change:
.split(/\s+/)
to
.split(/(?=[\s"])((?:"[^"\n\\]*(?:\\.[^"\n\\]*)*"[^"\s]*)*)(?:\s+|$)/)
(without any warranty)
About your pattern: the mistake in your pattern is already pointed by other answers and comments, but note that you can probably obtain the same result (the same matches) faster with this one:
/\b\w[!#-'*+\/-9=?^-~-]*(?:\.[!#-'*+\/-9=?^-~-]+)*#[a-z0-9]+(?:-[a-z0-9]+)*\.[a-z0-9]+(?:[-.][a-z0-9]+)*\b/i

Here's an example with a less strict regex that's fast.
function getEmails(str) {
var r = /\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/ig;
var emails = [];
var e = null;
var n = 0;
while ((e = r.exec(str)) !== null) {
emails[n++] = e[0];
}
return emails;
}
function emailTest() {
var str = document.getElementsByTagName('body')[0].innerHTML;
var emails = getEmails(str);
document.getElementById('found').innerHTML=emails.join("\n");
}
emailTest();
#found {
color:green;
font-weight:bold;
}
<pre id="email_test">
test#test.test
foo#bar.baz.test
foo#bar.baz.longdomain
foo-bar#foo.bar
foo_bar99#foo.bar
foo#foo#foo.bar
foo$bar#33#test.test
foo+bar-baz%99#someplace.top
</pre>
<pre id="found"></pre>

java script Regular Expressions patterns problem

My problem start with like-
var str='0|31|2|03|.....|4|2007'
str=str.replace(/[^|]\d*[^|]/,'5');
so the output becomes like:"0|5|2|03|....|4|2007" so it replaces 31->5
But this doesn't work for replacing other segments when i change code like this:
str=str.replace(/[^|]{2}\d*[^|]/,'6');
doesn't change 2->6.
What actually i am missing here.Any help?

I think a regular expression is a bad solution for that problem. I'd rather do something like this:
var str = '0|31|2|03|4|2007';
var segments = str.split("|");
segments[1] = "35";
segments[2] = "123";
Can't think of a good way to solve this with a regexp.

Here is a specific regex solution which replaces the number following the first | pipe symbol with the number 5:
var re = /^((?:\d+\|){1})\d+/;
return text.replace(re, '$15');
If you want to replace the digits following the third |, simply change the {1} portion of the regex to {3}
Here is a generalized function that will replace any given number slot (zero-based index), with a specified new number:
function replaceNthNumber(text, n, newnum) {
var re = new RegExp("^((?:\\d+\\|){"+ n +'})\\d+');
return text.replace(re, '$1'+ newnum);
}

Firstly, you don't have to escape | in the character set, because it doesn't have any special meaning in character sets.
Secondly, you don't put quantifiers in character sets.
And finally, to create a global matching expression, you have to use the g flag.

[^\|] means anything but a '|', so in your case it only matches a digit. So it will only match anything with 2 or more digits.
Second you should put the {2} outside of the []-brackets
I'm not sure what you want to achieve here.

Develop Reference

JavaScript is the programming language of the Web.

Embed comments within JavaScript regex like in Perl - javascript

Is there any way to embed a comment in a JavaScript regex, like you can do in Perl? I'm guessing there is not, but my searching didn't find anything stating you can or can't.

Related

RegExp replacement with variable and anchor

regex to remove certain characters at the beginning and end of a string

How to split a string by a character not directly preceded by a character of the same type?

What is the memory issue in this RegEx function

java script Regular Expressions patterns problem

Categories

Resources