Match whitespace in Javascript regexp by object, created through RegExp constructor

Match whitespace in Javascript regexp by object, created through RegExp constructor - javascript

Please, look into this code. Why does creating of the same regular expression by different ways (by /regex/ literal and through RegExp constructor) cause different result? Why doesn't the second pattern match the whitespace in the str?
var str = " ";
var pat1 = /\s/;
document.writeln(pat1.test(str)); // shows "true"
var pat2 = new RegExp("\s");
document.writeln(pat2.test(str)); // shows "false"
Can't find the answer on my question anywhere. Thanks

You need to escape the backslash since it's in a string:
var pat2 = new RegExp("\\s");

Related

How can you add e.g. 'gm' to a regex to avoid repeating the full regex again? [duplicate]

I am trying to create something similar to this:
var regexp_loc = /e/i;
except I want the regexp to be dependent on a string, so I tried to use new RegExp but I couldn't get what i wanted.
Basically I want the e in the above regexp to be a string variable but I fail with the syntax.
I tried something like this:
var keyword = "something";
var test_regexp = new RegExp("/" + keyword + "/i");
Basically I want to search for a sub string in a larger string then replace the string with some other string, case insensitive.
regards,
alexander

You need to pass the second parameter:
var r = new RegExp(keyword, "i");
You will also need to escape any special characters in the string to prevent regex injection attacks.

You should also remember to watch out for escape characters within a string...
For example if you wished to detect for a single number \d{1} and you did this...
var pattern = "\d{1}";
var re = new RegExp(pattern);
re.exec("1"); // fail! :(
that would fail as the initial \ is an escape character, you would need to "escape the escape", like so...
var pattern = "\\d{1}" // <-- spot the extra '\'
var re = new RegExp(pattern);
re.exec("1"); // success! :D

When using the RegExp constructor, you don't need the slashes like you do when using a regexp literal. So:
new RegExp(keyword, "i");
Note that you pass in the flags in the second parameter. See here for more info.

Want to share an example here:
I want to replace a string like: hi[var1][var2] to hi[newVar][var2].
and var1 are dynamic generated in the page.
so I had to use:
var regex = new RegExp("\\\\["+var1+"\\\\]",'ig');
mystring.replace(regex,'[newVar]');
This works pretty good to me. in case anyone need this like me.
The reason I have to go with [] is var1 might be a very easy pattern itself, adding the [] would be much accurate.

var keyword = "something";
var test_regexp = new RegExp(something,"i");

You need to convert RegExp, you actually can create a simple function to do it for you:
function toReg(str) {
if(!str || typeof str !== "string") {
return;
}
return new RegExp(str, "i");
}
and call it like:
toReg("something")

Unable to convert a string to the desired regexp in Javascript [duplicate]

I am trying to create something similar to this:
var regexp_loc = /e/i;
except I want the regexp to be dependent on a string, so I tried to use new RegExp but I couldn't get what i wanted.
Basically I want the e in the above regexp to be a string variable but I fail with the syntax.
I tried something like this:
var keyword = "something";
var test_regexp = new RegExp("/" + keyword + "/i");
Basically I want to search for a sub string in a larger string then replace the string with some other string, case insensitive.
regards,
alexander

You need to pass the second parameter:
var r = new RegExp(keyword, "i");
You will also need to escape any special characters in the string to prevent regex injection attacks.

You should also remember to watch out for escape characters within a string...
For example if you wished to detect for a single number \d{1} and you did this...
var pattern = "\d{1}";
var re = new RegExp(pattern);
re.exec("1"); // fail! :(
that would fail as the initial \ is an escape character, you would need to "escape the escape", like so...
var pattern = "\\d{1}" // <-- spot the extra '\'
var re = new RegExp(pattern);
re.exec("1"); // success! :D

When using the RegExp constructor, you don't need the slashes like you do when using a regexp literal. So:
new RegExp(keyword, "i");
Note that you pass in the flags in the second parameter. See here for more info.

Want to share an example here:
I want to replace a string like: hi[var1][var2] to hi[newVar][var2].
and var1 are dynamic generated in the page.
so I had to use:
var regex = new RegExp("\\\\["+var1+"\\\\]",'ig');
mystring.replace(regex,'[newVar]');
This works pretty good to me. in case anyone need this like me.
The reason I have to go with [] is var1 might be a very easy pattern itself, adding the [] would be much accurate.

var keyword = "something";
var test_regexp = new RegExp(something,"i");

You need to convert RegExp, you actually can create a simple function to do it for you:
function toReg(str) {
if(!str || typeof str !== "string") {
return;
}
return new RegExp(str, "i");
}
and call it like:
toReg("something")

What is the memory issue in this RegEx function

I am trying to scrape a web page for email addresses. I almost have it working, but there seems to be some kind of huge memory error that makes the page freeze when my script loads.
This is what I have:
var bodyText = document.body.textContent.replace(/\n/g, " ").split(' '); // Location to pull our text from. In this case it's the whole body
var r = new RegExp("[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", 'i');
function validateEmail(string) {
return r.test(string);
}
var domains = [];
var domain;
for (var i = 0; i < bodyText.length; i++){
domain = bodyText[i].toString();
if (validateEmail(domain)) {
domains.push(domain);
}
}
The only thing I can think of is that the email validating function I'm using is a 32 step expression and the page I'm running it on returns with over 3,000 parts, but I feel like this should be possible.
Here is a script that reproduces the error:
var str = "help.yahoo.com/us/tutorials/cg/mail/cg_addressguard2.html";
var r = new RegExp("[a-z0-9!#$%&'*+\/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_{|}~-]+)*#(?:[a-‌z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", 'i');
console.log("before:"+(new Date()));
console.log(r.test(str));
console.log("after:"+(new Date()));`
What can I do to overcome the memory issue?

stribizhev has pointed out the solution in the comment: specify the regex in RegExp literal syntax. Another solution, as shown in the comment by sln, is to escape \ in the string literal properly.
I will not address what is the correct regex to validating/matching email address with regex in this answer, since it has been rehashed many times over.
To demonstrate what causes the problem, let us print the string passed to RegExp constructor to the console. Did you notice that some \ are missing?
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])
^ ^ ^ ^
The string above is what the RegExp constructor sees and compiles.
/ only needs to be escaped in RegExp literal (since RegExp literals are delimited by /), and doesn't need to be escaped in the string passes to RegExp constructor, so the omission doesn't cause any problem.
Below are equivalent examples showing how to write a regex to match / with RegExp literal and RegExp constructor:
/\//;
new RegExp("/");
However, since \ in \. is not properly escaped in the string, instead of matching literal ., it allows any character (except for line separator) to be matched.
As a result, from being perfectly fine solution, these parts in the regex suffers from catastrophic backtracking:
(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+
Since . can match any character, the fragments above degenerates to the classic catastrophic backtracking pattern (A*)*. By reducing the power of the regex to its strict subset, you can see the problem more clearly:
(?:a[a]+)*
(?:[a](?:[a]*[a])?a)+
This is the solution with RegExp literal, which is the same as specified in the string literal in the question. You got the escape for RegExp literal done properly, but instead use it in RegExp constructor:
var r = /[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])/i;
As for equivalent RegExp constructor solution:
var r = new RegExp("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", "i");

Not exactly an answer to your question, but the first thing you need to do is to reduce the amount of text parts you have to test with your "corrected" pattern. In your html example file, you have about 3300 text strings to test with a regex. Keep in mind that using a regex has a cost, so removing useless text part is a priority:
var textParts = document.body.textContent
.split(/\s+/) // see the note
.filter(function(part) {
return part.length > 4 && part.length < 255 && part.indexOf('#') > 1;
});
alert(textParts.join("\n"));
Now you have only ~50 text parts to test.
note: if you want to take in account email addresses with spaces inside double quotes, you can try to change:
.split(/\s+/)
to
.split(/(?=[\s"])((?:"[^"\n\\]*(?:\\.[^"\n\\]*)*"[^"\s]*)*)(?:\s+|$)/)
(without any warranty)
About your pattern: the mistake in your pattern is already pointed by other answers and comments, but note that you can probably obtain the same result (the same matches) faster with this one:
/\b\w[!#-'*+\/-9=?^-~-]*(?:\.[!#-'*+\/-9=?^-~-]+)*#[a-z0-9]+(?:-[a-z0-9]+)*\.[a-z0-9]+(?:[-.][a-z0-9]+)*\b/i

Here's an example with a less strict regex that's fast.
function getEmails(str) {
var r = /\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/ig;
var emails = [];
var e = null;
var n = 0;
while ((e = r.exec(str)) !== null) {
emails[n++] = e[0];
}
return emails;
}
function emailTest() {
var str = document.getElementsByTagName('body')[0].innerHTML;
var emails = getEmails(str);
document.getElementById('found').innerHTML=emails.join("\n");
}
emailTest();
#found {
color:green;
font-weight:bold;
}
<pre id="email_test">
test#test.test
foo#bar.baz.test
foo#bar.baz.longdomain
foo-bar#foo.bar
foo_bar99#foo.bar
foo#foo#foo.bar
foo$bar#33#test.test
foo+bar-baz%99#someplace.top
</pre>
<pre id="found"></pre>

Why doesn't this RegExp work and which notation is more standards compliant?

Disclaimer: I realize asking "Why doesn't my regular expression work" is pretty amateur.
I have looked at the documentation, though I'm just plain struggling. I have a url (as a string) and what I want is to replace the placeholders (i.e. {objectID} and {queryTerm}
For a while now, I've been making attempts like this:
var _serviceURL = "http://my-server.com/rest-services/someObject/{objectID}/entries?term={queryTerm}";
var re1 = new RegExp("/{([A-Za-z])+}","gi");
var re2 = new RegExp("/{([A-Za-z]+)}+","gi");
var re3 = new RegExp("/{([A-Za-z])+}","gi");
var re4 = new RegExp("/({[A-Za-z]+})+","gi");
var re5 = new RegExp("({[A-Za-z]+})+","gi");
var re6 = new RegExp("({[A-Za-z]}+)*","g");
var re6a = new RegExp("({([a-z]+)})+","gi");
var re7 = /{([^}]+)}/g;
var tokens = re6A.exec(_serviceURL);
if (null != tokens.length ){
for(i = 0; i < tokens.length; i++){
var t = tokens[i];
console.log("tokens[i]: " + t);
}
}
else {
console.log("RegEx fail...")
}
re6a above produces an array like this upon execution:
tokens: Array[3]
0: "{objectID}"
1: "{objectID}"
2: "objectID"
Related to the scenario above:
Why is it I'm never getting the queryTerm ?
Does the RegExp 'i' (ignore case) flag mean I can list a character class like [a-z] and also capture [A-Z] ?
Which method of constructing a RegExp object is better? ...new RegExp(...); or var regExp = /{([^}]+)}/g; . In terms of "what's better", what I mean is cross-browser compatibility and as similar to other RegEx implementations (if I'm learning RegEx, I want to get the most value I can out of it).

Does the RegExp i (ignore case) flag mean I can list a character class like [a-z] and also capture [A-Z]?
Yes, it'll capture them all.
Which method of constructing a RegExp object is better? new RegExp(...) or var regExp = /{([^}]+)}/g;? In terms of "what's better", what I mean is cross-browser compatibility and as similar to other RegEx implementations (if I'm learning RegEx, I want to get the most value I can out of it).
You should definitely use the literal notation.
It gets compiled once at runtime, instead of every time you use it.
They're both equally cross browser compatible.
All that said, I'd use this:
_serviceURL.match(/[^{}]+(?=})/g);
Here's the fiddle: http://jsfiddle.net/CAugU/
Here's an explanation of the above regex:
[ opens the character set
^ negates the set. Will only match whatever is NOT in these brackets
{} match anything that is NOT a curly brace
] close the character set
+ match that as many times as possible
(?= ascertain that it is possible to match the following here (won't be included in the match, this is called a lookahead)
} match a curly brace
) close the lookahead

As you're going to replace the placeholders, it seems more natural to use replace rather than match, for example:
var _serviceURL = "http://my-server.com/rest-services/someObject/{objectID}/entries?term={queryTerm}"
var values = {
objectID: 1234,
queryTerm: "hello"
}
var result = _serviceURL.replace(/{(.+?)}/g, function($0, $1) {
return values[$1]
})
yields http://my-server.com/rest-services/someObject/1234/entries?term=hello

Javascript replace several character including '/'

Im using this snippet to replace several characters in a string.
var badwords = eval("/foo|bar|baz/ig");
var text="foo the bar!";
document.write(text.replace(badwords, "***"));
But one of the characters I want to replace is '/'. I assume it's not working because it's a reserved character in regular expressions, but how can I get it done then?
Thanks!

You simply escape the "reserved" char in your RegExp:
var re = /abc\/def/;
You are probably having trouble with that because you are, for some reason, using a string as your RegExp and then evaling it...so odd.
var badwords = /foo|bar|baz/ig;
is all you need.
If you INISIST on using a string, then you have to escape your escape:
var badwords = eval( "/foo|ba\\/r|baz/ig" );
This gets a backslash through the JS interpreter to make it to the RegExp engine.

first of DON'T USE EVAL it's the most evil function ever and fully unnecessary here
var badwords = /foo|bar|baz/ig;
works just as well (or use the new RegExp("foo|bar|baz","ig"); constructor)
and when you want to have a / in the regex and a \ before the character you want to escape
var badwords = /\/foo|bar|baz/ig;
//or
var badwords = new RegExp("\\/foo|bar|baz","ig");//double escape to escape the backslash in the string like one has to do in java

Develop Reference

JavaScript is the programming language of the Web.

Match whitespace in Javascript regexp by object, created through RegExp constructor - javascript

You need to escape the backslash since it's in a string: var pat2 = new RegExp("\\s");

Related

How can you add e.g. 'gm' to a regex to avoid repeating the full regex again? [duplicate]

Unable to convert a string to the desired regexp in Javascript [duplicate]

What is the memory issue in this RegEx function

Why doesn't this RegExp work and which notation is more standards compliant?

Javascript replace several character including '/'

Categories

Resources