What is the memory issue in this RegEx function - javascript

I am trying to scrape a web page for email addresses. I almost have it working, but there seems to be some kind of huge memory error that makes the page freeze when my script loads.
This is what I have:
var bodyText = document.body.textContent.replace(/\n/g, " ").split(' '); // Location to pull our text from. In this case it's the whole body
var r = new RegExp("[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", 'i');
function validateEmail(string) {
return r.test(string);
}
var domains = [];
var domain;
for (var i = 0; i < bodyText.length; i++){
domain = bodyText[i].toString();
if (validateEmail(domain)) {
domains.push(domain);
}
}
The only thing I can think of is that the email validating function I'm using is a 32 step expression and the page I'm running it on returns with over 3,000 parts, but I feel like this should be possible.
Here is a script that reproduces the error:
var str = "help.yahoo.com/us/tutorials/cg/mail/cg_addressguard2.html";
var r = new RegExp("[a-z0-9!#$%&'*+\/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_{|}~-]+)*#(?:[a-‌​z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", 'i');
console.log("before:"+(new Date()));
console.log(r.test(str));
console.log("after:"+(new Date()));`
What can I do to overcome the memory issue?

stribizhev has pointed out the solution in the comment: specify the regex in RegExp literal syntax. Another solution, as shown in the comment by sln, is to escape \ in the string literal properly.
I will not address what is the correct regex to validating/matching email address with regex in this answer, since it has been rehashed many times over.
To demonstrate what causes the problem, let us print the string passed to RegExp constructor to the console. Did you notice that some \ are missing?
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])
^ ^ ^ ^
The string above is what the RegExp constructor sees and compiles.
/ only needs to be escaped in RegExp literal (since RegExp literals are delimited by /), and doesn't need to be escaped in the string passes to RegExp constructor, so the omission doesn't cause any problem.
Below are equivalent examples showing how to write a regex to match / with RegExp literal and RegExp constructor:
/\//;
new RegExp("/");
However, since \ in \. is not properly escaped in the string, instead of matching literal ., it allows any character (except for line separator) to be matched.
As a result, from being perfectly fine solution, these parts in the regex suffers from catastrophic backtracking:
(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+
Since . can match any character, the fragments above degenerates to the classic catastrophic backtracking pattern (A*)*. By reducing the power of the regex to its strict subset, you can see the problem more clearly:
(?:a[a]+)*
(?:[a](?:[a]*[a])?a)+
This is the solution with RegExp literal, which is the same as specified in the string literal in the question. You got the escape for RegExp literal done properly, but instead use it in RegExp constructor:
var r = /[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])/i;
As for equivalent RegExp constructor solution:
var r = new RegExp("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])", "i");

Not exactly an answer to your question, but the first thing you need to do is to reduce the amount of text parts you have to test with your "corrected" pattern. In your html example file, you have about 3300 text strings to test with a regex. Keep in mind that using a regex has a cost, so removing useless text part is a priority:
var textParts = document.body.textContent
.split(/\s+/) // see the note
.filter(function(part) {
return part.length > 4 && part.length < 255 && part.indexOf('#') > 1;
});
alert(textParts.join("\n"));
Now you have only ~50 text parts to test.
note: if you want to take in account email addresses with spaces inside double quotes, you can try to change:
.split(/\s+/)
to
.split(/(?=[\s"])((?:"[^"\n\\]*(?:\\.[^"\n\\]*)*"[^"\s]*)*)(?:\s+|$)/)
(without any warranty)
About your pattern: the mistake in your pattern is already pointed by other answers and comments, but note that you can probably obtain the same result (the same matches) faster with this one:
/\b\w[!#-'*+\/-9=?^-~-]*(?:\.[!#-'*+\/-9=?^-~-]+)*#[a-z0-9]+(?:-[a-z0-9]+)*\.[a-z0-9]+(?:[-.][a-z0-9]+)*\b/i

Here's an example with a less strict regex that's fast.
function getEmails(str) {
var r = /\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/ig;
var emails = [];
var e = null;
var n = 0;
while ((e = r.exec(str)) !== null) {
emails[n++] = e[0];
}
return emails;
}
function emailTest() {
var str = document.getElementsByTagName('body')[0].innerHTML;
var emails = getEmails(str);
document.getElementById('found').innerHTML=emails.join("\n");
}
emailTest();
#found {
color:green;
font-weight:bold;
}
<pre id="email_test">
test#test.test
foo#bar.baz.test
foo#bar.baz.longdomain
foo-bar#foo.bar
foo_bar99#foo.bar
foo#foo#foo.bar
foo$bar#33#test.test
foo+bar-baz%99#someplace.top
</pre>
<pre id="found"></pre>

Related

How do I pass a variable into regex with Node js?

So basically, I have a regular expression which is
var regex1 = /10661\" class=\"fauxBlockLink-linkRow u-concealed\">([\s\S]*?)<\/a>/;
var result=text.match(regex1);
user_activity = result[1].replace(/\s/g, "")
console.log(user_activity);
What I'm trying to do is this
var number = 1234;
var regex1 = /${number}\" class=\"fauxBlockLink-linkRow u-concealed\">([\s\S]*?)<\/a>/;
but it is not working, and when I tried with RegExp, I kept getting errors.
You can use RegExp to create regexp from a string and use variables in that string.
var number = 1234;
var regex1 = new RegExp(`${number}aa`);
console.log("1234aa".match(regex1));
You can build the regex string with templates and/or string addition and then pass it to the RegExp constructor. One key in doing that is to get the escaping correct as you need an extra level of escaping for backslashes because the interpretation of the string takes one level of backslash, but you need one to survive as it gets to the RegExp contructor. Here's a working example:
function match(number, str) {
let r = new RegExp(`${number}" class="fauxBlockLink-linkRow u-concealed">([\\s\\S]*?)<\\/a>`);
return str.match(r);
}
const exampleHTML = 'Some link text';
console.log(match(1234, exampleHTML));
Note, using regex to match HTML like this becomes very order-sensitive (whereas the HTML itself isn't order-sensitive). And, your regex requires exactly one space between classes which HTML doesn't. If the class names were in a slightly different order or spacing different in the <a> tag, then it would not match. Depending upon what you're really trying to do, there may be better ways to parse and use the HTML that isn't order-sensitive.
I solved it with the method of Adem,
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}
var number = 1234;
var firstPart = `<a href="/forum/search/member?user_id=${number}" class="fauxBlockLink-linkRow u-concealed">`
var regexpString = escapeRegExp(firstPart) + '([\\s\\S]*?)' + escapeRegExp('</a>');
console.log(regexpString)
var sample = ` `
var regex1 = new RegExp(regexpString);
console.log(sample.match(regex1));
in the first place the issue was actually the way I was reading the file, the data I was applying the match on, was undefined.

Matching a string with a regex gives null even though it should match

I am trying to get my regex to work in JavaScript, but I have a problem.
Code:
var reg = new RegExp('978\d{10}');
var isbn = '9788740013498';
var res = isbn.match(reg);
console.log(res);
However, res is always null in the console.
This is quite interesting, as the regex should work.
My question: then, what is the right syntax to match a string and a regex?
(If it matters and could have any say in the environment: this code is taken from an app.get view made in Express.js in my Node.js application)
Because you're using a string to build your regex, you need to escape the \. It's currently working to escape the d, which doesn't need escaping.
You can see what happens if you create your regex on the chrome console:
new RegExp('978\d{10}');
// => /978d{10}/
Note that there is no \d, only a d, so your regex matches 978dddddddddd. That is, the literal 'd' character repeated 10 times.
You need to use \\ to insert a literal \ in the string you're building the regex from:
var reg = new RegExp('978\\d{10}');
var isbn = '9788740013498';
var res = isbn.match(reg);
console.log(res)
// => ["9788740013498", index: 0, input: "9788740013498"]
You need to escape with double back slash if you use RegExp constructor:
var reg = new RegExp('978\\d{10}');
Quote from documentation:
When using the constructor function, the normal string escape rules (preceding special characters with \ when included in a string) are necessary. For example, the following are equivalent:
var re = /\w+/;
var re = new RegExp("\\w+");

Embed comments within JavaScript regex like in Perl

Is there any way to embed a comment in a JavaScript regex, like you can do in Perl? I'm guessing there is not, but my searching didn't find anything stating you can or can't.
You can't embed a comment in a regex literal.
You may insert comments in a string construction that you pass to the RegExp constructor :
var r = new RegExp(
"\\b" + // word boundary
"A=" + // A=
"(\\d+)"+ // what is captured : some digits
"\\b" // word boundary again
, 'i'); // case insensitive
But a regex literal is so much more convenient (notice how I had to escape the \) I'd rather separate the regex from the comments : just put some comments before your regex, not inside.
EDIT 2018: This question and answer are very old. EcmaScript now offers new ways to handle this, and more precisely template strings.
For example I now use this simple utility in node:
module.exports = function(tmpl){
let [, source, flags] = tmpl.raw.toString()
.replace(/\s*(\/\/.*)?$\s*/gm, "") // remove comments and spaces at both ends of lines
.match(/^\/?(.*?)(?:\/(\w+))?$/); // extracts source and flags
return new RegExp(source, flags);
}
which lets me do things like this or this or this:
const regex = rex`
^ // start of string
[a-z]+ // some letters
bla(\d+)
$ // end
/ig`;
console.log(regex); // /^[a-z]+bla(\d+)$/ig
console.log("Totobla58".match(regex)); // [ 'Totobla58' ]
Now with the grave backticky things, you can do inline comments with a little finagling. Note that in the example below there are some assumptions being made about what won't appear in the strings being matched, especially regarding the whitespace. But I think often you can make intentional assumptions like that, if you write the process() function carefully. If not, there are probably creative ways to define the little "mini-language extension" to regexes in such a way as to make it work.
function process() {
var regex = new RegExp("\\s*([^#]*?)\\s*#.*$", "mg");
var output = "";
while ((result = regex.exec(arguments[0])) !== null ){
output += result[1];
}
return output;
}
var a = new RegExp(process `
^f # matches the first letter f
.* # matches stuff in the middle
h # matches the letter 'h'
`);
console.log(a);
console.log(a.test("fish"));
console.log(a.test("frog"));
Here's a codepen.
Also, to the OP, just because I feel a need to say this, this is neato, but if your resulting code turns out just as verbose as the string concatenation or if it takes you 6 hours to figure out the right regexes and you are the only one on your team who will bother to use it, maybe there are better uses of your time...
I hope you know that I am only this blunt with you because I value our friendship.

Why doesn't this RegExp work and which notation is more standards compliant?

Disclaimer: I realize asking "Why doesn't my regular expression work" is pretty amateur.
I have looked at the documentation, though I'm just plain struggling. I have a url (as a string) and what I want is to replace the placeholders (i.e. {objectID} and {queryTerm}
For a while now, I've been making attempts like this:
var _serviceURL = "http://my-server.com/rest-services/someObject/{objectID}/entries?term={queryTerm}";
var re1 = new RegExp("/{([A-Za-z])+}","gi");
var re2 = new RegExp("/{([A-Za-z]+)}+","gi");
var re3 = new RegExp("/{([A-Za-z])+}","gi");
var re4 = new RegExp("/({[A-Za-z]+})+","gi");
var re5 = new RegExp("({[A-Za-z]+})+","gi");
var re6 = new RegExp("({[A-Za-z]}+)*","g");
var re6a = new RegExp("({([a-z]+)})+","gi");
var re7 = /{([^}]+)}/g;
var tokens = re6A.exec(_serviceURL);
if (null != tokens.length ){
for(i = 0; i < tokens.length; i++){
var t = tokens[i];
console.log("tokens[i]: " + t);
}
}
else {
console.log("RegEx fail...")
}
re6a above produces an array like this upon execution:
tokens: Array[3]
0: "{objectID}"
1: "{objectID}"
2: "objectID"
Related to the scenario above:
Why is it I'm never getting the queryTerm ?
Does the RegExp 'i' (ignore case) flag mean I can list a character class like [a-z] and also capture [A-Z] ?
Which method of constructing a RegExp object is better? ...new RegExp(...); or var regExp = /{([^}]+)}/g; . In terms of "what's better", what I mean is cross-browser compatibility and as similar to other RegEx implementations (if I'm learning RegEx, I want to get the most value I can out of it).
Does the RegExp i (ignore case) flag mean I can list a character class like [a-z] and also capture [A-Z]?
Yes, it'll capture them all.
Which method of constructing a RegExp object is better? new RegExp(...) or var regExp = /{([^}]+)}/g;? In terms of "what's better", what I mean is cross-browser compatibility and as similar to other RegEx implementations (if I'm learning RegEx, I want to get the most value I can out of it).
You should definitely use the literal notation.
It gets compiled once at runtime, instead of every time you use it.
They're both equally cross browser compatible.
All that said, I'd use this:
_serviceURL.match(/[^{}]+(?=})/g);
Here's the fiddle: http://jsfiddle.net/CAugU/
Here's an explanation of the above regex:
[ opens the character set
^ negates the set. Will only match whatever is NOT in these brackets
{} match anything that is NOT a curly brace
] close the character set
+ match that as many times as possible
(?= ascertain that it is possible to match the following here (won't be included in the match, this is called a lookahead)
} match a curly brace
) close the lookahead
As you're going to replace the placeholders, it seems more natural to use replace rather than match, for example:
var _serviceURL = "http://my-server.com/rest-services/someObject/{objectID}/entries?term={queryTerm}"
var values = {
objectID: 1234,
queryTerm: "hello"
}
var result = _serviceURL.replace(/{(.+?)}/g, function($0, $1) {
return values[$1]
})
yields http://my-server.com/rest-services/someObject/1234/entries?term=hello

Javascript regular expression to replace word but not within curly brackets

I have some content, for example:
If you have a question, ask for help on StackOverflow
I have a list of synonyms:
a={one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight}
ask={question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate}
I'm using JavaScript to:
Split synonyms based on =
Looping through every synonym, if found in content replace with {...|...}
The output should look like:
If you have {one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight} question, {question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate} for help on StackOverflow
Problem:
Instead of replacing the entire word, it's replacing every character found. My code:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp(word, "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
It should replace content word with synonym which should not be in {...|...}.
When you build the regexps, you need to include word boundary anchors at both the beginning and the end to match whole words (beginning and ending with characters from [a-zA-Z0-9_]) only:
var match = new RegExp("\\b" + word + "\\b", "ig");
Depending on the specific replacements you are making, you might want to apply your method to individual words (rather than to the entire text at once) matched using a regexp like /\w+/g to avoid replacing words that themselves are the replacements for others. Something like:
content = content.replace(/\w+/g, function(word) {
for(var i = 0, L = allSyn.length; i < L; ++i) {
var rtnSyn = allSyn[syn].split("=");
var synonym = (rtnSyn[1]).trim();
if(synonym && rtnSyn[0].toLowerCase() == word.toLowerCase()) return synonym;
}
});
Regular expressions include something called a "word-boundary", represented by \b. It is a zero-width assertion (it just checks something, it doesn't "eat" input) that says in order to match, certain word boundary conditions have to apply. One example is a space followed by a letter; given the string ' X', this regex would match it: / \bX/. So to make your code work, you just have to add word boundaries to the beginning and end of your word regex, like this:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp("\\b"+word+"\\b", "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
[Note that there are two backslashes in each of the word boundary matchers because in javascript strings, the backslash is for escape characters -- two backslashes turns into a literal backslash.]
For optimization, don't create a new RegExp on each iteration. Instead, build up a big regex like [^{A-Za-z](a|ask|...)[^}A-Za-z] and an hash with a value for each key specifying what to replace it with. I'm not familiar enough with JavaScript to create the code on the fly.
Note the separator regex which says the match cannot begin with { or end with }. This is not terribly precise, but hopefully acceptable in practice. If you genuinely need to replace words next to { or } then this can certainly be refined, but I'm hoping we won't have to.

Categories

Resources