regular expressions difference between ((?:[^\"])*) and ([^\"]*) - javascript

what is the difference between this regular expressions are the replaceable?
((?:[^\"])*)
([^\"]*)
background to this question:
The javascript WYSIWYG editor (tinymce) fails to parse my html code
in Firefox (23.0.1 and 25.0a2) but works in in Chrome.
I found the regular expression to blame:
attrRegExp = /([\w:\-]+)(?:\s*=\s*(?:(?:\"((?:[^\"])*)\")|(?:\'((?:[^\'])*)\')|([^>\s]+)))?/g;
which I modified, replacing
((?:[^\"])*)
with
([^\"]*)
and
((?:[^\'])*)
with
([^\']*)
the resulting regular expression is working in both browsers for my test case
attrRegExp = /([\w:\-]+)(?:\s*=\s*(?:(?:\"([^\"]*)\")|(?:\'([^\']*)\')|([^>\s]+)))?/g
can someone put some light on that?
my test data that only works with the modified regular expression is a big image >700 kb like:
var testdata = '<img alt="" src="data:image/jpeg;base64,/9j/4AAQSkZJRgA...5PmDk4FOGOHy6S3JW120W1uCJ5M0PBa54edOFAc8ePX/2Q==">'
doing something like that to test:
testdata.match(attrRegExp);
especially when the test data is big the unmodified regex is likely to fail in firefox.
You can find the jsfiddle example here:

There should be no difference in the result. So you should be fine.
However, there might be a big difference in how RegExp engines will process these two expressions, and in the case of Firefox/Safari you just proved there actually is ;)
Firefox makes use of WebKit/JavaScriptCore YARR.
YARR imposes an arbitrary, artificial limit, which hits in the non-capturing group variant
// The below limit restricts the number of "recursive" match calls in order to
// avoid spending exponential time on complex regular expressions.
static const unsigned matchLimit = 1000000;
As such Safari is affected as well.
See the relevant Webkit bug and relevant Firefox bug and the nice test case comparing different expression types somebody put together.

Related

Get an example matched text from a regex pattern [duplicate]

Is there any way of generating random text which satisfies provided regular expression.
I am looking for a function which works like below
var reg = Some Regular Expression
var str = RandString(reg)
I have seen fairly good solutions in perl and ruby on github, but I think there are technical issues that make a complete solution impossible. For example, /[0-9]+/ has an infinite upper bound, which is not practical for selecting random numbers from.
Never seen it in JavaScript, but you could translate.
EDIT: After googling for a few seconds...
https://github.com/fent/randexp.js
if you know what the regular expression is, you can just generate random strings, then use a function that references the index of the letters and changes them as needed. Regex expressions vary widely, so it will be difficult to find one in particular that satisfies all possible regex.
Your question is pretty open so hopefully this steers you to the right solution. Get the current time (in seconds), MD5 it, check it against a REGEX, return the match.
Running Example: http://jsfiddle.net/MattLo/3gKrb/
Usage: RandString(/([A-Za-z])/ig); // expected to be a string
For JavaScript, the following modules can generate a random match to a regex:
pxeger
randexp.js
regexgen

End-of-string regex match too slow

Demo here. The regex:
([^>]+)$
I want to match text at the end of a HTML snippet that is not contained in a tag (i.e., a trailing text node). The regex above seems like the simplest match, but the execution time seems to scale linearly with the length of the match-text (and has causes hangs in the wild when used in my browser extension). It's also equally slow for matching and non-matching text.
Why is this seemingly simple regex so bad?
(I also tried RegexBuddy but can't seem to get an explanation from it.)
Edit: Here's a snippet for testing the various regexes (click "Run" in the console area).
Edit 2: And a no-match test.
Consider an input like this
abc<def>xyz
With your original expression, ([^>]+)$, the engine starts from a, fails on >, backtracks, restarts from b, then from c etc. So yes, the time grows with size of the input. If, however, you force the engine to consume everything up to the rightmost > first, as in:
.+>([^>]+)$
the backtracking will be limited by the length of the last segment, no matter how much input is before it.
The second expression is not equivalent to the first one, but since you're using grouping, it doesn't matter much, just pick matches[1].
Hint: even when you target javascript, switch to the pcre mode, which gives you access to the step info and debugger:
(look at the green bars!)
You could use the actual DOM instead of Regex, which is time consuming:
var html = "<div><span>blabla</span></div><div>bla</div>Here I am !";
var temp = document.createElement('div');
temp.innerHTML = html;
var lastNode = temp.lastChild || false;
if(lastNode.nodeType == 3){
alert(lastNode.nodeValue);
}

Preparing a regular expression for javascript

I have made this regular expression which does exactly what I want when I test it in e.g. RegExr:
^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?
However when I test it in javascript it says that the expression is invalid. After hours of debugging I found out that this expression works in javascript:
^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?![a-z0-9]+\.)?(localhost|yahoo\.com)(.*)?
However this doesn't do what I want (again testing in RegExr).
Why cannot I use the first expression in javascript? And how do I fix it?
UPDATE JULY 25
Sorry for the lack of info. The way I am using the Regexp is through a jQuery extension which lets me select using regexp. The script can be seen here: http://james.padolsey.com/javascript/regex-selector-for-jquery/
The specific code I am trying to get to work is:
$('a:regex(href, ^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?)').live('click', function(e) {
After including the linked jQuery plugin. The text strings I am testing are:
http://yahoo.com
http://google.dk
http://subdomain.yahoo.com
http://test.yahoo.com
http://localhost.dk
http://sub.yahoo.com/lalala
Where it is supposed to match "http://google.dk", "http://test.yahoo.com" and "http://sub.yahoo.com/lalala" - which it does when using RegExr but failing (invalid expression) using the jQuery plugin.
The first regular expression is not invalid:
var regexp = /^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?/;
works fine.
If you want to instantiate the expression from a string, you have to double all the backslashes:
var regexp = new RegExp("^https?:\\/\\/(www\\.)?(test\\.yahoo\\.com|sub\\.yahoo\\.com)?(?!([a-z0-9]+\\.)?(localhost|yahoo\\.com))(.*)?");
When you start from a string, you have to account for the fact that the string constant itself uses backslashes as a quoting mechanism, so there will be two evaluations made: one as a string, and one as a regular expression.
edit — OK I think I see the problem. That plugin you're trying to use is simply attempting to do something that's just not going to work, given the way that Sizzle parses selectors. In other words, the problem is not with your regular expression, it's with the overall selector. It is not even getting far enough to parse the regular expression.
Specifically it seems to be nested parentheses inside the regular expression. Something as simple as
$('a:regex(href, ((abc)))')
causes an error. You can instead do something like this:
$('a').filter(function() {
return /^https?:\/\/(www\.)?(test\.yahoo\.com|sub\.yahoo\.com)?(?!([a-z0-9]+\.)?(localhost|yahoo\.com))(.*)?/.test(this.href);
}).whatever( ... );

Is < intentionally ignored in Javascript RegExp as an assertion (as JSLint suggests)?

I ran JSLint on some inherited code, and received the following:
Problem at line 24 character 36: Unexpected '\'.
content = content.replace(/\<a href=/g, '<a target="blank" href=');
I've been Googling for a while and found this which suggests that I don't have to escape less thans in a Javascript replace.
HTML += '<div class="articleBody">'
+ this.myData.record[0][0][5].replace(/<\/?a[^>]*>/gi,'') + '</div>';
But my standard regExp intro would suggest they do, as do some other random sites, that last even talking specifically about Javascript (though its own examples suggest JSLink is right).
The next assertion characters match at the beginning and end of a
word, they are:
< and >
Though Mozilla's pages (replace() and RegExp) don't say it does do assertions with < & >, I've been unable to find a place that explicitly says Javascript intentionally doesn't do assertions with < in its RegExp/replace method. That is, I haven't found anywhere that says a Javascript implementation that escapes < is wrong. And, indeed, the escaped or unescaped < seems to work fine. Admittedly any escaped char that isn't reserved seems to work fine -- for example, \e === e, though \t !== t wrt replace().
Aside: I do realize that not all RegExp implementations are equal, and realize Javascript doesn't do, for example, look-behind. But that's pretty common knowledge. The assertions I'm having a harder time finding.
Can someone put this to bed for me? Is there someplace to find that It Is Written that < is intentionally ignored as a marker of an assertion in Javascript RegExp and that anything to the contrary is incorrect behavior?
JSLint is lint, it tells you your code is problematic (according to the author's standard), but that does not mean your code is semantically wrong. People escape < and > to avoid the piece of Javascript code being interpreted as HTML, although escaping < in your case is useless.
It is guaranteed that, in JS Regex, \< is equivalent to <. (ECMA-262 §15.10.2, IdentityEscape). Basically, any non-alphanumerics escaped are equal to itself. (However, \e does not seem to be defined in the standard.) It is unrelated to assertions (?<=…).
here you go, found on http://www.regular-expressions.info/lookaround.html
"Finally, flavors like JavaScript, Ruby and Tcl do not support lookbehind at all, even though they do support lookahead."
See from Important Notes About Lookbehind

JavaScript: indexOf vs. Match when Searching Strings?

Readability aside, are there any discernable differences (performance perhaps) between using
str.indexOf("src")
and
str.match(/src/)
I personally prefer match (and regexp) but colleagues seem to go the other way. We were wondering if it mattered ...?
EDIT:
I should have said at the outset that this is for functions that will be doing partial plain-string matching (to pick up identifiers in class attributes for JQuery) rather than full regexp searches with wildcards etc.
class='redBorder DisablesGuiClass-2345-2d73-83hf-8293'
So it's the difference between:
string.indexOf('DisablesGuiClass-');
and
string.match(/DisablesGuiClass-/)
RegExp is indeed slower than indexOf (you can see it here), though normally this shouldn't be an issue. With RegExp, you also have to make sure the string is properly escaped, which is an extra thing to think about.
Both of those issues aside, if two tools do exactly what you need them to, why not choose the simpler one?
Your comparison may not be entirely fair. indexOf is used with plain strings and is therefore very fast; match takes a regular expression - of course it may be slower in comparison, but if you want to do a regex match, you won't get far with indexOf. On the other hand, regular expression engines can be optimized, and have been improving in performance in the last years.
In your case, where you're looking for a verbatim string, indexOf should be sufficient. There is still one application for regexes, though: If you need to match entire words and want to avoid matching substrings, then regular expressions give you "word boundary anchors". For example:
indexOf('bar')
will find bar three times in bar, fubar, barmy, whereas
match(/\bbar\b/)
will only match bar when it is not part of a longer word.
As you can see in the comments, some comparisons have been done that show that a regex may be faster than indexOf - if it's performance-critical, you may need to profile your code.
Here all possible ways (relatively) to search for string
// 1. includes (introduced in ES6)
var string = "string to search for substring",
substring = "sea";
string.includes(substring);
// 2. string.indexOf
var string = "string to search for substring",
substring = "sea";
string.indexOf(substring) !== -1;
// 3. RegExp: test
var string = "string to search for substring",
expr = /sea/; // no quotes here
expr.test(string);
// 4. string.match
var string = "string to search for substring",
expr = "/sea/";
string.match(expr);
//5. string.search
var string = "string to search for substring",
expr = "/sea/";
string.search(expr);
Here a src: https://koukia.ca/top-6-ways-to-search-for-a-string-in-javascript-and-performance-benchmarks-ce3e9b81ad31
Benchmarks seem to be twisted specially for es6 includes , read the comments.
In resume:
if you don't need the matches.
=> Either you need regex and so use test. Otherwise es6 includes or indexOf. Still test vs indexOf are close.
And for includes vs indexOf:
They seem to be the same : https://jsperf.com/array-indexof-vs-includes/4 (if it was different it would be wierd, they mostly perform the same except for the differences that they expose check this)
And for my own benchmark test. here it is http://jsben.ch/fFnA0
You can test it (it's browser dependent) [test multiple time]
here how it performed (multiple run indexOf and includes one beat the other, and they are close). So they are the same. [here using the same test platform as the article above].
And here for the a long text version (8 times longer)
http://jsben.ch/wSBA2
Tested both chrome and firefox, same thing.
Notice jsben.ch doesn't handle memory overflow (or there limits correctly. It doesn't show any message) so result can get wrong if you add more then 8 text duplication (8 work well). But the conclusion is for very big text all three perform the same way. Otherwise for short indexOf and includes are the same and test a little bit slower. or Can be the same as it seemed in chrome (firefox 60 it is slower).
Notice with jsben.ch: don't freak out if you get inconsistant result. Try different time and see if it's consistent or not. Change browser, sometimes they just run totally wrong. Bug or bad handling of memory. Or something.
ex:
Here too my benchmark on jsperf (better details, and handle graphs for multiple browsers)
(top is chrome)
normal text
https://jsperf.com/indexof-vs-includes-vs-test-2019
resume: includes and indexOf have same perofrmance. test slower.
(seem all three perform the same in chrom)
Long text (12 time longer then normal)
https://jsperf.com/indexof-vs-includes-vs-test-2019-long-text-str/
resume: All the three perform the same. (chrome and firefox)
very short string
https://jsperf.com/indexof-vs-includes-vs-test-2019-too-short-string/
resume: includes and indexOf perform the same and test slower.
Note: about the benchmark above. For the very short string version (jsperf) had an big error for chrome. Seeing by my eyes. around 60 sample was run for both indexOf and includes same way (repeated a lot of time). And test a little bit less and so slower.
don't be fooled with the wrong graph. It's clear wrong. Same test work ok for firefox, surely it's a bug.
Here the illustration: (the first image was the test on firefox)
waaaa. Suddenly indexOf became superman. But as i said i did the test, and looked at the number of samples it was around 60. Both indexOf and includes and they performed the same. A bug on jspref. Except for this one (maybe because of a memory restriction related problem) all the rest was consistent, it give more details. And you see how many simple happen in real time.
Final resume
indexOf vs includes => Same performance
test => can be slower for short strings or text. And the same for long texts. And it make sense for the overhead that the regex engine add. In chrome it seemed it doesn't matter at all.
If you're trying to search for substring occurrences case-insensitively then match seems to be faster than a combination of indexOf and toLowerCase()
Check here - http://jsperf.com/regexp-vs-indexof/152
You ask whether str.indexOf('target') or str.match(/target/) should be preferred. As other posters have suggested, the use cases and return types of these methods are different. The first asks "where in str can I first find 'target'?" The second asks "does str match the regex and, if so, what are all of the matches for any associated capture groups?"
The issue is that neither one technically is designed to ask the simpler question "does the string contain the substring?" There is something that is explicitly designed to do so:
var doesStringContainTarget = /target/.test(str);
There are several advantages to using regex.test(string):
It returns a boolean, which is what you care about
It is more performant than str.match(/target/) (and rivals str.indexOf('target'))
If for some reason, str is undefined or null, you'll get false (the desired result) instead of throwing a TypeError
Using indexOf should, in theory, be faster than a regex when you're just searching for some plain text, but you should do some comparative benchmarks yourself if you're concerned about performance.
If you prefer match and it's fast enough for your needs then go for it.
For what it's worth, I agree with your colleagues on this: I'd use indexOf when searching for a plain string, and use match etc only when I need the extra functionality provided by regular expressions.
Performance wise indexOf will at the very least be slightly faster than match. It all comes down to the specific implementation. When deciding which to use ask yourself the following question:
Will an integer index suffice or do I
need the functionality of a RegExp
match result?
The return values are different
Aside from the performance implications, which are addressed by other answers, it is important to note that the return values for each method are different; so the methods cannot merely be substituted without also changing your logic.
Return value of .indexOf: integer
The index within the calling String object of the first occurrence of the specified value, starting the search at fromIndex.Returns -1 if the value is not found.
Return value of .match: array
An Array containing the entire match result and any parentheses-captured matched results.Returns null if there were no matches.
Because .indexOf returns 0 if the calling string begins with the specified value, a simple truthy test will fail.
For example:
Given this class…
class='DisablesGuiClass-2345-2d73-83hf-8293 redBorder'
…the return values for each would differ:
// returns `0`, evaluates to `false`
if (string.indexOf('DisablesGuiClass-')) {
… // this block is skipped.
}
vs.
// returns `["DisablesGuiClass-"]`, evaluates to `true`
if (string.match(/DisablesGuiClass-/)) {
… // this block is run.
}
The correct way to run a truthy test with the return from .indexOf is to test against -1:
if (string.indexOf('DisablesGuiClass-') !== -1) {
// ^returns `0` ^evaluates to `true`
… // this block is run.
}
remember Internet Explorer 8 doesnt understand indexOf.
But if nobody of your users uses ie8 (google analytics would tell you) than omit this answer.
possible solution to fix ie8:
How to fix Array indexOf() in JavaScript for Internet Explorer browsers

Categories

Resources