What is the algorithm of the search() function? - javascript

Does any body know what the algorithm used for the search() function in javascript is?
var myRegExp = /Alex/;
var string1 = "Today John went to the store and talked with Alex.";
var matchPos1 = string1.search(myRegExp);
if(matchPos1 != -1)
document.write("There was a match at position " + matchPos1);
else
document.write("There was no match in the first string");
Example copied tizaq.com
I need to use this function to search a text document for different string values. But I need to document what the algorithm behind this method is, and what the complexity is. Otherwise I have to write my own method that searches the text file that I have.

The specification says it's implemented as a regular expression match:
3) If Type(regexp) is Object and the value of the [[Class]] internal
property of regexp is "RegExp", then let rx be regexp;
4) Else, let rx be a new RegExp object created as if by the
expression new RegExp( regexp) where RegExp is the standard built-in
constructor with that name.
5) Search the value string from its beginning for an occurrence of
the regular expression pattern rx. Let result be a Number indicating
the offset within string where the pattern matched, or –1 if there was
no match. (...)
(Section 15.5.4.12 String.prototype.search (regexp)).
This means your question boils down to the regex matching algorithm. But that is not in the specification either, it depends on the implementation:
The value of the [[Match]] internal property is an implementation dependent representation of the Pattern of the RegExp object.
(Section 15.10.7 Properties of RegExp Instances).
So, if documenting the complexity of that algorithm is really a requirement, I guess you'll have to write your own method. But keep in mind that, by doing that, you'll probably come up with something less efficient, and probably dependent on other built-in methods whose complexity is unknown (maybe even RegExp itself). So, can't you convince the powers that be that documenting the complexity of a built-in, implementation-dependent js method is not your job?

Related

When does Javascript "initialize" a RegEx pattern?

I take care to declare a RegEx pattern once and reuse if possible, for performance reasons. I'm not entirely certain why - something I probably read once many years ago and has been filed away in the ol' skull sponge.
I find myself in a regex-heavy situation, and a thought occurred... does declaring a RegEx pattern "instantiate" or "initialize" that pattern, or does it just store the pattern until it's needed?
var NonNumbers = /[^0-9]/g; //"initialized" here?
"h5u4i15h1iu".replace(NonNumbers, "*"); //or "initialized" here?
Maybe RegExp() actually creates one and the literal waits until it's used, even though both patterns return the same results?
var NonNumbers = /[^0-9]/g; //just stores the pattern
var NonNumbers = RegExp(/[^0-9]/, 'g'); //actually creates the RegExp
Just an itch I'm hoping someone who understands the inner workings can scratch.
From the Mozilla spec:
You construct a regular expression in one of two ways:
Using a regular expression literal, which consists of a pattern enclosed between slashes, as follows:
var re = /ab+c/;
Regular expression literals provide compilation of the regular expression when the script is loaded. If the regular expression remains constant, using this can improve performance.
Or calling the constructor function of the RegExp object, as follows:
var re = new RegExp('ab+c');
Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.
Since the spec indicates that the regular expression is being compiled when using the literal syntax, it is also safe to assume that it is being initialized as a full, bona-fide regular expression object at that point.
Another advantage of using literals is that regular expressions can be interned, meaning that if the same regular expression literal is found in multiple places, both literals can refer to the same object, saving both memory and initialization costs.

What is the function of .source in context of this new RegExp

I ran into the below monster of a regex in the wild today. The regex is meant to validate a url.
function superUrlValidation(url) {
return new RegExp(/^/.source + "((.+):\/\/)?" + /(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/.source, "i")
.test(url);
}
I've never seen .source used in a regex like this so I looked it up.
The MDN docs for RegExp.prototype.source states:
The source property returns a String containing the source text of the regexp object, and it doesn't contain the two forward slashes on both sides and any flags.
... and gives this example:
var regex = /fooBar/ig;
console.log(regex.source); // "fooBar", doesn't contain /.../ and "ig".
I understand the MDN example (you're getting the source text of the regex object after it is created, makes sense), but I dont understand how this is being used in the superUrlValidation regex above.
How is the source being used before the regex object is completed and what does this accomplish? I cant find any documentation showing .source being used in this way.
Note that .source is used twice in the regex, at the beginning and the end
Use of .source everywhere in your regex seems totally unnecessary, may be just a trick to avoid double escaping. In fact even use of new RegExp is not needed and you can get away with just the regex literal as this:
var re = /^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i;
/^/ is a regex literal, meaning it's a valid regex object in it's own right. This means that /^/.source === "^".
This seems like an arbitrary example of using the source property as this means the author could have just placed a "^" in it's place, or even just put a ^ at the beginning of the next string, and it would have the same effect.
The .source property returns the content of the regex between the forward slashes as you say. so the result of the above is equivalent to this string:
/^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
In JavaScript you can write regexes like this: /matchsomething/ or using the RegExp function/constructor above. It looks like the code you found is the result of someone not know what they were doing. They seem to have taken a few regexes using the literal syntax (i.e /match_here/) and plugged it into the constructor version and stuck them all together.
I can't see any benefit in using the source property this way. I would just use the string version or the constructor version. Or better, find out what the original author intended and write it again or find a respected regex library with the criteria you need.
And, yeah, wow. It's massive.

Do RegExps made by expression literals share a single instance?

The following snippet of code (from Crockford's Javascript: The Good Parts) demonstrates that RegExp objects made by regular expression literals share a single instance:
function make_a_matcher( ) {
return /a/gi;
}
var x = make_a_matcher( );
var y = make_a_matcher( );
// Beware: x and y are the same object!
x.lastIndex = 10;
document.writeln(y.lastIndex); // 10
Question: Is this the same with any other literals? I tried modifying the code above to work with the string "string", but got a bunch of errors.
No, they are not shared. From the spec on Regular Expression Literals:
A regular expression literal is an input element that is converted to a RegExp object (see 15.10) each time the literal is evaluated. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
However, this changed with ES5. Old ECMAScript 3 had a different behavior:
A regular expression literal is an input element that is converted to a RegExp object (section 15.10) when it is scanned. The object is created before evaluation of the containing program or function begins. Evaluation of the literal produces a reference to that object; it does not create a new object. Two regular expression literals in a program evaluate to regular expression objects that never compare as === to each other even if the two literals' contents are identical.
This was supposed to share the compilation result of the regex engine across evaluations, but obviously led to buggy progams.
You should throw your book away and get a newer edition.

Javascripts String.split - how does it work internally?

I've recently discussed with a colleague how the separator of String.split is treated internally by JavaScript.
Is the separator always converted into a regular expression? E.g. will calling String.split(",", myvar) convert the "," into a regualar expression matching that string?
Well the answer for your question: "Is the separator always converted into a regular expression?" is:
It depends solely on the implementation. For example if you look at WebKit implementation http://svn.webkit.org/repository/webkit/trunk/Source/JavaScriptCore/runtime/StringPrototype.cpp (find stringProtoFuncSplit) then you see it is not always converted to RegEx. However, this does not imply anything, it is just a matter of implementation
Here's the official writeup over at ecma, but the relevant part is around this section:
8.If separator is a RegExp object (its [[Class]] is "RegExp"), let R = separator; otherwise let R = ToString(separator).
That being said it is the ecma spec, and as Anthony Grist mentioned in the comments, browsers can implement as they want, for instance V8 implements ecma262.
Edit: expanded thought on browser/js engines implementation, it appears the majority implement versions of ecma, as seen on this wiki
Yes, the javascript function split allow you to use regex:
EX:
var str = "I am confused";
str.split(/\s/g)
Str then contains ["I","am","confused"]
separator specifies the character(s) to use for separating the string. The separator is treated as a string or a regular expression. If separator is omitted, the array returned contains one element consisting of the entire string. If separator is an empty string, str is converted to an array of characters.
Please see the below link to know more about this, hope it will help you:
String.prototype.split()

Regular Expression to MATCH ALL words in a query, in any order

I'm trying to build a search feature for a project which narrows down items based on a user search input and if it matches the keywords listed against items. For this, I'm saving the item keywords in a data attribute and matching the query with these keywords using a RegExp pattern.
I'm currently using this expression, which I know is not correct and need your help on that:
new RegExp('\\b(' + query + ')', 'gi'))) where query is | separated values of the query entered by the user (e.g. \\b(meat|pasta|dinner)). This returns me a match even if there is only 1 match, say for example - meat
Just to throw some context, here's a small example:
If a user types: meat pasta dinner it should list all items which have ALL the 3 keywords listed against them i.e. meat pasta and dinner. These are independent of the order they're typed in.
Can you help me with an expression which will match ALL words in a query, in any order?
You can achieve this will lookahead assertions
^(?=.*\bmeat\b)(?=.*\bpasta\b)(?=.*\bdinner\b).+
See it here on Regexr
(?=.*\bmeat\b) is a positive lookahead assertion, that ensures that \bmeat\b is somewhere in the string. Same for the other keywords and the .+ is then actually matching the whole string, but only if the assertions are true.
But it will match also on "dinner meat Foobar pasta"
stema's answer is technically correct, but it doesn't take performance into account at all. Look aheads are extremely slow (in the context of regular expressions, which are lightning fast). Even with the current logic, the regular expression is not optimal.
So here are some measurements, calculated on larger strings which contain all three words, running the search 1000 times and using four different approaches:
stema's regular expression
/^(?=.*\bmeat\b)(?=.*\bpasta\b)(?=.*\bdinner\b).+/
result: 605ms
optimized regular expression
/^(?=.*?\bmeat\b)(?=.*?\bpasta\b)(?=.*?\bdinner\b)/
uses lazy matching and doesn't need the end all selector
result: 291ms
permutation regular expression
/(\bmeat\b.*?(\bpasta\b.*?\bdinner\b|\bdinner\b.*?\bpasta\b)|\bpasta\b.*?(\bmeat\b.*?\bdinner\b|\bdinner\b.*?\bmeat\b)|\bdinner\b.*?(\bpasta\b.*?\bmeat\b|\bmeat\b.*?\bpasta\b))/
result: 56ms
this is fast because the first pattern is matching, if the last pattern matched, it would be even slower than the look ahead one (300 ms)
array of regular expressions
var regs=[/\bmeat\b/,/\bpasta\b/,/\bdinner\b/];
var result = regs.every(reg=>reg.test(text));
result: 26ms
Note that if the strings are crafted to not match, then the results are:
521ms
220ms
161ms - much slower because it has to go through all the branches
14ms
As you can see, in all cases just using a loop is an order of magnitude faster, not to mention easier to read.
The original question was asking for a regular expression, so my answer to that is the permutation regular expression, but I would not use it, as its size would grow exponentially with the number of search words.
Also, in most cases this performance issue is academic, but it is necessary to be highlighted.
your regex looks pretty good:
\b(meat|pasta|dinner)\b
Check that the length of matches equals the number of keywords (in this case, three):
string.match(re).length === numberOfKeywords
where re is the regex with a g flag, string is the data and numberOfKeywords is the number of keywords
This assumes that there are no repeated keywords.
Based on the accepted answer I wrote a simple Java method that builds the regex from an array of keywords
public static String regexIfAllKeywordsExists(String[] keywords) {
StringBuilder sb = new StringBuilder("^");
for (String keyword : keywords) {
sb.append("(?=.*\\b");
sb.append(keyword);
sb.append("\\b)");
}
sb.append(".+");
return sb.toString();
}

Categories

Resources