Faster way match characters between strings than Regex? - javascript

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}

Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.

You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.

The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

Related

Is there an efficient way to test whether a string contains non-overlapping substrings to match the array of regular expressions?

All I want is to test whether a string contains non-overlapping substrings to match the array of regexes in the following way: if a substring matches some item of the array, remove the corresponding regex from the array, and continue. I will need a function func(arg1, arg2) that will take two arguments: the first one is the string itself, and the second one is an array of regular expressions to test.
I've read some explanations (such as Regular Expressions: Is there an AND operator?), but they do not answer this specific question. For example, in Javascript, the following three code snippets will return true:
/(?=ab)(?=abc)(?=abcd)/gi.test("eabzzzabcde");
/(?=.*ab)(?=.*abc)(?=.*abcd)/gi.test("eabzzzabcde");
/(?=.*?ab)(?=.*?abc)(?=.*?abcd)/gi.test("eabzzzabcde");
which is, obviously, not what I want (because "abc" and "abcd" in "eabzzzabcde" are just mixed together in an overlapping way). So, func("eabzzzabcde", [/ab/gi, /abc/gi, /abcd/gi]) should return false.
But, func("Fhh, fabcw wxabcdy yz... zab.", [/ab/gi, /abc/gi, /abcd/gi]) should return true because none of "ab", "abc" and "abcd" substrings overlap each other. The logic is the following. We have an array of regexes: [/ab/gi, /abc/gi, /abcd/gi], and some possible combination of three (where 3 is equal to the length of that array) non-overlapping, separate substrings of the original string: fabcw, xabcdy and zab. Does fabcw match /abc/gi? Yes. Okay, we remove /abc/gi from the array, and we have [/ab/gi, /abcd/gi] for xabcdy and zab. Does xabcdy match /abcd/gi? Yes. Okay, we remove /abcd/gi from the current array, and we have [/ab/gi] for zab. Does zab. match /ab/gi? Yes. No more regexes left in the current array, and we always answered "yes", so — return true.
The tricky part here is to find an efficient (such that performance is not too terrible) way to get at least one possible “good” combination of non-overlapping substrings.
The more complex case is e.g. func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]). Using the logic described above, we can see that if we take two non-overlapping parts of the original string — "acdxba" (or "cdxba") and "abaac" (or "abaacb", "babaac" etc.) — the first one matches /.*?c.*?b.*?a/gi, and the second one matches /.*?a.*?b.*?c/gi. So, func("acdxbaab ababaacb", [/.*?a.*?b.*?c/gi, /.*?c.*?b.*?a/gi]) should return true.
Is there any efficient way to solve such a problem?
Assuming each pattern should match exactly once, then we can construct a regexp of all of their permutations:
const patterns= ['ab', 'abc', 'abcd'];
const input = "Fhh, fabcw wxabcdy yz... zab.";
// Create a regexp of the form
// (.*?ab.*?abc.*?abcd.*?)
function build(patterns) {
return `(${['', ...patterns, ''].join('.*?')})`;
}
function match(input, patterns) {
const regexps = [...permute(patterns)].map(build);
// Create a regexp of the form
// /(.*?ab.*?abc.*?abcd.*?)|(.*?ab.*?abcd.*?abc.*?)|.../
const regexp = new RegExp(regexps.join('|'));
return regexp.test(input);
}
// Simple permutation generator.
function *permute(a, n = a.length) {
if (n <= 1) yield a.slice();
else for (let i = 0; i < n; i++) {
yield *permute(a, n - 1);
const j = n % 2 ? 0 : i;
[a[n-1], a[j]] = [a[j], a[n-1]];
}
}
console.log(match(input, patterns));
This will result in a very long regexp if there are more than a half-dozen or so patterns. To deal with this, we can test each permutation one at a time:
function match(input, patterns) {
return Array.from(permute(patterns))
.some(perm => input.match(build(perm)));
}
If there are ten patterns, we will end up doing a couple million tests.
Disclaimers
This uses ES6 features. Fall back to equivalent ES5 syntax if you need to.
The input patterns here are strings. To handle regexps instead would require a little bit of logic to extract the pattern from the regexp, and also escape any special regexp characters appearing in it.
Is there an efficient way to test whether a string contains non-overlapping substrings to match the array of regular expressions?
I doubt that you would call the above solution "efficient", but I don't know if there is a more efficient one. As far as I can see, any approach to this problem is going to involve backtracking. You could match the first nine of ten patterns, and then discover that the last one won't match because one of the earlier nine greedily ate up part of what the tenth needed, even though it could have matched itself somewhere later in the string. Therefore, I will go out on a limb and say that this problem is intrinsically of order O(n!).

How to compare two Strings and get Different part

now I have two strings,
var str1 = "A10B1C101D11";
var str2 = "A1B22C101D110E1";
What I intend to do is to tell the difference between them, the result will look like
A10B1C101D11
A10 B22 C101 D110E1
It follows the same pattern, one character and a number. And if the character doesn't exist or the number is different between them, I will say they are different, and highlight the different part. Can regular expression do it or any other good solution? thanks in advance!
Let me start by stating that regexp might not be the best tool for this. As the strings have a simple format that you are aware of it will be faster and safer to parse the strings into tokens and then compare the tokens.
However you can do this with Regexp, although in javascript you are hampered by the lack of lookbehind.
The way to do this is to use negative lookahead to prevent matches that are included in the other string. However since javascript does not support lookbehind you might need to go search from both directions.
We do this by concatenating the strings, with a delimiter that we can test for.
If using '|' as a delimiter the regexp becomes;
/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g
To find the tokens in the second string that are not present in the first you do;
var bothstring=str2.concat("|",str1);
var re=/(\D\d*)(?=(?:\||\D.*\|))(?!.*\|(.*\d)?\1(\D|$))/g;
var match=re.exec(bothstring);
Subsequent calls to re.exec will return later matches. So you can iterate over them as in the following example;
while (match!=null){
alert("\""+match+"\" At position "+match.index);
match=re.exec(t);
}
As stated this gives tokens in str2 that are different in str1. To get the tokens in str1 that are different use the same code but change the order of str1 and str2 when you concatenate the strings.
The above code might not be safe if dealing with potentially dirty input. In particular it might misbehave if feed a string like "A100|A100", the first A100 will not be considered as having a missing object because the regexp is not aware that the source is supposed to be two different strings. If this is a potential issue then search for occurences of the delimiting character.
You call break the string into an array
var aStr1 = str1.split('');
var aStr2 = str2.split('');
Then check which one has more characters, and save the smaller number
var totalCharacters;
if(aStr1.length > aStr2.length) {
totalCharacters = aStr2.length
} else {
totalCharacters = aStr1.length
}
And loop comparing both
var diff = [];
for(var i = 0; i<totalCharacters; i++) {
if(aStr1[i] != aStr2[i]) {
diff.push(aStr1[i]); // or something else
}
}
At the very end you can concat those last characters from the bigger String (since they obviously are different from the other one).
Does it helps you?

change regex to match some words instead of all words containing PRP

This regex matches all characters between whitespace if the word contains PRP.
How can I get it to match all words, or characters in-between whitepsace, if they contain PRP, but not if they contain me in any case.
So match all words containing PRP, but not containing ME or me.
Here is the regex to match words containing PRP: \S*PRP\S*
You can use negative lookahead for this:
(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)
Working Demo
PS: Use group #1 for your matched word.
Code:
var re = /(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)/;
var s = 'word abcPRP def';
var m = s.match(re);
if (m) console.log(m[1]); //=> abcPRP
Instead of using complicated regular expressions which would be confusing for almost anyone who's reading it, why don't you break up your code into two sections, separating the words into an array and filtering out the results with stuff you don't want?
function prpnotme(w) {
var r = w.match(/\S+/g);
if(r == null)
return [];
var i=0;
while(i<r.length) {
if(!r[i].contains('PRP') || r[i].toLowerCase().contains('me'))
r.splice(i,1);
else
i++;
}
return r;
}
console.log(prpnotme('whattttttt ok')); // []
console.log(prpnotme('MELOLPRP PRPRP PRPthemeok PRPmhm')); // ['PRPRP', 'PRPmhm']
For a very good reason why this is important, imagine if you ever wanted to add more logic. You're much more likely to make a mistake when modifying complicated regex to make it even more complicated, and this way it's done with simple logic that make perfect sense when reading each predicate, no matter how much you add on.

Regex with limited use of specific characters (like a scrabble bank of letters)

I would like to be able to do a regex where I can identify sort of a bank of letters like [dgos] for example and use that within my regex... but whenever a letter from that gets used, it takes it away from the bank.
So lets say that somehow \1 is able to stand for that bank of letters (dgos). I could write a regex something like:
^\1{1}o\1{2}$
and it would match basically:
\1{1} = [dgos]{1}
o
\1{2} = [dgos]{2} minus whatever was used in the first one
Matches could include good, dosg, sogd, etc... and would not include sods (because s would have to be used twice) or sooo (because the o would have to be used twice).
I would also like to be able to identify letters that can be used more than once. I started to write this myself but then realized I didn't even know where to begin with this so I've also searched around and haven't found a very elegant way to do this, or a way to do it that would be flexible enough that the regex could easily be generated with minimal input.
I have a solution below using a combination of conditions and multiple regexs (feel free to comment thoughts on that answer - perhaps it's the way I'll have to do it?), but I would prefer a single regex solution if possible... and something more elegant and efficient if possible.
Note that the higher level of elegance and single regex part are just my preferences, the most important thing is that it works and the performance is good enough.
Assuming you are using javascript version 1.5+ (so you can use lookaheads), here is my solution:
^([dogs])(?!.*\1)([dogs])(?!.*\2)([dogs])(?!.*\3)[dogs]$
So after each letter is matched, you perform a negative lookahead to ensure that this matched letter never appears again.
This method wouldn't work (or at least, would need to be made a heck of a lot more complicated!) if you want to allow some letters to be repeated, however. (E.g. if your letters to match are "example".)
EDIT: I just had a little re-think, and here is a much more elegant solution:
^(?:([dogs])(?!.*\1)){4}
After thinking about it some more, I have thought of a way that this IS possible, although this solution is pretty ugly! For example, suppose you want to match the letters "goods" (i.e. including TWO "o"s):
^(?=.*g)(?=.*o.*o)(?=.*d)(?=.*s).{5}$
- So, I have used forward lookaheads to check that all of these letters are in the text somewhere, then simply checked that there are exactly 5 letters.
Or as another example, suppose we want to match the letters "banana", with a letter "a" in position 2 of the word. Then you could match:
^(?=.*b)(?=.*a.*a.*a)(?=.*n.*n).a.{4}$
Building on #TomLord's answer, taking into account that you don't necessarily need to exhaust the bank of letters, you can use negative lookahead assertions instead of positive ones. For your example D<2 from bank>R<0-5 from bank>, that regex would be
/^(?![^o]*o[^o]*o)(?![^a]*a[^a]*a)(?![^e]*e[^e]*e[^e]*e)(?![^s]*s[^s]*s)d[oaes]{2}r[oaes]{0,5}$/i
Explanation:
^ # Start of string
(?![^o]*o[^o]*o) # Assert no more than one o
(?![^a]*a[^a]*a) # Assert no more than one a
(?![^e]*e[^e]*e[^e]*e) # Assert no more than two e
(?![^s]*s[^s]*s) # Assert no more than one s
d # Match d
[oaes]{2} # Match two of the letters in the bank
r # Match r
[oaes]{0,5} # Match 0-5 of the letters in the bank
$ # End of string
You could also write (?!.*o.*o) instead of (?![^o]*o[^o]*o), but the latter is faster, just harder to read. Another way to write this is (?!(?:[^o]*o){2}) (useful if the number of repetitions increases).
And of course you need to account for the number of letters in your "fixed" part of the string (which in this case (d and r) don't interfere with the bank, but they might do so in other examples).
Something like that would be pretty simple to specify, just very long
gdos|gdso|gods|gosd|....
You'd basically specify every permutation. Just write some code to generate every permutation, combine with the alternate operator, and you're done!
Although... It might pay to actually encode the decision tree used to generate the permutations...
I swear I think I answered this before on stackoverflow. Let me get back to you...
I couldn't think of a way to do this completely in a regex, but I was able to come up with this: http://jsfiddle.net/T2TMd/2/
As jsfiddle is limited to post size, I couldn't do the larger dictionary there. Check here for a better example using a 180k word dictionary.
Main function:
/*
* filter = regex to filter/split potential matches
* bank = available letters
* groups = capture groups that use the bank
* dict = list of words to search through
*/
var matchFind = function(filter, bank, groups, dict) {
var matches = [];
for(var i=0; i < dict.length; i++) {
if(dict[i].match(filter)){
var fail = false;
var b = bank;
var arr = dict[i].split(filter);
//loop groups that use letter bank
for(var k=0; k<groups.length && !fail; k++) {
var grp = arr[groups[k]] || [];
//loop characters of that group
for(var j=0; j<grp.length && !fail; j++) {
var regex = new RegExp(b);
var currChar = grp.charAt(j);
if(currChar.match(regex)) {
//match found, remove from bank
b = b.replace(currChar,"");
} else {
fail = true;
}
}
}
if(!fail) {
matches.push(dict[i]);
}
}
}
return matches;
}
Usage:
$("#go").click( function() {
var f = new RegExp($("#filter").val());
var b = "["+$("#bank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var g = $("#groups").val().replace(/[^0-9,]+/g,"").split(",") || [];
$("#result").text(matchFind(f,b,g,dict).toString());
});
To make it easier to create scenarios, I created this as well:
$("#build").click( function() {
var bank = "["+$("#buildBank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var buildArr = $("#builder").val().split(",");
var groups = [];
var build = "^";
for(var i=0; i<buildArr.length; i++) {
var part = buildArr[i];
if(/\</.test(part)) {
part = "(" + bank + part.replace("<", "{").replace(">", "}").replace("-",",") + ")";
build = build + part;
groups.push(i+1);
} else {
build = build + "("+part+")";
}
}
build = build + "$";
$("#filter").val(build);
$("#bank").val(bank);
$("#groups").val(groups.toString());
$("#go").click();
});
This would be useful in Scrabble, so lets say that you are in a position where a word must start with a "D", there are two spaces between that "D" and an "R" from a parallel word, and you have OAEES for your letter bank. In the builder I could put D,<2>,R,<0-3> because it must start with a D, then it must have 2 letters from the bank, then there is an R, then I'd have up to 3 letters to use (since I'm using 2 in between D and R).
The builder would use the letter bank and convert D,<2>,R,<0-3> to ^(D)([OAEES]{2})(R)([OAEES]{0,5})$ which would be used to filter for possible matches. Then with those possible matches it will look at the capture groups that use the letter bank, character by character, removing letters from that regex when it finds them so that it won't match if there are more of the letter bank letters used than there is in the letter bank.
Test the above scenario here.

Using javascript regexp to find the first AND longest match

I have a RegExp like the following simplified example:
var exp = /he|hell/;
When I run it on a string it will give me the first match, fx:
var str = "hello world";
var match = exp.exec(str);
// match contains ["he"];
I want the first and longest possible match,
and by that i mean sorted by index, then length.
Since the expression is combined from an array of RegExp's, I am looking for a way to find the longest match without having to rewrite the regular expression.
Is that even possible?
If it isn't, I am looking for a way to easily analyze the expression, and arrange it in the proper order. But I can't figure out how since the expressions could be a lot more complex, fx:
var exp = /h..|hel*/
How about /hell|he/ ?
All regex implementations I know of will (try to) match characters/patterns from left to right and terminate whenever they find an over-all match.
In other words: if you want to make sure you get the longest possible match, you'll need to try all your patterns (separately), store all matches and then get the longest match from all possible matches.
You can do it. It's explained here:
http://www.regular-expressions.info/alternation.html
(In summary, change the operand order or group with question mark the second part of the search.)
You cannot do "longest match" (or anything involving counting, minus look-aheads) with regular expressions.
Your best bet is to find all matches, and simply compare the lengths in the program.
I don't know if this is what you're looking for (Considering this question is almost 8 years old...), but here's my grain of salt:
(Switching the he for hell will perform the search based on the biggest first)
var exp = /hell|he/;
var str = "hello world";
var match = exp.exec(str);
if(match)
{
match.sort(function(a, b){return b.length - a.length;});
console.log(match[0]);
}
Where match[0] is going to be the longest of all the strings matched.

Categories

Resources