Regular Expression "AND" - javascript

I'm doing some basic text matching from an input. I need the ability to perform a basic "AND". For "ANY" I split the input by spaces and join each word by the pipe ("|") character but I haven't found a way to tell the regular expression to match any of the words.
switch (searchOption) {
case "any":
inputArray = input.split(" ");
if (inputArray.length > 1) { input = inputArray.join("|"); }
text = input;
break;
case "all":
inputArray = input.split(" ");
***[WHAT TO DO HERE?]***
text = input;
break;
case "exact":
inputArray = new Array(input);
text = input;
break;
}
Seems like it should be easy.

Use lookahead. Try this:
if( inputArray.length>1 ) rgx = "(?=.*" + inputArray.join( ")(?=.*" ) + ").*";
You'll end up with something like
(?=.*dog)(?=.*cat)(?=.*mouse).*
Which should only match if all the words appear, but they can be in any order.
The dog ate the cat who ate the mouse.
The mouse was eaten by the dog and the cat.
Most cats love mouses and dogs.
But not
The dog at the mouse.
Cats and dogs like mice.
The way it works is that the regex engine scans from the current match point (0) looking for .*dog, the first sub-regex (any number of any character, followed by dog). When it determines true-ness of that regex, it resets the match point (back to 0) and continues with the next sub-regex. So, the net is that it doesn't matter where each word is; only that every word is found.
EDIT: #Justin pointed out that i should have a trailing .*, which i've added above. Without it, text.match(regex) works, but regex.exec(text) returns an empty match string. With the trailing .*, you get the matching string.

The problem with "and" is: in what combination do you want the words? Can they appear in any order, or must they be in the order given? Can they appear consecutively or can there be other words in between?
These decisions impact heavily what search (or searches) you do.
If you're looking for "A B C" (in order, consecutively), the expression is simply /A B C/. Done!
If you're looking for "A foo B bar C" it might be /A.*?B.*?C/
If you're looking for "B foo A foo C" you'd be better off doing three separate tests for /A/, /B/, and /C/

Do a simple for loop and search for every term, something like this:
var n = inputArray.length;
if (n) {
for (var i=0; i<n; i++) {
if (/* inputArray[i] not in text */) {
break;
}
}
if (i != n) {
// not all terms were found
}
}

My regular expressions cookbook does feature a regular expression that can possibly do this using conditionals. However, it's quite complicated, so I'd go for the currently top rated answer which is iterating over the options. Anyway, trying to adapt their example I think it would be something like:
\b(?:(?:(word1)|(word2))(\b.*?)){2,}(?(1)|(?!))(?(2)|(?!))
No guarantees that this'll work as is, but it's the basic idea I think. See what I mean about complicated?

Related

Faster way match characters between strings than Regex?

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}
Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.
You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.
The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

change regex to match some words instead of all words containing PRP

This regex matches all characters between whitespace if the word contains PRP.
How can I get it to match all words, or characters in-between whitepsace, if they contain PRP, but not if they contain me in any case.
So match all words containing PRP, but not containing ME or me.
Here is the regex to match words containing PRP: \S*PRP\S*
You can use negative lookahead for this:
(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)
Working Demo
PS: Use group #1 for your matched word.
Code:
var re = /(?:^|\s)((?!\S*?(?:ME|me))\S*?PRP\S*)/;
var s = 'word abcPRP def';
var m = s.match(re);
if (m) console.log(m[1]); //=> abcPRP
Instead of using complicated regular expressions which would be confusing for almost anyone who's reading it, why don't you break up your code into two sections, separating the words into an array and filtering out the results with stuff you don't want?
function prpnotme(w) {
var r = w.match(/\S+/g);
if(r == null)
return [];
var i=0;
while(i<r.length) {
if(!r[i].contains('PRP') || r[i].toLowerCase().contains('me'))
r.splice(i,1);
else
i++;
}
return r;
}
console.log(prpnotme('whattttttt ok')); // []
console.log(prpnotme('MELOLPRP PRPRP PRPthemeok PRPmhm')); // ['PRPRP', 'PRPmhm']
For a very good reason why this is important, imagine if you ever wanted to add more logic. You're much more likely to make a mistake when modifying complicated regex to make it even more complicated, and this way it's done with simple logic that make perfect sense when reading each predicate, no matter how much you add on.

Javascript/Jquery - how to replace a word but only when not part of another word?

I am currently doing a regex comparison to remove words (rude words) from a text field when written by the user. At the moment it performs the check when the user hits space and removes the word if matches. However it will remove the word even if it is part of another word. So if you type apple followed by space it will be removed, that's ok. But if you type applepie followed by space it will remove 'apple' and leave pie, that's not ok. I am trying to make it so that in this instance if apple is part of another word it will not be removed.
Is there any way I can perform the comparison on the whole word only or ignore the comparison if it is combined with other characters?
I know that this allows people to write many rude things with no space. But that is the desired effect by the people that give me orders :(
Thanks for any help.
function rude(string) {
var regex = /apple|pear|orange|banana/ig;
//exaple words because I'm sure you don't need to read profanity
var updatedString = string.replace( regex, function(s) {
var blank = "";
return blank;
});
return updatedString;
}
$(input).keyup(function(event) {
var text;
if (event.keyCode == 32) {
var text = rude($(this).val());
$(this).val(text);
$("someText").html(text);
}
}
You can use word boundaries (\b), which match 0 characters, but only at the beginning or end of a word. I'm also using grouping (the parentheses), so it's easier to read an write such expressions.
var regex = /\b(apple|pear|orange|banana)\b/ig;
BTW, in your example you don't need to use a function. This is sufficient:
function rude(string) {
var regex = /\b(apple|pear|orange|banana)\b/ig;
return string.replace(regex, '');
}

Regex with limited use of specific characters (like a scrabble bank of letters)

I would like to be able to do a regex where I can identify sort of a bank of letters like [dgos] for example and use that within my regex... but whenever a letter from that gets used, it takes it away from the bank.
So lets say that somehow \1 is able to stand for that bank of letters (dgos). I could write a regex something like:
^\1{1}o\1{2}$
and it would match basically:
\1{1} = [dgos]{1}
o
\1{2} = [dgos]{2} minus whatever was used in the first one
Matches could include good, dosg, sogd, etc... and would not include sods (because s would have to be used twice) or sooo (because the o would have to be used twice).
I would also like to be able to identify letters that can be used more than once. I started to write this myself but then realized I didn't even know where to begin with this so I've also searched around and haven't found a very elegant way to do this, or a way to do it that would be flexible enough that the regex could easily be generated with minimal input.
I have a solution below using a combination of conditions and multiple regexs (feel free to comment thoughts on that answer - perhaps it's the way I'll have to do it?), but I would prefer a single regex solution if possible... and something more elegant and efficient if possible.
Note that the higher level of elegance and single regex part are just my preferences, the most important thing is that it works and the performance is good enough.
Assuming you are using javascript version 1.5+ (so you can use lookaheads), here is my solution:
^([dogs])(?!.*\1)([dogs])(?!.*\2)([dogs])(?!.*\3)[dogs]$
So after each letter is matched, you perform a negative lookahead to ensure that this matched letter never appears again.
This method wouldn't work (or at least, would need to be made a heck of a lot more complicated!) if you want to allow some letters to be repeated, however. (E.g. if your letters to match are "example".)
EDIT: I just had a little re-think, and here is a much more elegant solution:
^(?:([dogs])(?!.*\1)){4}
After thinking about it some more, I have thought of a way that this IS possible, although this solution is pretty ugly! For example, suppose you want to match the letters "goods" (i.e. including TWO "o"s):
^(?=.*g)(?=.*o.*o)(?=.*d)(?=.*s).{5}$
- So, I have used forward lookaheads to check that all of these letters are in the text somewhere, then simply checked that there are exactly 5 letters.
Or as another example, suppose we want to match the letters "banana", with a letter "a" in position 2 of the word. Then you could match:
^(?=.*b)(?=.*a.*a.*a)(?=.*n.*n).a.{4}$
Building on #TomLord's answer, taking into account that you don't necessarily need to exhaust the bank of letters, you can use negative lookahead assertions instead of positive ones. For your example D<2 from bank>R<0-5 from bank>, that regex would be
/^(?![^o]*o[^o]*o)(?![^a]*a[^a]*a)(?![^e]*e[^e]*e[^e]*e)(?![^s]*s[^s]*s)d[oaes]{2}r[oaes]{0,5}$/i
Explanation:
^ # Start of string
(?![^o]*o[^o]*o) # Assert no more than one o
(?![^a]*a[^a]*a) # Assert no more than one a
(?![^e]*e[^e]*e[^e]*e) # Assert no more than two e
(?![^s]*s[^s]*s) # Assert no more than one s
d # Match d
[oaes]{2} # Match two of the letters in the bank
r # Match r
[oaes]{0,5} # Match 0-5 of the letters in the bank
$ # End of string
You could also write (?!.*o.*o) instead of (?![^o]*o[^o]*o), but the latter is faster, just harder to read. Another way to write this is (?!(?:[^o]*o){2}) (useful if the number of repetitions increases).
And of course you need to account for the number of letters in your "fixed" part of the string (which in this case (d and r) don't interfere with the bank, but they might do so in other examples).
Something like that would be pretty simple to specify, just very long
gdos|gdso|gods|gosd|....
You'd basically specify every permutation. Just write some code to generate every permutation, combine with the alternate operator, and you're done!
Although... It might pay to actually encode the decision tree used to generate the permutations...
I swear I think I answered this before on stackoverflow. Let me get back to you...
I couldn't think of a way to do this completely in a regex, but I was able to come up with this: http://jsfiddle.net/T2TMd/2/
As jsfiddle is limited to post size, I couldn't do the larger dictionary there. Check here for a better example using a 180k word dictionary.
Main function:
/*
* filter = regex to filter/split potential matches
* bank = available letters
* groups = capture groups that use the bank
* dict = list of words to search through
*/
var matchFind = function(filter, bank, groups, dict) {
var matches = [];
for(var i=0; i < dict.length; i++) {
if(dict[i].match(filter)){
var fail = false;
var b = bank;
var arr = dict[i].split(filter);
//loop groups that use letter bank
for(var k=0; k<groups.length && !fail; k++) {
var grp = arr[groups[k]] || [];
//loop characters of that group
for(var j=0; j<grp.length && !fail; j++) {
var regex = new RegExp(b);
var currChar = grp.charAt(j);
if(currChar.match(regex)) {
//match found, remove from bank
b = b.replace(currChar,"");
} else {
fail = true;
}
}
}
if(!fail) {
matches.push(dict[i]);
}
}
}
return matches;
}
Usage:
$("#go").click( function() {
var f = new RegExp($("#filter").val());
var b = "["+$("#bank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var g = $("#groups").val().replace(/[^0-9,]+/g,"").split(",") || [];
$("#result").text(matchFind(f,b,g,dict).toString());
});
To make it easier to create scenarios, I created this as well:
$("#build").click( function() {
var bank = "["+$("#buildBank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var buildArr = $("#builder").val().split(",");
var groups = [];
var build = "^";
for(var i=0; i<buildArr.length; i++) {
var part = buildArr[i];
if(/\</.test(part)) {
part = "(" + bank + part.replace("<", "{").replace(">", "}").replace("-",",") + ")";
build = build + part;
groups.push(i+1);
} else {
build = build + "("+part+")";
}
}
build = build + "$";
$("#filter").val(build);
$("#bank").val(bank);
$("#groups").val(groups.toString());
$("#go").click();
});
This would be useful in Scrabble, so lets say that you are in a position where a word must start with a "D", there are two spaces between that "D" and an "R" from a parallel word, and you have OAEES for your letter bank. In the builder I could put D,<2>,R,<0-3> because it must start with a D, then it must have 2 letters from the bank, then there is an R, then I'd have up to 3 letters to use (since I'm using 2 in between D and R).
The builder would use the letter bank and convert D,<2>,R,<0-3> to ^(D)([OAEES]{2})(R)([OAEES]{0,5})$ which would be used to filter for possible matches. Then with those possible matches it will look at the capture groups that use the letter bank, character by character, removing letters from that regex when it finds them so that it won't match if there are more of the letter bank letters used than there is in the letter bank.
Test the above scenario here.

Moving index in JavaScript regex matching

I have this regex to extract double words from text
/[A-Za-z]+\s[A-Za-z]+/g
And this sample text
Mary had a little lamb
My output is this
[0] - Mary had; [1] - a little;
Whereas my expected output is this:
[0] - Mary had; [1] - had a; [2] - a little; [3] - little lamb
How can I achieve this output? As I understand it, the index of the search moves to the end of the first match. How can I move it back one word?
Abusing String.replace function
I use a little trick using the replace function. Since the replace function loops through the matches and allows us to specify a function, the possibility is infinite. The result will be in output.
var output = [];
var str = "Mary had a little lamb";
str.replace(/[A-Za-z]+(?=(\s[A-Za-z]+))/g, function ($0, $1) {
output.push($0 + $1);
return $0; // Actually we don't care. You don't even need to return
});
Since the output contains overlapping portion in the input string, it is necessary to not to consume the next word when we are matching the current word by using look-ahead 1.
The regex /[A-Za-z]+(?=(\s[A-Za-z]+))/g does exactly as what I have said above: it will only consume one word at a time with the [A-Za-z]+ portion (the start of the regex), and look-ahead for the next word (?=(\s[A-Za-z]+)) 2, and also capture the matched text.
The function passed to the replace function will receive the matched string as the first argument and the captured text in subsequent arguments. (There are more - check the documentation - I don't need them here). Since the look-ahead is zero-width (the input is not consumed), the whole match is also conveniently the first word. The capture text in the look-ahead will go into the 2nd argument.
Proper solution with RegExp.exec
Note that String.replace function incurs a replacement overhead, since the replacement result is not used at all. If this is unacceptable, you can rewrite the above code with RegExp.exec function in a loop:
var output = [];
var str = "Mary had a little lamb";
var re = /[A-Za-z]+(?=(\s[A-Za-z]+))/g;
var arr;
while ((arr = re.exec(str)) != null) {
output.push(arr[0] + arr[1]);
}
Footnote
In other flavor of regex which supports variable width negative look-behind, it is possible to retrieve the previous word, but JavaScript regex doesn't support negative look-behind!.
(?=pattern) is syntax for look-ahead.
Appendix
String.match can't be used here since it ignores the capturing group when g flag is used. The capturing group is necessary in the regex, as we need look-around to avoid consuming input and match overlapping text.
It can be done without regexp
"Mary had a little lamb".split(" ")
.map(function(item, idx, arr) {
if(idx < arr.length - 1){
return item + " " + arr[idx + 1];
}
}).filter(function(item) {return item;})
Here's a non-regex solution (it's not really a regular problem).
function pairs(str) {
var parts = str.split(" "), out = [];
for (var i=0; i < parts.length - 1; i++)
out.push([parts[i], parts[i+1]].join(' '));
return out;
}
Pass your string and you get an array back.
demo
Side note: if you're worried about non-words in your input (making a case for regular expressions!) you can run tests on parts[i] and parts[i+1] inside the for loop. If the tests fail: don't push them onto out.
A way that you could like could be this one:
var s = "Mary had a little lamb";
// Break on each word and loop
s.match(/\w+/g).map(function(w) {
// Get the word, a space and another word
return s.match(new RegExp(w + '\\s\\w+'));
// At this point, there is one "null" value (the last word), so filter it out
}).filter(Boolean)
// There, we have an array of matches -- we want the matched value, i.e. the first element
.map(Array.prototype.shift.call.bind(Array.prototype.shift));
If you run this in your console, you'll see ["Mary had", "had a", "a little", "little lamb"].
With this way, you keep your original regex and can do the other stuff you want in it. Although with some code around it to make it really work.
By the way, this code is not cross-browser. The following functions are not supported in IE8 and below:
Array.prototype.filter
Array.prototype.map
Function.prototype.bind
But they're easily shimmable. Or the same functionality is easily achievable with for.
Here we go:
You still don't know how the regular expression internal pointer really works, so I will explain it to you with a little example:
Mary had a little lamb with this regex /[A-Za-z]+\s[A-Za-z]+/g
Here, the first part of the regex: [A-Za-z]+ will match Mary so the pointer will be at the end of the y
Mary had a little lamb
^
In the next part (\s[A-Za-z]+) it will match an space followed by another word so...
Mary had a little lamb
^
The pointer will be where the word had ends. So here's your problem, you are increasing the internal pointer of the regular expression without wanting, how is this solved? Lookaround is your friend. With lookarounds (lookahead and lookbehind) you are able to walk through your text without increasing the main internal pointer of the regular expression (it would use another pointer for that).
So at the end, the regular expression that would match what you want would be: ([A-Za-z]+(?=\s[A-Za-z]+))
Explanation:
The only think you dont know about that regular expression is the (?=\s[A-Za-z]+) part, it means that the [A-Za-z]+ must be followed by a word, else the regular expression won't match. And this is exactly what you seem to want because the interal pointer will not be increased and will match everyword but the last one because the last one won't be followed by a word.
Then, once you have that you only have to replace whatever you are done right now.
Here you have a working example, DEMO
In full admiration of the concept of 'look-ahead', I still propose a pairwise function (demo), since it's really Regex's task to tokenize a character stream, and the decision of what to do with the tokens is up to the business logic. At least, that's my opinion.
A shame that Javascript hasn't got a pairwise, yet, but this could do it:
function pairwise(a, f) {
for (var i = 0; i < a.length - 1; i++) {
f(a[i], a[i + 1]);
}
}
var str = "Mary had a little lamb";
pairwise(str.match(/\w+/g), function(a, b) {
document.write("<br>"+a+" "+b);
});
​

Categories

Resources