Moving index in JavaScript regex matching - javascript

I have this regex to extract double words from text
/[A-Za-z]+\s[A-Za-z]+/g
And this sample text
Mary had a little lamb
My output is this
[0] - Mary had; [1] - a little;
Whereas my expected output is this:
[0] - Mary had; [1] - had a; [2] - a little; [3] - little lamb
How can I achieve this output? As I understand it, the index of the search moves to the end of the first match. How can I move it back one word?

Abusing String.replace function
I use a little trick using the replace function. Since the replace function loops through the matches and allows us to specify a function, the possibility is infinite. The result will be in output.
var output = [];
var str = "Mary had a little lamb";
str.replace(/[A-Za-z]+(?=(\s[A-Za-z]+))/g, function ($0, $1) {
output.push($0 + $1);
return $0; // Actually we don't care. You don't even need to return
});
Since the output contains overlapping portion in the input string, it is necessary to not to consume the next word when we are matching the current word by using look-ahead 1.
The regex /[A-Za-z]+(?=(\s[A-Za-z]+))/g does exactly as what I have said above: it will only consume one word at a time with the [A-Za-z]+ portion (the start of the regex), and look-ahead for the next word (?=(\s[A-Za-z]+)) 2, and also capture the matched text.
The function passed to the replace function will receive the matched string as the first argument and the captured text in subsequent arguments. (There are more - check the documentation - I don't need them here). Since the look-ahead is zero-width (the input is not consumed), the whole match is also conveniently the first word. The capture text in the look-ahead will go into the 2nd argument.
Proper solution with RegExp.exec
Note that String.replace function incurs a replacement overhead, since the replacement result is not used at all. If this is unacceptable, you can rewrite the above code with RegExp.exec function in a loop:
var output = [];
var str = "Mary had a little lamb";
var re = /[A-Za-z]+(?=(\s[A-Za-z]+))/g;
var arr;
while ((arr = re.exec(str)) != null) {
output.push(arr[0] + arr[1]);
}
Footnote
In other flavor of regex which supports variable width negative look-behind, it is possible to retrieve the previous word, but JavaScript regex doesn't support negative look-behind!.
(?=pattern) is syntax for look-ahead.
Appendix
String.match can't be used here since it ignores the capturing group when g flag is used. The capturing group is necessary in the regex, as we need look-around to avoid consuming input and match overlapping text.

It can be done without regexp
"Mary had a little lamb".split(" ")
.map(function(item, idx, arr) {
if(idx < arr.length - 1){
return item + " " + arr[idx + 1];
}
}).filter(function(item) {return item;})

Here's a non-regex solution (it's not really a regular problem).
function pairs(str) {
var parts = str.split(" "), out = [];
for (var i=0; i < parts.length - 1; i++)
out.push([parts[i], parts[i+1]].join(' '));
return out;
}
Pass your string and you get an array back.
demo
Side note: if you're worried about non-words in your input (making a case for regular expressions!) you can run tests on parts[i] and parts[i+1] inside the for loop. If the tests fail: don't push them onto out.

A way that you could like could be this one:
var s = "Mary had a little lamb";
// Break on each word and loop
s.match(/\w+/g).map(function(w) {
// Get the word, a space and another word
return s.match(new RegExp(w + '\\s\\w+'));
// At this point, there is one "null" value (the last word), so filter it out
}).filter(Boolean)
// There, we have an array of matches -- we want the matched value, i.e. the first element
.map(Array.prototype.shift.call.bind(Array.prototype.shift));
If you run this in your console, you'll see ["Mary had", "had a", "a little", "little lamb"].
With this way, you keep your original regex and can do the other stuff you want in it. Although with some code around it to make it really work.
By the way, this code is not cross-browser. The following functions are not supported in IE8 and below:
Array.prototype.filter
Array.prototype.map
Function.prototype.bind
But they're easily shimmable. Or the same functionality is easily achievable with for.

Here we go:
You still don't know how the regular expression internal pointer really works, so I will explain it to you with a little example:
Mary had a little lamb with this regex /[A-Za-z]+\s[A-Za-z]+/g
Here, the first part of the regex: [A-Za-z]+ will match Mary so the pointer will be at the end of the y
Mary had a little lamb
^
In the next part (\s[A-Za-z]+) it will match an space followed by another word so...
Mary had a little lamb
^
The pointer will be where the word had ends. So here's your problem, you are increasing the internal pointer of the regular expression without wanting, how is this solved? Lookaround is your friend. With lookarounds (lookahead and lookbehind) you are able to walk through your text without increasing the main internal pointer of the regular expression (it would use another pointer for that).
So at the end, the regular expression that would match what you want would be: ([A-Za-z]+(?=\s[A-Za-z]+))
Explanation:
The only think you dont know about that regular expression is the (?=\s[A-Za-z]+) part, it means that the [A-Za-z]+ must be followed by a word, else the regular expression won't match. And this is exactly what you seem to want because the interal pointer will not be increased and will match everyword but the last one because the last one won't be followed by a word.
Then, once you have that you only have to replace whatever you are done right now.
Here you have a working example, DEMO

In full admiration of the concept of 'look-ahead', I still propose a pairwise function (demo), since it's really Regex's task to tokenize a character stream, and the decision of what to do with the tokens is up to the business logic. At least, that's my opinion.
A shame that Javascript hasn't got a pairwise, yet, but this could do it:
function pairwise(a, f) {
for (var i = 0; i < a.length - 1; i++) {
f(a[i], a[i + 1]);
}
}
var str = "Mary had a little lamb";
pairwise(str.match(/\w+/g), function(a, b) {
document.write("<br>"+a+" "+b);
});
​

Related

Faster way match characters between strings than Regex?

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}
Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.
You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.
The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

Is there a way to do a substring in Javascript but use string characters as the parameters for what you want to select?

So a substring can take two parameters, the index to start at and the index to stop at like so
var str="Hello beautiful world!";
document.write(str.substring(3,7));
but is there a way to designate the start and stopping points as a set of characters to grab, so instead of the starting point being 3 I would want it to be "lo" and instead of the end point being 7 I would want it to be "wo" so I would be grabbing "lo beautiful wo". Is there a Javascript function that serves that purpose already?
Sounds like you want to use regular expressions and string.match() instead:
var str="Hello beautiful world!";
document.write(str.match(/lo.*wo/)[0]); // document.write("lo beautiful wo");
Note, match() returns an array of matches, which might be null if there is no match. So you should include a null check.
If you're not familiar with regexes, this is a pretty good source:
http://www.w3schools.com/jsref/jsref_obj_regexp.asp
use the method indexOf: document.write(str.substring(3,str.indexOf('wo')+2));
Yup, you can do this easily with regular expressions:
var substr = /lo.+wo/.exec( 'Hello beautiful world!' )[0];
console.log( substr ); //=> 'lo beautiful wo'
Use a regex brother:
if (/(lo.+wo)/.test("Hello beautiful world!")) {
document.write(RegExp.$1);
}
You need a backup plan in case the string does not match. Hence the use of test.
Regular expression may be able to achieve this to some extent, but there are many details that you must be aware of.
For example, if you want to find all the substrings that starts with "lo", and ends with the nearest "wo" after "lo". (If there are more than 1 match, the subsequent matches will pick up the first "lo" after the "wo" of last match).
"Hello beautiful world!".match(/lo.*?wo/g);
Using the RegExp constructor, you can make it more flexible (you can substitute "lo" and "wo" with the actual string you want to find):
"Hello beautiful world!".match(new RegExp("lo" + ".*?" + "wo", "g"));
Important: The downside of the RegExp approach above is that, you need to know what characters are special to escape them - otherwise, they will not match the actual substring you want to find.
It can also be achieve with indexOf, albeit a little bit dirty. For the first substring:
var startIndex = str.indexOf(startString);
var endIndex = str.indexOf(endString, startIndex);
if (startIndex >= 0 && endIndex >= 0)
str.substring(startIndex, endIndex + endString.length)
If you want to find the substring that starts with the first "lo" and ends with the last "wo" in the string, you can use indexOf and lastIndexOf to find it (with a small modification to the code above). RegExp can also do it, by changing .*? to .* in the two example above (there will be at most 1 match, so the "g" flag at the end is redundant).

JavaScript regex back references returning an array of matches from single capture group (multiple groups)

I'm fairly sure after spending the night trying to find an answer that this isn't possible, and I've developed a work around - but, if someone knows of a better method, I would love to hear it...
I've gone through a lot of iterations on the code, and the following is just a line of thought really. At some point I was using the global flag, I believe, in order for match() to work, and I can't remember if it was necessary now or not.
var str = "#abc#def#ghi&jkl";
var regex = /^(?:#([a-z]+))?(?:&([a-z]+))?$/;
The idea here, in this simplified code, is the optional group 1, of which there is an unspecified amount, will match #abc, #def and #ghi. It will only capture the alpha characters of which there will be one or more. Group 2 is the same, except matches on & symbol. It should also be anchored to the start and end of the string.
I want to be able to back reference all matches of both groups, ie:
result = str.match(regex);
alert(result[1]); //abc,def,ghi
alert(result[1][0]); //abc
alert(result[1][1]); //def
alert(result[1][2]); //ghi
alert(result[2]); //jkl
My mate says this works fine for him in .net, unfortunately I simply can't get it to work - only the last matched of any group is returned in the back reference, as can be seen in the following:
(additionally, making either group optional makes a mess, as does setting global flag)
var str = "#abc#def#ghi&jkl";
var regex = /(?:#([a-z]+))(?:&([a-z]+))/;
var result = str.match(regex);
alert(result[1]); //ghi
alert(result[1][0]); //g
alert(result[2]); //jkl
The following is the solution I arrived at, capturing the whole portion in question, and creating the array myself:
var str = "#abc#def#ghi&jkl";
var regex = /^([#a-z]+)?(?:&([a-z]+))?$/;
var result = regex.exec(str);
alert(result[1]); //#abc#def#ghi
alert(result[2]); //jkl
var result1 = result[1].toString();
result[1] = result1.split('#')
alert(result[1][1]); //abc
alert(result[1][2]); //def
alert(result[1][3]); //ghi
alert(result[2]); //jkl
That's simply not how .match() works in JavaScript. The returned array is an array of simple strings. There's no "nesting" of capture groups; you just count the ( symbols from left to right.
The first string (at index [0]) is always the overall matched string. Then come the capture groups, one string (or null) per array element.
You can, as you've done, rearrange the result array to your heart's content. It's just an array.
edit — oh, and the reason your result[1][0] was "g" is that array indexing notation applied to a string gets you the individual characters of the string.

getting contents of string between digits

have a regex problem :(
what i would like to do is to find out the contents between two or more numbers.
var string = "90+*-+80-+/*70"
im trying to edit the symbols in between so it only shows up the last symbol and not the ones before it. so trying to get the above variable to be turned into 90+80*70. although this is just an example i have no idea how to do this. the length of the numbers, how many "sets" of numbers and the length of the symbols in between could be anything.
many thanks,
Steve,
The trick is in matching '90+-+' and '80-+/' seperately, and selecting only the number and the last constant.
The expression for finding the a number followed by 1 or more non-numbers would be
\d+[^\d]+
To select the number and the last non-number, add parens:
(\d+)[^\d]*([^\d])
Finally add a /g to repeat the procedure for each match, and replace it with the 2 matched groups for each match:
js> '90+*-+80-+/*70'.replace(/(\d+)[^\d]*([^\d])/g, '$1$2');
90+80*70
js>
Or you can use lookahead assertion and simply remove all non-numerical characters which are not last: "90+*-+80-+/*70".replace(/[^0-9]+(?=[^0-9])/g,'');
You can use a regular expression to match the non-digits and a callback function to process the match and decide what to replace:
var test = "90+*-+80-+/*70";
var out = test.replace(/[^\d]+/g, function(str) {
return(str.substr(-1));
})
alert(out);
See it work here: http://jsfiddle.net/jfriend00/Tncya/
This works by using a regular expression to match sequences of non-digits and then replacing that sequence of non-digits with the last character in the matched sequence.
i would use this tutorial, first, then review this for javascript-specific regex questions.
This should do it -
var string = "90+*-+80-+/*70"
var result = '';
var arr = string.split(/(\d+)/)
for (i = 0; i < arr.length; i++) {
if (!isNaN(arr[i])) result = result + arr[i];
else result = result + arr[i].slice(arr[i].length - 1, arr[i].length);
}
alert(result);
Working demo - http://jsfiddle.net/ipr101/SA2pR/
Similar to #Arnout Engelen
var string = "90+*-+80-+/*70";
string = string.replace(/(\d+)[^\d]*([^\d])(?=\d+)/g, '$1$2');
This was my first thinking of how the RegEx should perform, it also looks ahead to make sure the non-digit pattern is followed by another digit, which is what the question asked for (between two numbers)
Similar to #jfriend00
var string = "90+*-+80-+/*70";
string = string.replace( /(\d+?)([^\d]+?)(?=\d+)/g
, function(){
return arguments[1] + arguments[2].substr(-1);
});
Instead of only matching on non-digits, it matches on non-digits between two numbers, which is what the question asked
Why would this be any better?
If your equation was embedded in a paragraph or string of text. Like:
This is a test where I want to clean up something like 90+*-+80-+/*70 and don't want to scrap the whole paragraph.
Result (Expected) :
This is a test where I want to clean up something like 90+80*70 and don't want to scrap the whole paragraph.
Why would this not be any better?
There is more pattern matching, which makes it theoretically slower (negligible)
It would fail if your paragraph had embedded numbers. Like:
This is a paragraph where Sally bought 4 eggs from the supermarket, but only 3 of them made it back in one piece.
Result (Unexpected):
This is a paragraph where Sally bought 4 3 of them made it back in one piece.

Regular Expression "AND"

I'm doing some basic text matching from an input. I need the ability to perform a basic "AND". For "ANY" I split the input by spaces and join each word by the pipe ("|") character but I haven't found a way to tell the regular expression to match any of the words.
switch (searchOption) {
case "any":
inputArray = input.split(" ");
if (inputArray.length > 1) { input = inputArray.join("|"); }
text = input;
break;
case "all":
inputArray = input.split(" ");
***[WHAT TO DO HERE?]***
text = input;
break;
case "exact":
inputArray = new Array(input);
text = input;
break;
}
Seems like it should be easy.
Use lookahead. Try this:
if( inputArray.length>1 ) rgx = "(?=.*" + inputArray.join( ")(?=.*" ) + ").*";
You'll end up with something like
(?=.*dog)(?=.*cat)(?=.*mouse).*
Which should only match if all the words appear, but they can be in any order.
The dog ate the cat who ate the mouse.
The mouse was eaten by the dog and the cat.
Most cats love mouses and dogs.
But not
The dog at the mouse.
Cats and dogs like mice.
The way it works is that the regex engine scans from the current match point (0) looking for .*dog, the first sub-regex (any number of any character, followed by dog). When it determines true-ness of that regex, it resets the match point (back to 0) and continues with the next sub-regex. So, the net is that it doesn't matter where each word is; only that every word is found.
EDIT: #Justin pointed out that i should have a trailing .*, which i've added above. Without it, text.match(regex) works, but regex.exec(text) returns an empty match string. With the trailing .*, you get the matching string.
The problem with "and" is: in what combination do you want the words? Can they appear in any order, or must they be in the order given? Can they appear consecutively or can there be other words in between?
These decisions impact heavily what search (or searches) you do.
If you're looking for "A B C" (in order, consecutively), the expression is simply /A B C/. Done!
If you're looking for "A foo B bar C" it might be /A.*?B.*?C/
If you're looking for "B foo A foo C" you'd be better off doing three separate tests for /A/, /B/, and /C/
Do a simple for loop and search for every term, something like this:
var n = inputArray.length;
if (n) {
for (var i=0; i<n; i++) {
if (/* inputArray[i] not in text */) {
break;
}
}
if (i != n) {
// not all terms were found
}
}
My regular expressions cookbook does feature a regular expression that can possibly do this using conditionals. However, it's quite complicated, so I'd go for the currently top rated answer which is iterating over the options. Anyway, trying to adapt their example I think it would be something like:
\b(?:(?:(word1)|(word2))(\b.*?)){2,}(?(1)|(?!))(?(2)|(?!))
No guarantees that this'll work as is, but it's the basic idea I think. See what I mean about complicated?

Categories

Resources