Regex skips some matches by consuming from the input string

Regex skips some matches by consuming from the input string - javascript

I know why this problem occur but I need a way to solve it. Consider I have a string such:
var str = "123+53ff+124+ks23+223+22+mo";
and I want to replace the numbers between two + signs with number to get this:
str = "123+53ff+number+ks23+number+number+mo";
So I use this regular expression: /\+\d+\+/g like this:
str = str.replace(/\+\d+\+/g, "+number+");
But what it does is if there is more matches after each other (+3+4+5+...), it match one, skips the next, match the one after, skips the one after ...
I kind of know why this is happening: It's because when it matches one, then it consumes the right + and then start matching just after that + where the next element won't get matched because technically it's not surrounded by + signs.
If the input is +3++4++5+ then the result is as expected because each number is surrounded by its own + signs (consuming one isn't a problem here beacause if one + is consumed the next one will do).
I've worked my way around this by calling the replace twice. But this is very hacky. I want a solid way. So, how to make the regular expression work here? Or how to make two consecutive numbers share the same + sign?
Note: I don't want to solve this particular example's issue (replace numbers between + signs). It was just an example of the issue.
EDIT:
If there is a way to do it, then can I group the + signs? I mean can I do this /(\+).../ so I can access it using $1?

Since the overlapping matches aren't matched, you could use a positive lookahead in order to determine if there is a + character following the digit(s).
In doing so, you aren't actually matching the succeeding + since it is part of the lookahead:
Example Here
/\+\d+(?=\+)/g
For instance:
"123+53ff+124+ks23+223+22+mo".replace(/\+\d+(?=\+)/g, "+number");
Output:
"123+53ff+number+ks23+number+number+mo"
It's worth pointing out that you would use +number (rather than +number+) to replace the match now that the succeeding + is no longer being matched.
Note: If you want to group the + signs, group the first one like this: /(\+)\d+(?=\+)/g as the first one is sufficient. The second one will be automatically grouped in the next match.

Regular expressions only find non-overlapping matches. Once a match is found, the search is continued from match end. You can use lookaround assertions to add context without adding to the match itself:
/(?<=\+)\d+(?=\+)/
(?<=\+) asserts that before the current location there is a plus; (?=\+) asserts that after the current location there is a plus. Both of them are zero-width.
EDIT: JavaScript doesn't have lookbehind, so /\+\d(?=\+)/ is what you can do in that case.
EDIT2: The lookaround assertions do not capture by themselves. You can't group the lookahead together with the match. However, you can group it independently, and concatenate later:
/(\+)(\d+)(?=(\+))/
will give you all three pieces.

Related

Unable to match regex for any character except ' and "

I've written a regex to match against the string
{{AB.group.one}}:"eighth",{{AB.group.TWO}}:"third",{{attr1111}}:"fourth","fifth":{{attr_22_2qq2}},"sixth":{{AB.group.three}},{{ab.group.fourth}}:"seventh","ninth":{{attr1111}}}
Regex:
/[^'"]({{2}[a-zA-Z0-9$_].*?}{2})[^'"]/gi
Breaking the regex above:
[^'"]: Start with a character which is neither ' nor ".
({{2}[a-zA-Z0-9$_].*?}{2}): Have exactly 2 {{, then any character in the range a-zA-Z0-9$_ . After that, exactly 2 }}
[^'"]: Any character except for ' and ".
Below matches are not the exact matches but the captured groups. I'll perform my operations on the captured groups so for simplicity, we can consider them as our matches.
Expected matches:
{{AB.group.one}}
{{AB.group.TWO}}
{{attr1111}}
{{attr_22_2qq2}}
{{AB.group.three}}
{{ab.group.fourth}}
{{attr1111}}}
Resultant matches:
{{AB.group.TWO}}
{{attr1111}}
{{attr_22_2qq2}}
{{AB.group.three}}
{{attr1111}}}
As you can see in the image below {{AB.group.one}} and {{ab.group.fourth}} do not match. I want them to match them as well.
I know the reasons why they aren't matching.
The reason why {{AB.group.one}} doesn't match is because [^'"] expects one character except for ' and " and I'm not providing one. If I replace [^'"] with ["'"]*, it'll work but in that case "{{AB.group.one}}" will match as well.
So, the problem statement is match any character(if there's any) before {{ and after }} but the character can't be ' or ".
The reason why {{ab.group.fourth}} doesn't match is because the character preceding this match i.e. , is part of another match. This is just my speculation, the reason could be something else. But if I include any character between {{AB.group.three}}, and {{ab.group.fourth}} (e.g. {{AB.group.three}}, {{ab.group.fourth}}), then the pattern matches. I have no idea how can I fix this.
Please help me in solving these two problems. Thank you.

Here is a regex based approach which seems to be working. First, we can string off all double-quoted terms, then replace islands of comma/colon with just a single comma separator. Finally, split on comma to generate an array of terms.
var input = "{{AB.group.one}}:\"eighth\",{{AB.group.TWO}}:\"third\",{{attr1111}}:\"fourth\",\"fifth\":{{attr_22_2qq2}},\"sixth\":{{AB.group.three}},{{ab.group.fourth}}:\"seventh\",\"ninth\":{{attr1111}}},\"blah\":\"stuff\",{{one}}:{{two}}";
var terms = input.replace(/\".*?\"/g, "").replace(/[,:]+/g, ",").split(",");
console.log(terms);

You were actually really close with what you had.
let input = '{{AB.group.one}}:"eighth",{{AB.group.TWO}}:"third",{{attr1111}}:"fourth","fifth":{{attr_22_2qq2}},"sixth":{{AB.group.three}},{{ab.group.fourth}}:"seventh","ninth":{{attr1111}}}'
let regex = /(?<=[^'"]?)({{2}[a-zA-Z0-9$_].*?}{2})(?=[^'"]?)/gi;
console.log(input.match(regex))
(?<=[^'"]?) is a positive lookbehind. Since the negated character set is used, we're checking that the character before the match is not ' or ". The question mark makes this optional - match zero or one of the previous token (the negated character set).
(?=[^'"]?) is a positive lookahead and checks the token immediately after the expression to ensure that it's not a ' or " (or that there is no token after the expression).
Another option, since lookbehinds aren't supported in every browser:
let input = '{{AB.group.one}}:"eighth",{{AB.group.TWO}}:"third",{{attr1111}}:"fourth","fifth":{{attr_22_2qq2}},"sixth":{{AB.group.three}},{{ab.group.fourth}}:"seventh","ninth":{{attr1111}}}'
let regex = /(?:[^{'"])?({{2}[a-zA-Z0-9$_].*?}{2})(?:[^}'"])?/gi
console.log([...input.matchAll(regex)].map(reg => reg[1]))
String.match() loses reference to capture groups when the global flag is passed, so only returns the "match". Since you're creating a capture group with ({{2}[a-zA-Z0-9$_].*?}{2}), if you wanted to just ensure the characters immediately surrounding the bracketed expression aren't quotation marks, you can just use non-capture groups for those optional checks.
(?:[^{'"])? is a non-capturing group, as is (?:[^}'"])?
Using String.matchAll, the first element of the arrays created for each match is the entire match, the second element is the first capturing group, etc. So the logic for mapping over [...input.matchAll(regex)] is just to collect the capturing group from each match.

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.

Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

Javascript Regex: how to simulate "match without capture" behavior of positive lookbehind?

I have a relatively simple regex problem - I need to match specific words in a string, if they are entire words or a prefix. With word boundaries, it would look something like this:
\b(word1|word2|prefix1|prefix2)
However, I can't use the word boundary condition because some words may start with odd characters, e.g. .999
My solution was to look for whitespace or starting token for these odd cases.
(\b|^|\s)(word1|word2|prefix1|prefix2)
Now words like .999 will still get matched correctly, BUT it also captures the whitespace preceding the matched words/prefixes. For my purposes, I can't have it capture the whitespace.
Positive lookbehinds seem to solve this, but javascript doesn't support them. Is there some other way I can get the same behavior to solve this problem?

You can use a non-capturing group using (?:):
/(?:\b|^|\s)(word1|word2|prefix1|prefix2)/
UPDATE:
Based on what you want to replace it with (and #AlanMoore's good point about the \b), you probably want to go with this:
var regex = /(^|\s)(word1|word2|prefix1|prefix2)/g;
myString.replace(regex,"$1<span>$2</span>");
Note that I changed the first group back to a capturing one since it'll be part of the match but you want to keep it in the replacement string (right?). Also added the g modifier so that this happens for all occurrences in the string (assuming thats what you wanted).

Let's get the terminology straight first. A regex normally consumes everything it matches. When you do a replace(), everything that was consumed is overwritten. You can also capture parts of the matched text separately and plug them back in using $1, $2, etc.
When you were using the word boundary you didn't have to worry about this, because \b doesn't consume anything. But now you're consuming the leading whitespace character if there is one, so you have to plug it back in. I don't know what you're replacing the match with, so I'll just replace them with nothing for this demonstration.
result = subject.replace(/(^|\s)(word1|word2|prefix1|prefix2)/g, "$1");
Note that the \b isn't needed any more. In fact, you must remove it, or it will match things like .999 in xyz.999, because \b matches between z and .. I'm pretty sure you don't want that.

How to make regex match only first occurrence of each match?

/\b(keyword|whatever)\b/gi
How can I modify the above javascript regex to match only the first occurance of each word (I believe this is called non-greedy)?
First occurance of "keyword" and first occurance of "whatever" and I may put more more words in there.

Remove g flag from your regex:
/\b(keyword|whatever)\b/i

What you're doing is simply unachievable with a singular regular expression. Instead you will have to store every word you wish to find in an array, loop through them all searching for an answer, and then for any matches, store the result in an array.
Example:
var words = ["keyword","whatever"];
var text = "Whatever, keywords are like so, whatever... Unrelated, I now know " +
"what it's like to be a tweenage girl. Go Edward.";
var matches = []; // An empty array to store results in.
/* When you search the text you need to convert it to lower case to make it
searchable.
* We'll be using the built in method 'String.indexOf(needle)' to match
the strings as it avoids the need to escape the input for regular expression
metacharacters. */
//Text converted to lower case to allow case insensitive searchable.
var lowerCaseText = text.toLowerCase();
for (var i=0;i<words.length;i++) { //Loop through the `words` array
//indexOf returns -1 if no match is found
if (lowerCaseText.indexOf(words[i]) != -1)
matches.push(words[i]); //Add to the `matches` array
}

Remove the g modifier from your regex. Then it will find only one match.

What you're talking about can't be done with a JavaScript regex. It might be possible with advanced regex features like .NET's unrestricted lookbehind, but JavaScript's feature set is extremely limited. And even in .NET, it would probably be simplest to create a separate regex for each word and apply them one by one; in JavaScript it's your only option.
Greediness only applies to regexes that employ quantifiers, like /START.*END/. The . means "any character" and the * means "zero or more". After the START is located, the .* greedily consumes the rest of the text. Then it starts backtracking, "giving back" one character at a time until the next part of the regex, END succeeds in matching.
We call this regex "greedy" because it matches everything from the first occurrence of START to the last occurrence of END.
If there may be more than one "START"-to-"END" sequence, and you want to match just the first one, you can append a ? to the * to make it non-greedy: /START.*?END/. Now, each time the . tries to consume the next character, it first checks to see if it could match END at that spot instead. Thus it matches from the first START to the first END after that. And if you want to match all the "START"-to-"END" sequences individually, you add the 'g' modifier: /START.*?END/g.
It's a bit more complicated than that, of course. For example, what if these sequences can be nested, as in START…START…END…END? If I've gotten a little carried away with this answer, it's because understanding greediness is the first important step to mastering regexes. :-/

New to Regular Expressions need help

I need a form with one button and window for input
that will check an array, via a regular expression.
And will find a exact match of letters + numbers. Example wxyz [some space btw] 0960000
or a mix of numbers and letters [some space btw] + numbers 01xg [some space btw] 0960000
The array has four objects for now.
Once found i need a function the will open a new page or window when match is found .
Thanks you for your help.
Michael

To answer the Javascript part, here's one way to "grep" through the array to find matching elements:
var matches = [];
var re = /whatever/;
foo.forEach(
function(el) {
if( re.exec(el) )
matches.push(el);
}
);
To attempt to answer the regular expression part: I don't know what "exact match" means to you, and I'm assuming "some space" belongs only in between the other terms, and I'm assuming letters means the English alphabet from 'a' to 'z' in lower and upper case and the digits should be 0-9 (otherwise, other language characters might be matched).
The first pattern would be /[a-zA-Z0-9]+\s*0960000/. Change "\s*" to "\s+" if there is at least one space, instead of zero or more space characters. Change "\s" to " " if matching the tab character (and some lesser-used space chars) is not desirable.
For the second pattern, I don't know what "numbers 01xg" means, but if it means numbers followed by that string, then the pattern would be /[a-zA-Z0-9]+\s*[0-9]+\s*01xg\s*0960000/. The same caveats apply as above.
Additionally, this will match a partial string. If the string much be matched in entirety (if nothing in the string must exist except that which is matched), add "^" to the beginning of the pattern to anchor it to the beginning of the string, and "$" at the end to anchor it to the end of the string. For example, /[a-zA-Z0-9]+\s*0960000/ matches "foo_bar 5 0960000", but /^[a-zA-Z0-9]+\s*0960000$/ does not.
For more on regular expressions in Javascript, take a look at developer.mozilla.org's article on the RegExp object (the link takes you to JS version 1.5 reference, which should apply to all JS-capable browsers).
(edited to add): To match either situation, since they have overlapping parts, you could use the following pattern: /[a-zA-Z0-9]+(?:\s*[0-9]+\s*01xg)?\s*0960000/. The question mark says to match the part that differs -- in a non-matching group (?:foo) -- once or zero times. (?:foo)? and (?:foo|) do the same thing in this case, but I'm not sure whether there is a performance difference; I would recommend to use the one that makes the most sense to you, so you can read it later.

Develop Reference

JavaScript is the programming language of the Web.