Check if string is repetition of an unknown substring in javascript - javascript

I try to solve the same problem in javascript with regexp mentioned here: Check if string is repetition of an unknown substring
I translated the regex in the first answer to Javascript: ^(.+){2,}$
But it does not work as I expect:
'SingleSingleSingle'.replace(/^(.+){2,}$/m, '$1') // returns 'e' instead of exptected 'Single'
What am I overlooking?

I currently have no explanation for why it returns e, but . matches any character and .{2,} basically just means "match any two or more characters".
What you want is to match whatever you captured in the capture group, by using backreferences:
/^(.+)\1+$/m
I just noticed that this is also what the answer you linked to suggests to use: /(.+)\1+/. The expression is exactly the same, there is nothing you have to change for JavaScript.

I think the reason why you get 'e' is that {2,} implies two or more repetitions of a match to the regex that preceeds it, in this case (.+) . {2,} does not guarantee that the repetitions match each other, only that they all qualify as a match for (.+).
From what I can see (using Expresso) it looks like the first match to
(.+) is 'SingleSingleSingl' (due to greedy matching) and the second match is 'e'. Since capturing groups only remember their last match, that is why replace() is giving you back 'e'. If you use (.+?) (for non-greedy or reluctant matching) each individual character will match, but you will still only get the last one, 'e'.
Using a back reference, as Felix mentioned, is the only way that I know of to guarantee that the repetitions match each other.

Related

How to match one 'x' but not one or both of xs in 'xx' globally in string [duplicate]

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.
Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

why does ((a(-b)?)(?!Z)) match the a in "a-bZ"?

I want to write a regular Expression that matches
a
a-b
but only if these sequences are not followed by Z
((a(-b)?)(?!Z))
a matches a ok
a-b matches a-b ok
aZ empty ok
a-bZ matches a NOT OK
Why does "a-bZ" match the first a although there is a group around (a(-b)?) ?
How can I correct it?
Need this in javascript RegExp, which should not matter however. Tried it in http://regexpal.com/
a-bZ is matched because (-b)? is ignored and (?!Z) matches the - symbol.
Because (-b) is optional, every string of the form ((a)(?!Z)) also gets matched.
You could match (a(?!Z))|(a-b(?!Z))
However, this will match a-bZ (because a is followed by a non-Z character).
If you want to find all instances of the strings where, for example, a-c doesn't get matched (even though - is a non-Z character), you could do this:
(a(?![-Z]))|(a-b(?!Z))
You could use atomic grouping to make your regex work. Unfortunately, the JavaScript regex engine does not support this feature.
But there is a trick to mimic its effect using a look-ahead and a back-reference (explained here):
(?=(pattern to make atomic))\1
so with your a-b or just a situation, this would become:
(?=(a-b|a))\1(?!Z)
Note that the longer sub-pattern a-b needs to be mentioned first in the group, otherwise it does not work.
The key mechanism is that the look-ahead finds the ealiest, longest-possible sub-match, while the back-reference prevents any backtracking in the engine and moves the position in the string, so the following test (?!Z) can be executed.
If you specify the start and end anchors, the above regex ((a(-b)?)(?!Z)) wouldn't match the string a-bZ, see the demo here. Because the anchors are not specified and the (-b) is made optional, the regex engine try to match a-b anywhere at first and then discards the match on seeing the following Z letter. Now the regex engine backtracks because of the optional -b to get a match. Now it's on a, the letter a is not immediately followed by Z, so the engine now matches the letter a

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

RegEx in JS to find No 3 Identical consecutive characters

How to find a sequence of 3 characters, 'abb' is valid while 'abbb' is not valid, in JS using Regex (could be alphabets,numerics and non alpha numerics).
This question is a variation of the question that I have asked in here : How to combine these regex for javascript.
This is wrong : /(^([0-9a-zA-Z]|[^0-9a-zA-Z]))\1\1/ , so what is the right way to do it?
This depends on what you actually mean. If you only want to match three non-identical characters (that is, if abb is valid for you), you can use this negative lookahead:
(?!(.)\1\1).{3}
It first asserts, that the current position is not followed by three times the same character. Then it matches those three characters.
If you really want to match 3 different characters (only stuff like abc), it gets a bit more complicated. Use these two negative lookaheads instead:
(.)(?!\1)(.)(?!\1|\2).
First match one character. Then we assert, the this is not followed by the same character. If so, we match another character. Then we assert that these are followed neither by the first nor the second character. Then we match a third character.
Note that those negative lookaheads ((?!...)) do not consume any characters. That is why they are called lookaheads. They just check what is coming next (or in this case what is not coming next) and then the regex continues from where it left of. Here is a good tutorial.
Note also that this matches anything but line breaks, or really anything if you use the DOTALL or SINGLELINE option. Since you are using JavaScript you can just activate the option by appending s after the regexes closing delimiter. If (for some reason) you don't want to use this option, replace the .s by [\s\S] (this always matches any character).
Update:
After clarification in the comments, I realised that you do not want to find three non-identical characters, but instead you want to assert that your string does not contain three identical (and consecutive) characters.
This is a bit easier, and closer to your former question, since it only requires one negative lookahead. What we do is this: we search the string from the beginning for three consecutive identical characters. But since we want to assert that these do not exist we wrap this in a negative lookahead:
^(?!.*(.)\1\1)
The lookahead is anchored to the beginning of the string, so this is the only place where we will look. The pattern in the lookahead then tries to find three identical characters from any position in the string (because of the .*; the identical characters are matched in the same way as in your previous question). If the pattern finds these, the negative lookahead will thus fail, and so the string will be invalid. If not three identical characters can be found, the inner pattern will never match, so the negative lookahead will succeed.
To find non-three-identical characters use regex pattern
([\s\S])(?!\1\1)[\s\S]{2}

Categories

Resources