Regexp: excluding a word but including non-standard punctuation - javascript

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.

Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

Related

How to match one 'x' but not one or both of xs in 'xx' globally in string [duplicate]

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.

Ambiguity in regex in javascript

var a = 'a\na'
console.log(a.match(/.*/g)) // ['a', '', 'a', '']
Why are there two empty strings in the result?
Let's say if there are empty strings, why isn't there one at beginning and at the end of each line as well, hence 4 empty strings?
I am not looking for how to select 'a's but just want to understand the presence of the empty strings.
The best explanation I can offer for the following:
'ab\na'.match(/.*/g)
["ab", "", "a", ""]
Is that JavaScript's match function uses dot not in DOT ALL mode, meaning that dot does not match across newlines. When the .* pattern is applied to ab\na, it first matches ab, then stops at the newline. The newline generates an empty match. Then, a is matched, and then for some reason the end of the string matches another empty match.
If you just want to extract the non whitespace content from each line, then you may try the following:
print('ab\na'.match(/.+/g))
ab,a
Let's say if there are empty strings, why isn't there one at beginning
and at the end...
.* applies greediness. It swallows a complete line asap. By a line I mean everything before a line break. When it encounters end of a line, it matches again due to star quantifier.
If you want 4 you may add ? to star quantifier and make it lazy .*? but yet this regex has different result in different flavors because of the way they handle zero-length matches.
You can try .*? with both PCRE and JS engines in regex101 and see the differences.
Question:
You may ask why does engine try to find a match at the end of line while whole thing is already matched?
Answer:
It's for the reason that we have a definition for end of lines and end of strings. So not whole thing is matched. There is a left position that has a chance to be matched and we have it with star quantifier.
This left position is end of line here which is a true match for $ when m flag is on. A . doesn't match this position but a .* or .*? match because they would be a pattern for zero-length positions too as any X-STAR patterns like \d*, \D*, a* or b?
Star operator * means there can be any number of ocurrences (even 0 ocurrences). With the expression used, an empty string can be a match. Not sure what are you looking for, but maybe a + operator (1 or more ocurrences) will be better?
Want to add some more info, regex use a greedy algorithm by default (in some languages you can override this behaviour), so it will pick as much of the text as it can. In this case, it will pick the a, because it can be processed with the regex, so the "\na" is still there. "\n" does not match the ".", so the only available option is the empty string. Then, we will process the next line, and again, we can match a "a". After this, only the empty string matches the regex.
* Matches the preceding expression 0 or more times.
. matches any single character except the newline character.
That is what official doc says about . and *. So i guess the array you received is something like this:
[ the first "any character" of the first line, following "nothing", the first "any character" of the second line, following "nothing"]
And the new-line character is just ignored

How to find all words with x (and one or more) occurrences of a letter?

I have an answer to my second question right here:
To find words with one or more occurrences of the letter 'a' in it
var re = /(\w+a)/;
With regards to the above, how does it work? For example,
var re = /(\w+a)/g;
var str = "gamma";
console.log(re.exec(str));
Output:
[ 'gamma', 'gamma', index: 0, input: 'gamma' ]
However; these are not the results I expected (although it IS what I want). That is to say, re should have found patterns such that there were any number of occurrences of \w. Then the first occurrence of the letter 'a'. Then stop.
I.e. I expected: ga.
Then mma
Next, how do I look for words with a pre-defined number of occurrences (call it x) of the letter 'a'. Such that f(x)=gamma iff x=2.
Repetition in regex is greedy. That is it takes as much as possible. You happen to get the full word, because it ends in an a. To make it ungreedy, (stop at the first one), you'd use:
\w+?a
But to actually get the full word, I'd rather use
\w*a\w*
Note the *, otherwise you'll get problems with words that have an a only as the first or last letter.
To get words with exactly 2 a you need to exclude a from the repeated letters. This is best done with a negated character class, that disallows non-word characters and as. In addition you need to make sure, that you get full words. This is easily done with the word boundary \b:
\b[^\Wa]*a[^\Wa]*a[^\Wa]*\b
For more flexibility in terms of the number of repetitions, this can be rewritten as
\b[^\Wa]*(?:a[^\Wa]*){2}\b
Regular expressions are greedy by default. That means that if they can grab more characters they will. You need to consider greed when using quantifiers, like + and *.
To make a quantifier not greedy (lazy) suffix it with a ?.
/(\w+?a)/
You can use regex for something, such as
/\b\w*a\w*\b/ - find a word with at least 1 a (can match the word 'a')
/\b\w*(?:a\w*){2}\b/ - find a word with at least 2 as
But it gets tricky when the amount is exact, because you must change the \w to include all letters except a... works by the negated class, thus
/\b[^\Wa]*(?:a[^\Wa]*){2}\b/ - matches a word with exactly 2 as
To find the syllables or so until the "a" letter, then you can use
/\b(?:[^\Wa]*a)/ - matches ga alone and in gamma
/\b(?:[^\Wa]*a){1,4}/ - matches word having 1-4 a, ending in a.
The easiest way to achieve something like this is however is to match all words /\w+/, and filter them by Javascript.

The behavior of /g mode matching

On this article, it mentioned
Make sure you are clear on the fact that an expression pattern is
tested on each individual character. And that, just because the engine
moves forward when following the pattern and looking for a match it
still backtracks and examines each character in a string until a match
is found or if the global flag is set until all characters are
examined.
But what I tested in Javascript
"aaa#bbb".match(/a+#b+/g)
does not produce a result like:
["aaa#bbb", "aa#bbb", "a#bbb"]
It only produces ["aaa#bbb"]. It seems it does not examine each character to test the pattern. Can anyone can explain a little on matching steps ? Thanks.
/g does not mean it will try to find every possible subset of characters in the input string which may match the given pattern. It means that once a match is found, it will continue searching for more substrings which may match the pattern starting from the previous match onward.
For example:
"aaa#bbb ... aaaa#bbbb".match(/a+#b+/g);
Will produce
["aaa#bbb", "aaaa#bbbb"]
That explanation is mixing two distinct concepts that IMO should be kept separated
A) backtracking
When looking for a match the normal behavior for a quantifier (?, *, +) is to be "greedy", i.e. to munch as much as possible... for example in /(a+)([^b]+)/ tested with aaaacccc all the a will be part of group 1 even if of course they also match the char set [^b] (everything but b).
However if grabbing too much is going to prevent a match the RE rules require that the quantifier "backtracks" capturing less if this allows the expression to match. For example in (a+)([^b]+) tested with aaaa the group 1 will get only three as, to leave one for group 2 to match.
You can change this greedy behavior with "non-greedy quantifiers" like *?, +?, ??. In this case stills the engine does backtracking but with the reverse meaning: a non-greedy subexpression will eat as little as possible that allows the rest of expression to match. For example (a+)(a+b+) tested with aaabbb will leave two as for group 1 and abbb for group 2, but (a+?)(a+b+) with the same string instead will leave only one a for group 1 because this is the minimum that allows matching the remaining part.
Note that also because of backtracking the greedy/non-greedy options doesn't change if an expression has a match or not, but only how big is the match and how much goes to each sub-expression.
B) the "global" option
This is something totally unrelated to backtracking and simply means that instead of stopping at the first match the search must find all non-overlapping matches. This is done by finding the first match and then starting again the search after the end of the match.
Note that each match is computed using the standard regexp rules and there is no look-ahead or backtracking between different matches: in other words if making for example a greedy match shorter would give more matches in the string this option is not considered... a+[^b]+ tested with aaaaaa is going to give only one match even if g option is specified and even if the substrings aa, aa, aa would have been each a valid match for the regexp.
When the global flag is used, it starts searching for the next match after the end of the previous match, to prevent generating lots of overlapping matches like that.
If you don't specify /g, the engine will stop as soon as a match is found.
If you do specify /g, it will keep going after a match. It will, however, still not produce overlapping matches, which is what you're asking about.
Its because.,
What Regex try to do:
All regex expression will try to match the best match.
What Regex wont do
It will not match the combinations for a single match as in your case.
When your "aaa#bbb".match(/a+#b+/g) scenario works
Rather, aaa#bbbHiaa#bbbHelloa#bbbSEEYOU try for some thing like this, which will give you
aaa#bbb
aa#bbb
a#bbb

New to Regular Expressions need help

I need a form with one button and window for input
that will check an array, via a regular expression.
And will find a exact match of letters + numbers. Example wxyz [some space btw] 0960000
or a mix of numbers and letters [some space btw] + numbers 01xg [some space btw] 0960000
The array has four objects for now.
Once found i need a function the will open a new page or window when match is found .
Thanks you for your help.
Michael
To answer the Javascript part, here's one way to "grep" through the array to find matching elements:
var matches = [];
var re = /whatever/;
foo.forEach(
function(el) {
if( re.exec(el) )
matches.push(el);
}
);
To attempt to answer the regular expression part: I don't know what "exact match" means to you, and I'm assuming "some space" belongs only in between the other terms, and I'm assuming letters means the English alphabet from 'a' to 'z' in lower and upper case and the digits should be 0-9 (otherwise, other language characters might be matched).
The first pattern would be /[a-zA-Z0-9]+\s*0960000/. Change "\s*" to "\s+" if there is at least one space, instead of zero or more space characters. Change "\s" to " " if matching the tab character (and some lesser-used space chars) is not desirable.
For the second pattern, I don't know what "numbers 01xg" means, but if it means numbers followed by that string, then the pattern would be /[a-zA-Z0-9]+\s*[0-9]+\s*01xg\s*0960000/. The same caveats apply as above.
Additionally, this will match a partial string. If the string much be matched in entirety (if nothing in the string must exist except that which is matched), add "^" to the beginning of the pattern to anchor it to the beginning of the string, and "$" at the end to anchor it to the end of the string. For example, /[a-zA-Z0-9]+\s*0960000/ matches "foo_bar 5 0960000", but /^[a-zA-Z0-9]+\s*0960000$/ does not.
For more on regular expressions in Javascript, take a look at developer.mozilla.org's article on the RegExp object (the link takes you to JS version 1.5 reference, which should apply to all JS-capable browsers).
(edited to add): To match either situation, since they have overlapping parts, you could use the following pattern: /[a-zA-Z0-9]+(?:\s*[0-9]+\s*01xg)?\s*0960000/. The question mark says to match the part that differs -- in a non-matching group (?:foo) -- once or zero times. (?:foo)? and (?:foo|) do the same thing in this case, but I'm not sure whether there is a performance difference; I would recommend to use the one that makes the most sense to you, so you can read it later.

Categories

Resources