Javascript lookahead regular expression - javascript

I'm trying to write a regular expression to parse the following string out into three distinct parts. This is for a highlighting engine I'm writing:
"\nOn and available after solution."
I have a regular expression that's dynamically created for any word a user might input. In the above example, the word is "on".
The regular expression expects a word with any amount of white space ([\s]*) followed by the search word (with no -\w following it, eg: on-time, on-wards should not be a valid result. To complicate this, there can be a -,$,< or > symbol following the example, so on-, on> or on$ are valid. This is why there is a negative lookahead after the search word in my regular expression below.
There's a complicated reason for this, but it's not relevant to the question. The last part should be the rest of the sentence. In this example, " and available after solution."
So,
p1 = "\n"
p2 = "On"
p3 = " and available after solution"
I currently have the following regular expression.
test = new RegExp('([\\s]*)(on(?!\\-\\w))([$\\-><]*?\\s(?=[.]*))',"gi")
The first part of this regular expression ([\\s]*)(on(?!\\-\\w))[$\\-><]*? works as expected. The last part does not.
In the last part, what I'm trying to do is force the regular expression engine to match whitespace before matching additional characters. If it can not match a space, then the regular expression should end. However, when I run this regular expression, I get the following results
str1 = "\nOn ly available after solution."
test.exec(str1)
["\n On ", "\n ", "On"]
So it would appear to me that the last positive look ahead is not working. Thanks for any suggestions, and if anyone needs some clarification, let me know.
EDIT:
It would appear that my regular expression was not matching because I didn't realize the following caveat:
You can use any regular expression inside the lookahead. (Note that this is not the case with lookbehind. I will explain why below.) Any valid regular expression can be used inside the lookahead. If it contains capturing parentheses, the backreferences will be saved. Note that the lookahead itself does not create a backreference. So it is not included in the count towards numbering the backreferences. If you want to store the match of the regex inside a backreference, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)). The other way around will not work, because the lookahead will already have discarded the regex match by the time the backreference is to be saved.

The dot in the character class [.] means a literal dot. Change it to just . if you wish to match any character.
The lookahead (?=.*) will always match and is completely pointless. Change it to (.*) if you just want to capture that part of the string.

I think the problem is your positive lookahead on(?!\-\w) is trying to match any on that is not followed by - then \w. I think what you want instead is on(?!\-|\w), which matches on that is not followed by - OR \w

Related

Regex to select all spaces between a special enclosure

I am trying to write a regex for Javascript that can select all whitespaces in between AMPScript brackets, the syntax for the Lang is something like this
%%[
set #truth = 'this is amp'
IF #truth == 'this is amp' THEN
set #response = 'amp rocks'
ELSE
set #response = '️...'
ENDIF
]%%
So far, I am able to select all the characters inside the given brackets using this expression:
%%\[(\s|.)*?\]%%
But this selects all the characters inside the enclosure, is there a method I can use to only select spaces/new lines/tabindents only?
Thanks in advance!
I believe this is what you want:
(?<=%%\[(\s|.)*)\s*(?=(\s|.)*\]%%)
https://regex101.com/r/EWhbuX/1
Edit
Here's a breakdown of the regular expression:
1.
First we start with a Positive Lookbehind (?<= )
This will make sure that the pattern that's following this, will be preceded by the pattern inside, but will not be include it in the matches.
In this case we want our matching pattern to be preceded by %%[ and any other character %%\[(\s|.)*
So, our resulting code for the Positive Lookbehind is
(?<=%%\[(\s|.)*)
2.
Next comes the pattern that we actually want to match after our Lookabehind, and (spoiler alert), before a Lookahead that we'll define later.
In this case, that's just any whitespace character, so our pattern will be
\s
(Yes, I just noticed that we don't even need the * in my original answer)
3.
Similarly to what we did at the beginning of the expression with the Lookbehind, we now need a Positive Lookahead (?= )
This is to make sure that our whitespaces will be followed by any character and ]%% (\s|.)*\]%%.
So this is our resulting Lookahead:
(?=(\s|.)*\]%%)
4.
Put everything together and you have your regular expression!
(?<=%%\[(\s|.)*)\s(?=(\s|.)*\]%%)

Unable to match regex for any character except ' and "

I've written a regex to match against the string
{{AB.group.one}}:"eighth",{{AB.group.TWO}}:"third",{{attr1111}}:"fourth","fifth":{{attr_22_2qq2}},"sixth":{{AB.group.three}},{{ab.group.fourth}}:"seventh","ninth":{{attr1111}}}
Regex:
/[^'"]({{2}[a-zA-Z0-9$_].*?}{2})[^'"]/gi
Breaking the regex above:
[^'"]: Start with a character which is neither ' nor ".
({{2}[a-zA-Z0-9$_].*?}{2}): Have exactly 2 {{, then any character in the range a-zA-Z0-9$_ . After that, exactly 2 }}
[^'"]: Any character except for ' and ".
Below matches are not the exact matches but the captured groups. I'll perform my operations on the captured groups so for simplicity, we can consider them as our matches.
Expected matches:
{{AB.group.one}}
{{AB.group.TWO}}
{{attr1111}}
{{attr_22_2qq2}}
{{AB.group.three}}
{{ab.group.fourth}}
{{attr1111}}}
Resultant matches:
{{AB.group.TWO}}
{{attr1111}}
{{attr_22_2qq2}}
{{AB.group.three}}
{{attr1111}}}
As you can see in the image below {{AB.group.one}} and {{ab.group.fourth}} do not match. I want them to match them as well.
I know the reasons why they aren't matching.
The reason why {{AB.group.one}} doesn't match is because [^'"] expects one character except for ' and " and I'm not providing one. If I replace [^'"] with ["'"]*, it'll work but in that case "{{AB.group.one}}" will match as well.
So, the problem statement is match any character(if there's any) before {{ and after }} but the character can't be ' or ".
The reason why {{ab.group.fourth}} doesn't match is because the character preceding this match i.e. , is part of another match. This is just my speculation, the reason could be something else. But if I include any character between {{AB.group.three}}, and {{ab.group.fourth}} (e.g. {{AB.group.three}}, {{ab.group.fourth}}), then the pattern matches. I have no idea how can I fix this.
Please help me in solving these two problems. Thank you.
Here is a regex based approach which seems to be working. First, we can string off all double-quoted terms, then replace islands of comma/colon with just a single comma separator. Finally, split on comma to generate an array of terms.
var input = "{{AB.group.one}}:\"eighth\",{{AB.group.TWO}}:\"third\",{{attr1111}}:\"fourth\",\"fifth\":{{attr_22_2qq2}},\"sixth\":{{AB.group.three}},{{ab.group.fourth}}:\"seventh\",\"ninth\":{{attr1111}}},\"blah\":\"stuff\",{{one}}:{{two}}";
var terms = input.replace(/\".*?\"/g, "").replace(/[,:]+/g, ",").split(",");
console.log(terms);
You were actually really close with what you had.
let input = '{{AB.group.one}}:"eighth",{{AB.group.TWO}}:"third",{{attr1111}}:"fourth","fifth":{{attr_22_2qq2}},"sixth":{{AB.group.three}},{{ab.group.fourth}}:"seventh","ninth":{{attr1111}}}'
let regex = /(?<=[^'"]?)({{2}[a-zA-Z0-9$_].*?}{2})(?=[^'"]?)/gi;
console.log(input.match(regex))
(?<=[^'"]?) is a positive lookbehind. Since the negated character set is used, we're checking that the character before the match is not ' or ". The question mark makes this optional - match zero or one of the previous token (the negated character set).
(?=[^'"]?) is a positive lookahead and checks the token immediately after the expression to ensure that it's not a ' or " (or that there is no token after the expression).
Another option, since lookbehinds aren't supported in every browser:
let input = '{{AB.group.one}}:"eighth",{{AB.group.TWO}}:"third",{{attr1111}}:"fourth","fifth":{{attr_22_2qq2}},"sixth":{{AB.group.three}},{{ab.group.fourth}}:"seventh","ninth":{{attr1111}}}'
let regex = /(?:[^{'"])?({{2}[a-zA-Z0-9$_].*?}{2})(?:[^}'"])?/gi
console.log([...input.matchAll(regex)].map(reg => reg[1]))
String.match() loses reference to capture groups when the global flag is passed, so only returns the "match". Since you're creating a capture group with ({{2}[a-zA-Z0-9$_].*?}{2}), if you wanted to just ensure the characters immediately surrounding the bracketed expression aren't quotation marks, you can just use non-capture groups for those optional checks.
(?:[^{'"])? is a non-capturing group, as is (?:[^}'"])?
Using String.matchAll, the first element of the arrays created for each match is the entire match, the second element is the first capturing group, etc. So the logic for mapping over [...input.matchAll(regex)] is just to collect the capturing group from each match.

how to negate a capture group?

Using a javascript regexp, I would like to find strings like "/foo" or "/foo d/" but not "/foo /"; ie, "annotation character", then either word with no terminating annotation, or multiple words, where the termination comes at the end of the phrase (with no space). Complicating the situation, there are three possible annotation symbols: /, \ and |.
I've tried something like:
/(?:^|\s)([\\\/|])((?:[\w_-]+(?![^\1]+[\w_-]\1))|(?:[\w\s]+[\w](?=\1)))/g
That is, start with space, then annotation, then
word not followed by (anything but annotation) then letter and annotation... or
possibly multiple words, immediately followed by annotation character.
The problem is the [^\1]: this doesn't read as "anything but the annotation character" in the angle brackets.
I could repeat the whole phrase three times, one for each annotation character. Any better ideas?
As you've mentioned, [^\1] doesn't work - it matches anything that is not the character 1. In JavaScript, you can negate \1 by using a lookahead: (?:(?!\1).)* . This is not as efficient, but it works.
Your pattern can be written as:
([\\\/|])([\w\-]+(?:(?:(?!\1).)*[\w\-]\1)?)
Working example at Regex101
\w already contains underscore.
Instead of alternation (a|ab) I'm using an optional group (a(?:b)?) - we always match the first word, with optional further words and tags.
You may still want to include (?:^|\s) at the beginning.

JS regular expression, basic lookahead

I cannot figure out, for the life of me, why this regular expression
^\.(?=a)$
does not match
".a"
anyone know why?
I am going off the information provided here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
The reason it doesn't work is because the lookahead doesn't actually consume any characters, so your matching position doesn't advance.
^\.(?=a)$
Matches the beginning of line (^ -- this matches) followed by a literal . (\. -- this also matches), and then (without consuming any characters), checks to see if the next character is a literal a ((?=a)). It is, so the lookahead matches. It then asserts that your position is at the end of the string ($). This is not the case, because we're still right after the ., so the match fails.
Another possible matching expression would be
^\.(?=a$)
Which works just as above, but the assertion about the end of the line is contained in the lookahead, so this time, it matches.
Your regex is only going to match a period that's followed by an 'a', without including 'a' in the match.
Another issue is that you're using $ after a character that's basically being ignored.
Remove the $ and it will work as described.
Bonus: I've enjoyed using this lately http://www.regexpal.com/

Regexp: excluding a word but including non-standard punctuation

I want to find strings that contain words in a particular order, allowing non-standard characters in between the words but excluding a particular word or symbol.
I'm using javascript's replace function to find all instances and put into an array.
So, I want select...from, with anything except 'from' in between the words. Or I can separate select...from from select...from (, as long as I exclude nesting. I think the answer is the same for both, i.e. how do I write: find x and not y within the same regexp?
From the internet, I feel this should work: /\bselect\b^(?!from).*\bfrom\b/gi but this finds no matches.
This works to find all select...from: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b/gi but modifying it to exclude the parenthesis "(" at the end prevents any matches: /\bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\(/gi
Can anyone tell me how to exclude words and symbols within this regexp?
Many thanks
Emma
Edit: partial string input:
left outer join [stage].[db].[table14] o on p.Project_id = o.project_id
left outer join
(
select
different_id
,sum(costs) - ( sum(brushes) + sum(carpets) + sum(fabric) + sum(other) + sum(chairs)+ sum(apples) ) as overallNumber
from
(
select ace from [stage].db.[table18] J
Javascript:
sequel = stringInputAsAbove;
var tst = sequel.replace(/\bselect\b[\s\S]*?\bfrom\b/gi, function(a,b) { console.log('match: '+a); selects.push(b); return a; });
console.log(selects);
Console.log(selects) should print an array of numbers, where each number is the starting character of a select...from. This works for the second regexp I gave in my info, printing: [95, 251]. Your \s\S variation does the same, #stribizhev.
The first example ^(?!from).* should do likewise but returns [].
The third example \s*^\( should return 251 only but returns []. However I have just noticed that the positive expression \s*\( does give 95, so some progress! It's the negatives I'm getting wrong.
Your \bselect\b^(?!from).*\bfrom\b regex doesn't work as expected because:
^ means here beginning of a line, not negation of next part, so
the \bselect\b^ means, select word followed by beginning of a
line. After removal of ^ regex start to match something
(DEMO) but it is still invalid.
in multiline text .* without modification will not match new line,
so regex will match only select...from in single lines, but if you
change it for (.|\n)* (as a simple example) it will match
multiline, but still invalid
the * is greede quantifire, so it will match as much a possible,
but if you use reluctant quantifire *?, regex will match to first
occurance of from word, and int will start to return relativly
correct result.
\bselect\b(?!from) means match separate select word which is not
directly followed by separate from word, so it would be
selectfrom somehow composed of separate words (because
select\bfrom) so (?!from) doesn't work and it is redundant
In effect you will get regex very similar to what Stribizhev gave you: \bselect\b(.|\n)*?\bfrom\b
In third expression you meke same mistake: \bselect\b[0-9a-zA-Z#\(\)\[\]\s\.\*,%_+-]*?\bfrom\b\s*^\( using ^ as (I assume) a negation, not beginning of a line. Remove ^ and you will again get relativly valid result (match from select through from to closing parathesis ) ).
Your second regex works similar to \bselect\b(.|\n)*?\bfrom\b or \bselect\b[\s\S]*?\bfrom\b.
I wrote "relativly valid result", as I also think, that parsing SQL with regex could be very camplicated, so I am not sure if it will work in every case.
You can also try to use positive lookahead to match just position in text, like:
(?=\bselect\b(?:.|\n)*?\bfrom\b)
DEMO - the () was added to regex just to return beginning index of match in groups, so it would be easier to check it validity
Negation in regex
We use ^ as negation in character class, for example [^a-z] means match anything but not letter, so it will match number, symbol, whitespace, etc, but not letter from range a to z (Look here). But this negation is on a level of single character. I you use [^from] it will prevent regex from matching characters f,r,o and m (demo). Also the [^from]{4} will avoid matching from but also form, morf, etc.
To exlude the whole word from matching by regex, you need to use negative look ahead, like (?!from), which will fail to match, if there will be chosen word from fallowing given position. To avoid matching whole line containing from you could use ^(?!.*from.*).+$ (demo).
However in your case, you don't need to use this construction, because if you replace greedy quantifire .*\bfrom with .*?\bfrom it will match to first occurance of this word. Whats more it would couse problems. Take a look on this regex, it will not match anything because (?![\s\S]*from[\s\S]*) is not restricted by anything, so it will match only if there is no from after select, but we want to match also from! in effect this regex try to match and exclude from at once, and fail. so the (?!.*word.*) construction works much better to exclude matching line with given word.
So what to do if we don't what to match a word in a fragment of a match? I think select\b([^f]|f(?!rom))*?\bfrom\b is a good solution. With ([^f]|f(?!rom))*? it will match everything between select and from, but will not exclude from.
But if you would like to match only select...from not followed by ( then it is good idea to use (?!\() like. But in your regex (multiline, use of (.|\n)*? or [\s\S]*? it will cause to match up to next select...from part, because reluctant quantifire will chenge a plece where it need to match to make whole regex . In my opinion, good solution would be to use again:
select\b([^f]|f(?!rom))*?\bfrom\b(?!\s*?\()
which will not overlap additional select..from and will not match if there is \( after select...from - check it here

Categories

Resources