Negative Lookahead & Lookbehind with Capture Groups and Word Boundaries - javascript

We are auto-formatting hyperlinks in a message composer but would like to avoid matching links that are already formatted.
Attempt: Build a regex that uses a negative lookbehind and negative lookahead to exclude matches where the link is surrounded by href=" and ".
Problem: Negative lookbehind/lookahead are not working with our regex:
Regex:
/(?<!href=")(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_+.~#?&\/\/=;]*)(?!")/g
Usage:
html.match(/(?<!")(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=;]*)(?!")/g);
When testing, we notice that exchanging the negative lookahead/lookbehind with a positive version causes it to work. Thus, only negative lookbehind/lookaheads are not working.
Does anyone know why these negative lookbehind/lookaheads are not functioning with this regex?
Thank you!

With #Barmar's help in the question comments, it is clear that the problem lies in the optional beginning and end of the regex.
"Basically, anything that allows something to be optional next to a negative lookaround may negate the effect of the lookaround, if it can find a shorter match that isn't next to it. "

If using modern JS that supports variable length lookbehind assertions, you can
utilize non-greedy variability into the lookbehind.
This allows the regex to now introduce optional beginnings like what you have.
/(?<!href="[^"]*?)(?:https?:\/\/.)?(?:www\.)?[a-zA-Z0-9#%+\-.:=#_~]{2,256}\.[a-z]{2,6}\b[a-zA-Z0-9#%&+\--\/:;=?#_~]*(?!")/
https://regex101.com/r/OdJyZf/1
(?<! href=" [^"]*? )
(?: https?:// . )?
(?: www \. )?
[a-zA-Z0-9#%+\-.:=#_~]{2,256} \. [a-z]{2,6} \b [a-zA-Z0-9#%&+\--/:;=?#_~]*
(?! " )
I must make a correction. In my comments I said that
the word boundary \b here [a-z]{2,6}\b[a-zA-Z0-9#%&+\--/:;=?#_~] effectively removes the word class \w in the following class.
This is true but only for the first following letter. All the following chars seem to include word chars so it's needed.
It's a clear example of overthinking something that does not need to be.
The whole regex should be able to be rewritten using \w in the classes unless ASCII is required.
Note that this will only work for the new JS engine and C# (of course).

Related

How to replace my current regular expression without using negative lookbehind

I have the following regular expression which matches on all double quotes besides those that are escaped:
i.e:
The regular expression is as follows:
((?<![\\])")
How could I alter this to no longer use the negative lookbehind as it is not supported on some browsers?
Any help is greatly appreciated, thanks!
I wasn't able to get anything currently working
You can match
/\\"|(")/
and keep only captured matches. Being so simple, it should work with most every regex engine.
Demo
This matches what you don't want (\\")--to be discarded--and captures what you do want (")--to be kept.
This technique has been referred to by one regex expert as The Greatest Regex Trick Ever. To get to the punch line at the link search for "(at last!)".
Neither of these may be a completely satisfactory solution.
This regex won't just match unescaped ", there's additional logic required to check if the 1st character of captured groups is " and adjust the match position.:
(?:^|[^\\])(")
This may be a better choice, but it depends on positive lookahead - which may have the same issue as negative lookbehind.
Version 1a (again requires additional logic)
(?:^|\b)(?=[^\\])(")
Version 2a (depends on positive lookahead)
(?:^|\b|\\\\)(?=[^\\])(")
Assuming you need to also handle escaped slashes followed by escaped quotes (not in the question, but ok):
Version 1a (requires the additional logic):
(?:^|[^\\]|\\\\)(")
Building on this answer, I'd like to add that you may also want to ignore escaped backslashes, and match the closing quote in this string:
"ab\\"
In that case, /\\[\\"]|(")/g is what you're after.

Regex for hashtags at the very begining not works in C# but works in Javascript

guys!
I writed this kind of regex I need
^((#\w+\b(\s?|#))+)
and it works fine... But only here (in Javascript mode).
As you can see, it highlights all lines till the text with no sign of hashtags begins (I only need get them from very beginning of the text).
If I'll try something like this at http://regexstorm.net/tester it would look like this (so part I need not fully captured, ECMAScript option not helps as well)
Whats the best way to fix it for C#? And why it doesnt work like that (because at other options in regex101 everything looks good)?
The main issue is the difference of line break style between Regex101 and RegexStorm sites: the first one uses LF and the latter uses CRLF styles. So, the \s? only matching 1 or 0 whitespaces fails to find a match at RegexStorm since there are two whitespaces between the end of the first and the beginning of the second line.
You might fix it changing \s? with \s* (or at least \s{0,2} to match 0 to 2 whitespaces).
However, your regex needs improving since it is causing too much overhead for the regex engine. You may write it linearly as
^#\w+(?:\s*#\w+)*
See the RegexStorm regex demo. It matches a hashtag, followed with 0+ sequences of 0+ whitespaces and a hashtag.
Note that ^ may be redefined to match the start of a line. To avoid that, in .NET, you may use \A anchor that always matches the start of the string.
Pattern details:
^ (or \A) - start of the string
#\w+ - a # followed with 1+ word chars
(?:\s*#\w+)* - zero or more sequences of:
\s* - zero or more whitespaces
#\w+ - a hashtag pattern.

JS Regexp: \b for hyphenated words

In JavaScript regexp, what can I use in place of \b to get the same effect but on words that may be hyphenated?
(This question is directed at readers familiar with \b and with hyphenation, and so does not provide examples.)
UPDATE
Addison's (?<!-)\b(?!-) here is a partial solution for PCRE. It falls short on -500, by losing the boundary that \b delivers. It doesn't work on lookbehind-less JavaScript.
You can't create your own version of \b in regex flavors like JavaScript that don't support lookbehind. \b matches at a position. It needs to check the character (or lack thereof) before and after that position in order to determine whether the position should be matched. This requires both lookahead and lookbehind.
You can match hyphenated words (ASCII only) with this regex:
\b[a-zA-Z\-]+\b
This regex will allow hyphens before and after the word but does not include those in the match.
I would consider using the \b expression, but modify it to be a little more fussy. Add a negative lookahead and lookbehind to it, so that it doesn't appear beside a hypen:
(?<!-)\b(?!-)
Try it on Regex 101
Note that this might cause problems with words such as -500, depending on what behaviour you want. You might want there to be a boundary before or after the hyphen (or not at all).
UPDATE
The regex gets much more complex, since there is not ordinarily a boundary before a hyphen, meaning one must be added.
(?<!-)\b(?!-)|\B(?=-\w)
The second condition adds a boundary wherever there is a non-word boundary followed by a hyphen and a word character. It's very explicit, but this is the only case it happens.

Regex lookaround for a group doesn't work

Happy Saturday,
I'm wondering if Stackoverflow's users could give me a clue about one specific Regex..
(^visite\d+)(?!\D)
The above regex works well..
It says that :
visite12345 --> is a good anwser (the string does match)
visite1a --> is not a good anwser (the string doesn't match)
However for:
visite12345a --> It doesn't work.
Indeed, the output is visite1234, whereas I'd like to get the same answer that for visite1a (string doesn't match)...
I use http://regexr.com/ to test my regexp.
Do you have any idea how to so?
Thank you very much.
The regex (^visite\d+)(?!\D) matches visite at the start of the string, followed with one or more digits that should not be followed with a non-digit.
The "issue" is that the engine can backtrack within \d+ pattern and it can match 2 digits if the third is not followed with a nondigit.
The best way to solve it is to check the actual requirements and adjust the pattern.
If the digits are the last characters in the string you just should replace the lookahead with the $ anchor.
A generic solution for this is making the subpattern atomic with a capturing group inside a positive lookahead and a backreference, and make sure the lookahead is changed to something like (?![a-zA-Z]) - fail if there is a letter):
/^visite(?=(\d+))\1(?![a-z])/i
See the regex demo
Or if a word boundary should follow the digits (i.e. digits should be followed with a letter, digit or an underscore), use \b instead of the lookahead:
/^visite\d+\b/
See another demo

JS regular expression, basic lookahead

I cannot figure out, for the life of me, why this regular expression
^\.(?=a)$
does not match
".a"
anyone know why?
I am going off the information provided here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
The reason it doesn't work is because the lookahead doesn't actually consume any characters, so your matching position doesn't advance.
^\.(?=a)$
Matches the beginning of line (^ -- this matches) followed by a literal . (\. -- this also matches), and then (without consuming any characters), checks to see if the next character is a literal a ((?=a)). It is, so the lookahead matches. It then asserts that your position is at the end of the string ($). This is not the case, because we're still right after the ., so the match fails.
Another possible matching expression would be
^\.(?=a$)
Which works just as above, but the assertion about the end of the line is contained in the lookahead, so this time, it matches.
Your regex is only going to match a period that's followed by an 'a', without including 'a' in the match.
Another issue is that you're using $ after a character that's basically being ignored.
Remove the $ and it will work as described.
Bonus: I've enjoyed using this lately http://www.regexpal.com/

Categories

Resources