JS Regexp: \b for hyphenated words - javascript

In JavaScript regexp, what can I use in place of \b to get the same effect but on words that may be hyphenated?
(This question is directed at readers familiar with \b and with hyphenation, and so does not provide examples.)
UPDATE
Addison's (?<!-)\b(?!-) here is a partial solution for PCRE. It falls short on -500, by losing the boundary that \b delivers. It doesn't work on lookbehind-less JavaScript.

You can't create your own version of \b in regex flavors like JavaScript that don't support lookbehind. \b matches at a position. It needs to check the character (or lack thereof) before and after that position in order to determine whether the position should be matched. This requires both lookahead and lookbehind.
You can match hyphenated words (ASCII only) with this regex:
\b[a-zA-Z\-]+\b
This regex will allow hyphens before and after the word but does not include those in the match.

I would consider using the \b expression, but modify it to be a little more fussy. Add a negative lookahead and lookbehind to it, so that it doesn't appear beside a hypen:
(?<!-)\b(?!-)
Try it on Regex 101
Note that this might cause problems with words such as -500, depending on what behaviour you want. You might want there to be a boundary before or after the hyphen (or not at all).
UPDATE
The regex gets much more complex, since there is not ordinarily a boundary before a hyphen, meaning one must be added.
(?<!-)\b(?!-)|\B(?=-\w)
The second condition adds a boundary wherever there is a non-word boundary followed by a hyphen and a word character. It's very explicit, but this is the only case it happens.

Related

Negative Lookahead & Lookbehind with Capture Groups and Word Boundaries

We are auto-formatting hyperlinks in a message composer but would like to avoid matching links that are already formatted.
Attempt: Build a regex that uses a negative lookbehind and negative lookahead to exclude matches where the link is surrounded by href=" and ".
Problem: Negative lookbehind/lookahead are not working with our regex:
Regex:
/(?<!href=")(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_+.~#?&\/\/=;]*)(?!")/g
Usage:
html.match(/(?<!")(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=;]*)(?!")/g);
When testing, we notice that exchanging the negative lookahead/lookbehind with a positive version causes it to work. Thus, only negative lookbehind/lookaheads are not working.
Does anyone know why these negative lookbehind/lookaheads are not functioning with this regex?
Thank you!
With #Barmar's help in the question comments, it is clear that the problem lies in the optional beginning and end of the regex.
"Basically, anything that allows something to be optional next to a negative lookaround may negate the effect of the lookaround, if it can find a shorter match that isn't next to it. "
If using modern JS that supports variable length lookbehind assertions, you can
utilize non-greedy variability into the lookbehind.
This allows the regex to now introduce optional beginnings like what you have.
/(?<!href="[^"]*?)(?:https?:\/\/.)?(?:www\.)?[a-zA-Z0-9#%+\-.:=#_~]{2,256}\.[a-z]{2,6}\b[a-zA-Z0-9#%&+\--\/:;=?#_~]*(?!")/
https://regex101.com/r/OdJyZf/1
(?<! href=" [^"]*? )
(?: https?:// . )?
(?: www \. )?
[a-zA-Z0-9#%+\-.:=#_~]{2,256} \. [a-z]{2,6} \b [a-zA-Z0-9#%&+\--/:;=?#_~]*
(?! " )
I must make a correction. In my comments I said that
the word boundary \b here [a-z]{2,6}\b[a-zA-Z0-9#%&+\--/:;=?#_~] effectively removes the word class \w in the following class.
This is true but only for the first following letter. All the following chars seem to include word chars so it's needed.
It's a clear example of overthinking something that does not need to be.
The whole regex should be able to be rewritten using \w in the classes unless ASCII is required.
Note that this will only work for the new JS engine and C# (of course).

How to exclude such pattern from regex matching?

I would like to compose a regular expression to highlight keywords.
The regex is kind of like
\btap\b.
And for below sentence, it's expected to match only one "tap" without double quotation. But in reality, it also match the second "tap" within quotation symbol.
tap click "tap"
How can I exclude the second tap word from being matched?
This seems working fine.
var reg = new RegExp('\\b(tap(?!\"))', 'ig')
('tap click "tap" tap.').match(reg)
Rules
Starting word
not quotes at end
case insensitive.
Fiddle
Word boundaries \b matches any non-word character (so the " also).
You can simulate your own word boundaries where to include only what you think is appropriate.
In example:
\s|^|\.|!|\?|$ - space or start of string, or dot, or exclamation mark, or question mark, or end of string
I would also suggest to use negative lookbehinds/-aheads but...
Javascript doesn't support lookbehinds
So you could use some capturing groups and then use the group which you need.
Sample regex: (?:\s|^|\.|!|\?)(tap)(\s|$|\.|!|\?)
And then in the javascript use the first capturing group - match[1].
See this SO answer for details how to use capturing groups in JavaScript.

Specific regex positive look(around|ahead|behind) in Javascript

I'm looking to match /(?=\W)(gimme)(?=\W)/gi or alike. The \W are supposed to be zero-width characters to surround my actual match.
Maybe some background. I want te replace certain words (always \w+) with some literal padding added, but only if it's not surrounded by a \w. (That does sound like a negative lookaround, but I hear JS doesn't do those!?)
(Btw: the above "gimme" is the word literal I want to replace. If that wasn't obvious.)
It has to be (?) a lookaround, because the \W have to be zero-width, because the intention is a .replace(...) and I cannot replace/copy the surrounding characters.
So this won't work:
text.replace(/(?=\W)(gimme)(?=\W)/gi, function(l, match, r) {
return l + doMagic(match) + r;
})
The zero-width chars have to be ignored, so the function can return (and replace) only doMagic(match).
I have only very limited lookaround experience and non of it in JS. Grazie.
PS. Or maybe I need a lookbehind and those aren't supported in JS..? I'm confused?
PS. A little bit of context: http://jsfiddle.net/rudiedirkx/kMs2N/show/ (ooh a link!)
you can use word boundary shortcut \b to assert that it's the whole word that you are matching.
The easiest way to achieve what you want to do is probably to match:
/(\s+gimme)(?=\W)/gi
and replace with [yourReplacement] - i.e. capture the whitespaces before 'gimme' and then include one in the replacement.
Another way to approach this would be capturing more characters before and after the gimme literal and then using the groups with backreference:
(\W+?)gimme(\W+?) - your match - note that this time the before and after characters are in the capturing groups 1 and 2
And you'd want to use \1[yourReplacement]\2 as replacement string - not sure how you use backreference in JS, but the idea is to tell the engine that with \1 you mean whatever was matched by the first captuing parenthesis. In some languages these are accessed with $1.
What you currently have will not work, for the following reason, (?=\W) means "the next character is not a word character", and the next thing you try to match is a literal g, so you have a contradiction ("next character is a g, but isn't a letter").
You do in fact need a lookbehind, but they are not supported by JavaScript.
Check out this article on Mimicking Lookbehind in JavaScript for a possible approach.
Have you considered using a lexer/parser combo?
This one is javascript based, and comes with a spiffy demonstration.

Can I write a regex expression where one symbol matches twice?

I'm matching words with regex in javascript. The following expression uses whitespace to separate the potential matches:
/(\W)(foo)(\W)/g
This works most of the time, but it fails when there are two matches separated by a single space. (e.g. "foo foo") I think this is because the space that separates them is the last \W of the first match and the first of the second.
Is there any way to modify this expression to work in this edge case?
You can use \b instead of \W. It matches a zero-width word boundary (a boundary between a \w and a \W or the start/end of the string, while \W matches a character which may not exist at the start or end of a string.
Javascript regexes have lookahead, so you can probably do something like this:
/(\W)(foo)(?=\W)/g
I don't think lookbehinds are available, but there are other techniques that have the same effect.
Of course, this is functionally different, in that the lookahead doesn't capture, so it depends on the nature of your problem. The main point here it not that it doesn't capture, but that it doesn't match; thereby avoiding your problem.
Give this a try, I think it will work for you:
/(\W)(foo| )(\W)/g
This will tell the regex to match foo or whitespace between the two \Ws.

How can I make a regular expression which takes accented characters into account?

I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that
A word boundary ("\b") is a spot
between two characters that has a "\w"
on one side of it and a "\W" on the
other side of it (in either order),
counting the imaginary characters off
the beginning and end of the string as
matching a "\W".
AS3 RegExp to match words with boundry type characters in them
And since
\w matches any alphanumerical
character (word characters) including
underscore (short for [a-zA-Z0-9_]).
\W matches any non-word characters
(short for [^a-zA-Z0-9_])
http://www.javascriptkit.com/javatutors/redev2.shtml
obviously accented characters are not taken into account. This becomes a problem with words like Montréal. If the é is considered a word boundary, then al is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..
Any help?
Here is the relevant JavaScript code, which searches userInput and finds two-letter words using the re_state regular expression:
var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
While JavaScript regexes recognize non-ASCII characters in some cases (like \s), it's hopelessly inadequate when it comes to \w and \b. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.
By the way, there's an error in your regex. You have a \b after the optional trailing comma, but it should be in front:
"\\b([a-z]{2})\\b,?"
I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:
"\\b[a-z]{2}\\b"
Have you set JavaScript to use non-ASCII?
Here is a page
that suggests setting JavaScript to use UTF-8:
http://blogs.oracle.com/shankar/entry/how_to_handle_utf_8
It says:
add a charset attribute
(charset="utf-8") to your script tags
in the parent page:
script type="text/javascript" src="[path]/myscript.js" charset="utf-8"

Categories

Resources