Understanding the usage of \b in regex that matches multiple strings - javascript

I just found the below regex online while browsing:
(?:^|\b)(bitcoin atm|new text|bitcoin|test a|test)(?!\w)
I was just curious to know what is the advantage of using (?:^|\b) here ?
I understand that basically (?:) means it a non capturing group but I am a bit stumped by ^|\b in this particular parenthesis, here I understand that ^ basically means asset start of string.
The examples of \b on MDN gave me a fair understanding of what \b does, but I am still not able to put things into context based on the example I have provided.

The (?:^|\b) is a non-capturing group that contains 2 alternatives both of which are zero-width assertions. That means, they just match locations in a string, and thus do not affect the text you get in the match.
Besides, as the next subpattern matches b, n or t as the first char (a word char) the \b (a word boundary) in the first non-capturing group will also match them in the beginning of a string, making ^ (start of string anchor) alternative branch redundant here.
Thus, you may safely use
\b(bitcoin atm|new text|bitcoin|test a|test)(?!\w)
and even
\b(bitcoin atm|new text|bitcoin|test a|test)\b
since the alternatives end with a word char here.
If the alternatives in the (bitcoin atm|new text|bitcoin|test a|test) group are user-defined, dynamic, and can start or end with a non-word char, then the (?:^|\b) and (?!\w) regex patterns makes sense, but it would not be prcise then, as (?:^|\b)\.txt(?!\w) will not match .txt as a whole word, it should be preceded with a word char then. I would use (?:^|\W) rather than (?:^|\b).

Related

How to match one 'x' but not one or both of xs in 'xx' globally in string [duplicate]

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.

how to negate a capture group?

Using a javascript regexp, I would like to find strings like "/foo" or "/foo d/" but not "/foo /"; ie, "annotation character", then either word with no terminating annotation, or multiple words, where the termination comes at the end of the phrase (with no space). Complicating the situation, there are three possible annotation symbols: /, \ and |.
I've tried something like:
/(?:^|\s)([\\\/|])((?:[\w_-]+(?![^\1]+[\w_-]\1))|(?:[\w\s]+[\w](?=\1)))/g
That is, start with space, then annotation, then
word not followed by (anything but annotation) then letter and annotation... or
possibly multiple words, immediately followed by annotation character.
The problem is the [^\1]: this doesn't read as "anything but the annotation character" in the angle brackets.
I could repeat the whole phrase three times, one for each annotation character. Any better ideas?
As you've mentioned, [^\1] doesn't work - it matches anything that is not the character 1. In JavaScript, you can negate \1 by using a lookahead: (?:(?!\1).)* . This is not as efficient, but it works.
Your pattern can be written as:
([\\\/|])([\w\-]+(?:(?:(?!\1).)*[\w\-]\1)?)
Working example at Regex101
\w already contains underscore.
Instead of alternation (a|ab) I'm using an optional group (a(?:b)?) - we always match the first word, with optional further words and tags.
You may still want to include (?:^|\s) at the beginning.

Find all char excluding group at the end with regexp

I have this string:
this is a test
at the end of this string I have a space and the new line.
I want to extract (for counting) all space group in the string witout the last space.
With my simple regex
/\s+/g
I obtain these groups:
this(1)is(2)a(3)test(4)
I want to exclude from group the forth space because i want to get only 3 groups if the string end with space.
What is the correct regexp?
Depending on the regex flavor, you can use two approaches.
If atomic groups/possessive quantifiers are not supported, use a lookahead solution like this:
(?:\s(?!\s*$))+
See the regex demo
The main point is that we only match a whitespace that is not followed with 0+ other whitespace symbols followed with an end of string (the check if performed with the (?!\s*$) lookahead).
Else, use
\s++(?!$)
See another demo. An equivalent expression with an atomic groups is (?>\s+)(?!$).
Here, we check for the end of string position ONLY after grabbing all whitespaces without backtracking into the \s++ pattern (so, if after the last space there is an end of string, the whole match is failed).
Also, it is possible to emulate an atomic group in JavaScript with the help of capturing inside the positive lookahead and then using a backreference like
(?=(\s+))\1(?!$)
However, this pattern is costly in terms of performance.

Regexp for accept numbers, letters and special characters

NKA-198, HM-1-0022, SCIDG 133
want regexp for the above codes. How can I Accept these codes and assign it to a variable??
Please suggest me and Thanks in advance.
First make sure that you have a solid understanding of the general structure of the strings you want to match - e.g., which separator symbols will be permissible (your example suggests -, SPC, but what about +? Would you want to match NKA 198, SCIDG-133 too ?
As a base for further refinement, use the following code fragment:
var orig = "some string containing ids like 'NKA-198' and 'SCIDG 133'";
var first_id = orig.replace(/^.*?([A-Z]+([ -][0-9]+)+).*/, "$1");
var last_id = orig.replace(/(?:.*[^A-Z]|^)([A-Z]+([ -][0-9]+)+).*/, "$1");
Explanation
core( ([A-Z]+([ -][0-9]+)+) )
Match any sequence of capital letters followed by a digit sequence preceded by a single hyphen or space character. The sequence 'space or hyphen plus number' may repeat arbitrarily often but at least once. This specification may be too restrictive or too lax which is the reason why you have to look up / guess general rules that the Ids you wish to match obey. In a strict sense, the regex you've been asking for is ^(NKA-198|HM-1-0022| SCIDG 133)$, which most certainly is not what you need.
The outermost parentheses define the match as the first capture group, allowing to reference the matched content as $1 in the replace method. Using replace also mandates that your regexp needs to match the whole original string.
additional parts / first regexp
Matches anything non-greedily, starting at the string's beginning. The non-greedy operator (.*?) makes sure that the shortest possible match is found that still allows a match of the complete pattern (See what happens if you drop the question mark). Ths you'll end up with the first matching id in first_id.
additional parts / second regexp
Matches greedily (= as much as possible) until an identifier pattern matches. Thus you'll end up with the last match. the negated character class ([^A-Z]) is necessary, since you there is no further information about the structure of the IDs in question, specifically which/how many leading capital characters there are. The class makes sure that the last character beforethe beginning of the matched id is not a capital character. The ^ in the alternation caters for the special case that orig starts with a matchable ID - in this case, the negated char class would not match, because there is no 'last prefix character' before the match.
References
A more detailed (and more competent) explanation of regexp pattern and usage can be found here. MDN provides info on regular expression usage in javascript.

Specific regex positive look(around|ahead|behind) in Javascript

I'm looking to match /(?=\W)(gimme)(?=\W)/gi or alike. The \W are supposed to be zero-width characters to surround my actual match.
Maybe some background. I want te replace certain words (always \w+) with some literal padding added, but only if it's not surrounded by a \w. (That does sound like a negative lookaround, but I hear JS doesn't do those!?)
(Btw: the above "gimme" is the word literal I want to replace. If that wasn't obvious.)
It has to be (?) a lookaround, because the \W have to be zero-width, because the intention is a .replace(...) and I cannot replace/copy the surrounding characters.
So this won't work:
text.replace(/(?=\W)(gimme)(?=\W)/gi, function(l, match, r) {
return l + doMagic(match) + r;
})
The zero-width chars have to be ignored, so the function can return (and replace) only doMagic(match).
I have only very limited lookaround experience and non of it in JS. Grazie.
PS. Or maybe I need a lookbehind and those aren't supported in JS..? I'm confused?
PS. A little bit of context: http://jsfiddle.net/rudiedirkx/kMs2N/show/ (ooh a link!)
you can use word boundary shortcut \b to assert that it's the whole word that you are matching.
The easiest way to achieve what you want to do is probably to match:
/(\s+gimme)(?=\W)/gi
and replace with [yourReplacement] - i.e. capture the whitespaces before 'gimme' and then include one in the replacement.
Another way to approach this would be capturing more characters before and after the gimme literal and then using the groups with backreference:
(\W+?)gimme(\W+?) - your match - note that this time the before and after characters are in the capturing groups 1 and 2
And you'd want to use \1[yourReplacement]\2 as replacement string - not sure how you use backreference in JS, but the idea is to tell the engine that with \1 you mean whatever was matched by the first captuing parenthesis. In some languages these are accessed with $1.
What you currently have will not work, for the following reason, (?=\W) means "the next character is not a word character", and the next thing you try to match is a literal g, so you have a contradiction ("next character is a g, but isn't a letter").
You do in fact need a lookbehind, but they are not supported by JavaScript.
Check out this article on Mimicking Lookbehind in JavaScript for a possible approach.
Have you considered using a lexer/parser combo?
This one is javascript based, and comes with a spiffy demonstration.

Categories

Resources