javascript regexp for word boundary detection with parenthesis - javascript

I have a string "I am a robot, I have been named 456/m(4). Forget the name (it does not mean anything)"
Now I would like to extract all words from this string
for this I use the regular expression:
/\b[\w\S]+\b/g
it returns me all the words in the string except that there is a word "456/(4" instead of "456/(4)". I understand that it is due to the fact that it is a word boundary, but is there a way I could say that it is not a legal word boundary since there was no "legal" starting parenthesis?

I made it even better now. It does exactly what you want.
\b(?>\([\w\/]+\)|[\w\/])+
Regex101
If you want a version that's javascript friendly:
((?:(?=(\([\w\/]+\)|[\w\/]))\2)+)
Just use capture group #1 here.
Regex101

Related

Unable to find a string matching a regex pattern

While trying to submit a form a javascript regex validation always proves to be false for a string.
Regex:- ^(([a-zA-Z]:)|(\\\\{2}\\w+)\\$?)(\\\\(\\w[\\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
I have tried following strings against it
abc.jpg,
abc:.jpg,
a:.jpg,
a:asdas.jpg,
What string could possible match this regex ?
This regex won't match against anything because of that $? in the middle of the string.
Apparently using the optional modifier ? on the end string symbol $ is not correct (if you paste it on https://regex101.com/ it will give you an error indeed). If the javascript parser ignores the error and keeps the regex as it is this still means you are going to match an end string in the middle of a string which is supposed to continue.
Unescaped it was supposed to match a \$ (dollar symbol) but as it is written it won't work.
If you want your string to be accepted at any cost you can probably use Firebug or a similar developer tool and edit the string inside the javascript code (this, assuming there's no server side check too and assuming it's not wrong aswell). If you ignore the $? then a matching string will be \\\\w\\\\ww.jpg (but since the . is unescaped even \\\\w\\\\ww%jpg is a match)
Of course, I wrote this answer assuming the escaping is indeed the one you showed in the question. If you need to find a matching pattern for the correctly escaped one ^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(\.jpeg|\.JPEG|\.jpg|\.JPG)$ then you can use this tool to find one http://fent.github.io/randexp.js/ (though it will find weird matches). A matching pattern is c:\zz.jpg
If you are just looking for a regular expression to match what you got there, go ahead and test this out:
(\w+:?\w*\.[jpe?gJPE?G]+,)
That should match exactly what you are looking for. Remove the optional comma at the end if you feel like it, of course.
If you remove escape level, the actual regex is
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
After ^start the first pipe (([a-zA-Z]:)|(\\{2}\w+)\$?) which matches an alpha followed by a colon or two backslashes followed by one or more word characters, followed by an optional literal $. There is some needless parenthesis used inside.
The second part (\\(\w[\w].*))+ matches a backslash, followed by two word characters \w[\w] which looks weird because it's equivalent to \w\w (don't need a character class for second \w). Followed by any amount of any character. This whole thing one or more times.
In the last part (.jpeg|.JPEG|.jpg|.JPG) one probably forgot to escape the dot for matching a literal. \. should be used. This part can be reduced to \.(JPE?G|jpe?g).
It would match something like
A:\12anything.JPEG
\\1$\anything.jpg
Play with it at regex101. A better readable could be
^([a-zA-Z]:|\\{2}\w+\$?)(\\\w{2}.*)+\.(jpe?g|JPE?G)$
Also read the explanation on regex101 to understand any pattern, it's helpful!

Javascript Regex: how to simulate "match without capture" behavior of positive lookbehind?

I have a relatively simple regex problem - I need to match specific words in a string, if they are entire words or a prefix. With word boundaries, it would look something like this:
\b(word1|word2|prefix1|prefix2)
However, I can't use the word boundary condition because some words may start with odd characters, e.g. .999
My solution was to look for whitespace or starting token for these odd cases.
(\b|^|\s)(word1|word2|prefix1|prefix2)
Now words like .999 will still get matched correctly, BUT it also captures the whitespace preceding the matched words/prefixes. For my purposes, I can't have it capture the whitespace.
Positive lookbehinds seem to solve this, but javascript doesn't support them. Is there some other way I can get the same behavior to solve this problem?
You can use a non-capturing group using (?:):
/(?:\b|^|\s)(word1|word2|prefix1|prefix2)/
UPDATE:
Based on what you want to replace it with (and #AlanMoore's good point about the \b), you probably want to go with this:
var regex = /(^|\s)(word1|word2|prefix1|prefix2)/g;
myString.replace(regex,"$1<span>$2</span>");
Note that I changed the first group back to a capturing one since it'll be part of the match but you want to keep it in the replacement string (right?). Also added the g modifier so that this happens for all occurrences in the string (assuming thats what you wanted).
Let's get the terminology straight first. A regex normally consumes everything it matches. When you do a replace(), everything that was consumed is overwritten. You can also capture parts of the matched text separately and plug them back in using $1, $2, etc.
When you were using the word boundary you didn't have to worry about this, because \b doesn't consume anything. But now you're consuming the leading whitespace character if there is one, so you have to plug it back in. I don't know what you're replacing the match with, so I'll just replace them with nothing for this demonstration.
result = subject.replace(/(^|\s)(word1|word2|prefix1|prefix2)/g, "$1");
Note that the \b isn't needed any more. In fact, you must remove it, or it will match things like .999 in xyz.999, because \b matches between z and .. I'm pretty sure you don't want that.

Regex format from PHP to Javascript

Can you please help me. How can I add this regex (?<=^|\s):d(?=$|\s) in javascript RegExp?
e.g
regex = new RegExp("?????" , 'g');
I want to replace the emoticon :d, but only if it is surrounded by spaces (or at an end of the string).
Firstly, as Some1.Kill.The.DJ mentioned, I recommend you use the literal syntax to create the regular expression:
var pattern = /yourPatternHere/g;
It's shorter, easier to read and you avoid complications with escape sequences.
The reason why the pattern does not work is that JavaScript does not support lookbehinds ((?<=...). So you have to find a workaround for that. You won't get around including that character in your pattern:
var pattern = /(?:^|\s):d(?!\S)/g;
Since there is no use in capturing anything in your pattern anyway (because :d is fixed) you are probably only interested in the position of the match. That means, when you find a match, you will have to check whether the first character is a space character (or is not :). If that is the case you have to increment the position by 1. If you know that your input string can never start with a space, you can simply increment any found position if it is not 0.
Note that I simplified your lookahead a bit. That is actually the beauty of lookarounds that you do not have to distinguish between end-of-string and a certain character type. Just use the negative lookahead, and assure that there is no non-space character ahead.
Just for future reference that means you could have simplified your initial pattern to:
(?<!\S):d(?!\S)
(If you were using a regex engine that supports lookbehinds.)
EDIT:
After your comment on the other answer, it's actually a lot easier to use the workaround. Just write back the captured space-character:
string = string.replace(/(^|\s):d(?!\S)/g, "$1emoticonCode");
Where $1 refers to what was matched with (^|\s). I.e. if the match was at the beginning of the string $1 will be empty, and if there was a space before :d, then $1 will contian that space character.
Javascript doesnt support lookbehind i.e(?<=)..
It supports lookahead
Better use
/(?:^|\s)(:d)(?=$|\s)/g
Group1 captures required match

Specific regex positive look(around|ahead|behind) in Javascript

I'm looking to match /(?=\W)(gimme)(?=\W)/gi or alike. The \W are supposed to be zero-width characters to surround my actual match.
Maybe some background. I want te replace certain words (always \w+) with some literal padding added, but only if it's not surrounded by a \w. (That does sound like a negative lookaround, but I hear JS doesn't do those!?)
(Btw: the above "gimme" is the word literal I want to replace. If that wasn't obvious.)
It has to be (?) a lookaround, because the \W have to be zero-width, because the intention is a .replace(...) and I cannot replace/copy the surrounding characters.
So this won't work:
text.replace(/(?=\W)(gimme)(?=\W)/gi, function(l, match, r) {
return l + doMagic(match) + r;
})
The zero-width chars have to be ignored, so the function can return (and replace) only doMagic(match).
I have only very limited lookaround experience and non of it in JS. Grazie.
PS. Or maybe I need a lookbehind and those aren't supported in JS..? I'm confused?
PS. A little bit of context: http://jsfiddle.net/rudiedirkx/kMs2N/show/ (ooh a link!)
you can use word boundary shortcut \b to assert that it's the whole word that you are matching.
The easiest way to achieve what you want to do is probably to match:
/(\s+gimme)(?=\W)/gi
and replace with [yourReplacement] - i.e. capture the whitespaces before 'gimme' and then include one in the replacement.
Another way to approach this would be capturing more characters before and after the gimme literal and then using the groups with backreference:
(\W+?)gimme(\W+?) - your match - note that this time the before and after characters are in the capturing groups 1 and 2
And you'd want to use \1[yourReplacement]\2 as replacement string - not sure how you use backreference in JS, but the idea is to tell the engine that with \1 you mean whatever was matched by the first captuing parenthesis. In some languages these are accessed with $1.
What you currently have will not work, for the following reason, (?=\W) means "the next character is not a word character", and the next thing you try to match is a literal g, so you have a contradiction ("next character is a g, but isn't a letter").
You do in fact need a lookbehind, but they are not supported by JavaScript.
Check out this article on Mimicking Lookbehind in JavaScript for a possible approach.
Have you considered using a lexer/parser combo?
This one is javascript based, and comes with a spiffy demonstration.

Match altered version of first match with only one expression?

I'm writing a brush for Alex Gorbatchev's Syntax Highlighter to get highlighting for Smalltalk code. Now, consider the following Smalltalk code:
aCollection do: [ :each | each shout ]
I want to find the block argument ":each" and then match "each" every time it occurrs afterwards (for simplicity, let's say every occurrence an not just inside the brackets).
Note that the argument can have any name, e.g. ":myArg".
My attempt to match ":each":
\:([\d\w]+)
This seems to work. The problem is for me to match the occurrences of "each". I thought something like this could work:
\:([\d\w]+)|\1
But the right hand side of the alternation seems to be treated as an independent expression, so backreferencing doesn't work.
Is it even possible to accomplish what I want in a single expression? Or would I have to use the backreference within a second expression (via another function call)?
You could do it in languages that support variable-length lookbehind (AFAIK only the .NET framework languages do, Perl 6 might). There you could highlight a word if it matches (?<=:(\w+)\b.*)\1. But JavaScript doesn't support lookbehind at all.
But anyway this regex would be very inefficient (I just checked a simple example in RegexBuddy, and the regex engine needs over 60 steps for nearly every character in the document to decide between match and non-match), so this is not a good idea if you want to use it for code highlighting.
I'd recommend you use the two-step approach you mentioned: First match :(\w+)\b (word boundary inserted for safety, \d is implied in \w), then do a literal search for match result \1.
I believe the only thing stored by the Regex engine between matches is the position of the last match. Therefore, when looking for the next match, you cannot use a backreference to the match before.
So, no, I do not think that this is possible.

Categories

Resources