smallest match in regular expression

smallest match in regular expression - javascript

I have this kind of expression:
var string = [a][1] [b][2] [c][3] [d .-][] [e][4]
I woud like to match the fourth element [d .-][] which may contain any character (letters, numbers, punctuation, etc) within the first pair of bracket but the second pair of bracket remains empty. Other elements, for example, [a][1], may contain any character but they do have a number inside the second pair of brackets.
I tried this:
string.match(/\\[[^]+]\\[ ]/);
but it is too greedy.
Any help would be appreciated.

I woud like to match the fourth element [d .-][] which may contain any character (letters, numbers, punctuation, etc) within the first pair of bracket but the second pair of bracket remains empty
string.match(/\[[^\]]*\]\[\]/)
should do it.
To break it down,
\[ matches a literal left square bracket,
[^\]]* matches any number of characters other than a right square bracket,
\] matches a literal right square bracket, and
\[\] matches the two character sequence [], square brackets with nothing in between.
To answer your question about greediness though, you can make the greedy match [^]+ non-greedy by adding a question-mark: [^]+?. You should know though that [^] does not work in IE. To match any UTF-16 code-unit I tend to use [\s\S] which is a bit more verbose but works on all browsers.

Related

Negative lookahead ends match before the last character I need

I am looking to identify parts of a string that are hex.
So if you consider the string
CHICKENORBEEFPIE, the match would be BEEF.
To do this I came up with this expression /[A-F0-9]{2,}(?![^A-F0-9])/g
This works perfectly - except it only matches BEE, not BEEF. Unless BEEF happened to be at the end of the string.

The negative lookahead (?![^A-F0-9]) means: do not match anything followed by any characters other than A-F, 0-9. Which translates to match pattern followed by A-F, 0-9. Your regex is matching 'BEE' because it is followed by F, which satisfies the condition.
If you want to identify sequences of two or more characters that are hex code, just eliminate the negative lookahead altogether.
/[A-F0-9]{2,}/g translates to: Find as many matches, a pattern consisting of A-F or 0-9 that are 2 or more characters long.

It is because the last part of your regex: (?![^A-F0-9])
Because of that, you are matching any strings that aren't followed by a non-hex character... which ultimately means to find strings where the next character is a hex character.
You could either remove the ^ or remove that whole piece altogether as it isn't necessary. The following will retrieve what you are looking for: /[A-F0-9]{2,}/g

[A-F0-9]{2,}(?![A-F0-9]) will match what is expected, however negative lookahead is superfluous because quantifier are greedy by default.
[A-F0-9]{2,}(?![^A-F0-9]) doesn't work because assertion is that following character must not be any character except A-F0-9 (double negation).
the reason why the last character F in BEEF is not matched is that after matching BEEF, negtaive lookahead fails P is in [^A-F0-9] which makes backtrack to BEE which success because F is not in [^A-F0-9].

If you need the given result with pair-based values you can use /([A-F0-9]{2})+/g, if not (if it doesn't matter whether it's odd or not) you can use /[A-F0-9]{2,}/g instead.
Hope it helps.

Use
/[A-F0-9]{2,}(?![^A-F0-9])*/g

Ambiguity in regex in javascript

var a = 'a\na'
console.log(a.match(/.*/g)) // ['a', '', 'a', '']
Why are there two empty strings in the result?
Let's say if there are empty strings, why isn't there one at beginning and at the end of each line as well, hence 4 empty strings?
I am not looking for how to select 'a's but just want to understand the presence of the empty strings.

The best explanation I can offer for the following:
'ab\na'.match(/.*/g)
["ab", "", "a", ""]
Is that JavaScript's match function uses dot not in DOT ALL mode, meaning that dot does not match across newlines. When the .* pattern is applied to ab\na, it first matches ab, then stops at the newline. The newline generates an empty match. Then, a is matched, and then for some reason the end of the string matches another empty match.
If you just want to extract the non whitespace content from each line, then you may try the following:
print('ab\na'.match(/.+/g))
ab,a

Let's say if there are empty strings, why isn't there one at beginning
and at the end...
.* applies greediness. It swallows a complete line asap. By a line I mean everything before a line break. When it encounters end of a line, it matches again due to star quantifier.
If you want 4 you may add ? to star quantifier and make it lazy .*? but yet this regex has different result in different flavors because of the way they handle zero-length matches.
You can try .*? with both PCRE and JS engines in regex101 and see the differences.
Question:
You may ask why does engine try to find a match at the end of line while whole thing is already matched?
Answer:
It's for the reason that we have a definition for end of lines and end of strings. So not whole thing is matched. There is a left position that has a chance to be matched and we have it with star quantifier.
This left position is end of line here which is a true match for $ when m flag is on. A . doesn't match this position but a .* or .*? match because they would be a pattern for zero-length positions too as any X-STAR patterns like \d*, \D*, a* or b?

Star operator * means there can be any number of ocurrences (even 0 ocurrences). With the expression used, an empty string can be a match. Not sure what are you looking for, but maybe a + operator (1 or more ocurrences) will be better?
Want to add some more info, regex use a greedy algorithm by default (in some languages you can override this behaviour), so it will pick as much of the text as it can. In this case, it will pick the a, because it can be processed with the regex, so the "\na" is still there. "\n" does not match the ".", so the only available option is the empty string. Then, we will process the next line, and again, we can match a "a". After this, only the empty string matches the regex.

* Matches the preceding expression 0 or more times.
. matches any single character except the newline character.
That is what official doc says about . and *. So i guess the array you received is something like this:
[ the first "any character" of the first line, following "nothing", the first "any character" of the second line, following "nothing"]
And the new-line character is just ignored

Examples and explanation for javascript regular expression (x), decimal point, and word boundary

Can someone give a better explanation for these special characters examples in here? Or provide some clearer examples?
(x)
The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and
remember the first two words in the string "foo bar foo bar". The \1
and \2 in the pattern match the string's last two words.
decimal point
For example, /.n/ matches 'an' and 'on' in "nay, an apple is on the
tree", but not 'nay'.
Word boundary \b
/\w\b\w/ will never match anything, because a word character can never
be followed by both a non-word and a word character.
non word boundary \B
/\B../ matches 'oo' in "noonday" (, and /y\B./ matches 'ye' in
"possibly yesterday."
totally having no idea what the above example is showing :(
Much thanks!

Parentheses (aka capture groups)
Parantheses are used to indicate a group of symbols in the regular expression that, when matched, are 'remembered' in the match result. Each matched group is labelled with a numbered order, as \1, \2, and so on. In the example /(foo) (bar) \1 \2/ we remember the match foo as \1, and the match bar as \2. This means that the string "foo bar foo bar" matches the regular expression because the third and fourth terms (the \1 and \2) are matching the first and second capture groups (i.e. (foo) and (bar)). You can use capture groups in javascript like this:
/id:(\d+)/.exec("the item has id:57") // => ["id:57", "57"]
Note that in the return we get the whole match, and the subsequent groups that were captured.
Decimal point (aka wildcard)
A decimal point is used to represent a single character that can have any value. This means that the regular expression /.n/ will match any two character string where the second character is an 'n'. So /.n/.test("on") // => true, /.n/.test("an") // => true but /.n/.test("or") // => false. DrC brings up a good point in the comments that this won't match a newline character, but I feel in order for that to be an issue you need to explicitly specify multiline mode.
Word boundaries
A word boundary will match against any non-word character that directly precedes, or directly follows a word (i.e. adjacent to a word character). In javascript the word characters are any alpahnumeric and the underscore (mdn), non word is obviously everything else! The trick for word boundaries is that they are zero width assertions, which means they don't count as a character. That's why /\w\b\w/ will never match, because you can never have a word boundary between two word characters.
Non-word boundaries
The opposite of a word boundary, instead of matching a point that goes from non-word to word, or word to non-word (i.e. the ends of a word) it will match points where it's moving between the same types of character. So for our examples /\B../ will match the first point in the string that is between two characters of the same type and the next two characters, in this case it's between the first 'n' and 'o', and the next two characters are "oo". In the second example /y\B./ we are looking for the character 'y' followed by a character of matching type (so a word character), and the '.' will match that second character. So "possibly yesterday" won't match on the 'y' at the end of "possibly" because the next character is a space, which is a non word, but it will match the 'y' at the beginning of "yesterday", because it's followed by a word character, which is then included in the match by the '.' in the regular expression.
Overall, regular expressions are popular in many languages and based off a sound theoretical basis, so there's a lot of material on these characters. In general, Javascript is very similar to Perl's PCRE regular expressions (but not exactly the same!), so the majority of your questions about javascript regular expressions would be answered by any PCRE regex tutorial (of which there are many).
Hope that helps!

Regex not working as expected

Whats wrong with this regular expression?
/^[a-zA-Z\d\s&#-\('"]{1,7}$/;
when I enter the following valid input, it fails:
a&'-#"2
Also check for 2 consecutive spaces within the input.

The dash needs to be either escaped (\-) or placed at the end of the character class, or it will signify a range (as in A-Z), not a literal dash:
/^[A-Z\d\s&#('"-]{1,7}$/i
would be a better regex.
N. B: [#-\(] would have matched #, $, %, &, ' or (.
To address the added requirement of not allowing two consecutive spaces, use a lookahead assertion:
/^(?!.*\s{2})[A-Z\d\s&#('"-]{1,7}$/i
(?!.*\s{2}) means "Assert that it's impossible to match (from the current position) any string followed by two whitespace characters". One caveat: The dot doesn't match newline characters.

The - (hyphen) has a special meaning inside a character class, used for specifying ranges. Did you mean to escape it?:
/^[a-zA-Z\d\s&#\-\('"]{1,7}$/;
This RegExp matches your input.

You have an unescaped - in the middle of your character class. This means that you're actually searching for all characters between and including # and ( (which are #, $, %, &, ', and (). Either move it to the end or escape it with a backslash. Your regex should read:
/^[a-zA-Z\d\s&#\('"-]{1,7}$/
or
/^[a-zA-Z\d\s&#\-\('"]{1,7}$/

remove the ; at the end and
^[a-zA-Z\d\s\&\#\-\(\'\"]+$

Your input does not match the regular expression. The problem here is the hyphen in you regexp. If you move it from its position after the '#' character to the start of the regex, like so:
/^[-a-zA-Z\d\s&#\('"]{1,7}$/;
everything is fine and dandy.
You can always use Rubular for checking your regular expressions. I use it on a regular (no pun intended) basis.

How can I remove all content in brackets except entirely numerical content?

I want to take a string and remove all occurrences of characters within square brackets:
[foo], [foo123bar], and [123bar] should be removed
But I want to keep intact any brackets consisting of only numbers:
[1] and [123] should remain
I've tried a couple of things, to no avail:
text = text.replace(/\[^[0-9+]\]/gi, "");
text = text.replace(/\[^[\d]\]/gi, "");

The tool you're looking for is negative lookahead. Here's how you would use it:
text = text.replace(/\[(?!\d+\])[^\[\]]+\]/g, "");
After \[ locates an opening bracket, the lookahead, (?!\d+\]) asserts that the brackets do not contain only digits.
Then, [^\[\]]+ matches anything that's not square brackets, ensuring (for example) that you don't accidentally match "nested" brackets, like [[123]].
Finally, \] matches the closing bracket.

You probably need this:
text = text.replace(/\[[^\]]*[^0-9\]][^\]]*\]/gi, "");
Explanation: you want to keep those sequences within brackets that contain only numbers. An alternative way to say this is to delete those sequences that are 1) enclosed within brackets, 2) contain no closing bracket and 3) contain at least one non-numeric character. The above regex matches an opening bracket (\[), followed by an arbitrary sequence of characters except the closing bracket ([^\]], note that the closing bracket had to be escaped), then a non-numeric character (also excluding the closing bracket), then an arbitrary sequence of characters except the closing bracket, then the closing bracket.

In python:
import re
text = '[foo] [foo123bar] [123bar] [foo123] [1] [123]'
print re.sub('(\[.*[^0-9]+\])|(\[[^0-9][^\]]*\])', '', text)

Develop Reference

JavaScript is the programming language of the Web.

smallest match in regular expression - javascript

Related

Negative lookahead ends match before the last character I need

Ambiguity in regex in javascript

Examples and explanation for javascript regular expression (x), decimal point, and word boundary

Regex not working as expected

How can I remove all content in brackets except entirely numerical content?

Categories

Resources