Regex consume a character if it matches, but not otherwise - javascript

I am trying to write a regex expression which will capture all instances of the '#' character, except when two such characters appear in succession (essentially, an escape sequence). For example:
abd#ajk: # should be matched
abd##ajk: No matches
abd###ajk: The final # should match.
abd####ajk: No matches
This almost works with the negative lookahead expression #(?!#), except that because the second # is not consumed, the last of two # symbols will still be matched. What I think I want to do is to lookahead for an # but consume the character if it is there; otherwise, do not consume it. Is this possible?
Edit: I'm using Javascript which unfortunately rules out several good approaches :(

In JavaScript, to split strings at an unescaped #, you may actually match chunks of text that is either ## (an escaped #) and any chars other than #:
var strs = ['abd#ajk','abd##ajk','abd###ajk','abd####ajk'];
var rx = /(?:[^#]|##)+/g;
for (var s of strs) {
console.log(s, "=>", s.match(rx))
}
The regex is
/(?:[^#]|##)+/g
See its demo
Details
(?: - start of a non-capturing group that matches either of the 2 alternatives:
[^#]- any char other than#`
| - or
## - 2 #s
)+ - repeat matching 1 or more times.
The g modifier finds all matching occurrences inside the input string.

Since you didn't tag a programming language to your question here is my 2 cents for Java:
(?<=(?<!#)(?:##){0,999})#(?!#)
Java doesn't support infinite lookbehinds but bounded so here I explicitly specified max of even occurrences of #: 999.
JavsScript
Lookbehinds in JavaScript are not implemented and supported by many browsers yet. If you are trying to do this in JS then this would be your working solution:
Method 1
((?:[^#]*(?:##)+[^#]*)+)|#
(?:[^#]*(?:##)+[^#]*)+ Match ## occurrences and all its leading / trailing characters
|# Or a single #
JS Code:
str.split(/((?:[^#]*(?:##)+[^#]*)+)|#/).filter(Boolean);
Method 2 (Recommended)
Or if you don't have problem with using match() this is much more cleaner and of course faster:
(?:[^#]*(?:##)+[^#]*)+|[^#]+
JS Code:
console.log(
"aaaa#######bbb#aa###cccc##ddddd#".match(/(?:[^#]*(?:##)+[^#]*)+|[^#]+/g)
);

Related

Extracting text from a string after 5 characters and without the last slash

I have a few strings:
some-text-123123#####abcdefg/
some-STRING-413123#####qwer123t/
some-STRING-413123#####456zxcv/
I would like to receive:
abcdefg
qwer123t
456zxcv
I have tried regexp:
/[^#####]*[^\/]/
But this not working...
To get whatever comes after five #s and before the last /, you can use
/#####(.*)\//
and pick up the first group.
Demo:
const regex = /#####(.*)\//;
console.log('some-text-123123#####abcdefg/'.match(regex)[1]);
console.log('some-STRING-413123#####qwer123t/'.match(regex)[1]);
console.log('some-STRING-413123#####456zxcv/'.match(regex)[1]);
assumptions:
the desired part of the string sample will always:
start after 5 #'s
end before a single /
suggestion: /(?<=#{5})\w*(?=\/)/
So (?<=#{5}) is a lookbehind assertion which will check to see if any matching string has the provided assertion immediately behind it (in this case, 5 #'s).
(?=\/) is a lookahead assertion, which will check ahead of a matching string segment to see if it matches the provided assertion (in this case, a single /).
The actual text the regex will return as a match is \w*, consisting of a character class and a quantifier. The character class \w matches any alphanumeric character ([A-Za-z0-9_]). The * quantifier matches the preceding item 0 or more times.
successful matches:
'some-text-123123#####abcdefg/'
'some-STRING-413123#####qwer123t/'
'some-STRING-413123#####456zxcv/'
I would highly recommend learning Regular Expressions in-depth, as it's a very powerful tool when fully utilised.
MDN, as with most things web-dev, is a fantastic resource for regex. Everything from my answer here can be learned on MDN's Regular expression syntax cheatsheet.
Also, an interactive tool can be very helpful when putting together a complex regular expression. Regex 101 is typically what I use, but there are many similar web-tools online that can be found from a google search.
You pattern does not work because you are using negated character classes [^
The pattern [^#####]*[^\/] can be written as [^#]*[^\/] and matches optional chars other than # and then a single char other than /
Here are some examples of other patterns that can give the same match.
At least 5 leading # chars and then matching 1+ word chars in a group and the / at the end of the string using an anchor $, or omit the anchor if that is not the case:
#####(\w+)\/$
Regex demo
If there should be a preceding character other than #
[^#]#####(\w+)\/$
(?<!#)#####(\w+)\/$
Regex demo
Matching at least 5 # chars and no # or / in between using a negated character class in this case:
#####([^#\/]+)\/
Or with lookarounds:
(?<=(?<!#)#####)[^#\/]+(?=\/)
Regex demo

Regex: match underscore-wrapped words unless they start with # / #

I'm trying to work around this bug in Tiptap (a WYSIWYG editor for Vue) by passing in a custom regex so that the regex that identifies italics notation in Markdown (_value_) would not be applied to strings that start with # or #, e.g. #some_tag_value would not get transformed into #sometagvalue.
This is my regex so far - /(^|[^##_\w])(?:\w?)(_([^_]+)_)/g
Edit: new regex with help from # Wiktor Stribiżew /(^|[^##_\w])(_([^_]+)_)/g
While it satisfies most of the common cases, it currently still fails when
underscores are mid-word, e.g. ant_farm_ should be matched (antfarm)
I have also provided some "should match" and "should not match" cases here https://regexr.com/50ibf for easier testing
Should match (between underscores)
_italic text here_
police_woman_
_fire_fighter
a thousand _words_
_brunch_ on a Sunday
Should not match
#ta_g_
__value__
#some_tag_value
#some_value_here
#some_tag_
#some_val_
#_hello_
You may use the following pattern:
(?:^|\s)[^##\s_]*(_([^_]+)_)
See the regex demo
Details
(?:^|\s) - start of string or whitespace
[^##\s_]* - 0 or more chars other than #, #, _ and whitespace
(_([^_]+)_) - Group 1: _, 1+ chars other than _ (captured into Group 2) and then _.
For science, this monstrosity works in Chrome (and Node.js).
let text = `
<strong>Should match</strong> (between underscores)
_italic text here_
police_woman_
_fire_fighter
a thousand _words_
_brunch_ on a Sunday
<strong>Should not match</strong>
#ta_g_
__value__
#some_tag_value
#some_value_here
#some_tag_
#some_val_
#_hello_
`;
let re = /(?<=(?:\s|^)(?![##])[^_\n]*)_([^_]+)_/g;
document.querySelector('div').innerHTML = text.replace(re, '<em>$1</em>');
div { white-space: pre; }
<div/>
This captures _something_ as full match, and something as 1st capture group (in order to remove the underscores). You can't capture just something, because then you lose the ability to tell what is inside the underscores, and what is outside (try it with (?<=(?:\s|^)(?![##])[^_\n]*_)([^_]+)(?=_)).
There are two things that prevent it being universally applicable:
Look-behinds are not supported in all JavaScript engines
Most regexp engines do not support variable-length look-behinds
EDIT: This is a bit stronger, and should allow you to additionally match_this_and_that_ but not #match_this_and_that correctly:
/(?<=(?:\s|^)(?![##])(?!__)\S*)_([^_]+)_/
Explanation:
_([^_]+)_ Match non-underscory bit between two underscores
(?<=...) that is preceded by
(?:\s|^) either a whitespace or a start of a line/string
(i.e. a proper word boundary, since we can't use `\b`)
\S* and then some non-space characters
(?![##]) that don't start with `#`, `#`,
(?!__) or `__`.
regex101 demo
Here's something, it's not as compact as other answers, but I think it's easier to understand what is going on. Match group \3 is what you want.
Needs the multiline flag
^([a-zA-Z\s]+|_)(([a-zA-Z\s]+)_)+?[a-zA-Z\s]*?$
^ - match the start of the line
([a-zA-Z\s]+|_) - multiple words or _
(([a-zA-Z\s]+)_)+? - multiple words followed by _ at least once, but the minimum match.
[a-zA-Z\s]*? - any final words
$ - the end of the line
In summary the breakdown of the things to match one of
_<words>_
<words>_<words>_
<words>_<words>_<words>
_<words>_<words>

regular expression to replace with ','

I have one RegExp, could anyone explain exactly what it does?
Regexp
b=b.replace(/(\d{1,3}(?=(?:\d\d\d)+(?!\d)))/g,"$1 ")
I think it is replacing with space(' ')
if i'm right, i want to replace it with comma(,) instead of space(' ').
To explain the regex, let's break it down:
( # Match and capture in group number 1:
\d{1,3} # one to three digits (as many as possible),
(?= # but only if it's possible to match the following afterwards:
(?: # A (non-capturing) group containing
\d\d\d # exactly three digits
)+ # once or more (so, three/six/nine/twelve/... digits)
(?!\d) # but only if there are no further digits ahead.
) # End of (?=...) lookahead assertion
) # End of capturing group
Actually, the outer parentheses are unnecessary if you use $& instead of $1 for the replacement string ($& contains the entire match).
The regex (\d{1,3}(?=(?:\d\d\d)+(?!\d))) matches any 1-3 digits ((\d{1,3}) that is followed by a multiple of 3 digits ((?:\d\d\d)+), that isn't followed by another digit ((?!\d)). It replaces it with "$1 ". $1 is replaced by the first capture group. The space behind it is... a space.
See regexpressions on mdn for more information about the different syntaxes.
If you want to seperate the numbers with a comma, instead of a space, you'll need to replace it with "$1," instead.
Don't try to solve everything by using regular expressions.
Regular expressions are meant for matching, not to fix non-text-encoded-as-text formatting.
If you want to format numbers differently, extract them and use format strings to reformat them on a character processing level. That is just an ugly hack.
It is okay to use regular expressions to find the numbers in the text, e.g. \d{4,} but trying to do the actual formatting with regexp is a crazy abuse.

Regular expression for excluding some characters with multiline matching

I want to ensure that the user input doesn't contain characters like <, > or &#, whether it is text input or textarea. My pattern:
var pattern = /^((?!&#|<|>).)*$/m;
The problem is, that it still matches multiline strings from a textarea like
this text matches
though this should not, because of this character <
EDIT:
To be more clear, I need exclude &# combination only, not & or #.
Please suggest the solution. Very grateful.
You're probably not looking for m (multiline) switch but s (DOTALL) switch in Javascript. Unfortunately s doesn't exist in Javascript.
However good news that DOTALL can be simulated using [\s\S]. Try following regex:
/^(?![\s\S]*?(&#|<|>))[\s\S]*$/
OR:
/^((?!&#|<|>)[\s\S])*$/
Live Demo
I don't think you need a lookaround assertion in this case. Simply use a negated character class:
var pattern = /^[^<>&#]*$/m;
If you're also disallowing the following characters, -, [, ], make sure to escape them or put them in proper order:
var pattern = /^[^][<>&#-]*$/m;
Alternate answer to specific question:
anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:
function hasSpecial_1(text) {
// If regex matches, then string does NOT contain special chars.
return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
// If regex matches, then string contains (at least) one special char.
return /[<>]|&#/.test(text) ? true : false;
}
These two functions are functionally equivalent, but the second one is probably quite a bit faster.
Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.
Test if a string contains HTML special Chars:
It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as:  ,  , etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &, <, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.
Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)
FALSE test using positive regex:
function hasHtmlSpecial_1(text) {
/* Commented regex:
# Match string having no special HTML chars.
^ # Anchor to start of string.
[^<>&]* # Zero or more non-[<>&] (normal*).
(?: # Unroll the loop. ((special normal*)*)
& # Allow a & but only if
(?! # not an HTML entity (3 valid types).
(?: # One from 3 types of HTML entities.
[a-z\d]+ # either a named entity,
| \#\d+ # or a decimal entity,
| \#x[a-f\d]+ # or a hex entity.
) # End group of HTML entity types.
; # All entities end with ";".
) # End negative lookahead.
[^<>&]* # More (normal*).
)* # End unroll the loop.
$ # Anchor to end of string.
*/
var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
// If regex matches, then string does NOT contain HTML special chars.
return re.test(text) ? false : true;
}
Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))
TRUE test using negative regex:
function hasHtmlSpecial_2(text) {
/* Commented regex:
# Match string having one special HTML char.
[<>] # Either a tag delimiter
| & # or a & if start of
(?: # one of 3 types of HTML entities.
[a-z\d]+ # either a named entity,
| \#\d+ # or a decimal entity,
| \#x[a-f\d]+ # or a hex entity.
) # End group of HTML entity types.
; # All entities end with ";".
*/
var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
// If regex matches, then string contains (at least) one special HTML char.
return re.test(text) ? true : false;
}
Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.

Javascript multiple regex pattern

I'm trying to exclude some internal IP addresses and some internal IP address formats from viewing certain logos and links in the site.I have multiple range of IP addresses(sample given below). Is it possible to write a regex that could match all the IP addresses in the list below using javascript?
10.X.X.X
12.122.X.X
12.211.X.X
64.X.X.X
64.23.X.X
74.23.211.92
and 10 more
Quote the periods, replace the X's with \d+, and join them all together with pipes:
const allowedIPpatterns = [
"10.X.X.X",
"12.122.X.X",
"12.211.X.X",
"64.X.X.X",
"64.23.X.X",
"74.23.211.92" //, etc.
];
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
')$';
const allowedRegexp = new RegExp(allowedRegexStr);
Then you're all set:
'10.1.2.3'.match(allowedRegexp) // => ['10.1.2.3']
'100.1.2.3'.match(allowedRegexp) // => null
How it works:
First, we have to turn the individual IP patterns into regular expressions matching their intent. One regular expression for "all IPs of the form '12.122.X.X'" is this:
^12\.122\.\d+\.\d+$
^ means the match has to start at the beginning of the string; otherwise, 112.122.X.X IPs would also match.
12 etc: digits match themselves
\.: a period in a regex matches any character at all; we want literal periods, so we put a backslash in front.
\d: shorthand for [0-9]; matches any digit.
+: means "1 or more" - 1 or more digits, in this case.
$: similarly to ^, this means the match has to end at the end of the string.
So, we turn the IP patterns into regexes like that. For an individual pattern you could use code like this:
const regexStr = `^` + ipXpattern.
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
`$`;
Which just replaces all .s with \. and Xs with \d+ and sticks the ^ and $ on the ends.
(Note the doubled backslashes; both string parsing and regex parsing use backslashes, so wherever we want a literal one to make it past the string parser to the regular expression parser, we have to double it.)
In a regular expression, the alternation this|that matches anything that matches either this or that. So we can check for a match against all the IP's at once if we to turn the list into a single regex of the form re1|re2|re3|...|relast.
Then we can do some refactoring to make the regex matcher's job easier; in this case, since all the regexes are going to have ^...$, we can move those constraints out of the individual regexes and put them on the whole thing: ^(10\.\d+\.\d+\.\d+|12\.122\.\d+\.\d+|...)$. The parentheses keep the ^ from being only part of the first pattern and $ from being only part of the last. But since plain parentheses capture as well as group, and we don't need to capture anything, I replaced them with the non-grouping version (?:..).
And in this case we can do the global search-and-replace once on the giant string instead of individually on each pattern. So the result is the code above:
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '\\d+') +
')$';
That's still just a string; we have to turn it into an actual RegExp object to do the matching:
const allowedRegexp = new RegExp(allowedRegexStr);
As written, this doesn't filter out illegal IPs - for instance, 10.1234.5678.9012 would match the first pattern. If you want to limit the individual byte values to the decimal range 0-255, you can use a more complicated regex than \d+, like this:
(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])
That matches "any one or two digits, or '1' followed by any two digits, or '2' followed by any of '0' through '4' followed by any digit, or '25' followed by any of '0' through '5'". Replacing the \d with that turns the full string-munging expression into this:
const allowedRegexStr = '^(?:' +
allowedIPpatterns.
join('|').
replace(/\./g, '\\.').
replace(/X/g, '(?:\\d{1,2}|1\\d{2}|2[0-4]\\d|25[0-5])') +
')$';
And makes the actual regex look much more unwieldy:
^(?:10\.(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5]).(?:\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])|12\.122\....
but you don't have to look at it, just match against it. :)
You could do it in regex, but it's not going to be pretty, especially since JavaScript doesn't even support verbose regexes, which means that it has to be one humongous line of regex without any comments. Furthermore, regexes are ill-suited for matching ranges of numbers. I suspect that there are better tools for dealing with this.
Well, OK, here goes (for the samples you provided):
var myregexp = /\b(?:74\.23\.211\.92|(?:12\.(?:122|211)|64\.23)\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])|(?:10|64)\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]))\b/g;
As a verbose ("readable") regex:
\b # start of number
(?: # Either match...
74\.23\.211\.92 # an explicit address
| # or
(?: # an address that starts with
12\.(?:122|211) # 12.122 or 12.211
| # or
64\.23 # 64.23
)
\. # .
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]) # followed by 0..255
| # or
(?:10|64) # match 10 or 64
\. # .
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\. # followed by 0..255 and a dot
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]) # followed by 0..255
)
\b # end of number
/^(X|\d{1,3})(\.(X|\d{1,3})){3}$/ should do it.
If you don't actually need to match the "X" character you could use this:
\b(?:\d{1,3}\.){3}\d{1,3}\b
Otherwise I would use the solution cebarrett provided.
I'm not entirely sure of what you're trying to achieve here (doesn't look anyone else is either).
However, if it's validation, then here's a solution to validate an IP address that doesn't use RegEx. First, split the input string at the dot. Then using parseInt on the number, make sure it isn't higher than 255.
function ipValidator(ipAddress) {
var ipSegments = ipAddress.split('.');
for(var i=0;i<ipSegments.length;i++)
{
if(parseInt(ipSegments[i]) > 255){
return 'fail';
}
}
return 'match';
}
Running the following returns 'match':
document.write(ipValidator('10.255.255.125'));
Whereas this will return 'fail':
document.write(ipValidator('10.255.256.125'));
Here's a noted version in a jsfiddle with some examples, http://jsfiddle.net/VGp2p/2/

Categories

Resources