Match words that consist of specific characters, excluding between special brackets - javascript

I'm trying to match words that consist only of characters in this character class: [A-z'\\/%], excluding cases where:
they are between < and >
they are between [ and ]
they are between { and }
So, say I've got this funny string:
[beginning]<start>How's {the} /weather (\\today%?)[end]
I need to match the following strings:
[ "How's", "/weather", "\\today%" ]
I've tried using this pattern:
/[A-z'/\\%]*(?![^{]*})(?![^\[]*\])(?![^<]*>)/gm
But for some reason, it matches:
[ "[beginning]", "", "How's", "", "", "", "/weather", "", "", "\\today%", "", "", "[end]", "" ]
I'm not sure why my pattern allows stuff between [ and ], since I used (?![^\[]*\]), and a similar approach seems to work for not matching {these cases} and <these cases>. I'm also not sure why it matches all the empty strings.
Any wisdom? :)

There are essentially two problems with your pattern:
Never use A-z in a character class if you intend to match only letters (because it will match more than just letters1). Instead, use a-zA-Z (or A-Za-z).
Using the * quantifier after the character class will allow empty matches. Use the + quantifier instead.
So, the fixed pattern should be:
[A-Za-z'/\\%]+(?![^{]*})(?![^\[]*\])(?![^<]*>)
Demo.
1 The [A-z] character class means "match any character with an ASCII code between 65 and 122". The problem with that is that codes between 91 and 95 are not letters (and that's why the original pattern matches characters like '[' and ']').

Split it with regular expression:
let data = "[beginning]<start>How's {the} /weather (\\today%?)[end]";
let matches = data.split(/\s*(?:<[^>]+>|\[[^\]]+\]|\{[^\}]+\}|[()])\s*/);
console.log(matches.filter(v => "" !== v));

You can match all the cases that you don't want using an alternation and place the character class in a capturing group to capture what you want to keep.
The [^ is a negated character class that matches any character except what is specified.
(?:\[[^\][]*]|<[^<>]*>|{[^{}]*})|([A-Za-z'/\\%]+)
Explanation
(?: Non capture group
\[[^\][]*] Match from opening till closing []
| Or
<[^<>]*> Match from opening till closing <>
| Or
{[^{}]*} Match from opening till closing {}
) Close non capture group
| Or
([A-Za-z'/\\%]+) Repeat the character class 1+ times to prevent empty matches and capture in group 1
Regex demo
const regex = /(?:\[[^\][]*]|<[^<>]*>|{[^{}]*})|([A-Za-z'/\\%]+)/g;
const str = `[beginning]<start>How's {the} /weather (\\\\today%?)[end]`;
let m;
while ((m = regex.exec(str)) !== null) {
if (m[1] !== undefined) console.log(m[1]);
}

Related

Break down a string with regex

I have some example strings I need to process
string1 = "_Wondrous item, common (requires attunement by a wizard or cleric)_"
string2 = "_Weapon (glaive), rare (requires attunement)_"
string3 = "_Wondrous item, common_"
I want to break them down into the following
group1 = {
type: "Wonderous item";
rarity: "common";
attune: True
class: "wizard or cleric"
}
group2 = {
type: "Weapon (glaive)";
rarity: "rare";
attune : True
}
group3 = {
type: "Wondrous item"
rarity: "common"
attune: False
}
the regex that I have currently is messy and probably inefficient but it only breaks down the first one.
regex = /_(?<type>[^:]*),\s(?<rarity>[^:]*)\s\((?<attune>[^:]+)by a(?<class>[^:]*)\)_/U
added Details
This will be used when processing text documents one by one
The sting will occur once in each document
I am using this in Obsidian.MD with templater if anyone is curious
And yes this is to process captured D&D magic items captured from Reddit
To get all groups for the 3 lines using your pattern:
_(?<type>[^:]*?),\s+(?<rarity>[^:]*?)(?:\s+\((?<attune>[^:]+?)\s*(?:by\s+a\s+(?<class>[^:]*?))?\))?_
_(?<type>[^:]*?) Match _, group type matches any char except : non greedy
,\s Match , and a whitespace char
(?<rarity>[^:]*?) Group rarity matches any char except : non greedy
(?: Non capture group
\s\( Match a whitespace char and (
(?<attune>[^:]+?)\s* group attune matches any char except : non greedy
(?:by a\s+(?<class>[^:]*?))? Optionally match by a and group class which matches any char except : non greedy
\) Match )
)?_ Make the outer group optional and match _
See a regex demo.
Using the groups property if supported, you can check for the values and update the object accordingly.
const regex = /_(?<type>[^:]*?),\s+(?<rarity>[^:]*?)(?:\s+\((?<attune>[^:]+?)\s*(?:by\s+a\s+(?<class>[^:]*?))?\))?_/;
[
"_Wondrous item, common (requires attunement by a wizard or cleric)_",
"_Weapon (glaive), rare (requires attunement)_",
"_Wondrous item, common_"
].forEach(s => {
const m = s.match(regex);
if (m) {
if (m.groups.class === undefined) {
delete m.groups.class;
}
m.groups.attune = m.groups.attune === undefined ? false : true;
console.log(m.groups)
}
});
Note that in your pattern you want to prevent matching : in the negated character class but there is no : in the example data.
For the fist negated character class you can change that to not match the comma, and for the others exclude matching the parenthesis to get the same result.
That way not all quantifiers have to be non greedy and it can prevent some unnecessary backtracking.
_(?<type>[^,]*),\s(?<rarity>[^:()]*)(?:\s\((?<attune>[^()]+?)\s*(?:by a\s+(?<class>[^()]*))?\))?_
See another regex demo.

Regex uppercase separation but not separating more than 1 next to each other

I have array of values which I have to separate by their uppercase. But there are some cases where the value of the array has 2, 3 or 4 serial uppercases that I must not separate. Here are some values:
ERISACheckL
ERISA404cCheckL
F401kC
DisclosureG
SafeHarborE
To be clear result must be:
ERISA Check L
ERISA 404c Check L
F 401k C
Disclosure G
Safe Harbor E
I tried using:
value.match(/[A-Z].*[A-Z]/g).join(" ")
But of couse it is not working for serial letters.
One option could be matching 1 or more uppercase characters asserting what is directly to the right is not a lowercase character, or get the position where what is on the left is a char a-z or digit, and on the right is an uppercase char.
The use split and use a capture group for the pattern to keep it in the result.
([A-Z]+(?![a-z]))|(?<=[\da-z])(?=[A-Z])
( Capture group 1 (To be kept using split)
[A-Z]+(?![a-z]) Match 1+ uppercase chars asserting what is directly to the right is a-z
) Close group 1
| Or
(?<=[\da-z])(?=[A-Z]) Get the postion where what is directly to left is either a-z or a digit and what is directly to the right is A-Z
Regex demo
const pattern = /([A-Z]+(?![a-z]))|(?<=[\da-z])(?=[A-Z])/;
[
"ERISACheckL",
"ERISA404cCheckL",
"F401kC",
"DisclosureG",
"SafeHarborE"
].forEach(s => console.log(s.split(pattern).filter(Boolean).join(" ")))
Another option is to use an alternation | matching the different parts:
[A-Z]+(?![a-z])|[A-Z][a-z]*|\d+[a-z]+
[A-Z]+(?![a-z]) Match 1+ uppercase chars asserting what is directly to the right is a-z
| Or
[A-Z][a-z]* Match A-Z optionally followed by a-z to also match single uppercase chars
| Or
\d+[a-z]+ match 1+ digits and 1+ chars a-z
Regex demo
const pattern = /[A-Z]+(?![a-z])|[A-Z][a-z]*|\d+[a-z]+/g;
[
"ERISACheckL",
"ERISA404cCheckL",
"F401kC",
"DisclosureG",
"SafeHarborE"
].forEach(s => console.log(s.match(pattern).join(" ")))
function formatString(str) {
return str.replace(/([A-Z][a-z]+|\d+[a-z]+)/g, ' $1 ').replace(' ', ' ').trim();
}
// test
[
'ERISACheckL',
'ERISA404cCheckL',
'F401kC',
'DisclosureG',
'SafeHarborE'
].forEach(item => {
console.log(formatString(item));
});

Setting the end of the match

I have the following string:
[TITLE|prefix=a] [STORENAME|prefix=b|suffix=c] [DYNAMIC|limit=10|random=0|reverse=0]
And I would like to get the value of the prefix of TITLE, which is a.
I have tried it with (?<=TITLE|)(?<=prefix=).*?(?=]|\|) and that seems to work but that gives me also the prefix of STORENAME (b). So if [TITLE|prefix=a] will be missing in the string, I'll have the wrong value.
So I need to set the end of the match with ] that belongs to [TITLE. Please notice that this string is dynamic. So it could be [TITLE|suffix=x|prefix=y] as well.
const regex = "[TITLE|prefix=a] [STORENAME|prefix=b|suffix=c] [DYNAMIC|limit=10|random=0|reverse=0]".match(/(?<=TITLE|)(?<=prefix=).*?(?=]|\|)/);
console.log(regex);
You can use
(?<=TITLE(?:\|suffix=[^\]|]+)?\|prefix=)[^\]|]+
See the regex demo. Details:
(?<=TITLE(?:\|suffix=[^\]|]+)?\|prefix=) - a location in string immediately preceded with TITLE|prefix| or TITLE|suffix=...|prefix|
[^\]|]+ - one or more chars other than ] and |.
See JavaScript demo:
const texts = ['[TITLE|prefix=a] [STORENAME|prefix=b|suffix=c] [DYNAMIC|limit=10|random=0|reverse=0]', '[TITLE|suffix=s|prefix=a]'];
for (let s of texts) {
console.log(s, '=>', s.match(/(?<=TITLE(?:\|suffix=[^\]|]+)?\|prefix=)[^\]|]+/)[0]);
}
You could also use a capturing group
\[TITLE\|(?:[^|=\]]*=[^|=\]]*\|)*prefix=([^|=\]]*)[^\]]*]
Explanation
\[TITLE\| Match [TITLE|
(?:\w+=\w+\|)* Repeat 0+ occurrences wordchars = wordchars and |
prefix= Match literally
(\w+) Capture group 1, match 1+ word chars
[^\]]* Match any char except ]
] Match the closing ]
Regex demo
const regex = /\[TITLE\|(?:\w+=\w+\|)*prefix=(\w+)[^\]]*\]/g;
const str = `[TITLE|prefix=a] [STORENAME|prefix=b|suffix=c] [DYNAMIC|limit=10|random=0|reverse=0]
[TITLE|suffix=x|prefix=y]`;
let m;
while ((m = regex.exec(str)) !== null) {
console.log(m[1]);
}
Or with a negated character class instead of \w
\[TITLE\|(?:[^|=\]]*=[^|=\]]*\|)*prefix=([^|=\]]*)[^\]]*]
Regex demo

Using Regex, how to check if second to last character is odd

I'm trying to wrap my head around Regex, but having some troubles with the basics.
I want to check to see if a the last character in a string is either a "0" or a "5", but I also want to check to is if the second to last character (if it exists) is odd.
If it matters, I'm trying to do this in Javascript for some form validation. I have the following Regex to satisfy my first condition of checking the last character and making sure its a "0" or a "5"
/([0|5]$)/g
But how do I properly add a 2nd condition to see if the 2nd to last character exists and is odd? Something like the following...?
/([0|5]$)([1|3|5|7|9]$-1)/g
If someone doesn't mind helping me out here and also explain to me what each part of their regex is doing, I'd be very grateful.
I'd go with /(?<=[13579]{1})[05]|^[05]$/.
This utilises two conditionals. One that checks for the presence of an odd character in the second-to-last position when there's at least two characters in the string, and one that checks for a single character string.
Breaking this down:
(?<=[13579]{1}) - does a positive lookbehind on exactly one odd character
[05] - match a 0 or a 5 directly following the lookbehind
| - denotes an OR
^ denotes the start of the string
[05] - match a 0 or a 5
$ - the end of the string
This can be seen in the following:
var re = /(?<=[13579]{1})[05]|^[05]$/;
console.log(re.test('12345')); // 12345 should return `false`
console.log(re.test('12335')); // 12335 should return `true`
console.log(re.test('1')); // 1 should return `false`
console.log(re.test('5')); // 5 should return `true`
And also seen on Regex101 here.
You're thinking about it the wrong way.
Try this:
/([13579])([05])$/g
If you want to check if a the last character in a string is either a "0" or a "5" and also want to check if the second to last character (if it exists) is odd, I think you do not need the capturing groups.
You could use an alternation and character classes for your requirements.
(?:\D[05]|[13579][05]|^[05])$
That would match:
(?: Non capturing group
\D[05] Match not a digit and 0 or 5
| Or
[13579][05] Match an odd digit and 0 or 5
| Or
^[05] Match from the beginning of the string 0 or 5
) Close non capturing group
$ Assert the end of the line
const strings = [
"00",
"11",
"text1",
"text10",
"text00",
"text5",
"10",
"05",
"15",
"99",
"12345",
"12335",
"0000",
"0010",
"5",
"1",
"0",
];
let pattern = /(?:[13579][05]|\D[05]|^[05])$/;
strings.forEach((s) => {
console.log(s + " ==> " + pattern.test(s));
});
/(^|[13579])[05]$/
Explained:
[05]$ means "0 or 5 followed by end of string"
(^|[13579]) means "beginning of string OR 1 or 3 or 5 or 7 or 9"
Tested in console:
re.test('aaa0') - false
re.test('aa15') - true
re.test('aa20') - false
re.test('0') - true
Is this what you were after?
As you said
I want to check to see if a the last character in a string is either a "0" or a "5", but I also want to check to is if the second to last character (if it exists) is odd
Try this :
var rgx = /^([1-9]+[13579][05]|[1-9][05])$/;
function test(str) {
for (var i = 0; i < str.length; i++) {
var res = str[i].match(rgx);
if (res) {
console.log("match");
} else {
console.log("not match");
}
}
}
var arr = ["12335", "12350", "45", "10", "12337", "11", "01", "820"];
test(arr);
You would want to do:
/(^|[1|3|5|7|9])([0|5])$/
https://regex101.com/r/nMX7L2/4
1st Capturing Group (^|[1|3|5|7|9])
1st Alternative ^
^ asserts position at start of the string
2nd Alternative [|1|3|5|7|9]
Match a single character present in the list below [|1|3|5|7|9]
|1|3|5|7|9 matches a single character in the list |13579 (case sensitive)
Match a single character present in the list below [1|3|5|7|9]
1|3|5|7|9 matches a single character in the list 1|3|5|7|9 (case sensitive)
2nd Capturing Group ([0|5])
Match a single character present in the list below [0|5]
0|5 matches a single character in the list 0|5 (case sensitive)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Trying to filter a string for multiple values with REGEX

I need to match multiples groups from multiples lines according to a structured source string.
The string is formatted with one name per line, but with some other values, in this order:
May have a number before the name starting each line;
May have some junk separators between the number and the name;
The name may have any character, including symbols as parentheses, apostrophes, etc;
May have a code between parentheses with 3 or 4 letters after the name (don't bother with the possibility of the name having 3 or 4 letter between parenthesis, this will not happen)
May have a asterisk at the end of line, before the break line.
I need to retrieve this 4 groups for each line. That is what I'm trying :
/^(\d+)?(?:[ \t]?[x:.=]?)[ \t]?(.+?)(?=[ \t]?(\(\w{3,4}\))?[ \t]?(\*))$/igm
To catch the number:
^(\d+)?
To clean the possible separators:
(?:[ \t]?[x:.=]?)
Filtering the space between each group:
[ \t]?
The name (and the rest):
(.+?(?=[ \t]?(\(\w{3,4}\))?[ \t]?(\*)?))
The problem is, obviously, with this last one. It's catching all together (groups 2, 3 and 4). As you can see, I'm trying the two last optional groups as positive lookaheads to separate them from the name.
What am I doing wrong or how would be the better way to achieve the result?
EDIT
String sample:
2 John Smith
3 Messala Oliveira (NMN) *
Mary Pop *
Joshua Junior (pMHH)
What I need:
[ "2", "John Smith", "", "" ],
[ "3", "Messala Oliveira", "(NMN)", "*" ],
[ "", "Mary Pop", "", "*" ],
[ "", "Joshua Junior", "(pMHH)", "" ],
You need to wrap the capturing groups that can be present or absent with optional non-capturing groups:
/^(?:(\d+)[ \t]*)?(.*?)(?:[ \t](\(\w{3,4}\)))?(?:[ \t](\*))?$/igm
See the regex demo.
Details:
^ - start of string
(?:(\d+)[ \t]*)? - an optional non-capturing group matching
(\d+) - (Group 1) 1+ digits
[ \t]* - 0+ spaces or tabs (if \s is used, 0+ whitespaces)
(.*?) - Group 2 capturing any 0+ chars other than linenbreaks symbols as few as possible
(?:[ \t](\(\w{3,4}\)))? - an optional group matching
[ \t] - a space or tab
(\(\w{3,4}\)) - Group 3 capturing a (, 3 or 4 word chars, )
(?:[ \t](\*))? - another optional group matching a space or tab and capturing into Group 4 a * symbol.
$ - end of string.
If you test the strings separately, the [ \t] can be replaced with a simpler \s:
var regex = /^(?:(\d+)\s*)?(.*?)(?:\s(\(\w{3,4}\)))?(?:\s(\*))?$/i;
var strs = ['2 John Smith','3 Messala Oliveira (NMN) *','Mary Pop *','Joshua Junior (pMHH)'];
for (var i=0; i<strs.length; i++) {
if ((m = regex.exec(strs[i])) !== null) {
var res = [];
if (m[1]) {
res.push(m[1]);
} else res.push("");
res.push(m[2]);
if (m[3]) {
res.push(m[3]);
} else res.push("");
if (m[4]) {
res.push(m[4]);
} else res.push("");
}
console.log(res);
}

Categories

Resources