Need to obtain match between specified delimiters - javascript

I'm trying to match specific tags between double block quote delimiters within a sentence :
Look for `foo="x"` ONLY between the specific double block quote delimiters [[foo="x"|bar="y"|baz="z"]]
Using the following regex matches also the foo="x" outside the delimiters :
(?:(foo|bar|baz)="([^"]+)")+
I've tried adding the positive lookbehind : (?<=\[\[) but it only returns me the first foo="x" within the bockquotes but ignores the bar="y" and baz="z" matches.
const regex = /(?:(foo|bar|baz)="([^"]+)")+/gm;
const str = `Look for \`foo="x"\` ONLY between the specific double block quote delimiters [[foo="x"|bar="y"|baz="z"]]`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}

If your strings inside [[ and ]] don't have [ and ] a simple
/(foo|bar|baz)="([^"]+)"(?=[^\][]*]])/g
will work for you. The (?=[^\][]*]]) part will make sure there are 0 or more chars other than [ and ] and then ] are immediately to the right of the current location. See the regex demo.
The safest solution includes two steps: 1) get the Group 1 value with /\[\[((foo|bar|baz)="([^"]+)"(?:\|(foo|bar|baz)="([^"]+)")*)]]/ (or a simpler but less precise but more generic /\[\[\w+="[^"]+"(?:\|\w+="[^"]+")*]]/g, see demo), and 2) use /(foo|bar|baz)="([^"]+)"/g (or /(\w+)="([^"]+)"/g) to extract the necessary values from Group 1.
const x = '(foo|bar|baz)="([^"]+)"'; // A key-value pattern block
const regex = new RegExp(`\\[\\[(${x}(?:\\|${x})*)]]`, 'g'); // Extracting the whole `[[]]`
const str = `Look for \`foo="x"\` ONLY between the specific double block quote delimiters [[foo="x"|bar="y"|baz="z"]]`;
let m;
while (m = regex.exec(str)) {
let results = [...m[1].matchAll(/(foo|bar|baz)="([^"]+)"/g)]; // Grabbing individual matches
console.log(Array.from(results, m => [m[1],m[2]]));
}
The \[\[((foo|bar|baz)="([^"]+)"(?:\|(foo|bar|baz)="([^"]+)")*)]] pattern will match
\[\[ - [[
((foo|bar|baz)="([^"]+)"(?:\|(foo|bar|baz)="([^"]+)")*) - Group 1:
(foo|bar|baz) - foo, bar or baz
= - =
"([^"]+)" - ", 1 or more chars other than " and a "
(?:\|(foo|bar|baz)="([^"]+)")* - 0 or more repetitions of | and the pattern described above
]] - ]] substring.
See the regex demo.

Try slightly another definition of your requirements:
match name="value", with capturing groups for both name and value,
before the name there should be:
either double opening bracket ([[),
or a vertical bar (|),
after the value (and closing double quote) there should be:
either double closing bracket (]]),
or a vertical bar (|).
Then the regex can be as follows:
(?:\[\[|\|)(foo|ba[rz])="(\w+)"(?=]]|\|)
Details:
(?:\[\[|\|) - the content before (will be a part of the match,
but not a part of any capturing group),
(foo|ba[rz])="(\w+)" - name / value pair (with double quotes),
(?=]]|\|) - the content after (this time expressed as a
positive lookahead).
For a working example see https://regex101.com/r/dj51GS/1

Related

Regex match apostrophe inside, but not around words, inside a character set

I'm counting how many times different words appear in a text using Regular Expressions in JavaScript. My problem is when I have quoted words: 'word' should be counted simply as word (without the quotes, otherwise they'll behave as two different words), while it's should be counted as a whole word.
(?<=\w)(')(?=\w)
This regex can identify apostrophes inside, but not around words. Problem is, I can't use it inside a character set such as [\w]+.
(?<=\w)(')(?=\w)|[\w]+
Will count it's a 'miracle' of nature as 7 words, instead of 5 (it, ', s becoming 3 different words). Also, the third word should be selected simply as miracle, and not as 'miracle'.
To make things even more complicated, I need to capture diacritics too, so I'm using [A-Za-zÀ-ÖØ-öø-ÿ] instead of \w.
How can I accomplish that?
1) You can simply use /[^\s]+/g regex
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s]+/g);
console.log(result.length);
console.log(result);
2) If you are calculating total number of words in a string then you can also use split as:
const str = `it's a 'miracle' of nature`;
const result = str.split(/\s+/);
console.log(result.length);
console.log(result);
3) If you want a word without quote at the starting and at the end then you can do as:
const str = `it's a 'miracle' of nature`;
const result = str.match(/[^\s]+/g).map((s) => {
s = s[0] === "'" ? s.slice(1) : s;
s = s[s.length - 1] === "'" ? s.slice(0, -1) : s;
return s;
});
console.log(result.length);
console.log(result);
You might use an alternation with 2 capture groups, and then check for the values of those groups.
(?<!\S)'(\S+)'(?!\S)|(\S+)
(?<!\S)' Negative lookbehind, assert a whitespace boundary to the left and match '
(\S+) Capture group 1, match 1+ non whitespace chars
'(?!\S) Match ' and assert a whitespace boundary to the right
| Or
(\S+) Capture group 2, match 1+ non whitespace chars
See a regex demo.
const regex = /(?<!\S)'(\S+)'(?!\S)|(\S+)/g;
const s = "it's a 'miracle' of nature";
Array.from(s.matchAll(regex), m => {
if (m[1]) console.log(m[1])
if (m[2]) console.log(m[2])
});

Get last 2 or 3 elements from path regex

So i currently have a path and i am trying to fetch the last 3;
Test:
/testing/path/here/src/handlebar/sample/colors.txt
/testing/path/here/src/handlebar/testing/another/colors.txt
Regex:
\/([^/]+\/[^/]+\/[^/]+)\.[^.]+$
Result:
handlebar/sample/colors
testing/another/colors
What i want it to do:
sample/colors
testing/another/colors
If there are 2 directories and then the item, it should utilise the 3 and if it contains the word handlebar, it should only be two.
You could just create a group for everything behind handlebar/ like this:
with a named capturing group (subPath group contains wanted value):
/handlebar\/(?<subPath>\S*)\.\S+$/gm
without naming (first group contains wanted value):
/handlebar\/(\S*)\.\S+$/gm
Explanation: This regex matches everything ending with 'handlebar/(...any non white-space chacters 0 to infinite times).(any white-space character 1-inifite times)'. With flags globally and multiline, if you want to check multiple paths within one string separated with a line break e.g.
As you tagged the question with the tag javascript, here is some example code, how to retrieve the value of the regex group
function getSubPath(fullPath = '') {
const regex = /handlebar\/(?<subPath>\S*)\.\S+$/gm
const match = regex.exec(fullPath)
if (match) {
return match.groups.subPath
}
return fullPath // regex.exec did not deliver match
}
getSubPath('/testing/path/here/src/handlebar/sample/colors.txt')
// returns 'sample/colors'
getSubPath('/testing/path/here/src/handlebar/testing/another/colors.txt')
// returns 'testing/another/colors'
without the named group, just read / return match.groups[1] for first capturing group; index 0 is for the full match (which would include the '/handlebars' and the file extension)
I hope you'll get like this.
This is the dynamic tomorrow you can pass as per your required parameters and get result..
<script>
var res = "/testing/path/here/src/handlebar/sample/colors.txt";
var res1 = "/testing/path/here/src/handlebar/testing/another/colors.txt";;
Result = (val, text) => {
var r = val.split(text + '/')[1];
return r.substr(0, r.lastIndexOf('.'));
}
console.log(Result(res, "handlebar"));
console.log(Result(res1, "handlebar"));
</script>
A javascript solution without regex would look like this:
const getTokenizedPath = path => {
const pathArray = path.split('/');
// last element of array looks like "colors.txt" - split by dot and read the first value, removing the extension
pathArray[pathArray.length-1] = pathArray[pathArray.length-1].split('.')[0];
// Remove all elements before the 'handlebar' token and join the remaining values together by '/'.
return pathArray.slice(pathArr2.indexOf('handlebar')+1).join('/');
}
getTokenizedPath('/testing/path/here/src/handlebar/sample/colors.txt');
--- sample/colors.txt
getTokenizedPath('/testing/path/here/src/handlebar/testing/another/colors.txt');
--- testing/another/colors
I guess,
(?!.*handlebar)/([^/]+/[^/]+/[^/]+)\.[^.]+$|/([^/]+/[^/]+)\.[^.]+$
might work OK.
Demo 1
and if lookarounds would be supported,
(?!.*handlebar)(?<=/)[^/]+/[^/]+/[^/]+(?=\.[^.]+$)|$|(?<=/)([^/]+/[^/]+)(?=\.[^.]+$)
Demo 2
would be an option too.
const regex = /(?!.*handlebar)\/([^\/]+\/[^\/]+\/[^\/]+)\.[^.]+$|\/([^\/]+\/[^\/]+)\.[^.]+$/gm;
const str = `/testing/path/here/src/handlebar/sample/colors.txt
/testing/path/here/src/handlebar/testing/another/colors.txt`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Regex optimization and best practice

I need to parse information out from a legacy interface. We do not have the ability to update the legacy message. I'm not very proficient at regular expressions, but I managed to write one that does what I want it to do. I just need peer-review and feedback to make sure it's clean.
The message from the legacy system returns values resembling the example below.
%name0=value
%name1=value
%name2=value
Expression: /\%(.*)\=(.*)/g;
var strBody = body_text.toString();
var myRegexp = /\%(.*)\=(.*)/g;
var match = myRegexp.exec(strBody);
var objPair = {};
while (match != null) {
if (match[1]) {
objPair[match[1].toLowerCase()] = match[2];
}
match = myRegexp.exec(strBody);
}
This code works, and I can add partial matches the middle of the name/values without anything breaking. I have to assume that any combination of characters could appear in the "values" match. Meaning it could have equal and percent signs within the message.
Is this clean enough?
Is there something that could break the expression?
First of all, don't escape characters that don't need escaping: %(.*)=(.*)
The problem with your expression: An equals sign in the value would break your parser. %name0=val=ue would result in name0=val=ue instead of name0=val=ue.
One possible fix is to make the first repetition lazy by appending a question mark: %(.*?)=(.*)
But this is not optimal due to unneeded backtracking. You can do better by using a negated character class: %([^=]*)=(.*)
And finally, if empty names should not be allowed, replace the first asterisk with a plus: %([^=]+)=(.*)
This is a good resource: Regex Tutorial - Repetition with Star and Plus
Your expression is fine, and wrapping it with two capturing groups is simple to get your desired variables and values.
You likely may not need to escape some chars and it would still work.
You can use this tool and test/edit/modify/change your expressions if you wish:
%(.+)=(.+)
Since your data is pretty structured, you can also do so with string split and get the same desired outputs, if you want.
RegEx Descriptive Graph
This graph shows how the expression would work and you can visualize other expressions in this link:
JavaScript Test
const regex = /%(.+)=(.+)/gm;
const str = `%name0=value
%name1=value
%name2=value`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Performance Test
This JavaScript snippet shows the performance of that expression using a simple 1-million times for loop.
const repeat = 1000000;
const start = Date.now();
for (var i = repeat; i >= 0; i--) {
const string = '%name0=value';
const regex = /(%(.+)=(.+))/gm;
var match = string.replace(regex, "\nGroup #1: $1 \n Group #2: $2 \n Group #3: $3 \n");
}
const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Parse query parameters with regexp

I need to parse the url /domain.com?filter[a.b.c]=value1&filter[a.b.d]=value2
and get 2 groups: 'a.b.c' and 'a.b.d'.
I try to parse with regexp [\?&]filter\[(.+\..+)+\]= but the result is 'a.b.c]=value1&filter[a.b.d'. How can I specify to search for the 1st occurrence?
You may use
/[?&]filter\[([^\].]+\.[^\]]+)]=/g
See the regex demo
Details
[?&] - a ? or &
filter\[ - a filter[ substring
([^\].]+\.[^\]]+) - Capturing group 1:
[^\].]+ - 1 or more chars other than ] and .
\. - a dot
[^\]]+ - 1 or more chars other than ]
]= - a ]= substring
JS demo:
var s = '/domain.com?filter[a.b.c]=value1&filter[a.b.d]=value2';
var rx = /[?&]filter\[([^\].]+\.[^\]]+)]=/g;
var m, res=[];
while(m=rx.exec(s)) {
res.push(m[1]);
}
console.log(res);
Note that in case & is never present as part of the query param value, you may add it to the negated character classes, [^\].]+ => [^\]&.]+, to make sure the regex does not overmatch across param values.
Since you need to extract text inside outer square brackets that may contain consecutive [...] substrings with at least 1 dot inside one of them, you may use a simpler regex with a bit more code:
var strs = ['/domain.com?filter[a.b.c]=value1&filter[a.b.d]=value2',
'/domain.com?filter[a.b.c]=value1&filter[a.b.d]=value2&filter[a][b.e]=value3',
'/domain.com?filter[a.b.c]=value1&filter[b][a.b.d][d]=value2&filter[a][b.e]=value3'];
var rx = /[?&]filter((?:\[[^\][]*])+)=/g;
for (var s of strs) {
var m, res=[];
console.log(s);
while(m=rx.exec(s)) {
if (m[1].indexOf('.') > -1) {
res.push(m[1].substring(1,m[1].length-1));
}
}
console.log(res);
console.log("--- NEXT STRING ----");
}
(?<=[\?&]filter\[)([^\]]+\.[^\]]+)+(?!>\]=)
This will give you only the groups you mentioned (a.b.c and a.b.d)
This part (?<=[\?&]filter\[) says recognise but don't capture [?&]filter before what you want and this part (?!>\]=) says recognise but don't capture after ] after what you want.
[^\]] this captures everything that isn't a square bracket

Regex - ignoring text between quotes / HTML(5) attribute filtering

So I have this Regular expression, which basically has to filter the given string to a HTML(5) format list of attributes. It currently isn't doing my fulfilling, but that's about to change! (I hope so)
I'm trying to achieve that whenever an occurrence is found, it selects the text until the next occurrence OR the end of the string, as the second match. So if you'd take a look at the current regular expression:
/([a-zA-Z]+|[a-zA-Z]+-[a-zA-Z0-9]+)=["']/g
A string like this: hey="hey world" hey-heyhhhhh3123="Hello world" data-goed="hey"
Would be filtered / matched out like this:
MATCH 1. [0-3] `hey`
MATCH 2. [16-32] `hey-heyhhhhh3123`
MATCH 3. [47-56] `data-goed`
This has to be seen as the attribute-name(s), and now.. we just have to fetch the attribute's value(s). So the mentioned string has to have an outcome like this:
MATCH 1.
1 [0-3] `hey`
2 [6-14] `hey world`
MATCH 2.
1 [16-32] `hey-heyhhhhh3123`
2 [35-45] `Hello world`
MATCH 3.
1 [47-56] `data-goed`
2 [59-61] `hey`
Could anyone try and help me to get my fulfilling? It would be appericiated a lot!
You can use
/([^\s=]+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+))/g
See regex demo
Pattern details:
([^\s=]+) - Group 1 capturing 1 or more characters other than whitespace and = symbol
= - an equal sign
(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+)) - a non-capturing group of 2 alternatives (one more '([^'\\]*(?:\\.[^'\\]*)*)' alternative can be added to account for single quoted string literals)
"([^"\\]*(?:\\.[^"\\]*)*)" - a double quoted string literal pattern:
" - a double quote
([^"\\]*(?:\\.[^"\\]*)*) - Group 2 capturing 0+ characters other than \ and ", followed with 0+ sequences of any escaped symbol followed with 0+ characters other than \ and "
" - a closing dlouble quote
| - or
(\S+) - Group 3 capturing one or more non-whitespace characters
JS demo (no single quoted support):
var re = /([^\s=]+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+))/g;
var str = 'hey="hey world" hey-heyhhhhh3123="Hello \\"world\\"" data-goed="hey" more=here';
var res = [];
while ((m = re.exec(str)) !== null) {
if (m[3]) {
res.push([m[1], m[3]]);
} else {
res.push([m[1], m[2]]);
}
}
console.log(res);
JS demo (with single quoted literal support)
var re = /([^\s=]+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|'([^'\\]*(?:\\.[^'\\]*)*)'|(\S+))/g;
var str = 'pseudoprefix-before=\'hey1"\' data-hey="hey\'hey" more=data and="more \\"here\\""';
var res = [];
while ((m = re.exec(str)) !== null) {
if (m[2]) {
res.push([m[1], m[2]])
} else if (m[3]) {
res.push([m[1], m[3]])
} else if (m[4]) {
res.push([m[1], m[4]])
}
}
console.log(res);

Categories

Resources