Regex optimization and best practice

Regex optimization and best practice - javascript

I need to parse information out from a legacy interface. We do not have the ability to update the legacy message. I'm not very proficient at regular expressions, but I managed to write one that does what I want it to do. I just need peer-review and feedback to make sure it's clean.
The message from the legacy system returns values resembling the example below.
%name0=value
%name1=value
%name2=value
Expression: /\%(.*)\=(.*)/g;
var strBody = body_text.toString();
var myRegexp = /\%(.*)\=(.*)/g;
var match = myRegexp.exec(strBody);
var objPair = {};
while (match != null) {
if (match[1]) {
objPair[match[1].toLowerCase()] = match[2];
}
match = myRegexp.exec(strBody);
}
This code works, and I can add partial matches the middle of the name/values without anything breaking. I have to assume that any combination of characters could appear in the "values" match. Meaning it could have equal and percent signs within the message.
Is this clean enough?
Is there something that could break the expression?

First of all, don't escape characters that don't need escaping: %(.*)=(.*)
The problem with your expression: An equals sign in the value would break your parser. %name0=val=ue would result in name0=val=ue instead of name0=val=ue.
One possible fix is to make the first repetition lazy by appending a question mark: %(.*?)=(.*)
But this is not optimal due to unneeded backtracking. You can do better by using a negated character class: %([^=]*)=(.*)
And finally, if empty names should not be allowed, replace the first asterisk with a plus: %([^=]+)=(.*)
This is a good resource: Regex Tutorial - Repetition with Star and Plus

Your expression is fine, and wrapping it with two capturing groups is simple to get your desired variables and values.
You likely may not need to escape some chars and it would still work.
You can use this tool and test/edit/modify/change your expressions if you wish:
%(.+)=(.+)
Since your data is pretty structured, you can also do so with string split and get the same desired outputs, if you want.
RegEx Descriptive Graph
This graph shows how the expression would work and you can visualize other expressions in this link:
JavaScript Test
const regex = /%(.+)=(.+)/gm;
const str = `%name0=value
%name1=value
%name2=value`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Performance Test
This JavaScript snippet shows the performance of that expression using a simple 1-million times for loop.
const repeat = 1000000;
const start = Date.now();
for (var i = repeat; i >= 0; i--) {
const string = '%name0=value';
const regex = /(%(.+)=(.+))/gm;
var match = string.replace(regex, "\nGroup #1: $1 \n Group #2: $2 \n Group #3: $3 \n");
}
const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Related

Get last 2 or 3 elements from path regex

So i currently have a path and i am trying to fetch the last 3;
Test:
/testing/path/here/src/handlebar/sample/colors.txt
/testing/path/here/src/handlebar/testing/another/colors.txt
Regex:
\/([^/]+\/[^/]+\/[^/]+)\.[^.]+$
Result:
handlebar/sample/colors
testing/another/colors
What i want it to do:
sample/colors
testing/another/colors
If there are 2 directories and then the item, it should utilise the 3 and if it contains the word handlebar, it should only be two.

You could just create a group for everything behind handlebar/ like this:
with a named capturing group (subPath group contains wanted value):
/handlebar\/(?<subPath>\S*)\.\S+$/gm
without naming (first group contains wanted value):
/handlebar\/(\S*)\.\S+$/gm
Explanation: This regex matches everything ending with 'handlebar/(...any non white-space chacters 0 to infinite times).(any white-space character 1-inifite times)'. With flags globally and multiline, if you want to check multiple paths within one string separated with a line break e.g.
As you tagged the question with the tag javascript, here is some example code, how to retrieve the value of the regex group
function getSubPath(fullPath = '') {
const regex = /handlebar\/(?<subPath>\S*)\.\S+$/gm
const match = regex.exec(fullPath)
if (match) {
return match.groups.subPath
}
return fullPath // regex.exec did not deliver match
}
getSubPath('/testing/path/here/src/handlebar/sample/colors.txt')
// returns 'sample/colors'
getSubPath('/testing/path/here/src/handlebar/testing/another/colors.txt')
// returns 'testing/another/colors'
without the named group, just read / return match.groups[1] for first capturing group; index 0 is for the full match (which would include the '/handlebars' and the file extension)

I hope you'll get like this.
This is the dynamic tomorrow you can pass as per your required parameters and get result..
<script>
var res = "/testing/path/here/src/handlebar/sample/colors.txt";
var res1 = "/testing/path/here/src/handlebar/testing/another/colors.txt";;
Result = (val, text) => {
var r = val.split(text + '/')[1];
return r.substr(0, r.lastIndexOf('.'));
}
console.log(Result(res, "handlebar"));
console.log(Result(res1, "handlebar"));
</script>

A javascript solution without regex would look like this:
const getTokenizedPath = path => {
const pathArray = path.split('/');
// last element of array looks like "colors.txt" - split by dot and read the first value, removing the extension
pathArray[pathArray.length-1] = pathArray[pathArray.length-1].split('.')[0];
// Remove all elements before the 'handlebar' token and join the remaining values together by '/'.
return pathArray.slice(pathArr2.indexOf('handlebar')+1).join('/');
}
getTokenizedPath('/testing/path/here/src/handlebar/sample/colors.txt');
--- sample/colors.txt
getTokenizedPath('/testing/path/here/src/handlebar/testing/another/colors.txt');
--- testing/another/colors

I guess,
(?!.*handlebar)/([^/]+/[^/]+/[^/]+)\.[^.]+$|/([^/]+/[^/]+)\.[^.]+$
might work OK.
Demo 1
and if lookarounds would be supported,
(?!.*handlebar)(?<=/)[^/]+/[^/]+/[^/]+(?=\.[^.]+$)|$|(?<=/)([^/]+/[^/]+)(?=\.[^.]+$)
Demo 2
would be an option too.
const regex = /(?!.*handlebar)\/([^\/]+\/[^\/]+\/[^\/]+)\.[^.]+$|\/([^\/]+\/[^\/]+)\.[^.]+$/gm;
const str = `/testing/path/here/src/handlebar/sample/colors.txt
/testing/path/here/src/handlebar/testing/another/colors.txt`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

RegEx for matching the first word

I have the following prop {priority} that outputs ‘high priority’, is there a way I can render it simply as ‘high’? could I use standard js or something like the below?
var getPriority = {priority};
var priority = getPriority.replace( regex );
console.log( priority );
How do I solve this problem?

If you wish to do that with a regular expression, this expression would do so, even if there might be a misspelling in the word "priority":
(.+)(\s[priorty]+)
It can simply use capturing groups for capturing your desired word before "priority". If you wish to add any boundaries to it, it would be much easier to do so, especially if your input string would change.
Graph
This graph shows how the expression would work and you can visualize other expressions in this link:
const regex = /(.+)(\s[priorty]+)/gmi;
const str = `high priority
low priority
medium priority
under-processing pririty
under-processing priority
400-urget priority
400-urget Priority
400-urget PRIority`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Performance Test
This JavaScript snippet shows the performance of that expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "high priority";
var regex = /(.+)(\s[priorty]+)/gmi;
var match = string.replace(regex, "$1");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

you can use substring to get your required string
var str = 'high priority';
console.log(str.substring(0, 4));
// expected output: "high"
so in your code
var getPriority = {priority};
var priority = getPriority.priority.substring(0, 4);
console.log( priority );

You can simply get the only first element of string using .split():
Code below will show first word of string:
var getPriority = {priority};
console.log( getPriority.priority.split(' ', 1)[0]);
Or if priority value always has priority word in the end, you can get rid of it just making it as a separator for .split():
var getPriority = {priority};
console.log( getPriority.priority.split(' priority')[0] );

Find a string surrounded by square brackets and not prefaced with a specific character

I would like to have a match with
[testing]
but not
![testing]
This is my query to grab a string surrounded by square brackets:
\[([^\]]+)\]
var match = /^[^!]*\[([^\]]+)\]/.exec(issueBody);
if (match)
{
$ISSUE_BODY.selectRange(match.index, match.index+match[0].length);
}
and it works marvelously.
However, I have spent a good half hour on http://regexr.com/ trying to skip strings with a "!" in front, and couldn't.
EDIT: I'm sorry guys I didn't realize that there were operations that could not be supported by specific interpreters. I am writing in Javascript and apparently lookbehind is not supported, I get this error:
Uncaught SyntaxError: Invalid regular expression:
/(?
Sorry for wasting time :\

You can use alternation:
(?:^|[^!])(\[[^\]]+\])
RegEx Demo
Here (?:^|[^!]) will match start of input OR any character that is NOT !
Code:
var re = /(?:^|[^!])(\[[^\]]+\])/gm;
var str = '![foobar123]\n[xyz789]';
while ((m = re.exec(str)) !== null)
console.log(m[1]);
Output:
[xyz789]

In Javascript, where lookbehinds are not supported, you can use:
^[^!]*\[([^\]]+)\]
(with the multiline flag to match every start of a line)
See it on regexr.com.
And here's a visualization from debuggex.com:

You can just use capturing:
var re = /(?:^|[^!])(\[[^[\]]*])/g;
var str = '[goodtesting] ![badtesting] ';
var m;
while ((m = re.exec(str)) !== null) {
document.getElementById("r").innerHTML += m[1] + "<br/>";
}
<div id="r"/>
The (?:^|[^!])(\[[^[\]]*]) regex matches the start of string or any character other than a ! (with a non-capturing group (?:^|[^!])) and matches and captures the substring enclosed with [ and ] that has no [ and ] inside (with (\[[^[\]]*])). When we need to get multiple matches, we need to use RegExp#exec() and access the captured groups using the indices (here, index 1).
Also, in JS, when you do not need to check what is after the match, just a lookbehind without a lookahead, you can use a reverse string technique (use a lookahead with the reversed string):
function revStr(s) {
return s.split('').reverse().join('');
}
var re = /][^[\]]*\[(?!!)/g; // Here, the regex pattern is reverse, too
var str = '![badtesting] [goodtesting]';
var m;
while ((m = re.exec(revStr(str))) !== null) { // We reverse a string here
document.getElementById("res").innerHTML += revStr(m[0]); // and the matched value here
}
<div id="res"/>
This is not possible with longer patterns but this one seems simple enough to go for it.

Extract string when preceding number or combo of preceding characters is unknown

Here's an example string:
++++#foo+bar+baz++#yikes
I need to extract foo and only foo from there or a similar scenario.
The + and the # are the only characters I need to worry about.
However, regardless of what precedes foo, it needs to be stripped or ignored. Everything else after it needs to as well.

try this:
/\++#(\w+)/
and catch the capturing group one.

You can simply use the match() method.
var str = "++++#foo+bar+baz++#yikes";
var res = str.match(/\w+/g);
console.log(res[0]); // foo
console.log(res); // foo,bar,baz,yikes
Or use exec
var str = "++++#foo+bar+baz++#yikes";
var match = /(\w+)/.exec(str);
alert(match[1]); // foo
Using exec with a g modifier (global) is meant to be used in a loop getting all sub matches.
var str = "++++#foo+bar+baz++#yikes";
var re = /\w+/g;
var match;
while (match = re.exec(str)) {
// In array form, match is now your next match..
}

How exactly do + and # play a role in identifying foo? If you just want any string that follows # and is terminated by + that's as simple as:
var foostring = '++++#foo+bar+baz++#yikes';
var matches = (/\#([^+]+)\+/g).exec(foostring);
if (matches.length > 1) {
// all the matches are found in elements 1 .. length - 1 of the matches array
alert('found ' + matches[1] + '!'); // alerts 'found foo!'
}
To help you more specifically, please provide information about the possible variations of your data and how you would go about identifying the token you want to extract even in cases of differing lengths and characters.
If you are just looking for the first segment of text preceded and followed by any combination of + and #, then use:
var foostring = '++++#foo+bar+baz++#yikes';
var result = foostring.match(/[^+#]+/);
// will be the single-element array, ['foo'], or null.
Depending on your data, using \w may be too restrictive as it is equivalent to [a-zA-z0-9_]. Does your data have anything else such as punctuation, dashes, parentheses, or any other characters that you do want to include in the match? Using the negated character class I suggest will catch every token that does not contain a + or a #.

split string only on first instance of specified character

In my code I split a string based on _ and grab the second item in the array.
var element = $(this).attr('class');
var field = element.split('_')[1];
Takes good_luck and provides me with luck. Works great!
But, now I have a class that looks like good_luck_buddy. How do I get my javascript to ignore the second _ and give me luck_buddy?
I found this var field = element.split(new char [] {'_'}, 2); in a c# stackoverflow answer but it doesn't work. I tried it over at jsFiddle...

Use capturing parentheses:
'good_luck_buddy'.split(/_(.*)/s)
['good', 'luck_buddy', ''] // ignore the third element
They are defined as
If separator contains capturing parentheses, matched results are returned in the array.
So in this case we want to split at _.* (i.e. split separator being a sub string starting with _) but also let the result contain some part of our separator (i.e. everything after _).
In this example our separator (matching _(.*)) is _luck_buddy and the captured group (within the separator) is lucky_buddy. Without the capturing parenthesis the luck_buddy (matching .*) would've not been included in the result array as it is the case with simple split that separators are not included in the result.
We use the s regex flag to make . match on newline (\n) characters as well, otherwise it would only split to the first newline.

What do you need regular expressions and arrays for?
myString = myString.substring(myString.indexOf('_')+1)
var myString= "hello_there_how_are_you"
myString = myString.substring(myString.indexOf('_')+1)
console.log(myString)

I avoid RegExp at all costs. Here is another thing you can do:
"good_luck_buddy".split('_').slice(1).join('_')

With help of destructuring assignment it can be more readable:
let [first, ...rest] = "good_luck_buddy".split('_')
rest = rest.join('_')

A simple ES6 way to get both the first key and remaining parts in a string would be:
const [key, ...rest] = "good_luck_buddy".split('_')
const value = rest.join('_')
console.log(key, value) // good, luck_buddy

Nowadays String.prototype.split does indeed allow you to limit the number of splits.
str.split([separator[, limit]])
...
limit Optional
A non-negative integer limiting the number of splits. If provided, splits the string at each occurrence of the specified separator, but stops when limit entries have been placed in the array. Any leftover text is not included in the array at all.
The array may contain fewer entries than limit if the end of the string is reached before the limit is reached.
If limit is 0, no splitting is performed.
caveat
It might not work the way you expect. I was hoping it would just ignore the rest of the delimiters, but instead, when it reaches the limit, it splits the remaining string again, omitting the part after the split from the return results.
let str = 'A_B_C_D_E'
const limit_2 = str.split('_', 2)
limit_2
(2) ["A", "B"]
const limit_3 = str.split('_', 3)
limit_3
(3) ["A", "B", "C"]
I was hoping for:
let str = 'A_B_C_D_E'
const limit_2 = str.split('_', 2)
limit_2
(2) ["A", "B_C_D_E"]
const limit_3 = str.split('_', 3)
limit_3
(3) ["A", "B", "C_D_E"]

This solution worked for me
var str = "good_luck_buddy";
var index = str.indexOf('_');
var arr = [str.slice(0, index), str.slice(index + 1)];
//arr[0] = "good"
//arr[1] = "luck_buddy"
OR
var str = "good_luck_buddy";
var index = str.indexOf('_');
var [first, second] = [str.slice(0, index), str.slice(index + 1)];
//first = "good"
//second = "luck_buddy"

You can use the regular expression like:
var arr = element.split(/_(.*)/)
You can use the second parameter which specifies the limit of the split.
i.e:
var field = element.split('_', 1)[1];

Replace the first instance with a unique placeholder then split from there.
"good_luck_buddy".replace(/\_/,'&').split('&')
["good","luck_buddy"]
This is more useful when both sides of the split are needed.

I need the two parts of string, so, regex lookbehind help me with this.
const full_name = 'Maria do Bairro';
const [first_name, last_name] = full_name.split(/(?<=^[^ ]+) /);
console.log(first_name);
console.log(last_name);

Non-regex solution
I ran some benchmarks, and this solution won hugely:1
str.slice(str.indexOf(delim) + delim.length)
// as function
function gobbleStart(str, delim) {
return str.slice(str.indexOf(delim) + delim.length);
}
// as polyfill
String.prototype.gobbleStart = function(delim) {
return this.slice(this.indexOf(delim) + delim.length);
};
Performance comparison with other solutions
The only close contender was the same line of code, except using substr instead of slice.
Other solutions I tried involving split or RegExps took a big performance hit and were about 2 orders of magnitude slower. Using join on the results of split, of course, adds an additional performance penalty.
Why are they slower? Any time a new object or array has to be created, JS has to request a chunk of memory from the OS. This process is very slow.
Here are some general guidelines, in case you are chasing benchmarks:
New dynamic memory allocations for objects {} or arrays [] (like the one that split creates) will cost a lot in performance.
RegExp searches are more complicated and therefore slower than string searches.
If you already have an array, destructuring arrays is about as fast as explicitly indexing them, and looks awesome.
Removing beyond the first instance
Here's a solution that will slice up to and including the nth instance. It's not quite as fast, but on the OP's question, gobble(element, '_', 1) is still >2x faster than a RegExp or split solution and can do more:
/*
`gobble`, given a positive, non-zero `limit`, deletes
characters from the beginning of `haystack` until `needle` has
been encountered and deleted `limit` times or no more instances
of `needle` exist; then it returns what remains. If `limit` is
zero or negative, delete from the beginning only until `-(limit)`
occurrences or less of `needle` remain.
*/
function gobble(haystack, needle, limit = 0) {
let remain = limit;
if (limit <= 0) { // set remain to count of delim - num to leave
let i = 0;
while (i < haystack.length) {
const found = haystack.indexOf(needle, i);
if (found === -1) {
break;
}
remain++;
i = found + needle.length;
}
}
let i = 0;
while (remain > 0) {
const found = haystack.indexOf(needle, i);
if (found === -1) {
break;
}
remain--;
i = found + needle.length;
}
return haystack.slice(i);
}
With the above definition, gobble('path/to/file.txt', '/') would give the name of the file, and gobble('prefix_category_item', '_', 1) would remove the prefix like the first solution in this answer.
Tests were run in Chrome 70.0.3538.110 on macOSX 10.14.

Use the string replace() method with a regex:
var result = "good_luck_buddy".replace(/.*?_/, "");
console.log(result);
This regex matches 0 or more characters before the first _, and the _ itself. The match is then replaced by an empty string.

Javascript's String.split unfortunately has no way of limiting the actual number of splits. It has a second argument that specifies how many of the actual split items are returned, which isn't useful in your case. The solution would be to split the string, shift the first item off, then rejoin the remaining items::
var element = $(this).attr('class');
var parts = element.split('_');
parts.shift(); // removes the first item from the array
var field = parts.join('_');

Here's one RegExp that does the trick.
'good_luck_buddy' . split(/^.*?_/)[1]
First it forces the match to start from the
start with the '^'. Then it matches any number
of characters which are not '_', in other words
all characters before the first '_'.
The '?' means a minimal number of chars
that make the whole pattern match are
matched by the '.*?' because it is followed
by '_', which is then included in the match
as its last character.
Therefore this split() uses such a matching
part as its 'splitter' and removes it from
the results. So it removes everything
up till and including the first '_' and
gives you the rest as the 2nd element of
the result. The first element is "" representing
the part before the matched part. It is
"" because the match starts from the beginning.
There are other RegExps that work as
well like /_(.*)/ given by Chandu
in a previous answer.
The /^.*?_/ has the benefit that you
can understand what it does without
having to know about the special role
capturing groups play with replace().

if you are looking for a more modern way of doing this:
let raw = "good_luck_buddy"
raw.split("_")
.filter((part, index) => index !== 0)
.join("_")

Mark F's solution is awesome but it's not supported by old browsers. Kennebec's solution is awesome and supported by old browsers but doesn't support regex.
So, if you're looking for a solution that splits your string only once, that is supported by old browsers and supports regex, here's my solution:
String.prototype.splitOnce = function(regex)
{
var match = this.match(regex);
if(match)
{
var match_i = this.indexOf(match[0]);
return [this.substring(0, match_i),
this.substring(match_i + match[0].length)];
}
else
{ return [this, ""]; }
}
var str = "something/////another thing///again";
alert(str.splitOnce(/\/+/)[1]);

For beginner like me who are not used to Regular Expression, this workaround solution worked:
var field = "Good_Luck_Buddy";
var newString = field.slice( field.indexOf("_")+1 );
slice() method extracts a part of a string and returns a new string and indexOf() method returns the position of the first found occurrence of a specified value in a string.

This should be quite fast
function splitOnFirst (str, sep) {
const index = str.indexOf(sep);
return index < 0 ? [str] : [str.slice(0, index), str.slice(index + sep.length)];
}
console.log(splitOnFirst('good_luck', '_')[1])
console.log(splitOnFirst('good_luck_buddy', '_')[1])

This worked for me on Chrome + FF:
"foo=bar=beer".split(/^[^=]+=/)[1] // "bar=beer"
"foo==".split(/^[^=]+=/)[1] // "="
"foo=".split(/^[^=]+=/)[1] // ""
"foo".split(/^[^=]+=/)[1] // undefined
If you also need the key try this:
"foo=bar=beer".split(/^([^=]+)=/) // Array [ "", "foo", "bar=beer" ]
"foo==".split(/^([^=]+)=/) // [ "", "foo", "=" ]
"foo=".split(/^([^=]+)=/) // [ "", "foo", "" ]
"foo".split(/^([^=]+)=/) // [ "foo" ]
//[0] = ignored (holds the string when there's no =, empty otherwise)
//[1] = hold the key (if any)
//[2] = hold the value (if any)

a simple es6 one statement solution to get the first key and remaining parts
let raw = 'good_luck_buddy'
raw.split('_')
.reduce((p, c, i) => i === 0 ? [c] : [p[0], [...p.slice(1), c].join('_')], [])

You could also use non-greedy match, it's just a single, simple line:
a = "good_luck_buddy"
const [,g,b] = a.match(/(.*?)_(.*)/)
console.log(g,"and also",b)

Develop Reference

JavaScript is the programming language of the Web.

Regex optimization and best practice - javascript

Related

Get last 2 or 3 elements from path regex

RegEx for matching the first word

Find a string surrounded by square brackets and not prefaced with a specific character

Extract string when preceding number or combo of preceding characters is unknown

split string only on first instance of specified character

Categories

Resources

Develop Reference

JavaScript is the programming language of the Web.

Regex optimization and best practice - javascript

Related

Get last 2 or 3 elements from path regex

RegEx for matching the first word

Find a string surrounded by square brackets and *not* prefaced with a specific character

Extract string when preceding number or combo of preceding characters is unknown

split string only on first instance of specified character

Categories

Resources

Find a string surrounded by square brackets and not prefaced with a specific character