regex between two character positions with known start and end indices - javascript

In regex, generally speaking, is there a way to select data between two line positions? I'm not even sure the correct terminology (character/line position, index, column?) after a few days of reading up on regex, but what I mean is...
Select the data between two indices, what is between ^.{4} and ^.{7}, for example:
TESTINGREGEX
ISNTTHEBEST!
or
TESTINGREGEXCANBEFUN
ISNTTHEBEST!ANDFARFROMFUN
the results I'm looking for would be:
TESTREGEX
ISNTBEST!
and
TESTREGEXCANBEFUN
ISNTBEST!ANDFARFROMFUN
I'm wondering, so I can learn if it's possible, how to achieve it? I'm very familiar with other ways to do this using other tools, but I'm curious how to achieve this using regex.
I've tried working with non capturing groups, and wondering if maybe I'm being limited by the fact that I'm attempting to apply this regex within the atom editor find and replace regex feature (falling victim to: Avoiding Common Pitfalls), so I'm hoping to get a few suggestions to broaden my knowledge and try out. I'm guessing javascript, and/or sed style regex answers would be acceptable...really anything would help!
EDIT:
.{3}(?=.{5}$) from Mark's answer works for me and with the example text I gave in the OP. And it's a good thing to know when able to count from the $ end of line. But I'm realizing I actually need the opposite... I need to count out from the ^ start of line. Is this not possible; re: comments on there being no support for lookbehind?

With just regex it's possible, just not in javascript. The regex (?<=^.{4}).+(?=.{5}$) works to capture the group between the 4th letter and the 5th to last letter. Since javascript doesn't support positive look behinds, you'll have to use some ammount of javascript beyond a simple .replace(regex, "") to remove those characters.
The next closest regex possible in javascript would be .{3}(?=.{5}$), which would match 3 characters before the 5th to last letter.
If you wanted with pure regex in javascript to capture something a few characters after the start of a string it would be impossible.

The regex ^(.{4}).{3}(.{5})$ (expressed in JavaScript's dialect, but the features used in it are quite common) will give you two capture groups you can combine to get the output you describe:
function test(str) {
var match = str.match(/^(.{4}).{3}(.{5})$/);
console.log(str, '=>', match[1] + match[2]);
}
test("TESTINGREGEX");
test("ISNTTHEBEST!");
If the lines are of varying length and you want to ignore everything after the end of what you want, just drop the $ assertion at the end.

If the purpose is to get the text between two character offsets then regular expressions are overkill. Just use slice:
function exclude(str, i, j) {
return str.slice(0, i) + str.slice(j);
}
console.log(exclude("TESTINGREGEX", 4, 7));
console.log(exclude("ISNTTHEBEST!", 4, 7));
If you really need to do this with regular expressions then proceed as follows:
function exclude(str, i, j) {
return str.replace(new RegExp(`^(.{${i}})(.{${j-i}})`), "$1");
}
console.log(exclude("TESTINGREGEX", 4, 7));
console.log(exclude("ISNTTHEBEST!", 4, 7));

Related

Get an example matched text from a regex pattern [duplicate]

Is there any way of generating random text which satisfies provided regular expression.
I am looking for a function which works like below
var reg = Some Regular Expression
var str = RandString(reg)
I have seen fairly good solutions in perl and ruby on github, but I think there are technical issues that make a complete solution impossible. For example, /[0-9]+/ has an infinite upper bound, which is not practical for selecting random numbers from.
Never seen it in JavaScript, but you could translate.
EDIT: After googling for a few seconds...
https://github.com/fent/randexp.js
if you know what the regular expression is, you can just generate random strings, then use a function that references the index of the letters and changes them as needed. Regex expressions vary widely, so it will be difficult to find one in particular that satisfies all possible regex.
Your question is pretty open so hopefully this steers you to the right solution. Get the current time (in seconds), MD5 it, check it against a REGEX, return the match.
Running Example: http://jsfiddle.net/MattLo/3gKrb/
Usage: RandString(/([A-Za-z])/ig); // expected to be a string
For JavaScript, the following modules can generate a random match to a regex:
pxeger
randexp.js
regexgen

catastrophic backstring in regular expression

I am using below regular expression
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
and it showing me catastrophic backstring when i am trying to match with input string.
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg
The expected output array of the matching regex will be like
[ 'w_100',
'h_500',
'e_saturation:50,e_tint:red:blue',
'c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.',
'l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc' ]
don't want to consider image name 1488800313_DSC_0334__3_.JPG_mweubp.jpg in match. the following
is there any method to solve this backstrack in regular expression or suggest me good regex for my input string.
The problem
You use a lot of alternations when a character class would be more effective. Also, you're getting the catastrophic backtracking due to the following quantifier:
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
^
It's trying to match any of the alternations you have, but keeps backtracking and never makes it past all your alternations (it's sometimes comparable to an infinite loop). In your case, your regex is so ineffective that it times out. I removed half your pattern and it takes a half second to complete with almost 200K steps (and that's only half your pattern).
Original Answer
How can it be fixed?
First step is to fix the quantifier and prevent it from continuously backtracking. This is actually quite easy, just make it possessive: + becomes ++. Changing the quantifier to possessive yields a pattern that takes about 56ms to complete and approx 9K steps (on my computer)
Second step is to improve the efficiency of the pattern. Change your alternations to character classes where possible.
(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?
# should instead be
(?::-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+|[A-Z0-9a-z]+)?
It's much shorter, much more concise and less prone to errors.
The new pattern
See regex in use here
This pattern only takes 271 steps and less than one millisecond to complete (yes, using PCRE engine, works in Java too)
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+)++
I also changed your positive lookahead to a positive lookbehind (?<=[,\/]) to improve performance.
Additionally, if you don't need all the specific logic, you can quite simply use the following regex (just under half as many steps as my regex above):
See regex in use here
(?<=[,\/])[A-Za-z]+_[^,\/]+
Results
This results in the following array:
P.S. I'm assuming there'a a typo in your expected output and that the / between l_text and l_fetch should also be split on; needs clarification.
w_100
h_500
e_saturation:50
e_tint:red:blue
c_crop
a_100
l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #1
The OP clarified the expected results. I added , to the character class in the fourth option of the non-capture group:
See regex in use here
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*,]|[-.][a-zA-Z]+)++
And in its shortened form:
See regex in use here
(?<=\/)[A-Za-z]+_[^\/]+
Results
This results in the following array:
w_100
h_500
e_saturation:50,e_tint:red:blue
c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #2
The OP presented another input and identified issues with Edit #1 related to that input. I added logic to force a fail on the last item in a string.
New test string:
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/sample_url_image.jpg
See regex in use here
(?<=\/)(?![A-Za-z]+_[^\/]+$)[A-Za-z]+_[^\/]+
Same results as in Edit #1.
PCRE version (if anyone is looking for it) - more efficient than the method above:
See regex in use hereenter link description here
(?<=\/)[A-Za-z]+_[^\/]+(?:$(*SKIP)(*FAIL))?
Assuming your example has a typo, e.g. the last / would be split too:
You can simply split on /, then filter out the .jpg items:
function splitWithFilter(line, filter) {
var filterRe = filter ? new RegExp(filter, 'i') : null;
return line
.replace(/^\//, '') // remove leading /
.split(/\//)
//.filter(Boolean) // filter out empty items (alternative to above replace())
.filter(function(item) {
return !filterRe || !item.match(filterRe);
});
}
var str = "/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg";
console.log(JSON.stringify(splitWithFilter(str, '\\.jpg$'), null, ' '));
Expected output:
[
"w_100",
"h_500",
"e_saturation:50,e_tint:red:blue",
"c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.",
"l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc"
]

Why would the replace with regex not work even though the regex does?

There may be a very simple answer to this, probably because of my familiarity (or possibly lack thereof) of the replace method and how it works with regex.
Let's say I have the following string: abcdefHellowxyz
I just want to strip the first six characters and the last four, to return Hello, using regex... Yes, I know there may be other ways, but I'm trying to explore the boundaries of what these methods are capable of doing...
Anyway, I've tinkered on http://regex101.com and got the following Regex worked out:
/^(.{6}).+(.{4})$/
Which seems to pass the string well and shows that abcdef is captured as group 1, and wxyz captured as group 2. But when I try to run the following:
"abcdefHellowxyz".replace(/^(.{6}).+(.{4})$/,"")
to replace those captured groups with "" I receive an empty string as my final output... Am I doing something wrong with this syntax? And if so, how does one correct it, keeping my original stance on wanting to use Regex in this manner...
Thanks so much everyone in advance...
The code below works well as you wish
"abcdefHellowxyz".replace(/^.{6}(.+).{4}$/,"$1")
I think that only use ()to capture the text you want, and in the second parameter of replace(), you can use $1 $2 ... to represent the group1 group2.
Also you can pass a function to the second parameter of replace,and transform the captured text to whatever you want in this function.
For more detail, as #Akxe recommend , you can find document on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace.
You are replacing any substring that matches /^(.{6}).+(.{4})$/, with this line of code:
"abcdefHellowxyz".replace(/^(.{6}).+(.{4})$/,"")
The regex matches the whole string "abcdefHellowxyz"; thus, the whole string is replaced. Instead, if you are strictly stripping by the lengths of the extraneous substrings, you could simply use substring or substr.
Edit
The answer you're probably looking for is capturing the middle token, instead of the outer ones:
var str = "abcdefHellowxyz";
var matches = str.match(/^.{6}(.+).{4}$/);
str = matches[1]; // index 0 is entire match
console.log(str);

Regex to match all instances not inside quotes

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
For Example:
An input string of: +bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return: #bar#baz"not+or\"+or+\"this+"foo#bar#
Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at
\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
I admit it is a little cryptic. =)
Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.
There happens to be a simple, general solution that wasn't mentioned.
Compared with alternatives, the regex for this solution is amazingly simple:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
Online demo
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
Hope this gives you a different idea of a very general way to do this. :)
What about Empty Strings?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:
"[^"]*"|(\+)
See demo.
What about Escaped Quotes?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.
Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
The resulting expression has three branches:
\\" to match and ignore
"(?:\\"|[^"])*" to match and ignore
(\+) to match, capture and handle
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
The full regex becomes:
\\"|"(?:\\"|[^"])*"|(\+)
See regex demo and full script.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
You can do it in three steps.
Use a regex global replace to extract all string body contents into a side-table.
Do your comma translation
Use a regex global replace to swap the string bodies back
Code below
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\\]|\\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.
Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';
and
var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.

How to search csv string and return a match by using a Javascript regex

I'm trying to extract the first user-right from semicolon separated string which matches a pattern.
Users rights are stored in format:
LAA;LA_1;LA_2;LE_3;
String is empty if user does not have any rights.
My best solution so far is to use the following regex in regex.replace statement:
.*?;(LA_[^;]*)?.*
(The question mark at the end of group is for the purpose of matching the whole line in case user has not the right and replace it with empty string to signal that she doesn't have it.)
However, it doesn't work correctly in case the searched right is in the first position:
LA_1;LA_2;LE_3;
It is easy to fix it by just adding a semicolon at the beginning of line before regex replace but my question is, why doesn't the following regex match it?
.*?(?:(?:^|;)(LA_[^;]*))?.*
I have tried numerous other regular expressions to find the solution but so far without success.
I am not sure I get your question right, but in regards to the regular expressions you are using, you are overcomplicating them for no clear reason (at least not to me). You might want something like:
function getFirstRight(rights) {
var m = rights.match(/(^|;)(LA_[^;]*)/)
return m ? m[2] : "";
}
You could just split the string first:
function getFirstRight(rights)
{
return rights.split(";",1)[0] || "";
}
To answer the specific question "why doesn't the following regex match it?", one problem is the mix of this at the beginning:
.*?
eventually followed by:
^|;
Which might be like saying, skip over any extra characters until you reach either the start or a semicolon. But you can't skip over anything and then later arrive at the start (unless it involves newlines in a multiline string).
Something like this works:
.*?(\bLA_[^;]).*
Meaning, skip over characters until a word boundary followed by "LA_".

Categories

Resources