"Learning JavaScript" Chapter 17: Regular Expressions...Backreference examples failing [duplicate] - javascript

This question already has answers here:
What is a non-capturing group in regular expressions?
(18 answers)
Closed 4 years ago.
I'm currently reading "Learning JavaScript" by Ethan Brown (2016). I'm going through the examples in the Backreferences section and they keep coming up as 'null'. There are two examples.
Example 1: Match names that follow the pattern XYYX.
const promo = "Opening for XAAX is the dynamic GOOG! At the box office now!";
const bands = promo.match(/(?:[A-Z])(?:[A-Z])\2\1/g);
console.log('bands: '+ bands);//output was null
If I understand the text correctly, the result should be...
bands: XAAX, GOOG
Example 2: Matching single and/or double quotation marks.
//we use backticks here because we're using single and
//double quotation marks:
const html = `<img alt='A "simple" example,'>` +
`<img alt="Don't abuse it!">`;
const matches = html.match(/<img alt=(?:['"]).*?\1/g);
console.log('matches: '+ matches);//output was null
Again, if I understand the text correctly, the result should not be 'null'. The text doesn't say exactly what the result should be.
I'm at a loss trying to figure out why when I run this in Node.js it keeps giving me 'null' for these two examples. Anyone have any insight?

The problem is that your group there is
(?:['"])
the ?: indicates that it's a non-capturing group - that means that you can't backreference the group (or get the group in your match result). Use plain parentheses instead to indicate that the group should be captured:
const html = `<img alt='A "simple" example,'>` +
`<img alt="Don't abuse it!">`;
const matches = html.match(/<img alt=(['"]).*?\1/g);
console.log('matches: '+ matches);

Looks like an error in the book.
The regex in the code snippets are using non-capturing groups: What is a non-capturing group? What does (?:) do?
These are not usable with back references. Use normal parentheses instead:
const promo = "Opening for XAAX is the dynamic GOOG! At the box office now!";
const bands = promo.match(/([A-Z])([A-Z])\2\1/g);
console.log('bands: '+ bands);//output was null
The same goes for the other samples...
Update: I have checked the original source (3rd edition) and can confirm: All samples are wrong and using non-capturing groups.
BTW: The author writes:
Grouping enables another technique called backreferences. In my
experience, this is one of the least used regex features, but there is
one instance where it comes in handy. ...
The only time I think I have ever needed to use backreferences (other
than solving puzzles) is matching quotation marks. In HTML, you can
use either single or double quotes for attribute values.
And then follows the HTML regex sample shown in the OP. Cthulhu is calling?

Related

catastrophic backstring in regular expression

I am using below regular expression
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
and it showing me catastrophic backstring when i am trying to match with input string.
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg
The expected output array of the matching regex will be like
[ 'w_100',
'h_500',
'e_saturation:50,e_tint:red:blue',
'c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.',
'l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc' ]
don't want to consider image name 1488800313_DSC_0334__3_.JPG_mweubp.jpg in match. the following
is there any method to solve this backstrack in regular expression or suggest me good regex for my input string.
The problem
You use a lot of alternations when a character class would be more effective. Also, you're getting the catastrophic backtracking due to the following quantifier:
[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
^
It's trying to match any of the alternations you have, but keeps backtracking and never makes it past all your alternations (it's sometimes comparable to an infinite loop). In your case, your regex is so ineffective that it times out. I removed half your pattern and it takes a half second to complete with almost 200K steps (and that's only half your pattern).
Original Answer
How can it be fixed?
First step is to fix the quantifier and prevent it from continuously backtracking. This is actually quite easy, just make it possessive: + becomes ++. Changing the quantifier to possessive yields a pattern that takes about 56ms to complete and approx 9K steps (on my computer)
Second step is to improve the efficiency of the pattern. Change your alternations to character classes where possible.
(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?
# should instead be
(?::-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+|[A-Z0-9a-z]+)?
It's much shorter, much more concise and less prone to errors.
The new pattern
See regex in use here
This pattern only takes 271 steps and less than one millisecond to complete (yes, using PCRE engine, works in Java too)
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+)++
I also changed your positive lookahead to a positive lookbehind (?<=[,\/]) to improve performance.
Additionally, if you don't need all the specific logic, you can quite simply use the following regex (just under half as many steps as my regex above):
See regex in use here
(?<=[,\/])[A-Za-z]+_[^,\/]+
Results
This results in the following array:
P.S. I'm assuming there'a a typo in your expected output and that the / between l_text and l_fetch should also be split on; needs clarification.
w_100
h_500
e_saturation:50
e_tint:red:blue
c_crop
a_100
l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #1
The OP clarified the expected results. I added , to the character class in the fourth option of the non-capture group:
See regex in use here
(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*,]|[-.][a-zA-Z]+)++
And in its shortened form:
See regex in use here
(?<=\/)[A-Za-z]+_[^\/]+
Results
This results in the following array:
w_100
h_500
e_saturation:50,e_tint:red:blue
c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc
Edit #2
The OP presented another input and identified issues with Edit #1 related to that input. I added logic to force a fail on the last item in a string.
New test string:
/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/sample_url_image.jpg
See regex in use here
(?<=\/)(?![A-Za-z]+_[^\/]+$)[A-Za-z]+_[^\/]+
Same results as in Edit #1.
PCRE version (if anyone is looking for it) - more efficient than the method above:
See regex in use hereenter link description here
(?<=\/)[A-Za-z]+_[^\/]+(?:$(*SKIP)(*FAIL))?
Assuming your example has a typo, e.g. the last / would be split too:
You can simply split on /, then filter out the .jpg items:
function splitWithFilter(line, filter) {
var filterRe = filter ? new RegExp(filter, 'i') : null;
return line
.replace(/^\//, '') // remove leading /
.split(/\//)
//.filter(Boolean) // filter out empty items (alternative to above replace())
.filter(function(item) {
return !filterRe || !item.match(filterRe);
});
}
var str = "/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg";
console.log(JSON.stringify(splitWithFilter(str, '\\.jpg$'), null, ' '));
Expected output:
[
"w_100",
"h_500",
"e_saturation:50,e_tint:red:blue",
"c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.",
"l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc"
]

Regex get all quoted words that are not also single quoted

Would it be possible to get all quoted text with a single regex?
Example text from regexr:
Edit the "Expression" & Text to see matches. Roll over "matches" or the expression for details.
Undo mistakes with ctrl-z.
Save 'Favorites & "Share" expressions' with friends or the Community. "Explore" your results with Tools. A full Reference & Help is available in the Library, or watch the video Tutorial.
In this case I would like to capture Expression, matches and Explore but not Share since 'Favorites & "Share" expressions' is single quoted.
You can't build a regex that matches only the parts you want in Javascript, however you can build a pattern that matches all the string without gaps and use a capture group to extract the part you want:
/[^"']*(?:'[^']*'[^"']*)*"([^"]*)"/g
#^----------------------^ all that isn't content between double quotes
Since your string may end with something like abcd 'efgh "ijkl" mnop' qrst (in short without the part you want but with a double quote part inside single quote substring), It's more secure to change the pattern to:
/[^"']*(?:'[^']*(?:'[^"']*|$))*(?:"([^"]*)"|$)/g
and to discard the last match.
Without special regex pattern:
var mystr = "Edit the \"Expression\" & Text to see matches. Roll over \"matches\" or the expression for details. Undo mistakes with ctrl-z. Save 'Favorites & \"Share\" expressions' with friends or the Community. \"Explore\" your results with Tools. A full Reference & Help is available in the Library, or watch the video Tutorial."
var myarr = mystr.split(/\"/g)
var opening=false;
for(var i=1; i<myarr.length;i=i+2){
if((myarr[i-1].length-myarr[i-1].replace(/'/g,"").length)%2===1){opening=!opening;}
if(!opening){console.log(myarr[1]);}
}
How works:
split text by "
odd index is string with " wrapper
if before this index, odd numbers of ' exists, this item wrapped by ' and should not be considered

Back Reference Fault in JavaScript Regex

I am teaching myself regex using a combination of the two websites Eloquent JavaScript and regular expressions.info. I am trying to use back references via a self made example in which I want to roughly be able to test for syntactic correctness of a Java while loop (assuming we limit it to while( value operator value) for the sake of simplicity).
However take a look at my code below and you will see that the reference \1 does not appear to work. I've tried my solution in JS. but also using software tool The Regex Coach.
Can anyone see the problem here?
var rx = /^while\s*\((\s*[a-zA-Z][a-zA-Z0-9_]*\s*)(\<\=|\<|\>\=|\>|\!\=|\=\=)\s*\1\)/
document.writeln(rx.test("while(x <= y)"));
Your regex would match
while(x <= x )
because \1 matches the exact text that was matched by the first capturing group - which in this case is "x ". And since "y" isn't the same as "x ", your regex fails on the example you've chosen.
For your example, the following would work:
var rx = /^while\s*\(\s*([a-zA-Z]\w*)\s*(<=?|>=?|!=|==)\s*([a-zA-Z]\w*)\s*\)$/
Note that \w is a shorthand for [a-zA-Z0-9_] in JavaScript, and that you don't need to escape all those symbols.

Match attribute value of XML string in JS

I've researched stackoverflow and find similar results but it is not really what I wanted.
Given an xml string: "<a b=\"c\"></a>" in javascript context, I want to create a regex that will capture the attribute value including the quotation marks.
NOTE: this is similar if you're using single quotation marks.
Currently I have a regular expression tailored to the XML specification:
[_A-Za-z][\w\.\-]*(?:=\"[^\"]*\")?
[_A-Za-z][\w\.\-]* //This will match the attribute name.
(?:=\"[^\"]*\")? //This will match the attribute value.
\"[^\"]*\" //This part concerns me.
My question now is, what if the xml string looks like this:
<shout statement="Hi! \"Richeve\"."></shout>
I know this is a dumb question to ask but I just want to capture rare cases that this scenario might happen (I know the coder can use single quotes on this scenario) but there are cases that we don't know the current value of the attribute given that the attribute value changes dynamically at runtime.
So to make this clearer, the result of that using the correct regex should be:
"Hi! \"Richeve\"."
I hope my question is clear. Thanks for all the help!
PS: Note that the language context is Javascript and I know it is tempting to use lookbehinds but currently lookbehinds are not supported.
PS: I know it is really hard to parse XML but I have an elegant solution to this :) so I just need this small problem to be solved. So this problem only main focus is capturing quotation marked string tokens containing quotation marks inside the string token.
The standard pattern for content with matching delimiters and embedded escaped delimiters goes like this:
"[^"\\]*(?:\\.[^"\\]*)*"
Ignoring the obvious first and last characters in the pattern, here's how the rest of the pattern works:
[^"\\]*: Consume all characters until a delimiter OR backslash (matching Hi! in your example)
(?:\\.[^"\\]*)* Try to consume a single escaped character \\. followed by a series of non delimiter/backslash characters, repeatedly (matching \"Richeve first and then \". next in your example)
That's it.
You can try to use a more generic delimiter approach using (['"]) and back references, or you can just allow for an alternate pattern with single quotes like so:
("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')
Here's another description of this technique that might also help (see the section called Strings): http://www.regular-expressions.info/examplesprogrammer.html
Description
I'm pretty really sure embedding double quotes inside a double quoted attribute value is not legal. You could use the unicode equivalent of a double quote \x22 inside the value.
However to answer the question, this expression will:
allow escaped quotes inside attribute values
capture the attribute statement 's value
allow attributes to appear in any order inside the tag
will avoid many of the edge cases which will trip up pattern matching inside html text
doesn't use lookbehinds
<shout\b(?=\s)(?=(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*?\sstatement=(['"])((?:\\['"]|.)*?)\1(?:\s|\/>|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>.*?<\/shout>
Example
Pretty Rubular
Ugly RegexPlanet set to Javascript
Sample Text
Note the difficult edge case in the first attribute :)
<shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
Matches
Group 0 gets the entire tag from open to close
Group 1 gets the quote surrounding the statement attribute value, this is used to match the closing quote correctly
Group 2 gets the statement attribute value which may include escaped quotes like \" but not including the surrounding quotes
[0][0] = <shout onmouseover=' statement="He said \"I am Inside the onMouseOver\" " ; if ( 6 > a ) { funRotate(statement) } ; ' statement="Hi! \"Richeve\"." title="sometitle">SomeString</shout>
[0][1] = "
[0][2] = Hi! \"Richeve\".

Regex not detecting swear words inside string?

I have a script here:
http://jsfiddle.net/d2rcx/
It has an array of badWords to compare input strings with.
This script works fine if the input string matches exactly as the swear word string but it does not pick up any variations where there is more characters in the string e.g. whitespace before the swear word.
Using this site as reference : http://www.zytrax.com/tech/web/regex.htm
It said the following regex would detect a string within a string.
var regex = new RegExp("/" + badWords[i] + "/g");
if (fieldValue.match(regex) == true)
return true;
However that does not seem to be the case.
What do I need to change to the regex to make it work.
Thanks
Also any good links to explain Regex than what google turns up would be appreciated.
Here's a corrected JSFiddle:
http://jsfiddle.net/d2rcx/5/
See the following documentation for RegExp:
https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp
Note that the second parameter is where you should specify your flags (e.g. 'g' or 'i'). For example:
new RegExp(badWords[i], 'gi');
Please review http://www.w3schools.com/jsref/jsref_match.asp and note that fieldValue.match() will not return a boolean but an array of matches
Rather than using Regex to do this you are probably better off just looping through an array of badwords and looking for instances within a string using [indexOf].(https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/String/indexOf)
Otherwise you could make a regex like...
\badword1|badword2|badword3\
and just check for any match.
A word boundary in Regex is \b so you could say
\\b(badword1|badword2|badword3)\b\
which will match only whole words - ie Scunthorpe will be ok :)
var rx = new RegExp("\\b(donkey|twerp|idiot)\\b","i"); // i = case insenstive option
alert(rx.test('you are a twerp')); //true
alert(rx.test('hello idiotstick')); //fasle -not whole word
alert(rx.test('nice "donkey"')); //true
http://jsfiddle.net/F8svC/
Changing this, which requires a loop:
var regex = new RegExp("/" + badWords[i] + "/g");
for this:
var regex = new RegExp("/" + badWords.join("|") + "/g");
would be a start. This will do all the matches in one go because the array becomes one string with each element separated by pipes.
P.S.
Reference guide for RegEx here. But there isn't a lot of clear information online about what is and isn't possible with respect to certain functions nor what's good code. I've found a couple of books most useful: David Flanagan's latest JavaScript: The Definitive Guide and Douglas Crockford's JavaScript: The Good Parts for the most usable subset of JavaScript, including RegEx. The railroad diagrams by Crockford are especially good but I'm not sure if they're available online anywhere.
EDIT: Here's an online copy of the relevant chapter including some of those railroad diagrams I mentioned, in case it helps.

Categories

Resources