JavaScript RegEx - Match quoted string - Possibly unexpected result?

JavaScript RegEx - Match quoted string - Possibly unexpected result? - javascript

Why does this:
console.log(/^(['"])(?:(?:\\[^])|[^\\])*\1/.test('"\"'))
result in true? Is this expected behavior or a bug? If it's expected, how to achieve intended behavior, which is to result in false as the last closing quote in the example shouldn't be matched as it's escaped? Maybe I made a mistake in writing the RegEx, in which case, I hope someone can kindly point out the error to me...
For the uninitiated, the above regular expression in JavaScript is intended to match only a complete (meaning, the matched portion should be a complete quoted string, NOT that the whole input string should be a complete quoted string.) single or double quoted string that may or not contain backslash escaped special characters. Nested levels of escaped strings may be present. Also, for simplicity, and as per requirement, the match starts from the beginning of the input string, as otherwise, a match may be possible, incorrectly, starting from an already escaped quote.
Tested in Firefox 82.0.2 and Edge 86.0.622.63

Ah, never mind! I figured out that the problem is not in the RegEx, but in the way I crafted the input string. The way I've written it, the outer string interprets the escape instead of the backslash acting as an escape for the inner string! The correct way to write it is to escape the backslash, so the above code should be rewritten as:
console.log(/^(['"])(?:(?:\\[^])|[^\\])*\1/.test('"\\"'))
So, the result is as expected after all, and not a bug!

Related

How to Escape the unescaped double quotes in a CSV string in Node

This is very similar to
Regular expression to find unescaped double quotes in CSV file
However, the solutions presented don't work with Node.js's regex engine. Given a CSV string where columns are quoted with double quotes, but some columns have unescaped double quotes in them, what regex could be used to match these unescaped quotes and just remove them.
Example rows
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
So the two double quotes surrounding the S in the third column would get matched and removed. Needs to work in Node.js (14.16.1)
I have tried (?m)""(?![ \t]*(,|$)) but get a Invalid regular expression: /(?m)""(?![ \t]*(,|$))/: Invalid group exception

I don't know much about node.js, but assuming it is like the JavaScript flavor of regex then I have the following comments about the example you took from the prior answer:
I think your example is choking on the first element, (?m) which is unsupported in Javascript. However, that part is not essential to your task. It only turns on multiline processing and you don't need that if you feed the regex engine each line individually. If you find you still want to feed it a multiline string, then you can still turn on multiline in JavaScript - you do it with the "m" flag after the final delimiter, "/myregex/m". All of the other elements, including the negative lookahead are supported by JavaScript and probably by your engine as well. So, drop the (?m) part of your expression and try it again.
Even after you get it to work, the example row you provided will not be parsed according to your expectations by the sample regular expression. Its function is to identify all occurrences of two double-quotes that are not followed by a comma (or end of string). The ONLY two occurrences of doubled quotes in your example each have a comma after, so you will get no matches on this regex in your example.
It seems like you want some context-sensitive scanning to match and remove the inner pairs of double quotes while leaving the outer ones in place and handling commas inside your strings and possibly correctly quoted double quotes. Regular expression engines are really bad at this kind of processing and I don't think you are going to get satisfactory results whatever you come up with.
You can get an approximate solution to your problem by using regex once to parse the individual elements of the .csv stripping the outer quotes as you go and then running a second regex against each parsed element to either remove single occurrences of double quote or adding a second double-quote, where necessary. Then you can reassemble the string under program control.
This still will break if someone embeds a "", sequence in a data field string, so it's not perfect but it might be good enough for you.
The regex for splitting the .csv and stripping the double quotes is:
/(("(.*?)")|([^,]*))(,|$)/gm
This will accept either a "anything", OR a anything, repeatedly until the source is exhausted. Because of the capturing groups, the parsed text will either by in $3 (if the field was quoted) or $4 (if it was not quoted) but not both.
Here's a regexpReplace of your string with $3&$4 and a semicolon after each iteration (I took the liberty of adding a numeric field without the quotes so you could see that it handles both cases):
"123","","SDFDS SDFSDF EEE "S"",456,"asdfas","b","lll"
RegexpReplace(<above>,"((""(.*?)"")|([^,]*))(,|$)","$3$4;")
=> 123;;SDFDS SDFSDF EEE "S";456;asdfas;b;lll;;
See how the outer quotes have been stripped away. Now it's a simple thing to go through all the matches to remove all the remaining quotes, and then you can reconstruct the string from the array of matches.

How to prevent regex characters from being changed after page is rendered?

I'am stuck after searching and trying several tests, but just can't figure out how to fix the following issue.
I use these characters \x3c, \x3e and \x22 in a regEx and save is in a variable in *.component.ts but when I use the variable in the markup/HTML, it turns it into <, > and ". the result is that my Pattern doesn't work as expected.
Here is one of test on regex101.com and as you can see it works as it should be:
^(?=.*[a-zA-Z\d!\x22#$%&\'()*+,.:;\x3c=\x3e?#[\]^_`{|}~/\\-])[A-Za-z\d!\x22#$%&\'()*+,.:;\x3c=\x3e?#[\]^_`{|}~/\\-]{8,50}$
How can I prevent this and keep the characters as they are in the original when the page is rendered? Is it a behavior of TypeScript or JavaScript browser engine or what? Any hint would be great.

First of all, you need to use double backslashes to introduce literal backslashes into the regex patterns. I.e. if you write "\x22" as a string literal, it is in fact a mere ". So, to define \x22 in a string literal, write "\\x22".
Then, you have
^(?=.*[a-zA-Z\d!\x22#$%&\'()*+,.:;\x3c=\x3e?#[\]^_`{|}~/\\-])[A-Za-z\d!\x22#$%&\'()*+,.:;\x3c=\x3e?#[\]^_`{|}~/\\-]{8,50}$
The lookahead here is redundant because it requires the same set of chars as is required by the consuming part. The lookahead can be removed, or better replaced with the one you need, (?=[^A-Z]*[A-Z]), requiring at least 1 uppercase ASCII letter:
^(?=[^A-Z]*[A-Z])[A-Za-z\d!\x22#$%&\'()*+,.:;\x3c=\x3e?#[\]^_`{|}~/\\-]{8,50}$
As a string literal:
"^(?=[^A-Z]*[A-Z])[A-Za-z\\d!\\x22#$%&'()*+,.:;\\x3c=\\x3e?#[\\]^_`{|}~/\\\\-]{8,50}$"
See the regex demo.

Unable to find a string matching a regex pattern

While trying to submit a form a javascript regex validation always proves to be false for a string.
Regex:- ^(([a-zA-Z]:)|(\\\\{2}\\w+)\\$?)(\\\\(\\w[\\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
I have tried following strings against it
abc.jpg,
abc:.jpg,
a:.jpg,
a:asdas.jpg,
What string could possible match this regex ?

This regex won't match against anything because of that $? in the middle of the string.
Apparently using the optional modifier ? on the end string symbol $ is not correct (if you paste it on https://regex101.com/ it will give you an error indeed). If the javascript parser ignores the error and keeps the regex as it is this still means you are going to match an end string in the middle of a string which is supposed to continue.
Unescaped it was supposed to match a \$ (dollar symbol) but as it is written it won't work.
If you want your string to be accepted at any cost you can probably use Firebug or a similar developer tool and edit the string inside the javascript code (this, assuming there's no server side check too and assuming it's not wrong aswell). If you ignore the $? then a matching string will be \\\\w\\\\ww.jpg (but since the . is unescaped even \\\\w\\\\ww%jpg is a match)
Of course, I wrote this answer assuming the escaping is indeed the one you showed in the question. If you need to find a matching pattern for the correctly escaped one ^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(\.jpeg|\.JPEG|\.jpg|\.JPG)$ then you can use this tool to find one http://fent.github.io/randexp.js/ (though it will find weird matches). A matching pattern is c:\zz.jpg

If you are just looking for a regular expression to match what you got there, go ahead and test this out:
(\w+:?\w*\.[jpe?gJPE?G]+,)
That should match exactly what you are looking for. Remove the optional comma at the end if you feel like it, of course.

If you remove escape level, the actual regex is
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
After ^start the first pipe (([a-zA-Z]:)|(\\{2}\w+)\$?) which matches an alpha followed by a colon or two backslashes followed by one or more word characters, followed by an optional literal $. There is some needless parenthesis used inside.
The second part (\\(\w[\w].*))+ matches a backslash, followed by two word characters \w[\w] which looks weird because it's equivalent to \w\w (don't need a character class for second \w). Followed by any amount of any character. This whole thing one or more times.
In the last part (.jpeg|.JPEG|.jpg|.JPG) one probably forgot to escape the dot for matching a literal. \. should be used. This part can be reduced to \.(JPE?G|jpe?g).
It would match something like
A:\12anything.JPEG
\\1$\anything.jpg
Play with it at regex101. A better readable could be
^([a-zA-Z]:|\\{2}\w+\$?)(\\\w{2}.*)+\.(jpe?g|JPE?G)$
Also read the explanation on regex101 to understand any pattern, it's helpful!

Cannot get a regex to work in JavaScript that allows whitespace and backslash

I have a regular expression as below. It should allow alphabets, digits, round brackets, square brackets, backslash and following punctuation marks: period, comma, semi-colon, full colon, exclamation, percentage and dash.
^[(a-z)(A-Z) .,;:!'%\-(0-9)(\\)\(\)[\]\s]+$
Question : I have tried this regular expression with some text at this online tester: https://regex101.com/r/kO5tW2/2, but it always comes up with no matches. What is causing the expression to fail in above case? To me, the string being tested should come back as valid, but it's not.

Your spec does not mention a question mark. However, the test text you give does include a question mark. You could have tested this easily enough by removing one character at a time from the test text until you got a match, which would have happened when you removed the question mark.
Either add the question mark to the regexp, or remove it from your test test.
Also, you do not need to (and should not) enclose ranges in parentheses.
In the below, I've also removed escaping for characters which do not need to be escaped:
^[a-zA-Z .,;:!'%\-0-9\\()[\]\s?]+$
^
https://regex101.com/r/kO5tW2/4

Try adding m (multiline) modifier to regex
If you have a string consisting of multiple lines, like first line\nsecond line (where \n indicates a line break), it is often desirable to work with lines, rather than the entire string. Therefore, all the regex engines discussed in this tutorial have the option to expand the meaning of both anchors. ^ can then match at the start of the string (before the f in the above string), as well as after each line break (between \n and s). Likewise, $ still matches at the end of the string (after the last e), and also before every line break (between e and \n). Source

RegExp in JavaScript, when a quantifier is part of the pattern

I have been trying to use a regexp that matches any text that is between a caret, less than and a greater than, caret.
So it would look like: ^< THE TEXT I WANT SELECTED >^
I have tried something like this, but it isn't working: ^<(.*?)>^
I'm assuming this is possible, right? I think the reason I have been having such a tough time is because the caret serves as a quantifier. Thanks for any help I get!
Update
Just so everyone knows, they following from am not i am worked
/\^<(.*?)>\^/
But, it turned out that I was getting html entities since I was getting my string by using the .innerHTML property. In other words,
> ... >
< ... <
To solve this, my regexp actually looks like this:
\^<(.*?)((.|\n)*)>\^
This includes the fact that the string in between should be any character or new line. Thanks!

You need to escape the ^ symbol since it has special meaning in a JavaScript regex.
/\^<(.*?)>\^/
In a JavaScript regex, the ^ means beginning of the string, unless the m modifier was used, in which case it means beginning of the line.

This should work:
\^<(.*?)>\^
In a regex, if you want to use a character that has a special meaning (caret, brackets, pipe, ...), you have to escape it using a backslash. For example, (\w\b)*\w\. will select a sequence of words terminated by a dot.
Careful!
If you have to pass the regex pattern as a string, i.e. there's no regex literal like in javascript or perl, you may have to use a double backslash, which the programming language will escape to a single one, which will then be processed by the regex engine.
Same regex in multiple languages:
Python:
import re
myRegex=re.compile(r"\^<(.*?)>\^") # The r before the string prevents backslash escaping
PHP:
$result=preg_match("/\\^<(.*?)>\\^/",$subject); // Notice the double backslashes here?
JavaScript:
var myRegex=/\^<(.*?)>\^/,
subject="^<blah example>^";
subject.match(myRegex);
If you tell us what programming language you're writing in, we'll be able to give you some finished code to work with.
Edit: Whoops, didn't even notice this was tagged as javascript. Then, you don't have to worry about double backslash at all.
Edit 2: \b represent a word boundary. Though I agree yours is what I would have used myself.

Develop Reference

JavaScript is the programming language of the Web.

JavaScript RegEx - Match quoted string - Possibly unexpected result? - javascript

Related

How to Escape the unescaped double quotes in a CSV string in Node

How to prevent regex characters from being changed after page is rendered?

Unable to find a string matching a regex pattern

Cannot get a regex to work in JavaScript that allows whitespace and backslash

RegExp in JavaScript, when a quantifier is part of the pattern

Categories

Resources