I'm trying to write a regex that will match Oracle q-quotes for a PL/SQL lexer in code-prettify.js. For example,
q'[Here's Johnny]'
This should be matched the same as
'Here''s Johnny'
(that is, so that SQL will treat it all as one text string. The advantage of q-quotes over the conventional two-single-quotes is you don't have to go through your text string doubling up all of your single quotes.)
The quote delimiter can be any of [, {, <, or (, but I think if I can get it working with one bracket type then I can repeat the variations as ORs, like
/^(?:pattern1|pattern2|pattern3))/
Ultimately I want a single regex that will match an ordinary single-quoted string or a q-quote with any of the bracket types.
For your lexer you would like the text of q'[Here's Johnny]' and 'Here''s Johnny' to be matched. Assuming that you want the match to include all characters comprising the string token, including quotes, brackets, etc, this regular expression should work:
(?:q'\[.*?\](?=')'|q'<.*?>(?=')'|q'\(.*?\)(?=')'|q'{.*?}(?=')'|(?!q)'(?:[^']|'')*')
The two relevant pieces are:
q'\[.*?\](?=')' is the basis for q-quoted strings, and the rest of the appropriate brackets have their own statements, and
'(?:[^']|'')*', matches single-quoted strings.
You can see matching examples here.
Related
need some help with nested double quotes regex,
I have the following string:
"abcd-1234\":" : value\":1234\":
and I want to capture the entire string and separate it out into key and value pair but I am not able to come with a proper regex.
Basically, I have the following string format -->
"key" : "value"
and I want to find a proper regex for the string format.
I am able capture the key and value individually with the following regex -->
((^[\"]).*\2(?![^:]))
But not able to get a proper regex for the entire string.
Please, can someone help me with the regex.
Imagine the following string: "\\" - That contains \" but is still a complete, valid string. You can't just 'ignore \" - you have to count backslashes.
(?:[^"\\]|\\.) will cover any 'in-the-string' character: Either a backslash followed by anything (. is anything), or any character at all, as long as it isn't either a backslash, or a quote. A string is a quote, followed by any amount of those, followed by a quote, thus, a regexp appears.
However, regexps probably aren't the right tool for the job. This looks like a part of a JSON formatted input; there are JSON parsers that do a much better job on this, covering far more cases.
This is very similar to
Regular expression to find unescaped double quotes in CSV file
However, the solutions presented don't work with Node.js's regex engine. Given a CSV string where columns are quoted with double quotes, but some columns have unescaped double quotes in them, what regex could be used to match these unescaped quotes and just remove them.
Example rows
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
So the two double quotes surrounding the S in the third column would get matched and removed. Needs to work in Node.js (14.16.1)
I have tried (?m)""(?![ \t]*(,|$)) but get a Invalid regular expression: /(?m)""(?![ \t]*(,|$))/: Invalid group exception
I don't know much about node.js, but assuming it is like the JavaScript flavor of regex then I have the following comments about the example you took from the prior answer:
I think your example is choking on the first element, (?m) which is unsupported in Javascript. However, that part is not essential to your task. It only turns on multiline processing and you don't need that if you feed the regex engine each line individually. If you find you still want to feed it a multiline string, then you can still turn on multiline in JavaScript - you do it with the "m" flag after the final delimiter, "/myregex/m". All of the other elements, including the negative lookahead are supported by JavaScript and probably by your engine as well. So, drop the (?m) part of your expression and try it again.
Even after you get it to work, the example row you provided will not be parsed according to your expectations by the sample regular expression. Its function is to identify all occurrences of two double-quotes that are not followed by a comma (or end of string). The ONLY two occurrences of doubled quotes in your example each have a comma after, so you will get no matches on this regex in your example.
It seems like you want some context-sensitive scanning to match and remove the inner pairs of double quotes while leaving the outer ones in place and handling commas inside your strings and possibly correctly quoted double quotes. Regular expression engines are really bad at this kind of processing and I don't think you are going to get satisfactory results whatever you come up with.
You can get an approximate solution to your problem by using regex once to parse the individual elements of the .csv stripping the outer quotes as you go and then running a second regex against each parsed element to either remove single occurrences of double quote or adding a second double-quote, where necessary. Then you can reassemble the string under program control.
This still will break if someone embeds a "", sequence in a data field string, so it's not perfect but it might be good enough for you.
The regex for splitting the .csv and stripping the double quotes is:
/(("(.*?)")|([^,]*))(,|$)/gm
This will accept either a "anything", OR a anything, repeatedly until the source is exhausted. Because of the capturing groups, the parsed text will either by in $3 (if the field was quoted) or $4 (if it was not quoted) but not both.
Here's a regexpReplace of your string with $3&$4 and a semicolon after each iteration (I took the liberty of adding a numeric field without the quotes so you could see that it handles both cases):
"123","","SDFDS SDFSDF EEE "S"",456,"asdfas","b","lll"
RegexpReplace(<above>,"((""(.*?)"")|([^,]*))(,|$)","$3$4;")
=> 123;;SDFDS SDFSDF EEE "S";456;asdfas;b;lll;;
See how the outer quotes have been stripped away. Now it's a simple thing to go through all the matches to remove all the remaining quotes, and then you can reconstruct the string from the array of matches.
I have the following sentence from a json string and i want to parse only the words for the "value" part of the string i.e Service Copy
"customtext_230216":"self":"https://jsonapi.com/restsv/api/13/customtext/11233","value":"Service Copy","id":"11211"}
After several searches on SO i was able to get the regex expression
\b(?<=value)(.*?)(?=\s*id)
and the result is
":"Service Copy","
How can I enhance this so that the special characters are not captured
expected result :
Service Copy
If you want to stick with regex:
(?<=value\":\")(.*?)(?=\s*\",\"id)
Basically took out the word boundary in the beginning and defined the special characters (quotes, colons, etc) outside of the capturing group.
However, there are better ways to extract values from JSON and I encourage you to look at those as this could potentially be a brittle solution.
I have the following JavaScript:
let strTest = `
"The issue": "L'oggetto ",
"issue": "oggetto",
"issue": 'oggetto "novo" ',
`;
I'm trying to tokenize a string like the one above.
My regexp attempt:
let regExp = /["'](.*?)["']\s*?:\s*?['"](.*?)["']/gm;
This works fine, except in the case where I have a pair of single quotes (') inside of double quotes (") or vice-versa.
Is this possible with only one regular expression?
I answer my self , I think I came with a smaller regex:
` /["'](.*)["']\s*?:\s*?["'[](.*)["']]/g `
Have a look at regex101.com/r/g9WCbi/1
You can use backreferences:
/(["'])(.*?)\1\s*?:\s*?(['"])(.*?)\3/gm
This will include the quotes, in the tokenized string, but you can then remove them from the produced match by taking only the even numbered tokens.
Edit:
As #TJ Crowder points out, this will not work correctly if the string contains escaped quotes in the form of \" within the string. In order to completely accommodate those escaped quotes and not break on strings like \\"(an escaped backslash preceding a quote) you will need to parse with multiple regexes or take a different tactic
The other thing you might want to look at, if this is coming from JSON, is ignoring regex, and just iterating through the properties of your json object. It depends if the string you're getting is coming in as valid json or not.
I encountered this regular expression that detects string literal of Unicode characters in JavaScript.
'"'("\\x"[a-fA-F0-9]{2}|"\\u"[a-fA-F0-9]{4}|"\\"[^xu]|[^"\n\\])*'"'
but I couldn't understand the role and need of
"\\x"[a-fA-F0-9]{2}
"\\"[^xu]|[^"\n\\]
My guess about 1) is that it is detecting control characters.
"\\x"[a-fA-F0-9]{2}
This is a literal \x followed by two characters from the hex-digit group.
This matches the shorter-form character escapes for the code points 0–255, \x00–\xFF. These are valid in JavaScript string literals but they aren't in JSON, where you have to use \u0000–\u00FF instead.
"\\"[^xu]|[^"{esc}\n]
This matches one of:
backslash followed by one more character, except for x or u. The valid cases for \xNN and \uNNNN were picked up in the previous |-separated clauses, so what this does is avoid matching invalid syntax like \uqX.
anything else, except for the " or newline. It is probably also supposed to be excluding other escape characters, which I'm guessing is what {esc} means. That isn't part of the normal regex syntax, but it may be some extended syntax or templating over the top of regex. Otherwise, [^"{esc}\n] would mean just any character except ", {, e, s, c, } or newline, which would be wrong.
Notably, the last clause, that picks up ‘anything else’, doesn't exclude \ itself, so you can still have \uqX in your string and get a match even though that is invalid in both JSON and JavaScript.