need some help with nested double quotes regex,
I have the following string:
"abcd-1234\":" : value\":1234\":
and I want to capture the entire string and separate it out into key and value pair but I am not able to come with a proper regex.
Basically, I have the following string format -->
"key" : "value"
and I want to find a proper regex for the string format.
I am able capture the key and value individually with the following regex -->
((^[\"]).*\2(?![^:]))
But not able to get a proper regex for the entire string.
Please, can someone help me with the regex.
Imagine the following string: "\\" - That contains \" but is still a complete, valid string. You can't just 'ignore \" - you have to count backslashes.
(?:[^"\\]|\\.) will cover any 'in-the-string' character: Either a backslash followed by anything (. is anything), or any character at all, as long as it isn't either a backslash, or a quote. A string is a quote, followed by any amount of those, followed by a quote, thus, a regexp appears.
However, regexps probably aren't the right tool for the job. This looks like a part of a JSON formatted input; there are JSON parsers that do a much better job on this, covering far more cases.
Related
This is very similar to
Regular expression to find unescaped double quotes in CSV file
However, the solutions presented don't work with Node.js's regex engine. Given a CSV string where columns are quoted with double quotes, but some columns have unescaped double quotes in them, what regex could be used to match these unescaped quotes and just remove them.
Example rows
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
So the two double quotes surrounding the S in the third column would get matched and removed. Needs to work in Node.js (14.16.1)
I have tried (?m)""(?![ \t]*(,|$)) but get a Invalid regular expression: /(?m)""(?![ \t]*(,|$))/: Invalid group exception
I don't know much about node.js, but assuming it is like the JavaScript flavor of regex then I have the following comments about the example you took from the prior answer:
I think your example is choking on the first element, (?m) which is unsupported in Javascript. However, that part is not essential to your task. It only turns on multiline processing and you don't need that if you feed the regex engine each line individually. If you find you still want to feed it a multiline string, then you can still turn on multiline in JavaScript - you do it with the "m" flag after the final delimiter, "/myregex/m". All of the other elements, including the negative lookahead are supported by JavaScript and probably by your engine as well. So, drop the (?m) part of your expression and try it again.
Even after you get it to work, the example row you provided will not be parsed according to your expectations by the sample regular expression. Its function is to identify all occurrences of two double-quotes that are not followed by a comma (or end of string). The ONLY two occurrences of doubled quotes in your example each have a comma after, so you will get no matches on this regex in your example.
It seems like you want some context-sensitive scanning to match and remove the inner pairs of double quotes while leaving the outer ones in place and handling commas inside your strings and possibly correctly quoted double quotes. Regular expression engines are really bad at this kind of processing and I don't think you are going to get satisfactory results whatever you come up with.
You can get an approximate solution to your problem by using regex once to parse the individual elements of the .csv stripping the outer quotes as you go and then running a second regex against each parsed element to either remove single occurrences of double quote or adding a second double-quote, where necessary. Then you can reassemble the string under program control.
This still will break if someone embeds a "", sequence in a data field string, so it's not perfect but it might be good enough for you.
The regex for splitting the .csv and stripping the double quotes is:
/(("(.*?)")|([^,]*))(,|$)/gm
This will accept either a "anything", OR a anything, repeatedly until the source is exhausted. Because of the capturing groups, the parsed text will either by in $3 (if the field was quoted) or $4 (if it was not quoted) but not both.
Here's a regexpReplace of your string with $3&$4 and a semicolon after each iteration (I took the liberty of adding a numeric field without the quotes so you could see that it handles both cases):
"123","","SDFDS SDFSDF EEE "S"",456,"asdfas","b","lll"
RegexpReplace(<above>,"((""(.*?)"")|([^,]*))(,|$)","$3$4;")
=> 123;;SDFDS SDFSDF EEE "S";456;asdfas;b;lll;;
See how the outer quotes have been stripped away. Now it's a simple thing to go through all the matches to remove all the remaining quotes, and then you can reconstruct the string from the array of matches.
I have the following JavaScript:
let strTest = `
"The issue": "L'oggetto ",
"issue": "oggetto",
"issue": 'oggetto "novo" ',
`;
I'm trying to tokenize a string like the one above.
My regexp attempt:
let regExp = /["'](.*?)["']\s*?:\s*?['"](.*?)["']/gm;
This works fine, except in the case where I have a pair of single quotes (') inside of double quotes (") or vice-versa.
Is this possible with only one regular expression?
I answer my self , I think I came with a smaller regex:
` /["'](.*)["']\s*?:\s*?["'[](.*)["']]/g `
Have a look at regex101.com/r/g9WCbi/1
You can use backreferences:
/(["'])(.*?)\1\s*?:\s*?(['"])(.*?)\3/gm
This will include the quotes, in the tokenized string, but you can then remove them from the produced match by taking only the even numbered tokens.
Edit:
As #TJ Crowder points out, this will not work correctly if the string contains escaped quotes in the form of \" within the string. In order to completely accommodate those escaped quotes and not break on strings like \\"(an escaped backslash preceding a quote) you will need to parse with multiple regexes or take a different tactic
The other thing you might want to look at, if this is coming from JSON, is ignoring regex, and just iterating through the properties of your json object. It depends if the string you're getting is coming in as valid json or not.
While trying to submit a form a javascript regex validation always proves to be false for a string.
Regex:- ^(([a-zA-Z]:)|(\\\\{2}\\w+)\\$?)(\\\\(\\w[\\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
I have tried following strings against it
abc.jpg,
abc:.jpg,
a:.jpg,
a:asdas.jpg,
What string could possible match this regex ?
This regex won't match against anything because of that $? in the middle of the string.
Apparently using the optional modifier ? on the end string symbol $ is not correct (if you paste it on https://regex101.com/ it will give you an error indeed). If the javascript parser ignores the error and keeps the regex as it is this still means you are going to match an end string in the middle of a string which is supposed to continue.
Unescaped it was supposed to match a \$ (dollar symbol) but as it is written it won't work.
If you want your string to be accepted at any cost you can probably use Firebug or a similar developer tool and edit the string inside the javascript code (this, assuming there's no server side check too and assuming it's not wrong aswell). If you ignore the $? then a matching string will be \\\\w\\\\ww.jpg (but since the . is unescaped even \\\\w\\\\ww%jpg is a match)
Of course, I wrote this answer assuming the escaping is indeed the one you showed in the question. If you need to find a matching pattern for the correctly escaped one ^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(\.jpeg|\.JPEG|\.jpg|\.JPG)$ then you can use this tool to find one http://fent.github.io/randexp.js/ (though it will find weird matches). A matching pattern is c:\zz.jpg
If you are just looking for a regular expression to match what you got there, go ahead and test this out:
(\w+:?\w*\.[jpe?gJPE?G]+,)
That should match exactly what you are looking for. Remove the optional comma at the end if you feel like it, of course.
If you remove escape level, the actual regex is
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))+(.jpeg|.JPEG|.jpg|.JPG)$
After ^start the first pipe (([a-zA-Z]:)|(\\{2}\w+)\$?) which matches an alpha followed by a colon or two backslashes followed by one or more word characters, followed by an optional literal $. There is some needless parenthesis used inside.
The second part (\\(\w[\w].*))+ matches a backslash, followed by two word characters \w[\w] which looks weird because it's equivalent to \w\w (don't need a character class for second \w). Followed by any amount of any character. This whole thing one or more times.
In the last part (.jpeg|.JPEG|.jpg|.JPG) one probably forgot to escape the dot for matching a literal. \. should be used. This part can be reduced to \.(JPE?G|jpe?g).
It would match something like
A:\12anything.JPEG
\\1$\anything.jpg
Play with it at regex101. A better readable could be
^([a-zA-Z]:|\\{2}\w+\$?)(\\\w{2}.*)+\.(jpe?g|JPE?G)$
Also read the explanation on regex101 to understand any pattern, it's helpful!
I'm trying to write a regex in javascript to identify string representations of arbitrary javascript functions found in json, ie. something like
{
"key": "function() { return 'I am a function'; }"
}
It's easy enough to identify the start, but I can't figure out how to identify the ending double quotes since the function might also contain escaped double quotes. My best try so far is
/"\s*function\(.*\)[^"]*/g
which works nicely if there are no double quotes in the function string. The end of a json key value will end with a double quote and a subsequent comma or closing bracket. Is there some way to retrieve all characters (including newline?) until a negated pattern such as
not "/s*, and not "/s*}
... or do I need to take a completely different approach without regex?
Here's is the current test data I'm working with:
http://regexr.com/39pvi
Seems like you want something like this,
"\s*function\(.*\)(?:\\.|[^\\"])*
It matches also the inbetween \" escaped double quotes.
DEMO
Can somebody explain what this regular expression does?
document.cookie.match(/cookieInfo=([^;]*).*$/)[1]
Also it would be great if I can strip out the double quotes I'm seeing in the cookieInfo values. i.e. when cookieInfo="xyz+asd" - I want to strip out the double quotes using the above regular expression.
It basically saying grab as many characters that are not semi-colons and that follow after the string 'cookieInfo='
Try this to eliminate the double quotes:
document.cookie.match(/cookieInfo="([^;]*)".*$/)[1]
It searches the document.cookie string for cookieInfo=.
Next it grabs all of the characters which are not ; (until it hits the first semicolon).
[...] set of all characters included inside.
[^...] set of all characters which don't match
Then it lets the RegEx search through all other characters.
.* any character, 0 or more times.
$ end of string (or in some special cases, end of line).
You could replace " a couple of different ways, but rather than stuffing it into the regex, I'd recommend doing a replace on it after the fact:
var string = document.cookie.match(...)[1],
cleaned_string = string.replace(/^"|"$/g, "");
That second regex says "look at the start of the string and see if there's a ", or look at the end of the string and see if there's a ".
Normally, a RegEx would stop after it did the first thing it found. The g at the end means to keep going for every match it can possibly find in the string that you gave it.
I wouldn't put it in the original RegEx, because playing around with optional quotes can be ugly.
If they're guaranteed to always, always be there, then that's great, but if you assume they are, and you hit one that doesn't have them, then you're going to get a null match.
The regular expression matches a string starting with 'cookieInfo=' followed by and capturing 0 or more non-semi-column characters followed by 0 or more 'anythings'.
To strip out the double quotes you can use the regex /"/ and replace it with an empty string.