How to match non-escaped quoted strings and also non-quoted strings?

How to match non-escaped quoted strings and also non-quoted strings? - javascript

I have a string that contains single, double, and escaped quotations:
Telling myself 'you are \'great\' ' and then saying "thank you" feels "a \"little\" nice"
I would like a single regex to pull out:
single quoted strings
double quoted strings
strings not in quotes
Expected Result: the following groups
Telling myself
you are \'great\'
and then saying
thank you
feels
a \"little\" nice
Requirements: don't return quotes, and ignore escaped quotes
What I have so far:
Regex #1 to return single and double quotes (source):
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
Result:
Regex #2 to return non-quoted strings:
((?<![\\])['"]|^).*?((?<![\\])['"]|$)
Result:
Problems:
I am unable to make regex #2 put the non-quoted string into a consistent group
I am unable to combine regex #1 and #2 to return all strings in one regex function

How about something like this:
(?<!\\)'(.+?)(?<!\\)'|(?<!\\)"(.+?)(?<!\\)"|(.+?)(?='|"|$)
Demo.
The basic idea behind this is that it tries to match the strings with quotes first so that whatever is left after that is the strings that were not enclosed quotes. You will have all the matched strings (not including the quotes) in the capturing groups.
Shortened version:
(?<!\\)(['"])(.+?)(?<!\\)\1|(.+?)(?='|"|$)
Demo.
If you don't want to use capturing groups, you may adjust it to work with Lookarounds like the following:
(?<=(?<!\\)').+?(?=(?<!\\)')|(?<=(?<!\\)").+?(?=(?<!\\)")|(?<=^|['"]).+?(?=(?<!\\)['"]|$)
Demo.
Shortened version:
(?<=(?<!\\)(['"])).+?(?=(?<!\\)\1)|(?<=^|['"]).+?(?=(?<!\\)['"]|$)
Demo.

JS version
/(?:"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|([^'"\\]+)|(\\[\S\s]))/
https://regex101.com/r/5xfs7q/1
PCRE - Pro level, super version ..
(?|(?|\s*((?:[^'"\\]|(?:\\[\S\s][^'"\\]*))+)(?<!\s)\s*|\s+(*SKIP)(*FAIL))|(?<!\\)(?|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|'([^'\\]*(?:\\[\S\s][^'\\]*)*)')|([\S\s]))
https://regex101.com/r/Tdyd3y/1
This is the cleanest, nicest one I've ever seen.
Wsp trim and regex contains just a single capture group.
Explained
(?| # BReset
(?| # BReset
\s* # Wsp trim
( # (1 start), Non-quoted data
(?:
[^'"\\]
| (?: \\ [\S\s] [^'"\\]* )
)+
) # (1 end)
(?<! \s )
\s* # Wsp trim
| # or,
\s+ (*SKIP) (*FAIL) # Skip intervals with all whitespace
)
|
(?<! \\ ) # Not an escape behind
(?| # BReset
"
( # (1 start), double quoted string data
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
) # (1 end)
"
| # or,
'
( # (1 start), single quoted string data
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
) # (1 end)
'
)
|
( [\S\s] ) # (1), Pass through, single char
# Un-balanced " or ' or \ at EOF
)

Related

How to count regex match for ' but not \' in a string

I know basics of how to process regex in javascript. i can count ' and \' in the string. but how to match ' and not \' in a string. As of now i have a workaround by subtracting the results of both the matches. is it possible to find the count includes ' and excludes \' in a single regex pattern.

In JavaScript, you don't have negative lookbehind assertions, so you can't easily check if there is an odd number of backslashes before a ' unless you make those characters part of the match.
Therefore, you need to do something a bit more complicated - match all the strings that end in a single, unescaped quote, effectively splitting the entire input into chunks, one for each quote. Then count those chunks:
/[^\\']*(?:\\.[^\\']*)*'/g
will match all those parts of the string.
Test it live on regex101.com.
Explanation:
[^\\']* # Match any number of characters except backslashes and quotes.
(?: # Start of non-capturing group: Match...
\\. # an escaped character (any escape sequence like \' or etc.),
[^\\']* # followed by any number of characters except backslashes and quotes.
)* # Do this as often as needed (even 0 times)
' # until you can match a single (unescaped) quote

Placing each matched value in its own capturing group

I've been at this for too long, trying to figure out how to match a comma-delimited string of values, while breaking apart the values into their own capturing groups. Here are my requirements:
No leading comma
Terms can be alphanumeric, with between 1 and 7 characters
Min: 1 term; Max: unlimited
Unlimited whitespace between terms and commas
No trailing comma
I'm so close, but I'm not able to get all terms in the string into their own capture groups. Instead it places the last matched term from the first capturing group into group #1, instead of placing all matches into previous groups. So here's my example:
abc1234, def5678, ghi9012
I would expect abc1234 to be group #1, def5678 to be group #2, and ghi9012 to be group #3. Instead, using the expression below, I get def5678 in group #1 and ghi9012 in group #2.
/(?:([A-z0-9]{1,7})\s*,\s*)+([A-z0-9]{1,7})/g
Link to RegExr example
I'm pretty sure I haven't set up my capturing/non-capturing groups correctly. Any help would be greatly appreciated.

This can do it for you. Using the extraction regex the value is in group 1. Also the value is trimmed.
Let me know if you need one for quoted fields.
Note that the requirement for 1-7 chars can't be enforced using the extraction one,
unless its validated ahead of time.
Validation regex:
# /^(?:(?:(?:^|,)\s*)[a-zA-Z0-9]{1,7}(?:\s*(?:(?=,)|$)))+$/
^
(?:
(?: # leading comma + optional whitespaces
(?: ^ | , )
\s*
)
[a-zA-Z0-9]{1,7} # alpha-num, 1-7 chars
(?: # trailing optional whitespaces
\s*
(?:
(?= , )
| $
)
)
)+
$
Extraction regex.
# /(?:(?:^|,)\s*)([^,]*?)(?:\s*(?:(?=,)|$))/
(?: # leading comma + optional whitespaces
(?: ^ | , )
\s*
)
( [^,]*? ) # (1), non-quoted field
(?: # trailing optional whitespaces
\s*
(?:
(?= , )
| $
)
)

Combine Regex to match variations of a String

I have a string that I'd like to pull some content from using javascript. The string can have multiple forms as follows:
[[(a*, b*) within 20]] or [[...(a*, b*) within 20]] where the "..." may or may not exist.
I'd like a regex that will match the "(a*, b*) within 20" portion.
/\[\[(.*?)\]\]/.exec(text)[1] will match [[(a*, b*) within 20]]
and
/([^\.]+)\]\]/.exec(text)[1] will match [[...(a*, b*) within 20]]
How can I combine these so that both version of the text will match "(a*, b*) within 20"?

You can use this regex:
var m = s.match(/\[\[.*?(\([^)]*\).*?)\]\]/);
if (m)
console.log(m[1]);
// (a*, b*) within 20 for both input strings

I'd like a regex that will match the (a*, b*) within 20 portion.
You can try
\[\[.*?(\(a\*, b\*\) .*?)\]\]
Here is demo on regex101
Note: you can use \w or [a-z] to make it more precise as per your need instead of a and b
\[\[.*?(\w\*, \w\*\) .*?)\]\]
Here escape chracter \ is used to escape special characters of regex pattern such as . [ [ ] ] * ( )

You can use the following to match both variations.
\[\[[^(]*(\([^)]*\)[^\]]*)\]\]
Explanation:
\[ # '['
\[ # '['
[^(]* # any character except: '(' (0 or more times)
( # group and capture to \1:
\( # '('
[^)]* # any character except: ')' (0 or more times)
\) # ')'
[^\]]* # any character except: '\]' (0 or more times)
) # end of \1
\] # ']'
\] # ']'
Working Demo

Check for open, unescaped quotes in a given string

A part of the application I'm building allows you to evaluate bash commands in an interactive terminal. On enter, the command is run. I'm trying to make it a bit more flexible, and allow for commands spanning multiple lines.
I already check for a trailing backslash, now I'm trying to figure out how to tell if there is an open string. I have not been successful in writing a regex for this, as it should also support escaped quotes.
For example:
echo "this is a
\"very\" cool quote"

If you want a regex that matches a string (subject) only if it doesn't contain unbalanced (unescaped) quotes, then try the following:
/^(?:[^"\\]|\\.|"(?:\\.|[^"\\])*")*$/.test(subject)
Explanation:
^ # Match the start of the string.
(?: # Match either...
[^"\\] # a character besides quotes or backslash
| # or
\\. # any escaped character
| # or
" # a closed string, i. e. one that starts with a quote,
(?: # followed by either
\\. # an escaped character
| # or
[^"\\] # any other character except quote or backslash
)* # any number of times,
" # and a closing quote.
)* # Repeat as often as needed.
$ # Match the end of the string.

Match everything but not quoted strings

I want to match everything but no quoted strings.
I can match all quoted strings with this: /(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/
So I tried to match everything but no quoted strings with this: /[^(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))]/ but it doesn't work.
I would like to use only regex because I will want to replace it and want to get the quoted text after it back.
string.replace(regex, function(a, b, c) {
// return after a lot of operations
});
A quoted string is for me something like this "bad string" or this 'cool string'
So if I input:
he\'re is "watever o\"k" efre 'dder\'4rdr'?
It should output this matches:
["he\'re is ", " efre ", "?"]
And than I wan't to replace them.
I know my question is very difficult but it is not impossible! Nothing is impossible.
Thanks

EDIT: Rewritten to cover more edge cases.
This can be done, but it's a bit complicated.
result = subject.match(/(?:(?=(?:(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*'(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*')*(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*$)(?=(?:(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*"(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*")*(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*$)(?:\\.|[^\\'"]))+/g);
will return
, he said.
, she replied.
, he reminded her.
,
from this string (line breaks added and enclosing quotes removed for clarity):
"Hello", he said. "What's up, \"doc\"?", she replied.
'I need a 12" crash cymbal', he reminded her.
"2\" by 4 inches", 'Back\"\'slashes \\ are OK!'
Explanation: (sort of, it's a bit mindboggling)
Breaking up the regex:
(?:
(?= # Assert even number of (relevant) single quotes, looking ahead:
(?:
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*
'
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*
'
)*
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*
$
)
(?= # Assert even number of (relevant) double quotes, looking ahead:
(?:
(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*
"
(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*
"
)*
(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*
$
)
(?:\\.|[^\\'"]) # Match text between quoted sections
)+
First, you can see that there are two similar parts. Both these lookahead assertions ensure that there is an even number of single/double quotes in the string ahead, disregarding escaped quotes and quotes of the opposite kind. I'll show it with the single quotes part:
(?= # Assert that the following can be matched:
(?: # Match this group:
(?: # Match either:
\\. # an escaped character
| # or
"(?:\\.|[^"\\])*" # a double-quoted string
| # or
[^\\'"] # any character except backslashes or quotes
)* # any number of times.
' # Then match a single quote
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*' # Repeat once to ensure even number,
# (but don't allow single quotes within nested double-quoted strings)
)* # Repeat any number of times including zero
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])* # Then match the same until...
$ # ... end of string.
) # End of lookahead assertion.
The double quotes part works the same.
Then, at each position in the string where these two assertions succeed, the next part of the regex actually tries to match something:
(?: # Match either
\\. # an escaped character
| # or
[^\\'"] # any character except backslash, single or double quote
) # End of non-capturing group
The whole thing is repeated once or more, as many times as possible. The /g modifier makes sure we get all matches in the string.
See it in action here on RegExr.

Here is a tested function that does the trick:
function getArrayOfNonQuotedSubstrings(text) {
/* Regex with three global alternatives to section the string:
('[^'\\]*(?:\\[\S\s][^'\\]*)*') # $1: Single quoted string.
| ("[^"\\]*(?:\\[\S\s][^"\\]*)*") # $2: Double quoted string.
| ([^'"\\]*(?:\\[\S\s][^'"\\]*)*) # $3: Un-quoted string.
*/
var re = /('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^'"\\]*(?:\\[\S\s][^'"\\]*)*)/g;
var a = []; // Empty array to receive the goods;
text = text.replace(re, // "Walk" the text chunk-by-chunk.
function(m0, m1, m2, m3) {
if (m3) a.push(m3); // Push non-quoted stuff into array.
return m0; // Return this chunk unchanged.
});
return a;
}
This solution uses the String.replace() method with a replacement callback function to "walk" the string section by section. The regex has three global alternatives, one for each section; $1: single quoted, $2: double quoted, and $3: non-quoted substrings, Each non-quoted chunk is pushed onto the return array. It correctly handles all escaped characters, including escaped quotes, both inside and outside quoted strings. Single quoted substrings may contain any number of double quotes and vice-versa. Illegal orphan quotes are removed and serve to divide a non-quoted section into two chunks. Note that this solution requires no lookaround and requires only one pass. It also implements Friedl's "Unrolling-the-Loop" efficiency technique and is quite efficient.
Additional: Here is some code to test the function with the original test string:
// The original test string (with necessary escapes):
var s = "he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?";
alert(s); // Show the test string without the extra backslashes.
console.log(getArrayOfNonQuotedSubstrings(s).toString());

You can't invert a regex. What you have tried was making a character class out of it and invert that - but also for doing that you would have to escape all closing brackets "\]".
EDIT: I would have started with
/(^|" |' ).+?($| "| ')/
This matches anything between the beginning or the end of a quoted string (very simple: a quotation mark plus a blank) and the end of the string or the start of a quoted string (a blank plus a quotation mark). Of course this doesn't handle any escape sequences or quotations which don't follow the scheme / ['"].*['"] /. See above answers for more detailed expressions :-)

Develop Reference

JavaScript is the programming language of the Web.

How to match non-escaped quoted strings and also non-quoted strings? - javascript

Related

How to count regex match for ' but not \' in a string

Placing each matched value in its own capturing group

Combine Regex to match variations of a String

Check for open, unescaped quotes in a given string

Match everything but not quoted strings

Categories

Resources