Match everything but not quoted strings - javascript

I want to match everything but no quoted strings.
I can match all quoted strings with this: /(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/
So I tried to match everything but no quoted strings with this: /[^(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))]/ but it doesn't work.
I would like to use only regex because I will want to replace it and want to get the quoted text after it back.
string.replace(regex, function(a, b, c) {
// return after a lot of operations
});
A quoted string is for me something like this "bad string" or this 'cool string'
So if I input:
he\'re is "watever o\"k" efre 'dder\'4rdr'?
It should output this matches:
["he\'re is ", " efre ", "?"]
And than I wan't to replace them.
I know my question is very difficult but it is not impossible! Nothing is impossible.
Thanks

EDIT: Rewritten to cover more edge cases.
This can be done, but it's a bit complicated.
result = subject.match(/(?:(?=(?:(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*'(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*')*(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*$)(?=(?:(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*"(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*")*(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*$)(?:\\.|[^\\'"]))+/g);
will return
, he said.
, she replied.
, he reminded her.
,
from this string (line breaks added and enclosing quotes removed for clarity):
"Hello", he said. "What's up, \"doc\"?", she replied.
'I need a 12" crash cymbal', he reminded her.
"2\" by 4 inches", 'Back\"\'slashes \\ are OK!'
Explanation: (sort of, it's a bit mindboggling)
Breaking up the regex:
(?:
(?= # Assert even number of (relevant) single quotes, looking ahead:
(?:
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*
'
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*
'
)*
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*
$
)
(?= # Assert even number of (relevant) double quotes, looking ahead:
(?:
(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*
"
(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*
"
)*
(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*
$
)
(?:\\.|[^\\'"]) # Match text between quoted sections
)+
First, you can see that there are two similar parts. Both these lookahead assertions ensure that there is an even number of single/double quotes in the string ahead, disregarding escaped quotes and quotes of the opposite kind. I'll show it with the single quotes part:
(?= # Assert that the following can be matched:
(?: # Match this group:
(?: # Match either:
\\. # an escaped character
| # or
"(?:\\.|[^"\\])*" # a double-quoted string
| # or
[^\\'"] # any character except backslashes or quotes
)* # any number of times.
' # Then match a single quote
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*' # Repeat once to ensure even number,
# (but don't allow single quotes within nested double-quoted strings)
)* # Repeat any number of times including zero
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])* # Then match the same until...
$ # ... end of string.
) # End of lookahead assertion.
The double quotes part works the same.
Then, at each position in the string where these two assertions succeed, the next part of the regex actually tries to match something:
(?: # Match either
\\. # an escaped character
| # or
[^\\'"] # any character except backslash, single or double quote
) # End of non-capturing group
The whole thing is repeated once or more, as many times as possible. The /g modifier makes sure we get all matches in the string.
See it in action here on RegExr.

Here is a tested function that does the trick:
function getArrayOfNonQuotedSubstrings(text) {
/* Regex with three global alternatives to section the string:
('[^'\\]*(?:\\[\S\s][^'\\]*)*') # $1: Single quoted string.
| ("[^"\\]*(?:\\[\S\s][^"\\]*)*") # $2: Double quoted string.
| ([^'"\\]*(?:\\[\S\s][^'"\\]*)*) # $3: Un-quoted string.
*/
var re = /('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^'"\\]*(?:\\[\S\s][^'"\\]*)*)/g;
var a = []; // Empty array to receive the goods;
text = text.replace(re, // "Walk" the text chunk-by-chunk.
function(m0, m1, m2, m3) {
if (m3) a.push(m3); // Push non-quoted stuff into array.
return m0; // Return this chunk unchanged.
});
return a;
}
This solution uses the String.replace() method with a replacement callback function to "walk" the string section by section. The regex has three global alternatives, one for each section; $1: single quoted, $2: double quoted, and $3: non-quoted substrings, Each non-quoted chunk is pushed onto the return array. It correctly handles all escaped characters, including escaped quotes, both inside and outside quoted strings. Single quoted substrings may contain any number of double quotes and vice-versa. Illegal orphan quotes are removed and serve to divide a non-quoted section into two chunks. Note that this solution requires no lookaround and requires only one pass. It also implements Friedl's "Unrolling-the-Loop" efficiency technique and is quite efficient.
Additional: Here is some code to test the function with the original test string:
// The original test string (with necessary escapes):
var s = "he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?";
alert(s); // Show the test string without the extra backslashes.
console.log(getArrayOfNonQuotedSubstrings(s).toString());

You can't invert a regex. What you have tried was making a character class out of it and invert that - but also for doing that you would have to escape all closing brackets "\]".
EDIT: I would have started with
/(^|" |' ).+?($| "| ')/
This matches anything between the beginning or the end of a quoted string (very simple: a quotation mark plus a blank) and the end of the string or the start of a quoted string (a blank plus a quotation mark). Of course this doesn't handle any escape sequences or quotations which don't follow the scheme / ['"].*['"] /. See above answers for more detailed expressions :-)

Related

Regular Expression to match text between # and only if # is not preceded by '

Hello I'm trying to find a regular expression that can help me find all matches inside a string when they're inside # and only if # are not preceded by an apostrophe "'".
Basically I need to bold the text just as here when we use double * to bold text like this, but the apostrophe should work as an escape character.
For example
#Hello my name is Noé# should look like Hello my name is Noé
#Hello this has an escape apostrophe '# so I'll match until here# should look like Hello this has an escape apostrophe '# so I'll match until here
Inside a long text there might or might not be several matches:
"Hello I'm a text #I'm bold#, and I need to know how to match my text that's inside two '#, and #I will not match either 'cause I got no end"
So i can print it like
"Hello I'm a text I'm bold, and I need to know how to match my text that's inside two '#, and #I will not match either 'cause I got no end"
If thats not possible with a RegExp I could program a finite state machine, but I was hoping I was possible, thank you in advance God bless you!
Note: I will handle the escape characters later by now I just need to know how to mach this
/(?<!')#.*(?<!')#/gim
This was the only thing I could come up with, but honestly, I have no idea how negative look behind works :(, with this regexp it would match wrong. For example, if I type:
"I'm a text #and I should be a match# and this should not #But this should as well# and I'm just some random extra text"
matches from the first # occurrence until the last one, like so:
"I'm a text #and I should be a match# and this should not #But this should as well# and I'm just some random extra text"
I think this should work:
(?<!')#(.*?)(?<!')#
Here you can see the regexp working with your examples: https://regex101.com/r/wnguiA/1
(?<!') is Negative Lookbehind, it tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a b that is not preceded by an a.
More easy is the (.*?) that matches any character (except for line terminators); adding ? tells the capturing group to be not-greedy and stop at the first occourence of the succesive token.
To prevent triggering the negatilve lookbehind at all the positions not asserting a ' to the left, you can also first match # and do the assertion after it.
#(?<!'#)(.*?)#(?<!'#)
Regex demo
Another option instead of using the non greedy .*? is to use a negated character class matching any char except #
Then when you encounter # only match it if there is ' before it using a positive lookbehind.
#(?<!'#)([^#\n]*(?:#(?<='#)[^#\n]*)*)#(?<!'#)
#(?<!'#) Match # not directly preceded by '
( Capture group 1
[^#\n]* Optionally match any char except # or a newline
(?: Non capture group
#(?<='#) Match # not directly preceded by '
[^#\n]* Match optional repetitions of any char except # or a newline
)* Close non capture group and optionally repeat it to match all occurrences
) Close group 1
#(?<!'#) Match # not directly preceded by '
Regex demo

How to match non-escaped quoted strings and also non-quoted strings?

I have a string that contains single, double, and escaped quotations:
Telling myself 'you are \'great\' ' and then saying "thank you" feels "a \"little\" nice"
I would like a single regex to pull out:
single quoted strings
double quoted strings
strings not in quotes
Expected Result: the following groups
Telling myself
you are \'great\'
and then saying
thank you
feels
a \"little\" nice
Requirements: don't return quotes, and ignore escaped quotes
What I have so far:
Regex #1 to return single and double quotes (source):
((?<![\\])['"])((?:.(?!(?<![\\])\1))*.?)\1
Result:
Regex #2 to return non-quoted strings:
((?<![\\])['"]|^).*?((?<![\\])['"]|$)
Result:
Problems:
I am unable to make regex #2 put the non-quoted string into a consistent group
I am unable to combine regex #1 and #2 to return all strings in one regex function
How about something like this:
(?<!\\)'(.+?)(?<!\\)'|(?<!\\)"(.+?)(?<!\\)"|(.+?)(?='|"|$)
Demo.
The basic idea behind this is that it tries to match the strings with quotes first so that whatever is left after that is the strings that were not enclosed quotes. You will have all the matched strings (not including the quotes) in the capturing groups.
Shortened version:
(?<!\\)(['"])(.+?)(?<!\\)\1|(.+?)(?='|"|$)
Demo.
If you don't want to use capturing groups, you may adjust it to work with Lookarounds like the following:
(?<=(?<!\\)').+?(?=(?<!\\)')|(?<=(?<!\\)").+?(?=(?<!\\)")|(?<=^|['"]).+?(?=(?<!\\)['"]|$)
Demo.
Shortened version:
(?<=(?<!\\)(['"])).+?(?=(?<!\\)\1)|(?<=^|['"]).+?(?=(?<!\\)['"]|$)
Demo.
JS version
/(?:"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|([^'"\\]+)|(\\[\S\s]))/
https://regex101.com/r/5xfs7q/1
PCRE - Pro level, super version ..
(?|(?|\s*((?:[^'"\\]|(?:\\[\S\s][^'"\\]*))+)(?<!\s)\s*|\s+(*SKIP)(*FAIL))|(?<!\\)(?|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|'([^'\\]*(?:\\[\S\s][^'\\]*)*)')|([\S\s]))
https://regex101.com/r/Tdyd3y/1
This is the cleanest, nicest one I've ever seen.
Wsp trim and regex contains just a single capture group.
Explained
(?| # BReset
(?| # BReset
\s* # Wsp trim
( # (1 start), Non-quoted data
(?:
[^'"\\]
| (?: \\ [\S\s] [^'"\\]* )
)+
) # (1 end)
(?<! \s )
\s* # Wsp trim
| # or,
\s+ (*SKIP) (*FAIL) # Skip intervals with all whitespace
)
|
(?<! \\ ) # Not an escape behind
(?| # BReset
"
( # (1 start), double quoted string data
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
) # (1 end)
"
| # or,
'
( # (1 start), single quoted string data
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
) # (1 end)
'
)
|
( [\S\s] ) # (1), Pass through, single char
# Un-balanced " or ' or \ at EOF
)

How to count regex match for ' but not \' in a string

I know basics of how to process regex in javascript. i can count ' and \' in the string. but how to match ' and not \' in a string. As of now i have a workaround by subtracting the results of both the matches. is it possible to find the count includes ' and excludes \' in a single regex pattern.
In JavaScript, you don't have negative lookbehind assertions, so you can't easily check if there is an odd number of backslashes before a ' unless you make those characters part of the match.
Therefore, you need to do something a bit more complicated - match all the strings that end in a single, unescaped quote, effectively splitting the entire input into chunks, one for each quote. Then count those chunks:
/[^\\']*(?:\\.[^\\']*)*'/g
will match all those parts of the string.
Test it live on regex101.com.
Explanation:
[^\\']* # Match any number of characters except backslashes and quotes.
(?: # Start of non-capturing group: Match...
\\. # an escaped character (any escape sequence like \' or etc.),
[^\\']* # followed by any number of characters except backslashes and quotes.
)* # Do this as often as needed (even 0 times)
' # until you can match a single (unescaped) quote

regular expression to replace with ','

I have one RegExp, could anyone explain exactly what it does?
Regexp
b=b.replace(/(\d{1,3}(?=(?:\d\d\d)+(?!\d)))/g,"$1 ")
I think it is replacing with space(' ')
if i'm right, i want to replace it with comma(,) instead of space(' ').
To explain the regex, let's break it down:
( # Match and capture in group number 1:
\d{1,3} # one to three digits (as many as possible),
(?= # but only if it's possible to match the following afterwards:
(?: # A (non-capturing) group containing
\d\d\d # exactly three digits
)+ # once or more (so, three/six/nine/twelve/... digits)
(?!\d) # but only if there are no further digits ahead.
) # End of (?=...) lookahead assertion
) # End of capturing group
Actually, the outer parentheses are unnecessary if you use $& instead of $1 for the replacement string ($& contains the entire match).
The regex (\d{1,3}(?=(?:\d\d\d)+(?!\d))) matches any 1-3 digits ((\d{1,3}) that is followed by a multiple of 3 digits ((?:\d\d\d)+), that isn't followed by another digit ((?!\d)). It replaces it with "$1 ". $1 is replaced by the first capture group. The space behind it is... a space.
See regexpressions on mdn for more information about the different syntaxes.
If you want to seperate the numbers with a comma, instead of a space, you'll need to replace it with "$1," instead.
Don't try to solve everything by using regular expressions.
Regular expressions are meant for matching, not to fix non-text-encoded-as-text formatting.
If you want to format numbers differently, extract them and use format strings to reformat them on a character processing level. That is just an ugly hack.
It is okay to use regular expressions to find the numbers in the text, e.g. \d{4,} but trying to do the actual formatting with regexp is a crazy abuse.

Escape a white space character using Javascript

I have the following jquery statement. I wish to remove the whitespace as shown below. So if I have a word like:
For example
#Operating/System I would like the
end result to show me
#Operating\/System. (ie with a
escape sequence).
But if I have #Operating/System test then I want to show
#Operating\/System + escape
sequence for space. The .replace(/ /,'')
part is incorrect but .replace("/","\\/") works
well as per my requirements.
Please help!
$("#word" + lbl.eq(i).text().replace("/","\\/").replace(/ /,'')).hide();
$( "#word" + lbl.eq(i).text().replace(/([ /])/g, '\\$1') ).hide();
This matches all spaces and slashes in a string (and saves the respective char in group $1):
/([ /])/g
replacement with
'\\$1'
means a backslash plus the original char in group $1.
"#Operating/System test".replace(/([ /])/g, '\\$1');
-->
"#Operating\/System\ test"
Side advantage - there is only a singe call to replace().
EDIT: As requested by the OP, a short explanation of the regular expression /([ /])/g. It breaks down as follows:
/ # start of regex literal
( # start of match group $1
[ /] # a character class (spaces and slashes)
) # end of group $1
/g # end of regex literal + "global" modifier
When used with replace() as above, all spaces and slashes are replaced with themselves, preceded by a backslash.

Categories

Resources