RegExp - find all occurences, but not inside quotes - javascript

I have this text (it's a string value, not a language expression):
hello = world + 'foo bar' + gizmo.hoozit + "escaped \"quotes\"";
And I would like to find all words ([a-zA-Z]+) which are not enclosed in double or single quotes. The quotes can be escaped (\" or \'). The result should be:
hello, world, gizmo, hoozit
Can I do this using regular expressions in JavaScript?

you can use this pattern, what you need is in the second capturing group:
EDIT: a little bit shorter with a negative lookahead:
var re = /(['"])(?:[^"'\\]+|(?!\1)["']|\\{2}|\\[\s\S])*\1|([a-z]+)/ig
var mystr = 'hello = world + \'foo bar\' + gizmo.hoozit + "escaped \\"quotes\\"";';
var result = Array();
while (match = re.exec(mystr)) {
if (match[2]) result.push(match[2]);
}
console.log(mystr);
console.log(result);
the idea is to match content enclosed between quotes before the target.
Enclosed content details: '(?:[^'\\]+|\\{2}|\\[\s\S])*'
(["']) # literal single quote
(?: # open a non capturing group
[^"'\\]+ # all that is not a quote or a backslash
| # OR
(?!\1)["'] # a quote but not the captured quote
| # OR
\\{2} # 2 backslashes (to compose all even numbers of backslash)*
| # OR
\\[\s\S] # an escaped character (to allow escaped single quotes)
)* # repeat the group zero or more times
\1 # the closing single quote (backreference)
(* an even number of backslashes doesn't escape anything)

You might want to use several regular expression methods one after the other for simplicity and clarity of function (large Regexes may be fast, but they're hard to construct, understand and edit): first remove all escaped quotes, then remove all quoted strings, then run your search.
var matches = string
.replace( /\\'|\\"/g, '' )
.replace( /'[^']*'|"[^']*"/g, '' )
.match( /\w+/g );
A few notes on the regular expressions involved:
The central construct in the 2nd replacement is character ('), followed by zero or more (*) of any character from the set ([]) which does not (^) conform to character (')
| means or, meaning either the part before or after the pipe can be matched
'\w' means 'any word character', and works as a shorthand for '[a-zA-Z]'
jsFiddle demo.

Replace each escaped quote with an empty string;
Replace each pair of quotes and the string between with an empty string:
If you use a capture group for the opening quote (["']) then you can use a back-reference \1 to match the same style quote at the other end of the quoted string;
Matching with a back reference means you need to use a non-greedy (match as few characters as possible) wildcard match .*? to get the minimum possible quoted string.
Finally, find the matches using your regular expression [a-zA-Z]+.
Like this:
var text = "hello = world + 'foo bar' + gizmo.hoozit + \"escaped \\\"quotes\\\"\";";
var matches = text.replace( /\\["']/g, '' )
.replace( /(["']).*?\1/g, '' )
.match( /[a-zA-Z]+/g );
console.log( matches );

Related

Javascript how to identify a combination of letters and strip a portion of it

Im very new to Regex . Right now im trynig to use regex to prepare my markup string before sending it to the database.
Here is an example string:
#[admin](user:3) Testing this string #[hellotessginal](user:4) Hey!
So far i am able to identify #[admin](user:3) the entire term here using /#\[(.*?)]\((.*?):(\d+)\)/g
But the next step forward is that i wish to remove the (user:3) leaving me with #[admin].
Hence the result of passing through the stripper function would be:
#[admin] Testing this string #[hellotessginal] Hey!
Please help!
You may use
s.replace(/(#\[[^\][]*])\([^()]*?:\d+\)/g, '$1')
See the regex demo. Details:
(#\[[^\][]*]) - Capturing group 1: #[, 0 or more digits other than [ and ] as many as possible and then ]
\( - a ( char
[^()]*? - 0 or more (but as few as possible) chars other than ( and )
: - a colon
\d+ - 1+ digits
\) - a ) char.
The $1 in the replacement pattern refers to the value captured in Group 1.
See the JavaScript demo:
const rx = /(#\[[^\][]*])\([^()]*?:\d+\)/g;
const remove_parens = (string, regex) => string.replace(regex, '$1');
let s = '#[admin](user:3) Testing this string #[hellotessginal](user:4) Hey!';
s = remove_parens(s, rx);
console.log(s);
Try this:
var str = "#[admin](user:3) Testing this string #[hellotessginal](user:4) Hey!";
str = str.replace(/ *\([^)]*\) */g, ' ');
console.log(str);
You can replace matches of the following regular expression with empty strings.
str.replace(/(?<=\#\[(.*?)\])\(.*?:\d+\)/g, ' ');
regex demo
I've assumed the strings for which "admin" and "user" are placeholders in the example cannot contain the characters in the string "()[]". If that's not the case please leave a comment and I will adjust the regex.
I've kept the first capture group on the assumption that it is needed for some unstated purpose. If it's not needed, remove it:
(?<=\#\[.*?\])\(.*?:\d+\)
There is of course no point creating a capture group for a substring that is to be replaced with an empty string.
Javascript's regex engine performs the following operations.
(?<= : begin positive lookbehind
\#\[ : match '#['
(.*?) : match 0+ chars, lazily, save to capture group 1
\] : match ']'
) : end positive lookbehind
\(.*?:\d+\) : match '(', 0+ chars, lazily, 1+ digits, ')'

How to replace different characters with regex and add conditionals.

Example string: George's - super duper (Computer)
Wanted new string: georges-super-duper-computer
Current regex: .replace(/\s+|'|()/g, '-')
It does not work and and when I remove the spaces and there is already a - in between I get something like george's---super.
tl;dr Your regex is malformed. Also you can't conditionally remove ' and \s ( ) in a single expression.
Your regex is malformed since ( and ) have special meanings. They are used to form groups so you have to escape them as \( and \). You'll also have to place another pipe | in between them, otherwise you're going to match the literal "()", which is not what you want.
The proper expression would look like this: .replace(/\s+|'|\(|\)/g, '-').
However, this is not what you want. Since this would produce George-s---super-duper--Computer-. I would recommend that you use Character Classes, which will also make your expression easier to read:
.replace(/[\s'()-]+/g, '-')
This matches whitespace, ', (, ) and any additional - on or more times and replaces them with -, yielding George-s-super-duper-Computer-.
This is still not quite right, so have this:
var myString = "George's - super duper (Computer)";
var myOtherString = myString
// Remove non-whitespace, non-alphanumeric characters from the string (note: ^ inverses the character class)
// also trim any whitespace from the beginning and end of the string (so we don't end up with hyphens at the start and end of the string)
.replace(/^\s+|[^\s\w]+|\s+$/g, "")
// Replace the remaining whitespace with hyphens
.replace(/\s+/g, "-")
// Finally make all characters lower case
.toLowerCase();
console.log(myString, '=>', myOtherString);
You could do match instead of replace then join result on -. Then you may need a replace to remove single quotes. Regex would be:
[a-z]+('[a-z]+)*
JS code:
var str = "George's - super duper (Computer)";
console.log(
str.match(/[a-z]+('[a-z]+)*/gi).join('-').replace("'", "").toLowerCase()
);

RegEx : Match a string enclosed in single quotes but don't match those inside double quotes

I wanted to write a regex to match the strings enclosed in single quotes but should not match a string with single quote that is enclosed in a double quote.
Example 1:
a = 'This is a single-quoted string';
the whole value of a should match because it is enclosed with single quotes.
EDIT: Exact match should be:
'This is a single-quoted string'
Example 2:
x = "This is a 'String' with single quote";
x should not return any match because the single quotes are found inside double quotes.
I have tried /'.*'/g but it also matches the single quoted string inside a double quoted string.
Thanks for the help!
EDIT:
To make it clearer
Given the below strings:
The "quick 'brown' fox" jumps
over 'the lazy dog' near
"the 'riverbank'".
The match should only be:
'the lazy dog'
Assuming that won't have to deal with escaped quotes (which would be possible but make the regex complicated), and that all quotes are correctly balanced (nothing like It's... "Monty Python's Flying Circus"!), then you could look for single-quoted strings that are followed by an even number of double quotes:
/'[^'"]*'(?=(?:[^"]*"[^"]*")*[^"]*$)/g
See it live on regex101.com.
Explanation:
' # Match a '
[^'"]* # Match any number of characters except ' or "
' # Match a '
(?= # Assert that the following regex could match here:
(?: # Start of non-capturing group:
[^"]*" # Any number of non-double quotes, then a quote.
[^"]*" # The same thing again, ensuring an even number of quotes.
)* # Match this group any number of times, including zero.
[^"]* # Then match any number of characters except "
$ # until the end of the string.
) # (End of lookahead assertion)
Try something like this:
^[^"]*?('[^"]+?')[^"]*$
Live Demo
If you are not strictly bounded by the regex, you could use the function "indexOf" to find out if it's a substring of the double quoted match:
var a = "'This is a single-quoted string'";
var x = "\"This is a 'String' with single quote\"";
singlequoteonly(x);
function singlequoteonly(line){
var single, double = "";
if ( line.match(/\'(.+)\'/) != null ){
single = line.match(/\'(.+)\'/)[1];
}
if( line.match(/\"(.+)\"/) != null ){
double = line.match(/\"(.+)\"/)[1];
}
if( double.indexOf(single) == -1 ){
alert(single + " is safe");
}else{
alert("Warning: Match [ " + single + " ] is in Line: [ " + double + " ]");
}
}
See the JSFiddle below:
JSFiddle

((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,} works on regexpal.com, but not on jsfiddle.net

So this regular expression contained in "pattern" below, is only supposed to match what I say in the comment below (with the most basic match being 1 letter follow by a dot, and then two letters)
var link = "Help"
// matches www-data.it -- needs at least (1 letter + '.' + 2 letters )
var pattern = '((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}';
var re2 = new RegExp('^' + pattern, 'i');
// if no http and there is something.something
if (link.search(re2) == 0)
{
link = link;
}
When I test this code # http://regexpal.com/ it works e.g. only something.something passes.
When I test it at JSFiddle and in production it matches more than it should, e.g. "Help" matches.
http://jsfiddle.net/2jU4D/
what's the deal?
You should construct the regular expression with native regex syntax:
var re2 = /^((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}/i;
In particular, the \. in the regular expression will look like just plain . by the time you call new RegExp(). The string grammar also uses backslash for quoting, so the backslash will be "eaten" when the expression is first parsed as a string.
Alternatively:
var pattern = '((xn--)?[a-z0-9]+(-[a-z0-9]+)*\\.)+[a-z]{2,}';
var re2 = new RegExp('^' + pattern, 'i');
Doubling the backslash will leave you with the proper string to pass to the RegExp constructor.
Here is the breakdown of what it matches. I would replace all the capture groups with non-capture groups. And put all the anchors in the body of a regex (don't append later).
The regex is valid, don't know about its delimeters or the way you are using it.
Pay attention to the required parts and you will see its not matching correctly, but not
I don't think because of the regex.
( # (1 start)
( xn-- )? # (2), optional capture 'xn--'
[a-z0-9]+ # many lower case letters or digits
( - [a-z0-9]+ )* # (3), optional many captures of '-' followed by many lower case letters or digits
\. # a dot '.'
)+ # (1 end), overwrite this capture buffer many times
[a-z]{2,} # Two or more lower case letters

Match everything but not quoted strings

I want to match everything but no quoted strings.
I can match all quoted strings with this: /(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/
So I tried to match everything but no quoted strings with this: /[^(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))]/ but it doesn't work.
I would like to use only regex because I will want to replace it and want to get the quoted text after it back.
string.replace(regex, function(a, b, c) {
// return after a lot of operations
});
A quoted string is for me something like this "bad string" or this 'cool string'
So if I input:
he\'re is "watever o\"k" efre 'dder\'4rdr'?
It should output this matches:
["he\'re is ", " efre ", "?"]
And than I wan't to replace them.
I know my question is very difficult but it is not impossible! Nothing is impossible.
Thanks
EDIT: Rewritten to cover more edge cases.
This can be done, but it's a bit complicated.
result = subject.match(/(?:(?=(?:(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*'(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*')*(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*$)(?=(?:(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*"(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*")*(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*$)(?:\\.|[^\\'"]))+/g);
will return
, he said.
, she replied.
, he reminded her.
,
from this string (line breaks added and enclosing quotes removed for clarity):
"Hello", he said. "What's up, \"doc\"?", she replied.
'I need a 12" crash cymbal', he reminded her.
"2\" by 4 inches", 'Back\"\'slashes \\ are OK!'
Explanation: (sort of, it's a bit mindboggling)
Breaking up the regex:
(?:
(?= # Assert even number of (relevant) single quotes, looking ahead:
(?:
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*
'
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*
'
)*
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*
$
)
(?= # Assert even number of (relevant) double quotes, looking ahead:
(?:
(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*
"
(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*
"
)*
(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*
$
)
(?:\\.|[^\\'"]) # Match text between quoted sections
)+
First, you can see that there are two similar parts. Both these lookahead assertions ensure that there is an even number of single/double quotes in the string ahead, disregarding escaped quotes and quotes of the opposite kind. I'll show it with the single quotes part:
(?= # Assert that the following can be matched:
(?: # Match this group:
(?: # Match either:
\\. # an escaped character
| # or
"(?:\\.|[^"\\])*" # a double-quoted string
| # or
[^\\'"] # any character except backslashes or quotes
)* # any number of times.
' # Then match a single quote
(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*' # Repeat once to ensure even number,
# (but don't allow single quotes within nested double-quoted strings)
)* # Repeat any number of times including zero
(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])* # Then match the same until...
$ # ... end of string.
) # End of lookahead assertion.
The double quotes part works the same.
Then, at each position in the string where these two assertions succeed, the next part of the regex actually tries to match something:
(?: # Match either
\\. # an escaped character
| # or
[^\\'"] # any character except backslash, single or double quote
) # End of non-capturing group
The whole thing is repeated once or more, as many times as possible. The /g modifier makes sure we get all matches in the string.
See it in action here on RegExr.
Here is a tested function that does the trick:
function getArrayOfNonQuotedSubstrings(text) {
/* Regex with three global alternatives to section the string:
('[^'\\]*(?:\\[\S\s][^'\\]*)*') # $1: Single quoted string.
| ("[^"\\]*(?:\\[\S\s][^"\\]*)*") # $2: Double quoted string.
| ([^'"\\]*(?:\\[\S\s][^'"\\]*)*) # $3: Un-quoted string.
*/
var re = /('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^'"\\]*(?:\\[\S\s][^'"\\]*)*)/g;
var a = []; // Empty array to receive the goods;
text = text.replace(re, // "Walk" the text chunk-by-chunk.
function(m0, m1, m2, m3) {
if (m3) a.push(m3); // Push non-quoted stuff into array.
return m0; // Return this chunk unchanged.
});
return a;
}
This solution uses the String.replace() method with a replacement callback function to "walk" the string section by section. The regex has three global alternatives, one for each section; $1: single quoted, $2: double quoted, and $3: non-quoted substrings, Each non-quoted chunk is pushed onto the return array. It correctly handles all escaped characters, including escaped quotes, both inside and outside quoted strings. Single quoted substrings may contain any number of double quotes and vice-versa. Illegal orphan quotes are removed and serve to divide a non-quoted section into two chunks. Note that this solution requires no lookaround and requires only one pass. It also implements Friedl's "Unrolling-the-Loop" efficiency technique and is quite efficient.
Additional: Here is some code to test the function with the original test string:
// The original test string (with necessary escapes):
var s = "he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?";
alert(s); // Show the test string without the extra backslashes.
console.log(getArrayOfNonQuotedSubstrings(s).toString());
You can't invert a regex. What you have tried was making a character class out of it and invert that - but also for doing that you would have to escape all closing brackets "\]".
EDIT: I would have started with
/(^|" |' ).+?($| "| ')/
This matches anything between the beginning or the end of a quoted string (very simple: a quotation mark plus a blank) and the end of the string or the start of a quoted string (a blank plus a quotation mark). Of course this doesn't handle any escape sequences or quotations which don't follow the scheme / ['"].*['"] /. See above answers for more detailed expressions :-)

Categories

Resources