A simpler regular expression to parse quoted strings - javascript

The question is simple. I have a string that contains multiple elements which are embedded in single-quotation marks:
var str = "'alice' 'anna marie' 'benjamin' 'christin' 'david' 'muhammad ali'"
And I want to parse it so that I have all those names in an array:
result = [
'alice',
'anna marie',
'benjamin',
'christin',
'david',
'muhammad ali'
]
Currently I'm using this code to do the job:
var result = str.match(/\s*'(.*?)'\s*'(.*?)'\s*'(.*?)'\s*'(.*?)'/);
But this regular expression is too long and it's not flexible, so if I have more elements in the str string, I have to edit the regular expression.
What is the fastest and most efficient way to do this parsing? Performance and felxibility is important in our web application.
I have looked at the following question but they are not my answer:
Regular Expression For Quoted String
Regular Expression - How To Find Words and Quoted Phrases

Define the pattern once and use the global g flag.
var matches = str.match(/'[^']*'/g);
If you want the tokens without the single quotes around them, the normal approach would be to use sub-matches in REGEX - however JavaScript doesn't support the capturing of sub-groups when the g flag is used. The simplest (though not necessarily most efficient) way around this would be to remove them afterwards, iteratively:
if (matches)
for (var i=0, len=matches.length; i<len; i++)
matches[i] = matches[i].replace(/'/g, '');
[EDIT] - as the other answers say, you could use split() instead, but only if you can rely on there always being a space (or some common delimiter) between each token in your string.

A different approach
I came here needing an approach that could parse a string for quotes and non quotes, preserve the order of quotes and non quotes, then output it with specific tags wrapped around them for React or React Native so I ended up not using the answers here because I wasn't sure how to get them to fit my need then did this instead.
function parseQuotes(str) {
var openQuote = false;
var parsed = [];
var quote = '';
var text = '';
var openQuote = false;
for (var i = 0; i < str.length; i++) {
var item = str[i];
if (item === '"' && !openQuote) {
openQuote = true;
parsed.push({ type: 'text', value: text });
text = '';
}
else if (item === '"' && openQuote) {
openQuote = false;
parsed.push({ type: 'quote', value: quote });
quote = '';
}
else if (openQuote) quote += item;
else text += item;
}
if (openQuote) parsed.push({ type: 'text', value: '"' + quote });
else parsed.push({ type: 'text', value: text });
return parsed;
}
That when given this:
'Testing this "shhhh" if it "works!" " hahahah!'
produces that:
[
{
"type": "text",
"value": "Testing this "
},
{
"type": "quote",
"value": "shhhh"
},
{
"type": "text",
"value": " if it "
},
{
"type": "quote",
"value": "works!"
},
{
"type": "text",
"value": " "
},
{
"type": "text",
"value": "\" hahahah!"
}
]
which allows you to easily wrap tags around it depending on what it is.
https://jsfiddle.net/o6seau4e/4/

When a regex object has the the global flag set, you can execute it multiple times against a string to find all matches. It works by starting the next search after the last character matched in the last run:
var buf = "'abc' 'def' 'ghi'";
var exp = /'(.*?)'/g;
for(var match=exp.exec(buf); match!=null; match=exp.exec(buf)) {
alert(match[0]);
}
Personally, I find it a really good way to parse strings.
EDIT: the expression /'(.*?)'/g matches any content between single-quote ('), the modifier *? is non-greedy and it greatly simplifies the pattern.

One way;
var str = "'alice' 'benjamin' 'christin' 'david'";
var result = {};
str.replace(/'([^']*)'/g, function(m, p1) {
result[p1] = "";
});
for (var k in result) {
alert(k);
}

If someone gets here and requires more complex string parsing, with both single or double quotes and ability for escaping the quote this is the regex. Tested in JS and Ruby.
r = /(['"])((?:\\\1|(?!\1).)*)(\1)/g
str = "'alice' ddd vvv-12 'an\"na m\\'arie' \"hello ' world\" \"hello \\\" world\" 'david' 'muhammad ali'"
console.log(str.match(r).join("\n"))
'alice'
'an"na m\'arie'
"hello ' world"
"hello \" world"
'david'
'muhammad ali'
See that non-quoted strings were not found. If the goal is to also find non-quote words then a small fix will do:
r = /(['"])((?:\\\1|(?!\1).)*)(\1)|([^'" ]+)/g
console.log(str.match(r).join("\n"))
'alice'
ddd
vvv-12
'an"na m\'arie'
"hello ' world"
"hello \" world"
'david'
'muhammad ali'

Related

String manipulation JavaScript - replace placeholders

I have a long string, which I have to manipulate in a specific way. The string can include other substrings which causes problems with my code. For that reason, before doing anything to the string, I replace all the substrings (anything introduced by " and ended with a non escaped ") with placeholders in the format: $0, $1, $2, ..., $n. I know for sure that the main string itself doesn't contain the character $ but one of the substrings (or more) could be for example "$0".
Now the problem: after manipulation/formatting the main string, I need to replace all the placeholders with their actual values again.
Conveniently I have them saved in this format:
// TypeScript
let substrings: { placeholderName: string; value: string }[];
But doing:
// JavaScript
let mainString1 = "main string $0 $1";
let mainString2 = "main string $0 $1";
let substrings = [
{ placeholderName: "$0", value: "test1 $1" },
{ placeholderName: "$1", value: "test2" }
];
for (const substr of substrings) {
mainString1 = mainString1.replace(substr.placeholderName, substr.value);
mainString2 = mainString2.replaceAll(substr.placeholderName, substr.value);
}
console.log(mainString1); // expected result: "main string test1 test2 $1"
console.log(mainString2); // expected result: "main string test1 test2 test2"
// wanted result: "main string test1 $1 test2"
is not an option since the substrings could include $x which would replace the wrong thing (by .replace() and by .replaceAll()).
Getting the substrings is archived with an regex, maybe a regex could help here too? Though I have no control about what is saved inside the substrings...
If you're sure that all placeholders will follow the $x format, I'd go with the .replace() method with a callback:
const result = mainString1.replace(
/\$\d+/g,
placeholder => substrings.find(
substring => substring.placeholderName === placeholder
)?.value ?? placeholder
);
// result is "main string test1 $1 test2"
This may not be the most efficient code. But here is the function I made with comments.
Note: be careful because if you put the same placeholder inside itself it will create an infinite loop. Ex:
{ placeholderName: "$1", value: "test2 $1" }
let mainString1 = "main string $0 $1";
let mainString2 = "main string $0 $1";
let substrings = [{
placeholderName: "$0",
value: "test1 $1"
},
{
placeholderName: "$1",
value: "test2"
},
];
function replacePlaceHolders(mainString, substrings) {
let replacedString = mainString
//We will find every placeHolder, the followin line wil return and array with all of them. Ex: ['$1', $n']
let placeholders = replacedString.match(/\$[0-9]*/gm)
//while there is some place holder to replace
while (placeholders !== null && placeholders.length > 0) {
//We will iterate for each placeholder
placeholders.forEach(placeholder => {
//extrac the value to replace
let value = substrings.filter(x => x.placeholderName === placeholder)[0].value
//replace it
replacedString = replacedString.replace(placeholder, value)
})
//and finally see if there is any new placeHolder inserted in the replace. If there is something the loop will start again.
placeholders = replacedString.match(/\$[0-9]*/gm)
}
return replacedString
}
console.log(replacePlaceHolders(mainString1, substrings))
console.log(replacePlaceHolders(mainString2, substrings))
EDIT:
Ok... I think I understood your problem now... You did't want the placeHoldersLike strings inside your values to be replaced.
This version of code should work as expected and you won't have to worry aboy infine loops here. However, be carefull with your placeHolders, the "$" is a reserved caracter in regex and they are more that you should scape. I asume all your placeHolders will be like "$1", "$2", etc. If they are not, you should edit the regexPlaceholder function that wraps and scapes that caracter.
let mainString1 = "main string $0 $1";
let mainString2 = "main string $0 $1 $2";
let substrings = [
{ placeholderName: "$0", value: "$1 test1 $2 $1" },
{ placeholderName: "$1", value: "test2 $2" },
{ placeholderName: "$2", value: "test3" },
];
function replacePlaceHolders(mainString, substrings) {
//You will need to escape the $ characters or maybe even others depending of how you made your placeholders
function regexPlaceholder(p) {
return new RegExp('\\' + p, "gm")
}
let replacedString = mainString
//We will find every placeHolder, the followin line wil return and array with all of them. Ex: ['$1', $n']
let placeholders = replacedString.match(/\$[0-9]*/gm)
//if there is any placeHolder to replace
if (placeholders !== null && placeholders.length > 0) {
//we will declare some variable to check if the values had something inside that can be
//mistaken for a placeHolder.
//We will store how many of them have we changed and replace them back at the end
let replacedplaceholdersInValues = []
let indexofReplacedValue = 0
placeholders.forEach(placeholder => {
//extrac the value to replace
let value = substrings.filter(x => x.placeholderName === placeholder)[0].value
//find if the value had a posible placeholder inside
let placeholdersInValues = value.match(/\$[0-9]*/gm)
if (placeholdersInValues !== null && placeholdersInValues.length > 0) {
placeholdersInValues.forEach(placeholdersInValue => {
//if there are, we will replace them with another mark, so our primary function wont change them
value = value.replace(regexPlaceholder(placeholdersInValue), "<markToReplace" + indexofReplacedValue + ">")
//and store every change to make a rollback later
replacedplaceholdersInValues.push({
placeholderName: placeholdersInValue,
value: "<markToReplace" + indexofReplacedValue + ">"
})
})
indexofReplacedValue++
}
//replace the actual placeholders
replacedString = replacedString.replace(regexPlaceholder(placeholder), value)
})
//if there was some placeholderlike inside the values, we change them back to normal
if (replacedplaceholdersInValues.length > 0) {
replacedplaceholdersInValues.forEach(replaced => {
replacedString = replacedString.replace(replaced.value, replaced.placeholderName)
})
}
}
return replacedString
}
console.log(replacePlaceHolders(mainString1, substrings))
console.log(replacePlaceHolders(mainString2, substrings))
The key is to choose a placeholder that is impossible in both the main string and the substring. My trick is to use non-printable characters as the placeholder. And my favorite is the NUL character (0x00) because most other people would not use it because C/C++ consider it to be end of string. Javascript however is robust enough to handle strings that contain NUL (encoded as unicode \0000):
let mainString1 = "main string \0-0 \0-1";
let mainString2 = "main string \0-0 \0-1";
let substrings = [
{ placeholderName: "\0-0", value: "test1 $1" },
{ placeholderName: "\0-1", value: "test2" }
];
The rest of your code does not need to change.
Note that I'm using the - character to prevent javascript from interpreting your numbers 0 and 1 as part of the octal \0.
If you have an aversion to \0 like most programmers then you can use any other non-printing characters like \1 (start of heading), 007 (the character that makes your terminal make a bell sound - also, James Bond) etc.

javascript regex pattern with an array

I want to create a regex pattern which should be able to search through an array.
Let's say :
var arr = [ "first", "second", "third" ];
var match = text.match(/<arr>/);
which should be able to match only
<first> or <second> or <third> ......
but should ignore
<ffirst> or <dummy>
I need an efficient approach please .
Any help would be great .
Thanks
First you can do array.map to quote all special regex characters.
Then you can do array.join to join the array elements using | and create an instance of RegExp.
Code:
function quoteSpecial(str) { return str.replace(/([\[\]^$|()\\+*?{}=!.])/g, '\\$1'); }
var arr = [ "first", "second", "third", "fo|ur" ];
var re = new RegExp('<(?:' + arr.map(quoteSpecial).join('|') + ')>');
//=> /<(?:first|second|third|fo\|ur)>/
then use this RegExp object:
'<first>'.match(re); // ["<first>"]
'<ffirst>'.match(re); // null
'<dummy>'.match(re); // null
'<second>'.match(re); // ["<second>"]
'<fo|ur>'.match(re); // ["<fo|ur>"]
You should search for a specific word from a list using (a|b|c).
The list is made from the arr by joining the values with | char as glue
var arr = [ "first", "second", "third" ];
var match = text.match(new RegExp("<(?:"+arr.join("|")+")>")); //matches <first> <second> and <third>
Note that if your "source" words might contain regular expression's preserved characters - you might get into trouble - so you might need to escape those characters before joining the array
A good function for doing so can be found here:
function regexpQuote(str, delimiter) {
return String(str)
.replace(new RegExp('[.\\\\+*?\\[\\^\\]$(){}=!<>|:\\' + (delimiter || '') + '-]', 'g'), '\\$&');
}
so in this case you'll have
function escapeArray(arr){
var escaped = [];
for(var i in arr){
escaped.push(regexpQuote(arr[i]));
}
return escaped;
}
var arr = [ "first", "second", "third" ];
var pattern = new RegExp("<(?:"+escapeArray(arr).join("|")+")>");
var match = text.match(pattern); //matches <first> <second> and <third>

replace all commas within a quoted string

is there any way to capture and replace all the commas within a string contained within quotation marks and not any commas outside of it. I'd like to change them to pipes, however this:
/("(.*?)?,(.*?)")/gm
is only getting the first instance:
JSBIN
If callbacks are okay, you can go for something like this:
var str = '"test, test2, & test3",1324,,,,http://www.asdf.com';
var result = str.replace(/"[^"]+"/g, function (match) {
return match.replace(/,/g, '|');
});
console.log(result);
//"test| test2| & test3",1324,,,,http://www.asdf.com
This is very convoluted compared to regular expression version, however, I wanted to do this if just for the sake of experiment:
var PEG = require("pegjs");
var parser = PEG.buildParser(
["start = seq",
"delimited = d:[^,\"]* { return d; }",
"quoted = q:[^\"]* { return q; }",
"quote = q:[\"] { return q; }",
"comma = c:[,] { return ''; }",
"dseq = delimited comma dseq / delimited",
"string = quote dseq quote",
"seq = quoted string seq / quoted quote? quoted?"].join("\n")
);
function flatten(array) {
return (array instanceof Array) ?
[].concat.apply([], array.map(flatten)) :
array;
}
flatten(parser.parse('foo "bar,bur,ber" baz "bbbr" "blerh')).join("");
// 'foo "barburber" baz "bbbr" "blerh'
I don't advise you to do this in this particular case, but maybe it will create some interest :)
PS. pegjs can be found here: (I'm not an author and have no affiliation, I simply like PEG) http://pegjs.majda.cz/documentation

How to determine if a string matches a pattern type

Given the following pattern types: where 11 and 22 are variable:
#/projects/11
#/projects/11/tasks/22
With Javascript/jQuery, given var url, how can I determine if var url equals either string 1, 2 or neither?
Thanks
You can do it using a single regular expression:
var reg = /^#\/projects\/(\d+)(?:\/tasks\/(\d+))?$/,
str = "#/projects/11/tasks/22",
match = str.match(reg);
if (match && !match[2])
// Match on string 1
else if (match && match[2])
// Match on string 2
else
// No match
The expression I wrote uses sub-expressions to capture the digits; the result would be an array that looks like this:
"#/projects/11/tasks/22".match(reg);
//-> ["#/projects/11/tasks/22", "11", "22"]
"#/projects/11".match(reg);
//-> ["#/projects/11", "11", undefined]
There are many regular expression tutorials online that will help you understand how to solve problems like this one - I'd recommend searching Google for such a tutorial.
I would look into Regex personally, as it is easy to set up a pattern and test if a string applies to it. Try this: http://www.regular-expressions.info/javascript.html
Here's a more flexible approach you can use for other urls too http://jsfiddle.net/EXRXE/
/*
in: "#/cat/34/item/24"
out: {
cat: "34",
item: "24"
}
*/
function translateUrl(url) {
// strip everytying from the beginning that's not a character
url = url.replace(/^[^a-zA-Z]*/, "");
var parts = url.split("/");
var obj = {};
for(var i=0; i < parts.length; i+=2) {
obj[parts[i]] = parts[i+1]
}
return obj;
}
var url = translateUrl('#/projects/11/tasks/22');
console.log(url);
if (url.projects) {
console.log("Project is " + url.projects);
}
if (url.tasks) {
console.log("Task is " + url.tasks);
}

Javascript split string on space or on quotes to array

var str = 'single words "fixed string of words"';
var astr = str.split(" "); // need fix
I would like the array to be like this:
var astr = ["single", "words", "fixed string of words"];
The accepted answer is not entirely correct. It separates on non-space characters like . and - and leaves the quotes in the results. The better way to do this so that it excludes the quotes is with capturing groups, like such:
//The parenthesis in the regex creates a captured group within the quotes
var myRegexp = /[^\s"]+|"([^"]*)"/gi;
var myString = 'single words "fixed string of words"';
var myArray = [];
do {
//Each call to exec returns the next regex match as an array
var match = myRegexp.exec(myString);
if (match != null)
{
//Index 1 in the array is the captured group if it exists
//Index 0 is the matched text, which we use if no captured group exists
myArray.push(match[1] ? match[1] : match[0]);
}
} while (match != null);
myArray will now contain exactly what the OP asked for:
single,words,fixed string of words
str.match(/\w+|"[^"]+"/g)
//single, words, "fixed string of words"
This uses a mix of split and regex matching.
var str = 'single words "fixed string of words"';
var matches = /".+?"/.exec(str);
str = str.replace(/".+?"/, "").replace(/^\s+|\s+$/g, "");
var astr = str.split(" ");
if (matches) {
for (var i = 0; i < matches.length; i++) {
astr.push(matches[i].replace(/"/g, ""));
}
}
This returns the expected result, although a single regexp should be able to do it all.
// ["single", "words", "fixed string of words"]
Update
And this is the improved version of the the method proposed by S.Mark
var str = 'single words "fixed string of words"';
var aStr = str.match(/\w+|"[^"]+"/g), i = aStr.length;
while(i--){
aStr[i] = aStr[i].replace(/"/g,"");
}
// ["single", "words", "fixed string of words"]
Here might be a complete solution:
https://github.com/elgs/splitargs
ES6 solution supporting:
Split by space except for inside quotes
Removing quotes but not for backslash escaped quotes
Escaped quote become quote
Can put quotes anywhere
Code:
str.match(/\\?.|^$/g).reduce((p, c) => {
if(c === '"'){
p.quote ^= 1;
}else if(!p.quote && c === ' '){
p.a.push('');
}else{
p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
}
return p;
}, {a: ['']}).a
Output:
[ 'single', 'words', 'fixed string of words' ]
This will split it into an array and strip off the surrounding quotes from any remaining string.
const parseWords = (words = '') =>
(words.match(/[^\s"]+|"([^"]*)"/gi) || []).map((word) =>
word.replace(/^"(.+(?="$))"$/, '$1'))
This soulution would work for both double (") and single (') quotes:
Code:
str.match(/[^\s"']+|"([^"]*)"/gmi)
// ["single", "words", "fixed string of words"]
Here it shows how this regular expression would work: https://regex101.com/r/qa3KxQ/2
Until I found #dallin 's answer (this thread: https://stackoverflow.com/a/18647776/1904943) I was having difficulty processing strings with a mix of unquoted and quoted terms / phrases, via JavaScript.
In researching that issue, I ran a number of tests.
As I found it difficult to find this information, I have collated the relevant information (below), which may be useful to others seeking answers on the processing in JavaScript of strings containing quoted words.
let q = 'apple banana "nova scotia" "british columbia"';
Extract [only] quoted words and phrases:
// https://stackoverflow.com/questions/12367126/how-can-i-get-a-substring-located-between-2-quotes
const r = q.match(/"([^']+)"/g);
console.log('r:', r)
// r: Array [ "\"nova scotia\" \"british columbia\"" ]
console.log('r:', r.toString())
// r: "nova scotia" "british columbia"
// ----------------------------------------
// [alternate regex] https://www.regextester.com/97161
const s = q.match(/"(.*?)"/g);
console.log('s:', s)
// s: Array [ "\"nova scotia\"", "\"british columbia\"" ]
console.log('s:', s.toString())
// s: "nova scotia","british columbia"
Extract [all] unquoted, quoted words and phrases:
// https://stackoverflow.com/questions/2817646/javascript-split-string-on-space-or-on-quotes-to-array
const t = q.match(/\w+|"[^"]+"/g);
console.log('t:', t)
// t: Array(4) [ "apple", "banana", "\"nova scotia\"", "\"british columbia\"" ]
console.log('t:', t.toString())
// t: apple,banana,"nova scotia","british columbia"
// ----------------------------------------------------------------------------
// https://stackoverflow.com/questions/2817646/javascript-split-string-on-space-or-on-quotes-to-array
// [#dallon 's answer (this thread)] https://stackoverflow.com/a/18647776/1904943
var myRegexp = /[^\s"]+|"([^"]*)"/gi;
var myArray = [];
do {
/* Each call to exec returns the next regex match as an array. */
var match = myRegexp.exec(q); // << "q" = my query (string)
if (match != null)
{
/* Index 1 in the array is the captured group if it exists.
* Index 0 is the matched text, which we use if no captured group exists. */
myArray.push(match[1] ? match[1] : match[0]);
}
} while (match != null);
console.log('myArray:', myArray, '| type:', typeof(myArray))
// myArray: Array(4) [ "apple", "banana", "nova scotia", "british columbia" ] | type: object
console.log(myArray.toString())
// apple,banana,nova scotia,british columbia
Work with a set (rather than an array):
// https://stackoverflow.com/questions/28965112/javascript-array-to-set
var mySet = new Set(myArray);
console.log('mySet:', mySet, '| type:', typeof(mySet))
// mySet: Set(4) [ "apple", "banana", "nova scotia", "british columbia" ] | type: object
Iterating over set elements:
mySet.forEach(x => console.log(x));
/* apple
* banana
* nova scotia
* british columbia
*/
// https://stackoverflow.com/questions/16401216/iterate-over-set-elements
myArrayFromSet = Array.from(mySet);
for (let i=0; i < myArrayFromSet.length; i++) {
console.log(i + ':', myArrayFromSet[i])
}
/*
0: apple
1: banana
2: nova scotia
3: british columbia
*/
Asides
The JavaScript responses above are from the FireFox Developer Tools (F12, from web page). I created a blank HTML file that calls a .js file that I edit with Vim, as my IDE. Simple JavaScript IDE
Based on my tests, the cloned set appears to be a deep copy. Shallow-clone an ES6 Map or Set
I noticed the disappearing characters, too. I think you can include them - for example, to have it include "+" with the word, use something like "[\w\+]" instead of just "\w".

Categories

Resources