A javascript regular expression to tokenize the query - javascript

Hi I'm stumbled up on a problem related to regular expressions that I cannot resolve.
I need to tokenize the query (split query into parts), suppose the following one as an example:
These are the separate query elements "These are compound composite terms"
What I eventually need is to have an array of 7 tokens:
1) These
2) are
3) the
4) separate
5) query
6) elements
7) These are compound composite term
The seventh token consists of several words because it was inside double quotation marks.
My question is: Is it possible to tokenize the input string accordingly to above explanations using one regular expression?
Edit
I was curious about possibility of using Regex.exec or similar code instead of split while achieving the same thing, so I've did some investigation that was followed by another question here. And so as a another answer to a question a following regex can be used:
(?:")(?:\w+\W*)+(?:")|\w+
With the following one-liner usage scenario:
var tokens = query.match(/(?:")(?:\w+\W*)+(?:")|\w+/g);
Hope it will be useful...

You can use this regex:
var s = 'These are the separate query elements "These are compound composite term"';
var arr = s.split(/(?=(?:(?:[^"]*"){2})*[^"]*$)\s+/g);
//=> ["These", "are", "the", "separate", "query", "elements", ""These are compound composite term""]
This regex will split on spaces if those are outside double quotes by using a lookahead to make sure there are even number of quotes after space.

You can use a simpler approach to split the string and grab the substrings inside double quotation marks, and then get rid of empty array items with clean function:
Array.prototype.clean = function() {
for (var i = 0; i < this.length; i++) {
if (this[i] == undefined || this[i] == '') {
this.splice(i, 1);
i--;
}
}
return this;
};
var re = /"(.*?)"|\s/g;
var str = 'These are the separate query elements "These are compound composite term"';
var arr = str.split(re);
alert(arr.clean());

You can get everything that is between one quote and the next ".*?" or everything that is not a whitespace \S+:
var re = /".*?"|\S+/g,
str = 'These are the separate query elements "These are compound composite term"',
m,
arr = [];
while ( m = re.exec( str ) ){
arr.push( m[0] );
}
alert( arr.join('\n') );

\s(?=[^"]*(?:"[^"]*")*[^"]*$)
You can split by this.See demo.
https://www.regex101.com/r/fJ6cR4/20

Related

Regex matching comma delimited strings

Given any of the following strings, where operator and value are just placeholders:
"operator1(value)"
"operator1(value), operator2(value)"
"operator1(value), operator2(value), operator_n(value)"
I need to be able to match so i can get each operator and it's value as follows:
[[operator1, value]]
[[operator1, value], [operator2, value]]
[[operator1, value], [operator2, value], [operator_n, value]]
Please Note: There could be n number of operators (comma delimited) in the given string.
My current attempt will match on operator1(value) but nothing with multiple operators. See regex101 for the results.
/^(.*?)\((.*)\)$/
You should be able to do this with a single regex using the global flag.
var re= /(?:,\s*)?([^(]+?)\(([^)]+)\)/g;
var results = re.exec(str);
See the result at Regex 101: https://regex101.com/r/eC3uK3/2
Here's a pure regex answer to this question, this will work so long as your variables are always separated by a , and a space, should traverse through lines without much issue
https://regex101.com/r/eC3uK3/4
([^\(]*)(\([^, ]*\))(?:, )?(?:\n)?
Matches on:
operator1(value), operator2(value), operator_n(value),
operator1(value), operator2(value)
Explanation:
So, this sets up 2 capture groups and 2 non-capture groups.
The first capture group will match a value name until a parenthesis (by using a negated set and greedy). The second capture group will grab the parenthesis and the value name until the end of the parenthesis are found (note you can get rid of the parenthesis by escaping the outer set of parenthesis rather than the inner (Example here: https://regex101.com/r/eC3uK3/6). There's an optional ", " in a non capturing group, and an optional "\n" in another non-capturing group to handle any newline characters that you may happen across.
This should break your data out into:
'Operator1'
'(value)'
'operator2'
'(value)'
For as many as there are.
You can do this by first splitting then using a regular expression:
[
"operator1(value)",
"operator1(value), operator2(value)",
"operator1(value), operator2(value), operator_n(value)"
].forEach((str)=>{
var results = str
.split(/[,\s]+/) // split operations
.map(s=>s.match(/(\w+)\((\w+)\)/)) // extracts parts of the operations
.filter(Boolean) // ensure there's no error (in case of impure entries)
.map(s=>s.slice(1)); // make the desired result
console.log(results);
});
The following function "check" will achieve what you are looking for, if you want a string instead of an array of result, simply use the .toString() method on the array returned from the function.
function check(str) {
var myRe = /([^(,\s]*)\(([^)]*)\)/g;
var myArray;
var result = [];
while ((myArray = myRe.exec(str)) !== null) {
result.push(`[${myArray[1]}, ${myArray[2]}]`);
};
return result;
}
var check1 = check("operator1(value)");
console.log("check1", check1);
var check2 = check("operator1(value), operator2(value)");
console.log("check2", check2);
var check3 = check("operator1(value), operator2(value), operator_n(value)");
console.log("check3", check3);
This can also be done with a simple split and a for loop.
var data = "operator1(value), operator2(value), operator_n(value)",
ops = data.substring(0, data.length - 1), // Remove the last parenth from the string
arr = ops.split(/\(|\), /),
res = [], n, eN = arr.length;
for (n = 0; n < eN; n += 2) {
res.push([arr[n], arr[n + 1]]);
}
console.log(res);
The code creates a flattened array from a string, and then nests arrays of "operator"/"value" pairs to the result array. Works for older browsers too.

Separate value from string using javascript

I have a string in which every value is between [] and it has a . at the end. How can I separate all values from the string?
This is the example string:
[value01][value02 ][value03 ]. [value04 ]
//want something like this
v1 = value01;
v2 = value02;
v3 = value03;
v4 = value04
The number of values is not constant. How can I get all values separately from this string?
Use regular expressions to specify multiple separators. Please check the following posts:
How do I split a string with multiple separators in javascript?
Split a string based on multiple delimiters
var str = "[value01][value02 ][value03 ]. [value04 ]"
var arr = str.split(/[\[\]\.\s]+/);
arr.shift(); arr.pop(); //discard the first and last "" elements
console.log( arr ); //output: ["value01", "value02", "value03", "value04"]
JS FIDDLE DEMO
How This Works
.split(/[\[\]\.\s]+/) splits the string at points where it finds one or more of the following characters: [] .. Now, since these characters are also found at the beginning and end of the string, .shift() discards the first element, and .pop() discards the last element, both of which are empty strings. However, your may want to use .filter() and your can replace lines 2 and 3 with:
var arr = str.split(/[\[\]\.\s]+/).filter(function(elem) { return elem.length > 0; });
Now you can use jQuery/JS to iterate through the values:
$.each( arr, function(i,v) {
console.log( v ); // outputs the i'th value;
});
And arr.length will give you the number of elements you have.
If you want to get the characters between "[" and "]" and the data is regular and always has the pattern:
'[chars][chars]...[chars]'
then you can get the chars using match to get sequences of characters that aren't "[" or "]":
var values = '[value01][value02 ][value03 ][value04 ]'.match(/[^\[\]]+/g)
which returns an array, so values is:
["value01", "value02 ", "value03 ", "value04 "]
Match is very widely supported, so no cross browser issues.
Here's a fiddle: http://jsfiddle.net/5xVLQ/
Regex patern: /(\w)+/ig
Matches all words using \w (alphanumeric combos). Whitespace, brackets, dots, square brackets are all non-matching, so they don't get returned.
What I do is create a object to hold results in key/value pairs such as v1:'value01'. You can iterate through this object, or you can access the values directly using objRes.v1
var str = '[value01][value02 ][value03 ]. [value04 ]';
var myRe = /(\w)+/ig;
var res;
var objRes = {};
var i=1;
while ( ( res = myRe.exec(str) ) != null )
{
objRes['v'+i] = res[0];
i++;
}
console.log(objRes);

How to remove the last matched regex pattern in javascript

I have a text which goes like this...
var string = '~a=123~b=234~c=345~b=456'
I need to extract the string such that it splits into
['~a=123~b=234~c=345','']
That is, I need to split the string with /b=.*/ pattern but it should match the last found pattern. How to achieve this using RegEx?
Note: The numbers present after the equal is randomly generated.
Edit:
The above one was just an example. I did not make the question clear I guess.
Generalized String being...
<word1>=<random_alphanumeric_word>~<word2>=<random_alphanumeric_word>..~..~..<word2>=<random_alphanumeric_word>
All have random length and all wordi are alphabets, the whole string length is not fixed. the only text known would be <word2>. Hence I needed RegEx for it and pattern being /<word2>=.*/
This doesn't sound like a job for regexen considering that you want to extract a specific piece. Instead, you can just use lastIndexOf to split the string in two:
var lio = str.lastIndexOf('b=');
var arr = [];
var arr[0] = str.substr(0, lio);
var arr[1] = str.substr(lio);
http://jsfiddle.net/NJn6j/
I don't think I'd personally use a regex for this type of problem, but you can extract the last option pair with a regex like this:
var str = '~a=123~b=234~c=345~b=456';
var matches = str.match(/^(.*)~([^=]+=[^=]+)$/);
// matches[1] = "~a=123~b=234~c=345"
// matches[2] = "b=456"
Demo: http://jsfiddle.net/jfriend00/SGMRC/
Assuming the format is (~, alphanumeric name, =, and numbers) repeated arbitrary number of times. The most important assumption here is that ~ appear once for each name-value pair, and it doesn't appear in the name.
You can remove the last token by a simple replacement:
str.replace(/(.*)~.*/, '$1')
This works by using the greedy property of * to force it to match the last ~ in the input.
This can also be achieved with lastIndexOf, since you only need to know the index of the last ~:
str.substring(0, (str.lastIndexOf('~') + 1 || str.length() + 1) - 1)
(Well, I don't know if the code above is good JS or not... I would rather write in a few lines. The above is just for showing one-liner solution).
A RegExp that will give a result that you may could use is:
string.match(/[a-z]*?=(.*?((?=~)|$))/gi);
// ["a=123", "b=234", "c=345", "b=456"]
But in your case the simplest solution is to split the string before extract the content:
var results = string.split('~'); // ["", "a=123", "b=234", "c=345", "b=456"]
Now will be easy to extract the key and result to add to an object:
var myObj = {};
results.forEach(function (item) {
if(item) {
var r = item.split('=');
if (!myObj[r[0]]) {
myObj[r[0]] = [r[1]];
} else {
myObj[r[0]].push(r[1]);
}
}
});
console.log(myObj);
Object:
a: ["123"]
b: ["234", "456"]
c: ["345"]
(?=.*(~b=[^~]*))\1
will get it done in one match, but if there are duplicate entries it will go to the first. Performance also isn't great and if you string.replace it will destroy all duplicates. It would pass your example, but against '~a=123~b=234~c=345~b=234' it would go to the first 'b=234'.
.*(~b=[^~]*)
will run a lot faster, but it requires another step because the match comes out in a group:
var re = /.*(~b=[^~]*)/.exec(string);
var result = re[1]; //~b=234
var array = string.split(re[1]);
This method will also have the with exact duplicates. Another option is:
var regex = /.*(~b=[^~]*)/g;
var re = regex.exec(string);
var result = re[1];
// if you want an array from either side of the string:
var array = [string.slice(0, regex.lastIndex - re[1].length - 1), string.slice(regex.lastIndex, string.length)];
This actually finds the exact location of the last match and removes it regex.lastIndex - re[1].length - 1 is my guess for the index to remove the ellipsis from the leading side, but I didn't test it so it might be off by 1.

How can I split this string in JavaScript?

I have strings like this:
ab
rx'
wq''
pok'''
oyu,
mi,,,,
Basically, I want to split the string into two parts. The first part should have the alphabetical characters intact, the second part should have the non-alphabetical characters.
The alphabetical part is guaranteed to be 2-3 lowercase characters between a and z; the non-alphabetical part can be any length, and is gauranteed to only be the characters , or ', but not both in the one string (e.g. eex,', will never occur).
So the result should be:
[ab][]
[rx][']
[wq]['']
[pok][''']
[oyu][,]
[mi][,,,,]
How can I do this? I'm guessing a regular expression but I'm not particularly adept at coming up with them.
Regular expressions have is a nice special called "word boundary" (\b). You can use it, well, to detect the boundary of a word, which is a sequence of alpha-numerical characters.
So all you have to do is
foo.split(/\b/)
For example,
"pok'''".split(/\b/) // ["pok", "'''"]
If you can 100% guarantee that:
Letter-strings are 2 or 3 characters
There are always one or more primes/commas
There is never any empty space before, after or in-between the letters and the marks
(aside from line-break)
You can use:
/^([a-zA-Z]{2,3})('+|,+)$/gm
var arr = /^([a-zA-Z]{2,3})('+|,+)$/gm.exec("pok'''");
arr === ["pok'''", "pok", "'''"];
var arr = /^([a-zA-Z]{2,3})('+|,+)$/gm.exec("baf,,,");
arr === ["baf,,,", "baf", ",,,"];
Of course, save yourself some sanity, and save that RegEx as a var.
And as a warning, if you haven't dealt with RegEx like this:
If a match isn't found -- if you try to match foo','' by mixing marks, or you have 0-1 or 4+ letters, or 0 marks... ...then instead of getting an array back, you'll get null.
So you can do this:
var reg = /^([a-zA-Z]{2,3})('+|,+)$/gm,
string = "foobar'',,''",
result_array = reg.exec(string) || [string];
In this case, the result of the exec is null; by putting the || (or) there, we can return an array that has the original string in it, as index-0.
Why?
Because the result of a successful exec will have 3 slots; [*string*, *letters*, *marks*].
You might be tempted to just read the letters like result_array[1].
But if the match failed and result_array === null, then JavaScript will scream at you for trying null[1].
So returning the array at the end of a failed exec will allow you to get result_array[1] === undefined (ie: there was no match to the pattern, so there are no letters in index-1), rather than a JS error.
You could try something like that:
function splitString(string){
var match1 = null;
var match2 = null;
var stringArray = new Array();
match1 = string.indexOf(',');
match2 = string.indexOf('`');
if(match1 != 0){
stringArray = [string.slice(0,match1-1),string.slice(match1,string.length-1];
}
else if(match2 != 0){
stringArray = [string.slice(0,match2-1),string.slice(match2,string.length-1];
}
else{
stringArray = [string];
}
}
var str = "mi,,,,";
var idx = str.search(/\W/);
if(idx) {
var list = [str.slice(0, idx), str.slice(idx)]
}
You'll have the parts in list[0] and list[1].
P.S. There might be some better ways than this.
yourStr.match(/(\w{2,3})([,']*)/)
if (match = string.match(/^([a-z]{2,3})(,+?$|'+?$)/)) {
match = match.slice(1);
}

split string only on first instance of specified character

In my code I split a string based on _ and grab the second item in the array.
var element = $(this).attr('class');
var field = element.split('_')[1];
Takes good_luck and provides me with luck. Works great!
But, now I have a class that looks like good_luck_buddy. How do I get my javascript to ignore the second _ and give me luck_buddy?
I found this var field = element.split(new char [] {'_'}, 2); in a c# stackoverflow answer but it doesn't work. I tried it over at jsFiddle...
Use capturing parentheses:
'good_luck_buddy'.split(/_(.*)/s)
['good', 'luck_buddy', ''] // ignore the third element
They are defined as
If separator contains capturing parentheses, matched results are returned in the array.
So in this case we want to split at _.* (i.e. split separator being a sub string starting with _) but also let the result contain some part of our separator (i.e. everything after _).
In this example our separator (matching _(.*)) is _luck_buddy and the captured group (within the separator) is lucky_buddy. Without the capturing parenthesis the luck_buddy (matching .*) would've not been included in the result array as it is the case with simple split that separators are not included in the result.
We use the s regex flag to make . match on newline (\n) characters as well, otherwise it would only split to the first newline.
What do you need regular expressions and arrays for?
myString = myString.substring(myString.indexOf('_')+1)
var myString= "hello_there_how_are_you"
myString = myString.substring(myString.indexOf('_')+1)
console.log(myString)
I avoid RegExp at all costs. Here is another thing you can do:
"good_luck_buddy".split('_').slice(1).join('_')
With help of destructuring assignment it can be more readable:
let [first, ...rest] = "good_luck_buddy".split('_')
rest = rest.join('_')
A simple ES6 way to get both the first key and remaining parts in a string would be:
const [key, ...rest] = "good_luck_buddy".split('_')
const value = rest.join('_')
console.log(key, value) // good, luck_buddy
Nowadays String.prototype.split does indeed allow you to limit the number of splits.
str.split([separator[, limit]])
...
limit Optional
A non-negative integer limiting the number of splits. If provided, splits the string at each occurrence of the specified separator, but stops when limit entries have been placed in the array. Any leftover text is not included in the array at all.
The array may contain fewer entries than limit if the end of the string is reached before the limit is reached.
If limit is 0, no splitting is performed.
caveat
It might not work the way you expect. I was hoping it would just ignore the rest of the delimiters, but instead, when it reaches the limit, it splits the remaining string again, omitting the part after the split from the return results.
let str = 'A_B_C_D_E'
const limit_2 = str.split('_', 2)
limit_2
(2) ["A", "B"]
const limit_3 = str.split('_', 3)
limit_3
(3) ["A", "B", "C"]
I was hoping for:
let str = 'A_B_C_D_E'
const limit_2 = str.split('_', 2)
limit_2
(2) ["A", "B_C_D_E"]
const limit_3 = str.split('_', 3)
limit_3
(3) ["A", "B", "C_D_E"]
This solution worked for me
var str = "good_luck_buddy";
var index = str.indexOf('_');
var arr = [str.slice(0, index), str.slice(index + 1)];
//arr[0] = "good"
//arr[1] = "luck_buddy"
OR
var str = "good_luck_buddy";
var index = str.indexOf('_');
var [first, second] = [str.slice(0, index), str.slice(index + 1)];
//first = "good"
//second = "luck_buddy"
You can use the regular expression like:
var arr = element.split(/_(.*)/)
You can use the second parameter which specifies the limit of the split.
i.e:
var field = element.split('_', 1)[1];
Replace the first instance with a unique placeholder then split from there.
"good_luck_buddy".replace(/\_/,'&').split('&')
["good","luck_buddy"]
This is more useful when both sides of the split are needed.
I need the two parts of string, so, regex lookbehind help me with this.
const full_name = 'Maria do Bairro';
const [first_name, last_name] = full_name.split(/(?<=^[^ ]+) /);
console.log(first_name);
console.log(last_name);
Non-regex solution
I ran some benchmarks, and this solution won hugely:1
str.slice(str.indexOf(delim) + delim.length)
// as function
function gobbleStart(str, delim) {
return str.slice(str.indexOf(delim) + delim.length);
}
// as polyfill
String.prototype.gobbleStart = function(delim) {
return this.slice(this.indexOf(delim) + delim.length);
};
Performance comparison with other solutions
The only close contender was the same line of code, except using substr instead of slice.
Other solutions I tried involving split or RegExps took a big performance hit and were about 2 orders of magnitude slower. Using join on the results of split, of course, adds an additional performance penalty.
Why are they slower? Any time a new object or array has to be created, JS has to request a chunk of memory from the OS. This process is very slow.
Here are some general guidelines, in case you are chasing benchmarks:
New dynamic memory allocations for objects {} or arrays [] (like the one that split creates) will cost a lot in performance.
RegExp searches are more complicated and therefore slower than string searches.
If you already have an array, destructuring arrays is about as fast as explicitly indexing them, and looks awesome.
Removing beyond the first instance
Here's a solution that will slice up to and including the nth instance. It's not quite as fast, but on the OP's question, gobble(element, '_', 1) is still >2x faster than a RegExp or split solution and can do more:
/*
`gobble`, given a positive, non-zero `limit`, deletes
characters from the beginning of `haystack` until `needle` has
been encountered and deleted `limit` times or no more instances
of `needle` exist; then it returns what remains. If `limit` is
zero or negative, delete from the beginning only until `-(limit)`
occurrences or less of `needle` remain.
*/
function gobble(haystack, needle, limit = 0) {
let remain = limit;
if (limit <= 0) { // set remain to count of delim - num to leave
let i = 0;
while (i < haystack.length) {
const found = haystack.indexOf(needle, i);
if (found === -1) {
break;
}
remain++;
i = found + needle.length;
}
}
let i = 0;
while (remain > 0) {
const found = haystack.indexOf(needle, i);
if (found === -1) {
break;
}
remain--;
i = found + needle.length;
}
return haystack.slice(i);
}
With the above definition, gobble('path/to/file.txt', '/') would give the name of the file, and gobble('prefix_category_item', '_', 1) would remove the prefix like the first solution in this answer.
Tests were run in Chrome 70.0.3538.110 on macOSX 10.14.
Use the string replace() method with a regex:
var result = "good_luck_buddy".replace(/.*?_/, "");
console.log(result);
This regex matches 0 or more characters before the first _, and the _ itself. The match is then replaced by an empty string.
Javascript's String.split unfortunately has no way of limiting the actual number of splits. It has a second argument that specifies how many of the actual split items are returned, which isn't useful in your case. The solution would be to split the string, shift the first item off, then rejoin the remaining items::
var element = $(this).attr('class');
var parts = element.split('_');
parts.shift(); // removes the first item from the array
var field = parts.join('_');
Here's one RegExp that does the trick.
'good_luck_buddy' . split(/^.*?_/)[1]
First it forces the match to start from the
start with the '^'. Then it matches any number
of characters which are not '_', in other words
all characters before the first '_'.
The '?' means a minimal number of chars
that make the whole pattern match are
matched by the '.*?' because it is followed
by '_', which is then included in the match
as its last character.
Therefore this split() uses such a matching
part as its 'splitter' and removes it from
the results. So it removes everything
up till and including the first '_' and
gives you the rest as the 2nd element of
the result. The first element is "" representing
the part before the matched part. It is
"" because the match starts from the beginning.
There are other RegExps that work as
well like /_(.*)/ given by Chandu
in a previous answer.
The /^.*?_/ has the benefit that you
can understand what it does without
having to know about the special role
capturing groups play with replace().
if you are looking for a more modern way of doing this:
let raw = "good_luck_buddy"
raw.split("_")
.filter((part, index) => index !== 0)
.join("_")
Mark F's solution is awesome but it's not supported by old browsers. Kennebec's solution is awesome and supported by old browsers but doesn't support regex.
So, if you're looking for a solution that splits your string only once, that is supported by old browsers and supports regex, here's my solution:
String.prototype.splitOnce = function(regex)
{
var match = this.match(regex);
if(match)
{
var match_i = this.indexOf(match[0]);
return [this.substring(0, match_i),
this.substring(match_i + match[0].length)];
}
else
{ return [this, ""]; }
}
var str = "something/////another thing///again";
alert(str.splitOnce(/\/+/)[1]);
For beginner like me who are not used to Regular Expression, this workaround solution worked:
var field = "Good_Luck_Buddy";
var newString = field.slice( field.indexOf("_")+1 );
slice() method extracts a part of a string and returns a new string and indexOf() method returns the position of the first found occurrence of a specified value in a string.
This should be quite fast
function splitOnFirst (str, sep) {
const index = str.indexOf(sep);
return index < 0 ? [str] : [str.slice(0, index), str.slice(index + sep.length)];
}
console.log(splitOnFirst('good_luck', '_')[1])
console.log(splitOnFirst('good_luck_buddy', '_')[1])
This worked for me on Chrome + FF:
"foo=bar=beer".split(/^[^=]+=/)[1] // "bar=beer"
"foo==".split(/^[^=]+=/)[1] // "="
"foo=".split(/^[^=]+=/)[1] // ""
"foo".split(/^[^=]+=/)[1] // undefined
If you also need the key try this:
"foo=bar=beer".split(/^([^=]+)=/) // Array [ "", "foo", "bar=beer" ]
"foo==".split(/^([^=]+)=/) // [ "", "foo", "=" ]
"foo=".split(/^([^=]+)=/) // [ "", "foo", "" ]
"foo".split(/^([^=]+)=/) // [ "foo" ]
//[0] = ignored (holds the string when there's no =, empty otherwise)
//[1] = hold the key (if any)
//[2] = hold the value (if any)
a simple es6 one statement solution to get the first key and remaining parts
let raw = 'good_luck_buddy'
raw.split('_')
.reduce((p, c, i) => i === 0 ? [c] : [p[0], [...p.slice(1), c].join('_')], [])
You could also use non-greedy match, it's just a single, simple line:
a = "good_luck_buddy"
const [,g,b] = a.match(/(.*?)_(.*)/)
console.log(g,"and also",b)

Categories

Resources