Tokenizing strings using regular expression in Javascript

Tokenizing strings using regular expression in Javascript - javascript

Suppose I've a long string containing newlines and tabs as:
var x = "This is a long string.\n\t This is another one on next line.";
So how can we split this string into tokens, using regular expression?
I don't want to use .split(' ') because I want to learn Javascript's Regex.
A more complicated string could be this:
var y = "This #is a #long $string. Alright, lets split this.";
Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:
var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];
var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];

Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/
Basically, the code is very simple:
var y = "This #is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"
var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
document.write(match[i]);
document.write('<br>');
}
UPDATE:
Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/
var regex = /[^\s\.,!?]+/g;
UPDATE 2:
Only letters all the time:
http://jsfiddle.net/ayezutov/BjXw5/3/
var regex = /\w+/g;

Use \s+ to tokenize the string.

exec can loop through the matches to remove non-word (\W) characters.
var A= [], str= "This #is a #long $string. Alright, let's split this.",
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words;
while((words= rx.exec(str))!= null){
A.push(words[1]);
}
A.join(', ')
/* returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/

var words = y.split(/[^A-Za-z0-9]+/);

Here is a solution using regex groups to tokenise the text using different types of tokens.
You can test the code here https://jsfiddle.net/u3mvca6q/5/
/*
Basic Regex explanation:
/ Regex start
(\w+) First group, words \w means ASCII letter with \w + means 1 or more letters
| or
(,|!) Second group, punctuation
| or
(\s) Third group, white spaces
/ Regex end
g "global", enables looping over the string to capture one element at a time
Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(\w+)|(,|!)|(\s)/g;
/*
Advanced Regex explanation:
[a-zA-Z\u0080-\u00FF] instead of \w Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex
(\.\.\.|\.|,|!|\?) Identify ellipsis (...) and points as separate entities
You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g;
var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";
console.log("------------------");
var result = null;
do {
result = basicRegex.exec(basicString)
console.log(result);
} while(result != null)
console.log("------------------");
var result = null;
do {
result = advancedRegex.exec(advancedString)
console.log(result);
} while(result != null)
/*
Output:
Array [ "Hello", "Hello", undefined, undefined ]
Array [ ",", undefined, ",", undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "this", "this", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "is", "is", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "a", "a", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "random", "random", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "message", "message", undefined, undefined ]
Array [ "!", undefined, "!", undefined ]
null
*/

In order to extract word-only characters, we use the \w symbol. Whether or not this will match Unicode characters or not is implementation-dependent, and you can use this reference to see what the case is for your language/library.
Please see Alexander Yezutov's answer (update 2) on how to apply this into an expression.

Related

How to split comma, semicolon and comma separated phrases with semicolons around?

How to split comma, semicolon and comma separated phrases with semicolons around where it should treat anything between semicolons with a comma as the delimiter, but also delimit just comma and semicolon all together in funcion?
String of words:
var words = "apple;banana;apple, banana;fruit"
Regex function separate by , and ;
var result = words.split(/[,;]+/);
Result from that function is:
[ "apple", "banana", "apple", " banana", "fruit" ]
But what I am looking to get is this, to have "banana, apple" and treat it as a separate value
[ "apple", "banana, apple", " banana", "fruit" ]
So is it possible to combine 2 cases in that function to output as the second example? Or maybe some other elegant way?

A combination of match and replace can be used.
With the match we take the two patterns we have
Values between delimiter (?:\w+;\w+,)
Value not in between delimiter \w+;?
Now based on matched group we just change the values in desired format using replace
let words = `apple;banana;apple, banana;fruit`
let op = words.match(/(?:\w+;\w+,)|\w+;?/g)
.map(e=>e.replace(/(?:(\w+);(\w+),)|(\w+);?/g, (m,g1,g2,g3)=> g1 ? g1+', '+g2 : g3 ))
console.log(op)

Array math is easier than regex.
Objective
apple;banana;apple, banana;fruit > [ "apple", "banana, apple", " banana", "fruit" ]
Remove spaces, split at ;, split at ,.
Solution
const words = 'apple;banana;apple, banana;fruit';
const result = words.split(' ').join('').split(';').join(',').split(',');
console.log(result);

Accorting to question apple-banana second example (after this question update) try
words.split(';');
var words = "apple;banana;apple, banana;fruit"
var result = words.split(';');
console.log(result);

How to extract text in square brackets from string in JavaScript?

How can I extract the text between all pairs of square brackets from the a string "[a][b][c][d][e]", so that I can get the following results:
→ Array: ["a", "b", "c", "d", "e"]
→ String: "abcde"
I have tried the following Regular Expressions, but to no avail:
→ (?<=\[)(.*?)(?=\])
→ \[(.*?)\]

Research:
After having searched in Stack Overflow, I have only found two solutions, both of which using Regular Expressions and they can be found here:
→ (?<=\[)(.*?)(?=\]) (1)
(?<=\[) : Positive Lookbehind.
\[ :matches the character [ literally.
(.*?) : matches any character except newline and expands as needed.
(?=\]) : Positive Lookahead.
\] : matches the character ] literally.
→ \[(.*?)\] (2)
\[ : matches the character [ literally.
(.*?) : matches any character except newline and expands as needed.
\] : matches the character ] literally.
Notes:
(1) This pattern throws an error in JavaScript, because the lookbehind operator is not supported.
Example:
console.log(/(?<=\[)(.*?)(?=\])/.exec("[a][b][c][d][e]"));
Uncaught SyntaxError: Invalid regular expression: /(?<=\[)(.*?)(?=\])/: Invalid group(…)
(2) This pattern returns the text inside only the first pair of square brackets as the second element.
Example:
console.log(/\[(.*?)\]/.exec("[a][b][c][d][e]"));
Returns: ["[a]", "a"]
Solution:
The most precise solution for JavaScript that I have come up with is:
var string, array;
string = "[a][b][c][d][e]";
array = string.split("["); // → ["", "a]", "b]", "c]", "d]", "e]"]
string = array1.join(""); // → "a]b]c]d]e]"
array = string.split("]"); // → ["a", "b", "c", "d", "e", ""]
Now, depending upon whether we want the end result to be an array or a string we can do:
array = array.slice(0, array.length - 1) // → ["a", "b", "c", "d", "e"]
/* OR */
string = array.join("") // → "abcde"
One liner:
Finally, here's a handy one liner for each scenario for people like me who prefer to achieve the most with least code or our TL;DR guys.
Array:
var a = "[a][b][c][d][e]".split("[").join("").split("]").slice(0,-1);
/* OR */
var a = "[a][b][c][d][e]".slice(1,-1).split(']['); // Thanks #xorspark
String:
var a = "[a][b][c][d][e]".split("[").join("").split("]").join("");

I don't know what text you are expecting in that string of array, but for the example you've given.
var arrStr = "[a][b][c][d][e]";
var arr = arrStr.match(/[a-z]/g) --> [ 'a', 'b', 'c', 'd', 'e' ] with typeof 'array'
then you can just use `.concat()` on the produced array to combine them into a string.
if you're expecting multiple characters between the square brackets, then the regex can be (/[a-z]+/g) or tweaked to your liking.

I think this approach will be interesting to you.
var arr = [];
var str = '';
var input = "[a][b][c][d][e]";
input.replace(/\[(.*?)\]/g, function(match, pattern){
arr.push(pattern);
str += pattern;
return match;//just in case ;)
});
console.log('Arr:', arr);
console.log('String:', str);
//And trivial solution if you need only string
var a = input.replace(/\[|\]/g, '');
console.log('Trivial:',a);

What is the purpose of the 'y' sticky pattern modifier in JavaScript RegExps?

MDN introduced the 'y' sticky flag for JavaScript RegExp. Here is a documentation excerpt:
y
sticky; matches only from the index indicated by the lastIndex property of this regular expression in the target string (and does not attempt to match from any later indexes).
There's also an example:
var text = 'First line\nSecond line';
var regex = /(\S+) line\n?/y;
var match = regex.exec(text);
console.log(match[1]); // prints 'First'
console.log(regex.lastIndex); // prints '11'
var match2 = regex.exec(text);
console.log(match2[1]); // prints 'Second'
console.log(regex.lastIndex); // prints '22'
var match3 = regex.exec(text);
console.log(match3 === null); // prints 'true'
But there isn't actually any difference between the usage of the g global flag in this case:
var text = 'First line\nSecond line';
var regex = /(\S+) line\n?/g;
var match = regex.exec(text);
console.log(match[1]); // prints 'First'
console.log(regex.lastIndex); // prints '11'
var match2 = regex.exec(text);
console.log(match2[1]); // prints 'Second'
console.log(regex.lastIndex); // prints '22'
var match3 = regex.exec(text);
console.log(match3 === null); // prints 'true'
Same output. So I guess there might be something else regarding the 'y' flag and it seems that MDN's example isn't a real use-case for this modifier, as it seems to just work as a replacement for the 'g' global modifier here.
So, what could be a real use-case for this experimental 'y' sticky flag? What's its purpose in "matching only from the RegExp.lastIndex property" and what makes it differ from 'g' when used with RegExp.prototype.exec?
Thanks for the attention.

The difference between y and g is described in Practical Modern JavaScript:
The sticky flag advances lastIndex like g but only if a match is found
starting at lastIndex, there is no forward search. The sticky flag was added to improve the performance of writing lexical analyzers using
JavaScript...
As for a real use case,
It could be used to require a regular expression match starting at position n where n
is what lastIndex is set to. In the case of a non-multiline regular
expression, a lastIndex value of 0 with the sticky flag would be in
effect the same as starting the regular expression with ^ which
requires the match to start at the beginning of the text searched.
And here is an example from that blog, where the lastIndex property is manipulated before the test method invocation, thus forcing different match results:
var searchStrings, stickyRegexp;
stickyRegexp = /foo/y;
searchStrings = [
"foo",
" foo",
" foo",
];
searchStrings.forEach(function(text, index) {
stickyRegexp.lastIndex = 1;
console.log("found a match at", index, ":", stickyRegexp.test(text));
});
Result:
"found a match at" 0 ":" false
"found a match at" 1 ":" true
"found a match at" 2 ":" false

There is definitely a difference in behaviour as showed below:
var text = "abc def ghi jkl"
undefined
var regexy = /\S(\S)\S/y;
undefined
var regexg = /\S(\S)\S/g;
undefined
regexg.exec(text)
Array [ "abc", "b" ]
regexg.lastIndex
3
regexg.exec(text)
Array [ "def", "e" ]
regexg.lastIndex
7
regexg.exec(text)
Array [ "ghi", "h" ]
regexg.lastIndex
11
regexg.exec(text)
Array [ "jkl", "k" ]
regexg.lastIndex
15
regexg.exec(text)
null
regexg.lastIndex
0
regexy.exec(text)
Array [ "abc", "b" ]
regexy.lastIndex
3
regexy.exec(text)
null
regexy.lastIndex
0
..but I have yet to fully understand what is going on there.

Split number and string from the value using java script or regex

I have a value "4.66lb"
and i want to separate "4.66" and "lb" using regex.
I tried the below code, but that separates only number "4,66"!! but i want both the values 4.66 and lb.
var text = "4.66lb";
var regex = /(\d+)/g;
alert(text.match(/(\d+)/g));

Have a try with:
var res = text.match(/(\d+(?:\.\d+)?)(\D+)/);
res[1] contains 4.66
res[2] contains lb
In order to match also 4/5lb, you could use:
var res = text.match(/(\d+(?:[.\/]\d+)?)(\D+)/);

You could use a character class also,
> var res = text.match(/([0-9\.]+)(\w+)/);
undefined
> res[1]
'4.66'
> res[2]
'lb'

Let me explain with an example
var str = ' 1 ab 2 bc 4 dd'; //sample string
str.split(/\s+\d+\s+/)
result_1 = ["", "ab", "bc", "dd"] //regex not enclosed in parenthesis () will split string on the basis of match expression
str.split(/(\s+\d+\s+)/) //regex enclosed in parenthesis () along with above results, it also finds all matching strings
result_2 = ["", " 1 ", "ab", " 2 ", "bc", " 4 ", "dd"]
//here we received two type of results: result_1 (split of string based on regex) and those matching the regex itself
//Yours case is the second one
//enclose the desired regex in parenthesis
solution : str.split(/(\d+\.*\d+[^\D])/)

Javascript split string on space or on quotes to array

var str = 'single words "fixed string of words"';
var astr = str.split(" "); // need fix
I would like the array to be like this:
var astr = ["single", "words", "fixed string of words"];

The accepted answer is not entirely correct. It separates on non-space characters like . and - and leaves the quotes in the results. The better way to do this so that it excludes the quotes is with capturing groups, like such:
//The parenthesis in the regex creates a captured group within the quotes
var myRegexp = /[^\s"]+|"([^"]*)"/gi;
var myString = 'single words "fixed string of words"';
var myArray = [];
do {
//Each call to exec returns the next regex match as an array
var match = myRegexp.exec(myString);
if (match != null)
{
//Index 1 in the array is the captured group if it exists
//Index 0 is the matched text, which we use if no captured group exists
myArray.push(match[1] ? match[1] : match[0]);
}
} while (match != null);
myArray will now contain exactly what the OP asked for:
single,words,fixed string of words

str.match(/\w+|"[^"]+"/g)
//single, words, "fixed string of words"

This uses a mix of split and regex matching.
var str = 'single words "fixed string of words"';
var matches = /".+?"/.exec(str);
str = str.replace(/".+?"/, "").replace(/^\s+|\s+$/g, "");
var astr = str.split(" ");
if (matches) {
for (var i = 0; i < matches.length; i++) {
astr.push(matches[i].replace(/"/g, ""));
}
}
This returns the expected result, although a single regexp should be able to do it all.
// ["single", "words", "fixed string of words"]
Update
And this is the improved version of the the method proposed by S.Mark
var str = 'single words "fixed string of words"';
var aStr = str.match(/\w+|"[^"]+"/g), i = aStr.length;
while(i--){
aStr[i] = aStr[i].replace(/"/g,"");
}
// ["single", "words", "fixed string of words"]

Here might be a complete solution:
https://github.com/elgs/splitargs

ES6 solution supporting:
Split by space except for inside quotes
Removing quotes but not for backslash escaped quotes
Escaped quote become quote
Can put quotes anywhere
Code:
str.match(/\\?.|^$/g).reduce((p, c) => {
if(c === '"'){
p.quote ^= 1;
}else if(!p.quote && c === ' '){
p.a.push('');
}else{
p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
}
return p;
}, {a: ['']}).a
Output:
[ 'single', 'words', 'fixed string of words' ]

This will split it into an array and strip off the surrounding quotes from any remaining string.
const parseWords = (words = '') =>
(words.match(/[^\s"]+|"([^"]*)"/gi) || []).map((word) =>
word.replace(/^"(.+(?="$))"$/, '$1'))

This soulution would work for both double (") and single (') quotes:
Code:
str.match(/[^\s"']+|"([^"]*)"/gmi)
// ["single", "words", "fixed string of words"]
Here it shows how this regular expression would work: https://regex101.com/r/qa3KxQ/2

Until I found #dallin 's answer (this thread: https://stackoverflow.com/a/18647776/1904943) I was having difficulty processing strings with a mix of unquoted and quoted terms / phrases, via JavaScript.
In researching that issue, I ran a number of tests.
As I found it difficult to find this information, I have collated the relevant information (below), which may be useful to others seeking answers on the processing in JavaScript of strings containing quoted words.
let q = 'apple banana "nova scotia" "british columbia"';
Extract [only] quoted words and phrases:
// https://stackoverflow.com/questions/12367126/how-can-i-get-a-substring-located-between-2-quotes
const r = q.match(/"([^']+)"/g);
console.log('r:', r)
// r: Array [ "\"nova scotia\" \"british columbia\"" ]
console.log('r:', r.toString())
// r: "nova scotia" "british columbia"
// ----------------------------------------
// [alternate regex] https://www.regextester.com/97161
const s = q.match(/"(.*?)"/g);
console.log('s:', s)
// s: Array [ "\"nova scotia\"", "\"british columbia\"" ]
console.log('s:', s.toString())
// s: "nova scotia","british columbia"
Extract [all] unquoted, quoted words and phrases:
// https://stackoverflow.com/questions/2817646/javascript-split-string-on-space-or-on-quotes-to-array
const t = q.match(/\w+|"[^"]+"/g);
console.log('t:', t)
// t: Array(4) [ "apple", "banana", "\"nova scotia\"", "\"british columbia\"" ]
console.log('t:', t.toString())
// t: apple,banana,"nova scotia","british columbia"
// ----------------------------------------------------------------------------
// https://stackoverflow.com/questions/2817646/javascript-split-string-on-space-or-on-quotes-to-array
// [#dallon 's answer (this thread)] https://stackoverflow.com/a/18647776/1904943
var myRegexp = /[^\s"]+|"([^"]*)"/gi;
var myArray = [];
do {
/* Each call to exec returns the next regex match as an array. */
var match = myRegexp.exec(q); // << "q" = my query (string)
if (match != null)
{
/* Index 1 in the array is the captured group if it exists.
* Index 0 is the matched text, which we use if no captured group exists. */
myArray.push(match[1] ? match[1] : match[0]);
}
} while (match != null);
console.log('myArray:', myArray, '| type:', typeof(myArray))
// myArray: Array(4) [ "apple", "banana", "nova scotia", "british columbia" ] | type: object
console.log(myArray.toString())
// apple,banana,nova scotia,british columbia
Work with a set (rather than an array):
// https://stackoverflow.com/questions/28965112/javascript-array-to-set
var mySet = new Set(myArray);
console.log('mySet:', mySet, '| type:', typeof(mySet))
// mySet: Set(4) [ "apple", "banana", "nova scotia", "british columbia" ] | type: object
Iterating over set elements:
mySet.forEach(x => console.log(x));
/* apple
* banana
* nova scotia
* british columbia
*/
// https://stackoverflow.com/questions/16401216/iterate-over-set-elements
myArrayFromSet = Array.from(mySet);
for (let i=0; i < myArrayFromSet.length; i++) {
console.log(i + ':', myArrayFromSet[i])
}
/*
0: apple
1: banana
2: nova scotia
3: british columbia
*/
Asides
The JavaScript responses above are from the FireFox Developer Tools (F12, from web page). I created a blank HTML file that calls a .js file that I edit with Vim, as my IDE. Simple JavaScript IDE
Based on my tests, the cloned set appears to be a deep copy. Shallow-clone an ES6 Map or Set

I noticed the disappearing characters, too. I think you can include them - for example, to have it include "+" with the word, use something like "[\w\+]" instead of just "\w".

Develop Reference

JavaScript is the programming language of the Web.

Tokenizing strings using regular expression in Javascript - javascript

Use \s+ to tokenize the string.

var words = y.split(/[^A-Za-z0-9]+/);

Related

How to split comma, semicolon and comma separated phrases with semicolons around?

How to extract text in square brackets from string in JavaScript?

What is the purpose of the 'y' sticky pattern modifier in JavaScript RegExps?

Split number and string from the value using java script or regex

Javascript split string on space or on quotes to array

Categories

Resources