How do you split a javascript string by spaces and punctuation? - javascript

I have some random string, for example: Hello, my name is john.. I want that string split into an array like this: Hello, ,, , my, name, is, john, .,. I tried str.split(/[^\w\s]|_/g), but it does not seem to work. Any ideas?

To split a str on any run of non-word characters I.e. Not A-Z, 0-9, and underscore.
var words=str.split(/\W+/); // assumes str does not begin nor end with whitespace
Or, assuming your target language is English, you can extract all semantically useful values from a string (i.e. "tokenizing" a string) using:
var str='Here\'s a (good, bad, indifferent, ...) '+
'example sentence to be used in this test '+
'of English language "token-extraction".',
punct='\\['+ '\\!'+ '\\"'+ '\\#'+ '\\$'+ // since javascript does not
'\\%'+ '\\&'+ '\\\''+ '\\('+ '\\)'+ // support POSIX character
'\\*'+ '\\+'+ '\\,'+ '\\\\'+ '\\-'+ // classes, we'll need our
'\\.'+ '\\/'+ '\\:'+ '\\;'+ '\\<'+ // own version of [:punct:]
'\\='+ '\\>'+ '\\?'+ '\\#'+ '\\['+
'\\]'+ '\\^'+ '\\_'+ '\\`'+ '\\{'+
'\\|'+ '\\}'+ '\\~'+ '\\]',
re=new RegExp( // tokenizer
'\\s*'+ // discard possible leading whitespace
'('+ // start capture group
'\\.{3}'+ // ellipsis (must appear before punct)
'|'+ // alternator
'\\w+\\-\\w+'+ // hyphenated words (must appear before punct)
'|'+ // alternator
'\\w+\'(?:\\w+)?'+ // compound words (must appear before punct)
'|'+ // alternator
'\\w+'+ // other words
'|'+ // alternator
'['+punct+']'+ // punct
')' // end capture group
);
// grep(ary[,filt]) - filters an array
// note: could use jQuery.grep() instead
// #param {Array} ary array of members to filter
// #param {Function} filt function to test truthiness of member,
// if omitted, "function(member){ if(member) return member; }" is assumed
// #returns {Array} all members of ary where result of filter is truthy
function grep(ary,filt) {
var result=[];
for(var i=0,len=ary.length;i++<len;) {
var member=ary[i]||'';
if(filt && (typeof filt === 'Function') ? filt(member) : member) {
result.push(member);
}
}
return result;
}
var tokens=grep( str.split(re) ); // note: filter function omitted
// since all we need to test
// for is truthiness
which produces:
tokens=[
'Here\'s',
'a',
'(',
'good',
',',
'bad',
',',
'indifferent',
',',
'...',
')',
'example',
'sentence',
'to',
'be',
'used',
'in',
'this',
'test',
'of',
'English',
'language',
'"',
'token-extraction',
'"',
'.'
]
EDIT
Also available as a Github Gist

Try this (I'm not sure if this is what you wanted):
str.replace(/[^\w\s]|_/g, function ($1) { return ' ' + $1 + ' ';}).replace(/[ ]+/g, ' ').split(' ');
http://jsfiddle.net/zNHJW/3/

Try:
str.split(/([_\W])/)
This will split by any non-alphanumeric character (\W) and any underscore. It uses capturing parentheses to include the item that was split by in the final result.

This solution caused a challenge with spaces for me (still needed them), then I gave str.split(/\b/) a shot and all is well. Spaces are output in the array, which won't be hard to ignore, and the ones left after punctuation can be trimmed out.

Related

regex: get match and remaining string with regex

I want to use the RegExp constructor to run a regular expression against a string and let me get both the match and the remaining string.
the above is to be able to implement the following UI pattern:
as you can see in the image I need to separate the match from the rest of the string to be able to apply some style or any other process separately.
/**
* INPUT
*
* input: 'las vegas'
* pattern: 'las'
*
*
* EXPECTED OUTPUT
*
* match: 'las'
* remaining: 'vegas'
*/
Get the match then replace the match with nothing in the string, and return both results.
function matchR(str, regex){
// get the match
var _match = str.match(regex);
// return the first match index, and the remaining string
return {match:_match[0], remaining:str.replace(_match, "")};
}
Here is a function that takes the user input and an array of strings to match as as parameters, and returns an array of arrays:
const strings = [
'Las Cruces',
'Las Vegas',
'Los Altos',
'Los Gatos',
];
function getMatchAndRemaining(input, strings) {
let escaped = input.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
let regex = new RegExp('^(' + escaped + ')(.*)$', 'i');
return strings.map(str => {
return (str.match(regex) || [str, '', str]).slice(1);
});
}
//tests:
['l', 'las', 'los', 'x'].forEach(input => {
let matches = getMatchAndRemaining(input, strings);
console.log(input, '=>', matches);
});
Some notes:
you need to escape the user input before creating the regex, some chars have special meaning
if there is no match, the before part is empty, and the remaining part contains the full string
you could add an additional parameter to the function with style or class to add to the before part, in which case you would return a string instead of an array of [before, remaining]

Regex match cookie value and remove hyphens

I'm trying to extract out a group of words from a larger string/cookie that are separated by hyphens. I would like to replace the hyphens with a space and set to a variable. Javascript or jQuery.
As an example, the larger string has a name and value like this within it:
facility=34222%7CConner-Department-Store;
(notice the leading "C")
So first, I need to match()/find facility=34222%7CConner-Department-Store; with regex. Then break it down to "Conner Department Store"
var cookie = document.cookie;
var facilityValue = cookie.match( REGEX ); ??
var test = "store=874635%7Csomethingelse;facility=34222%7CConner-Department-Store;store=874635%7Csomethingelse;";
var test2 = test.replace(/^(.*)facility=([^;]+)(.*)$/, function(matchedString, match1, match2, match3){
return decodeURIComponent(match2);
});
console.log( test2 );
console.log( test2.split('|')[1].replace(/[-]/g, ' ') );
If I understood it correctly, you want to make a phrase by getting all the words between hyphens and disallowing two successive Uppercase letters in a word, so I'd prefer using Regex in that case.
This is a Regex solution, that works dynamically with any cookies in the same format and extract the wanted sentence from it:
var matches = str.match(/([A-Z][a-z]+)-?/g);
console.log(matches.map(function(m) {
return m.replace('-', '');
}).join(" "));
Demo:
var str = "facility=34222%7CConner-Department-Store;";
var matches = str.match(/([A-Z][a-z]+)-?/g);
console.log(matches.map(function(m) {
return m.replace('-', '');
}).join(" "));
Explanation:
Use this Regex (/([A-Z][a-z]+)-?/g to match the words between -.
Replace any - occurence in the matched words.
Then just join these matches array with white space.
Ok,
first, you should decode this string as follows:
var str = "facility=34222%7CConner-Department-Store;"
var decoded = decodeURIComponent(str);
// decoded = "facility=34222|Conner-Department-Store;"
Then you have multiple possibilities to split up this string.
The easiest way is to use substring()
var solution1 = decoded.substring(decoded.indexOf('|') + 1, decoded.length)
// solution1 = "Conner-Department-Store;"
solution1 = solution1.replace('-', ' ');
// solution1 = "Conner Department Store;"
As you can see, substring(arg1, arg2) returns the string, starting at index arg1 and ending at index arg2. See Full Documentation here
If you want to cut the last ; just set decoded.length - 1 as arg2 in the snippet above.
decoded.substring(decoded.indexOf('|') + 1, decoded.length - 1)
//returns "Conner-Department-Store"
or all above in just one line:
decoded.substring(decoded.indexOf('|') + 1, decoded.length - 1).replace('-', ' ')
If you want still to use a regular Expression to retrieve (perhaps more) data out of the string, you could use something similar to this snippet:
var solution2 = "";
var regEx= /([A-Za-z]*)=([0-9]*)\|(\S[^:\/?#\[\]\#\;\,']*)/;
if (regEx.test(decoded)) {
solution2 = decoded.match(regEx);
/* returns
[0:"facility=34222|Conner-Department-Store",
1:"facility",
2:"34222",
3:"Conner-Department-Store",
index:0,
input:"facility=34222|Conner-Department-Store;"
length:4] */
solution2 = solution2[3].replace('-', ' ');
// "Conner Department Store"
}
I have applied some rules for the regex to work, feel free to modify them according your needs.
facility can be any Word built with alphabetical characters lower and uppercase (no other chars) at any length
= needs to be the char =
34222 can be any number but no other characters
| needs to be the char |
Conner-Department-Store can be any characters except one of the following (reserved delimiters): :/?#[]#;,'
Hope this helps :)
edit: to find only the part
facility=34222%7CConner-Department-Store; just modify the regex to
match facility= instead of ([A-z]*)=:
/(facility)=([0-9]*)\|(\S[^:\/?#\[\]\#\;\,']*)/
You can use cookies.js, a mini framework from MDN (Mozilla Developer Network).
Simply include the cookies.js file in your application, and write:
docCookies.getItem("Connor Department Store");

regex to get all occurrences with optional next character or end of string

I have a string separated by forward slashes, and wildcards are denoted by beginning with a $:
/a/string/with/$some/$wildcards
I need a regex to get all wildcards (without the "$"), where wildcards can either have more "string" ahead of them (and the next character should always be a forward slash) or will be at the end of the string. Here is where I'm at (it matches to the end of the string rather to the next "/"):
//Just want to match $one
var string = "/a/string/with/$one/wildcard"
var re = /\$(.*)($|[/]?)/g
var m = re.exec(string)
console.log(m);
// [ '$one/wildcard',
// 'one/wildcard',
// '',
// index: 123,
// input: '/a/string/with/$one/wildcard'
// ]
Here was a previous attempt (that doesn't account for wildcards that are at the end of the string):
//Want to match $two and $wildcards
var string = "/a/string/with/$two/$wildcards"
var re = /\$(.*)\//g
var m = re.exec(string)
console.log(m);
// [ '$two/',
// 'two',
// '',
// index: 123,
// input: '/a/string/with/$two/$wildcards'
// ]
I've searched around for matching a character or end of string and have found several answers, but none that try to account for multiple matches. I think I need the ability to match the next character as a / greedily, and then try to match the end of the string.
The desired functionality is to take the input string:
/a/string/with/$two/$wildcards
and transform it to the following:
/a/string/with/[two]/[wildcards]
Thanks in advance! Also, apologies if this has been explicitly covered in detail, I was unable to find a replica after various searches.
I think this should do it:
/\$([^\/]+)/g
And the you can use the replace() function:
"/a/string/with/$two/$wildcards".replace(/\$([^\/]+)/g, "[$1]");
// "/a/string/with/[two]/[wildcards]"
You can use replace function on the string like so:
var s = '/a/string/with/$two/$wildcards';
s.replace(/\$([a-zA-Z]+)/g, '[$1]')';
s will have the value:
/a/string/with/[two]/[wildcards]
Here's a reference to replace documentation https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/String/replace

Why is this regex matching also words within a non-capturing group?

I have this string (notice the multi-line syntax):
var str = ` Number One: Get this
Number Two: And this`;
And I want a regex that returns (with match):
[str, 'Get this', 'And this']
So I tried str.match(/Number (?:One|Two): (.*)/g);, but that's returning:
["Number One: Get this", "Number Two: And this"]
There can be any whitespace/line-breaks before any "Number" word.
Why doesn't it return only what is inside of the capturing group? Am I misundersating something? And how can I achieve the desired result?
Per the MDN documentation for String.match:
If the regular expression includes the g flag, the method returns an Array containing all matched substrings rather than match objects. Captured groups are not returned. If there were no matches, the method returns null.
(emphasis mine).
So, what you want is not possible.
The same page adds:
if you want to obtain capture groups and the global flag is set, you need to use RegExp.exec() instead.
so if you're willing to give on using match, you can write your own function that repeatedly applies the regex, gets the captured substrings, and builds an array.
Or, for your specific case, you could write something like this:
var these = str.split(/(?:^|\n)\s*Number (?:One|Two): /);
these[0] = str;
Replace and store the result in a new string, like this:
var str = ` Number One: Get this
Number Two: And this`;
var output = str.replace(/Number (?:One|Two): (.*)/g, "$1");
console.log(output);
which outputs:
Get this
And this
If you want the match array like you requested, you can try this:
var getMatch = function(string, split, regex) {
var match = string.replace(regex, "$1" + split);
match = match.split(split);
match = match.reverse();
match.push(string);
match = match.reverse();
match.pop();
return match;
}
var str = ` Number One: Get this
Number Two: And this`;
var regex = /Number (?:One|Two): (.*)/g;
var match = getMatch(str, "#!SPLIT!#", regex);
console.log(match);
which displays the array as desired:
[ ' Number One: Get this\n Number Two: And this',
' Get this',
'\n And this' ]
Where split (here #!SPLIT!#) should be a unique string to split the matches. Note that this only works for single groups. For multi groups add a variable indicating the number of groups and add a for loop constructing "$1 $2 $3 $4 ..." + split.
Try
var str = " Number One: Get this\
Number Two: And this";
// `/\w+\s+\w+(?=\s|$)/g` match one or more alphanumeric characters ,
// followed by one or more space characters ,
// followed by one or more alphanumeric characters ,
// if following space or end of input , set `g` flag
// return `res` array `["Get this", "And this"]`
var res = str.match(/\w+\s+\w+(?=\s|$)/g);
document.write(JSON.stringify(res));

How to match one, but not two characters using regular expressions

Using javascript regular expressions, how do you match one character while ignoring any other characters that also match?
Example 1: I want to match $, but not $$ or $$$.
Example 2: I want to match $$, but not $$$.
A typical string that is being tested is, "$ $$ $$$ asian italian"
From a user experience perspective, the user selects, or deselects, a checkbox whose value matches tags found in in a list of items. All the tags must be matched (checked) for the item to show.
function filterResults(){
// Make an array of the checked inputs
var aInputs = $('.listings-inputs input:checked').toArray();
// alert(aInputs);
// Turn that array into a new array made from each items value.
var aValues = $.map(aInputs, function(i){
// alert($(i).val());
return $(i).val();
});
// alert(aValues);
// Create new variable, set the value to the joined array set to lower case.
// Use this variable as the string to test
var sValues = aValues.join(' ').toLowerCase();
// alert(sValues);
// sValues = sValues.replace(/\$/ig,'\\$');
// alert(sValues);
// this examines each the '.tags' of each item
$('.listings .tags').each(function(){
var sTags = $(this).text();
// alert(sTags);
sSplitTags = sTags.split(' \267 '); // JavaScript uses octal encoding for special characters
// alert(sSplitTags);
// sSplitTags = sTags.split(' \u00B7 '); // This also works
var show = true;
$.each(sSplitTags, function(i,tag){
if(tag.charAt(0) == '$'){
// alert(tag);
// alert('It begins with a $');
// You have to escape special characters for the RegEx
tag = tag.replace(/\$/ig,'\\$');
// alert(tag);
}
tag = '\\b' + tag + '\\b';
var re = new RegExp(tag,'i');
if(!(re.test(sValues))){
alert(tag);
show = false;
alert('no match');
return false;
}
else{
alert(tag);
show = true;
alert('match');
}
});
if(show == false){
$(this).parent().hide();
}
else{
$(this).parent().show();
}
});
// call the swizzleRows function in the listings.js
swizzleList();
}
Thanks in advance!
Normally, with regex, you can use (?<!x)x(?!x) to match an x that is not preceded nor followed with x.
With the modern ECMAScript 2018+ compliant JS engines, you may use lookbehind based regex:
(?<!\$)\$(?!\$)
See the JS demo (run it in supported browsers only, their number is growing, check the list here):
const str ="$ $$ $$$ asian italian";
const regex = /(?<!\$)\$(?!\$)/g;
console.log( str.match(regex).length ); // Count the single $ occurrences
console.log( str.replace(regex, '<span>$&</span>') ); // Enclose single $ occurrences with tags
console.log( str.split(regex) ); // Split with single $ occurrences
\bx\b
Explanation: Matches x between two word boundaries (for more on word boundaries, look at this tutorial). \b includes the start or end of the string.
I'm taking advantage of the space delimiting in your question. If that is not there, then you will need a more complex expression like (^x$|^x[^x]|[^x]x[^x]|[^x]x$) to match different positions possibly at the start and/or end of the string. This would limit it to single character matching, whereas the first pattern matches entire tokens.
The alternative is just to tokenize the string (split it at spaces) and construct an object from the tokens which you can just look up to see if a given string matched one of the tokens. This should be much faster per-lookup than regex.
Something like that:
q=re.match(r"""(x{2})($|[^x])""", 'xx')
q.groups() ('xx', '')
q=re.match(r"""(x{2})($|[^x])""", 'xxx')
q is None True

Categories

Resources