I'm trying to extract the PROCEDURE section out of CLAIM, EOB & COB from a text file.
and create an object like so
claim : [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}],
eob : [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}],
cob: [{PROCEDURE1}, {PROCEDURE2}, {PROCEDURE3}]
let data = ` SEND CLAIM {
PREFIX="9403 "
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
PROCEDURE_CODE="01201"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
PROCEDURE_CODE="02102"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
PROCEDURE_CODE="21222"
}
}
SEND EOB {
PREFIX="9403 "
OFFICE_SEQUENCE="000721"
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
ELIGIBLE="002750"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
ELIGIBLE="008725"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
ELIGIBLE="010200"
}
}
SEND COB {
PREFIX="TEST4 "
OFFICE_SEQUENCE="000721"
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
PROCEDURE_CODE="01201"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
PROCEDURE_CODE="02102"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
PROCEDURE_CODE="21222"
DATE="19990104"
}
PRIME_EOB=SEND EOB {
PREFIX="9403 "
OFFICE_SEQUENCE="000721"
PROCEDURE { /* #1 */
PROCEDURE_LINE="1"
ELIGIBLE="002750"
}
PROCEDURE { /* #2 */
PROCEDURE_LINE="2"
ELIGIBLE="008725"
}
PROCEDURE { /* #3 */
PROCEDURE_LINE="3"
ELIGIBLE="010200"
}
}
}`
let re = /(^\s+PROCEDURE\s\{)([\S\s]*?)(?:})/gm
console.log(data.match(re));
Here is what I have tried so far (^\s+PROCEDURE\s\{)([\S\s]*?)(?:}), but I can't figure out how I can match PROCEDUREs after key CLAIM or EOB
For "claim", you could match the following regular expression.
/(?<=^ *SEND CLAIM +\{\r?\n(?:^(?! *SEND EOB *\{)(?! *SEND COB *\{).*\r?\n)*^ *PROCEDURE *)\{[^\}]*\}/
CLAIM regex
This matches the following strings, which I assume can be easily saved to an array with a sprinkling of Javascript code.
{ /* CLAIM #1 */
PROCEDURE_LINE="1"
PROCEDURE_CODE="01201"
}
{ /* CLAIM #2 */
PROCEDURE_LINE="2"
PROCEDURE_CODE="02102"
}
{ /* CLAIM #3 */
PROCEDURE_LINE="3"
PROCEDURE_CODE="21222"
}
Javascript's regex engine performs the following operations.
(?<= : begin positive lookbehind
^ : match beginning of line
\ *SEND CLAIM\ + : match 'SEND CLAIM' surrounded by 0+ spaces
\{\r?\n : match '{' then line terminators
(?: : begin non-capture group
^ : match beginning of line
(?! : begin negative lookahead
\ *SEND EOB\ * : match 'SEND EOB' surrounded by 0+ spaces
\{ : match '{'
) : end negative lookahead
(?! : begin negative lookahead
\ *SEND COB\ * : match 'SEND COB' surrounded by 0+ spaces
\{ : match '{'
) : end negative lookahead
.*\r?\n : match line including terminators
) : end non-capture group
* : execute non-capture group 0+ times
^ : match beginning of line
\ *PROCEDURE\ * : match 'PROCEDURE' surrounded by 0+ spaces
) : end positive lookbehind
\{[^\}]*\} : match '{', 0+ characters other than '}', '}'
I've escaped space characters above to improve readability.
For "eob", use the slightly-modified regex:
/(?<=^ *SEND EOB +\{\r?\n(?:^(?! *SEND CLAIM *\{)(?! *SEND COB *\{).*\r?\n)*^ *PROCEDURE *)\{[^\}]*\}/
EOB regex
I've made no attempt to do the same for "cob" as that part has a different structure than "claim" and "eob" and it is not clear to me how it is to be treated.
A final note, should it not be obvious: it would be far easier to extract the strings of interest using convention code with loops and, possibly, simple regular expressions, but I hope some readers may find my answer instructive about some elements of regular expressions.
Will CLAIM, EOB and COB always be in the same order? If so, you can split the text before using the regex you already have:
const procRegex = /(^\s+PROCEDURE\s\{)([\S\s]*?)(?:})/gm;
let claimData = data.split("EOB")[0];
let claimProcedures = claimData.match(procRegex);
let eobData = data.split("COB")[0].split("EOB")[1];
let eobProcedures = eobData.match(procRegex);
let cobData = data.split("COB")[1];
let cobProcedures = cobData.match(procRegex);
// If you want to leave out the PRIME_EOB, you can split COB again
cobData = cobData.split("EOB")[0];
cobProcedures = cobData.match(procRegex);
console.log(claimProcedures);
Output:
[
' PROCEDURE { /* #1 */\n' +
' PROCEDURE_LINE="1"\n' +
' PROCEDURE_CODE="01201"\n' +
' \n' +
' }',
' PROCEDURE { /* #2 */\n' +
' PROCEDURE_LINE="2"\n' +
' PROCEDURE_CODE="02102"\n' +
' \n' +
' }',
' PROCEDURE { /* #3 */\n' +
' PROCEDURE_LINE="3"\n' +
' PROCEDURE_CODE="21222"\n' +
' \n' +
' }'
]
Demo
As an alternate method, your data is not terribly far away from valid JSON, so you could run with that. The code below translates the data into JSON, then parses it into a Javascript object that you can use however you want.
/* data cannot have Javascript comments in it for this to work, or you need
another regex to remove them */
data = data.replace(/=/g, ":") // replace = with :
.replace(/\s?{/g, ": {") // replace { with : {
.replace(/SEND/g, "") // remove "SEND"
.replace(/\"\s*$(?!\s*\})/gm, "\",") // add commas after object properties
.replace(/}(?=\s*\w)/g, "},") // add commas after objects
.replace(/(?<!\}),\s*PROCEDURE: /g, ",\nPROCEDURES: [") // start procedures list
.replace(/(PROCEDURE:[\S\s]*?\})\s*(?!,\s*PROCEDURE)/g, "$1]\n") // end list
.replace(/PROCEDURE: /g, "") // remove "PROCEDURE"
.replace("PRIME_EOB: EOB:", "PRIME_EOB:") // replace double key with single key. Is this the behavior you want?
.replace(/(\S*):/g, "\"$1\":") // put quotes around object key names
let dataObj = JSON.parse("{" + data + "}");
console.log(dataObj.CLAIM.PROCEDURES);
Output:
[ { PROCEDURE_LINE: '1', PROCEDURE_CODE: '01201' },
{ PROCEDURE_LINE: '2', PROCEDURE_CODE: '02102' },
{ PROCEDURE_LINE: '3', PROCEDURE_CODE: '21222' } ]
Demo
What you are trying to do is to write a parser for the syntax used in your text file.
If one looks at the syntax it looks much like JSON.
I would recommend to modify the syntax with regexps to get a valid JSON syntax and parse it with the JavaScript JSON parser. The parser is able to handle recursion. At the end you will have a JavaScript object that allows you to remove- or add whatever you need. In addition the hierarchy of the source will be preserved.
This code does the job for the provided example:
let data = ` SEND CLAIM {
// your text file contents
}`;
// handle PRIME_EOB=SEND EOB {
var regex = /(\w+)=\w+.*{/gm;
var replace = data.replace(regex, "$1 {");
// append double quotes in lines like PROCEDURE_LINE="1"
var regex = /(\w+)=/g;
var replace = replace.replace(regex, "\"$1\": ");
// append double quotes in lines like PROCEDURE {
var regex = /(\w+.*)\s{/g;
var replace = replace.replace(regex, "\"$1\": {");
// remove comments: /* */
var regex = /\/\**.*\*\//g;
var replace = replace.replace(regex, "");
// append commas to lines i.e. "PROCEDURE_LINE": "2"
var regex = /(\".*\":\s*\".*\")/gm;
var replace = replace.replace(regex, "$1,");
// append commas to '}'
var regex = /^.*}.*$/gm;
var replace = replace.replace(regex, "},");
// remove trailing commas
var regex = /\,(?!\s*?[\{\[\"\'\w])/g;
var replace = replace.replace(regex, "");
// surround with {}
replace = "{" + replace + "}";
console.log(replace);
var obj = JSON.parse(replace);
console.log(obj);
The JSON looks like this snippet:
{ "SEND CLAIM": {
"PREFIX": "9403 ",
"PROCEDURE": {
"PROCEDURE_LINE": "1",
"PROCEDURE_CODE": "01201"
},
"PROCEDURE": {
"PROCEDURE_LINE": "2",
"PROCEDURE_CODE": "02102"
And the final object appears in the debugger like this
.
It is not completely clear to me what your final array or object should look like. But from here I expect only little effort to produce what you desire.
Related
I am trying to extract JPA named parameters in Javasacript. And this is the algorithm that I can think of
const notStrRegex = /(?<![\S"'])([^"'\s]+)(?![\S"'])/gm
const namedParamCharsRegex = /[a-zA-Z0-9_]/;
/**
* #returns array of named parameters which,
* 1. always begins with :
* 2. the remaining characters is guranteed to be following {#link namedParamCharsRegex}
*
* #example
* 1. "select * from a where id = :myId3;" -> [':myId3']
* 2. "to_timestamp_tz(:FROM_DATE, 'YYYY-MM-DD\"T\"HH24:MI:SS')" -> [':FROM_DATE']
* 3. "TO_CHAR(ep.CHANGEDT,'yyyy=mm-dd hh24:mi:ss')" -> []
*/
export function extractNamedParam(query: string): string[] {
return (query.match(notStrRegex) ?? [])
.filter((word) => word.includes(':'))
.map((splittedWord) => splittedWord.substring(splittedWord.indexOf(':')))
.filter((splittedWord) => splittedWord.length > 1) // ignore ":"
.map((word) => {
// i starts from 1 because word[0] is :
for (let i = 1; i < word.length; i++) {
const isAlphaNum = namedParamCharsRegex.test(word[i]);
if (!isAlphaNum) return word.substring(0, i);
}
return word;
});
}
I got inspired by the solution in
https://stackoverflow.com/a/11324894/12924700
to filter out all characters that are enclosed in single/double quotes.
While the code above fulfilled the 3 use cases above.
But when a user input
const testStr = '"user input invalid string \' :shouldIgnoreThisNamedParam \' in a string"'
extractNamedParam(testStr) // should return [] but it returns [":shouldIgnoreThisNamedParam"] instead
I did visit the source code of hibernate to see how named parameters are extracted there, but I couldn't find the algorithm that is doing the work. Please help.
You can use
/"[^\\"]*(?:\\[\w\W][^\\"]*)*"|'[^\\']*(?:\\[\w\W][^\\']*)*'|(:\w+)/g
Get the Group 1 values only. See the regex demo. The regex matches strings between single/double quotes and captures : + one or more word chars in all other contexts.
See the JavaScript demo:
const re = /"[^\\"]*(?:\\[\w\W][^\\"]*)*"|'[^\\']*(?:\\[\w\W][^\\']*)*'|(:\w+)/g;
const text = "to_timestamp_tz(:FROM_DATE, 'YYYY-MM-DD\"T\"HH24:MI:SS')";
let matches=[], m;
while (m=re.exec(text)) {
if (m[1]) {
matches.push(m[1]);
}
}
console.log(matches);
Details:
"[^\\"]*(?:\\[\w\W][^\\"]*)*" - a ", then zero or more chars other than " and \ ([^"\\]*), and then zero or more repetitions of any escaped char (\\[\w\W]) followed with zero or more chars other than " and \, and then a "
| - or
'[^\\']*(?:\\[\w\W][^\\']*)*' - a ', then zero or more chars other than ' and \ ([^'\\]*), and then zero or more repetitions of any escaped char (\\[\w\W]) followed with zero or more chars other than ' and \, and then a '
| - or
(:\w+) - Group 1 (this is the value we need to get, the rest is just used to consume some text where matches must be ignored): a colon and one or more word chars.
I'm trying to get an array of JSON objects. To do that, I'm trying to make the input I have parsable, then parse it and push it to that array using a for loop. The inputs I have to work with look like this:
firstname: Chris, lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW
I've gotten it to the point where each loop produces something like this:
{ "firstname": First Name, "lastname": Last Name, "email": sample#gmail.com, "viewerlink": audiencematic.com/viewer?v=TESTSHOW&push=715B3074, "tempid": 715B3074, "permaid": F8tGYNx, "showid": TESTSHOW }
But got stuck on the last bit, making the values strings. I want it to look like this, so I can use JSON.parse():
{ "firstname": "First Name", "lastname": "Last Name", "email": "sample#gmail.com", "viewerlink": "audiencematic.com/viewer?v=TESTSHOW&push=715B3074", "tempid": "715B3074", "permaid": "F8tGYNx", "showed": "TESTSHOW" }
I tried a couple of different methods I found on here, but one of the values is a URL and the period is screwing with the replace expressions. I tried using the replace function like this:
var jsonStr2 = jsonStr.replace(/(: +\w)|(:+\w)/g, function(matchedStr) {
return ':"' + matchedStr.substring(2, matchedStr.length) + '"';
});
But it just becomes this:
{ "firstname":""irst Name, "lastname":""ast Name, "email":""ample#gmail.com, "viewerlink":""udiencematic.com/viewer?v=TESTSHOW&push=715B3074, "tempid":""15B3074, "permaid":""8tGYNx, "showid":""ESTSHOW }
How should I change my replace function?
(I tried that code because I'm using
var jsonStr = string.replace(/(\w+:)|(\w+ :)/g, function(matchedStr) {
return '"' + matchedStr.substring(0, matchedStr.length - 1) + '":';
});
to put parenthesis around the key sides and that seems to work.)
FIGURED IT OUT!! SEE MY ANSWER BELOW.
One option might be to try using a deserialized version of the string, alter the values associated with the properties of the object, and then convert back to a string.
var person = "{fname:\"John\", lname:\"Doe\", age:25}";
var obj = JSON.parse(person);
for (x in obj) {
obj[x] = "";
}
var result = JSON.stringify(obj);
It's a little longer than doing a string replacement, but I find it a little easier to follow.
I figured it out! I just had to mess around in regexr to figure out what conditions I needed. Here's the working for loop code:
for (i = 0; i < audiencelistdirty.feed.openSearch$totalResults.$t; i++) {
var string = '{ ' + audiencelistdirty.feed.entry[i].content.$t + ' }';
var jsonStr = string.replace(/(\w+:)|(\w+ :)/g, function(matchedStr) {
return '"' + matchedStr.substring(0, matchedStr.length - 1) + '":';
});
var jsonStr1 = jsonStr.replace(/(:(.*?),)|(:\s(.*?)\s)/g, function(matchedStr) {
return ':"' + matchedStr.substring(2, matchedStr.length - 1) + '",';
});
var jsonStr2 = jsonStr1.replace(/(",})/g, function(matchedStr) {
return '" }';
});
var newObj = JSON.parse(jsonStr2);
audiencelist.push(newObj);
};
It's pretty ugly but it works.
EDIT: Sorry, I completely misread the question. To replace the values with quoted strings use this regex replace function:
const str =
'firstname: Chris, lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW'
const json = (() => {
const result = str
.replace(/\w+:\s(.*?)(?:,|$)/g, function (match, subStr) {
return match.replace(subStr, `"${subStr}"`)
})
.replace(/(\w+):/g, function (match, subStr) {
return match.replace(subStr, `"${subStr}"`)
})
return '{' + result + '}'
})()
Wrap the input string into commas then use a regex to identify the keys (between , and :) and their associated values (between : and ,) and construct the object directly as in the example below:
const input = ' firstname : Chris , lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW ';
const wrapped = `,${input},`;
const re = /,\s*([^:\s]*)\s*:\s*(.*?)\s*(?=,)/g;
const obj = {}
Array.from(wrapped.matchAll(re)).forEach((match) => obj[match[1]] = match[2]);
console.log(obj)
String.matchAll() is a newer function, not all JavaScript engines have implemented it yet. If you are one of the unlucky ones (or if you write code to be executed in a browser) then you can use the old-school way:
const input = ' firstname : Chris , lastname: Cheshire, email: chris#cmdcheshire.com, viewerlink: audiencematic.com/viewer?v\u003dTESTSHOW\u0026push\u003d8A043B5A, tempid: 8A043B5A, permaid: F8tGYNx, showid: TESTSHOW ';
const wrapped = `,${input},`;
const re = /,\s*([^:\s]*)\s*:\s*(.*?)\s*(?=,)/g;
const obj = {}
let match = re.exec(wrapped);
while (match) {
obj[match[1]] = match[2];
match = re.exec(wrapped);
}
console.log(obj);
The anatomy of the regex used above
The regular expression piece by piece:
/ # regex delimiter; not part of the regex but JavaScript syntax
, # match a comma
\s # match a white space character (space, tab, new line)
* # the previous symbol zero or more times
( # start the first capturing group; does not match anything
[ # start a character class...
^ # ... that matches any character not listed inside the class
: # ... i.e. any character but semicolon...
\s # ... and white space character
] # end of the character class; the entire class matches only one character
* # the previous symbol zero or more times
) # end of the first capturing group; does not match anything
\s*:\s* # zero or more spaces before and after the semicolon
( # start of the second capturing group
.* # any character, any number of times; this is greedy by default
? # make it not greedy
) # end of the second capturing group
\s* # zero or more spaces
(?= # lookahead positive assertion; matches but does not consume the matched substring
, # matches a comma
) # end of the assertion
/ # regex delimiter; not part of the regex but JavaScript
g # regex flag; 'g' for 'global' is needed to find all matches
Read about the syntax of regular expressions in JavaScript. For a more comprehensive description of the regex patterns I recommend reading the PHP documentation of PCRE (Perl-Compatible Regular Expressions).
You can see the regex in action and play with it on regex101.com.
Regex to fetch all spaces as long as they are not enclosed in braces
This is for a javascript mention system
ex: "Speak #::{Joseph Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc}, all right?"
Need to get:
[ "Speak ", "#::{Joseph
Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc}", ",", "all ", "right?"
]
[Edit]
Solved in: https://codesandbox.io/s/rough-http-8sgk2
Sorry for my bad english
I interpreted your question as you said to to fetch all spaces as long as they are not enclosed in braces, although your result example isn't what I would expect. Your example result contains a space after speak, as well as a separate match for the , after the {} groups. My output below shows what I would expect for what I think you are asking for, a list of strings split on just the spaces outside of braces.
const str =
"Speak #::{Joseph Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc}, all right?";
// This regex matches both pairs of {} with things inside and spaces
// It will not properly handle nested {{}}
// It does this such that instead of capturing the spaces inside the {},
// it instead captures the whole of the {} group, spaces and all,
// so we can discard those later
var re = /(?:\{[^}]*?\})|( )/g;
var match;
var matches = [];
while ((match = re.exec(str)) != null) {
matches.push(match);
}
var cutString = str;
var splitPieces = [];
for (var len=matches.length, i=len - 1; i>=0; i--) {
match = matches[i];
// Since we have matched both groups of {} and spaces, ignore the {} matches
// just look at the matches that are exactly a space
if(match[0] == ' ') {
// Note that if there is a trailing space at the end of the string,
// we will still treat it as delimiter and give an empty string
// after it as a split element
// If this is undesirable, check if match.index + 1 >= cutString.length first
splitPieces.unshift(cutString.slice(match.index + 1));
cutString = cutString.slice(0, match.index);
}
}
splitPieces.unshift(cutString);
console.log(splitPieces)
Console:
["Speak", "#::{Joseph Empyre}{b0268efc-0002-485b-b3b0-174fad6b87fc},", "all", "right?"]
I need a regex that will match the following:
a.b.c
a0.b_.c
a.bca._cda.dca-fb
Notice that it can contain numbers, but the groups are separeted by dots. The characters allowed are a-zA-z, -, _, 0-9
The rule is that it cannot start with a number, and it cannot end with a dot. i.e, the regex should not match
0a.b.c
a.b.c.d.
I have come up with a regex, which seems to work on regex101, but not javascript
([a-zA-Z]+.?)((\w+).)*(\w+)
`
But does not seem to work in js:
var str = "a.b.c"
if (str.match("([a-zA-Z]+.?)((\w+).)*(\w+)")) {
console.log("match");
} else {
console.log("not match");
}
// says not match
Your regex matches your values if you use anchors to assert the start ^ and the end $ of the line.
As an alternative you might use:
^[a-z][\w-]*(?:\.[\w-]+)*$
This will assert the start of the line ^, matches a word character \w (which will match [a-zA-Z0-9_]) or a hyphen in a character class [\w-].
Then repeat the pattern that will match a dot and the allowed characters in the character class (?:\.[\w-]+)* until the end of the line $
const strings = [
"a.b.c",
"A.b.c",
"a0.b_.c",
"a.bca._cda.dca-fb",
"0a.b.c",
"a.b.c.d."
];
let pattern = /^[a-z][\w-]*(?:\.[\w-]+)*$/i;
strings.forEach((s) => {
console.log(s + " ==> " + pattern.test(s));
});
If the match should not start with a digit but can start with an underscore or hypen you might use:
^[a-z_-][\w-]*(?:\.[\w-]+)*$
Use forward slashes / and paste the regex code between them from online regex testers, when you use JavaScipt.
Here are, what I've changed in your regex pattern:
added ^ at the beginning of your regex to match the beginning of the input
added $ at the end to match the end of the input
removed A-Z and added the i modifier for case-insensitive search (this is optional).
Also, when you use regex101, make sure to select JavaScript Flavor, when creating/testing your regex for JavaScript.
var pattern = /^([a-z]+.?)((\w+).)*(\w+)$/i;
// list of strings, that should be matched
var shouldMatch = [
'a.b.c',
'a0.b_.c',
'a.bca._cda.dca-fb'
];
// list of strings, that should not be matched
var shouldNotMatch = [
'0a.b.c',
'a.b.c.d.'
];
shouldMatch.forEach(function (string) {
if (string.match(pattern)) {
console.log('matched, as it should: "' + string + '"');
} else {
console.log('should\'ve matched, but it didn\'t: "' + string + '"');
}
});
shouldNotMatch.forEach(function (string) {
if (!string.match(pattern)) {
console.log('didn\'t match, as it should: "' + string + '"');
} else {
console.log('shouldn\'t have matched, but it did: "' + string + '"');
}
});
More on regexes in JavaScript
I have some random string, for example: Hello, my name is john.. I want that string split into an array like this: Hello, ,, , my, name, is, john, .,. I tried str.split(/[^\w\s]|_/g), but it does not seem to work. Any ideas?
To split a str on any run of non-word characters I.e. Not A-Z, 0-9, and underscore.
var words=str.split(/\W+/); // assumes str does not begin nor end with whitespace
Or, assuming your target language is English, you can extract all semantically useful values from a string (i.e. "tokenizing" a string) using:
var str='Here\'s a (good, bad, indifferent, ...) '+
'example sentence to be used in this test '+
'of English language "token-extraction".',
punct='\\['+ '\\!'+ '\\"'+ '\\#'+ '\\$'+ // since javascript does not
'\\%'+ '\\&'+ '\\\''+ '\\('+ '\\)'+ // support POSIX character
'\\*'+ '\\+'+ '\\,'+ '\\\\'+ '\\-'+ // classes, we'll need our
'\\.'+ '\\/'+ '\\:'+ '\\;'+ '\\<'+ // own version of [:punct:]
'\\='+ '\\>'+ '\\?'+ '\\#'+ '\\['+
'\\]'+ '\\^'+ '\\_'+ '\\`'+ '\\{'+
'\\|'+ '\\}'+ '\\~'+ '\\]',
re=new RegExp( // tokenizer
'\\s*'+ // discard possible leading whitespace
'('+ // start capture group
'\\.{3}'+ // ellipsis (must appear before punct)
'|'+ // alternator
'\\w+\\-\\w+'+ // hyphenated words (must appear before punct)
'|'+ // alternator
'\\w+\'(?:\\w+)?'+ // compound words (must appear before punct)
'|'+ // alternator
'\\w+'+ // other words
'|'+ // alternator
'['+punct+']'+ // punct
')' // end capture group
);
// grep(ary[,filt]) - filters an array
// note: could use jQuery.grep() instead
// #param {Array} ary array of members to filter
// #param {Function} filt function to test truthiness of member,
// if omitted, "function(member){ if(member) return member; }" is assumed
// #returns {Array} all members of ary where result of filter is truthy
function grep(ary,filt) {
var result=[];
for(var i=0,len=ary.length;i++<len;) {
var member=ary[i]||'';
if(filt && (typeof filt === 'Function') ? filt(member) : member) {
result.push(member);
}
}
return result;
}
var tokens=grep( str.split(re) ); // note: filter function omitted
// since all we need to test
// for is truthiness
which produces:
tokens=[
'Here\'s',
'a',
'(',
'good',
',',
'bad',
',',
'indifferent',
',',
'...',
')',
'example',
'sentence',
'to',
'be',
'used',
'in',
'this',
'test',
'of',
'English',
'language',
'"',
'token-extraction',
'"',
'.'
]
EDIT
Also available as a Github Gist
Try this (I'm not sure if this is what you wanted):
str.replace(/[^\w\s]|_/g, function ($1) { return ' ' + $1 + ' ';}).replace(/[ ]+/g, ' ').split(' ');
http://jsfiddle.net/zNHJW/3/
Try:
str.split(/([_\W])/)
This will split by any non-alphanumeric character (\W) and any underscore. It uses capturing parentheses to include the item that was split by in the final result.
This solution caused a challenge with spaces for me (still needed them), then I gave str.split(/\b/) a shot and all is well. Spaces are output in the array, which won't be hard to ignore, and the ones left after punctuation can be trimmed out.