JavaScript - Regex split string to array allowing for apostrophes - javascript

I have some Express middleware which handles a string - a sentence entered by a user through a text field - and does some analysis on it. For this I need both the words and the punctuation broken into an array.
An example string is:
"It's familiar. Not much has really changed, which is surprising, but
it's nice to come back to where I was as a kid."
As part of the process I replace new lines with <br /> and split the string into an array
res.locals.storyArray =
res.locals.story.storyText.replace(/(?:\r\n|\r|\n)/g, ' <br/>' ).split(" ");
this works to a certain degree but when a sentence contains an apostrophe e.g. "It's familiar. things get thrown out of sync and I get an array like (note that there is detail I'm not showing here regarding how the word gets mapped to its grammar type) :
[ [ '"', 'quote' ],
['It', 'Personal pronoun' ], <--these items are the issue
[ '\'', 'quote' ], < --------these items are the issue
[ 's', 'Personal pronoun'], <------these items are the issue
[ 'familiar', 'Adjective' ],
[ '.', 'Sent-final punct' ],
[ 'Not', 'Adverb' ],
[ 'much', 'Adjective' ],
[ 'has', 'Verb, present' ],
[ 'really', 'Adverb' ],
[ 'changed', 'verb, past part' ],
[ ',', 'Comma' ],
[ 'which', 'Wh-determiner' ],
[ 'is', 'Verb, present' ]]
I'm actually surprised that the commas and full stops seem to be split correctly seeing I am only splitting on white space but I'm trying to get my array to be:
[ [ '"', 'quote' ],
[ 'It's, 'Personal pronoun' ],
[ 'familiar', 'Adjective' ],
[ '.', 'Sent-final punct' ],
.....
]

You could use String.raw to make sure the string remains correctly in-tact with included punctuation.
The only issue I had was in keeping the "." punctuation marks. For that I added a new replace function before splitting .replace(/\./g, " .") - this was done for all commas as well.
let strArray = myStr.replace(/\./g, " .")
.replace(/\,/g, " ,")
.replace(/\"/g, String.raw` " `)
.split(/\s/g)
.filter(_=>_);
let myStr = String.raw `"It's familiar. Not much has really changed, which is surprising, but
it's nice to come back to where I was as a kid."`;
let strArray = myStr.replace(/\./g, " .")
.replace(/\,/g, " ,")
.replace(/\"/g, String.raw` " `)
.split(/\s/g)
.filter(_=>_);
let HTML = myStr.replace(/(?:\r\n|\r|\n)/g, " <br/>");
console.log(myStr);
console.log(strArray);
EDIT: Added replace for comma separation as well.
I'm not sure what you expect to be done about the <br/> - it seems silly to insert them while trying to turn your string into an array. In the code I've separated the process. You now have a string that spits out with <br/> tags and another variable that contains the array.
If you have any supplemental information, if this doesn't solve your issue, I'd be happy to help

Related

How to extract content in between an opening and a closing bracket?

I am trying to split a string into an array of text contents which each are present within the [# and ] delimiters. Just characters in between [#and ] are allowed to match. Being provided with a string like ...
const stringA = '[#Mary James], [#Jennifer John] and [#Johnny Lever[#Patricia Robert] are present in the meeting and [#Jerry[#Jeffery Roger] is absent.'
... the following result is expected ...
[
'Mary James',
'Jennifer John',
'Patricia Robert',
'Jeffery Roger'
]
Any logic which leads to the expected outcome can be used.
A self search for a solution brought up the following applied regex ...
stringA.match(/(?<=\[#)[^\]]*(?=\])/g);
But the result doesn't fulfill the requirements because the array features the following items ...
[
'Mary James',
'Jennifer John',
'Johnny Lever[#Patricia Robert',
'Jerry[#Jeffery Roger'
]
The OP's regex does not feature the opening bracket within the negated character class, thus changing the OP's /(?<=\[#)[^\]]*(?=\])/g to (?<=\[#)[^\[\]]*(?=\]) already solves the OP's problem for most environments not including safari browsers due to the lookbehind which is not supported.
Solution based on a regex ... /\[#(?<content>[^\[\]]+)\]/g ... with a named capture group ...
const sampleText = '[#Mary James], [#Jennifer John] and [#Johnny Lever[#Patricia Robert] are present in the meeting and [#Jerry[#Jeffery Roger] is absent.'
// see ... [https://regex101.com/r/v234aT/1]
const regXCapture = /\[#(?<content>[^\[\]]+)\]/g;
console.log(
Array.from(
sampleText.matchAll(regXCapture)
).map(
({ groups: { content } }) => content
)
);
Close, just missing the exclusion of [:
stringA.match(/(?<=\[#)[^\[\]]*(?=\])/g);
// ^^ exclude '[' as well as ']'

pug - array output via each without commas

I have this following pug array and let it execute in an each. The problem is the values are listed with commas. I want it without commas.
I could write the array in the each like each x, y in {'value1': 'value2', ...} but that isnt comfortable.
The current code:
-
var starWars = {
"people": [
"Yoda",
"Obi-Wan",
"Anakin"
],
"rank": [
"master",
"master",
"knight"
]
}
each person, rank in {starWars}
p= person.people
p= person.rank
Output:
Yoda,Obi-Wan,Anakin
master,master,knight
The = character after the tag p is for buffered code. Any JavaScript expression is valid input and will be converted to a string before being printed.
So when you put in an array, it is converted to the string representation of that array which is to separate each element with a comma.
Add a .join(" ") after each array to convert them to a string yourself and delimit them by space rather than comma:
each person, rank in {starWars}
p= person.people.join(" ")
p= person.rank.join(" ")
Output with my changes:
Yoda Obi-Wan Anakin
master master knight

NODEJS: extracting strings between two DIFFERENT characters and storing them in an array

Using nodejs, I need to extract ALL strings between two characters that are DIFFERENT, and store them in an array for future use.
For instance, consider a file, containing a file with the following content.
"type":"multi",
"folders": [
"cities/",
"users/"
]
I need to extract the words: cities and users, and place them in an array. In general, I want the words between " and /"
As Bergi mentions in a comment, this looks suspiciously similar to JSON (javascript object notation.) So I'll write my answer assuming that it is. For your current example to be valid JSON, it needs to be inside object-brackets like this:
{
"type": "multi",
"folders": [
"cities/",
"users/"
]
}
If you parse this:
var parsed_json = JSON.parse( json_string );
// You could add the brackets yourself if they are missing:
var parsed_json = JSON.parse('{' + json_string + '}');
Then all you have to do to get to the array:
var arr = parsed_json.folders;
console.log(arr);
And to fix the annoying trailing slashes we remap the array:
// .map calls a function for every item in an array
// And whatever you choose to return becomes the new array
arr = arr.map(function(item){
// substr returns a part of a string. Here from start (0) to end minus one (the slash).
return item.substr( 0, item.length - 1 );
// Another option could be to instead just replace all the slashes:
return item.replace( '/' , '' );
}
Now the trailing slashes are gone:
console.log( arr );
This should work.
"(.+?)\/"
" preceding
1 or more character (non-greedy)
followed by /"
REGEX101

JS RegExp not working with alphabetical chars

As part of a custom WYSIWYG editor, we've been asked to implement automatic emoticon parsing if enabled. To do this, we use Regular Expressions to replace character combinations with their associated PNG files.
Here is the relevant part of the code which handles this (it's triggered by an onkeyup event on a contenteditable element; I've trimmed it back to the relevant parts):
// Parse emjoi:
this.parseEmoji = function()
{
if( ! this.settings.parseSmileys )
{
return;
}
var _self = this,
url = 'http://cdn.jsdelivr.net/emojione/assets/png/',
$html = this.$editor.html();
// Loop through:
for( var i in _self.emoji )
{
var re = new RegExp( '\\B' + _self.regexpEscape(i) + '\\B', 'g' ),
em = _self.emoji[i];
if( re.test($html) )
{
var replace = '<img class="lw-emoji" height="16" src="'+(url + em[0] + '.png')+'" alt="'+em[1]+'" />';
this.insertAtCaret( replace );
_self.$editor.html(function() { return $(this).html().replace(re, ''); });
}
}
};
And here is the regexpEscape() function:
// Escape a string so that it's RegExp safe!
this.regexpEscape = function( txt )
{
return txt.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
};
We define all of the emoticons used in the system inside of an object which is referenced by the char combination itself as follows:
this.emoji = {
':)' : [ '1F642', 'Smiling face' ],
':-)' : [ '1F642', 'Smiling face' ],
':D' : [ '1F601', 'Happy face' ],
':-D' : [ '1F601', 'Happy face' ],
':\'(': [ '1F622', 'Crying face' ],
':(' : [ '1F614', 'Sad face' ],
':-(' : [ '1F614', 'Sad face' ],
':P' : [ '1F61B', 'Cheeky' ],
':-P' : [ '1F61B', 'Cheeky' ],
':/' : [ '1F615', 'Unsure face' ],
':-/' : [ '1F615', 'Unsure face' ],
'B)' : [ '1F60E', 'Too cool face' ],
'B-)' : [ '1F60E', 'Too cool face' ]
};
Now, the odd thing is that any of the character combinations which contain an alphabetical character do not get replaced, and fail the re.test() function. For example: :), :-), :( and :'( all get replaced without issue. However, :D and B) do not.
Can anyone explain why the alpha chars are causing issues inside of the RegExp?
Paired-back jsFiddle Demo
The problem is that \B is context-dependent, if there is a word character starting the pattern a word character must appear before it in the input string for a match. Same way at the end of the pattern, \B at the end of the pattern will require the same type of the symbol appear right after.
To avoid that issue, a lookaround-based solution is usually used: (?<!\w)YOUR_PATTERN(?!\w). However, in JS, a lookbehind is not supported. It can be worked around with a capturing group and and a backreference in the replace function later.
So, to replace those cases correctly, you need to change that part of code to
var re = new RegExp( '(^|\\W)' + _self.regexpEscape(i) + '(?!\\w)' ),
em = _self.emoji[i]; // match the pattern when not preceded and not followed by a word character
if( re.test($html) )
{
var replace = '<img class="lw-emoji" height="16" src="'+(url + em[0] + '.png')+'" alt="'+em[1]+'" />';
this.insertAtCaret( replace );
_self.$editor.html(function() { return $(this).html().replace(re, '$1'); }); // restore the matched symbol (the one \W matched) with $1
}
Here is the updated fiddle.

Javascript, Regex - I need to grab each section of a string contained in brackets

Here's what I need in what I guess must be the right order:
The contents of each section of the string contained in square brackets (which each must follow after the rest of the original string) need to be extracted out and stored, and the original string returned without them.
If there is a recognized string followed by a colon at the start of a given extracted section, then I need that identified and removed.
For what's left (comma delimited), I need it dumped into an array.
Do not attempt to parse nested brackets.
What is a good way to do this?
Edit: Here's an example of a string:
hi, i'm a string [this: is, how] [it: works, but, there] [might be bracket, parts, without, colons ] [[nested sections should be ignored?]]
Edit: Here's what might be the results:
After extraction: 'hi, i'm a string'
Array identified as 'this': ['is', 'how']
Array identified as 'it': ['works', 'but', 'there']
Array identified without a label: ['might by bracket', 'parts', 'without', 'colons']
Array identified without a label: []
var results = [];
s = s.replace(/\[+(?:(\w+):)?(.*?)\]+/g,
function(g0, g1, g2){
results.push([g1, g2.split(',')]);
return "";
});
Gives the results:
>> results =
[["this", [" is", " how"]],
["it", [" works", " but", " there"]],
["", ["might be bracket", " parts", " without", " colons "]],
["", ["nested sections should be ignored?"]]
]
>> s = "hi, i'm a string "
Note it leaves spaces between tokens. Also, you can remove [[]] tokens in an earlier stage by calling s = s.replace(/\[\[.*?\]\]/g, ''); - this code captures them as a normal group.

Categories

Resources