How to extract text in square brackets from string in JavaScript? - javascript

How can I extract the text between all pairs of square brackets from the a string "[a][b][c][d][e]", so that I can get the following results:
→ Array: ["a", "b", "c", "d", "e"]
→ String: "abcde"
I have tried the following Regular Expressions, but to no avail:
→ (?<=\[)(.*?)(?=\])
→ \[(.*?)\]

Research:
After having searched in Stack Overflow, I have only found two solutions, both of which using Regular Expressions and they can be found here:
→ (?<=\[)(.*?)(?=\]) (1)
(?<=\[) : Positive Lookbehind.
\[ :matches the character [ literally.
(.*?) : matches any character except newline and expands as needed.
(?=\]) : Positive Lookahead.
\] : matches the character ] literally.
→ \[(.*?)\] (2)
\[ : matches the character [ literally.
(.*?) : matches any character except newline and expands as needed.
\] : matches the character ] literally.
Notes:
(1) This pattern throws an error in JavaScript, because the lookbehind operator is not supported.
Example:
console.log(/(?<=\[)(.*?)(?=\])/.exec("[a][b][c][d][e]"));
Uncaught SyntaxError: Invalid regular expression: /(?<=\[)(.*?)(?=\])/: Invalid group(…)
(2) This pattern returns the text inside only the first pair of square brackets as the second element.
Example:
console.log(/\[(.*?)\]/.exec("[a][b][c][d][e]"));
Returns: ["[a]", "a"]
Solution:
The most precise solution for JavaScript that I have come up with is:
var string, array;
string = "[a][b][c][d][e]";
array = string.split("["); // → ["", "a]", "b]", "c]", "d]", "e]"]
string = array1.join(""); // → "a]b]c]d]e]"
array = string.split("]"); // → ["a", "b", "c", "d", "e", ""]
Now, depending upon whether we want the end result to be an array or a string we can do:
array = array.slice(0, array.length - 1) // → ["a", "b", "c", "d", "e"]
/* OR */
string = array.join("") // → "abcde"
One liner:
Finally, here's a handy one liner for each scenario for people like me who prefer to achieve the most with least code or our TL;DR guys.
Array:
var a = "[a][b][c][d][e]".split("[").join("").split("]").slice(0,-1);
/* OR */
var a = "[a][b][c][d][e]".slice(1,-1).split(']['); // Thanks #xorspark
String:
var a = "[a][b][c][d][e]".split("[").join("").split("]").join("");

I don't know what text you are expecting in that string of array, but for the example you've given.
var arrStr = "[a][b][c][d][e]";
var arr = arrStr.match(/[a-z]/g) --> [ 'a', 'b', 'c', 'd', 'e' ] with typeof 'array'
then you can just use `.concat()` on the produced array to combine them into a string.
if you're expecting multiple characters between the square brackets, then the regex can be (/[a-z]+/g) or tweaked to your liking.

I think this approach will be interesting to you.
var arr = [];
var str = '';
var input = "[a][b][c][d][e]";
input.replace(/\[(.*?)\]/g, function(match, pattern){
arr.push(pattern);
str += pattern;
return match;//just in case ;)
});
console.log('Arr:', arr);
console.log('String:', str);
//And trivial solution if you need only string
var a = input.replace(/\[|\]/g, '');
console.log('Trivial:',a);

Related

How to split a string on pattern of one or more repeating character and retain match?

For example, get a string abaacaaa, a character a, split the string to get ['ab', 'aac', 'aaa'].
string = 'abaacaaa'
string.split('a') // 1. ["", "b", "", "c", "", "", ""]
string.split(/(?=a)/) // 2. ["ab", "a", "ac", "a", "a", "a"]
string.split(/(?=a+)/) // 3. ["ab", "a", "ac", "a", "a", "a"]
string.split(/*???*/) // 4. ['ab', 'aac', 'aaa']
Why is 3rd expression outputs the same value as 2nd even if + presented after a, and what to put into 4th?
Edit:
string.match(/a+[^a]*/g) doesn't work properly in babaacaaa.
string = 'babaacaaa' // should be splited to ['b', 'ab', 'aac', 'aaa']
string.match(/a+[^a]*/g) // ["ab", "aac", "aaa"]
Solutions 2 and 3 are equal because unanchored lookaheads test each position in the input. string. (?=a) tests the start of string in abaacaaa, and finds a match, the leading empty result is discarded. Next, it tries after a, no match since the char to the right is b, the regex engine goes on to the next position. Next, it matches after b. ab is added to the result. Then it matches a position after a, adds a to the resulting array, and goes to the next position to find a match. And so on. With (?=a+) the process is indetical, it just matches 1+ as, but still tests each position.
To split babaacaaa, you need
var s = 'babaacaaa';
console.log(
s.split(/(a+[^a]*)/).filter(Boolean)
);
The a+[^a]* matches
a+ - 1 or more a
[^a]* - 0 or more chars other than a
The capturing group allows adding matched substrings to the resulting split array, and .filter(Boolean) will discard empty matches in between adjoining matches.
let string = 'abaacaaa'
let result = string.match(/a*([^a]+|a)/g)
console.log(result)
string = 'babaacaaa'
result = string.match(/a*([^a]+|a)/g)
console.log(result)
string.match(/^[^a]+|a+[^a]*/g) seems to work as expected.

Separating words with Regex

I am trying to get this result: 'Summer-is-here'. Why does the code below generate extra spaces? (Current result: '-Summer--Is- -Here-').
function spinalCase(str) {
var newA = str.split(/([A-Z][a-z]*)/).join("-");
return newA;
}
spinalCase("SummerIs Here");
You are using a variety of split where the regexp contains a capturing group (inside parentheses), which has a specific meaning, namely to include all the splitting strings in the result. So your result becomes:
["", "Summer", "", "Is", " ", "Here", ""]
Joining that with - gives you the result you see. But you can't just remove the unnecessary capture group from the regexp, because then the split would give you
["", "", " ", ""]
because you are splitting on zero-width strings, due to the * in your regexp. So this doesn't really work.
If you want to use split, try splitting on zero-width or space-only matches looking ahead to a uppercase letter:
> "SummerIs Here".split(/\s*(?=[A-Z])/)
^^^^^^^^^ LOOK-AHEAD
< ["Summer", "Is", "Here"]
Now you can join that to get the result you want, but without the lowercase mapping, which you could do with:
"SummerIs Here" .
split(/\s*(?=[A-Z])/) .
map(function(elt, i) { return i ? elt.toLowerCase() : elt; }) .
join('-')
which gives you want you want.
Using replace as suggested in another answer is also a perfectly viable solution. In terms of best practices, consider the following code from Ember:
var DECAMELIZE_REGEXP = /([a-z\d])([A-Z])/g;
var DASHERIZE_REGEXP = /[ _]/g;
function decamelize(str) {
return str.replace(DECAMELIZE_REGEXP, '$1_$2').toLowerCase();
}
function dasherize(str) {
return decamelize(str).replace(DASHERIZE_REGEXP, '-');
}
First, decamelize puts an underscore _ in between two-character sequences of lower-case letter (or digit) and upper-case letter. Then, dasherize replaces the underscore with a dash. This works perfectly except that it lower-cases the first word in the string. You can sort of combine decamelize and dasherize here with
var SPINALIZE_REGEXP = /([a-z\d])\s*([A-Z])/g;
function spinalCase(str) {
return str.replace(SPINALIZE_REGEXP, '$1-$2').toLowerCase();
}
You want to separate capitalized words, but you are trying to split the string on capitalized words that's why you get those empty strings and spaces.
I think you are looking for this :
var newA = str.match(/[A-Z][a-z]*/g).join("-");
([A-Z][a-z]*) *(?!$|[a-z])
You can simply do a replace by $1-.See demo.
https://regex101.com/r/nL7aZ2/1
var re = /([A-Z][a-z]*) *(?!$|[a-z])/g;
var str = 'SummerIs Here';
var subst = '$1-';
var result = str.replace(re, subst);
var newA = str.split(/ |(?=[A-Z])/).join("-");
You can change the regex like:
/ |(?=[A-Z])/ or /\s*(?=[A-Z])/
Result:
Summer-Is-Here

js split() using regex, what expression was matched

When using a regex as the separator in the split(), is there a way to know what string it matched?
Example:
var
string = "12+34-12",
numberlist = split(/[^0-9]/);
how would I know if it found a + or a -?
You can use capturing group to also capture string that was used in String#split:
var m = string.split(/(\D)/);
//=> ["12", "+", "34", "-", "12"]
To see the difference here is the output without capturing group:
var m = string.split(/\D/);
//=> ["12", "34", "12"]
PS: I have changed your use of [^0-9] to \D since they are equivalent.
Just capture the splitting regular expression, like
numberlist = string.split(/([^0-9])/);
and the output will be
[ '12', '+', '34', '-', '12' ]
Since you are capturing the splitting regular expression, it will also be a part of the resulting array.

Tokenizing strings using regular expression in Javascript

Suppose I've a long string containing newlines and tabs as:
var x = "This is a long string.\n\t This is another one on next line.";
So how can we split this string into tokens, using regular expression?
I don't want to use .split(' ') because I want to learn Javascript's Regex.
A more complicated string could be this:
var y = "This #is a #long $string. Alright, lets split this.";
Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:
var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];
var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];
Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/
Basically, the code is very simple:
var y = "This #is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"
var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
document.write(match[i]);
document.write('<br>');
}
UPDATE:
Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/
var regex = /[^\s\.,!?]+/g;
UPDATE 2:
Only letters all the time:
http://jsfiddle.net/ayezutov/BjXw5/3/
var regex = /\w+/g;
Use \s+ to tokenize the string.
exec can loop through the matches to remove non-word (\W) characters.
var A= [], str= "This #is a #long $string. Alright, let's split this.",
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words;
while((words= rx.exec(str))!= null){
A.push(words[1]);
}
A.join(', ')
/* returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/
var words = y.split(/[^A-Za-z0-9]+/);
Here is a solution using regex groups to tokenise the text using different types of tokens.
You can test the code here https://jsfiddle.net/u3mvca6q/5/
/*
Basic Regex explanation:
/ Regex start
(\w+) First group, words \w means ASCII letter with \w + means 1 or more letters
| or
(,|!) Second group, punctuation
| or
(\s) Third group, white spaces
/ Regex end
g "global", enables looping over the string to capture one element at a time
Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(\w+)|(,|!)|(\s)/g;
/*
Advanced Regex explanation:
[a-zA-Z\u0080-\u00FF] instead of \w Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex
(\.\.\.|\.|,|!|\?) Identify ellipsis (...) and points as separate entities
You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g;
var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";
console.log("------------------");
var result = null;
do {
result = basicRegex.exec(basicString)
console.log(result);
} while(result != null)
console.log("------------------");
var result = null;
do {
result = advancedRegex.exec(advancedString)
console.log(result);
} while(result != null)
/*
Output:
Array [ "Hello", "Hello", undefined, undefined ]
Array [ ",", undefined, ",", undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "this", "this", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "is", "is", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "a", "a", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "random", "random", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "message", "message", undefined, undefined ]
Array [ "!", undefined, "!", undefined ]
null
*/
In order to extract word-only characters, we use the \w symbol. Whether or not this will match Unicode characters or not is implementation-dependent, and you can use this reference to see what the case is for your language/library.
Please see Alexander Yezutov's answer (update 2) on how to apply this into an expression.

Javascript Match and RegExp Issue -- Strange Behavior

I have been trying to use a simple jQuery operation to dynamically match and store all anchor tags and their texts on the page. But I have found a weird behavior. When you are using match() or exec(), if you designate the needle as a separate RegExp object or a pattern variable, then your query matches only one instance among dozens in the haystack.
And if you designate the pattern like this
match(/needle/gi)
then it matches every instance of the needle.
Here is my code.
You can even fire up Firebug and try this code right here on this page.
var a = {'text':'','parent':[]};
$("a").each(function(i,n) {
var module = $.trim($(n).text());
a.text += module.toLowerCase() + ',' + i + ',';
a.parent.push($(n).parent().parent());
});
var stringLowerCase = 'b';
var regex = new RegExp(stringLowerCase, "gi");
//console.log(a.text);
console.log("regex 1: ", regex.exec(a.text));
var regex2 = "/" + stringLowerCase + "/";
console.log("regex 2: ", a.text.match(regex2));
console.log("regex 3: ", a.text.match(/b/gi));
For me it is returning:
regex 1: ["b"]
regex 2: null
regex 3: ["b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"]
Can anyone explain the root of this behavior?
EDIT: I forgot to mention that for regex1, it doesn't make any difference whether you add the flags "gi" for global and case insensitive matching. It still returns only one match.
EDIT2: SOlved my own problem. I still don't know why one regex1 matches only one instance, but I managed to match all instances using the match() and the regex1.
So..this matches all and dynamically!
var regex = new RegExp(stringLowerCase, "gi");
console.log("regex 2: ", a.text.match(regex));
This is not unusual behaviour at all. In regex 1 you are only checking for 1 instance of it where in regex 3 you have told it to return all instances of the item by using the /gi argument.
In Regex 2 you are assuming that "/b/" === /b/ when it doesn't. "/b/" !== /b/. "/b/" is a string that is searching so if you string has "/b/" in it then it will return while /b/ means that it needs to search between the slashes so you could have "abc" and it will return "b"
I hope that helps.
EDIT:
Looking into it a little bit more, the exec methods returns the first match that it finds rather than all the matches that it finds.
EDIT:
var myRe = /ab*/g;
var str = "abbcdefabh";
var myArray;
while ((myArray = myRe.exec(str)) != null)
{
var msg = "Found " + myArray[0] + ". ";
msg += "Next match starts at " + myRe.lastIndex;
console.log(msg);
}
Having a look at it again it definitely does return the first instance that it finds. If you looped through it then would return more.
Why it does this? I have no idea...my JavaScript Kung Fu clearly isnt strong enough to answer that part
The reason regex 2 is returning null is that you're passing "/b/" as the pattern parameter, while "b" is actually the only thing that is actually part of the pattern. The slashes are shorthand for regex, just as [ ] is for array. So if you were to replace that to just new regex("b"), you'd get one match, but only one, since you're omitting the "global+ignorecase" flags in that example. To get the same results for #2 and #3, modify accordingly:
var regex2 = stringLowerCase;
console.log("regex 2: ", a.text.match(regex2, "gi"));
console.log("regex 3: ", a.text.match(/b/gi));
regex2 is a string, not a RegExp, I had trouble too using this kind of syntax, tho i'm not really sure of the behavior.
Edit : Remebered : for regex2, JS looks for "/b/" as a needle, not "b".

Categories

Resources