String Split With Unicode

String Split With Unicode - javascript

First off I been searching the web for this solution.
How to:
<''.split('');
> ['','','']
Simply express of what I'll like to do. But also with other Unicode characters like poo.

As explained in JavaScript has a Unicode problem, in ES6 you can do this quite easily by using the new ... spread operator. This causes the string iterator (another new ES6 feature) to be used internally, and because that iterator is designed to deal with code points rather than UCS-2/UTF-16 code units, it works the way you want:
console.log([...'💩💩']);
// → ['💩', '💩']
Try it out here: https://babeljs.io/repl/#?experimental=true&evaluate=true&loose=false&spec=false&code=console.log%28%0A%20%20%5B%2e%2e%2e%27%F0%9F%92%A9%F0%9F%92%A9%27%5D%0A%29%3B
A more generic solution:
function splitStringByCodePoint(string) {
return [...string];
}
console.log(splitStringByCodePoint('💩💩'));
// → ['💩', '💩']

for ... of could loop through string contains unicode characters,
let string = "😀😃😄😁😆😅🤣😂🙂🙃😉😊😇"
for(var c of string)
console.log(c);

The above solutions work well for simple emojis, but not for the one from an extended set and the ones that use Surrogate Pairs
For example:
splitStringByCodePoint("❤️")
// Returns: [ "❤", "️" ]
To handle these cases properly you'll need a purpose-built library, like for example:
https://github.com/dotcypress/runes
https://github.com/essdot/spliddit

Related

Special unicode string splitting in Javascript [duplicate]

First off I been searching the web for this solution.
How to:
<''.split('');
> ['','','']
Simply express of what I'll like to do. But also with other Unicode characters like poo.

As explained in JavaScript has a Unicode problem, in ES6 you can do this quite easily by using the new ... spread operator. This causes the string iterator (another new ES6 feature) to be used internally, and because that iterator is designed to deal with code points rather than UCS-2/UTF-16 code units, it works the way you want:
console.log([...'💩💩']);
// → ['💩', '💩']
Try it out here: https://babeljs.io/repl/#?experimental=true&evaluate=true&loose=false&spec=false&code=console.log%28%0A%20%20%5B%2e%2e%2e%27%F0%9F%92%A9%F0%9F%92%A9%27%5D%0A%29%3B
A more generic solution:
function splitStringByCodePoint(string) {
return [...string];
}
console.log(splitStringByCodePoint('💩💩'));
// → ['💩', '💩']

for ... of could loop through string contains unicode characters,
let string = "😀😃😄😁😆😅🤣😂🙂🙃😉😊😇"
for(var c of string)
console.log(c);

The above solutions work well for simple emojis, but not for the one from an extended set and the ones that use Surrogate Pairs
For example:
splitStringByCodePoint("❤️")
// Returns: [ "❤", "️" ]
To handle these cases properly you'll need a purpose-built library, like for example:
https://github.com/dotcypress/runes
https://github.com/essdot/spliddit

What is the function of .source in context of this new RegExp

I ran into the below monster of a regex in the wild today. The regex is meant to validate a url.
function superUrlValidation(url) {
return new RegExp(/^/.source + "((.+):\/\/)?" + /(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/.source, "i")
.test(url);
}
I've never seen .source used in a regex like this so I looked it up.
The MDN docs for RegExp.prototype.source states:
The source property returns a String containing the source text of the regexp object, and it doesn't contain the two forward slashes on both sides and any flags.
... and gives this example:
var regex = /fooBar/ig;
console.log(regex.source); // "fooBar", doesn't contain /.../ and "ig".
I understand the MDN example (you're getting the source text of the regex object after it is created, makes sense), but I dont understand how this is being used in the superUrlValidation regex above.
How is the source being used before the regex object is completed and what does this accomplish? I cant find any documentation showing .source being used in this way.
Note that .source is used twice in the regex, at the beginning and the end

Use of .source everywhere in your regex seems totally unnecessary, may be just a trick to avoid double escaping. In fact even use of new RegExp is not needed and you can get away with just the regex literal as this:
var re = /^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i;

/^/ is a regex literal, meaning it's a valid regex object in it's own right. This means that /^/.source === "^".
This seems like an arbitrary example of using the source property as this means the author could have just placed a "^" in it's place, or even just put a ^ at the beginning of the next string, and it would have the same effect.

The .source property returns the content of the regex between the forward slashes as you say. so the result of the above is equivalent to this string:
/^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
In JavaScript you can write regexes like this: /matchsomething/ or using the RegExp function/constructor above. It looks like the code you found is the result of someone not know what they were doing. They seem to have taken a few regexes using the literal syntax (i.e /match_here/) and plugged it into the constructor version and stuck them all together.
I can't see any benefit in using the source property this way. I would just use the string version or the constructor version. Or better, find out what the original author intended and write it again or find a respected regex library with the criteria you need.
And, yeah, wow. It's massive.

Javascripts String.split - how does it work internally?

I've recently discussed with a colleague how the separator of String.split is treated internally by JavaScript.
Is the separator always converted into a regular expression? E.g. will calling String.split(",", myvar) convert the "," into a regualar expression matching that string?

Well the answer for your question: "Is the separator always converted into a regular expression?" is:
It depends solely on the implementation. For example if you look at WebKit implementation http://svn.webkit.org/repository/webkit/trunk/Source/JavaScriptCore/runtime/StringPrototype.cpp (find stringProtoFuncSplit) then you see it is not always converted to RegEx. However, this does not imply anything, it is just a matter of implementation

Here's the official writeup over at ecma, but the relevant part is around this section:
8.If separator is a RegExp object (its [[Class]] is "RegExp"), let R = separator; otherwise let R = ToString(separator).
That being said it is the ecma spec, and as Anthony Grist mentioned in the comments, browsers can implement as they want, for instance V8 implements ecma262.
Edit: expanded thought on browser/js engines implementation, it appears the majority implement versions of ecma, as seen on this wiki

Yes, the javascript function split allow you to use regex:
EX:
var str = "I am confused";
str.split(/\s/g)
Str then contains ["I","am","confused"]

separator specifies the character(s) to use for separating the string. The separator is treated as a string or a regular expression. If separator is omitted, the array returned contains one element consisting of the entire string. If separator is an empty string, str is converted to an array of characters.
Please see the below link to know more about this, hope it will help you:
String.prototype.split()

Call [].reverse on a string

Why does
[].reverse.call("string");
fails (error in both firefox and ie, returns the original string in chrome) while calling all other arrays methods on a string work ?
>>> [].splice.call("string",3)
["i", "n", "g"]
>>> [].map.call("string",function (a) {return a +a;} )
["ss", "tt", "rr", "ii", "nn", "gg"]

Because .reverse() modifies an Array, and strings are immutable.
You could borrow Array.prototype.slice to convert to an Array, then reverse and join it.
var s = "string";
var s2 = [].slice.call(s).reverse().join('');
Just be aware that in older versions of IE, you can't manipulate a string like an Array.

The following technique (or similar) is commonly used to reverse a string in JavaScript:
// Don’t use this!
var naiveReverse = function(string) {
return string.split('').reverse().join('');
}
In fact, all the answers posted so far are a variation of this pattern. However, there are some problems with this solution. For example:
naiveReverse('foo 𝌆 bar');
// → 'rab �� oof'
// Where did the `𝌆` symbol go? Whoops!
If you’re wondering why this happens, read up on JavaScript’s internal character encoding. (TL;DR: 𝌆 is an astral symbol, and JavaScript exposes it as two separate code units.)
But there’s more:
// To see which symbols are being used here, check:
// http://mothereff.in/js-escapes#1ma%C3%B1ana%20man%CC%83ana
naiveReverse('mañana mañana');
// → 'anãnam anañam'
// Wait, so now the tilde is applied to the `a` instead of the `n`? WAT.
A good string to test string reverse implementations is the following:
'foo 𝌆 bar mañana mañana'
Why? Because it contains an astral symbol (𝌆) (which are represented by surrogate pairs in JavaScript) and a combining mark (the ñ in the last mañana actually consists of two symbols: U+006E LATIN SMALL LETTER N and U+0303 COMBINING TILDE).
The order in which surrogate pairs appear cannot be reversed, else the astral symbol won’t show up anymore in the ‘reversed’ string. That’s why you saw those �� marks in the output for the previous example.
Combining marks always get applied to the previous symbol, so you have to treat both the main symbol (U+006E LATIN SMALL LETTER N) as the combining mark (U+0303 COMBINING TILDE) as a whole. Reversing their order will cause the combining mark to be paired with another symbol in the string. That’s why the example output had ã instead of ñ.
Hopefully, this explains why all the answers posted so far are wrong.
To answer your initial question — how to [properly] reverse a string in JavaScript —, I’ve written a small JavaScript library that is capable of Unicode-aware string reversal. It doesn’t have any of the issues I just mentioned. The library is called Esrever; its code is on GitHub, and it works in pretty much any JavaScript environment. It comes with a shell utility/binary, so you can easily reverse strings from your terminal if you want.
var input = 'foo 𝌆 bar mañana mañana';
esrever.reverse(input);
// → 'anañam anañam rab 𝌆 oof'

See #am not i am's answer for why it doesn't work. However, if you want to know how to accomplish this, convert it to an array first:
"string".split('').reverse().join(''); // "gnirts"

Javascript Regex: extracting variables from paths

Trying to extract variable names from paths (variable is preceded with : ,optionally enclosed by ()), the number of variables may vary
"foo/bar/:firstVar/:(secondVar)foo2/:thirdVar"
Expected output should be:
['firstVar', 'secondVar', 'thirdVar']
Tried something like
"foo/bar/:firstVar/:(secondVar)foo2/:thirdVar".match(/\:([^/:]\w+)/g)
but it doesnt work (somehow it captures colons & doesnt have optional enclosures), if there is some regex mage around, please help. Thanks a lot in advance!

var path = "foo/bar/:firstVar/:(secondVar)foo2/:thirdVar";
var matches = [];
path.replace(/:\(?(\w+)\)?/g, function(a, b){
matches.push(b)
});
matches; // ["firstVar", "secondVar", "thirdVar"]

What about this:
/\:\(?([A-Za-z0-9_\-]+)\)?/
matches:
:firstVar
:(secondVar)
:thirdVar
$1 contains:
firstVar
secondVar
thirdVar

May I recommend that you look into the URI template specification? It does exactly what you're trying to do, but more elegantly. I don't know of any current URI template parsers for JavaScript, since it's usually a server-side operation, but a minimal implementation would be trivial to write.
Essentially, instead of:
foo/bar/:firstVar/:(secondVar)foo2/:thirdVar
You use:
foo/bar/{firstVar}/{secondVar}foo2/{thirdVar}
Hopefully, it's pretty obvious why this format works better in the case of secondVar. Plus it has the added advantage of being a specification, albeit currently still a draft.

Develop Reference

JavaScript is the programming language of the Web.

String Split With Unicode - javascript

First off I been searching the web for this solution. How to: <''.split(''); > ['','',''] Simply express of what I'll like to do. But also with other Unicode characters like poo.

for ... of could loop through string contains unicode characters, let string = "😀😃😄😁😆😅🤣😂🙂🙃😉😊😇" for(var c of string) console.log(c);

Related

Special unicode string splitting in Javascript [duplicate]

What is the function of .source in context of this new RegExp

Javascripts String.split - how does it work internally?

Call [].reverse on a string

Javascript Regex: extracting variables from paths

Categories

Resources