Trying to design a WORD SEARCH puzzle with Unicode Letters (TAMIL) Using HTML and JAVASCRIPT [duplicate] - javascript

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).
JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).
To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.
I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).
Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.
For the purposes of this question I do not require splitting by grapheme cluster.

#bobince's answer has (luckily) become a bit dated; you can now simply use
var chars = Array.from( text )
to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

Along the lines of #John Frazer's answer, one can use this even succincter form of string iteration:
const chars = [...text]
e.g., with:
const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "๐‘จ", "B", "๐‘ฉ", "C", "๐‘ช"]

In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.
Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(nยฒ)), so we can't realistically use this approach for a while yet.
So doing it the manual way:
String.prototype.toCodePoints= function() {
chars = [];
for (var i= 0; i<this.length; i++) {
var c1= this.charCodeAt(i);
if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
var c2= this.charCodeAt(i+1);
if (c2>=0xDC00 && c2<0xE000) {
chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
i++;
continue;
}
}
chars.push(c1);
}
return chars;
}
For the inverse to this see https://stackoverflow.com/a/3759300/18936

Another method using codePointAt:
String.prototype.toCodePoints = function () {
var arCP = [];
for (var i = 0; i < this.length; i += 1) {
var cP = this.codePointAt(i);
arCP.push(cP);
if (cP >= 0x10000) {
i += 1;
}
}
return arCP;
}

Related

Regex split comma except escaped [duplicate]

I have this string:
a\,bcde,fgh,ijk\,lmno,pqrst\,uv
I need a JavaScript function that will split the string by every , but only those that don't have a \ before them
How can this be done?
Here's the shortest thing I could come up with:
'a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv'.replace(/([^\\]),/g, '$1\u000B').split('\u000B')
The idea behind is to find every place where comma isn't prefixed with a backslash, replace those with string that is uncommon to come up in your strings and then split by that uncommon string.
Note that backslashes before commas have to be escaped using another backslash. Otherwise, javascript treats form \, as escaped comma and produce simply a comma out of it! In other words if you won't escape the backslash, javascript sees this: a\,bcde,fgh,ijk\,lmno,pqrst\,uv as this a,bcde,fgh,ijk,lmno,pqrst,uv.
Since regular expressions in JavaScript does not support lookbehinds, I'm not going to cook up a giant hack to mimic this behavior. Instead, you can just split() on all commas (,) and then glue back the pieces that shouldn't have been split in the first place.
Quick 'n' dirty demo:
var str = 'a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv'.split(','), // Split on all commas
out = []; // Output
for (var i = 0, j = str.length - 1; i < j; i++) { // Iterate all but last (last can never be glued to non-existing next)
var curr = str[i]; // This piece
if (curr.charAt(curr.length - 1) == '\\') { // If ends with \ ...
curr += ',' + str[++i]; // ... glue with next and skip next (increment i)
}
out.push(curr); // Add to output
}
Another ugly hack around the lack of look-behinds:
function rev(s) {
return s.split('').reverse().join('');
}
var s = 'a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv';
// Enter bizarro world...
var r = rev(s);
// Split with a look-ahead
var rparts = r.split(/,(?!\\)/);
// And put it back together with double reversing.
var sparts = [ ];
while(rparts.length)
sparts.push(rev(rparts.pop()));
for(var i = 0; i < sparts.length; ++i)
$('#out').append('<pre>' + sparts[i] + '</pre>');
Demo: http://jsfiddle.net/ambiguous/QbBfw/1/
I don't think I'd do this in real life but it works even if it does make me feel dirty. Consider this a curiosity rather than something you should really use.
In case if need remove backslashes also:
var test='a\\.b.c';
var result = test.replace(/\\?\./g, function (t) { return t == '.' ? '\u000B' : '.'; }).split('\u000B');
//result: ["a.b", "c"]
In 2022 most of browsers support lookbehinds:
https://caniuse.com/js-regexp-lookbehind
Safari should be your only concern.
With a lookbehind you can split your string this way:
"a\\,bcde,fgh,ijk\\,lmno,pqrst\\,uv".split(/(?<!\\),/)
// => ['a\\,bcde', 'fgh', 'ijk\\,lmno', 'pqrst\\,uv']
You can use regex to do the split.
Here is the link to regex in javascript http://www.w3schools.com/jsref/jsref_obj_regexp.asp
Here is the link to other post where the author have used regex for split Javascript won't split using regex
From the first link if you note you can create a regular expression using
?!n Matches any string that is not followed by a specific string n
[,]!\\

How do I determine the width of the result of codePointAt?

I'm trying to loop over the Unicode characters in a Javascript string, that I assume is encoded with UTF-16.
It is my understanding that UTF-16 is variable width. That is, a single Unicode character may be split across multiple 16-bit characters. I can use s[i].codePointAt to get the Unicode character beginning at a given code point. But once I have it, how do I know how far to advance i?
Roughly, what is getWidth here? Is it simply c > Math.pow(2, 16)?
for (var i = 0; i < s.length;) {
var c = s.codePointAt(i);
// do some operation with c
i = i + getWidth(c)
}
Is there a standard library function I can use to determine how far to advance? Or a way to iterate over the Unicode code points in a string?
Is there a standard [โ€ฆ] way to iterate over the Unicode code points in a string?
Yes, since ES6 you can simply iterate all strings to get the code points:
for (const character of string) {
const codepoint = character.codePointAt(0);
// do some operation with codepoint
}
A simple approach:
for (var i = 0; i < s.length; ++i) {
var c = s.codePointAt(i);
// do some operation with c
if( s.charCodeAt(i) != c) {
++i; // step past the next sixteen bits of the surrogate pair
}
}
(where the value of c is the Unicode codepoint, not the character).
If you want to split the string into an array of Unicode characters you can make use of the string iterator invoked by the spread operator introduced in ES6:
var array = [...s];
In pre-ES6 browsers the start of a surrogate pair can be identified in order to skip the second part:
for (var i = 0; i < s.length; ++i) {
var k = s.charCodeAt(i);
if( k < 0xD800 || k > 0xDBFF) {
var c = s[i]; // character in BMP
}
else {
c = s.substring( i,i+2); // use surrogate pair
++i;
}
// do something with c
console.log(c)
}
See: http://www.unicode.org/glossary/#supplementary_code_point
Basically, if your code point is 0x010000+ you are dealing with multibyte character.
const MIN_SUPPLEMENTARY_CODE_POINT = 0x010000;
function charCount(int codePoint) {
return codePoint >= MIN_SUPPLEMENTARY_CODE_POINT ? 2 : 1;
}
JavaScript predates Unicode and uses another, older system called UCS2, which is very similar but doesn't handle surrogate pairs nor does it understand any characters that can't be represented by two bytes.
If you are stepping through a string looking at codepoints, you can look at the codepoint value itself... if the value is greater than 2^16, you have to advance 2 string characters, otherwise advance 1 string character.
You might try a new ES6 sytax that works really well at splitting up strings into characters, even if those characters are high-order.
// High order unicode character
const k = '๐Ÿ’ฉ';
// Takes four bytes
console.log(k.length);
const chars = [...k];
// But its only one character
console.log(chars.length);

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")

Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).
JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).
To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.
I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).
Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.
For the purposes of this question I do not require splitting by grapheme cluster.
#bobince's answer has (luckily) become a bit dated; you can now simply use
var chars = Array.from( text )
to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.
Along the lines of #John Frazer's answer, one can use this even succincter form of string iteration:
const chars = [...text]
e.g., with:
const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "๐‘จ", "B", "๐‘ฉ", "C", "๐‘ช"]
In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.
Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(nยฒ)), so we can't realistically use this approach for a while yet.
So doing it the manual way:
String.prototype.toCodePoints= function() {
chars = [];
for (var i= 0; i<this.length; i++) {
var c1= this.charCodeAt(i);
if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
var c2= this.charCodeAt(i+1);
if (c2>=0xDC00 && c2<0xE000) {
chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
i++;
continue;
}
}
chars.push(c1);
}
return chars;
}
For the inverse to this see https://stackoverflow.com/a/3759300/18936
Another method using codePointAt:
String.prototype.toCodePoints = function () {
var arCP = [];
for (var i = 0; i < this.length; i += 1) {
var cP = this.codePointAt(i);
arCP.push(cP);
if (cP >= 0x10000) {
i += 1;
}
}
return arCP;
}

Regex matching JS source that's not in a string or regex literal

Do there exist comprehensive regular expressions that, when applied to JavaScript source code, will match all valid string literals (such as "say \"Hello\"") and regex literals (such as /and\/or/)? The expressions would have to cover all edge cases, including line breaks and escape sequences.
Alternatively, does anyone know of regexes for matching patterns outside of string and regex literals?
My goal is to implement a simple JavaScript syntax extension that allows macros in delimeters (e.g. {{#foo.bar}} or ##foo.bar#) to be expanded by a preprocessor. However, I'd like the macros to be processed only outside of literals.
For now, I'm trying to accomplish this using just string replacement, without having to augment an existing JavaScript lexer/parser.
This JavaScript preprocessor will itself be implemented in JavaScript.
This is the regex that I've been using to match quoted strings which is pretty good since it should work with almost all engines since it does not require backtracking or backreferences or any of that voodoo. This will match all text INSIDE literals.
"(\\.|[^"])*"
Depending on the engine, it might support non capturing groups. In that case you can use
"(?:\\.|[^"])*"
and it should be faster.
I think this is too much for regexes.
Consider var foo = "//" // /"(?:\\.|[^"])*"/. Where do the strings, comments and regex literals start and end? You would need to write a complete JavaScript parser to cover all edge cases. Of course, the parser will be using regexes...
I would probably go about doing something like the following. It will need to be improved for certain possible conditions, though.
var str = '"aaa \"sss \\t bbb" sss #3 ss# ((t sdsds)) ff ';
str += '/gg sdfd \/dsds/ {aaa bbb} {{ss}} {#sdsd#}';
var repeating = ['"','\\\'','/','\\~','\\#'];
// "example" 'example' /example/ ~example~ #example#
var enclosing = [];
enclosing.push(['\\{','\\}']);
enclosing.push(['\\{\\{','\\}\\}']);
enclosing.push(['\\[','\\]']);
enclosing.push(['\\(\\(','\\)\\)']);
// {example} {{example}} [example] ((example))
for (var forEnclosing='',i = 0 ; i < enclosing.length; i++) {
var e = enclosing[i];
var r = e[0]+'(\\\\['+e[0]+e[1]+']|[^'+e[0]+e[1]+'])*'+e[1];
forEnclosing += r + (i < enclosing.length-1 ? '|' : '');
}
for (var forRepeating='',i = 0; i < repeating.length; i++) {
var e = repeating[i];
var r = e+'(\\'+e+'|[^'+e+'])*'+e;
forRepeating += r + (i < repeating.length-1 ? '|' : '');
}
var rx = new RegExp('('+forEnclosing+'|'+forRepeating+')','g');
var m = str.match(rx);
try { for (var i = 0; i < m.length; i++) console.log(m[i]) }
catch(e) {}
Outputs:
"aaa "sss \t bbb"
#3 ss#
((t sdsds))
/gg sdfd /dsds/
{aaa bbb}
{{ss}}
{#sdsd#}
The closest you can get with a regex is to have one regex that matches EITHER a string literal (single- or double-quoted) OR a regex OR a comment (OR whatever else might contain bogus matches) OR one of your macro thingies:
"[^"\\]*(?:\\.[^"\\]*)*"
|
'[^'\\]*(?:\\.[^'\\]*)*'
|
/[^/\\]*(?:\\.[^/\\]*)*/[gim]*
|
/\*[^*]*(?:\*(?!/)[^*]*)*\*/
|
##(\w+\.\w+)#
If group #1 contains anything after the match, it must be what you're looking for. Otherwise, ignore this match and go on to the next one.

How do I split a string into an array of characters? [duplicate]

This question already has answers here:
How to get character array from a string?
(14 answers)
Closed 5 years ago.
var s = "overpopulation";
var ar = [];
ar = s.split();
alert(ar);
I want to string.split a word into array of characters.
The above code doesn't seem to work - it returns "overpopulation" as Object..
How do i split it into array of characters, if original string doesn't contain commas and whitespace?
You can split on an empty string:
var chars = "overpopulation".split('');
If you just want to access a string in an array-like fashion, you can do that without split:
var s = "overpopulation";
for (var i = 0; i < s.length; i++) {
console.log(s.charAt(i));
}
You can also access each character with its index using normal array syntax. Note, however, that strings are immutable, which means you can't set the value of a character using this method, and that it isn't supported by IE7 (if that still matters to you).
var s = "overpopulation";
console.log(s[3]); // logs 'r'
Old question but I should warn:
Do NOT use .split('')
You'll get weird results with non-BMP (non-Basic-Multilingual-Plane) character sets.
Reason is that methods like .split() and .charCodeAt() only respect the characters with a code point below 65536; bec. higher code points are represented by a pair of (lower valued) "surrogate" pseudo-characters.
'๐Ÿ™๐Ÿš๐Ÿ›'.length // โ€”> 6
'๐Ÿ™๐Ÿš๐Ÿ›'.split('') // โ€”> ["๏ฟฝ", "๏ฟฝ", "๏ฟฝ", "๏ฟฝ", "๏ฟฝ", "๏ฟฝ"]
'๐Ÿ˜Ž'.length // โ€”> 2
'๐Ÿ˜Ž'.split('') // โ€”> ["๏ฟฝ", "๏ฟฝ"]
Use ES2015 (ES6) features where possible:
Using the spread operator:
let arr = [...str];
Or Array.from
let arr = Array.from(str);
Or split with the new u RegExp flag:
let arr = str.split(/(?!$)/u);
Examples:
[...'๐Ÿ™๐Ÿš๐Ÿ›'] // โ€”> ["๐Ÿ™", "๐Ÿš", "๐Ÿ›"]
[...'๐Ÿ˜Ž๐Ÿ˜œ๐Ÿ™ƒ'] // โ€”> ["๐Ÿ˜Ž", "๐Ÿ˜œ", "๐Ÿ™ƒ"]
For ES5, options are limited:
I came up with this function that internally uses MDN example to get the correct code point of each character.
function stringToArray() {
var i = 0,
arr = [],
codePoint;
while (!isNaN(codePoint = knownCharCodeAt(str, i))) {
arr.push(String.fromCodePoint(codePoint));
i++;
}
return arr;
}
This requires knownCharCodeAt() function and for some browsers; a String.fromCodePoint() polyfill.
if (!String.fromCodePoint) {
// ES6 Unicode Shims 0.1 , ยฉ 2012 Steven Levithan , MIT License
String.fromCodePoint = function fromCodePoint () {
var chars = [], point, offset, units, i;
for (i = 0; i < arguments.length; ++i) {
point = arguments[i];
offset = point - 0x10000;
units = point > 0xFFFF ? [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)] : [point];
chars.push(String.fromCharCode.apply(null, units));
}
return chars.join("");
}
}
Examples:
stringToArray('๐Ÿ™๐Ÿš๐Ÿ›') // โ€”> ["๐Ÿ™", "๐Ÿš", "๐Ÿ›"]
stringToArray('๐Ÿ˜Ž๐Ÿ˜œ๐Ÿ™ƒ') // โ€”> ["๐Ÿ˜Ž", "๐Ÿ˜œ", "๐Ÿ™ƒ"]
Note: str[index] (ES5) and str.charAt(index) will also return weird results with non-BMP charsets. e.g. '๐Ÿ˜Ž'.charAt(0) returns "๏ฟฝ".
UPDATE: Read this nice article about JS and unicode.
.split('') splits emojis in half.
Onur's solutions work for some emojis, but can't handle more complex languages or combined emojis.
Consider this emoji being ruined:
[..."๐Ÿณ๏ธโ€๐ŸŒˆ"] // returns ["๐Ÿณ", "๏ธ", "โ€", "๐ŸŒˆ"] instead of ["๐Ÿณ๏ธโ€๐ŸŒˆ"]
Also consider this Hindi text เค…เคจเฅเคšเฅเค›เฅ‡เคฆ which is split like this:
[..."เค…เคจเฅเคšเฅเค›เฅ‡เคฆ"] // returns ["เค…", "เคจ", "เฅ", "เคš", "เฅ", "เค›", "เฅ‡", "เคฆ"]
but should in fact be split like this:
["เค…","เคจเฅ","เคšเฅ","เค›เฅ‡","เคฆ"]
This happens because some of the characters are combining marks (think diacritics/accents in European languages).
You can use the grapheme-splitter library for this:
It does proper standards-based letter split in all the hundreds of exotic edge-cases - yes, there are that many.
It's as simple as:
s.split("");
The delimiter is an empty string, hence it will break up between each single character.
The split() method in javascript accepts two parameters: a separator and a limit.
The separator specifies the character to use for splitting the string. If you don't specify a separator, the entire string is returned, non-separated. But, if you specify the empty string as a separator, the string is split between each character.
Therefore:
s.split('')
will have the effect you seek.
More information here
A string in Javascript is already a character array.
You can simply access any character in the array as you would any other array.
var s = "overpopulation";
alert(s[0]) // alerts o.
UPDATE
As is pointed out in the comments below, the above method for accessing a character in a string is part of ECMAScript 5 which certain browsers may not conform to.
An alternative method you can use is charAt(index).
var s = "overpopulation";
alert(s.charAt(0)) // alerts o.
To support emojis use this
('Dragon ๐Ÿ‰').split(/(?!$)/u);
=> ['D', 'r', 'a', 'g', 'o', 'n', ' ', '๐Ÿ‰']
You can use the regular expression /(?!$)/:
"overpopulation".split(/(?!$)/)
The negative look-ahead assertion (?!$) will match right in front of every character.

Categories

Resources