How to append strings in linear time - javascript

I want to change a string with dashes placed randomly in between groups of characters to a string with dashes in between groups of n characters. I would like to keep this at a worst case of O(n) time complexity. The following soln works but I believe that the string concatenation is slow and would prefer a constant time operation in order to maintain O(n).
//Desired string is 15-678435-555339
let s = "1-5678-43-5555-339"
let newString = ""
let counter = 0
let n = 6
if(s.length === 1 || s.length < k) return s
for(let i = s.length-1; i >= 0; i--){
if(counter === 6) {
counter = 0;
newString = "-" + newString
}
if(s.charAt(i) !== "-") {
counter += 1
newString = s.charAt(i) + newString
}
}

As this is javascript you usually get the most performant solution with the least code, which would be:
function* chunk(iterable, size) {
for(let i = iterable.length; i >= 0; i -= size)
yield iterable.slice(Math.max(0, i - size), i);
}
let result = [...chunk(s.replace(/-/g, ""), 6)].reverse().join("-");
(But thats just speculation, that heavily depends on the engine)
Hmm, I thought that the string concat is expensive, making it well above o(n).
Usually yes, but some very agressive inlining might optimize it away.

If this is a real life problem and not homework with arbitrary constraints, you should just concatenate the strings. On a modern javascript, and especially for short strings like you've got there, this will not present a performance problem under normal conditions.
If you really wanted to minimize the number of strings created, you could construct an integer array of character codes using .charCodeAt(i), and then s = String.fromCharCode.apply(null, arrayOfIntegers). But this isn't something you'd normally have to do.

Not sure what language you are operating in, but most have some concept of a StringBuilder, which is just implemented as an ArrayList of strings underneath and concatenated when you ask for the resulting string - usually via a toString() method.
Here is a Java example: https://ideone.com/vHHdHH
public static void main(String[] args) {
String s = "1-5678-43-5555-339";
StringBuilder sb = new StringBuilder();
int dashPosition = 2;
int count = 0;
char[] ch = s.toCharArray();
for (int i = 0; i < ch.length; i++) {
if (Character.getNumericValue(ch[i]) != -1) {
sb.append(ch[i]);
count++;
if (count % dashPosition == 0) {
sb.append('-');
count = 0;
dashPosition = 6;
}
}
}
if (sb.charAt(sb.length() - 1) == '-') {
sb.deleteCharAt(sb.length() - 1);
}
//Desired string is 15-678435-555339
System.out.println(sb.toString());
}

Related

How can i make a loop that will show '-' mark x time iteration was? [duplicate]

In Perl I can repeat a character multiple times using the syntax:
$a = "a" x 10; // results in "aaaaaaaaaa"
Is there a simple way to accomplish this in Javascript? I can obviously use a function, but I was wondering if there was any built in approach, or some other clever technique.
These days, the repeat string method is implemented almost everywhere. (It is not in Internet Explorer.) So unless you need to support older browsers, you can simply write:
"a".repeat(10)
Before repeat, we used this hack:
Array(11).join("a") // create string with 10 a's: "aaaaaaaaaa"
(Note that an array of length 11 gets you only 10 "a"s, since Array.join puts the argument between the array elements.)
Simon also points out that according to this benchmark, it appears that it's faster in Safari and Chrome (but not Firefox) to repeat a character multiple times by simply appending using a for loop (although a bit less concise).
In a new ES6 harmony, you will have native way for doing this with repeat. Also ES6 right now only experimental, this feature is already available in Edge, FF, Chrome and Safari
"abc".repeat(3) // "abcabcabc"
And surely if repeat function is not available you can use old-good Array(n + 1).join("abc")
Convenient if you repeat yourself a lot:
String.prototype.repeat = String.prototype.repeat || function(n){
n= n || 1;
return Array(n+1).join(this);
}
alert( 'Are we there yet?\nNo.\n'.repeat(10) )
Array(10).fill('a').join('')
Although the most voted answer is a bit more compact, with this approach you don't have to add an extra array item.
An alternative is:
for(var word = ''; word.length < 10; word += 'a'){}
If you need to repeat multiple chars, multiply your conditional:
for(var word = ''; word.length < 10 * 3; word += 'foo'){}
NOTE: You do not have to overshoot by 1 as with word = Array(11).join('a')
The most performance-wice way is https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/repeat
Short version is below.
String.prototype.repeat = function(count) {
if (count < 1) return '';
var result = '', pattern = this.valueOf();
while (count > 1) {
if (count & 1) result += pattern;
count >>>= 1, pattern += pattern;
}
return result + pattern;
};
var a = "a";
console.debug(a.repeat(10));
Polyfill from Mozilla:
if (!String.prototype.repeat) {
String.prototype.repeat = function(count) {
'use strict';
if (this == null) {
throw new TypeError('can\'t convert ' + this + ' to object');
}
var str = '' + this;
count = +count;
if (count != count) {
count = 0;
}
if (count < 0) {
throw new RangeError('repeat count must be non-negative');
}
if (count == Infinity) {
throw new RangeError('repeat count must be less than infinity');
}
count = Math.floor(count);
if (str.length == 0 || count == 0) {
return '';
}
// Ensuring count is a 31-bit integer allows us to heavily optimize the
// main part. But anyway, most current (August 2014) browsers can't handle
// strings 1 << 28 chars or longer, so:
if (str.length * count >= 1 << 28) {
throw new RangeError('repeat count must not overflow maximum string size');
}
var rpt = '';
for (;;) {
if ((count & 1) == 1) {
rpt += str;
}
count >>>= 1;
if (count == 0) {
break;
}
str += str;
}
// Could we try:
// return Array(count + 1).join(this);
return rpt;
}
}
If you're not opposed to including a library in your project, lodash has a repeat function.
_.repeat('*', 3);
// β†’ '***
https://lodash.com/docs#repeat
For all browsers
The following function will perform a lot faster than the option suggested in the accepted answer:
var repeat = function(str, count) {
var array = [];
for(var i = 0; i < count;)
array[i++] = str;
return array.join('');
}
You'd use it like this :
var repeatedString = repeat("a", 10);
To compare the performance of this function with that of the option proposed in the accepted answer, see this Fiddle and this Fiddle for benchmarks.
For moderns browsers only
In modern browsers, you can now do this using String.prototype.repeat method:
var repeatedString = "a".repeat(10);
Read more about this method on MDN.
This option is even faster. Unfortunately, it doesn't work in any version of Internet explorer. The numbers in the table specify the first browser version that fully supports the method:
In ES2015/ES6 you can use "*".repeat(n)
So just add this to your projects, and your are good to go.
String.prototype.repeat = String.prototype.repeat ||
function(n) {
if (n < 0) throw new RangeError("invalid count value");
if (n == 0) return "";
return new Array(n + 1).join(this.toString())
};
String.repeat() is supported by 96.39% of browsers as of now.
function pad(text, maxLength){
return text + "0".repeat(maxLength - text.length);
}
console.log(pad('text', 7)); //text000
/**
* Repeat a string `n`-times (recursive)
* #param {String} s - The string you want to repeat.
* #param {Number} n - The times to repeat the string.
* #param {String} d - A delimiter between each string.
*/
var repeat = function (s, n, d) {
return --n ? s + (d || "") + repeat(s, n, d) : "" + s;
};
var foo = "foo";
console.log(
"%s\n%s\n%s\n%s",
repeat(foo), // "foo"
repeat(foo, 2), // "foofoo"
repeat(foo, "2"), // "foofoo"
repeat(foo, 2, "-") // "foo-foo"
);
Just for the fun of it, here is another way by using the toFixed(), used to format floating point numbers.
By doing
(0).toFixed(2)
(0).toFixed(3)
(0).toFixed(4)
we get
0.00
0.000
0.0000
If the first two characters 0. are deleted, we can use this repeating pattern to generate any repetition.
function repeat(str, nTimes) {
return (0).toFixed(nTimes).substr(2).replaceAll('0', str);
}
console.info(repeat('3', 5));
console.info(repeat('hello ', 4));
Another interesting way to quickly repeat n character is to use idea from quick exponentiation algorithm:
var repeatString = function(string, n) {
var result = '', i;
for (i = 1; i <= n; i *= 2) {
if ((n & i) === i) {
result += string;
}
string = string + string;
}
return result;
};
For repeat a value in my projects i use repeat
For example:
var n = 6;
for (i = 0; i < n; i++) {
console.log("#".repeat(i+1))
}
but be careful because this method has been added to the ECMAScript 6 specification.
function repeatString(n, string) {
var repeat = [];
repeat.length = n + 1;
return repeat.join(string);
}
repeatString(3,'x'); // => xxx
repeatString(10,'🌹'); // => "🌹🌹🌹🌹🌹🌹🌹🌹🌹🌹"
This is how you can call a function and get the result by the helps of Array() and join()
using Typescript and arrow fun
const repeatString = (str: string, num: number) => num > 0 ?
Array(num+1).join(str) : "";
console.log(repeatString("🌷",10))
//outputs: 🌷🌷🌷🌷🌷🌷🌷🌷🌷🌷
function repeatString(str, num) {
// Array(num+1) is the string you want to repeat and the times to repeat the string
return num > 0 ? Array(num+1).join(str) : "";
}
console.log(repeatString("a",10))
// outputs: aaaaaaaaaa
console.log(repeatString("🌷",10))
//outputs: 🌷🌷🌷🌷🌷🌷🌷🌷🌷🌷
Here is what I use:
function repeat(str, num) {
var holder = [];
for(var i=0; i<num; i++) {
holder.push(str);
}
return holder.join('');
}
I realize that it's not a popular task, what if you need to repeat your string not an integer number of times?
It's possible with repeat() and slice(), here's how:
String.prototype.fracRepeat = function(n){
if(n < 0) n = 0;
var n_int = ~~n; // amount of whole times to repeat
var n_frac = n - n_int; // amount of fraction times (e.g., 0.5)
var frac_length = ~~(n_frac * this.length); // length in characters of fraction part, floored
return this.repeat(n) + this.slice(0, frac_length);
}
And below a shortened version:
String.prototype.fracRepeat = function(n){
if(n < 0) n = 0;
return this.repeat(n) + this.slice(0, ~~((n - ~~n) * this.length));
}
var s = "abcd";
console.log(s.fracRepeat(2.5))
I'm going to expand on #bonbon's answer. His method is an easy way to "append N chars to an existing string", just in case anyone needs to do that. For example since "a google" is a 1 followed by 100 zeros.
for(var google = '1'; google.length < 1 + 100; google += '0'){}
document.getElementById('el').innerText = google;
<div>This is "a google":</div>
<div id="el"></div>
NOTE: You do have to add the length of the original string to the conditional.
Lodash offers a similar functionality as the Javascript repeat() function which is not available in all browers. It is called _.repeat and available since version 3.0.0:
_.repeat('a', 10);
var stringRepeat = function(string, val) {
var newString = [];
for(var i = 0; i < val; i++) {
newString.push(string);
}
return newString.join('');
}
var repeatedString = stringRepeat("a", 1);
Can be used as a one-liner too:
function repeat(str, len) {
while (str.length < len) str += str.substr(0, len-str.length);
return str;
}
In CoffeeScript:
( 'a' for dot in [0..10]).join('')
String.prototype.repeat = function (n) { n = Math.abs(n) || 1; return Array(n + 1).join(this || ''); };
// console.log("0".repeat(3) , "0".repeat(-3))
// return: "000" "000"

How to improve performance of this Javascript/Cracking the code algorithm?

so here is the question below, with my answer to it. I know that because of the double nested for loop, the efficiency is O(n^2), so I was wondering if there were a way to improve my algorithm/function's big O.
// Design an algorithm and write code to remove the duplicate characters in a string without using any additional buffer. NOTE: One or two additional variables are fine. An extra copy of the array is not.
function removeDuplicates(str) {
let arrayString = str.split("");
let alphabetArray = [["a", 0],["b",0],["c",0],["d",0],["e",0],["f",0],["g",0],["h",0],["i",0],["j",0],["k",0],["l",0],["m",0],["n",0],["o",0],["p",0],["q",0],["r",0],["s",0],["t",0],["u",0],["v",0],["w",0],["x",0],["y",0],["z",0]]
for (let i=0; i<arrayString.length; i++) {
findCharacter(arrayString[i].toLowerCase(), alphabetArray);
}
removeCharacter(arrayString, alphabetArray);
};
function findCharacter(character, array) {
for (let i=0; i<array.length; i++) {
if (array[i][0] === character)Β {
array[i][1]++;
}
}
}
function removeCharacter(arrString, arrAlphabet) {
let finalString = "";
for (let i=0; i<arrString.length; i++) {
for (let j=0; j<arrAlphabet.length; j++) {
if (arrAlphabet[j][1] < 2 && arrString[i].toLowerCase() == arrAlphabet[j][0]) {
finalString += arrString[i]
}
}
}
console.log("The string with removed duplicates is:", finalString)
}
removeDuplicates("Hippotamuus")
The ASCII/Unicode character codes of all letters of the same case are consecutive. This allows for an important optimization: You can find the index of a character in the character count array from its ASCII/Unicode character code. Specifically, the index of the character c in the character count array will be c.charCodeAt(0) - 'a'.charCodeAt(0). This allows you to look up and modify the character count in the array in O(1) time, which brings the algorithm run-time down to O(n).
There's a little trick to "without using any additional buffer," although I don't see a way to improve on O(n^2) complexity without using a hash map to determine if a particular character has been seen. The trick is to traverse the input string buffer (assume it is a JavaScript array since strings in JavaScript are immutable) and overwrite the current character with the next unique character if the current character is a duplicate. Finally, mark the end of the resultant string with a null character.
Pseudocode:
i = 1
pointer = 1
while string[i]:
if not seen(string[i]):
string[pointer] = string[i]
pointer = pointer + 1
i = i + 1
mark string end at pointer
The function seen could either take O(n) time and O(1) space or O(1) time and O(|alphabet|) space if we use a hash map.
Based on your description, I'm assuming the input is a string (which is immutable in javascript) and I'm not sure what exactly does "one or two additional variables" mean so based on your implementation, I'm going to assume it's ok to use O(N) space. To improve time complexity, I think implementations differ according to different requirements for the outputted string.
Assumption1: the order of the outputted string is in the order that it appears the first time. eg. "bcabcc" -> "bca"
Suppose the length of s is N, the following implementation uses O(N) space and O(N) time.
function removeDuplicates(s) {
const set = new Set(); // use set so that insertion and lookup time is o(1)
let res = "";
for (let i = 0; i < s.length; i++) {
if (!set.has(s[i])) {
set.add(s[i]);
res += s[i];
}
}
return res;
}
Assumption2: the outputted string has to be of ascending order.
You may use quick-sort to do in-place sorting and then loop through the sorted array to add the last-seen element to result. Note that you may need to split the string into an array first. So the implementation would use O(N) space and the average time complexity would be O(NlogN)
Assumption3: the result is the smallest in lexicographical order among all possible results. eg. "bcabcc" -> "abc"
The following implementation uses O(N) space and O(N) time.
const removeDuplicates = function(s) {
const stack = []; // stack and set are in sync
const set = new Set(); // use set to make lookup faster
const lastPos = getLastPos(s);
let curVal;
let lastOnStack;
for (let i = 0; i < s.length; i++) {
curVal = s[i];
if (!set.has(curVal)) {
while(stack.length > 0 && stack[stack.length - 1] > curVal && lastPos[stack[stack.length - 1]] > i) {
set.delete(stack[stack.length - 1]);
stack.pop();
}
set.add(curVal);
stack.push(curVal);
}
}
return stack.join('');
};
const getLastPos = (s) => {
// get the last index of each unique character
const lastPosMap = {};
for (let i = 0; i < s.length; i++) {
lastPosMap[s[i]] = i;
}
return lastPosMap;
}
I was unsure what was mean't by:
...without using any additional buffer.
So I thought I would have a go at doing this in one loop, and let you tell me if it's wrong.
I have worked on the basis that the function you have provided gives the correct output, you were just looking for it to run faster. The function below gives the correct output and run's a lot faster with any large string with lots of duplication that I throw at it.
function removeDuplicates(originalString) {
let outputString = '';
let lastChar = '';
let lastCharOccurences = 1;
for (let char = 0; char < originalString.length; char++) {
outputString += originalString[char];
if (lastChar === originalString[char]) {
lastCharOccurences++;
continue;
}
if (lastCharOccurences > 1) {
outputString = outputString.slice(0, outputString.length - (lastCharOccurences + 1)) + originalString[char];
lastCharOccurences = 1;
}
lastChar = originalString[char];
}
console.log("The string with removed duplicates is:", outputString)
}
removeDuplicates("Hippotamuus")
Again, sorry if I have misunderstood the post...

Create a string of variable length, filled with a repeated character

So, my question has been asked by someone else in it's Java form here: Java - Create a new String instance with specified length and filled with specific character. Best solution?
. . . but I'm looking for its JavaScript equivalent.
Basically, I'm wanting to dynamically fill text fields with "#" characters, based on the "maxlength" attribute of each field. So, if an input has has maxlength="3", then the field would be filled with "###".
Ideally there would be something like the Java StringUtils.repeat("#", 10);, but, so far, the best option that I can think of is to loop through and append the "#" characters, one at a time, until the max length is reached. I can't shake the feeling that there is a more efficient way to do it than that.
Any ideas?
FYI - I can't simply set a default value in the input, because the "#" characters need to clear on focus, and, if the user didn't enter a value, will need to be "refilled" on blur. It's the "refill" step that I'm concerned with
The best way to do this (that I've seen) is
var str = new Array(len + 1).join( character );
That creates an array with the given length, and then joins it with the given string to repeat. The .join() function honors the array length regardless of whether the elements have values assigned, and undefined values are rendered as empty strings.
You have to add 1 to the desired length because the separator string goes between the array elements.
Give this a try :P
s = '#'.repeat(10)
document.body.innerHTML = s
ES2015 the easiest way is to do something like
'X'.repeat(data.length)
X being any string, data.length being the desired length.
see: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/repeat
Unfortunately although the Array.join approach mentioned here is terse, it is about 10X slower than a string-concatenation-based implementation. It performs especially badly on large strings. See below for full performance details.
On Firefox, Chrome, Node.js MacOS, Node.js Ubuntu, and Safari, the fastest implementation I tested was:
function repeatChar(count, ch) {
if (count == 0) {
return "";
}
var count2 = count / 2;
var result = ch;
// double the input until it is long enough.
while (result.length <= count2) {
result += result;
}
// use substring to hit the precise length target without
// using extra memory
return result + result.substring(0, count - result.length);
};
This is verbose, so if you want a terse implementation you could go with the naive approach; it still performs betweeb 2X to 10X better than the Array.join approach, and is also faster than the doubling implementation for small inputs. Code:
// naive approach: simply add the letters one by one
function repeatChar(count, ch) {
var txt = "";
for (var i = 0; i < count; i++) {
txt += ch;
}
return txt;
}
Further information:
Run speed test in your own browser
Full source code of speed test
Speed test results
I would create a constant string and then call substring on it.
Something like
var hashStore = '########################################';
var Fiveup = hashStore.substring(0,5);
var Tenup = hashStore.substring(0,10);
A bit faster too.
http://jsperf.com/const-vs-join
A great ES6 option would be to padStart an empty string. Like this:
var str = ''.padStart(10, "#");
Note: this won't work in IE (without a polyfill).
Version that works in all browsers
This function does what you want, and performs a lot faster than the option suggested in the accepted answer :
var repeat = function(str, count) {
var array = [];
for(var i = 0; i <= count;)
array[i++] = str;
return array.join('');
}
You use it like this :
var repeatedCharacter = repeat("a", 10);
To compare the performance of this function with that of the option proposed in the accepted answer, see this Fiddle and this Fiddle for benchmarks.
Version for moderns browsers only
In modern browsers, you can now also do this :
var repeatedCharacter = "a".repeat(10) };
This option is even faster. However, unfortunately it doesn't work in any version of Internet explorer.
The numbers in the table specify the first browser version that fully supports the method :
For Evergreen browsers, this will build a staircase based on an incoming character and the number of stairs to build.
function StairCase(character, input) {
let i = 0;
while (i < input) {
const spaces = " ".repeat(input - (i+1));
const hashes = character.repeat(i + 1);
console.log(spaces + hashes);
i++;
}
}
//Implement
//Refresh the console
console.clear();
StairCase("#",6);
You can also add a polyfill for Repeat for older browsers
if (!String.prototype.repeat) {
String.prototype.repeat = function(count) {
'use strict';
if (this == null) {
throw new TypeError('can\'t convert ' + this + ' to object');
}
var str = '' + this;
count = +count;
if (count != count) {
count = 0;
}
if (count < 0) {
throw new RangeError('repeat count must be non-negative');
}
if (count == Infinity) {
throw new RangeError('repeat count must be less than infinity');
}
count = Math.floor(count);
if (str.length == 0 || count == 0) {
return '';
}
// Ensuring count is a 31-bit integer allows us to heavily optimize the
// main part. But anyway, most current (August 2014) browsers can't handle
// strings 1 << 28 chars or longer, so:
if (str.length * count >= 1 << 28) {
throw new RangeError('repeat count must not overflow maximum string size');
}
var rpt = '';
for (;;) {
if ((count & 1) == 1) {
rpt += str;
}
count >>>= 1;
if (count == 0) {
break;
}
str += str;
}
// Could we try:
// return Array(count + 1).join(this);
return rpt;
}
}
Based on answers from Hogan and Zero Trick Pony. I think this should be both fast and flexible enough to handle well most use cases:
var hash = '####################################################################'
function build_string(length) {
if (length == 0) {
return ''
} else if (hash.length <= length) {
return hash.substring(0, length)
} else {
var result = hash
const half_length = length / 2
while (result.length <= half_length) {
result += result
}
return result + result.substring(0, length - result.length)
}
}
You can use the first line of the function as a one-liner if you like:
function repeat(str, len) {
while (str.length < len) str += str.substr(0, len-str.length);
return str;
}
I would do
Buffer.alloc(length, character).toString()
If it's performance you need (prior to ES6), then a combination of substr and a template string is probably best. This function is what I've used for creating space padding strings, but you can change the template to whatever you need:
function strRepeat(intLen, strTemplate) {
strTemplate = strTemplate || " ";
var strTxt = '';
while(intLen > strTemplate.length) {
strTxt += strTemplate;
intLen -= strTemplate.length;
}
return ((intLen > 0) ? strTxt + strTemplate.substr(0, intLen) : strTxt);
}

Most efficient way to generate a really long string (tens of megabytes) in JS

I find myself needing to synthesize a ridiculously long string (like, tens of megabytes long) in JavaScript. (This is to slow down a CSS selector-matching operation to the point where it takes a measurable amount of time.)
The best way I've found to do this is
var really_long_string = (new Array(10*1024*1024)).join("x");
but I'm wondering if there's a more efficient way - one that doesn't involve creating a tens-of-megabytes array first.
For ES6:
'x'.repeat(10*1024*1024)
The previously accepted version uses String.prototype.concat() which is vastly slower than using the optimized string concatenating operator, +. MDN also recommends to keep away from using it in speed critical code.
I have made three versions of the above code to show the speed differences in a JsPerf. Converting it to using only using concat is only a third as fast as only using the string concatenating operator (Chrome - your mileage will vary). The edited version below will run twice as fast in Chrome
var x = '1234567890'
var iterations = 14
for (var i = 0; i < iterations; i++) {
x += x + x
}
This is the more efficient algorithm for generating very long strings in javascript:
function stringRepeat(str, num) {
num = Number(num);
var result = '';
while (true) {
if (num & 1) { // (1)
result += str;
}
num >>>= 1; // (2)
if (num <= 0) break;
str += str;
}
return result;
}
more info here: http://www.2ality.com/2014/01/efficient-string-repeat.html.
Alternatively, in ECMA6 you can use String.prototype.repeat() method.
Simply accumulating is vastly faster in Safari 5:
var x = "1234567890";
var iterations = 14;
for (var i = 0; i < iterations; i++) {
x += x.concat(x);
}
alert(x.length); // 47829690
Essentially, you'll get x.length * 3^iterations characters.
Not sure if this is a great implementation, but here's a general function based on #oligofren's solution:
function repeat(ch, len) {
var result = ch;
var halfLength = len / 2;
while (result.length < len) {
if (result.length <= halfLength) {
result += result;
} else {
return result + repeat(ch, len - result.length);
}
}
return result;
}
This assumes that concatenating a large string is faster than a series of small strings.

JavaScript strings outside of the BMP

BMP being Basic Multilingual Plane
According to JavaScript: the Good Parts:
JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide.
This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF.
Further investigation confirms this:
> String.fromCharCode(0x20001);
The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001.
Question: is it at all possible to handle post-BMP characters in JavaScript?
2011-07-31: slide twelve from Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly covers issues related to this quite well:
Depends what you mean by β€˜support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.
But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, slice etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.
If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:
String.prototype.getCodePointLength= function() {
return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};
String.fromCodePoint= function() {
var chars= Array.prototype.slice.call(arguments);
for (var i= chars.length; i-->0;) {
var n = chars[i]-0x10000;
if (n>=0)
chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
}
return String.fromCharCode.apply(null, chars);
};
I came to the same conclusion as bobince. If you want to work with strings containing unicode characters outside of the BMP, you have to reimplement javascript's String methods. This is because javascript counts characters as each 16-bit code value. Symbols outside of the BMP need two code values to be represented. You therefore run into a case where some symbols count as two characters and some count only as one.
I've reimplemented the following methods to treat each unicode code point as a single character: .length, .charCodeAt, .fromCharCode, .charAt, .indexOf, .lastIndexOf, .splice, and .split.
You can check it out on jsfiddle: http://jsfiddle.net/Y89Du/
Here's the code without comments. I tested it, but it may still have errors. Comments are welcome.
if (!String.prototype.ucLength) {
String.prototype.ucLength = function() {
// this solution was taken from
// http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp
return this.length - this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length + 1;
};
}
if (!String.prototype.codePointAt) {
String.prototype.codePointAt = function (ucPos) {
if (isNaN(ucPos)){
ucPos = 0;
}
var str = String(this);
var codePoint = null;
var pairFound = false;
var ucIndex = -1;
var i = 0;
while (i < str.length){
ucIndex += 1;
var code = str.charCodeAt(i);
var next = str.charCodeAt(i + 1);
pairFound = (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF);
if (ucIndex == ucPos){
codePoint = pairFound ? ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000 : code;
break;
} else{
i += pairFound ? 2 : 1;
}
}
return codePoint;
};
}
if (!String.fromCodePoint) {
String.fromCodePoint = function () {
var strChars = [], codePoint, offset, codeValues, i;
for (i = 0; i < arguments.length; ++i) {
codePoint = arguments[i];
offset = codePoint - 0x10000;
if (codePoint > 0xFFFF){
codeValues = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
} else{
codeValues = [codePoint];
}
strChars.push(String.fromCharCode.apply(null, codeValues));
}
return strChars.join("");
};
}
if (!String.prototype.ucCharAt) {
String.prototype.ucCharAt = function (ucIndex) {
var str = String(this);
var codePoint = str.codePointAt(ucIndex);
var ucChar = String.fromCodePoint(codePoint);
return ucChar;
};
}
if (!String.prototype.ucIndexOf) {
String.prototype.ucIndexOf = function (searchStr, ucStart) {
if (isNaN(ucStart)){
ucStart = 0;
}
if (ucStart < 0){
ucStart = 0;
}
var str = String(this);
var strUCLength = str.ucLength();
searchStr = String(searchStr);
var ucSearchLength = searchStr.ucLength();
var i = ucStart;
while (i < strUCLength){
var ucSlice = str.ucSlice(i,i+ucSearchLength);
if (ucSlice == searchStr){
return i;
}
i++;
}
return -1;
};
}
if (!String.prototype.ucLastIndexOf) {
String.prototype.ucLastIndexOf = function (searchStr, ucStart) {
var str = String(this);
var strUCLength = str.ucLength();
if (isNaN(ucStart)){
ucStart = strUCLength - 1;
}
if (ucStart >= strUCLength){
ucStart = strUCLength - 1;
}
searchStr = String(searchStr);
var ucSearchLength = searchStr.ucLength();
var i = ucStart;
while (i >= 0){
var ucSlice = str.ucSlice(i,i+ucSearchLength);
if (ucSlice == searchStr){
return i;
}
i--;
}
return -1;
};
}
if (!String.prototype.ucSlice) {
String.prototype.ucSlice = function (ucStart, ucStop) {
var str = String(this);
var strUCLength = str.ucLength();
if (isNaN(ucStart)){
ucStart = 0;
}
if (ucStart < 0){
ucStart = strUCLength + ucStart;
if (ucStart < 0){ ucStart = 0;}
}
if (typeof(ucStop) == 'undefined'){
ucStop = strUCLength - 1;
}
if (ucStop < 0){
ucStop = strUCLength + ucStop;
if (ucStop < 0){ ucStop = 0;}
}
var ucChars = [];
var i = ucStart;
while (i < ucStop){
ucChars.push(str.ucCharAt(i));
i++;
}
return ucChars.join("");
};
}
if (!String.prototype.ucSplit) {
String.prototype.ucSplit = function (delimeter, limit) {
var str = String(this);
var strUCLength = str.ucLength();
var ucChars = [];
if (delimeter == ''){
for (var i = 0; i < strUCLength; i++){
ucChars.push(str.ucCharAt(i));
}
ucChars = ucChars.slice(0, 0 + limit);
} else{
ucChars = str.split(delimeter, limit);
}
return ucChars;
};
}
More recent JavaScript engines have String.fromCodePoint.
const ideograph = String.fromCodePoint( 0x20001 ); // outside the BMP
Also a code-point iterator, which gets you the code-point length.
function countCodePoints( str )
{
const i = str[Symbol.iterator]();
let count = 0;
while( !i.next().done ) ++count;
return count;
}
console.log( ideograph.length ); // gives '2'
console.log( countCodePoints(ideograph) ); // '1'
Yes, you can. Although support to non-BMP characters directly in source documents is optional according to the ECMAScript standard, modern browsers let you use them. Naturally, the document encoding must be properly declared, and for most practical purposes you would need to use the UTF-8 encoding. Moreover, you need an editor that can handle UTF-8, and you need some input method(s); see e.g. my Full Unicode Input utility.
Using suitable tools and settings, you can write var foo = '𠀁'.
The non-BMP characters will be internally represented as surrogate pairs, so each non-BMP character counts as 2 in the string length.
Using for (c of this) instruction, one can make various computations on a string that contains non-BMP characters. For instance, to compute the string length, and to get the nth character of the string:
String.prototype.magicLength = function()
{
var c, k;
k = 0;
for (c of this) // iterate each char of this
{
k++;
}
return k;
}
String.prototype.magicCharAt = function(n)
{
var c, k;
k = 0;
for (c of this) // iterate each char of this
{
if (k == n) return c + "";
k++;
}
return "";
}
This old topic has now a simple solution in ES6:
Split characters into an array
simple version
[..."πŸ˜΄πŸ˜„πŸ˜ƒβ›”πŸŽ πŸš“πŸš‡"] // ["😴", "πŸ˜„", "πŸ˜ƒ", "β›”", "🎠", "πŸš“", "πŸš‡"]
Then having each one separated you can handle them easily for most common cases.
Credit: DownGoat
Full solution
To overcome special emojis as the one in the comment, one can search for the connection charecter (char code 8205 in UTF-16) and make some modifications. Here is how:
let myStr = "πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘§πŸ˜ƒπŒ†"
let arr = [...myStr]
for (i = arr.length-1; i--; i>= 0) {
if (arr[i].charCodeAt(0) == 8205) { // special combination character
arr[i-1] += arr[i] + arr[i+1]; // combine them back to a single emoji
arr.splice(i, 2)
}
}
console.log(arr.length) //3
Haven't found a case where this doesn't work. Comment if you do.
To conclude
it seems that JS uses the 8205 char code to represent UCS-2 characters as a UTF-16 combinations.

Categories

Resources