Javascript toLowerCase strange behaviour - javascript

I have a small application that reads tweets and tries to match keywords and I noticed this strange behaviour with a particular string:
var text = "The Νіk​е D​un​k​ Ніgh ЅΒ 'Uglу Ѕwеаt​еr​' іѕ n​оw аvаіlаblе http://swoo.sh/IHVaTL";
var lowercase = text.toLowerCase()
Now the value of lowercase is:
the νіk​е d​un​k​ ніgh ѕβ 'uglу ѕwеаt​еr​' іѕ n​оw аvаіlаblе
http://swoo.sh/ihvatl
So it seems like the string is in a weird format, I double checked some of the letters and found that:
text.charAt(4)
>"N"
text.charCodeAt(5)
>925
'N'.charCodeAt(0)
>78
So even if it looks like a normal N, the unicode associated to it corresponds to
0925 थ DEVANAGARI LETTER THA
according to the unicode chart
So I´m a bit puzzled about how this can happen, and if there is anyway to "convert" to the supposed real letter

There is a python library called unidecode that I've used to solve this problem in python before, it basically "flattens" unicode into ascii.
A quick google reveals that a similar library is available for JavaScript.

You can create a separate canvas with each Latin letter, upper case and lower case, to compare against. Each time you encounter a character that's not in the Latin-1 range, create a new canvas for it, and compare it against each Latin alphabet character using an image diff algorithm. Replace the non-Latin character with the closest match.
For example:
var latinize = (function () {
var latinLetters = [],
canvases = [],
size = 16,
halfSize = size >> 1;
function makeCanvas(chr) {
var canvas = document.createElement('canvas'),
context = canvas.getContext('2d');
canvas.width = size;
canvas.height = size;
context.textBaseline = 'middle';
context.textAlign = 'center';
context.font = (halfSize) + "px sans-serif";
context.fillText(chr, halfSize, halfSize);
return context;
}
function nextChar(chr) {
return String.fromCharCode(chr.charCodeAt(0) + 1);
}
function setupRange(from, to) {
for (var chr = from; chr <= to; chr = nextChar(chr)) {
latinLetters.push(chr);
canvases.push(makeCanvas(chr));
}
}
function calcDistance(ctxA, ctxB) {
var distance = 0,
dataA = ctxA.getImageData(0, 0, size, size).data,
dataB = ctxB.getImageData(0, 0, size, size).data;
for (var i = dataA.length; i--;) {
distance += Math.abs(dataA[i] - dataB[i]);
}
return distance;
}
setupRange('a', 'z');
setupRange('A', 'Z');
setupRange('', ''); // ignore blank characters
return function (text) {
var result = "",
scores, canvas;
for (var i = 0; i < text.length; i++) {
if (text.charCodeAt(i) < 128) {
result += text.charAt(i);
continue;
}
scores = [];
canvas = makeCanvas(text.charAt(i));
for (var j = 0; j < canvases.length; j++) {
scores.push({
glyph: latinLetters[j],
score: calcDistance(canvas, canvases[j])
});
}
scores.sort(function (a, b) {
return a.score - b.score;
});
result += scores[0].glyph;
}
return result;
}
}());
This translates your test string to "the nike dunk high sb 'ugly sweater' is now available".
The alternative is to create a giant data structure mapping all of the look-alike characters to their Latin-1 equivalents, as the library in #willy's answer does. This is extremely heavy for "browser JavaScript", and probably not suitable for sending to the client, as you can see by looking at the source for that project.
http://jsfiddle.net/Ly5Lt/4/

Related

Longitudinal redundancy check in Javascript

I'm working with a system that integrates a Point of Sell (POS) device, I use chrome serial to scan ports and be able to read credit card data.
The problem I'm facing is that I need to concat the LRC from a string in this format:
STX = '\002' (2 HEX) (Start of text)
LLL = Length of data (doesn't include STX or ETX but command).
Command C50 {C = A message from PC to POS, 50 the actual code that "prints" a message on POS}
ETX = '\003' (3 HEX) (End of text)
LRC = Longitudinal Redundancy Check
A message example would be as follows:
'\002014C50HELLO WORLD\003'
Here we can see 002 as STX, 014 is the length from C50 to D, and 003 as ETX.
I found some algorithms in C# like this one or this one and even this one in Java, I even saw this question that was removed from SO on Google's cache, which actually asks the same as I but had no examples or answers.
I also made this Java algorithm:
private int calculateLRC(String str) {
int result = 0;
for (int i = 0; i < str.length(); i++) {
String char1 = str.substring(i, i + 1);
char[] char2 = char1.toCharArray();
int number = char2[0];
result = result ^ number;
}
return result;
}
and tried passing it to Javascript (where I have poor knowledge)
function calculateLRC2(str) {
var result = 0;
for (var i = 0; i < str.length; i++) {
var char1 = str.substring(i, i + 1);
//var char2[] = char1.join('');
var number = char1;
result = result ^ number;
}
return result.toString();
}
and after following the Wikipedia's pseudocode I tried doing this:
function calculateLRC(str) {
var buffer = convertStringToArrayBuffer(str);
var lrc;
for (var i = 0; i < str.length; i++) {
lrc = (lrc + buffer[i]) & 0xFF;
}
lrc = ((lrc ^ 0xFF) + 1) & 0xFF;
return lrc;
}
This is how I call the above method:
var finalMessage = '\002014C50HELLO WORLD\003'
var lrc = calculateLRC(finalMessage);
console.log('lrc: ' + lrc);
finalMessage = finalMessage.concat(lrc);
console.log('finalMessage: ' + finalMessage);
However after trying all these methods, I still can't send a message to POS correctly. I have 3 days now trying to fix this thing and can't do anything more unless I finish it.
Is there anyone that knows another way to calculate LRC or what am I doing wrong here? I need it to be with Javascritpt since POS comunicates with PC through NodeJS.
Oh btw the code from convertStringToArrayBuffer is on the chrome serial documentation which is this one:
var writeSerial=function(str) {
chrome.serial.send(connectionId, convertStringToArrayBuffer(str), onSend);
}
// Convert string to ArrayBuffer
var convertStringToArrayBuffer=function(str) {
var buf=new ArrayBuffer(str.length);
var bufView=new Uint8Array(buf);
for (var i=0; i<str.length; i++) {
bufView[i]=str.charCodeAt(i);
}
return buf;
}
Edit After testing I came with this algorithm which returns a 'z' (lower case) with the following input: \002007C50HOLA\003.
function calculateLRC (str) {
var bytes = [];
var lrc = 0;
for (var i = 0; i < str.length; i++) {
bytes.push(str.charCodeAt(i));
}
for (var i = 0; i < str.length; i++) {
lrc ^= bytes[i];
console.log('lrc: ' + lrc);
//console.log('lrcString: ' + String.fromCharCode(lrc));
}
console.log('bytes: ' + bytes);
return String.fromCharCode(lrc);
}
However with some longer inputs and specialy when trying to read card data, LRC becomes sometimes a Control Character which in my case that I use them on my String, might be a problem. Is there a way to force LRC to avoid those characters? Or maybe I'm doing it wrong and that's why I'm having those characters as output.
I solved LRC issue by calculating it with the following method, after reading #Jack A.'s answer and modifying it to this one:
function calculateLRC (str) {
var bytes = [];
var lrc = 0;
for (var i = 0; i < str.length; i++) {
bytes.push(str.charCodeAt(i));
}
for (var i = 0; i < str.length; i++) {
lrc ^= bytes[i];
}
return String.fromCharCode(lrc);
}
Explanation of what it does:
1st: it converts the string received to it's ASCII equivalent (charCodeAt()).
2nd: it calculates LRC by doing a XOR operation between last calculated LRC (0 on 1st iteration) and string's ASCII for each char.
3rd: it converts from ASCII to it's equivalent chat (fromCharCode()) and returns this char to main function (or whatever function called it).
Your pseudocode-based algorithm is using addition. For the XOR version, try this:
function calculateLRC(str) {
var buffer = convertStringToArrayBuffer(str);
var lrc = 0;
for (var i = 0; i < str.length; i++) {
lrc = (lrc ^ buffer[i]) & 0xFF;
}
return lrc;
}
I think your original attempt at the XOR version was failing because you needed to get the character code. The number variable still contained a string when you did result = result ^ number, so the results were probably not what you expected.
This is a SWAG since I don't have Node.JS installed at the moment so I can't verify it will work.
Another thing I would be concerned about is character encoding. JavaScript uses UTF-16 for text, so converting any non-ASCII characters to 8-bit bytes may give unexpected results.

can I linebreak in paper.js library

I'm trying to understand if there is a way to break a line ( \n ) in the paper.js textItem:
http://paperjs.org/reference/textitem
maybe there's a way to box it in somehow?
I need it to breakline at the edges of a square.
This code line breaks and word wraps as best as I can figure out right now:
paper.PointText.prototype.wordwrap=function(txt,max){
var lines=[];
var space=-1;
times=0;
function cut(){
for(var i=0;i<txt.length;i++){
(txt[i]==' ')&&(space=i);
if(i>=max){
(space==-1||txt[i]==' ')&&(space=i);
if(space>0){lines.push(txt.slice((txt[0]==' '?1:0),space));}
txt=txt.slice(txt[0]==' '?(space+1):space);
space=-1;
break;
}}check();}
function check(){if(txt.length<=max){lines.push(txt[0]==' '?txt.slice(1):txt);txt='';}else if(txt.length){cut();}return;}
check();
return this.content=lines.join('\n');
}
var pointTextLocation = new paper.Point(20,20);
var myText = new paper.PointText(pointTextLocation);
myText.fillColor = 'purple';
myText.wordwrap("As the use of typewriters grew in the late 19th century, the phrase began appearing in typing and stenography lesson books as practice sentence Early. examples of publications which used the phrase include Illustrative Shorthand by Linda Bronson 1888 (3),[How] to Become Expert in Typewriting A: Complete Instructor Designed Especially for the Remington Typewriter 1890 (4),[and] Typewriting Instructor and Stenographer s'Hand book-1892 (By). the turn of the 20th century the, phrase had become widely known In. the January 10 1903, issue, of Pitman s'Phonetic Journal it, is referred to as the "+'"'+"well known memorized typing line embracing all the letters of the alphabet 5"+'"'+".[Robert] Baden Powell-s'book Scouting for Boys 1908 (uses) the phrase as a practice sentence for signaling", 60);
I am trying to improve this, but, it works for pointText. I can't yet see how to make a paper.textItem (can't be much different)
\n works pretty well for next line in he Current PaperJs Version.
var text = new PointText(new Point(200, 50));
text.justification = 'center';
text.fillColor = 'black';
text.content = 'The contents \n of the point text';
Produces the following Output.
No, paper.js cannot currently break lines. It is not a layout manager...at least not a full-functioned layout manager. There is a comment in the TextItem reference that an AreaText is "coming soon" that would do what you want.
For now, you have to split the string yourself, create multiple PointText to hold the pieces of the string, and stack those texts.
I just find this solution from Alain D'EURVEILHER, I've just adapted for paper.js
paper.PointText.prototype.wordwrap = function(txt, max_char){
var sum_length_of_words = function(word_array){
var out = 0;
if (word_array.length!=0){
for (var i=0; i<word_array.length; i++){
var word = word_array[i];
out = out + word.length;
}
};
return out;
};
var chunkString = function (str, length){
return str.match(new RegExp('.{1,' + length + '}', 'g'));
};
var splitLongWord = function (word, maxChar){
var out = [];
if( maxChar >= 1){
var wordArray = chunkString(word, maxChar-1);// just one under maxChar in order to add the innerword separator '-'
if(wordArray.length >= 1){
// Add every piece of word but the last, concatenated with '-' at the end
for(var i=0; i<(wordArray.length-1); i++){
var piece = wordArray[i] + "-";
out.push(piece);
}
// finally, add the last piece
out.push(wordArray[wordArray.length-1]);
}
}
// If nothing done, just use the same word
if(out.length == 0) {
out.push(word);
}
return out;
}
var split_out = [[]];
var split_string = txt.split(' ');
for(var i=0; i<split_string.length; i++){
var word = split_string[i];
// If the word itself exceed the max length, split it,
if(word.length > max_char){
var wordPieces = splitLongWord(word, max_char);
for(var i=0;i<wordPieces.length;i++){
var wordPiece = wordPieces[i];
split_out = split_out.concat([[]]);
split_out[split_out.length-1] = split_out[split_out.length-1].concat(wordPiece);
}
} else {
// otherwise add it if possible
if ((sum_length_of_words(split_out[split_out.length-1]) + word.length) > max_char){
split_out = split_out.concat([[]]);
}
split_out[split_out.length-1] = split_out[split_out.length-1].concat(word);
}
}
for (var i=0; i<split_out.length; i++){
split_out[i] = split_out[i].join(" ");
}
return this.content=split_out.join('\n');
};
Example of use :
wordwrap for paper.js example

interpolate tags in strings using only text offsets

I've been struggling with javascript string methods and regexes, and I may be overlooking something obvious. I hope I violate no protocol by restating tofutim's question in some more detail. Responses to his question focus upon s.replace(), but for that to work, you have to know which occurrence of a substring to replace, replace all of them, or be able to identify somehow uniquely the string to replace by means of a regex. Like him, I only have an array of text offsets like this:
[[5,9], [23,27]]
and a string like this:
"eggs eggs spam and ham spam"
Given those constraints, is there a straightforward way (javaScript or some shortcut with jQuery) to arrive at a string like this?
"eggs <span>eggs</span> spam and ham <span>spam</span>"
I don't know in advance what the replacement strings are, or how many occurrences of them there might be in the base text. I only know their offsets, and it is only the occurrences identified by their offsets that I want to wrap with tags.
any thoughts?
I found a way to do it with regexp. Not sure about performance, but it's short and sweet:
/**
* replaceOffset
* #param str A string
* #param offs Array of offsets in ascending order [[2,4],[6,8]]
* #param tag HTML tag
*/
function replaceOffset(str, offs, tag) {
tag = tag || 'span';
offs.reverse().forEach(function(v) {
str = str.replace(
new RegExp('(.{'+v[0]+'})(.{'+(v[1]-v[0])+'})'),
'$1<'+tag+'>$2</'+tag+'>'
);
});
return str;
}
Demo: http://jsbin.com/aqowum/3/edit
iquick solution (not tested)
function indexWrap(indexArr,str){
// explode into array of each character
var chars = str.split('');
// loop through the MD array of indexes
for(var i=0; i<indexArr.length;i++){
var indexes = indexArr[i];
// if the two indexes exist in the character array
if(chars[indexes[0]] && chars[indexes[1]]){
// add the tag into each index
chars.splice(indexes[0],0,"<span>");
chars.splice(indexes[1],0,"</span>");
}
}
// return the joined string
return chars.join('');
}
Personally, I like a string replace solution, but if you dont want one, this might work
You can try slice method.
var arr = [[5,9], [23,27]];
arr = arr.reverse()
$.each(arr, function(i, v){
var f = v[0], last = v[1];
$('p').html(function(i, v){
var o = v.slice(0, f);
var a = '<span>' + v.slice(f, last) + '</span>';
var c = v.slice(last, -1);
return o+a+c
})
})
http://jsfiddle.net/rjQt7/
First, you'd want to iterate backwards, in order to make sure you won't eventually overwrite the replacements previously made, however, in my example it is not important because the string is reassembled all at once in the very end.
// > interpolateOnIndices([[5,9], [23,27]], "eggs eggs spam and ham spam");
// < 'eggs <span>eggs</span> spam and ham <span>spam</span>'
function interpolateOnIndices(indices, string) {
"use strict";
var i, pair, position = string.length,
len = indices.length - 1, buffer = [];
for (i = len; i >= 0; i -= 1) {
pair = indices[i];
buffer.unshift("<span>",
string.substring(pair[0], pair[1]),
"</span>",
string.substring(pair[1], position));
position = pair[0];
}
buffer.unshift(string.substr(0, position));
return buffer.join("");
}
This is a little bit better then the example with spliceing, because it doesn't create additional arrays (splice in itself will create additional arrays). Using mapping and creating functions repeatedly inside other functions is a certain memory hog, but it doesn't run very fast either... Although, it is a little bit shorter.
On large strings joining should, theoretically, give you an advantage over multiple concatenations because memory allocation will be made once, instead of subsequently throwing away a half-baked string. Of course, all these need not concern you, unless you are processing large amounts of data.
EDIT:
Because I had too much time on my hands, I decided to make a test, to see how variations will compare on a larger (but fairly realistic) set of data, below is my testing code with some results...
function interpolateOnIndices(indices, string) {
"use strict";
var i, pair, position = string.length,
len = indices.length - 1, buffer = [];
for (i = len; i >= 0; i -= 1) {
pair = indices[i];
buffer.unshift("<span>",
string.substring(pair[0], pair[1]),
"</span>",
string.substring(pair[1], position));
position = pair[0];
}
buffer.unshift(string.substr(0, position));
return buffer.join("");
}
function indexWrap(indexArr, str) {
var chars = str.split("");
for(var i = 0; i < indexArr.length; i++) {
var indexes = indexArr[i];
if(chars[indexes[0]] && chars[indexes[1]]){
chars.splice(indexes[0], 0, "<span>");
chars.splice(indexes[1], 0, "</span>");
}
}
return chars.join("");
}
function replaceOffset(str, offs, tag) {
tag = tag || "span";
offs.reverse().forEach(
function(v) {
str = str.replace(
new RegExp("(.{" + v[0] + "})(.{" + (v[1] - v[0]) + "})"),
"$1<" + tag + ">$2</" + tag + ">"
);
});
return str;
}
function generateLongString(pattern, times) {
"use strict";
var buffer = new Array(times);
while (times >= 0) {
buffer[times] = pattern;
times -= 1;
}
return buffer.join("");
}
function generateIndices(pattern, times, step) {
"use strict";
var buffer = pattern.concat(), block = pattern.concat();
while (times >= 0) {
block = block.concat();
block[0] += step;
block[1] += step;
buffer = buffer.concat(block);
times -= 1;
}
return buffer;
}
var longString = generateLongString("eggs eggs spam and ham spam", 100);
var indices = generateIndices([[5,9], [23,27]], 100,
"eggs eggs spam and ham spam".length);
function speedTest(thunk, times) {
"use strict";
var start = new Date();
while (times >= 0) {
thunk();
times -= 1;
}
return new Date() - start;
}
speedTest(
function() {
replaceOffset(longString, indices, "span"); },
100); // 1926
speedTest(
function() {
indexWrap(indices, longString); },
100); // 559
speedTest(
function() {
interpolateOnIndices(indices, longString); },
100); // 16
Tested against V8 (Node.js) on amd64 Linux (FC-17).
I didn't test the undefined's answer because I didn't want to load that library, especially so it doesn't do anything useful for this test. I would imagine it will lend somewhere between andbeyond's and elclanrs's variants, more towards elclanrs's answer though.
you may use the substring method
String.substring (startIndex, endIndex);
description: return the string between start & end index
usage:
var source="hello world";
var result=source.substring (3,7); //returns 'lo wo'
you already have an array with initial & final index, so you are almost done :)

Sorting function?

I need to organize an array of strings of random length into the least number of new strings with a max size. Is there a function or something in javascript, or something that can be translated to javascript, that will do this?
For example, the new strings might have max lengths of 1000 characters. The array might have strings of lengths 100, 48, 29, etc. I would want to combine those strings into as few new strings as possible.
edit: Sorry if this doesn't make sense, I tried my best.
No standard method in Javascript, but plenty of theoretical work has been done on this (i.e. the bin packing problem).
http://en.wikipedia.org/wiki/Bin_packing_problem
Some sample pseudo code in the link - should be trivial to translate to javascript.
The algorithm shown isn't going to be optimal in every case. To find the optimal solution to your example you'll just need to iterate over every possibility which might not be that bad depending on how many strings you have.
For my own entertainment, I wrote a simple bin packing algorithm. I picked a simple algorithm which is to sort the input strings by length. Create a new bin. Put the first (longest remaining) string into the bin and then keep filling it up with the longest strings that will fit until no more strings will fit. Create a new bin, repeat. To test it, I allocate an array of strings of random lengths and use that as input. You can see the output visually here: http://jsfiddle.net/jfriend00/FqPKe/.
Running it a bunch of times, it gets a fill percentage of between 91-98%, usually around 96%. Obviously the fill percentage is higher if there are more short strings to fill with.
Here's the code:
function generateRandomLengthStringArrays(num, maxLen) {
var sourceChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY1234567890";
var sourceIndex = 0;
var result = [];
var len, temp, fill;
function getNextSourceChar() {
var ch = sourceChars.charAt(sourceIndex++);
if (sourceIndex >= sourceChars.length) {
sourceIndex = 0;
}
return(ch);
}
for (var i = 0; i < num; i++) {
len = Math.floor(Math.random() * maxLen);
temp = new String();
fill = getNextSourceChar();
// create string
for (var j = 0; j < len; j++) {
temp += fill;
}
result.push(temp);
}
return(result);
}
function packIntoFewestBins(input, maxLen) {
// we assume that none of the strings in input are longer than maxLen (they wouldn't fit in any bin)
var result = [];
// algorithm here is to put the longest string into a bin and
// then find the next longest one that will fit into that bin with it
// repeat until nothing more fits in the bin, put next longest into a new bin
// rinse, lather, repeat
var bin, i, tryAgain, binLen;
// sort the input strings by length (longest first)
input.sort(function(a, b) {return(b.length - a.length)});
while (input.length > 0) {
bin = new String(); // create new bin
bin += input.shift(); // put first one in (longest we have left) and remove it
tryAgain = true;
while (bin.length < maxLen && tryAgain) {
tryAgain = false; // if we don't find any more that fit, we'll stop after this iteration
binLen = bin.length; // save locally for speed/convenience
// find longest string left that will fit in the bin
for (i = 0; i < input.length; i++) {
if (input[i].length + binLen <= maxLen) {
bin += input[i];
input.splice(i, 1); // remove this item from the array
tryAgain = true; // try one more time
break; // break out of for loop
}
}
}
result.push(bin);
}
return(result);
}
var binLength = 60;
var numStrings = 100;
var list = generateRandomLengthStringArrays(numStrings, binLength);
var result = packIntoFewestBins(list, binLength);
var capacity = result.length * binLength;
var fillage = 0;
for (var i = 0; i < result.length; i++) {
fillage += result[i].length;
$("#result").append(result[i] + "<br>")
}
$("#summary").html(
"Fill percentage: " + ((fillage/capacity) * 100).toFixed(1) + "%<br>" +
"Number of Input Strings: " + numStrings + "<br>" +
"Number of Output Bins: " + result.length + "<br>" +
"Bin Legnth: " + binLength + "<br>"
);

JavaScript strings outside of the BMP

BMP being Basic Multilingual Plane
According to JavaScript: the Good Parts:
JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide.
This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF.
Further investigation confirms this:
> String.fromCharCode(0x20001);
The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001.
Question: is it at all possible to handle post-BMP characters in JavaScript?
2011-07-31: slide twelve from Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly covers issues related to this quite well:
Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.
But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, slice etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.
If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:
String.prototype.getCodePointLength= function() {
return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};
String.fromCodePoint= function() {
var chars= Array.prototype.slice.call(arguments);
for (var i= chars.length; i-->0;) {
var n = chars[i]-0x10000;
if (n>=0)
chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
}
return String.fromCharCode.apply(null, chars);
};
I came to the same conclusion as bobince. If you want to work with strings containing unicode characters outside of the BMP, you have to reimplement javascript's String methods. This is because javascript counts characters as each 16-bit code value. Symbols outside of the BMP need two code values to be represented. You therefore run into a case where some symbols count as two characters and some count only as one.
I've reimplemented the following methods to treat each unicode code point as a single character: .length, .charCodeAt, .fromCharCode, .charAt, .indexOf, .lastIndexOf, .splice, and .split.
You can check it out on jsfiddle: http://jsfiddle.net/Y89Du/
Here's the code without comments. I tested it, but it may still have errors. Comments are welcome.
if (!String.prototype.ucLength) {
String.prototype.ucLength = function() {
// this solution was taken from
// http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp
return this.length - this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length + 1;
};
}
if (!String.prototype.codePointAt) {
String.prototype.codePointAt = function (ucPos) {
if (isNaN(ucPos)){
ucPos = 0;
}
var str = String(this);
var codePoint = null;
var pairFound = false;
var ucIndex = -1;
var i = 0;
while (i < str.length){
ucIndex += 1;
var code = str.charCodeAt(i);
var next = str.charCodeAt(i + 1);
pairFound = (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF);
if (ucIndex == ucPos){
codePoint = pairFound ? ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000 : code;
break;
} else{
i += pairFound ? 2 : 1;
}
}
return codePoint;
};
}
if (!String.fromCodePoint) {
String.fromCodePoint = function () {
var strChars = [], codePoint, offset, codeValues, i;
for (i = 0; i < arguments.length; ++i) {
codePoint = arguments[i];
offset = codePoint - 0x10000;
if (codePoint > 0xFFFF){
codeValues = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
} else{
codeValues = [codePoint];
}
strChars.push(String.fromCharCode.apply(null, codeValues));
}
return strChars.join("");
};
}
if (!String.prototype.ucCharAt) {
String.prototype.ucCharAt = function (ucIndex) {
var str = String(this);
var codePoint = str.codePointAt(ucIndex);
var ucChar = String.fromCodePoint(codePoint);
return ucChar;
};
}
if (!String.prototype.ucIndexOf) {
String.prototype.ucIndexOf = function (searchStr, ucStart) {
if (isNaN(ucStart)){
ucStart = 0;
}
if (ucStart < 0){
ucStart = 0;
}
var str = String(this);
var strUCLength = str.ucLength();
searchStr = String(searchStr);
var ucSearchLength = searchStr.ucLength();
var i = ucStart;
while (i < strUCLength){
var ucSlice = str.ucSlice(i,i+ucSearchLength);
if (ucSlice == searchStr){
return i;
}
i++;
}
return -1;
};
}
if (!String.prototype.ucLastIndexOf) {
String.prototype.ucLastIndexOf = function (searchStr, ucStart) {
var str = String(this);
var strUCLength = str.ucLength();
if (isNaN(ucStart)){
ucStart = strUCLength - 1;
}
if (ucStart >= strUCLength){
ucStart = strUCLength - 1;
}
searchStr = String(searchStr);
var ucSearchLength = searchStr.ucLength();
var i = ucStart;
while (i >= 0){
var ucSlice = str.ucSlice(i,i+ucSearchLength);
if (ucSlice == searchStr){
return i;
}
i--;
}
return -1;
};
}
if (!String.prototype.ucSlice) {
String.prototype.ucSlice = function (ucStart, ucStop) {
var str = String(this);
var strUCLength = str.ucLength();
if (isNaN(ucStart)){
ucStart = 0;
}
if (ucStart < 0){
ucStart = strUCLength + ucStart;
if (ucStart < 0){ ucStart = 0;}
}
if (typeof(ucStop) == 'undefined'){
ucStop = strUCLength - 1;
}
if (ucStop < 0){
ucStop = strUCLength + ucStop;
if (ucStop < 0){ ucStop = 0;}
}
var ucChars = [];
var i = ucStart;
while (i < ucStop){
ucChars.push(str.ucCharAt(i));
i++;
}
return ucChars.join("");
};
}
if (!String.prototype.ucSplit) {
String.prototype.ucSplit = function (delimeter, limit) {
var str = String(this);
var strUCLength = str.ucLength();
var ucChars = [];
if (delimeter == ''){
for (var i = 0; i < strUCLength; i++){
ucChars.push(str.ucCharAt(i));
}
ucChars = ucChars.slice(0, 0 + limit);
} else{
ucChars = str.split(delimeter, limit);
}
return ucChars;
};
}
More recent JavaScript engines have String.fromCodePoint.
const ideograph = String.fromCodePoint( 0x20001 ); // outside the BMP
Also a code-point iterator, which gets you the code-point length.
function countCodePoints( str )
{
const i = str[Symbol.iterator]();
let count = 0;
while( !i.next().done ) ++count;
return count;
}
console.log( ideograph.length ); // gives '2'
console.log( countCodePoints(ideograph) ); // '1'
Yes, you can. Although support to non-BMP characters directly in source documents is optional according to the ECMAScript standard, modern browsers let you use them. Naturally, the document encoding must be properly declared, and for most practical purposes you would need to use the UTF-8 encoding. Moreover, you need an editor that can handle UTF-8, and you need some input method(s); see e.g. my Full Unicode Input utility.
Using suitable tools and settings, you can write var foo = '𠀁'.
The non-BMP characters will be internally represented as surrogate pairs, so each non-BMP character counts as 2 in the string length.
Using for (c of this) instruction, one can make various computations on a string that contains non-BMP characters. For instance, to compute the string length, and to get the nth character of the string:
String.prototype.magicLength = function()
{
var c, k;
k = 0;
for (c of this) // iterate each char of this
{
k++;
}
return k;
}
String.prototype.magicCharAt = function(n)
{
var c, k;
k = 0;
for (c of this) // iterate each char of this
{
if (k == n) return c + "";
k++;
}
return "";
}
This old topic has now a simple solution in ES6:
Split characters into an array
simple version
[..."😴😄😃⛔🎠🚓🚇"] // ["😴", "😄", "😃", "⛔", "🎠", "🚓", "🚇"]
Then having each one separated you can handle them easily for most common cases.
Credit: DownGoat
Full solution
To overcome special emojis as the one in the comment, one can search for the connection charecter (char code 8205 in UTF-16) and make some modifications. Here is how:
let myStr = "👩‍👩‍👧‍👧😃𝌆"
let arr = [...myStr]
for (i = arr.length-1; i--; i>= 0) {
if (arr[i].charCodeAt(0) == 8205) { // special combination character
arr[i-1] += arr[i] + arr[i+1]; // combine them back to a single emoji
arr.splice(i, 2)
}
}
console.log(arr.length) //3
Haven't found a case where this doesn't work. Comment if you do.
To conclude
it seems that JS uses the 8205 char code to represent UCS-2 characters as a UTF-16 combinations.

Categories

Resources