non-ascii char as arguments - javascript

printargv.js:
console.log(Buffer.byteLength(process.argv[2]));
In cmd.exe (with chcp=65001,font='Lucida Console'), I ran:
node printargv.js Ā
(Note: unicode code point of Ā is U+0100.) The script outputted:
1
I expected the script to print a number greater than 1 but it doesn't. Does anyone know why?
edit:
i think that node 'parses' initial arguments incorrectly for cmd.exe after i tried the below code:
var i = require('readline').createInterface(process.stdin,process.stdout);
i.question('char: ', function(c){
console.log( Buffer.byteLength(c) );
i.close();
process.stdin.destroy();
});
the output is 2

Your program is not receiving the Ā, it's receiving an A instead. I used this program to test:
var n;
for (n = 0; n < process.argv.length; ++n) {
console.log(n + ": '" + process.argv[n] + "'");
}
console.log("length: " + process.argv[2].length);
console.log("code: " + process.argv[2].charCodeAt(0));
console.log("size: " + Buffer.byteLength(process.argv[2]));
On Ubuntu using UTF-8 in the console, I got:
$ node test.js Ā
0: 'node'
1: '/home/tjc/temp/test.js'
2: 'Ā'
length: 1
code: 256
size: 2
...which is correct.
On Windows 7 using chcp 65001 and Lucida Console, I got:
C:\tmp>node temp.js Ā
0: 'node'
1: 'C:\tmp\temp.js'
2: 'A'
length: 1
code: 65
size: 1
Note that the Ā became an A at some point along the way.
As I said in my comment on the question, I can only assume there's some issue with Lucida Console, or cmd.exe's handling of UTF-8, or perhaps node.exe's handling of Unicode from the console on Windows (I used the pre-built 0.5.7 version).
Update: This might be something to take up with the NodeJS folks, since Windows appears to get it right on its own. If I put this code in a test.vbs file:
WScript.Echo WScript.Arguments(0)
WScript.Echo AscW(WScript.Arguments(0))
I get a correct result:
C:\tmp>cscript /nologo test.vbs Ā
Ā
256
...suggesting that the terminal is passing the argument correctly to the program. So it could be an issue with the Windows node.exe build.

Related

Confusing JavaScript statement about string Concatenation

I was developing a node.js site and I made a copy and paste error that resulted in the following line (simplified for this question):
var x = "hi" + + "mom"
It doesn't crash and x = NaN. Now that i have fixed this bug, I am curious what is going on here, since if I remove the space between the + signs I get an error (SyntaxError: invalid increment operand)
My Question is : Can some explain to me what is going on in the statement and how nothing (a space between the + signs) changes this from an error to a NaN?
PS. I am not sure if this should go here or programers.stackoverflow.com. Let me know if I posted on the wrong site.
It's being interpreted like this:
var x = "hi" + (+"mom")
The prefix + tries to coerce the string to a number. Number('mom') is NaN, so +'mom' is also NaN.

Get result of last statement in JavaScript Code

Given is a file with JavaScript-Code. For example:
1 + 1;
3 + 3;
I want to receive the value of the last expression (which is 6 in this case).
This could be achieved via
node --print "1 + 1; 3 + 3;"
But I cannot pass the code as a string because the code can contain quotes which conflict with the quotes around the code (e.g. node -p "1 + 1; aFunction("string")").
Unfortunately, node's --print parameter cannot deal with files.
Another approach would be to modify the source file. I could use the eval-Function which has the desired behaviour that eval("1 + 1; 3 + 3") returns 6. Unfortunately, I run into the same conflicts with the quotes.
I hope I could make my point clear. I'm looking forward to your answers.
If you're on Linux and maybe MacOS and maybe maybe Windows/Cygwin you can put the code in a file and then try this:
node -p < thefile.js

Encoding issues for UTF8 CSV file when opening Excel and TextEdit

I recently added a CSV-download button that takes data from database (Postgres) an array from server (Ruby on Rails), and turns it into a CSV file on the client side (Javascript, HTML5). I'm currently testing the CSV file and I am coming across some encoding issues.
When I view the CSV file via 'less', the file appears fine. But when I open the file in Excel OR TextEdit, I start seeing weird characters like
—, â€, “
appear in the text. Basically, I see the characters that are described here: http://digwp.com/2011/07/clean-up-weird-characters-in-database/
I read that this sort of issue can arise when the Database encoding setting is set to the wrong one. BUT, the database that I am using is set to use UTF8 encoding. And when I debug through the JS codes that create the CSV file, the text appear normal. (This could be a Chrome ability, and less capability)
I'm feeling frustrated because the only thing I am learning from my online search is that there could be many reasons why encoding is not working, I'm not sure which part is at fault (so excuse me as I initially tag numerous things), and nothing I tried has shed new light on my problem.
For reference, here's the JavaScript snippet that creates the CSV file!
$(document).ready(function() {
var csvData = <%= raw to_csv(#view_scope, clicks_post).as_json %>;
var csvContent = "data:text/csv;charset=utf-8,";
csvData.forEach(function(infoArray, index){
var dataString = infoArray.join(",");
csvContent += dataString+ "\n";
});
var encodedUri = encodeURI(csvContent);
var button = $('<a>');
button.text('Download CSV');
button.addClass("button right");
button.attr('href', encodedUri);
button.attr('target','_blank');
button.attr('download','<%=title%>_25_posts.csv');
$("#<%=title%>_download_action").append(button);
});
As #jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.
Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.
Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.
Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.
You can find a fiddle here:
Unicode export sample Fiddle
Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.
Typically use like:
// Initiate
encoder = new DataEnc({
mime : 'text/csv',
charset: 'UTF-16BE',
bom : true
});
// Convert data to percent escaped text
encoder.enc(data);
// Get result
var result = encoder.pay();
There is two result properties of the object:
1.) encoder.lead
This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.
data:[<MIME-type>][;charset=<encoding>][;base64]
You can specify base64, but there is no base64 conversion (at least not this far).
2.) encoder.buf
This is a string with the percent escaped data.
The .pay() function simply return 1.) and 2.) as one.
Main code:
function DataEnc(a) {
this.config(a);
this.intro();
}
/*
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* */
DataEnc._enctype = {
u8 : ['u8', 'utf8'],
// RFC-2781, Big endian should be presumed if none given
u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
u16le : ['u16le', 'utf16le', 'ucs2le']
};
DataEnc._BOM = {
'none' : '',
'UTF-8' : '%ef%bb%bf', // Discouraged
'UTF-16BE' : '%fe%ff',
'UTF-16LE' : '%ff%fe'
};
DataEnc.prototype = {
// Basic setup
config : function(a) {
var opt = {
charset: 'u8',
mime : 'text/csv',
base64 : 0,
bom : 0
};
a = a || {};
this.charset = typeof a.charset !== 'undefined' ?
a.charset : opt.charset;
this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;
this.enc = this.utf8;
this.buf = '';
this.lead = '';
return this;
},
// Create lead based on config
// data:[<MIME-type>][;charset=<encoding>][;base64],<data>
intro : function() {
var
g = [],
c = this.charset || '',
b = 'none'
;
if (this.mime && this.mime !== '')
g.push(this.mime);
if (c !== '') {
c = c.replace(/[-\s]/g, '').toLowerCase();
if (DataEnc._enctype.u8.indexOf(c) > -1) {
c = 'UTF-8';
if (this.bom)
b = c;
this.enc = this.utf8;
} else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
c = 'UTF-16BE';
if (this.bom)
b = c;
this.enc = this.utf16be;
} else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
c = 'UTF-16LE';
if (this.bom)
b = c;
this.enc = this.utf16le;
} else {
if (c === 'copy')
c = '';
this.enc = this.copy;
}
}
if (c !== '')
g.push('charset=' + c);
if (this.base64)
g.push('base64');
this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
return this;
},
// Deliver
pay : function() {
return this.lead + this.buf;
},
// UTF-16BE
utf16be : function(t) { // U+0500 => %05%00
var i, c, buf = [];
for (i = 0; i < t.length; ++i) {
if ((c = t.charCodeAt(i)) > 0xff) {
buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
} else {
buf.push('00');
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
}
}
this.buf += '%' + buf.join('%');
// Note the hex array is returned, not string with '%'
// Might be useful if one want to loop over the data.
return buf;
},
// UTF-16LE
utf16le : function(t) { // U+0500 => %00%05
var i, c, buf = [];
for (i = 0; i < t.length; ++i) {
if ((c = t.charCodeAt(i)) > 0xff) {
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
} else {
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
buf.push('00');
}
}
this.buf += '%' + buf.join('%');
// Note the hex array is returned, not string with '%'
// Might be useful if one want to loop over the data.
return buf;
},
// UTF-8
utf8 : function(t) {
this.buf += encodeURIComponent(t);
return this;
},
// Direct copy
copy : function(t) {
this.buf += t;
return this;
}
};
Previous answer:
I do not have any setup to replicate yours, but if your case is the same as #jlarson then the resulting file should be correct.
This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.
TL;DR:
The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.
PS: The UniSearcher is a nice tool to have available on this journey.
The long way around
The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.
As an example if we take the script of jlarson it takes the data Array:
data = ['name', 'city', 'state'],
['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington']
This one is merged into the string:
name,city,state<newline>
\u0500\u05E1\u0E01\u1054,seattle,washington<newline>
which translates by Unicode to:
name,city,state<newline>
Ԁסกၔ,seattle,washington<newline>
As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is:
Code-point Glyph UTF-8
----------------------------
U+0500 Ԁ d4 80
U+05E1 ס d7 a1
U+0E01 ก e0 b8 81
U+1054 ၔ e1 81 94
Looking at the hex-dump of the downloaded file:
0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a name,city,state.
0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474 ..........,seatt
0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a le,washington.
On second line we find d480 d7a1 e0b8 81e1 8194 which match up with the above:
0000010: d480 d7a1 e0b8 81 e1 8194 2c73 6561 7474 ..........,seatt
| | | | | | | | | | | | | |
+-+-+ +-+-+ +--+--+ +--+--+ | | | | | |
| | | | | | | | | |
Ԁ ס ก ၔ , s e a t t
None of the other characters is mangled either.
Do similar tests if you want. The result should be the similar.
By sample provided —, â€, “
We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.
To quote Wikipedia on Windows-1252:
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by
default in the legacy components of Microsoft Windows in English and some other
Western languages. It is one version within the group of Windows code pages.
In LaTeX packages, it is referred to as "ansinew".
Retrieving the original bytes
To translate it back into it's original form we can look at the code page layout, from which we get:
Character: <â> <€> <”> <,> < > <â> <€> < > <,> < > <â> <€> <œ>
U.Hex : e2 20ac 201d 2c 20 e2 20ac 9d 2c 20 e2 20ac 153
T.Hex : e2 80 94 2c 20 e2 80 9d* 2c 20 e2 80 9c
U is short for Unicode
T is short for Translated
For example:
â => Unicode 0xe2 => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80
Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.
Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:
set fenc=utf-16
# Or
set fenc=ucs-2
Bytes to UTF-8
We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.
0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash) —
0x2c = 0010 1100bin => 1 byte => 0x2c (comma) ,
0x2c = 0010 0000bin => 1 byte => 0x20 (space)
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte => 0x2c (comma) ,
0x2c = 0010 0000bin => 1 byte => 0x20 (space)
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq) “
Conclusion; The original UTF-8 string was:
—, ”, “
Mangling it back
We can also do the reverse. The original string as bytes:
UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c
Corresponding values in cp-1252:
e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...
and so on, result:
—, â€, “
Importing to MS Excel
In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways.
Method one:
Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.
As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.
Method two:
Use import data instead of open file. Something like:
Data -> Import External Data -> Import Data
Select encoding and proceed.
Check that Excel and selected font actually supports the glyph
We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel:
page with code points U+0E00 to U+0EFF
If support for the code points exist, the text should render fine.
Linux
On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered.
Why it works (or should)
encodeURI from the spec states, (also read sec-15.1.3):
The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.
We can simply test this in our console by, for example saying:
>> encodeURI('Ԁסกၔ,seattle,washington')
<< "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington"
As we register the escape sequences are equal to the ones in the hex dump above:
%D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log)
d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file)
or, testing a 4-byte code:
>> encodeURI('󱀁')
<< "%F3%B1%80%81"
If this is does not comply
If nothing of this apply it could help if you added
Sample of expected input vs mangled output, (copy paste).
Sample hex-dump of original data vs result file.
I ran into exactly this yesterday. I was developing a button that exports the contents of an HTML table as a CSV download. The functionality of the button itself is almost identical to yours – on click I read the text from the table and create a data URI with the CSV content.
When I tried to open the resulting file in Excel it was clear that the "£" symbol was getting read incorrectly. The 2 byte UTF-8 representation was being processed as ASCII resulting in an unwanted garbage character. Some Googling indicated this was a known issue with Excel.
I tried adding the byte order mark at the start of the string – Excel just interpreted it as ASCII data. I then tried various things to convert the UTF-8 string to ASCII (such as csvData.replace('\u00a3', '\xa3')) but I found that any time the data is coerced to a JavaScript string it will become UTF-8 again. The trick is to convert it to binary and then Base64 encode it without converting back to a string along the way.
I already had CryptoJS in my app (used for HMAC authentication against a REST API) and I was able to use that to create an ASCII encoded byte sequence from the original string then Base64 encode it and create a data URI. This worked and the resulting file when opened in Excel does not display any unwanted characters.
The essential bit of code that does the conversion is:
var csvHeader = 'data:text/csv;charset=iso-8859-1;base64,'
var encodedCsv = CryptoJS.enc.Latin1.parse(csvData).toString(CryptoJS.enc.Base64)
var dataURI = csvHeader + encodedCsv
Where csvData is your CSV string.
There are probably ways to do the same thing without CryptoJS if you don't want to bring in that library but this at least shows it is possible.
Excel likes Unicode in UTF-16 LE with BOM encoding. Output the correct BOM (FF FE), then convert all your data from UTF-8 to UTF-16 LE.
Windows uses UTF-16 LE internally, so some applications work better with UTF-16 than with UTF-8.
I haven't tried to do that in JS, but there're various scripts on the web to convert UTF-8 to UTF-16. Conversion between UTF variations is pretty easy and takes just a dozen of lines.
I was having a similar issue with data that was pulled into Javascript from a Sharepoint list. It turned out to be something called a "Zero Width Space" character and it was being displayed as †when it was brought into Excel. Apparently, Sharepoint inserts these sometimes when a user hits 'backspace'.
I replaced them with this quickfix:
var mystring = myString.replace(/\u200B/g,'');
It looks like you may have other hidden characters in there. I found the codepoint for the zero-width character in mine by looking at the output string in the Chrome inspector. The inspector couldn't render the character so it replaced it with a red dot. When you hover your mouse over that red dot, it gives you the codepoint (eg. \u200B) and you can just sub in the various codepoints to the invisible characters and remove them that way.
button.href = 'data:' + mimeType + ';charset=UTF-8,%ef%bb%bf' + encodedUri;
this should do the trick
It could be a problem in your server encoding.
You could try (assuming locale english US) if you are running Linux:
sudo locale-gen en_US en_US.UTF-8
dpkg-reconfigure locales
These three rules should be applied when writing a multibyte CSV file so that it can be readable on Excel across different OS platforms (Windows, Linux, MacOS)
The tab character \t is used to separate between fields instead of comma (,)
The content must be encoded in UTF-16 little endian (UTF16-LE)
The content must be prefixed with UTF16-LE byte order mark (BOM), which is 0xFEFF
Here is an article that shows how to reproduce the encoding issue and walks through the solution. NodeJS is used to create the CSV file.
As a side note, UTF16-LE BOM has to be explicitly set when writing a file using NodeJS fs module. Refer to this github issue for more detailed discussion.

Why is the operation address incremented by two?

I am looking at a Javascript emulator of a NES to try and understand how it works.
On this line:
addr = this.load(opaddr+2);
The opcode is incremented by two. However, the documentation (see appendix E) I'm reading says:
Zero page addressing uses a single operand which serves as a pointer
to an address in zero page ($0000-$00FF) where the data to be operated
on can be found. By using zero page addressing, only one byte is
needed for the operand, so the instruction is shorter and, therefore,
faster to execute than with addressing modes which take two operands.
An example of a zero page instruction is AND $12.
So if the operand's argument is only one byte, shouldn't it appear directly after it, and be + 1 instead of + 2? Why +2?
This is how I think it works, which may be incorrect. Suppose our memory looks like:
-------------------------
| 0 | 1 | 2 | 3 | 4 | 5 | <- index
-------------------------
| a | b | c | d | e | f | <- memory
-------------------------
^
\
PC
and our PC is 0, pointing to a. For this cycle, we say that the opcode:
var pc= 0; //for example's sake
var opcode= memory[pc]; //a
So shouldn't the first operand be the next slot, i.e. b?
var first_operand = memory[pc + 1]; //b
Your analysis appears to be correct at first glance but since the emulator works there must be something else going on.
The relevant code is as follows :
var opinf = this.opdata[this.nes.mmap.load(this.REG_PC+1)];
var cycleCount = (opinf>>24);
var cycleAdd = 0;
// Find address mode:
var addrMode = (opinf >> 8) & 0xFF;
// Increment PC by number of op bytes:
var opaddr = this.REG_PC;
this.REG_PC += ((opinf >> 16) & 0xFF);
var addr = 0;
switch(addrMode){
case 0:{
// Zero Page mode. Use the address given after the opcode,
// but without high byte.
addr = this.load(opaddr+2);
break;
Note how on the first line shown, the memory access to get the instruction information is at address REG_PC+1. So the PC actually points to the byte preceding the opcode being executed and so the operands start at that address + 2. The opcode itself is encoded as the lower 8 bytes of opinf and used in the execute switch a page or so below the code segment shown.

how to render 32bit unicode characters in google v8 (and nodejs)

does anyone have an idea how to render unicode 'astral plane' characters (whose CIDs are beyond 0xffff) in google v8, the javascript vm that drives both google chrome and nodejs?
funnily enough, when i give google chrome (it identifies as 11.0.696.71, running on ubuntu 10.4) an html page like this:
<script>document.write( "helo" )
document.write( "𡥂 ⿸𠂇子" );
</script>
it will correctly render the 'wide' character 𡥂 alongside with the 'narrow' ones, but when i try the equivalent in nodejs (using console.log()) i get a single � (0xfffd, REPLACEMENT CHARACTER) for the 'wide' character instead.
i have also been told that for whatever non-understandable reason google have decided to implement characters using a 16bit-wide datatype. while i find that stupid, the surrogate codepoints have been designed precisely to enable the 'channeling' of 'astral codepoints' through 16bit-challenged pathways. and somehow the v8 running inside of chrome 11.0.696.71 seems to use this bit of unicode-foo or other magic to do its work (i seem to remember years ago i always got boxes instead even on static pages).
ah yes, node --version reports v0.4.10, gotta figure out how to obtain a v8 version number from that.
update i did the following in coffee-script:
a = String.fromCharCode( 0xd801 )
b = String.fromCharCode( 0xdc00 )
c = a + b
console.log a
console.log b
console.log c
console.log String.fromCharCode( 0xd835, 0xdc9c )
but that only gives me
���
���
������
������
the thinking behind this is that since that braindead part of the javascript specification that deals with unicode appears to mandate? / not downright forbid? / allows? the use of surrogate pairs, then maybe my source file encoding (utf-8) might be part of the problem. after all, there are two ways to encode 32bit codepoints in utf-8: one is two write out the utf-8 octets needed for the first surrogate, then those for the second; the other way (which is the preferred way, as per utf-8 spec) is to calculate the resulting codepoint and write out the octets needed for that codepoint. so here i completely exclude the question of source file encoding by dealing only with numbers. the above code does work with document.write() in chrome, giving 𐐀𝒜, so i know i got the numbers right.
sigh.
EDIT i did some experiments and found out that when i do
var f = function( text ) {
document.write( '<h1>', text, '</h1>' );
document.write( '<div>', text.length, '</div>' );
document.write( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
document.write( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
console.log( '<h1>', text, '</h1>' );
console.log( '<div>', text.length, '</div>' );
console.log( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
console.log( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' ); };
f( '𩄎' );
f( String.fromCharCode( 0xd864, 0xdd0e ) );
i do get correct results in google chrome---both inside the browser window and on the console:
𩄎
2
0xd864
0xdd0e
𩄎
2
0xd864
0xdd0e
however, this is what i get when using nodejs' console.log:
<h1> � </h1>
<div> 1 </div>
<div>0x fffd </div>
<div>0x NaN </div>
<h1> �����</h1>
<div> 2 </div>
<div>0x d864 </div>
<div>0x dd0e </div>
this seems to indicate that both parsing utf-8 with CIDs beyond 0xffff and outputting those characters to the console is broken. python 3.1, by the way, does treat the character as a surrogate pair and can print the charactr to the console.
NOTE i've cross-posted this question to the v8-users mailing list.
This recent presentation covers all sorts of issues with Unicode in popular languages, and isn't kind to Javascript: The Good, the Bad, & the (mostly) Ugly
He covers the issue with two-byte representation of Unicode in Javascript:
The UTF‐16 née UCS‐2 Curse
Like several other languages, Javascript
suffers from The UTF‐16 Curse. Except that Javascript has an even
worse form of it, The UCS‐2 Curse. Things like charCodeAt and
fromCharCode only ever deal with 16‐bit quantities, not with real,
21‐bit Unicode code points. Therefore, if you want to print out
something like 𝒜, U+1D49C, MATHEMATICAL SCRIPT CAPITAL A, you have to
specify not one character but two “char units”: "\uD835\uDC9C". 😱
// ERROR!!
document.write(String.fromCharCode(0x1D49C));
// needed bogosity
document.write(String.fromCharCode(0xD835,0xDC9C));
I think it's a console.log issue. Since console.log is only for debugging do you have the same issues when you output from node via http to a browser?

Categories

Resources