How to get the real length of Japanese string in javascript? [duplicate]

How to get the real length of Japanese string in javascript? [duplicate] - javascript

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!

For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80｡｢｣､･ｦｧｨｩｪｫｬｭｮｯｰｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾅﾆﾇﾈﾉﾊﾋﾌﾍﾎﾏﾐﾑﾒﾓﾔﾕﾖﾗﾘﾙﾚﾛﾜﾝ ﾞ ﾟ]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.

You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...

Related

How viable is base128 encoding for scenarios like JavaScript strings?

I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.

It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip compression. In practice if compression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.
As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.
Original data: 1,048,576 bytes
Base64 encoded data: 1,398,104 bytes
Gzipped base64 encoded data: 1,060,329 bytes
That's only a 1.12% increase in size (and the overhead of encoding -> compressing -> decompressing -> decoding).
Base128 encoding would take 1,198,373 bytes, so you'd have to compress it too if you wanted comparable file size. Gzip compression is a standard feature in all modern browsers so what's the case for base128 and all the extra complexity that would entail?

Select 36 non-Unicode characters from within the upper (>128) ASCII range
base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.

The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream.
Technically we can use higher bases but the problem with them is that they will need to fit some character set.
UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below
https://github.com/bharatmicrosystems/base256

Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
§2.2 Base-122 Encoding
You can find the implementation on github
The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...
Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as
... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)
Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case

Javascript - Alternative to lzw compression for Database entry

I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?

To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.

Using charCodeAt() and fromCharCode to obtain Unicode characters (code value > 55349) with JS

I use "".charCodeAt(pos) to get the Unicode number for a strange character, and then String.fromCharCode for the reverse.
But I'm having problems with characters that have a Unicode number greater than 55349. For example, the Blackboard Bold characters. If I want Lowercase Blackboard Bold X (𝕩), which has a Unicode number of 120169, if I alert the code from JavaScript:
alert(String.fromCharCode(120169));
I get another character. The same thing happens if I log an Uppercase Blackboard Bold X (𝕏), which has a Unicode number of 120143, from directly within JavaScript:
s="𝕏";
alert(s.charCodeAt(0))
alert(s.charCodeAt(1))
Output:
55349
56655
Is there a method to work with these kind of characters?

Internally, Javascript stores strings in a 16-bit encoding resembling UCS2 and UTF-16. (I say resembling, since it’s really neither of those two). The fact that they’re 16-bits means that characters outside the BMP, with code points above 65535, will be split up into two different characters. If you store the two different characters separately, and recombine them later, you should get the original character without problem.
Recognizing that you have such a character can be rather tricky, though.
Mathias Bynens has written a blog post about this: JavaScript’s internal character encoding: UCS-2 or UTF-16?. It’s very interesting (though a bit arcane at times), and concludes with several references to code libraries that support the conversion from UCS-2 to UTF-16 and vice versa. You might be able to find what you need in there.

How do I get an ASCII code from a string in JavaScript?

(Similar questions to this have been asked on StackOverflow, but not exactly this. The nearest is probably "javascript how to convert unicode string to ascii", where there is already the remark "this has to be a dup[licate]". I have read some similar posts, but they don't answer my specific question. I've looked on the very good W3Schools site, and have also Googled it, but not found the answer that way either. So any hints here would be very much appreciated.)
I have an array of bytes being passed to a piece of JavaScript. In the JavaScript the data arrives in a string. I do not know the mechanism of transfer, as it's from a 3rd-party application. I do not know even whether the string is "wide" or "narrow".
In my JavaScript, I have some code like b = str.charCodeAt(pos);.
My problem is that a byte value such as 0x86 = 134 is coming through as character 0x2020 = 8224. This seems to be because my original byte interpreted as a Latin-1 (probably) 'dagger' character, and is then being translated to the equivalent Unicode code-point. (The problem may or may not be JavaScript's 'fault'.) Similar problems occur with other values, although the ranges 0x00..0x7F and 0xA0..0xFF seem to be fine, but most values from 0x80..0x9F are affected, in each case the value seems to be the Unicode for the original Latin-1.
Another observation is that the length of the string is what I'd expect for narrow string if the length was measured in bytes. (On the other hand, if length returns a value in abstract characters, this doesn't tell me anything.)
So, in JavaScript, is there a way at getting at the 'raw' bytes in a string, or getting a Latin-1 or ASCII character code directly, or of converting between character encodings, or defining the default encoding?
I could write my own mapping, but I'd rather not. I expect that is what I'll end up doing, but that has the feel of a kludge on a kludge.
I'm also looking into whether there's anything I can adjust in the calling application (as it could be passing the data as a wide string, although I doubt it).
Either way, though, I'd be interested in whether there is a simple JavaScript solution, or to understand why there isn't.
(If the incoming data was character data, having Unicode dealt with so automatically would be great. But it's not, it's just a binary data stream.)
Thanks.

There is no such thing as the raw bytes in a String. The EcmaScript spec defines a string as a sequence of UTF-16 code-units. That is the most fine-grained representation exposed by any interpreter have ever encountered.
On the browser there are no encoding libraries. You have to roll your own if you are trying to represent a byte array as a string and want to reencode it.
If your string already happens to be valid ASCII, then you can get the numeric value of a code unit by using the charCodeAt method.
"\n".charCodeAt(0) === 10

Start with the Javascript (Ecmascript) specs: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf. Is says:
8.4 The String Type
The String type is the set of all finite ordered
sequences of zero or more 16-bit unsigned integer
values (“elements”). The String type is generally
used to represent textual data in a running ECMAScript
program, in which case each element in the String is
treated as a code unit value (see Clause 6). Each
element is regarded as occupying a position within
the sequence. These positions are indexed with
nonnegative integers. The first element (if any) is
at position 0, the next element (if any) at position
1, and so on. The length of a String is the number
of elements (i.e., 16-bit values) within it. The
empty String has length zero and therefore contains
no elements.
When a String contains actual textual data, each
element is considered to be a single UTF-16 code unit.
Whether or not this is the actual storage format of a
String, the characters within a String are numbered by
their initial code unit element position as though they
were represented using UTF-16. All operations on Strings
(except as otherwise stated) treat them as sequences of
undifferentiated 16-bit unsigned integers; they do not
ensure the resulting String is in normalised form, nor
do they ensure language-sensitive results.
NOTE The rationale behind this design was to keep the
implementation of Strings as simple and high-performing
as possible. The intent is that textual data coming into
the execution environment from outside (e.g., user input,
text read from a file or received over the network, etc.)
be converted to Unicode Normalised Form C before the
running program sees it. Usually this would occur at the
same time incoming text is converted from its original
character encoding to Unicode (and would impose no additional
overhead). Since it is recommended that ECMAScript source
code be in Normalised Form C, string literals are guaranteed
to be normalised (if source text is guaranteed to be
normalised), as long as they do not contain any Unicode
escape sequences.
What charCodeAt(p) gives you is the UTF-16 value (a 16-bit number) of the character at index p in the string. Since UTF-16 directly represents Unicode's Basic Multilingual Plane (that would be code points U+0000–U+D7FF and U+E000–U+FFFF, your Latin-1 characters should be the values you expect them to be.
That fact that they are not suggests to me that you have an encoding problem with the inbound 3rd octet stream — if the conversion to UTF-16 is being done and gets the encoding of the inbound octet stream wrong, you'll get odd results.
Perhaps that it is being treated as vanilla ASCII, when in fact it is UTF-8 (or vice-versa). UTF-8 represents code points above 0x7F as 2-, 3- or 4-octet "digraphs".

Character Encoding: â?

I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.

â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.

This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.

The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)

you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.

Develop Reference

JavaScript is the programming language of the Web.