I'm new to javascript and do not have a good grasp of its unicode handling. If I understand correctly it's kind of like C/C++ where a string contains a binary sequence without any encoding info.
When I use something like var str=window.getSelection().toString() to get the highlighted text, will the resulting string have the same encoding as the web-page? If so, what's the best way of finding out that encoding and converting it to a unicode one (e.g. UTF8)?
Strings in Javascript are not like "strings" in C or PHP, which are actually byte arrays and have encoding semantics. Strings in Javascript are quite different than that and are like strings in Java/C# or Python's unicode type.
They are strings of abstract characters, at least if you don't try to have non-BMP characters. In practice, you don't have to worry about that, I am just mentioning it for completeness.
As per above, var str=window.getSelection().toString() does not have any encoding semantics, it's just a string of the characters that are selected. You don't state any actual problem in your question, but if you are wondering if "special" characters will just work in Javascript, well, they do just work.
Related
I need to parse URLs from javascript source code using a different langauge(i.e. PHP and Java). The issue is that I don't know what kind of encoding is used on certain URLs(See below). Is there a standard(i.e. RFC, specifications) for this kind of encoding that I can use to implement it in the language I need? Initially I thought it simply backslashes the forward slashes but it seems is more than that as we have escaped characters such \x3d...
var _F_jsUrl = 'https:\/\/www.example.com\/accounts\/static\/_\/js\/k\x3dgaia.gaiafe_glif.en.nnMHsIffkD4.O\/m\x3dglifb,identifier,unknownerror\/am\x3dggIgAAAAAAEKBCEImA2CYiCxoQo\/rt\x3dj\/d\x3d1\/rs\x3dABkqax3Fc8CWFtgWOYXlvHJI_bE3oVSwgA';
I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?
To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.
I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.
fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.
I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!
In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.
As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306
JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.
In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).
(Similar questions to this have been asked on StackOverflow, but not exactly this. The nearest is probably "javascript how to convert unicode string to ascii", where there is already the remark "this has to be a dup[licate]". I have read some similar posts, but they don't answer my specific question. I've looked on the very good W3Schools site, and have also Googled it, but not found the answer that way either. So any hints here would be very much appreciated.)
I have an array of bytes being passed to a piece of JavaScript. In the JavaScript the data arrives in a string. I do not know the mechanism of transfer, as it's from a 3rd-party application. I do not know even whether the string is "wide" or "narrow".
In my JavaScript, I have some code like b = str.charCodeAt(pos);.
My problem is that a byte value such as 0x86 = 134 is coming through as character 0x2020 = 8224. This seems to be because my original byte interpreted as a Latin-1 (probably) 'dagger' character, and is then being translated to the equivalent Unicode code-point. (The problem may or may not be JavaScript's 'fault'.) Similar problems occur with other values, although the ranges 0x00..0x7F and 0xA0..0xFF seem to be fine, but most values from 0x80..0x9F are affected, in each case the value seems to be the Unicode for the original Latin-1.
Another observation is that the length of the string is what I'd expect for narrow string if the length was measured in bytes. (On the other hand, if length returns a value in abstract characters, this doesn't tell me anything.)
So, in JavaScript, is there a way at getting at the 'raw' bytes in a string, or getting a Latin-1 or ASCII character code directly, or of converting between character encodings, or defining the default encoding?
I could write my own mapping, but I'd rather not. I expect that is what I'll end up doing, but that has the feel of a kludge on a kludge.
I'm also looking into whether there's anything I can adjust in the calling application (as it could be passing the data as a wide string, although I doubt it).
Either way, though, I'd be interested in whether there is a simple JavaScript solution, or to understand why there isn't.
(If the incoming data was character data, having Unicode dealt with so automatically would be great. But it's not, it's just a binary data stream.)
Thanks.
There is no such thing as the raw bytes in a String. The EcmaScript spec defines a string as a sequence of UTF-16 code-units. That is the most fine-grained representation exposed by any interpreter have ever encountered.
On the browser there are no encoding libraries. You have to roll your own if you are trying to represent a byte array as a string and want to reencode it.
If your string already happens to be valid ASCII, then you can get the numeric value of a code unit by using the charCodeAt method.
"\n".charCodeAt(0) === 10
Start with the Javascript (Ecmascript) specs: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf. Is says:
8.4 The String Type
The String type is the set of all finite ordered
sequences of zero or more 16-bit unsigned integer
values (“elements”). The String type is generally
used to represent textual data in a running ECMAScript
program, in which case each element in the String is
treated as a code unit value (see Clause 6). Each
element is regarded as occupying a position within
the sequence. These positions are indexed with
nonnegative integers. The first element (if any) is
at position 0, the next element (if any) at position
1, and so on. The length of a String is the number
of elements (i.e., 16-bit values) within it. The
empty String has length zero and therefore contains
no elements.
When a String contains actual textual data, each
element is considered to be a single UTF-16 code unit.
Whether or not this is the actual storage format of a
String, the characters within a String are numbered by
their initial code unit element position as though they
were represented using UTF-16. All operations on Strings
(except as otherwise stated) treat them as sequences of
undifferentiated 16-bit unsigned integers; they do not
ensure the resulting String is in normalised form, nor
do they ensure language-sensitive results.
NOTE The rationale behind this design was to keep the
implementation of Strings as simple and high-performing
as possible. The intent is that textual data coming into
the execution environment from outside (e.g., user input,
text read from a file or received over the network, etc.)
be converted to Unicode Normalised Form C before the
running program sees it. Usually this would occur at the
same time incoming text is converted from its original
character encoding to Unicode (and would impose no additional
overhead). Since it is recommended that ECMAScript source
code be in Normalised Form C, string literals are guaranteed
to be normalised (if source text is guaranteed to be
normalised), as long as they do not contain any Unicode
escape sequences.
What charCodeAt(p) gives you is the UTF-16 value (a 16-bit number) of the character at index p in the string. Since UTF-16 directly represents Unicode's Basic Multilingual Plane (that would be code points U+0000–U+D7FF and U+E000–U+FFFF, your Latin-1 characters should be the values you expect them to be.
That fact that they are not suggests to me that you have an encoding problem with the inbound 3rd octet stream — if the conversion to UTF-16 is being done and gets the encoding of the inbound octet stream wrong, you'll get odd results.
Perhaps that it is being treated as vanilla ASCII, when in fact it is UTF-8 (or vice-versa). UTF-8 represents code points above 0x7F as 2-, 3- or 4-octet "digraphs".
I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!
If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)