How do I get an ASCII code from a string in JavaScript? - javascript

(Similar questions to this have been asked on StackOverflow, but not exactly this. The nearest is probably "javascript how to convert unicode string to ascii", where there is already the remark "this has to be a dup[licate]". I have read some similar posts, but they don't answer my specific question. I've looked on the very good W3Schools site, and have also Googled it, but not found the answer that way either. So any hints here would be very much appreciated.)
I have an array of bytes being passed to a piece of JavaScript. In the JavaScript the data arrives in a string. I do not know the mechanism of transfer, as it's from a 3rd-party application. I do not know even whether the string is "wide" or "narrow".
In my JavaScript, I have some code like b = str.charCodeAt(pos);.
My problem is that a byte value such as 0x86 = 134 is coming through as character 0x2020 = 8224. This seems to be because my original byte interpreted as a Latin-1 (probably) 'dagger' character, and is then being translated to the equivalent Unicode code-point. (The problem may or may not be JavaScript's 'fault'.) Similar problems occur with other values, although the ranges 0x00..0x7F and 0xA0..0xFF seem to be fine, but most values from 0x80..0x9F are affected, in each case the value seems to be the Unicode for the original Latin-1.
Another observation is that the length of the string is what I'd expect for narrow string if the length was measured in bytes. (On the other hand, if length returns a value in abstract characters, this doesn't tell me anything.)
So, in JavaScript, is there a way at getting at the 'raw' bytes in a string, or getting a Latin-1 or ASCII character code directly, or of converting between character encodings, or defining the default encoding?
I could write my own mapping, but I'd rather not. I expect that is what I'll end up doing, but that has the feel of a kludge on a kludge.
I'm also looking into whether there's anything I can adjust in the calling application (as it could be passing the data as a wide string, although I doubt it).
Either way, though, I'd be interested in whether there is a simple JavaScript solution, or to understand why there isn't.
(If the incoming data was character data, having Unicode dealt with so automatically would be great. But it's not, it's just a binary data stream.)
Thanks.

There is no such thing as the raw bytes in a String. The EcmaScript spec defines a string as a sequence of UTF-16 code-units. That is the most fine-grained representation exposed by any interpreter have ever encountered.
On the browser there are no encoding libraries. You have to roll your own if you are trying to represent a byte array as a string and want to reencode it.
If your string already happens to be valid ASCII, then you can get the numeric value of a code unit by using the charCodeAt method.
"\n".charCodeAt(0) === 10

Start with the Javascript (Ecmascript) specs: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf. Is says:
8.4 The String Type
The String type is the set of all finite ordered
sequences of zero or more 16-bit unsigned integer
values (“elements”). The String type is generally
used to represent textual data in a running ECMAScript
program, in which case each element in the String is
treated as a code unit value (see Clause 6). Each
element is regarded as occupying a position within
the sequence. These positions are indexed with
nonnegative integers. The first element (if any) is
at position 0, the next element (if any) at position
1, and so on. The length of a String is the number
of elements (i.e., 16-bit values) within it. The
empty String has length zero and therefore contains
no elements.
When a String contains actual textual data, each
element is considered to be a single UTF-16 code unit.
Whether or not this is the actual storage format of a
String, the characters within a String are numbered by
their initial code unit element position as though they
were represented using UTF-16. All operations on Strings
(except as otherwise stated) treat them as sequences of
undifferentiated 16-bit unsigned integers; they do not
ensure the resulting String is in normalised form, nor
do they ensure language-sensitive results.
NOTE The rationale behind this design was to keep the
implementation of Strings as simple and high-performing
as possible. The intent is that textual data coming into
the execution environment from outside (e.g., user input,
text read from a file or received over the network, etc.)
be converted to Unicode Normalised Form C before the
running program sees it. Usually this would occur at the
same time incoming text is converted from its original
character encoding to Unicode (and would impose no additional
overhead). Since it is recommended that ECMAScript source
code be in Normalised Form C, string literals are guaranteed
to be normalised (if source text is guaranteed to be
normalised), as long as they do not contain any Unicode
escape sequences.
What charCodeAt(p) gives you is the UTF-16 value (a 16-bit number) of the character at index p in the string. Since UTF-16 directly represents Unicode's Basic Multilingual Plane (that would be code points U+0000–U+D7FF and U+E000–U+FFFF, your Latin-1 characters should be the values you expect them to be.
That fact that they are not suggests to me that you have an encoding problem with the inbound 3rd octet stream — if the conversion to UTF-16 is being done and gets the encoding of the inbound octet stream wrong, you'll get odd results.
Perhaps that it is being treated as vanilla ASCII, when in fact it is UTF-8 (or vice-versa). UTF-8 represents code points above 0x7F as 2-, 3- or 4-octet "digraphs".

Related

Javascript - Alternative to lzw compression for Database entry

I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?
To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.

JavaScript print all used Unicode characters

I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.
A script like the following could work:
for(i = 0; i < 1114112; i++)
console.log(String.fromCharCode(i));
But I found out that only 10% of the 1,114,112 Unicode characters are used.
How can I can I only print the used unicode characters?
As Jukka said, JavaScript has no built-in way of knowing whether a given Unicode code point has been assigned a symbol yet or not.
There is still a way to do what you want, though.
I’ve written several scripts that parse the Unicode database and create separate data files for each category, property, script, block, etc. in Unicode. I’ve also created an HTTP API that allows you to programmatically get all code points (i.e. an array of numbers) in a given Unicode category, or all symbols (i.e. an array of strings for each character) with a given Unicode property, or a regular expression with that matches any symbols in a certain Unicode script.
For example, to get an array of strings that contains one item for each Unicode code point that has been assigned a symbol in Unicode v6.3.0, you could use the following URL:
http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B
Note that you can prepend and append anything you like to the output by tweaking the URL parameters, to make it easier to reuse the data in your own scripts. An example HTML page that console.log()s all these symbols, as you requested, could be written as follows:
<!DOCTYPE html>
<meta charset="utf-8">
<title>All assigned Unicode v6.3.0 symbols</title>
<script src="http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B"></script>
<script>
window.symbols.forEach(function(symbol) {
// Do what you want to do with `symbol` here, e.g.
console.log(symbol);
});
</script>
Demo. Note that since this is a lot of data, you can expect your DevTools console to become slow when opening this page.
Update: Nowadays, you should use Unicode data packages such as unicode-11.0.0 instead. In Node.js, you can then do the following:
const symbols = require('unicode-11.0.0/Binary_Property/Assigned/symbols.js');
console.log(symbols);
// Or, to get the code points:
require('unicode-11.0.0/Binary_Property/Assigned/code-points.js');
// Or, to get a regular expression that only matches these characters:
require('unicode-11.0.0/Binary_Property/Assigned/regex.js');
There is no direct way in JavaScript to find out whether a code point is assigned to a character or not, which appears to be the question here. You need information extracted from suitable sources, and this information needs to be updated whenever new characters are assigned in new versions of Unicode.
There are 1,114,112 code points in Unicode. The Unicode standard assigns to each code point the property gc, General Category. If the value of this property is anything but Cs, Co, or Cn, then the code point is assigned to a character. (Code points with gc equal to Co are Private Use code points, to which no character is assigned, but they may be used for characters by private agreements.)
What you would need to do is to get a copy of some relevant files in the Unicode character database (just a collection of files in specific formats, really) and write code that reads it and generates information about assigned code points. For the purposes of printing all Unicode characters, it might be best to generate the information as an array of ranges of assigned codepoints. And this would need to be repeated when the standard is updated with new characters.
Even the rest isn’t trivial. You would need to decide what it means to print a character. Some characters are control characters that may have an effect such as causing a newline, but lacking a visible glyph. Some (spaces) have empty glyphs. Some (combining marks) are meant to be rendered as marks attached to preceding character, though they have conventional renderings as “standalone” characters, too. Some are meant to take essentially different shapes depending on nearest context; they may have isolated forms, too, but just writing a character after another by no means guarantees that an isolated form is used.
Then there’s the problem of fonts. No single font can contain all Unicode characters, so you would need to find a collection of fonts that cover all of Unicode when used together, preferably so that they stylistically match somehow.
So if you are just looking for a compilation of all printable Unicode characters, consider using the Unicode code charts.
The trouble here is that Javascript is not, contrary to popular opinion, a Unicode environment.
Internally, it uses USC-2, an incompatible 16-bit encoding method that predates UTF16.
In addition, many of the unicode characters are not directly printable by themselves -- some of them are modifies for the previous characters -- for example the Spanish letter ñ can be written in unicode either as a single point -- that character -- or as two points -- n and ~
Here are a couple of resources that should really help you in understanding this:
http://mathiasbynens.be/notes/javascript-encoding
http://mathiasbynens.be/notes/javascript-unicode

Using charCodeAt() and fromCharCode to obtain Unicode characters (code value > 55349) with JS

I use "".charCodeAt(pos) to get the Unicode number for a strange character, and then String.fromCharCode for the reverse.
But I'm having problems with characters that have a Unicode number greater than 55349. For example, the Blackboard Bold characters. If I want Lowercase Blackboard Bold X (𝕩), which has a Unicode number of 120169, if I alert the code from JavaScript:
alert(String.fromCharCode(120169));
I get another character. The same thing happens if I log an Uppercase Blackboard Bold X (𝕏), which has a Unicode number of 120143, from directly within JavaScript:
s="𝕏";
alert(s.charCodeAt(0))
alert(s.charCodeAt(1))
Output:
55349
56655
Is there a method to work with these kind of characters?
Internally, Javascript stores strings in a 16-bit encoding resembling UCS2 and UTF-16. (I say resembling, since it’s really neither of those two). The fact that they’re 16-bits means that characters outside the BMP, with code points above 65535, will be split up into two different characters. If you store the two different characters separately, and recombine them later, you should get the original character without problem.
Recognizing that you have such a character can be rather tricky, though.
Mathias Bynens has written a blog post about this: JavaScript’s internal character encoding: UCS-2 or UTF-16?. It’s very interesting (though a bit arcane at times), and concludes with several references to code libraries that support the conversion from UCS-2 to UTF-16 and vice versa. You might be able to find what you need in there.

Insert EBCDIC character into javascript string

I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!
If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)

Javascript client-data compression

I am trying to develop a paint brush application thru processingjs.
This API has function loadPixels() that will load the RGB values in to the array.
Now i want to store the array in the server db.
The problem is the size of the array, when i convert to a string the size is 5 MB.
Is the best solution is to do compression at javascript level? How to do it?
See http://rosettacode.org/wiki/LZW_compression#JavaScript for an LZW compression example. It works best on longer strings with repeated patterns.
From the Wikipedia article on LZW:
A dictionary is initialized to contain
the single-character strings
corresponding to all the possible
input characters (and nothing else
except the clear and stop codes if
they're being used). The algorithm
works by scanning through the input
string for successively longer
substrings until it finds one that is
not in the dictionary. When such a
string is found, the index for the
string less the last character (i.e.,
the longest substring that is in the
dictionary) is retrieved from the
dictionary and sent to output, and the
new string (including the last
character) is added to the dictionary
with the next available code. The last
input character is then used as the
next starting point to scan for
substrings.
In this way, successively longer
strings are registered in the
dictionary and made available for
subsequent encoding as single output
values. The algorithm works best on
data with repeated patterns, so the
initial parts of a message will see
little compression. As the message
grows, however, the compression ratio
tends asymptotically to the
maximum.
JavaScript implementation of Gzip has a couple answers that are relevant.
Also, Javascript LZW and Huffman Coding with PHP and JavaScript are other implementations I found.

Categories

Resources