I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!
If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)
Related
I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.
I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.
I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?
To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.
I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.
A script like the following could work:
for(i = 0; i < 1114112; i++)
console.log(String.fromCharCode(i));
But I found out that only 10% of the 1,114,112 Unicode characters are used.
How can I can I only print the used unicode characters?
As Jukka said, JavaScript has no built-in way of knowing whether a given Unicode code point has been assigned a symbol yet or not.
There is still a way to do what you want, though.
I’ve written several scripts that parse the Unicode database and create separate data files for each category, property, script, block, etc. in Unicode. I’ve also created an HTTP API that allows you to programmatically get all code points (i.e. an array of numbers) in a given Unicode category, or all symbols (i.e. an array of strings for each character) with a given Unicode property, or a regular expression with that matches any symbols in a certain Unicode script.
For example, to get an array of strings that contains one item for each Unicode code point that has been assigned a symbol in Unicode v6.3.0, you could use the following URL:
http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B
Note that you can prepend and append anything you like to the output by tweaking the URL parameters, to make it easier to reuse the data in your own scripts. An example HTML page that console.log()s all these symbols, as you requested, could be written as follows:
<!DOCTYPE html>
<meta charset="utf-8">
<title>All assigned Unicode v6.3.0 symbols</title>
<script src="http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B"></script>
<script>
window.symbols.forEach(function(symbol) {
// Do what you want to do with `symbol` here, e.g.
console.log(symbol);
});
</script>
Demo. Note that since this is a lot of data, you can expect your DevTools console to become slow when opening this page.
Update: Nowadays, you should use Unicode data packages such as unicode-11.0.0 instead. In Node.js, you can then do the following:
const symbols = require('unicode-11.0.0/Binary_Property/Assigned/symbols.js');
console.log(symbols);
// Or, to get the code points:
require('unicode-11.0.0/Binary_Property/Assigned/code-points.js');
// Or, to get a regular expression that only matches these characters:
require('unicode-11.0.0/Binary_Property/Assigned/regex.js');
There is no direct way in JavaScript to find out whether a code point is assigned to a character or not, which appears to be the question here. You need information extracted from suitable sources, and this information needs to be updated whenever new characters are assigned in new versions of Unicode.
There are 1,114,112 code points in Unicode. The Unicode standard assigns to each code point the property gc, General Category. If the value of this property is anything but Cs, Co, or Cn, then the code point is assigned to a character. (Code points with gc equal to Co are Private Use code points, to which no character is assigned, but they may be used for characters by private agreements.)
What you would need to do is to get a copy of some relevant files in the Unicode character database (just a collection of files in specific formats, really) and write code that reads it and generates information about assigned code points. For the purposes of printing all Unicode characters, it might be best to generate the information as an array of ranges of assigned codepoints. And this would need to be repeated when the standard is updated with new characters.
Even the rest isn’t trivial. You would need to decide what it means to print a character. Some characters are control characters that may have an effect such as causing a newline, but lacking a visible glyph. Some (spaces) have empty glyphs. Some (combining marks) are meant to be rendered as marks attached to preceding character, though they have conventional renderings as “standalone” characters, too. Some are meant to take essentially different shapes depending on nearest context; they may have isolated forms, too, but just writing a character after another by no means guarantees that an isolated form is used.
Then there’s the problem of fonts. No single font can contain all Unicode characters, so you would need to find a collection of fonts that cover all of Unicode when used together, preferably so that they stylistically match somehow.
So if you are just looking for a compilation of all printable Unicode characters, consider using the Unicode code charts.
The trouble here is that Javascript is not, contrary to popular opinion, a Unicode environment.
Internally, it uses USC-2, an incompatible 16-bit encoding method that predates UTF16.
In addition, many of the unicode characters are not directly printable by themselves -- some of them are modifies for the previous characters -- for example the Spanish letter ñ can be written in unicode either as a single point -- that character -- or as two points -- n and ~
Here are a couple of resources that should really help you in understanding this:
http://mathiasbynens.be/notes/javascript-encoding
http://mathiasbynens.be/notes/javascript-unicode
I'm new to javascript and do not have a good grasp of its unicode handling. If I understand correctly it's kind of like C/C++ where a string contains a binary sequence without any encoding info.
When I use something like var str=window.getSelection().toString() to get the highlighted text, will the resulting string have the same encoding as the web-page? If so, what's the best way of finding out that encoding and converting it to a unicode one (e.g. UTF8)?
Strings in Javascript are not like "strings" in C or PHP, which are actually byte arrays and have encoding semantics. Strings in Javascript are quite different than that and are like strings in Java/C# or Python's unicode type.
They are strings of abstract characters, at least if you don't try to have non-BMP characters. In practice, you don't have to worry about that, I am just mentioning it for completeness.
As per above, var str=window.getSelection().toString() does not have any encoding semantics, it's just a string of the characters that are selected. You don't state any actual problem in your question, but if you are wondering if "special" characters will just work in Javascript, well, they do just work.
I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.
fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.
I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!
In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.
As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306
JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.
In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).