We are designing a SMS send form where users can type any character they want. The system should determine what type of character they type and based on that it will decide the type of the message and charge the user for SMS credts. This form is going to be used by all over the world.
I am trying this using Javascript. I count the number of characters and loop through each character. If any of the character is double byte (> 255) then I determine it is a UNICODE or else it is a plain ASCII text.
I am not sure whether I am doing in the right way.
Recently one of the user tried the below and he claimed that the system has not deducted as UNICODE. I got surprised that all these characters are less than 255 and I doubt my logic whether am I doing correct.
Sævar Davíðssson. ÆÝÐÞ
Can someone guide me please.
Because of how various sms systems handle characters, you might have to create a whitelist in order to know what people will or will not get charged for.
Some carriers even charge differently depending on whether they're going to other carriers as well, so it can get fairly complex.
And if that wasn't bad enough, some carriers don't use pre-defined standards for their character sets. And several (especially internationally) use different and conflicting standards for character encoding.
Especially using JavaScript if you don't have the same character encoding as the carrier you'll run into problems figuring out what's legal to use.
The original ASCII standard only defines 7-bit characters. There are a variety of 8-bit character encodings expanding on ASCII. One of the most popular ones is ISO 8859-1 ("latin-1", also mostly coincident with the windows codepage 1252). This adds a lot of western european language characters to the 7-bit ASCII set, including the ones in your example string.
Related
I'm new to web development and just trying to check if the user input contains emojis without using regex for performance reasons.
Is there a way to do it with JavaScript on the front end or by using java on the backend?
Java does not identify emoji as such
The official Unicode Character Database does not identify emoji characters as such, according to Annex A of Unicode® Technical Standard #51 UNICODE EMOJI.
I suppose that is why we do not see any kind of isEmoji method on the Java 13 class, Character.
Roll-your-own
According to that Annex A, there are emoji-data data files available describing aspects of emoji characters. If you are sufficiently motivated to reliably identify emoji characters, I suggest reading that Technical Note, and consider importing the data from those files to identify the code points of emoji. There may well be ranges of numbers that the Unicode Consortium uses to cluster the emoji characters.
Keep in mind that the Unicode Consortium in recent years has been frequently adding more and more emoji. So you will be chasing a moving target, needing updates.
You may be able to narrow down your ranges with the named ranges of code points defined in Character.UnicodeBlock.
I am guessing that Character.OTHER_SYMBOL may help, as the emoji I perused are so tagged, according to the handy macOS app, UnicodeChecker.
FYI, the Unicode Consortium does publish a list of emoji: Full Emoji List, v12.0.
By the way, the CLDR published by the Unicode Consortium and used by default in recent versions of Java defines how to sort emoji. Yes, emoji have sort-order: human faces before cat faces, and so on. The code points for emoji characters are assigned rather arbitrarily, so do not go by that for sorting.
Instead of trying to blacklist emojis, it'd probably be easier to whitelist the characters you do want to allow. If your site is multilingual, you'd have to add the characters of the languages you want to support. It should be relatively simple to loop over each character of your input and see if it's in the list of valid characters.
You'll want to do your validation on both the frontend and the backend. You want to do the frontend so you can show feedback to the user immediately, and you have to do validation on the backend so that people can't game your system by opening their browser's console or getting creative. Frontend stuff should never be trusted by the server in general.
I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)
JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.
Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.
In JavaScript we can use the below line of code(which uses Unicode) for displaying copyright symbol:
var x = "\u00A9 RPeripherals";
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
var x = "© RPeripherals" ;
What is the difference between these two methods?
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
Who says so? Of course you can. Just configure your code editor to use UTF-8 encoding for source files. You should never use anything else to begin with...
What is the difference between these two methods?
The difference is that using the \uXXXX scheme you are transmitting at best 2 and at worst 5 extra bytes on the wire. This kind of spelling may help if you need to embed characters in your source code, which your font cannot display properly. For example, I don't have traditional Chinese characters in the font I'm using for programming, so if I type Chinese characters into my code editor, I'll see a bunch of question marks or rectangles with Unicode codepoint digits instead of actual characters. But someone who has Chinese glyphs in the font wouldn't have that problem.
If me and that person want to share our source code, it would be preferable that the other person uses \uXXXX scheme, as I would be able to verify which character is that by looking it up in the Unicode table. That's about all the difference.
EDIT
ECMAScript standard (v 262/5.1) says specifically that
A conforming implementation of this Standard shall interpret
characters in conformance with the Unicode Standard, Version 3.0 or
later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted
encoding form, implementation level 3. If the adopted ISO/IEC 10646-1
subset is not otherwise specified, it is presumed to be the BMP
subset, collection 300. If the adopted encoding form is not otherwise
specified, it presumed to be the UTF-16 encoding form.
So, the standard guarantees that character encoding is Unicode, and enforces the use of UTF-16 (that's strange, I thought it was UTF-8), but I don't think that this is what happens in practice... I believe that browsers use UTF-8 as default. Perhaps this have changed in the later standards, but this is the one last universally accepted.
Why can't we directly type the copyright symbol directly
Because JavaScript engines are capable of parsing UTF-8 encoded source files.
What is the difference between these two methods?
One is short, requires the source file be encoded in an encoding that supports the character, and requires that you type a character that isn't printed on the keyboard's buttons.
The other is (comparatively) long, can be expressed entirely in ASCII, and can be typed with characters printed on the buttons of a standard keyboard.
I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.
fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.
I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!
In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.
As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306
JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.
In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).
I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.
â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.
This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.
The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)
you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.