insert unicode like \u1d6fc in a javascript text string - javascript

I'm writing some code that scans a string for TeX-style Greek character (like \Delta or \alpha), and replaces the string with the Unicode symbol. It works fine for the non-italic Greek characters. The problem is that I want to use mathematical italic for the lower case. These codes are one digit longer. For example, the code for the letter alpha is 1d6fc. When I put \u1d6fc into my string it displays as the character that matches \u1d6f (a lower case m with a superimposed tilde) followed by the letter c. How do I force the "correct" reading of the code?

You have to use UTF-16 surrogate pairs for characters beyond the UTF-16 range. In your particular case, you can use 0xD835 0xDEFC:
console.log('\uD835\uDEFC')
Here is a handy pair calculator. If you don't have to worry about Internet Explorer, you can also use String.fromCodePoint(), which will deal with that mess for you. If you do have to worry about Internet Explorer, MDN has a polyfill for that method.

To produce a \u escape sequence with more than 4 hex digits (code point belonging to a so-called astral plane), you can use the Unicode code point escape notation \u{xxxxx}:
console.log ('\u{1d6fc}');
or you can call String.fromCodePoint with the code point value expressed in hexadecimal using the 0x prefix notation:
console.log (String.fromCodePoint (0x1d6fc));

Related

Detecting characters having a similar connotation as in ASCII set

To detect if the string is composed of ASCII characters, I am using a regex that looks as follows:
"string".match(/^[\x00-\x7F]*$/gm)
This works fine in detecting the ASCII characters. But for this leaves the characters that are similar in meaning to ascii codes. For example a double quote that falls out of ASCII set and is included in unicode set. For example:
"see the difference in double quotes“
With the above regex, this string will fail the detection test because of “. How could I extend the above regex to include characters such as these that are very similar to meaning in ASCII set. For example, , [comma], "[double quote], etc.
Regex doesn't understand the meaning of anything, it only follows its rules to match the sequence of characters.
If you want to match a comma, you need to put a comma in your character set. If you are looking for "similar" characters, you need to identify each and every one of them and put them inside the character set.
[,"]
will match "comma" and "double quote".

Unicode characters not working

I'm really new to Javascript and I heard about unicode characters, but I don't know how they work. I did this:
alert("U+00BF");
which is the unicode for an upside-down question mark, but for some reason it just alerts the letters "U+00BF". I've tried using unicode characters with a format more like this:
alert("/xF3");
and those worked, but I don't know what I'm doing wrong with the first one. Does anyone know?
The "U+00BF" in alert("U+00BF"); is a string of length 6 containing the characters 'U', '+', '0', '0', 'B', 'F'. Hence the string "U+00BF" is echoed out in the alert.
Based on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Values,_variables,_and_literals#Unicode ,
You can use the Unicode escape sequence in string literals, regular expressions, and identifiers. The escape sequence consists of six ASCII characters: \u and a four-digit hexadecimal number. For example, \u00A9 represents the copyright symbol. Every Unicode escape sequence in JavaScript is interpreted as one character.
Which means, we need to do:
alert("\u00BF");
to see the "upside-down question mark" unicode character.
The notation U+00BF is simply a conventional way of emphasing that you are mentioning, in text, a Unicode character by its code number 00BF. It is not an escape notation of any kind in JavaScript.
You can use the character as such,
alert("¿")
provided that handle the character encoding issues, as you should anyway.
If, however, you find this simple approach not applicable due to some external constraints, you can use a classic JavaScript escape notation:
alert("\xBF")
for characters in the range up to U+00FF, or the Unicode-based JavaScript escape notation
alert("\u00BF")
for characters in the range up to U+FFFF. (For characters beyond that, you need a so-called surrogate pair.)
Note that the special character used in these notations is \ U+005C REVERSE SOLIDUS, commonly, and originally, called “backslash”, not / U+002F SOLIDUS, commonly called “slash”, or sometimes (for emphasis) “forward slash”.

Regular expression to allow all alphabet characters plus unicode characters

I need a regular expression to allow all alphabet characters plus Greek/German alphabet in a string but replace those symbols ?,&,^,". with *
I skipped the list with characters to escape to made the question simple.
I really want to see how to construct this and afterwards include alphabet sets using ASCII codes.
if you have a finite and short set of elements to replace you could just use a class e.g.
string.replace(/[?\^&]/g, '*');
and add as many symbols as you want to reject. you could also add ranges of unicode symbols you want to replace (e.g. \u017F-\036F\u0400-\uFFFF )
otherwise use a a class to specify what symbols don't need to be replaced, like a-z, accented/diacritic letters and greek symbols
string.replace(/[^a-z\00C0-\017E\u0370-\03FF]/gi, '*');
You have to use the XRegexp plugin, along with the Unicode add-on.
Once you have that, you can use modern regexes like /[\p{L}\p{Nl}]/, which necessarily also includes those \p{Greek} code points which are letters or letter-numbers. But you could also match /[\p{Latin}\p{Greek}]/ if you wanted.
Javascript’s own regexes are terrible. Use XRegexp.
So something like: /^[^?&\^"]*$/ (that means the string is composed only of characters outside the five you listed)...
But if you want to have the greek characters and the unicode characters (what are unicode characters? àèéìòù? Japanese?) perhaps you'll have to use http://xregexp.com/ It is a regex library for javascript that includes character classes for the various unicode character classes (I know I'm repeating myself) plus other "commands" for unicode handling.

How to make my own string in javascript as if I hit ALT codes on keyboard (UTF-8)

I am trying to create some random unicode strings within javascript and was wondering if there was an easy way. I tried doing something like this...
var username = "David Perry" + "/u4589";
But it just appends /u4589 to the end which is to be expected since it's just a string. What I WANT it to do is convert that into the unicode character in the string (AS IF I typed ALT 4589 on the keypad). I'm trying to build the string within javascript because I wanna test my form with various symbols and stuff and I'm tired of trying ALT codes to see what weird characters there are... so I thought.. I would loop through ALL unicode characters for FUN and populate my form and submit it automatically...
I was going to start at /u0000 and go up to /uffff and see which codes break my website when outputting them :)
I know there are different functions in JS but I can't seem to figure out why I can't build a string of unicode characters. lol.
If it's too complicated don't worry about it. It's just something I wanted to tinker with.
Try "\u4589" instead of "/u4589":
>>> "/u4589"
"/u4589"
>>> "\u4589"
"䖉"
the forward slash (/) is just a forward slash in a string, however the backslash (\) is an escape character.
If you wish to generate random characters or loop through a range of characters, then you could use String.fromCharCode(), which gives you the character with the Unicode number passed as argument, e.g. String.fromCharCode(0x4589) or String.fromCharCode(i) where i is a variable with an integer value.
Both the \uxxxx notation and the String.fromCharCode() work up to 0xFFFF only, i.e. for Basic Multilingual Plane characters. This may well suffice, but if you need non-BMP characters, check out e.g. the MDN page on fromCharCode.

Javascript utf-8 substr and length function

I am trying to do a substr on a UTF-8 string like हिन्दी.
The problem is that it becomes totally screwed up=> with some weird box in the end (does not show here, although i copy pasted) (its something like [00 02]): हिन...
okay this is how it appers after using substr function:
alt text http://img27.imageshack.us/img27/765/capturexv.png
Wondering if there is some function to solve this problem? Atleast I want to remove that funny box.
Thank you for your time.
JavaScript encodes strings with UTF-16, meaning characters outside the basic multilingual plane have to be represented as a surrogate pair. Splitting a string in the middle of such a pair might explain your results.
As I understand the wikipedia article, you'll have to check if your last character lies in the range 0xD800–0xDBFF and, if so, either drop it or add the following character (which should be in range 0xDC00-0xDFFF) to the substring.
I believe that the box is the font's representation of the UTF-8 values that the substring created. Try to remove the character at the box's position and it should be removed.
Try avoiding to put UTF-8 byte sequences into JavaScript string objects. Instead, rely on the Unicode support of JavaScript, and use a proper Unicode string (instead of an UTF-8 string).
My guess is that you managed to slice the string in the middle of a character, so that the result is an incomplete character. Browser then try to render it anyway, leading to moji-bake.

Categories

Resources