Create invalid UTF8 string - javascript

Is it possible to create an invalid UTF8 string using Javascript?
Every solution I've found relies String.fromCharCode which generates undefined rather than an invalid string. I've seen mention of errors being generated by ill-formed UTF8 string (i.e. https://developer.mozilla.org/en-US/docs/Web/API/WebSocket#send()) but I can't figure out how you would actually create one.

One way to generate an invalid UTF-8 string with JavaScript is to take an emoji and remove the last byte.
For example, this will be an invalid UTF-8 string:
const invalidUtf8 = '🐶🐶🐶'.substr(0,5);

A string in JavaScript is a counted sequence of UTF-16 code units. There is an implicit contract that the code units represent Unicode codepoints. Even so, it is possible to represent any sequence of UTF-16 code units—even unpaired surrogates.
I find String.fromCharCode(0xd801) returns the replacement character, which seems quite reasonable (rather than undefined). Any text function might do that but, for efficiency reasons, I'm sure that many text manipulations would just pass invalid sequences through unless the manipulation required interpreting them as codepoints.
The easiest way to create such a string is with a string literal. For example, "\uD83D \uDEB2" or "\uD83D" or "\uDEB2" instead of the valid "\uD83D\uDEB2".
"\uD83D \uDEB2".replace(" ","") actually does return "\uD83D\uDEB2" ("🚲") but I don't think you should count on anything good coming from a string that isn't a valid UTF-16 encoding of Unicode codepoints.

Related

insert unicode like \u1d6fc in a javascript text string

I'm writing some code that scans a string for TeX-style Greek character (like \Delta or \alpha), and replaces the string with the Unicode symbol. It works fine for the non-italic Greek characters. The problem is that I want to use mathematical italic for the lower case. These codes are one digit longer. For example, the code for the letter alpha is 1d6fc. When I put \u1d6fc into my string it displays as the character that matches \u1d6f (a lower case m with a superimposed tilde) followed by the letter c. How do I force the "correct" reading of the code?
You have to use UTF-16 surrogate pairs for characters beyond the UTF-16 range. In your particular case, you can use 0xD835 0xDEFC:
console.log('\uD835\uDEFC')
Here is a handy pair calculator. If you don't have to worry about Internet Explorer, you can also use String.fromCodePoint(), which will deal with that mess for you. If you do have to worry about Internet Explorer, MDN has a polyfill for that method.
To produce a \u escape sequence with more than 4 hex digits (code point belonging to a so-called astral plane), you can use the Unicode code point escape notation \u{xxxxx}:
console.log ('\u{1d6fc}');
or you can call String.fromCodePoint with the code point value expressed in hexadecimal using the 0x prefix notation:
console.log (String.fromCodePoint (0x1d6fc));

In NODE.JS, the newline code (%0A) will decode back to what character?

I have a pretty simple question, but a few simple googling and stachexchange queries were not able to answer it, so i guess i'm missing something here.
Here are my simplified parameters:
I'm using Javascript.
I have a text that needs to get URLEncoded and the text have more than 1 line.
My question is: What is the character for newline before the text get encoded? (I know that after the encoding the newline will be encoded into %0A)
I guess asking "What char is decoded when decoding %0A" will be the same.
Those codes consist of a percent sign, followed by a two character hexadecimal number representing a byte value.
So in this case, the byte value is 0A, representing the ASCII newline character. This is commonly written as \n inside strings in JavaScript (and others, like PHP).
But I think your question suggests you want to do some search and replace for this character. I would not do that, since there can be other characters too that need encoding. Instead, use the function encodeURIComponent, which can encode the entire string for you. There is encodeURI as well, but in your case, I think the first is more appropriate.
This example shows how special characters (newline, space, and others) are encoded to an url-friendly format. Note that the diacritic é translates to the two bytes of its UTF-8 representation.
document.write(encodeURIComponent("Normal text\nEéy, check the specials: /, + and \t!"));

How to make my own string in javascript as if I hit ALT codes on keyboard (UTF-8)

I am trying to create some random unicode strings within javascript and was wondering if there was an easy way. I tried doing something like this...
var username = "David Perry" + "/u4589";
But it just appends /u4589 to the end which is to be expected since it's just a string. What I WANT it to do is convert that into the unicode character in the string (AS IF I typed ALT 4589 on the keypad). I'm trying to build the string within javascript because I wanna test my form with various symbols and stuff and I'm tired of trying ALT codes to see what weird characters there are... so I thought.. I would loop through ALL unicode characters for FUN and populate my form and submit it automatically...
I was going to start at /u0000 and go up to /uffff and see which codes break my website when outputting them :)
I know there are different functions in JS but I can't seem to figure out why I can't build a string of unicode characters. lol.
If it's too complicated don't worry about it. It's just something I wanted to tinker with.
Try "\u4589" instead of "/u4589":
>>> "/u4589"
"/u4589"
>>> "\u4589"
"䖉"
the forward slash (/) is just a forward slash in a string, however the backslash (\) is an escape character.
If you wish to generate random characters or loop through a range of characters, then you could use String.fromCharCode(), which gives you the character with the Unicode number passed as argument, e.g. String.fromCharCode(0x4589) or String.fromCharCode(i) where i is a variable with an integer value.
Both the \uxxxx notation and the String.fromCharCode() work up to 0xFFFF only, i.e. for Basic Multilingual Plane characters. This may well suffice, but if you need non-BMP characters, check out e.g. the MDN page on fromCharCode.

Javascript utf-8 substr and length function

I am trying to do a substr on a UTF-8 string like हिन्दी.
The problem is that it becomes totally screwed up=> with some weird box in the end (does not show here, although i copy pasted) (its something like [00 02]): हिन...
okay this is how it appers after using substr function:
alt text http://img27.imageshack.us/img27/765/capturexv.png
Wondering if there is some function to solve this problem? Atleast I want to remove that funny box.
Thank you for your time.
JavaScript encodes strings with UTF-16, meaning characters outside the basic multilingual plane have to be represented as a surrogate pair. Splitting a string in the middle of such a pair might explain your results.
As I understand the wikipedia article, you'll have to check if your last character lies in the range 0xD800–0xDBFF and, if so, either drop it or add the following character (which should be in range 0xDC00-0xDFFF) to the substring.
I believe that the box is the font's representation of the UTF-8 values that the substring created. Try to remove the character at the box's position and it should be removed.
Try avoiding to put UTF-8 byte sequences into JavaScript string objects. Instead, rely on the Unicode support of JavaScript, and use a proper Unicode string (instead of an UTF-8 string).
My guess is that you managed to slice the string in the middle of a character, so that the result is an incomplete character. Browser then try to render it anyway, leading to moji-bake.

Detect non-printable characters in JavaScript

Is it possible to detect binary data in JavaScript?
I'd like to be able to detect binary data and convert it to hex for easier readability/debugging.
After more investigation I've realized that detecting binary data is not the right question, because binary data can contain regular characters, and non-printable characters.
Outis's question and answer (/[\x00-\x1F]/) is really the best we can do in an attempt to detect binary characters.
Note: You must remove line feeds and possibly other characters from your ascii string sequence for the check to actually work.
If by "binary", you mean "contains non-printable characters", try:
/[\x00-\x1F]/.test(data)
If whitespace is considered non-binary data, try:
/[\x00-\x08\x0E-\x1F]/.test(data)
If you know the string is either ASCII or binary, use:
/[\x00-\x1F\x80-\xFF]/.test(data)
or:
/[\x00-\x08\x0E-\x1F\x80-\xFF]/.test(data)

Categories

Resources