Detect non-printable characters in JavaScript - javascript

Is it possible to detect binary data in JavaScript?
I'd like to be able to detect binary data and convert it to hex for easier readability/debugging.
After more investigation I've realized that detecting binary data is not the right question, because binary data can contain regular characters, and non-printable characters.
Outis's question and answer (/[\x00-\x1F]/) is really the best we can do in an attempt to detect binary characters.
Note: You must remove line feeds and possibly other characters from your ascii string sequence for the check to actually work.

If by "binary", you mean "contains non-printable characters", try:
/[\x00-\x1F]/.test(data)
If whitespace is considered non-binary data, try:
/[\x00-\x08\x0E-\x1F]/.test(data)
If you know the string is either ASCII or binary, use:
/[\x00-\x1F\x80-\xFF]/.test(data)
or:
/[\x00-\x08\x0E-\x1F\x80-\xFF]/.test(data)

Related

Create invalid UTF8 string

Is it possible to create an invalid UTF8 string using Javascript?
Every solution I've found relies String.fromCharCode which generates undefined rather than an invalid string. I've seen mention of errors being generated by ill-formed UTF8 string (i.e. https://developer.mozilla.org/en-US/docs/Web/API/WebSocket#send()) but I can't figure out how you would actually create one.
One way to generate an invalid UTF-8 string with JavaScript is to take an emoji and remove the last byte.
For example, this will be an invalid UTF-8 string:
const invalidUtf8 = '🐶🐶🐶'.substr(0,5);
A string in JavaScript is a counted sequence of UTF-16 code units. There is an implicit contract that the code units represent Unicode codepoints. Even so, it is possible to represent any sequence of UTF-16 code units—even unpaired surrogates.
I find String.fromCharCode(0xd801) returns the replacement character, which seems quite reasonable (rather than undefined). Any text function might do that but, for efficiency reasons, I'm sure that many text manipulations would just pass invalid sequences through unless the manipulation required interpreting them as codepoints.
The easiest way to create such a string is with a string literal. For example, "\uD83D \uDEB2" or "\uD83D" or "\uDEB2" instead of the valid "\uD83D\uDEB2".
"\uD83D \uDEB2".replace(" ","") actually does return "\uD83D\uDEB2" ("🚲") but I don't think you should count on anything good coming from a string that isn't a valid UTF-16 encoding of Unicode codepoints.

javascript encodeURI() output

According to MDN, The 'encodeURI()' function:
replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character
However, when invoking encodeURI('\u0082') (in Chrome) Im getting %C2%81 as output.
I expected to get %82 or %00%82. What does the %C2 mean?
The '0082' in '\u0082' is the Unicode code point, not the UTF-8 bytes representation.
UTF-8 maps u+0082 code point to two bytes: C2+81
Unicode to UTF-8 mapping table
Decoding %C2 at http://www.albionresearch.com/misc/urlencode.php leads to Â
When dealing with German texts and ISO 8859-15 / ISO 8859-1 vs. UTF-8 I often ran into the à character. The characters are quite close to each other. May this also be an encoding problem?
Maybe HTML encoding issues - "Â" character showing up instead of " " helps.

In NODE.JS, the newline code (%0A) will decode back to what character?

I have a pretty simple question, but a few simple googling and stachexchange queries were not able to answer it, so i guess i'm missing something here.
Here are my simplified parameters:
I'm using Javascript.
I have a text that needs to get URLEncoded and the text have more than 1 line.
My question is: What is the character for newline before the text get encoded? (I know that after the encoding the newline will be encoded into %0A)
I guess asking "What char is decoded when decoding %0A" will be the same.
Those codes consist of a percent sign, followed by a two character hexadecimal number representing a byte value.
So in this case, the byte value is 0A, representing the ASCII newline character. This is commonly written as \n inside strings in JavaScript (and others, like PHP).
But I think your question suggests you want to do some search and replace for this character. I would not do that, since there can be other characters too that need encoding. Instead, use the function encodeURIComponent, which can encode the entire string for you. There is encodeURI as well, but in your case, I think the first is more appropriate.
This example shows how special characters (newline, space, and others) are encoded to an url-friendly format. Note that the diacritic é translates to the two bytes of its UTF-8 representation.
document.write(encodeURIComponent("Normal text\nEéy, check the specials: /, + and \t!"));

Read non printable characters from Javascript

Say I have the following Javascript instruction:
var a="hiàja, c . Non di–g t";
a contains binary data, i.e., any ASCII from 0-255.
Before what ASCII bytes should I add backslash so that a is read properly? (for example, before ").
Should I use an specific charset and content-type different than text/Javascript and UTF-8?
Thanks
The ASCII range is 0 to 127, but strings are not limited to ASCII in JavaScript. According to the ECMAScript standard, “All characters may appear literally in a string literal except for the closing quote character, backslash, carriage return, line separator, paragraph separator, and line feed.” If the encoding of your document is suitable (e.g., windows-1252 or utf-8) and properly declared, you can use your example string as it is.

Javascript utf-8 substr and length function

I am trying to do a substr on a UTF-8 string like हिन्दी.
The problem is that it becomes totally screwed up=> with some weird box in the end (does not show here, although i copy pasted) (its something like [00 02]): हिन...
okay this is how it appers after using substr function:
alt text http://img27.imageshack.us/img27/765/capturexv.png
Wondering if there is some function to solve this problem? Atleast I want to remove that funny box.
Thank you for your time.
JavaScript encodes strings with UTF-16, meaning characters outside the basic multilingual plane have to be represented as a surrogate pair. Splitting a string in the middle of such a pair might explain your results.
As I understand the wikipedia article, you'll have to check if your last character lies in the range 0xD800–0xDBFF and, if so, either drop it or add the following character (which should be in range 0xDC00-0xDFFF) to the substring.
I believe that the box is the font's representation of the UTF-8 values that the substring created. Try to remove the character at the box's position and it should be removed.
Try avoiding to put UTF-8 byte sequences into JavaScript string objects. Instead, rely on the Unicode support of JavaScript, and use a proper Unicode string (instead of an UTF-8 string).
My guess is that you managed to slice the string in the middle of a character, so that the result is an incomplete character. Browser then try to render it anyway, leading to moji-bake.

Categories

Resources