Javascript hexadecimal to ASCII with latin extended symbols - javascript

I am getting a hexadecimal value of my string that looks like this:
String has letters with diacritics: č,š,ř, ...
Hexadecimal value of this string is:
0053007400720069006E006700200068006100730020006C0065007400740065007200730020007700690074006800200064006900610063007200690074006900630073003A0020010D002C00200161002C00200159002C0020002E002E002E
The problem is that when i try to convert this value back to ascii it poorly converts the č,š,ř,.. and returns symbol of little box with question mark in it instead of these symbols.
My code for converting hex to ascii:
function convertHexadecimal(hexx){
let index = hexx.indexOf("~");
let strInfo = hexx.substring(0, index+1);
let strMessage = hexx.substring(index+1);
var hex = strMessage.toString();
var str = '';
for (var i = 0; i < hex.length; i += 2){
str += String.fromCharCode(parseInt(hex.substr(i, 2), 16));
}
console.log("Zpráva: " + str);
var strFinal = strInfo + str;
return strFinal;
}
Can somebody help me with this?

First an example solution:
let demoHex = `0053007400720069006E006700200068006100730020006C0065007400740065007200730020007700690074006800200064006900610063007200690074006900630073003A0020010D002C00200161002C00200159002C0020002E002E002E`;
function hexToString(hex) {
let str="";
for( var i = 0; i < hex.length; i +=4) {
str += String.fromCharCode( Number("0x" + hex.substr(i,4)));
}
return str;
}
console.log("Decoded string: %s", hexToString(demoHex) );
What it's doing:
It's treating the hex characters as a sequence of 4 hexadecimal digits that provide the UTF-16 character code of a character.
It gets each set of 4 digits in a loop using String.prototype.substr. Note MDN says .substr is deprecated but this is not mentioned in the ECMASript standard - rewrite it to use substring or something else as you wish.
Hex characters are prefixed with "0x" to make them a valid number representation in JavaScript and converted to a number object using Number. The number is then converted to a character string using the String.fromCharCode static method.
I guessed the format of the hex string by looking at it, which means a general purpose encoding routine to encode UTF16 characters (not code points) into hex could look like:
const hexEncodeUTF16 =
str=>str.split('')
.map( char => char.charCodeAt(0).toString(16).padStart(4,'0'))
.join('');
console.log( hexEncodeUTF16( "String has letters with diacritics: č, š, ř, ..."));
I hope these examples show what needs doing - there are any number of ways to implement it in code.

Related

Writing a Hexadecimal Escape Character Sequence using Variables

Testing Hex Character Codes
Problem
What does a Vertical Tab or a Backspace character actually do? I want to find out.
My experiment is to find out exactly what happens when every hex character is put into a string. I thought the best way to do this would be to created a nested loop to go through each of the 16 hexadecimal characters to create each possible 2 digit hex character code.
I soon discovered that you cannot use the \x escape character with interpolated variables, and so I expect what I have set out to do might be impossible.
const hexCharacters = "0123456789ABCDEF";
let code = "";
let char1 = "";
let char2 = "";
for (charPos1 = 0; charPos1 < hexCharacters.length; charPos1++) {
for (charPos2 = 0; charPos2 < hexCharacters.length; charPos2++) {
char1 = hexCharacters[charPos1];
char2 = hexCharacters[charPos2];
code = `${char1}${char2}`;
printHexChar(code);
}
}
function printHexChar(string) {
let output = `<p>Hex Code ${string} = \x${string}</p>`; // THE PROBLEM IS CLEAR
document.write(output)
}
I know it will also probably fail once it gets past 7F or whichever is the last character in the set, but that's not the main issue here! :D
Potential solution
string.prototype.fromCharCode
This sort of string method approach would seem to be the answer, but it is meant for U-16 character codes, and that is not what I wanted to test. There doesn't seem to be an existing string method for hex codes. Probably because nobody would ever want one, but nevertheless it would be cool.
Conclusion
Is there any way to create an escape character sequence from assembled parts that will render not as plain text, but as a proper escape character sequence?
Apologies if this has been asked before in some form, but with my feeble understanding of things I just couldn't find an answer.
You can use String.fromCharCode with parseInt.
`<p>Hex Code ${string} = ${String.fromCharCode(parseInt(string, 16))}</p>`;
const hexCharacters = "0123456789ABCDEF";
let code = "";
let char1 = "";
let char2 = "";
for (charPos1 = 0; charPos1 < hexCharacters.length; charPos1++) {
for (charPos2 = 0; charPos2 < hexCharacters.length; charPos2++) {
char1 = hexCharacters[charPos1];
char2 = hexCharacters[charPos2];
code = `${char1}${char2}`;
printHexChar(code);
}
}
function printHexChar(string) {
let output = `<p>Hex Code ${string} = ${String.fromCharCode(parseInt(string, 16))}</p>`;
document.write(output)
}
eval works as well, though it should generally be avoided.
`<p>Hex Code ${string} = ${eval('"\\x'+string+'"')}</p>`
If you want to output \x literally, then in a string literal you need to escape the escape character, so `\\x`.
string.prototype.fromCharCode [...] is meant for U-16 character codes
JavaScript uses one character encoding. The following strings are all equal:
let a = String.fromCharCode(27);
let b = "\x1B";
let c = "\u001B";
console.log(a === b, b === c);
If I understand correctly, you want to produce a string literal that shows \x escape sequences -- not the actual character:
// Prepare string
let chars = Array.from({length: 128}, (_, i) => String.fromCharCode(i))
.join("");
// Escape them
let escaped = Array.from(chars, ch => `\\x${ch.charCodeAt().toString(16).padStart(2, "0")}`).join("");
console.log(escaped);
But you might also use JSON.stringify. Although it uses different escape sequences (\u instead of \x), and only for non-display characters, it will be the exact same string when evaluated. Here is a demo:
// Prepare string
let chars = Array.from({length: 128}, (_, i) => String.fromCharCode(i))
.join("");
// Escape them
let escaped = '"' + Array.from(chars, ch => `\\x${ch.charCodeAt().toString(16).padStart(2, "0")}`).join("") + '"';
console.log(escaped);
// Or JSONify them
let json = JSON.stringify(chars);
console.log(json);
// Compare them, when evaluated:
console.log(eval(escaped) === eval(json));
Finally, note that there is nothing special about hexadecimal: it is just a representation of an integer. In the end, it is the numerical value that is important, not the representation of it. It is that numerical value that corresponds to a character.
Addendum
If you prefer code that sticks to old-style JavaScript, here is something equivalent of the last code snippet:
// Prepare string
let chars = "";
for (let i = 0; i < 128; i++) {
chars += String.fromCharCode(i);
}
// Escape the characters in this string
let escaped = '"';
for (let i = 0; i < chars.length; i++) {
let ch = chars.charCodeAt(i);
let hex = ch.toString(16);
if (hex.length === 1) hex = "0" + hex;
escaped += "\\x" + hex;
}
escaped += '"';
console.log(escaped);
// Or JSONify them
let json = JSON.stringify(chars);
console.log(json);
// Compare them, when evaluated:
console.log(eval(escaped) === eval(json));

How do I determine the width of the result of codePointAt?

I'm trying to loop over the Unicode characters in a Javascript string, that I assume is encoded with UTF-16.
It is my understanding that UTF-16 is variable width. That is, a single Unicode character may be split across multiple 16-bit characters. I can use s[i].codePointAt to get the Unicode character beginning at a given code point. But once I have it, how do I know how far to advance i?
Roughly, what is getWidth here? Is it simply c > Math.pow(2, 16)?
for (var i = 0; i < s.length;) {
var c = s.codePointAt(i);
// do some operation with c
i = i + getWidth(c)
}
Is there a standard library function I can use to determine how far to advance? Or a way to iterate over the Unicode code points in a string?
Is there a standard […] way to iterate over the Unicode code points in a string?
Yes, since ES6 you can simply iterate all strings to get the code points:
for (const character of string) {
const codepoint = character.codePointAt(0);
// do some operation with codepoint
}
A simple approach:
for (var i = 0; i < s.length; ++i) {
var c = s.codePointAt(i);
// do some operation with c
if( s.charCodeAt(i) != c) {
++i; // step past the next sixteen bits of the surrogate pair
}
}
(where the value of c is the Unicode codepoint, not the character).
If you want to split the string into an array of Unicode characters you can make use of the string iterator invoked by the spread operator introduced in ES6:
var array = [...s];
In pre-ES6 browsers the start of a surrogate pair can be identified in order to skip the second part:
for (var i = 0; i < s.length; ++i) {
var k = s.charCodeAt(i);
if( k < 0xD800 || k > 0xDBFF) {
var c = s[i]; // character in BMP
}
else {
c = s.substring( i,i+2); // use surrogate pair
++i;
}
// do something with c
console.log(c)
}
See: http://www.unicode.org/glossary/#supplementary_code_point
Basically, if your code point is 0x010000+ you are dealing with multibyte character.
const MIN_SUPPLEMENTARY_CODE_POINT = 0x010000;
function charCount(int codePoint) {
return codePoint >= MIN_SUPPLEMENTARY_CODE_POINT ? 2 : 1;
}
JavaScript predates Unicode and uses another, older system called UCS2, which is very similar but doesn't handle surrogate pairs nor does it understand any characters that can't be represented by two bytes.
If you are stepping through a string looking at codepoints, you can look at the codepoint value itself... if the value is greater than 2^16, you have to advance 2 string characters, otherwise advance 1 string character.
You might try a new ES6 sytax that works really well at splitting up strings into characters, even if those characters are high-order.
// High order unicode character
const k = '💩';
// Takes four bytes
console.log(k.length);
const chars = [...k];
// But its only one character
console.log(chars.length);

How to convert non numeric string to int

For an assignment, i am trying to recreate a small project I once made in ASP.NET.
It converted each letter of a text to its int value, then added 1 en reconverted it to a char and put them all back into a single string.
Now I am trying to do this in Angular, but I am having trouble converting my non-numeric strings to its int value.
I tried it with ParseInt(), but this only seems to work if the string is a valid integer.
Is there any way to parse or convert non-numeric strings to an int value and how?
'String here'.split('').map(function (char) {
return String.fromCharCode(char.charCodeAt(0) + 1);
}).join('');
If you mean char code.
Thankx to the helpful insights of Claies and Damien Czapiewski I constructed the following solution.
Loop through each character in my string in a for loop.
Then, for each char I retrieved its value with charCodeAt()
And to return to a string value I used fromCharCode()
encode(msg:string):string {
let result: string = "";
if (msg) {
for (var i = 0; i < msg.length; i++) {
let msgToInt = msg.charCodeAt(i);
// do stuff here
result += String.fromCharCode(msgToInt);
}
}
return result;
}

Encode String to HEX

i have my function to convert string to hex:
function encode(str){
str = encodeURIComponent(str).split('%').join('');
return str.toLowerCase();
}
example:
守护村子
alert(encode('守护村子'));
the output would be:
e5ae88e68aa4e69d91e5ad90
It works on Chinese characters. But when i do it with English letters
alert(encode('Hello World'));
it outputs:
hello20world
I have tried this for converting string to hex:
function String2Hex(tmp) {
var str = '';
for(var i = 0; i < tmp.length; i++) {
str += tmp[i].charCodeAt(0).toString(16);
}
return str;
}
then tried it on the Chinese characters above, but it outputs the UTF-8 HEX:
5b8862a467515b50
not the ANSI Hex:
e5ae88e68aa4e69d91e5ad90
I also have searched converting UFT8 to ANSI but no luck.
Anyone could help me? Thanks!
As a self-contained solution in functional style, you can encode with:
plain.split("")
.map(c => c.charCodeAt(0).toString(16).padStart(2, "0"))
.join("");
The split on an empty string produces an array with one character (or rather, one UTF-16 codepoint) in each element. Then we can map each to a HEX string of the character code.
Then to decode:
hex.split(/(\w\w)/g)
.filter(p => !!p)
.map(c => String.fromCharCode(parseInt(c, 16)))
.join("")
This time the regex passed to split captures groups of two characters, but this form of split will intersperse them with empty strings (the stuff "between" the captured groups, which is nothing!). So filter is used to remove the empty strings. Then map decodes each character.
On Node.js, you can do:
const myString = "This is my string to be encoded/decoded";
const encoded = Buffer.from(myString).toString('hex'); // encoded == 54686973206973206d7920737472696e6720746f20626520656e636f6465642f6465636f646564
const decoded = Buffer.from(encoded, 'hex').toString(); // decoded == "This is my string to be encoded/decoded"
I solved it by downloading utf8.js
https://github.com/mathiasbynens/utf8.js
then using the String2Hex function from above:
alert(String2Hex(utf8.encode('守护村子')));
It gives me the output I want:
e5ae88e68aa4e69d91e5ad90
This should work.
var str="some random string";
var result = "";
for (i=0; i<str.length; i++) {
hex = str.charCodeAt(i).toString(16);
result += ("000"+hex).slice(-4);
}
If you want to properly handle UTF8 strings you can try these:
function utf8ToHex(str) {
return Array.from(str).map(c =>
c.charCodeAt(0) < 128 ? c.charCodeAt(0).toString(16) :
encodeURIComponent(c).replace(/\%/g,'').toLowerCase()
).join('');
},
function hexToUtf8: function(hex) {
return decodeURIComponent('%' + hex.match(/.{1,2}/g).join('%'));
}
Demo: https://jsfiddle.net/lyquix/k2tjbrvq/
Another way to do it
function toHex(txt){
const encoder = new TextEncoder();
return Array
.from(encoder.encode(txt))
.map(b => b.toString(16).padStart(2, '0'))
.join('')
}

Decode Unicode to character in javascript

I have the following unicode sequence:
d76cb9dd0020b370b2c8c758
I tried randomly in non-English character (for this experiment, I tried korean languange) as the original of above unicode lines :
희망 데니의
How can i decode those-above-mentioned unicode sequence into the original form?
As a JavaScript string literal, escape hex codes with \u:
var koreanString = "\ud76c\ub9dd\u0020\ub370\ub2c8\uc758";
Or just enter the korean characters into the string:
var koreanString = "희망 데니의";
To process a hex string representing unicode characters, parse the hex string to numbers and the build the unicode string use String.fromCharCode():
var hex = "d76cb9dd0020b370b2c8c758";
var koreanString = "";
for (var i = 0; i < hex.length; i += 4) {
koreanString += String.fromCharCode(parseInt(hex.substring(i, 4), 16));
}
Edit: You can get the length of any string by accessing its length property:
var stringLength = koreanString.length;
This will return 6. There is no "english" string. You have a string representing hexadecimal numbers, and hexadecimal numbers consist of characters from the latin character set, but these are not in any spoken language. They are just numbers. You can, of course, get the length of the hexadecimal string using the length property, but I'm not sure why you'd want to do that. It would be more straight forward to use an array of numbers instead of a string:
var charCodes = [0xd76c, 0xb9dd, 0x0020, 0xb370, 0xb2c8, 0xc758];
var koreanString = String.fromCharCode.apply(null, charCodes);
In this way, charCodes.length will be the same as koreanString.length.
How about
var str = 'd76cb9dd0020b370b2c8c758';
str = '"'+str.replace(/([0-9a-z]{4})/g, '\\u$1')+'"';
alert(JSON.parse(str));
DEMO

Categories

Resources