How to convert large UTF-8 strings into ASCII? - javascript

I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.
How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any platform/framework/library)
Edit: I understand that the ASCII representation will not look correct and would be larger (in terms of bytes) than its UTF-8 counterpart, since its an encoded form of the UTF-8 original.

You could use an ASCII-only version of Douglas Crockford's json2.js quote function. Which would look like this:
var escapable = /[\\\"\x00-\x1f\x7f-\uffff]/g,
meta = { // table of character substitutions
'\b': '\\b',
'\t': '\\t',
'\n': '\\n',
'\f': '\\f',
'\r': '\\r',
'"' : '\\"',
'\\': '\\\\'
};
function quote(string) {
// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.
escapable.lastIndex = 0;
return escapable.test(string) ?
'"' + string.replace(escapable, function (a) {
var c = meta[a];
return typeof c === 'string' ? c :
'\\u' + ('0000' + a.charCodeAt(0).toString(16)).slice(-4);
}) + '"' :
'"' + string + '"';
}
This will produce a valid ASCII-only, javascript-quoted of the input string
e.g. quote("Doppelgänger!") will be "Doppelg\u00e4nger!"
To revert the encoding you can just eval the result
var encoded = quote("Doppelgänger!");
var back = JSON.parse(encoded); // eval(encoded);

Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.
UTF-8 can represent any unicode character - ASCII cannot.

As others have said, you can't convert UTF-8 text/plain into ASCII text/plain without dropping data.
You could convert UTF-8 text/plain into ASCII someother/format. For instance, HTML lets any character in UTF-8 be representing in an ASCII data file using character references.
If we continue with that example, in JavaScript, charCodeAt could help with converting a string to a representation of it using HTML character references.
Another approach is taken by URLs, and implemented in JS as encodeURIComponent.

Your requirement is pretty strange.
Converting UTF-8 into ASCII would loose all information about Unicode codepoints > 127 (i.e. everything that's not in ASCII).
You could, however try to encode your Unicode data (no matter what source encoding) in an ASCII-compatible encoding, such as UTF-7. This would mean that the data that is produced could legally be interpreted as ASCII, but it is really UTF-7.

If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.
One way is to use base-64 encoding (example in C#):
string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);
If you want the string encoded as ASCII data:
// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);

It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.
Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question
Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.

function utf8ToAscii(str) {
/**
* ASCII contains 127 characters.
*
* In JavaScript, strings is encoded by UTF-16, it means that
* js cannot present strings which charCode greater than 2^16. Eg:
* `String.fromCharCode(0) === String.fromCharCode(2**16)`
*
* #see https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary
*/
const reg = /[\x7f-\uffff]/g; // charCode: [127, 65535]
const replacer = (s) => {
const charCode = s.charCodeAt(0);
const unicode = charCode.toString(16).padStart(4, '0');
return `\\u${unicode}`;
};
return str.replace(reg, replacer);
}
Better way
See Uint8Array to string in Javascript also. You can use TextEncoder and Uint8Array:
function utf8ToAscii(str) {
const enc = new TextEncoder('utf-8');
const u8s = enc.encode(str);
return Array.from(u8s).map(v => String.fromCharCode(v)).join('');
}
// For ascii to string
// new TextDecoder().decode(new Uint8Array(str.split('').map(v=>v.charCodeAt(0))))

Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?
First can be done in a loop checking for values > 128 and replacing them.
If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();

Here is a function to convert UTF8 accents to ASCII Accents (àéèî etc)
If there is an accent in the string it's converted to %239 for exemple
Then on the other side, I parse the string and I know when there is an accent and what is the ASCII char.
I used it in a javascript software to send data to a microcontroller that works in ASCII.
convertUtf8ToAscii = function (str) {
var asciiStr = "";
var refTable = { // Reference table Unicode vs ASCII
199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 231: 135, 234: 136, 235: 137, 232: 138,
239: 139, 238: 140, 236: 141, 196: 142, 201: 144, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151
};
for(var i = 0; i < str.length; i++){
var ascii = refTable[str.charCodeAt(i)];
if (ascii != undefined)
asciiStr += "%" +ascii;
else
asciiStr += str[i];
}
return asciiStr;
}

An implementation of the quote() function might do what you want.
My version can be found here
You can use eval() to reverse the encoding:
var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);

Related

Javascript hexadecimal to ASCII with latin extended symbols

I am getting a hexadecimal value of my string that looks like this:
String has letters with diacritics: č,š,ř, ...
Hexadecimal value of this string is:
0053007400720069006E006700200068006100730020006C0065007400740065007200730020007700690074006800200064006900610063007200690074006900630073003A0020010D002C00200161002C00200159002C0020002E002E002E
The problem is that when i try to convert this value back to ascii it poorly converts the č,š,ř,.. and returns symbol of little box with question mark in it instead of these symbols.
My code for converting hex to ascii:
function convertHexadecimal(hexx){
let index = hexx.indexOf("~");
let strInfo = hexx.substring(0, index+1);
let strMessage = hexx.substring(index+1);
var hex = strMessage.toString();
var str = '';
for (var i = 0; i < hex.length; i += 2){
str += String.fromCharCode(parseInt(hex.substr(i, 2), 16));
}
console.log("Zpráva: " + str);
var strFinal = strInfo + str;
return strFinal;
}
Can somebody help me with this?
First an example solution:
let demoHex = `0053007400720069006E006700200068006100730020006C0065007400740065007200730020007700690074006800200064006900610063007200690074006900630073003A0020010D002C00200161002C00200159002C0020002E002E002E`;
function hexToString(hex) {
let str="";
for( var i = 0; i < hex.length; i +=4) {
str += String.fromCharCode( Number("0x" + hex.substr(i,4)));
}
return str;
}
console.log("Decoded string: %s", hexToString(demoHex) );
What it's doing:
It's treating the hex characters as a sequence of 4 hexadecimal digits that provide the UTF-16 character code of a character.
It gets each set of 4 digits in a loop using String.prototype.substr. Note MDN says .substr is deprecated but this is not mentioned in the ECMASript standard - rewrite it to use substring or something else as you wish.
Hex characters are prefixed with "0x" to make them a valid number representation in JavaScript and converted to a number object using Number. The number is then converted to a character string using the String.fromCharCode static method.
I guessed the format of the hex string by looking at it, which means a general purpose encoding routine to encode UTF16 characters (not code points) into hex could look like:
const hexEncodeUTF16 =
str=>str.split('')
.map( char => char.charCodeAt(0).toString(16).padStart(4,'0'))
.join('');
console.log( hexEncodeUTF16( "String has letters with diacritics: č, š, ř, ..."));
I hope these examples show what needs doing - there are any number of ways to implement it in code.

Javascript: Error while converting Unicode string to hex

I'm trying to convert a unicode string to a hexadecimal representation in javascript controller in SAPUI5 WebIDE.
I am using this function to convert the unicode data to hex. Str variable contains the Unicode data
convertToHex: function(str) {
var hex = '';
var i = 0;
while (str.length > i) {
hex += '' + str.charCodeAt(i).toString(16);
i++;
}
console.log(hex);
return hex;
},
This is first line of the result i am getting in hex variable
504b3414060800021062ee9d685e10090400130825b436f6e74656e745f54797065735d2e786d6c20a24228a002
Now when i am uploading same data to SAP Netweaver gateway, it converts the unicode data to hex as follows (First line) :
504B03041400060008000000210062EE9D685E01000090040000130008025B436F6E74656E745F54797065735D2E786D6C20A2040228A00002
This is the decoded unicode:
PK!bîh^[Content_Types].xml ¢( 
For my application to work i need both hex codes to be same but i am not able to generate the correct hex code in Javascript whereas in SAP i am getting the correct hex values.

Base64 encoded String URL friendly

I'm using this function for base64 decode/encode URL in javascript but i've seen many gives notice about making Base64 encoded String URL friendly and to use
// if this is your Base64 encoded string
var str = 'VGhpcyBpcyBhbiBhd2Vzb21lIHNjcmlwdA==';
// make URL friendly:
str = str.replace(/\+/g, '-').replace(/\//g, '_').replace(/\=+$/, '');
// reverse to original encoding
str = (str + '===').slice(0, str.length + (str.length % 4));
str = str.replace(/-/g, '+').replace(/_/g, '/');
jsfiddle
I see it only replce + with - and \ with _ so what is the point! i mean why should i?
If you would not replace the characters they could get misinterpreted, like pointed out in this answer. This would result in either displaying wrong content or more likely displaying nothing.
For explanation:
In URLs the follwing characters are reserved: : / ? # [ ] # : $ & ' ( ) * + , ; =
Like pointed out in this answer you need 65 characters to encode data, but the sum of uppercase-, lowercase characters and digits is only 62. So it needs three additional characters which are + = /. These three are reserved. So you need to replace them. As of right now I am not entirely sure about -.

JavaScript how to split a string every n characters while ignoring ANSI codes

How would you approach splitting a JavaScript string every n characters while ignoring the ansi codes? (so splitting every n + length of ansi characters contained in that string)
It is important to keep the ansi code in the final array.
I know using regex you'd write something like /.{1,3}/, but how would you ignore the ansi chars in the count?
Example:
Given \033[34mHey \033[35myou\033[0m, how would you split every 3 chars to get:
[
'\033[34mHey',
' \033[35myo',
'u\033[0m'
]
Here is a way to achieve what you need:
s = "\033[34mHey \033[35myou\033[0mfd\033[1m";
chunks = s.match(/(?:(?:\033\[[0-9;]*m)*.?){1,3}/g);
var arr = [];
[].forEach.call(chunks, function(a) {
if (!/^(?:\033\[[0-9;]*m)*$/.test(a)) {
arr.push(a);
}
});
document.getElementById("r").innerHTML = JSON.stringify(arr);
<div id="r"/>
Note that octal codes can be used directly in the regex. We filter all the empty and ANSI-color code only elements in the forEach call.
You can match ANSI color escapes as \x1B\[[\d;]*m, and only count all characters except escapes [^\x1B]
/(?:(?:\x1B\[[\d;]*m)*[^\x1B]){1,3}/g
Also, to include escapes in the end of string as part of the last token:
/(?:(?:\x1B\[[\d;]*m)*[^\x1B]){1,3}(?:(?:\x1B\[[\d;]*m)+$)?/g
Code
subject = "\033[34mHey y\033[0mou\033[0m";
pattern = /(?:(?:\x1B\[[\d;]*m)*[^\x1B]){1,3}(?:(?:\x1B\[[\d;]*m)+$)?/g;
result = subject.match(pattern);
document.write('<pre>' + JSON.stringify(result) + '</pre>');

Decode Unicode to character in javascript

I have the following unicode sequence:
d76cb9dd0020b370b2c8c758
I tried randomly in non-English character (for this experiment, I tried korean languange) as the original of above unicode lines :
희망 데니의
How can i decode those-above-mentioned unicode sequence into the original form?
As a JavaScript string literal, escape hex codes with \u:
var koreanString = "\ud76c\ub9dd\u0020\ub370\ub2c8\uc758";
Or just enter the korean characters into the string:
var koreanString = "희망 데니의";
To process a hex string representing unicode characters, parse the hex string to numbers and the build the unicode string use String.fromCharCode():
var hex = "d76cb9dd0020b370b2c8c758";
var koreanString = "";
for (var i = 0; i < hex.length; i += 4) {
koreanString += String.fromCharCode(parseInt(hex.substring(i, 4), 16));
}
Edit: You can get the length of any string by accessing its length property:
var stringLength = koreanString.length;
This will return 6. There is no "english" string. You have a string representing hexadecimal numbers, and hexadecimal numbers consist of characters from the latin character set, but these are not in any spoken language. They are just numbers. You can, of course, get the length of the hexadecimal string using the length property, but I'm not sure why you'd want to do that. It would be more straight forward to use an array of numbers instead of a string:
var charCodes = [0xd76c, 0xb9dd, 0x0020, 0xb370, 0xb2c8, 0xc758];
var koreanString = String.fromCharCode.apply(null, charCodes);
In this way, charCodes.length will be the same as koreanString.length.
How about
var str = 'd76cb9dd0020b370b2c8c758';
str = '"'+str.replace(/([0-9a-z]{4})/g, '\\u$1')+'"';
alert(JSON.parse(str));
DEMO

Categories

Resources