มอเตอร์ไซค์
Can I convert this unicode to string with JS. (It is Thailand Language)
I use
console.log(String.fromCharCode("มอเตอร์ไซค์"));
And It's not correct. if it right it will show มอเตอร์ไซค์
Your Unicode string is encoded using HTML entity notation. Generally that means that whatever encoded the string expected it to end up in the middle of an HTML document, where it would be seen by an HTML parser.
If you've somehow got that string in JavaScript in a browser, you can get to the encoded Unicode by letting the browser parse it:
var str = "มอเตอร์ไซค์";
var elem = document.createElement("div");
elem.innerHTML = str;
alert(elem.textContent);
The string.fromCharCode() function expects one or more numeric arguments; it won't understand HTML entities. Thus if you're not in a browser (like, if you've got the string in a Node.js program or something like that), you could convert the string with your own code:
var str = "มอเตอร์ไซค์";
var thai = String.fromCharCode.apply(String, str.match(/x[^;]*;/g).map(function(n) { return parseInt(n.slice(1, -1), 16); }));
That conversion will only work when the code points involved are within the first 64K values.
You may want something like this :
var input = "มอเตอร์ไซค์";
var output = input.replace(/&#x[0-9A-Fa-f]+;/g,
function(htmlCode) {
var codePoint = parseInt( htmlCode.slice(3, -1), 16 );
return String.fromCharCode( codePoint );
});
Related
I have a string in JS in this format:
http\x3a\x2f\x2fwww.url.com
How can I get the decoded string out of this? I tried unescape(), string.decode but it doesn't decode this. If I display that encoded string in the browser it looks fine (http://www.url.com), but I want to manipulate this string before displaying it.
Thanks.
You could write your own replacement method:
String.prototype.decodeEscapeSequence = function() {
return this.replace(/\\x([0-9A-Fa-f]{2})/g, function() {
return String.fromCharCode(parseInt(arguments[1], 16));
});
};
"http\\x3a\\x2f\\x2fwww.example.com".decodeEscapeSequence()
There is nothing to decode here. \xNN is an escape character in JavaScript that denotes the character with code NN. An escape character is simply a way of specifying a string - when it is parsed, it is already "decoded", which is why it displays fine in the browser.
When you do:
var str = 'http\x3a\x2f\x2fwww.url.com';
it is internally stored as http://www.url.com. You can manipulate this directly.
If you already have:
var encodedString = "http\x3a\x2f\x2fwww.url.com";
Then decoding the string manually is unnecessary. The JavaScript interpreter would already be decoding the escape sequences for you, and in fact double-unescaping can cause your script to not work properly with some strings. If, in contrast, you have:
var encodedString = "http\\x3a\\x2f\\x2fwww.url.com";
Those backslashes would be considered escaped (therefore the hex escape sequences remain unencoded), so keep reading.
Easiest way in that case is to use the eval function, which runs its argument as JavaScript code and returns the result:
var decodedString = eval('"' + encodedString + '"');
This works because \x3a is a valid JavaScript string escape code. However, don't do it this way if the string does not come from your server; if so, you would be creating a new security weakness because eval can be used to execute arbitrary JavaScript code.
A better (but less concise) approach would be to use JavaScript's string replace method to create valid JSON, then use the browser's JSON parser to decode the resulting string:
var decodedString = JSON.parse('"' + encodedString.replace(/([^\\]|^)\\x/g, '$1\\u00') + '"');
// or using jQuery
var decodedString = $.parseJSON('"' + encodedString.replace(/([^\\]|^)\\x/g, '$1\\u00') + '"');
You don't need to decode it. You can manipulate it safely as it is:
var str = "http\x3a\x2f\x2fwww.url.com";
alert(str.charAt(4)); // :
alert("\x3a" === ":"); // true
alert(str.slice(0,7)); // http://
maybe this helps: http://cass-hacks.com/articles/code/js_url_encode_decode/
function URLDecode (encodedString) {
var output = encodedString;
var binVal, thisString;
var myregexp = /(%[^%]{2})/;
while ((match = myregexp.exec(output)) != null
&& match.length > 1
&& match[1] != '') {
binVal = parseInt(match[1].substr(1),16);
thisString = String.fromCharCode(binVal);
output = output.replace(match[1], thisString);
}
return output;
}
2019
You can use decodeURI or decodeURIComponent and not unescape.
console.log(
decodeURI('http\x3a\x2f\x2fwww.url.com')
)
So I'm converting a string to BASE64 as shown in the code below...
var str = "Hello World";
var enc = window.btoa(str);
This yields SGVsbG8gV29ybGQ=. However if I add these characters – ” such as the code shown below, the conversion doesn't happen. What is the reason behind this? Thank you so much.
var str = "Hello – World”";
var enc = window.btoa(str);
btoa is an exotic function in that it requires a "Binary String", which is an 8-bit clean string format. It doesn't work with unicode values above charcode 255, such as used by your em dash and "fancy" quote symbol.
You'll either have to turn your string into a new string that conforms to single byte packing (and then manually reconstitute the result of the associated atob), or you can uri encode the data first, making it safe:
> var str = `Hello – World`;
> window.btoa(encodeURIComponent(str));
"SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA=="
And then remember to decode it again when unpacking:
> var base64= "SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA==";
> decodeURIComponent(window.atob(base64));
"Hello – World"
The Problem is the character ” lies outside of Latin1 range.
For this you can use unescape (now deprecated)
var str = "Hello – World”";
var enc = btoa(unescape(encodeURIComponent(str)));
alert(enc);
And to decode:
var encStr = "SGVsbG8g4oCTIFdvcmxk4oCd";
var dec = decodeURIComponent(escape(window.atob(encStr)))
alert(dec);
This ultimately owes to a deficiency in the JavaScript type system.
JavaScript strings are strings of 16-bit code units, which are customarily interpreted as UTF-16. The Base64 encoding is a method of transforming an 8-bit byte stream into a string of digits, by taking each three bytes and mapping them into four digits, each covering 6 bits: 3 × 8 = 4 × 6. As we see, this is crucially dependent on the bit width of each symbol.
At the time the btoa function was defined, JavaScript had no type for 8-bit byte streams, so the API was defined to take the ordinary 16-bit string type as input, with the restriction that each code unit was supposed to be confined to the range [U+0000, U+00FF]; when encoded into ISO-8859-1, such a string would reproduce the intended byte stream exactly.
The character – is U+2013, while ” is U+201D; neither of those characters fits into the above-mentioned range, so the function rejects it.
If you want to convert Unicode text into Base64, you need to pick a character encoding and convert it into a byte string first, and encode that. Asking for a Base64 encoding of a Unicode string itself is meaningless.
The most bullet proof way is to work on binary data directly.
For this, you can encode your string to an ArrayBuffer object representing the UTF-8 version of your string.
Then a FileReader instance will be able to give you the base64 quite easily.
var str = "Hello – World”";
var buf = new TextEncoder().encode( str );
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([buf]) );
And since the Blob() constructor does automagically encode DOMString instances to UTF-8, we could even get rid of the TextEncoder object:
var str = "Hello – World”";
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([str]) );
I have a string which contains xml. It has the following substring
<Subject>������������������</subject>
I'm pulling the xml from a server and I need to display it to the user. I've noticed the ampersand has been escaped and there are utf-16 surrogate pairs. How do I ensure the emojis/emoticons are displayed correctly in a browser.
Currently I'm just getting these characters: �������������� instead of the actual emojis.
I'm looking for a simple way to fix this without any external libraries or any 3rd party code if possible just plain old javascript, html or css.
You can convert UTF-16 code units including surrogates to a JavaScript string with String.fromCharCode. The following code snippet should give you an idea.
var str = '��ABC����������������';
// Regex matching either a surrogate or a character.
var re = /&#(\d+);|([^&])/g;
var match;
var charCodes = [];
// Find successive matches
while (match = re.exec(str)) {
if (match[1] != null) {
// Surrogate
charCodes.push(match[1]);
}
else {
// Unescaped character (assuming the code point is below 0x10000),
charCodes.push(match[2].charCodeAt(0));
}
}
// Create string from UTF-16 code units.
var result = String.fromCharCode.apply(null, charCodes);
console.log(result);
I'm trying to regex on the client as well as the server with this validation of a Base64 encoded 256-bit number without the = padding.
^[A-Za-z0-9+/]{42}[AEIMQUYcgkosw048]$
This is my code which isn't working as expected as any value seems to return true:
$.fn.validateKey = function() {
var re = /^[A-Za-z0-9+/]{42}[AEIMQUYcgkosw048]$/
var re = new RegExp($(this).val());
return re;
};
How can I validate Base 64 encoded 256-bit signing keys without padding with javascript?
You're returning a RegExp object. You want to return its evaluation with an input string instead.
$.fn.validateKey = function() {
var re = /^[A-Za-z0-9+/]{42}[AEIMQUYcgkosw048]$/;
return re.test($(this).val());
};
Jan in the comments pointed out something interesting, in which the / doesn't need to be escaped in the regex (at least in my browser).
I believe it's due to being part of a character class.
I have a mixed source of unicode and ascii characters, for example:
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
How do I convert it to a string by leveraging and extending the below uniCodeToString function written by myself in Javascript? This function can convert pure unicode to string.
function uniCodeToString(source){
//for example, source = "\u5c07\u63a2\u8a0e"
var escapedSource = escape(source);
var codeArray = escapedSource.split("%u");
var str = "";
for(var i=1; i<codeArray.length; i++){
str += String.fromCharCode("0x"+codeArray[i]);
}
return str;
}
Use encodeURIComponent, escape was never meant for unicode.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
var enc=encodeURIComponent(source)
//returned value: (String)
%E5%B0%87%E6%8E%A2%E8%A8%8E%20HTML5%20%E5%8F%8A%E5%85%B6%E4%BB%96
decodeURIComponent(enc)
//returned value: (String)
將探討 HTML5 及其他
I think you are misunderstanding the purpose of Unicode escape sequences.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
JavaScript strings are always Unicode (each code unit is a 16 bit UTF-16 encoded value.) The purpose of the escapes is to allow you to describe values that are unsupported by the encoding used to save the source file (e.g. the HTML page or .JS file is encoded as ISO-8859-1) or to overcome things like keyboard limitations. This is no different to using \n to indicate a linefeed code point.
The above string ("將探討 HTML5 及其他") is made up of the values 5c07 63a2 8a0e 0020 0048 0054 004d 004c 0035 0020 53ca 5176 4ed6 whether you write the sequence as a literal or in escape sequences.
See the String Literals section of ECMA-262 for more details.