How to convert mixed ascii and unicode to a string in javascript? - javascript

I have a mixed source of unicode and ascii characters, for example:
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
How do I convert it to a string by leveraging and extending the below uniCodeToString function written by myself in Javascript? This function can convert pure unicode to string.
function uniCodeToString(source){
//for example, source = "\u5c07\u63a2\u8a0e"
var escapedSource = escape(source);
var codeArray = escapedSource.split("%u");
var str = "";
for(var i=1; i<codeArray.length; i++){
str += String.fromCharCode("0x"+codeArray[i]);
}
return str;
}

Use encodeURIComponent, escape was never meant for unicode.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
var enc=encodeURIComponent(source)
//returned value: (String)
%E5%B0%87%E6%8E%A2%E8%A8%8E%20HTML5%20%E5%8F%8A%E5%85%B6%E4%BB%96
decodeURIComponent(enc)
//returned value: (String)
將探討 HTML5 及其他

I think you are misunderstanding the purpose of Unicode escape sequences.
var source = "\u5c07\u63a2\u8a0e HTML5 \u53ca\u5176\u4ed6";
JavaScript strings are always Unicode (each code unit is a 16 bit UTF-16 encoded value.) The purpose of the escapes is to allow you to describe values that are unsupported by the encoding used to save the source file (e.g. the HTML page or .JS file is encoded as ISO-8859-1) or to overcome things like keyboard limitations. This is no different to using \n to indicate a linefeed code point.
The above string ("將探討 HTML5 及其他") is made up of the values 5c07 63a2 8a0e 0020 0048 0054 004d 004c 0035 0020 53ca 5176 4ed6 whether you write the sequence as a literal or in escape sequences.
See the String Literals section of ECMA-262 for more details.

Related

Why the .replace() and toUppercase() did not work in the second function? [duplicate]

I want to replace the smart quotes like ‘, ’, “ and ” to regular quotes. Also, I wanted to replace the ©, ® and ™. I used the following code. But it doesn't help.
Kindly help me to resolve this issue.
str.replace(/[“”]/g, '"');
str.replace(/[‘’]/g, "'");
Use:
str = str.replace(/[“”]/g, '"');
str = str.replace(/[‘’]/g, "'");
or to do it in one statement:
str = str.replace(/[“”]/g, '"').replace(/[‘’]/g,"'");
In JavaScript (as in many other languages) strings are immutable - string "replacement" methods actually just return the new string instead of modifying the string in place.
The MDN JavaScript reference entry for replace states:
Returns a new string with some or all matches of a pattern replaced by a replacement.
…
This method does not change the String object it is called on. It simply returns a new string.
replace return the resulting string
str = str.replace(/["']/, '');
The OP doesn't say why it isn't working, but there seems to be problems related to the encoding of the file. If I have an ANSI encoded file and I do:
var s = "“This is a test” ‘Another test’";
s = s.replace(/[“”]/g, '"').replace(/[‘’]/g,"'");
document.writeln(s);
I get:
"This is a test" "Another test"
I converted the encoding to UTF-8, fixed the smart quotes (which broke when I changed encoding), then converted back to ANSI and the problem went away.
Note that when I copied and pasted the double and single smart quotes off this page into my test document (ANSI encoded) and ran this code:
var s = "“This is a test” ‘Another test’";
for (var i = 0; i < s.length; i++) {
document.writeln(s.charAt(i) + '=' + s.charCodeAt(i));
}
I discovered that all the smart quotes showed up as ? = 63.
So, to the OP, determine where the smart quotes are originating and make sure they are the character codes you expect them to be. If they are not, consider changing the encoding of the source so they arrive as “ = 8220, ” = 8221, ‘ = 8216 and ’ = 8217. Use my loop to examine the source, if the smart quotes are showing up with any charCodeAt() values other than those I've listed, replace() will not work as written.
To replace all regular quotes with smart quotes, I am using a similar function. You must specify the CharCode as some different computers/browsers default settings may identify the plain characters differently ("",",',').
Using the CharCode with call the ASCII character, which will eliminate the room for error across different browsers, and operating systems. This is also helpful for bilingual use (accents, etc.).
To replace smart quotes with SINGLE QUOTES
function unSmartQuotify(n){
var name = n;
var apos = String.fromCharCode(39);
while (n.indexOf("'") > -1)
name = name.replace("'" , apos);
return name;
}
To find the other ASCII values you may need. Check here.

replace multiple words in string based on an array [duplicate]

I want to replace the smart quotes like ‘, ’, “ and ” to regular quotes. Also, I wanted to replace the ©, ® and ™. I used the following code. But it doesn't help.
Kindly help me to resolve this issue.
str.replace(/[“”]/g, '"');
str.replace(/[‘’]/g, "'");
Use:
str = str.replace(/[“”]/g, '"');
str = str.replace(/[‘’]/g, "'");
or to do it in one statement:
str = str.replace(/[“”]/g, '"').replace(/[‘’]/g,"'");
In JavaScript (as in many other languages) strings are immutable - string "replacement" methods actually just return the new string instead of modifying the string in place.
The MDN JavaScript reference entry for replace states:
Returns a new string with some or all matches of a pattern replaced by a replacement.
…
This method does not change the String object it is called on. It simply returns a new string.
replace return the resulting string
str = str.replace(/["']/, '');
The OP doesn't say why it isn't working, but there seems to be problems related to the encoding of the file. If I have an ANSI encoded file and I do:
var s = "“This is a test” ‘Another test’";
s = s.replace(/[“”]/g, '"').replace(/[‘’]/g,"'");
document.writeln(s);
I get:
"This is a test" "Another test"
I converted the encoding to UTF-8, fixed the smart quotes (which broke when I changed encoding), then converted back to ANSI and the problem went away.
Note that when I copied and pasted the double and single smart quotes off this page into my test document (ANSI encoded) and ran this code:
var s = "“This is a test” ‘Another test’";
for (var i = 0; i < s.length; i++) {
document.writeln(s.charAt(i) + '=' + s.charCodeAt(i));
}
I discovered that all the smart quotes showed up as ? = 63.
So, to the OP, determine where the smart quotes are originating and make sure they are the character codes you expect them to be. If they are not, consider changing the encoding of the source so they arrive as “ = 8220, ” = 8221, ‘ = 8216 and ’ = 8217. Use my loop to examine the source, if the smart quotes are showing up with any charCodeAt() values other than those I've listed, replace() will not work as written.
To replace all regular quotes with smart quotes, I am using a similar function. You must specify the CharCode as some different computers/browsers default settings may identify the plain characters differently ("",",',').
Using the CharCode with call the ASCII character, which will eliminate the room for error across different browsers, and operating systems. This is also helpful for bilingual use (accents, etc.).
To replace smart quotes with SINGLE QUOTES
function unSmartQuotify(n){
var name = n;
var apos = String.fromCharCode(39);
while (n.indexOf("'") > -1)
name = name.replace("'" , apos);
return name;
}
To find the other ASCII values you may need. Check here.

How to decode utf-16 emoji surrogate pairs into uf8-8 and display them correctly in html?

I have a string which contains xml. It has the following substring
<Subject>&#55357;&#56898;&#55357;&#56838;&#55357;&#56846;&#55357;&#56838;&#55357;&#56843;&#55357;&#56838;&#55357;&#56843;&#55357;&#56832;&#55357;&#56846;</subject>
I'm pulling the xml from a server and I need to display it to the user. I've noticed the ampersand has been escaped and there are utf-16 surrogate pairs. How do I ensure the emojis/emoticons are displayed correctly in a browser.
Currently I'm just getting these characters: �������������� instead of the actual emojis.
I'm looking for a simple way to fix this without any external libraries or any 3rd party code if possible just plain old javascript, html or css.
You can convert UTF-16 code units including surrogates to a JavaScript string with String.fromCharCode. The following code snippet should give you an idea.
var str = '&#55357;&#56898;ABC&#55357;&#56838;&#55357;&#56846;&#55357;&#56838;&#55357;&#56843;&#55357;&#56838;&#55357;&#56843;&#55357;&#56832;&#55357;&#56846;';
// Regex matching either a surrogate or a character.
var re = /&#(\d+);|([^&])/g;
var match;
var charCodes = [];
// Find successive matches
while (match = re.exec(str)) {
if (match[1] != null) {
// Surrogate
charCodes.push(match[1]);
}
else {
// Unescaped character (assuming the code point is below 0x10000),
charCodes.push(match[2].charCodeAt(0));
}
}
// Create string from UTF-16 code units.
var result = String.fromCharCode.apply(null, charCodes);
console.log(result);

How to convert unicode \:0936\:093e\:092e to Devanagari script?

I have created a Twitter bot based on Google Apps Script and wolfram|alpha. The bot answers questions the way wolfram|alpha does.
I need to translate a string from English to Devanagari.
I get result as \:0936\:093e\:092e which should be converted to "शाम"
Here is a link for more info - https://codepoints.net/U+0936?lang=en
I want to know how I can achieve this using Google Apps Script (JavaScript)?
The values in \:0936\:093e\:092e are UTF-16 character codes, but are not expressed in a way that will render the characters you need. If they were, you could use the answer from Expressing UTF-16 unicode characters in JavaScript directly.
Demo
This script extracts the hexadecimal numbers from the given string, then uses the getUnicodeCharacter() function from the linked question to convert each number, or codepoint, into its Unicode character.
function utf16demo() {
var str = "\:0936\:093e\:092e";
var charCodes = str.replace(/\:/,'').split('\:').map(function(st){return parseInt(st,16);});
var newStr = '';
for (var ch=0; ch < charCodes.length; ch++) {
newStr += getUnicodeCharacter(charCodes[ch])
}
Logger.log(newStr);
}
Log
[15-11-03 23:04:16:096 EST] शाम

Convert this unicode to string with javascript (Thai Language)

มอเตอร์ไซค์
Can I convert this unicode to string with JS. (It is Thailand Language)
I use
console.log(String.fromCharCode("มอเตอร์ไซค์"));
And It's not correct. if it right it will show มอเตอร์ไซค์
Your Unicode string is encoded using HTML entity notation. Generally that means that whatever encoded the string expected it to end up in the middle of an HTML document, where it would be seen by an HTML parser.
If you've somehow got that string in JavaScript in a browser, you can get to the encoded Unicode by letting the browser parse it:
var str = "มอเตอร์ไซค์";
var elem = document.createElement("div");
elem.innerHTML = str;
alert(elem.textContent);
The string.fromCharCode() function expects one or more numeric arguments; it won't understand HTML entities. Thus if you're not in a browser (like, if you've got the string in a Node.js program or something like that), you could convert the string with your own code:
var str = "มอเตอร์ไซค์";
var thai = String.fromCharCode.apply(String, str.match(/x[^;]*;/g).map(function(n) { return parseInt(n.slice(1, -1), 16); }));
That conversion will only work when the code points involved are within the first 64K values.
You may want something like this :
var input = "มอเตอร์ไซค์";
var output = input.replace(/&#x[0-9A-Fa-f]+;/g,
function(htmlCode) {
var codePoint = parseInt( htmlCode.slice(3, -1), 16 );
return String.fromCharCode( codePoint );
});

Categories

Resources