How to escape HTML - javascript

I have a string which contains HTML text. I need to escape just the strings and not tags.
For example, I have string which contains,
<ul class="main_nav">
<li>
<a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
</li>
<li>
<a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
</li>
</ul>
How to escape just the text to,
<ul class="main_nav">
<li>
<a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
</li>
<li>
<a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
</li>
</ul>
with out modifying the tags.
Can this be handled with HTML DOM and javascript?

I'm very surprised no one answered this. You can just use the browser it self to do the escaping for you. No regex is better or safer than letting the browser do what it does best, handle HTML.
function escapeHTML(str){
var p = document.createElement("p");
p.appendChild(document.createTextNode(str));
return p.innerHTML;
}
or a short alternative using the Option() constructor
function escapeHTML(str){
return new Option(str).innerHTML;
}

(See further down for an answer to the question as updated by comments from the OP below)
Can this be handled with HTML DOM and javascript?
No, once the text is in the DOM, the concept of "escaping" it doesn't apply. The HTML source text needs to be escaped so that it's parsed into the DOM correctly; once it's in the DOM, it isn't escaped.
This can be a bit tricky to understand, so let's use an example. Here's some HTML source text (such as in an HTML file that you would view with your browser):
<div>This & That</div>
Once that's parsed into the DOM by the browser, the text within the div is This & That, because the & has been interpreted at that point.
So you'll need to catch this earlier, before the text is parsed into the DOM by the browser. You can't handle it after the fact, it's too late.
Separately, the string you're starting with is invalid if it has things like <div>This & That</div> in it. Pre-processing that invalid string will be tricky. You can't just use built-in features of your environment (PHP or whatever you're using server-side) because they'll escape the tags as well. You'll need to do text processing, extracting only the parts that you want to process and then running those through an escaping process. That process will be tricky. An & followed by whitespace is easy enough, but if there are unescaped entities in the source text, how do you know whether to escape them or not? Do you assume that if the string contains &, you leave it alone? Or turn it into &amp;? (Which is perfectly valid; it's how you show the actual string & in an HTML page.)
What you really need to do is correct the underlying problem: The thing creating these invalid, half-encoded strings.
Edit: From our comment stream below, the question is totally different than it seemed from your example (that's not meant critically). To recap the comments for those coming to this fresh, you said that you were getting these strings from WebKit's innerHTML, and I said that was odd, innerHTML should encode & correctly (and pointed you at a couple of test pages that suggested it did). Your reply was:
This works for &. But the same test page do not work for entities like ©, ®, « and many more.
That changes the nature of the question. You want to make entities out of characters that, while perfectly valid when used literally (provided you have your text encoding right), could be expressed as entities instead and therefore made more resilient to text encoding changes.
We can do that. According to the spec, the character values in a JavaScript string are UTF-16 (using Unicode Normalized Form C) and any conversion from the source character encoding (ISO 8859-1, Windows-1252, UTF-8, whatever) is performed before the JavaScript runtime sees it. (If you're not 100% sure you know what I mean by character encoding, it's well worth stopping now, going off and reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, then coming back.) So that's the input side. On the output side, HTML entities identify Unicode code points. So we can convert from JavaScript strings to HTML entities reliably.
The devil is in the detail, though, as always. JavaScript explicitly assumes that each 16-bit value is a character (see section 8.4 in the spec), even though that's not actually true of UTF-16 — one 16-bit value might be a "surrogate" (such as 0xD800) that only makes sense when combined with the next value, meaning that two "characters" in the JavaScript string are actually one character. This isn't uncommon for far Eastern languages.
So a robust conversion that starts with a JavaScript string and results in an HTML entity can't assume that a JavaScript "character" actually equals a character in the text, it has to handle surrogates. Fortunately, doing so is dead easy because the smart people defining Unicode made it dead easy: The first surrogate value is always in the range 0xD800-0xDBFF (inclusive), and the second surrogate is always in the range 0xDC00-0xDFFF (inclusive). So any time you see a pair of "characters" in a JavaScript string that match those ranges, you're dealing with a single character defined by a surrogate pair. The formulae for converting from the pair of surrogate values to a code point value are given in the above links, although fairly obtusely; I find this page much more approachable.
Armed with all of this information, we can write a function that will take a JavaScript string and search for characters (real characters, which may be one or two "characters" long) you might want to turn into entities, replacing them with named entities from a map or numeric entities if we don't have them in our named map:
// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};
// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep
// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}
// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}
// Return replacement
return rep;
});
}
You should be fine passing all of the HTML through it, since if these characters appear in attribute values, you almost certainly want to encode them there as well.
I have not used the above in production (I actually wrote it for this answer, because the problem intrigued me) and it is totally supplied without warrantee of any kind. I have tried to ensure that it handles surrogate pairs because that's necessary for far Eastern languages, and supporting them is something we should all be doing now that the world has gotten smaller.
Complete example page:
<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
font-family: sans-serif;
}
#log p {
margin: 0;
padding: 0;
}
</style>
<script type='text/javascript'>
// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {
// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};
// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep
// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}
// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}
// Return replacement
return rep;
});
}
// Return the function reference out of the scoping function to publish it
return prepEntities;
})();
function go() {
var d = document.getElementById('d1');
var s = d.innerHTML;
alert("Before: " + s);
s = prepEntities(s);
alert("After: " + s);
}
</script>
</head>
<body>
<div id='d1'>Copyright: © Yen: ¥ Cedilla: ¸ Surrogate pair: 𐀀</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>
There I've included the cedilla as an example of converting to a numeric entity rather than a named one (since I left cedil out of my very small example map). And note that the surrogate pair at the end shows up in the first alert as two "characters" because of the way JavaScript handles UTF-16.

You can encode all characters in your string:
function encode(e){return e.replace(/[^]/g,function(e){return"&#"+e.charCodeAt(0)+";"})}
Or just target the main characters to worry about (&, inebreaks, <, >, " and ') like:
function encode(r){
return r.replace(/[\x26\x0A\<>'"]/g,function(r){return"&#"+r.charCodeAt(0)+";"})
}
var myString='Encode HTML entities!\n"Safe" escape <script></'+'script> & other tags!';
test.value=encode(myString);
testing.innerHTML=encode(myString);
/*************
* \x26 is &ampersand (it has to be first),
* \x0A is newline,
*************/
<p><b>What JavaScript Sees:</b></p>
<textarea id=test rows="3" cols="55"></textarea>
<p><b>What It Renders Too In HTML:</b></p>
<div id="testing">www.WHAK.com</div>

Did you tried escape() function in Javascript? JavaScript escape() Function

What server-side language are you using?
if you're using PHP you could use htmlentities
Example:
<?php
$myHTML = "<h1>Some HTML Tags</h1><br />";
echo htmlentities($myHTML);
?>

Related

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

Match interconnected arabic characters

I need to match interconnected Arabic characters to do expansion like this:
بسم الله الرحمن الرحيم
becomes
بـسـم الـلـه الـرحـمـن الـرحـيـم
is there a way to do that using regular expressions?
How about something like this:
"بسم الله الرحمن الرحيم".replace(/(ب|ت|ث|ج|ح|خ|س|ش|ص|ض|ط|ظ|ع|غ|ف|ق|ك|ل|م|ن|ه|ي)(?=\S)/g, "$1ـ");
returns:
"بـسـم الـلـه الـرحـمـن الـرحـيـم"
Clarification:
We're matching letters that can be interconnected with the proceeding character by doing an OR group between all those characters, then we make sure it's not followed by a white space (not an end of word). then we replace the first matched group (the letter) by itself ($1) followed by an expansion character.
I had a project once in which I had to choose the correct unicode codes to render depending on the position of the letters; so that they appear connected (or disconnected) as appropriate, because I was using a system non-compliant with Unicode.
The unicode values for the disconnected Meem (م) is different than the one that is connected. BUT:
Unfortunately for your case, and most fortunately for many other cases, it is part of the unicode specification that displaying letters be separated from their actual unicode value. This is why you might have the unicode for a disconnected Meem, but it displayed as connected! The specification includes that comparing the connected Meem to a disconnected one always yields the correct value semantically which is true for equivalence. This makes things a lot easier!
What I ended up doing is to create a static data structure (use hard coded dictionaries or arrays) or XML or whatever. This data structure would tell us when each Arabic letter is connected or not (to both after and before).
For example:
//list of chars that can connect before and after
var canConnectBeforeAfter = new List<char>() { 'ع', 'ت', 'ب', 'ي' /*and so on*/ };
//list of chars that can connect only to character before them (of that character can connect to the one after it! watch out for وو)
var cannotConnectAfter = new List<char>() { 'ر', 'و' };
var cannotConnect = new List<char>() { 'ء' });
You will need to add the right characters for the right lists. I hope you don't have to deal with Harakat!!!!
سلام, let me know if you need clarification

Insert Unicode character into JavaScript

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

jquery .text() and unicode

I'd like to display the "Open Lock" character in my HTML link text.
If I do it directly it shows up correctly with <a id="myId">🔒</a>, but I found no way to change it dinamically with the jQuery .text() function, like in:
$("#myID").text(openLockText);
What should I put in openLockText?
Javascript internally only supports UTF-16.
Because this is an extended 32-bit UTF character (not in the "Basic Multilingual Plane") you need to insert the "UTF-16 surrogate pair", which is helpfully provided on the same page that you linked to:
0xD83D 0xDD13
i.e.
$('#myId').text('\ud83d\udd13');
More details can be found in RFC 4627, which is strictly speaking the format for JSON.
edited — If it were a Unicode code point that could be represented in a single UTF-16 character, then ou could use JavaScript escape sequences in such situations:
$('#foo').text('\uXXXX');
However, because your character requires more bits, that doesn't work. It would probably be possible to construct the byte sequence that'd allow the character to be represented as UTF-16, but it'd be a pain. I'd go with .html().
Note that not all fonts provide glyphs for "exotic" code points like that, and in my experience those that do provide incredibly ugly ones.
You can put the character there directly, as a quoted string, e.g.
$("#myID").text('🔓');
provided that the file is UTF-8 encoded and you properly declare the character encoding. (In theory, you could alternatively use UTF-16 or even UTF-32, but browsers should not be expected to support them.)
Although support to non-BMP characters directly in source documents is optional according to the ECMAScript standard, modern browsers let you use them. Naturally, you need an editor that can handle UTF-8, and you need some input method(s); see e.g. my Full Unicode Input utility.
The question contains some mistakes that have gone unnoticed: Since id attribute values are case-sensitive, the spelling myId needs to be fixed to myID. And the OPEN LOCK character is U+1F513, not U+1F512, so the reference 🔒 would give a wrong character.
Moreover, very few fonts contain OPEN LOCK, and browsers, especially IE, may have difficulties in finding the glyph even if some font in the system contains it, so you should give browsers help and declare a list of fonts known to contain the character, in order of preference. Example:
<style>
#myID { font-family: Symbola, Quivira, Segoe UI Symbol; }
</style>
<a id="myID">stuff</a>
<script>
$("#myID").text('🔓');
</script>
A non-BMP character is internally represented as a surrogate pair, and it could be written using \u notations for the components of the pair, but this is very unintuitive
Following script should convert UTF32 hex values to UTF16 pairs
function toUTF16Pair(hex) {
hex = hex.replace(/[&#x;]/g,'');
var x = parseInt(hex, 16);
if (x >= 0x10000 && x <= 0x10FFFF) {
var first = Math.floor((x - 0x10000) / 0x400) + 0xD800;
var second = ((x - 0x10000) % 0x400) + 0xDC00;
return {
first: first.toString(16).toUpperCase(),
second: second.toString(16).toUpperCase(),
combined: '\\u'+first.toString(16).toUpperCase() + '\\u'+second.toString(16).toUpperCase()
};
} else {
return {}
}
}
<input type='text' id='in' />
<input type='button' value='Click' onclick="document.getElementById('result').innerHTML = toUTF16Pair(document.getElementById('in').value).combined" />
<p id='result'></p>

Why does Closure Compiler insist on adding more bytes?

If I give Closure Compiler something like this:
window.array = '0123456789'.split('');
It "compiles" it to this:
window.array="0,1,2,3,4,5,6,7,8,9".split(",");
Now as you can tell, that's bigger. Is there any reason why Closure Compiler is doing this?
I think this is what's going on, but I am by no means certain...
The code that causes the insertion of commas is tryMinimizeStringArrayLiteral in PeepholeSubstituteAlternateSyntax.java.
That method contains a list of characters that are likely to have a low Huffman encoding, and are therefore preferable to split on than other characters. You can see the result of this if you try something like this:
"a b c d e f g".split(" "); //Uncompiled, split on spaces
"a,b,c,d,e,f,g".split(","); //Compiled, split on commas (same size)
The compiler will replace the character you try to split on with one it thinks is favourable. It does so by iterating over the characters of the string and finding the most favourable splitting character that does not occur within the string:
// These delimiters are chars that appears a lot in the program therefore
// probably have a small Huffman encoding.
NEXT_DELIMITER: for (char delimiter : new char[]{',', ' ', ';', '{', '}'}) {
for (String cur : strings) {
if (cur.indexOf(delimiter) != -1) {
continue NEXT_DELIMITER;
}
}
String template = Joiner.on(delimiter).join(strings);
//...
}
In the above snippet you can see the array of characters the compiler claims to be optimal to split on. The comma is first (which is why in my space example above, the spaces have been replaced by commas).
I believe the insertion of commas in the case where the string to split on is the empty string may simply be an oversight. There does not appear to be any special treatment of this case, so it's treated like any other split call and each character is joined with the first appropriate character from the array shown in the above snippet.
Another example of how the compiler deals with the split method:
"a,;b;c;d;e;f;g".split(";"); //Uncompiled, split on semi-colons
"a, b c d e f g".split(" "); //Compiled, split on spaces
This time, since the original string already contains a comma (and we don't want to split on the comma character), the comma can't be chosen from the array of low-Huffman-encoded characters, so the next best choice is selected (the space).
Update
Following some further research into this, it is definitely not a bug. This behaviour is actually by design, and in my opinion it's a very clever little optimisation, when you bear in mind that the Closure compiler tends to favour the speed of the compiled code over size.
Above I mentioned Huffman encoding a couple of times. The Huffman coding algorithm, explained very simply, assigns a weight to each character appearing the the text to be encoded. The weight is based on the frequency with which each character appears. These frequencies are used to build a binary tree, with the most common character at the root. That means the most common characters are quicker to decode, since they are closer to the root of the tree.
And since the Huffman algorithm is a large part of the DEFLATE algorithm used by gzip. So if your web server is configured to use gzip, your users will be benefiting from this clever optimisation.
This issue was fixed on Apr 20, 2012 see revision:
https://code.google.com/p/closure-compiler/source/detail?r=1267364f742588a835d78808d0eef8c9f8ba8161
Ironically, split in the compiled code has nothing to do with split in the source. Consider:
Source : a = ["0","1","2","3","4","5"]
Compiled: a="0,1,2,3,4,5".split(",")
Here, split is just a way to represent long arrays (long enough for sum of all quotes + commas to be longer than split(","") ). So, what's going on in your example? First, the compiler sees a string function applied to a constant and evaluates it right away:
'0123456789'.split('') => ["0","1","2","3","4","5","6","7","8","9"]
At some later point, when generating output, the compiler considers this array to be "long" and writes it in the above "split" form:
["0","1","2","3","4","5","6","7","8","9"] => "0,1,2,3,4,5,6,7,8,9".split(",")
Note that all information about split('') in the source is already lost at this point.
If the source string were shorter, it would be generated in the array array form, without extra splitting:
Source : a = '0123'.split('')
Compiled: a=["0","1","2","3"]

Categories

Resources