jquery .text() and unicode - javascript

I'd like to display the "Open Lock" character in my HTML link text.
If I do it directly it shows up correctly with <a id="myId">🔒</a>, but I found no way to change it dinamically with the jQuery .text() function, like in:
$("#myID").text(openLockText);
What should I put in openLockText?

Javascript internally only supports UTF-16.
Because this is an extended 32-bit UTF character (not in the "Basic Multilingual Plane") you need to insert the "UTF-16 surrogate pair", which is helpfully provided on the same page that you linked to:
0xD83D 0xDD13
i.e.
$('#myId').text('\ud83d\udd13');
More details can be found in RFC 4627, which is strictly speaking the format for JSON.

edited — If it were a Unicode code point that could be represented in a single UTF-16 character, then ou could use JavaScript escape sequences in such situations:
$('#foo').text('\uXXXX');
However, because your character requires more bits, that doesn't work. It would probably be possible to construct the byte sequence that'd allow the character to be represented as UTF-16, but it'd be a pain. I'd go with .html().
Note that not all fonts provide glyphs for "exotic" code points like that, and in my experience those that do provide incredibly ugly ones.

You can put the character there directly, as a quoted string, e.g.
$("#myID").text('🔓');
provided that the file is UTF-8 encoded and you properly declare the character encoding. (In theory, you could alternatively use UTF-16 or even UTF-32, but browsers should not be expected to support them.)
Although support to non-BMP characters directly in source documents is optional according to the ECMAScript standard, modern browsers let you use them. Naturally, you need an editor that can handle UTF-8, and you need some input method(s); see e.g. my Full Unicode Input utility.
The question contains some mistakes that have gone unnoticed: Since id attribute values are case-sensitive, the spelling myId needs to be fixed to myID. And the OPEN LOCK character is U+1F513, not U+1F512, so the reference 🔒 would give a wrong character.
Moreover, very few fonts contain OPEN LOCK, and browsers, especially IE, may have difficulties in finding the glyph even if some font in the system contains it, so you should give browsers help and declare a list of fonts known to contain the character, in order of preference. Example:
<style>
#myID { font-family: Symbola, Quivira, Segoe UI Symbol; }
</style>
<a id="myID">stuff</a>
<script>
$("#myID").text('🔓');
</script>
A non-BMP character is internally represented as a surrogate pair, and it could be written using \u notations for the components of the pair, but this is very unintuitive

Following script should convert UTF32 hex values to UTF16 pairs
function toUTF16Pair(hex) {
hex = hex.replace(/[&#x;]/g,'');
var x = parseInt(hex, 16);
if (x >= 0x10000 && x <= 0x10FFFF) {
var first = Math.floor((x - 0x10000) / 0x400) + 0xD800;
var second = ((x - 0x10000) % 0x400) + 0xDC00;
return {
first: first.toString(16).toUpperCase(),
second: second.toString(16).toUpperCase(),
combined: '\\u'+first.toString(16).toUpperCase() + '\\u'+second.toString(16).toUpperCase()
};
} else {
return {}
}
}
<input type='text' id='in' />
<input type='button' value='Click' onclick="document.getElementById('result').innerHTML = toUTF16Pair(document.getElementById('in').value).combined" />
<p id='result'></p>

Related

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

Insert Unicode character into JavaScript

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

Remove a long dash from a string in JavaScript?

I've come across an error in my web app that I'm not sure how to fix.
Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.
How can I fix this?
The specific character that's killing it is —.
Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.
This code might help:
text = text.replace(/\u2013|\u2014/g, "-");
It replaces all – (–) and — (—) symbols with simple dashes (-).
DEMO: http://jsfiddle.net/F953H/
That character is call an Em Dash. You can replace it like so:
str.replace('\u2014', '');​​​​​​​​​​
Here is an example Fiddle: http://jsfiddle.net/x67Ph/
The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.
There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash
You can replace unicode characters directly by using the unicode escape:
'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:
[yourstring].replace(/[\u0080-\uC350]/g,
function(a) {
return '&#'+a.charCodeAt(0)+';';
}
);
With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:
/\p{Dash}/gu
In ES5, the equivalent expression is:
/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g
See the Unicode Utilities reference.
Here are some JavaScript examples:
const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030𐺭";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------
To match one or more dashes and replace with a single char (or remove in one go):
/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g

How to escape HTML

I have a string which contains HTML text. I need to escape just the strings and not tags.
For example, I have string which contains,
<ul class="main_nav">
<li>
<a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
</li>
<li>
<a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
</li>
</ul>
How to escape just the text to,
<ul class="main_nav">
<li>
<a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
</li>
<li>
<a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
</li>
</ul>
with out modifying the tags.
Can this be handled with HTML DOM and javascript?
I'm very surprised no one answered this. You can just use the browser it self to do the escaping for you. No regex is better or safer than letting the browser do what it does best, handle HTML.
function escapeHTML(str){
var p = document.createElement("p");
p.appendChild(document.createTextNode(str));
return p.innerHTML;
}
or a short alternative using the Option() constructor
function escapeHTML(str){
return new Option(str).innerHTML;
}
(See further down for an answer to the question as updated by comments from the OP below)
Can this be handled with HTML DOM and javascript?
No, once the text is in the DOM, the concept of "escaping" it doesn't apply. The HTML source text needs to be escaped so that it's parsed into the DOM correctly; once it's in the DOM, it isn't escaped.
This can be a bit tricky to understand, so let's use an example. Here's some HTML source text (such as in an HTML file that you would view with your browser):
<div>This & That</div>
Once that's parsed into the DOM by the browser, the text within the div is This & That, because the & has been interpreted at that point.
So you'll need to catch this earlier, before the text is parsed into the DOM by the browser. You can't handle it after the fact, it's too late.
Separately, the string you're starting with is invalid if it has things like <div>This & That</div> in it. Pre-processing that invalid string will be tricky. You can't just use built-in features of your environment (PHP or whatever you're using server-side) because they'll escape the tags as well. You'll need to do text processing, extracting only the parts that you want to process and then running those through an escaping process. That process will be tricky. An & followed by whitespace is easy enough, but if there are unescaped entities in the source text, how do you know whether to escape them or not? Do you assume that if the string contains &, you leave it alone? Or turn it into &amp;? (Which is perfectly valid; it's how you show the actual string & in an HTML page.)
What you really need to do is correct the underlying problem: The thing creating these invalid, half-encoded strings.
Edit: From our comment stream below, the question is totally different than it seemed from your example (that's not meant critically). To recap the comments for those coming to this fresh, you said that you were getting these strings from WebKit's innerHTML, and I said that was odd, innerHTML should encode & correctly (and pointed you at a couple of test pages that suggested it did). Your reply was:
This works for &. But the same test page do not work for entities like ©, ®, « and many more.
That changes the nature of the question. You want to make entities out of characters that, while perfectly valid when used literally (provided you have your text encoding right), could be expressed as entities instead and therefore made more resilient to text encoding changes.
We can do that. According to the spec, the character values in a JavaScript string are UTF-16 (using Unicode Normalized Form C) and any conversion from the source character encoding (ISO 8859-1, Windows-1252, UTF-8, whatever) is performed before the JavaScript runtime sees it. (If you're not 100% sure you know what I mean by character encoding, it's well worth stopping now, going off and reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, then coming back.) So that's the input side. On the output side, HTML entities identify Unicode code points. So we can convert from JavaScript strings to HTML entities reliably.
The devil is in the detail, though, as always. JavaScript explicitly assumes that each 16-bit value is a character (see section 8.4 in the spec), even though that's not actually true of UTF-16 — one 16-bit value might be a "surrogate" (such as 0xD800) that only makes sense when combined with the next value, meaning that two "characters" in the JavaScript string are actually one character. This isn't uncommon for far Eastern languages.
So a robust conversion that starts with a JavaScript string and results in an HTML entity can't assume that a JavaScript "character" actually equals a character in the text, it has to handle surrogates. Fortunately, doing so is dead easy because the smart people defining Unicode made it dead easy: The first surrogate value is always in the range 0xD800-0xDBFF (inclusive), and the second surrogate is always in the range 0xDC00-0xDFFF (inclusive). So any time you see a pair of "characters" in a JavaScript string that match those ranges, you're dealing with a single character defined by a surrogate pair. The formulae for converting from the pair of surrogate values to a code point value are given in the above links, although fairly obtusely; I find this page much more approachable.
Armed with all of this information, we can write a function that will take a JavaScript string and search for characters (real characters, which may be one or two "characters" long) you might want to turn into entities, replacing them with named entities from a map or numeric entities if we don't have them in our named map:
// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};
// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep
// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}
// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}
// Return replacement
return rep;
});
}
You should be fine passing all of the HTML through it, since if these characters appear in attribute values, you almost certainly want to encode them there as well.
I have not used the above in production (I actually wrote it for this answer, because the problem intrigued me) and it is totally supplied without warrantee of any kind. I have tried to ensure that it handles surrogate pairs because that's necessary for far Eastern languages, and supporting them is something we should all be doing now that the world has gotten smaller.
Complete example page:
<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
font-family: sans-serif;
}
#log p {
margin: 0;
padding: 0;
}
</style>
<script type='text/javascript'>
// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {
// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
"160": " ",
"161": "¡",
"162": "&#cent;",
"163": "&#pound;",
"164": "&#curren;",
"165": "&#yen;",
"166": "&#brvbar;",
"167": "&#sect;",
"168": "&#uml;",
"169": "©",
// ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
"8364": "€" // Last one must not have a comma after it, IE doesn't like trailing commas
};
// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
// The regular expression below uses an alternation to look for a surrogate pair _or_
// a single character that we might want to make an entity out of. The first part of the
// alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
// alone, it searches for the surrogates. The second part of the alternation you can
// adjust as you see fit, depending on how conservative you want to be. The example
// below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
// character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
// it's not "printable ASCII" (in the old parlance), convert it. That's probably
// overkill, but you said you wanted to make entities out of things, so... :-)
return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
var high, low, charValue, rep
// Get the character value, handling surrogate pairs
if (match.length == 2) {
// It's a surrogate pair, calculate the Unicode code point
high = match.charCodeAt(0) - 0xD800;
low = match.charCodeAt(1) - 0xDC00;
charValue = (high * 0x400) + low + 0x10000;
}
else {
// Not a surrogate pair, the value *is* the Unicode code point
charValue = match.charCodeAt(0);
}
// See if we have a mapping for it
rep = entityMap[charValue];
if (!rep) {
// No, use a numeric entity. Here we brazenly (and possibly mistakenly)
rep = "&#" + charValue + ";";
}
// Return replacement
return rep;
});
}
// Return the function reference out of the scoping function to publish it
return prepEntities;
})();
function go() {
var d = document.getElementById('d1');
var s = d.innerHTML;
alert("Before: " + s);
s = prepEntities(s);
alert("After: " + s);
}
</script>
</head>
<body>
<div id='d1'>Copyright: © Yen: ¥ Cedilla: ¸ Surrogate pair: 𐀀</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>
There I've included the cedilla as an example of converting to a numeric entity rather than a named one (since I left cedil out of my very small example map). And note that the surrogate pair at the end shows up in the first alert as two "characters" because of the way JavaScript handles UTF-16.
You can encode all characters in your string:
function encode(e){return e.replace(/[^]/g,function(e){return"&#"+e.charCodeAt(0)+";"})}
Or just target the main characters to worry about (&, inebreaks, <, >, " and ') like:
function encode(r){
return r.replace(/[\x26\x0A\<>'"]/g,function(r){return"&#"+r.charCodeAt(0)+";"})
}
var myString='Encode HTML entities!\n"Safe" escape <script></'+'script> & other tags!';
test.value=encode(myString);
testing.innerHTML=encode(myString);
/*************
* \x26 is &ampersand (it has to be first),
* \x0A is newline,
*************/
<p><b>What JavaScript Sees:</b></p>
<textarea id=test rows="3" cols="55"></textarea>
<p><b>What It Renders Too In HTML:</b></p>
<div id="testing">www.WHAK.com</div>
Did you tried escape() function in Javascript? JavaScript escape() Function
What server-side language are you using?
if you're using PHP you could use htmlentities
Example:
<?php
$myHTML = "<h1>Some HTML Tags</h1><br />";
echo htmlentities($myHTML);
?>

Categories

Resources