UTF-8 in HTML input added by JavaScript

UTF-8 in HTML input added by JavaScript - javascript

I just don't get it.
My case is, that my application is sending all the needed GUI text by JSON at page startup from my PHP server. On my PHP server I have all text special characters written in UTF-8. Example: Für
So on the client side I have exactly the same value, and it gets displayed nicely everywhere except on input fields. When I do this with JavaScript:
document.getElementById('myInputField').value = "FÖr";
Then it is written exactly like that without any transformation into the special character.
Did I understand something wrong in UTF-8 concepts?
Thanks for any hints.

The notation ü has nothing particular to do with UTF-8. The use of character references is a common way of avoiding the need to use UTF-8; they can be used with any encoding, but if you use UTF-8, you don’t need them.
The notation ü is an HTML notation, not JavaScript. Whether it gets interpreted by HTML rules when it appears inside your JavaScript code depends on the context (like JavaScript inside an HTML document vs. separate JavaScript file). This problem is best avoided by using either characters as such or by using JavaScript notations for characters.
For example, ü means the same as ü, i.e. U+00FC, ü (u with diaeresis). The JavaScript notation, for use inside string literals, for this is \u00fc (\u followed by exactly four hexadecimal digits). E.g., the following sets the value to “Für”:
document.getElementById('myInputField').value = "F\u00fcr";

Your using whats called HTML entities to encode characters which it not the same as UTF-8, but of course a UTF-8 string can include HTML entities.
I think the problem is that tag attributes can't include HTML entities so you have to use some other encoding when assigning the text input value attribute. I think you have two options:
Decode the HTML entity on the client side. A quite ugly solution to piggyback on the decoder available in the browser (im using jQuery in the example, but you probably get the point).
inputElement.value = $("<p/>").html("FÖr").text();
Another option, which is think is nicer, is to not send HTML entities in the server response but instead use proper UTF-8 encoding for all characters which should work fine when put into text nodes or tag attributes. This assumes the HTML page uses UTF-8 encoding of course.

Related

JavaScript/NodeJS RTF CJK Conversions

I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.

I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.

DOM XSS and Javascript Escaping

I am going through all the OWASP rules for DOM Based XSS prevention and trying to get a full understanding of each rule. I'm a bit stuck on this rule:
"RULE #2 - JavaScript Escape Before Inserting Untrusted Data into HTML Attribute Subcontext within the Execution Context"
See here:
https://www.owasp.org/index.php/DOM_based_XSS_Prevention_Cheat_Sheet#RULE_.232_-_JavaScript_Escape_Before_Inserting_Untrusted_Data_into_HTML_Attribute_Subcontext_within_the_Execution_Context
The problem is that I'm not sure what method to use when "javascript escaping" on the front-end? I know it is not a very likely use case because most front-end developers would generally avoid inserting untrusted data in to an html attribute in the first place, but nonetheless I would like to fully understand what is meant with this rule by understanding exactly what the escape method should be. Is there a simple javascript escape method people typically use on the front-end? Thanks!
EDIT: Other answers I find on stackoverflow all mention html escapers. I'm specifically looking for a javascript escaper and I want to know why owasp specifically uses the term "javascript escaper" if, as some people would suggest, an html escaper is sufficient.
Perhaps the question could also be phrased as "In the context of OWASP's cheat sheet for DOM Based XSS what is the difference between html escaping and javascript escaping? Please give an example of javascript escaping.

The escaping needed depends on the context that a value is inserted in. Using the wrong escaping may allow special characters in one context, that aren't special characters in a different context, or corrupt the values.
JavaScript escaping is for values that are inserted directly into a JavaScript string literal via a server-side templating language.
So the example they have is:
x.setAttribute("value", '<%=Encoder.encodeForJS(companyName)%>');
Here, the value of companyName is inserted into a script, surrounded by single quotes making it a JavaScript string literal. The special characters here are things like quotes, new lines, and some unicode whitespace characters. These should be converted to JavaScript escape sequences. So a quote would become \x27 rather than the HTML entity '. If you were to use HTML encoding then a quote character would be displayed as ' and a newline character would cause a syntax error. JavaScript encoding can be done in Java with encodeForJavaScript, or PHP with json_encode.
It's inserted into a JavaScript value so it should be JavaScript encoded. People are used to HTML encoding attributes but this only makes sense when directly inserting into the HTML, not when using the setAttribute DOM method. The encoding needed is the same as if it were like:
var x = '<%=Encoder.encodeForJS(companyName)%>';
The attribute doesn't need to be HTML encoded because it's not in an HTML context. HTML encoding is needed when the value is inserted directly into an attribute like:
<input value='<%=Encoder.encodeForHTML(companyName)%>'>

JS eval() and XSS

If I JS encode untrusted data, and put it into the eval() function, for example like this:
eval('var a="JS_ENCODED_UNTRUSTED_DATA";alert(a);');
How is XSS still possible in that case?
Edit: To clarify what I meant by "JS encode": In Java, I can use OWASP Java Encoder to encode untrusted data for various contexts. For example Encoder.forHTML(UNTRUSTED_DATA) if I'm inserting untrusted data into HTML or Encoder.forJavaScript(UNTRUSTED_DATA) if I'm inserting untrusted data into JS. It simply encodes or escapes dangerous characters in the input string before inserting it into the HTML page or JavaScript. I'm not exactly sure how the Encoder.forJavaScript function encodes each character, but I know that some characters are simply escaped with '\', and some are converted to the \xHH format.

It depends on how you have escaped that "data". Your data is located
in a "-delimited JavaScript string
inside a '-delimited JavaScript string
possibly inside an HTML <script> element (if not being loaded as an external script).
So you would need to call up to three different escape functions on your data to make it secure. That said, there are really few cases where you actually need eval.

Escape HTML tags. Any issue possible with charset encoding?

I have a function to escape HTML tags, to be able to insert text into HTML.
Very similar to:
Can I escape html special chars in javascript?
I know that Javascript use Unicode internally, but HTML pages may be encoded in different charsets like UTF-8 or ISO8859-1, etc..
My question is: There is any issue with this very simple conversion? or should I take into consideration the page charset?
If yes, how to handle that?
PS: For example, the equivalente PHP function (http://php.net/manual/en/function.htmlspecialchars.php) has a parameter to select a charset.

No, JavaScript lives in the Unicode world so encoding issues are generally invisible to it. escapeHtml in the linked question is fine.
The only place I can think of where JavaScript gets to see bytes would be data: URLs (typically hidden beneath base64). So this:
var markup = '<p>Hello, '+escapeHtml(user_supplied_data);
var url = 'data:text/html;base64,'+btoa(markup);
iframe.src = url;
is in principle a bad thing. Although I don't know of any browsers that will guess UTF-7 in this situation, a charset=... parameter should be supplied to ensure that the browser uses the appropriate encoding for the data. (btoa uses ISO-8859-1, for what it's worth.)

what the function that I can use in Javascript to convert from one character encoding to another?

what the built-in or user-defined function that I can use in Javascript or jQuery to convert from one character encoding to another?
For Example,
FROM "utf-8" TO "windows-1256"
OR
FROM "windows-1256" TO "utf-8"
A practical use of that is if you have a php page with specific character encoding like "windows-1256" that you could not change it according to the business needs and when you use ajax to send a block data from database using json which uses "utf-8" encoding only so you need to convert the ouput of json to this encoding so that the characters and the strings will be displayed well
Thanks in advance .....

From the standpoint of a JavaScript runtime environment, there's really no such thing as character encodings – the messiness of encodings is abstracted away from you. By spec, all JS source text is interpreted as Unicode characters, and all Strings are Unicode.
As such, there's no way in JavaScript to represent characters in anything other than Unicode. Look at the methods available on a String instance – you'll see there's nothing related to character encoding.
Because JavaScript runs in Unicode, and all JavaScript strings are stored in Unicode, all AJAX calls will be transmitted over the wire in Unicode. From the jQuery AJAX docs:
Data will always be transmitted to the server using UTF-8 charset; you must decode this appropriately on the server side.
Your PHP script is going to have to cope with Unicode input from AJAX calls.

Develop Reference

JavaScript is the programming language of the Web.