Javascript encoding question - javascript

We have an external .js file that we want to include in a number of different pages. The file contains code for sorting a table on the client-side, and uses the ▲ and ▼ characters in the script to indicate which column is sorted and in which direction.
The script was originally written for an ASP.Net page to offload some sorting work from the server to client (prevent sorting postbacks when javascript is enabled). In that case, the encoding is pretty much always UTF-8 and it works great in that context.
However, we also have a number of older Classic ASP pages where we want to include the script. For these pages the encoding is more of a hodgepodge depending on who wrote the page when and what tool they were using (notepad, vs6, vs2005, other html helper). Often no encoding is specified in the page so it's up to the browser to pick, but there's really no hard rule for it that I can see.
The problem is that if a different (non-UTF8) encoding is used the ▼ and ▲ characters won't show up correctly. I tried using html entities instead, but couldn't get them to work well from the javascript.
How can I make the script adjust for the various potential encodings so that the "special" characters always show up correctly? Are there different characters I could be using, or a trick I missed to make the html entities work from javascript?
Here is the snippet where the characters are used:
// get sort direction, arrow
var dir = 1;
if (self.innerHTML.indexOf(" ▲") > -1)
dir = -1;
var arrow = (dir == 1)?" ▲":" ▼";
// SORT -- function that actually sorts- not relevant to the question
if (!SimpleTableSort(t.id, self.cellIndex, dir, sortType)) return;
//remove all arrows
for (var c = 0,cl=t.rows[0].cells.length;c<cl;c+=1)
{
var cell = t.rows[0].cells[c];
cell.innerHTML = cell.innerHTML.replace(" ▲", "").replace(" ▼", "");
}
// set new arrow
self.innerHTML += arrow;
For the curious, the code points I ended up using with the accepted answer were \u25B4 and \u25BC.

The encoding of the JavaScript file depends on the encoding of the HTML page, where it is embedded. If you have a UTF-8 JavaScript file and a ISO-8859-1 HTML page the JavaScript is interpreted as ISO-8859-1.
If you load the JavaScript from as a external file you could specify the encoding of the JavaScript:
<script type="text/javascript" charset="UTF-8" src="externalJS.js"></script>
Anyway the best option is to save all files related to a webproject in one encoding, UTF-8 recommended.

You want Javascript Unicode escapes i.e. "\uxxxx", where "xxxx" is the Unicode code point for the character. I believe "\u25B2" and "\u25BC" are the two you need.

I voted for both. I think both answers put together would be your best bet.
You're probably going to have to write the script twice, putting in a part for UTF-8, and putting in a part for non UTF-8. It's more trouble, and might not work all the time, STILL.
Someone needs to come up with standards for your developers. If you all write with at least the same encoding, it'll make things a lot easier for yourselves in the future.

Related

How do I properly escape inline Javascript in a <script> tag?

I'm writing a server-side function for a framework that will let me inline a Javascript file. It takes the filename as input, and its output would be like this:
<script>
/* contents of Javascript file */
</script>
How do I escape the contents of the Javascript file safely?
I am particularly worried if the file contains something like </script>. If the input Javascript file has syntax errors, I still want it to escape correctly. I also realise that XHTML expects some entities to be encoded, whereas HTML doesn't.
There are a lot of questions similar to this asking about how to escape string literals or JSON. But I want something that can handle the general case, so that I can write a tool for the general case.
(I realise inlining potentially untrusted Javascript isn't the best idea, so no need to spend time discussing that.)
This is a work in progress, let me know if I've missed a corner case!
The answer is different depending on whether you're using XHTML or HTML.
1. XHTML with Content-Type: application/xhtml+xml header
In this case, you can simply XML escape any entities, turning this file:
console.log("Example Javascript file");
console.log(1</script>2/);
console.log("That previous line prints false");
To this:
<script>
console.log("Example Javascript file");
console.log(1</script>2/);
console.log("That previous line prints false");
</script>
Note that if you're using XHTML with a different Content-Type header, then different browsers may behave differently, and I haven't researched it, so I would recommend fixing the Content-Type header.
2. HTML
Unfortunately, I know of no way to escape it properly in this case (without at least parsing the Javascript). Replacing all instances of / with \/ would cause some Javascript to break, including the previous example.
The best that I can recommend is that you search for </script case-insensitively and throw an exception if you find it. If you're only dealing with string literals or JSON, substitute all instances of / with \/.
Some Javascript minifiers might deal with </script in a safe manner perhaps, let me know if you find one.

Chars sanitization and XSS

I was doing the Google's XSS game (https://xss-game.appspot.com/level4) and I managed to solve the 4th level. I didn't completely undestand how, though.
I don't understand why if I inject the encoding version of a char (let's say %3B) this is translated into the char itself (that is ';') inside the final HTML page. I mean who does this, the browser? Why?
Furthermore, I don't understand where in the code the the injected chars are checked. I made some tests and I've seen that if I try to inject strings like '()';"' whatever comes after the ; is cut out! Where does this happen in the code?
Finally, if I inject a tag like <asd> it is encoded within the <div> (that is <asd>) but it does not in the onload attribute of the <img> tag, where in the code this stuff is performed?
(This answer makes a number of assumptions because I don't have access to Google's client side or server side code (the link goes to an error page because I haven't played the game to reach the level)).
The ((probably) server side) URL parser (which will be part of the server side code) is responsible for converting percent-encoded data in URLs into characters.
; is a key/value separator in form encoding syntax. The URL parser will cut off data at that point.
Responsibility for converting text into HTML is usually given to the template engine, but might be done in some general server side code before data gets to the template (assuming there is a template, the general server side code might just smash strings together).
In order to manage level 4 just enter
')*alert('xss

Escape HTML tags. Any issue possible with charset encoding?

I have a function to escape HTML tags, to be able to insert text into HTML.
Very similar to:
Can I escape html special chars in javascript?
I know that Javascript use Unicode internally, but HTML pages may be encoded in different charsets like UTF-8 or ISO8859-1, etc..
My question is: There is any issue with this very simple conversion? or should I take into consideration the page charset?
If yes, how to handle that?
PS: For example, the equivalente PHP function (http://php.net/manual/en/function.htmlspecialchars.php) has a parameter to select a charset.
No, JavaScript lives in the Unicode world so encoding issues are generally invisible to it. escapeHtml in the linked question is fine.
The only place I can think of where JavaScript gets to see bytes would be data: URLs (typically hidden beneath base64). So this:
var markup = '<p>Hello, '+escapeHtml(user_supplied_data);
var url = 'data:text/html;base64,'+btoa(markup);
iframe.src = url;
is in principle a bad thing. Although I don't know of any browsers that will guess UTF-7 in this situation, a charset=... parameter should be supplied to ensure that the browser uses the appropriate encoding for the data. (btoa uses ISO-8859-1, for what it's worth.)

Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.
I want to extract all comments from this html file.
I can point out two problems in doing this:
What is a comment in one language may not be a comment in another.
In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I
just apply substituting // and then the
remainder of the line, with nothing.
So this is not a trivial problem.
Is there anywhere some solution for this already available?
Has anybody already done this?
Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.
Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.
Consider
Problem 3, a comment in a language is not always a comment in a language.
<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>
Problem 4, a comment embedded in a language may not obviously be a comment.
<button onclick="// this is a comment//
notAComment()">
Problem 5, what is a comment may depend on how the browser is configured.
<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->
I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.
https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.
It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.
I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.
The start/end indices will have to be identified properly, which is a challenge but you will have full power.
There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.

Javascript File Name Encryption and Referencing

I want to change the JavaScript names in my application for additional security like facebook and google plus suring deployment.
Is there are an application or a library that can change the JavaScript file names and reference them in the my view (written in php) and JavaScript files.
EXAMPLE OF THE DESIRED EFFECT
Change this (Before Deployment):
In folder: js/myfunction.js
In file:<script type="text/javascript" src="https://mysite.com/myfunction.js"></script>
To this (After Deployment):
In folder: js/PuKJS78UyH.js
In file: <script type="text/javascript" src="https://mysite.com/PuKJS78UyHK.js"></script>
Instead of obfuscating and encryption you should think about optimization. Couple things that you could do:
Combine all common JS files in one file (minimizes number of requests and also solves your problem - there will be no file names to obfuscate)
Minimize JS - it's faster that way and takes less space (and in addition it becomes unreadable)
This tool looks like a good place to start: http://code.google.com/p/minify/
You should not depend on JavaScript encryption. It is not safe, and might be hacked in a short time. Using sever side languages like PHP is much safer than JavaScript.
However, if you would like to perform a simple base-64 encoding in JavaScript, for which normal people will not able to read, you are lucky, it doesn't need any library. \(^o^)/
Just use btoa() for encoding, and atob() for decoding. Then you can create a <script> tag using the encoded URL.
Read more in MDN: window.atob
Example:
var txt = "myfunction.js";
var encode = btoa(txt);
var decode = atob(encode);
console.log( encode ); //return "bXlmdW5jdGlvbi5qcw=="
console.log( decode ); //return "myfunction.js" (orginal)
//Do whatever you want with the encoded text, like
$("<script src='/js/"+encode+".js' type='text/javascript'></script>")
.appendTo("head"); //dynamically adding an script tag using jQuery
Demo: http://jsfiddle.net/DerekL/JWSUs/
Result:
Result from jsFiddle. You can see that "myfunction.js" is encoded to "bXlmdW5jdGlvbi5qcw==", which normal people will not be able to read.

Categories

Resources