Prevent Javascript Injection in data attribute

Prevent Javascript Injection in data attribute - javascript

I have a script that pulls a text from an API and sets that as a tooltip in my html.
<div class="item ttip" data-html="<?php echo $obj->titleTag;?>">...</div>
The API allows html and javascript to be entered on their side for that field.
I tried this $obj->titleTag = htmlentities(strip_tags_content($this->channel->status)));
I now had a user that entered the following (or similar, he is blocked now I cannot check it again):
\" <img src="xx" onerror=window.location.replace(https://www.youtube.com/watch?v=IAISUDbjXj0)>
which does not get caught by the above.
I could str_replace the window.location stuff, but that seems dirty.
What would be the right approach? I am reading a lot of "Whitelists" but I don't understand the concept for such a case.
//EDIT strip_tags_content comes from here: https://php.net/strip_tags#86964

Well, It's not tags you're replacing now but code within tags. You need to allow certain attributes in your code rather than stripping tags since you've only got one tag in there ;)
What you wanna do is check for any handlers being bound in the JS, a full list here, and then remove them if anything contains something like onerror or so

Related

How to make a live HTML preview textarea safe against HTML/Script Injection

I'm turning here as a last resort. I've scoured google and I'm having troubles coming to a solution. I have a form with a textarea element that allows you to type html in the area and it will render the HTML markup live as you type if you have the preview mode active. Not too different from the way StackOverflow shows the preview below a new post.
However, I have recently discovered that my functionality has a vulnerability. All I got to do is type something like:
</textarea>
<script>alert("Hello World!");</script>
<textarea style="display: none;">
And not only does this run from within the textarea live, if you save the form and reload said data on a different page this code still executes within the textarea on said different page but unbeknownst to the user; to them all the see is a textarea (if there is no alert obviously).
I found this post; Live preview of textarea input with javascript html, and attempted to refactor my JS to the accepted answer there, because I noticed I couldn't write a script tag in the JSFiddle example, though maybe that's some JSFiddle blocking that behaviour, but I couldn't get it working within my JS file.
These few lines is what I use to live render HTML markup:
$(".main").on("keyup", "#actualTextArea", function () {
$('#previewTextArea').html($('#actualTextArea').val());
});
$(".main").on("keydown", "#actualTextArea", function () {
$('#previewTextArea').html($('#actualTextArea').val());
});
Is there a way this can be refactored so it's safe? My only idea at the moment is to wipe the live preview and use a toggle on/off and encode it, but I really think this is a cool feature and would like to keep it live instead of toggle. Is there a way to "live encode" it or escape certain tags or something?

In order to sanitise your text area preview simply replace all the < and > with their html character code equivalents:
function showPreview()
{
var value = $('#writer').val().trim();
value = value.replace("<", "<");
value = value.replace(">", ">");
$('#preview').html(value);
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<textarea id="writer" onInput="showPreview();">
</textarea>
<br/>
<hr/>
<div id="preview">
</div>

Edit: Actually, I think this solution is a little cleaner, and makes the below code unnecessary. In the velocity page all that is needed is to take advantage of the Spring framework. So I replace the textarea with this like so:
#springBindEscaped("myJavaObj.textAreaText" true)
<textarea id="actualTextArea" name="${status.expression}" class="myClass" rows="10" cols="120">$!status.value</textarea>
This paired with some backend Java validation and it ends up being a much cleaner solution.
But if you want a non-spring/ velocity solution, then this below works just fine
I cobbled together a quick fix as my main purpose is to eliminate the ability for others to execute scripts easily. It's not ideal, and I"m not claiming it to be the best answer, so if someone finds a better solution, please do share. I created a "sanitize" function like so:
function sanitize(text){
var sanitized = text.replace("<script>", "");
sanitized = sanitized.replace("</script>", "");
return sanitized;
}
Then the previous two event handlers now look like:
$(".main").on("keyup", "#actualTextArea", function () {
var textAreaMarkup = $('#actualTextArea').val();
var sanitizedMarkup = sanitize(textAreaMarkup );
$('#actualTextArea').val(sanitizedMarkup);
$('#previewTextArea').html(sanitizedMarkup);
});
// This one can remain unchanged and infact needs to be
// If it's the same as above it will wipe the text area
// on a highlight-backspace
$(".main").on("keydown", "#actualTextArea", function () {
$('#previewTextArea').html($('#actualTextArea').val());
});
Along with Java side sanitation to prevent anything harmful being stored in the DB, this serves my purpose, but I'm very open to a better solution if it exists.

how to avoid fetching a part of html page which is being called inside another page?

I am calling a .html page(say A.html, which is dynamically created by another software each time a request is made) inside another webpage (say B.html). I am doing this by using the .load() function. Everything works fine but the problem is I donot want the so many "br" tags (empty tags) present at the end of A.html into B.html. Is there any way to avoid fetching those "br" tags into B.html? Any suggestion would be of great help. Thank you in advance.

You can't avoid loading part of a file when you are just accessing it.
The best option would be to simply remove the extra <br> tags from the document to begin with. There is probably a better way to accomplish whatever they are attempting to accomplish.
With some server-side scripting, it could be possible to strip them automatically when you load it, but would probably be pretty bothersome to do.
Instead, if you can't remove the <br> elements for some reason, what might be easier, if you are just dealing with a handful of <br> tags would be to simply strip them out.
Since you mention using the load() function, I'm guessing you are using jQuery.
If that's the case, something like this would cleanly strip out any extra <br> tags from the end of the document.
Here is a JSfiddle which will do it: http://jsfiddle.net/dMJ2F/
var html = "<p>A</p><br><p>B</p><br><p>C</p><br><br /><br/>";
var $html = $('<div>').append(html);
var $br;
while (($br = $html.find('br:last-child')).length > 0) {
$br.remove();
}
$('p').text($html.html());
Basically, throw the loaded stuff in to a div (in memory), then loop through and remove each <br> at the end until there aren't any. You could use regex to do this as well, but it runs a few risks that this jQuery method doesn't.

You shout delete the br-tags in your A.html.
Substitute them by changing the class .sequence with marging-top:30px
And have an other value in your B.html-file.
You also can run this:
$('br', '.sequence').remove();
in the load-function. It will strip all br-tags.

You can't avoid fetching a part of your page, but you CAN fetch only a part of it.
According to the jQuery docs, you can call load like this:
$("#result").load("urlorpage #form-id");
That way, you only load the form html inside the result element.

Javascript: document innerHTML replace breaks forms

I'm currently trying to replace a piece of plain text in a page that also contains a form. I am aware that upon replacing code containing a form, the form elements get recreated. This can break forms (and it does on the webpage I'm manipulating).
Usually, I go about this by using the "getElementsByTagName" function, to make sure that I don't need to replace the code containing the form and this has always been possible so far. However at this point, I have arrived at a page where the smallest tagname is a div that contains the text I need to replace and a form. This div is further subdivided in tables so initially I thought "let's get elements by table", but exactly the piece that I need to replace is not subdivided in a table.
So I used this code to replace:
document.documentElement.innerHTML = document.documentElement.innerHTML.replace(RegEx, replaceString);
Of course, this breaks the form on the page, which is not wanted behavior.
Does anyone have any idea how to go about this without breaking the form? Is it possible to somehow get a reference to the part of the div that does not contain a table? Is it possible to alter just part of the code? Right now I take an instance of the code, replace the matches in the instance, and then overwrite the original code with the altered instance. I once remember trying document.documentElement.innerHTML.replace(RegEx, replaceString); on another page but this only returned an instance of altered code, it did not alter the original code.
This is part of the page:
<div class="BoxContent" style="background-image:url(http://static.tibia.com/images/global/content/scroll.gif);">
<TABLE></TABLE>
<BR>
Some text here.
<BR>
And some more.
<table></table>
<table></table>
</div>
I need to do some changes in the text between the tables.
I have looked around on SO and found similar question about adding things to a form with innerHTML, but this did not help my cause. So, all help is appreciated here!
Kenneth

Here - plain JS
DEMO
window.onload=function() {
var nodes = document.getElementsByClassName("BoxContent")[0].childNodes;
for (var i=0,n=nodes.length;i<n;i++) {
if (nodes[i].nodeType==3) {
// console.log(nodes[i].textContent)
nodes[i].textContent=nodes[i].textContent.replace(/some/gi,"Lots");
}
}
}

Regex replace string but not inside html tag

I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:
visit google search engine
you can search on google tatatata...
I want to replace google by <b>google</b>, but not here:
visit google search engine
you can search on <b>google</b> tatatata...
I tried with this one:
regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML = el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');
but the problem: I got <b>google</b> inside the <a> tag:
visit <b>google</b> search engine
you can search on <b>google</b> tatatata...
How can fix this?

You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.

You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).
It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.

Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular.
For more details see this Stackoverflow question (and answers).

I think you're all missing the question here...
When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a <p> </p> tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.

WORKAROUND
If You can't use a html parser or are quite confident about Your html structure try this:
do the "bad" changing
repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)
It's a simple workaround, but works for me.
Cons?
Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page
[edit:]
SOLUTION
Why not use jQuery, put the html code into the page and do something like this:
$(containerOrSth).find('a').each(function(){
if($(this).children().length==0){
$(this).text($(this).text().replace('google','evil'));
}else{
//here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
}
});

I'm using
regex = new RegExp("(?=[^>]*<)google", 'i');

you can't really do that, your "google" is always in some tag, either replace all or none

Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>

You can do that using REGEX, but filtering blocks like STYLE, SCRIPT and CDATA will need more work, and not implemented in the following solution.
Most of the answers state that 'your data is always in some tags' but they are missing the point, the data is always 'between' some tags, and you want to filter where it is 'in' a tag.
Note that tag characters in inline scripts will likely break this, so if they exist, they should be processed seperately with this method. Take a look at here :
complex html string.replace function

I can give you a hacky solution…
Pick a non printable character that’s not in your string…. Dup your buffer… now overwrite the tags in your dup buffer using the non printable character… perform regex to find position and length of match on dup buffer … Now you know where to perform replace in original buffer

How to store arbitrary data for some HTML tags

I'm making a page which has some interaction provided by javascript. Just as an example: links which send an AJAX request to get the content of articles and then display that data in a div. Obviously in this example, I need each link to store an extra bit of information: the id of the article. The way I've been handling it in case was to put that information in the href link this:
<a class="article" href="#5">
I then use jQuery to find the a.article elements and attach the appropriate event handler. (don't get too hung up on the usability or semantics here, it's just an example)
Anyway, this method works, but it smells a bit, and isn't extensible at all (what happens if the click function has more than one parameter? what if some of those parameters are optional?)
The immediately obvious answer was to use attributes on the element. I mean, that's what they're for, right? (Kind of).
<a articleid="5" href="link/for/non-js-users.html">
In my recent question I asked if this method was valid, and it turns out that short of defining my own DTD (I don't), then no, it's not valid or reliable. A common response was to put the data into the class attribute (though that might have been because of my poorly-chosen example), but to me, this smells even more. Yes it's technically valid, but it's not a great solution.
Another method I'd used in the past was to actually generate some JS and insert it into the page in a <script> tag, creating a struct which would associate with the object.
var myData = {
link0 : {
articleId : 5,
target : '#showMessage'
// etc...
},
link1 : {
articleId : 13
}
};
<a href="..." id="link0">
But this can be a real pain in butt to maintain and is generally just very messy.
So, to get to the question, how do you store arbitrary pieces of information for HTML tags?

Which version of HTML are you using?
In HTML 5, it is totally valid to have custom attributes prefixed with data-, e.g.
<div data-internalid="1337"></div>
In XHTML, this is not really valid. If you are in XHTML 1.1 mode, the browser will probably complain about it, but in 1.0 mode, most browsers will just silently ignore it.
If I were you, I would follow the script based approach. You could make it automatically generated on server side so that it's not a pain in the back to maintain.

If you are using jQuery already then you should leverage the "data" method which is the recommended method for storing arbitrary data on a dom element with jQuery.
To store something:
$('#myElId').data('nameYourData', { foo: 'bar' });
To retrieve data:
var myData = $('#myElId').data('nameYourData');
That is all that there is to it but take a look at the jQuery documentation for more info/examples.

Just another way, I personally wouldn't use this but it works (assure your JSON is valid because eval() is dangerous).
<a class="article" href="link/for/non-js-users.html">
<span style="display: none;">{"id": 1, "title":"Something"}</span>
Text of Link
</a>
// javascript
var article = document.getElementsByClassName("article")[0];
var data = eval(article.childNodes[0].innerHTML);

Arbitrary attributes are not valid, but are perfectly reliable in modern browsers. If you are setting the properties via javascript, than you don't have to worry about validation as well.
An alternative is to set attributes in javascript. jQuery has a nice utility method just for that purpose, or you can roll your own.

A hack that's going to work with pretty much every possible browser is to use open classes like this: <a class='data\_articleid\_5' href="link/for/non-js-users.html>;
This is not all that elegant to the purists, but it's universally supported, standard-compliant, and very easy to manipulate. It really seems like the best possible method. If you serialize, modify, copy your tags, or do pretty much anything else, data will stay attached, copied etc.
The only problem is that you cannot store non-serializable objects that way, and there might be limits if you put something really huge there.
A second way is to use fake attributes like: <a articleid='5' href="link/for/non-js-users.html">
This is more elegant, but breaks standard, and I'm not 100% sure about support. Many browsers support it fully, I think IE6 supports JS access for it but not CSS selectors (which doesn't really matter here), maybe some browsers will be completely confused, you need to check it.
Doing funny things like serializing and deserializing would be even more dangerous.
Using ids to pure JS hash mostly works, except when you try to copy your tags. If you have tag <a href="..." id="link0">, copy it via standard JS methods, and then try to modify data attached to just one copy, the other copy will be modified.
It's not a problem if you don't copy tags, or use read only data. If you copy tags and they're modified you'll need to handle that manually.

Using jquery,
to store: $('#element_id').data('extra_tag', 'extra_info');
to retrieve: $('#element_id').data('extra_tag');

I know that you're currently using jQuery, but what if you defined the onclick handler inline. Then you could do:
<a href='/link/for/non-js-users.htm' onclick='loadContent(5);return false;'>
Article 5</a>

You could use hidden input tags. I get no validation errors at w3.org with this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="content-type" />
<title>Hello</title>
</head>
<body>
<div>
<a class="article" href="link/for/non-js-users.html">
<input style="display: none" name="articleid" type="hidden" value="5" />
</a>
</div>
</body>
</html>
With jQuery you'd get the article ID with something like (not tested):
$('.article input[name=articleid]').val();
But I'd recommend HTML5 if that is an option.

Why not make use of the meaningful data already there, instead of adding arbitrary data?
i.e. use <a href="/articles/5/page-title" class="article-link">, and then you can programmatically get all article links on the page (via the classname) and the article ID (matching the regex /articles\/(\d+)/ against this.href).

As a jQuery user I would use the Metadata plugin. The HTML looks clean, it validates, and you can embed anything that can be described using JSON notation.

This is good advice. Thanks to #Prestaul
If you are using jQuery already then you should leverage the "data"
method which is the recommended method for storing arbitrary data on a
dom element with jQuery.
Very true, but what if you want to store arbitrary data in plain-old HTML? Here's yet another alternative...
<input type="hidden" name="whatever" value="foobar"/>
Put your data in the name and value attributes of a hidden input element. This might be useful if the server is generating HTML (i.e. a PHP script or whatever), and your JavaScript code is going to use this information later.
Admittedly, not the cleanest, but it's an alternative. It's compatible with all
browsers and is valid XHTML. You should NOT use custom attributes, nor should you really use attributes with the 'data-' prefix, as it might not work on all browsers. And, in addition, your document will not pass W3C validation.

As long as you're actual work is done serverside, why would you need custom information in the html tags in the output anyway? all you need to know back on the server is an index into whatever kind of list of structures with your custom info. I think you're looking to store the information in the wrong place.
I will recognize, however unfortunate, that in lots of cases the right solution isn't the right solution. In which case I would strongly suggest generating some javascript to hold the extra information.
Many years later:
This question was posted roughly three years before data-... attributes became a valid option with the advent of html 5 so the truth has shifted and the original answer I gave is no longer relevant. Now I'd suggest to use data attributes instead.
<a data-articleId="5" href="link/for/non-js-users.html">
<script>
let anchors = document.getElementsByTagName('a');
for (let anchor of anchors) {
let articleId = anchor.dataset.articleId;
}
</script>

I advocate use of the "rel" attribute. The XHTML validates, the attribute itself is rarely used, and the data is efficiently retrieved.

So there should be four choices to do so:
Put the data in the id attribute.
Put the data in the arbitrary attribute
Put the data in class attribute
Put your data in another tag
http://www.shanison.com/?p=321

You could use the data- prefix of your own made attribute of a random element (<span data-randomname="Data goes here..."></span>), but this is only valid in HTML5. Thus browsers may complain about validity.
You could also use a <span style="display: none;">Data goes here...</span> tag. But this way you can not use the attribute functions, and if css and js is turned off, this is not really a neat solution either.
But what I personally prefer is the following:
<input type="hidden" title="Your key..." value="Your value..." />
The input will in all cases be hidden, the attributes are completely valid, and it will not get sent if it is within a <form> tag, since it has not got any name, right?
Above all, the attributes are really easy to remember and the code looks nice and easy to understand. You could even put an ID-attribute in it, so you can easily access it with JavaScript as well, and access the key-value pair with input.title; input.value.

One possibility might be:
Create a new div to hold all the extended/arbitrary data
Do something to ensure that this div is invisible (e.g. CSS plus a class attribute of the div)
Put the extended/arbitrary data within [X]HTML tags (e.g. as text within cells of a table, or anything else you might like) within this invisible div

Another approach can be to store a key:value pair as a simple class using the following syntax :
<div id="my_div" class="foo:'bar'">...</div>
This is valid and can easily be retrieved with jQuery selectors or a custom made function.

In html, we can store custom attributes with the prefix 'data-' before the attribute name like
<p data-animal='dog'>This animal is a dog.</p>.
Check documentation
We can use this property to dynamically set and get attributes using jQuery like:
If we have a p tag like
<p id='animal'>This animal is a dog.</p>
Then to create an attribute called 'breed' for the above tag, we can write:
$('#animal').attr('data-breed', 'pug');
To retrieve the data anytime, we can write:
var breedtype = $('#animal').data('breed');

At my previous employer, we used custom HTML tags all the time to hold info about the form elements. The catch: We knew that the user was forced to use IE.
It didn't work well for FireFox at the time. I don't know if FireFox has changed this or not, but be aware that adding your own attributes to HTML elements may or may-not be supported by your reader's browser.
If you can control which browser your reader is using (i.e. an internal web applet for a corporation), then by all means, try it. What can it hurt, right?

This is how I do you ajax pages... its a pretty easy method...
function ajax_urls() {
var objApps= ['ads','user'];
$("a.ajx").each(function(){
var url = $(this).attr('href');
for ( var i=0;i< objApps.length;i++ ) {
if (url.indexOf("/"+objApps[i]+"/")>-1) {
$(this).attr("href",url.replace("/"+objApps[i]+"/","/"+objApps[i]+"/#p="));
}
}
});
}
How this works is it basically looks at all URLs that have the class 'ajx' and it replaces a keyword and adds the # sign... so if js is turned off then the urls would act as they normally do... all "apps" (each section of the site) has its own keyword... so all i need to do is add to the js array above to add more pages...
So for example my current settings are set to:
var objApps= ['ads','user'];
So if i have a url such as:
www.domain.com/ads/3923/bla/dada/bla
the js script would replace the /ads/ part so my URL would end up being
www.domain.com/ads/#p=3923/bla/dada/bla
Then I use jquery bbq plugin to load the page accordingly...
http://benalman.com/projects/jquery-bbq-plugin/

I have found the metadata plugin to be an excellent solution to the problem of storing arbitrary data with the html tag in a way that makes it easy to retrieve and use with jQuery.
Important: The actual file you include is is only 5 kb and not 37 kb (which is the size of the complete download package)
Here is an example of it being used to store values I use when generating a google analytics tracking event (note: data.label and data.value happen to be optional params)
$(function () {
$.each($(".ga-event"), function (index, value) {
$(value).click(function () {
var data = $(value).metadata();
if (data.label && data.value) {
_gaq.push(['_trackEvent', data.category, data.action, data.label, data.value]);
} else if (data.label) {
_gaq.push(['_trackEvent', data.category, data.action, data.label]);
} else {
_gaq.push(['_trackEvent', data.category, data.action]);
}
});
});
});
<input class="ga-event {category:'button', action:'click', label:'test', value:99}" type="button" value="Test"/>

My answer might not apply to your case. I needed to store a 2D table in HTML, and i needed to do with fewest possible keystrokes. Here's my data in HTML:
<span hidden id="my-data">
IMG,,LINK,,CAPTION
mypic.jpg,,khangssite.com,,Khang Le
funnypic.jpg,,samssite.com,,Smith, Sam
sadpic.png,,joyssite.com,,Joy Jones
sue.jpg,,suessite.com,,Sue Sneed
dog.jpg,,dogssite.com,,Brown Dog
cat.jpg,,catssite.com,,Black Cat
</span>
Explanation
It's hidden using hidden attribute. No CSS needed.
This is processed by Javascript. I use two split statements, first on newline, then on double-comma delimiter. That puts the whole thing into a 2D array.
I wanted to minimize typing. I didn't want to redundantly retype the fieldnames on every row (json/jso style), so i just put the fieldnames on the first row. That a visual key for the programmer, and also used by Javascript to know the fieldnames. I eliminated all braces, brackets, equals, parens, etc. End-of-line is record delimiter.
I use double-commas as delimiters. I figured no one would normally use double-commas for anything, and they're easy to type. Beware, programmer must enter a space for any empty cells, to prevent unintended double-commas. The programmer can easily use a different delimiter if they prefer, as long as they update the Javascript. You can use single-commas if you're sure there will be no embedded commas within a cell.
It's a span to ensure it takes up no room on the page.
Here's the Javascript:
// pull 2D text-data into array
let sRawData = document.querySelector("#my-data").innerHTML.trim();
// get headers from first row of data and load to array. Trim and split.
const headersEnd = sRawData.indexOf("\n");
const headers = sRawData.slice(0, headersEnd).trim().split(",,");
// load remaining rows to array. Trim and split.
const aRows = sRawData.slice(headersEnd).trim().split("\n");
// trim and split columns
const data = aRows.map((element) => {
return element.trim().split(",,");
});
Explanation:
JS uses lots of trims to get rid of any extra whitespace.

Develop Reference

JavaScript is the programming language of the Web.

Prevent Javascript Injection in data attribute - javascript

Related

How to make a live HTML preview textarea safe against HTML/Script Injection

how to avoid fetching a part of html page which is being called inside another page?

Javascript: document innerHTML replace breaks forms

Regex replace string but not inside html tag

How to store arbitrary data for some HTML tags

Categories

Resources