Detect if source is CSS/HTML/JavaScript

Detect if source is CSS/HTML/JavaScript - javascript

I want to use js beautify on some source but there isn't a way to detect what type of source it is. Is there any way, crude or not, to detect if the source is css, html, javascript or none?
Looking at their site they have this that looks like it'll figure out if it's html:
function looks_like_html(source) {
// <foo> - looks like html
// <!--\nalert('foo!');\n--> - doesn't look like html
var trimmed = source.replace(/^[ \t\n\r]+/, '');
var comment_mark = '<' + '!-' + '-';
return (trimmed && (trimmed.substring(0, 1) === '<' && trimmed.substring(0, 4) !== comment_mark));
}
just need to see if it's css, javascript or neither. This is running in node.js
So this code would need to tell me it's JavaScript:
var foo = {
bar : 'baz'
};
where as this code needs to tell me it's CSS:
.foo {
background : red;
}
So a function to test this would return the type:
function getSourceType(source) {
if (isJs) {
return 'js';
}
if (isHtml) {
return 'html';
}
if (isCss) {
return 'css';
}
}
There will be cases where other languages are used like Java where I need to ignore but for css/html/js I can use the beautifier on.

Short answer: Almost impossible.
- Thanks to Katana's input
The reason: A valid HTML can contain JS and CSS (and it usually does). JS can contain both css and html (i.e.: var myContent = '< div >< style >CSS-Rules< script >JS Commands';). And even CSS can contain both in comments.
So writing a parser for this close to impossible. You just cannot separate them easily.
The languages have rules upon how to write them, what you want to do is reverse architect something and check whether those rules apply. That's probably not worth the effort.
Approach 1
If the requirement is worth the effort, you could try to run different parsers on the source and see if they throw errors. I.e. Java is likely to not be a valid HTML/JS/CSS but a valid Java-Code (if written properly).
Approach 2
- Thanks to Bram's input
However if you know the source very well and have the assumption that these things don't occur in your code, you could try the following with Regular Expressions.
Example
<code><div>This div is HTML var i=32;</div></code>
<code>#thisiscss { margin: 0; padding: 0; }</code>
<code>.thisismorecss { border: 1px solid; background-color: #0044FF;}</code>
<code>function jsfunc(){ { var i = 1; i+=1;<br>}</code>
Parsing
$("code").each(function() {
code = $(this).text();
if (code.match(/<(br|basefont|hr|input|source|frame|param|area|meta|!--|col|link|option|base|img|wbr|!DOCTYPE).*?>|<(a|abbr|acronym|address|applet|article|aside|audio|b|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frameset|head|header|hgroup|h1|h2|h3|h4|h5|h6|html|i|iframe|ins|kbd|keygen|label|legend|li|map|mark|menu|meter|nav|noframes|noscript|object|ol|optgroup|output|p|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video).*?<\/\2/)) {
$(this).after("<span>This is HTML</span>");
}
else if (code.match(/(([ trn]*)([a-zA-Z-]*)([.#]{1,1})([a-zA-Z-]*)([ trn]*)+)([{]{1,1})((([ trn]*)([a-zA-Z-]*)([:]{1,1})((([ trn]*)([a-zA-Z-0-9#]*))+)[;]{1})*)([ trn]*)([}]{1,1})([ trn]*)/)) {
$(this).after("<span>This is CSS</span>");
}
else {
$(this).after("<span>This is JS</span>");
}
});
What does it do: Parse the text.
HTML
If it contains characters like '<' followed by br (or any of the other tags above) and then '>' then it's html. (Include a check as well since you could compare numbers in js as well).
CSS
If it is made out of the pattern name(optional) followed by . or # followed by id or class followed by { you should get it from here... In the pattern above I also included possible spaces and tabs.
JS
Else it is JS.
You could also do Regex like: If it contains '= {' or 'function...' or ' then JS. Also check further for Regular Expressions to check more clearly and/or provide white- and blacklists (like 'var' but no < or > around it, 'function(asdsd,asdsad){assads}' ..)
Bram's Start with what I continued was:
$("code").each(function() {
code = $(this).text();
if (code.match(/^<[^>]+>/)) {
$(this).after("<span>This is HTML</span>");
}
else if (code.match(/^(#|\.)?[^{]+{/)) {
$(this).after("<span>This is CSS</span>");
}
});
For more Information:
http://regexone.com is a good reference.
Also check http://www.sitepoint.com/jquery-basic-regex-selector-examples/ for inspiration.

It depends if you are allowed to mix languages, as mentioned in the comments (i.e. having embedded JS and CSS in your HTML), or if those are separate files that you need to detect for some reason.
A rigorous approach would be to build a tree from the file, where each node would be a statement (in Perl, you can use HTML::TreeBuilder). Then you could parse it and compare with the original source. Then proceed by applying eliminating regexes to weed out chunks of code and split languages.
Another way would be to search for language-specific patterns (I was thinking that CSS only uses " *= " in some situations, therefore if you have " = " by itself, must be JavaScript, embedded or not).
For HTML you for sure can detect the tags with some regex like
if($source =~ m/(<.+>)/){}
Basically then you would need to take into account some fancy cases like if the JavaScript is used to display some HTML code
var code = "<body>";
Then again it really depends on the situation you are facing, and how the codes mix.

Related

Can JavaScript be executed from a textarea that does not communicate to the internet?

In a tool I'm creating, there are several <input> and <textarea> fields that are essentially acting as glorified notepads. The only things that happen to these fields are basic editing (adding strings to the value) or copying/pasting them. Here's a snippet of one part of the tool:
HTML:
<button onclick="auth();" class="bOwn bLight">Copy</button> <button onclick="authClear();" class="bOwn bRed">Clear</button>
<input type="checkbox" id="postSCActivity" name="postSCActivity" value="Activity after Stat Change"><label for="postSCActivity">Activity after Status Change?</label><br>
<textarea id="auth" class="textbox"></textarea><p>
JS:
function auth() {
$("#auth").removeClass("textSelected");
lastClickedField = "#auth";
var prefix = "\n\n";
var content = $("#auth").val();
var suffix = " //" + $("#opcode").val();
var postSCActivity = "";
if ($("#postSCActivity").prop("checked")) {
postSCActivity = " Additional activity follows the status change(s).";
} else {
postSCActivity = " No activity follows the status change(s).";
}
if (content != "") {
$("#auth").val(prefix + content + postSCActivity + suffix);
$("#auth").select();
document.execCommand('copy');
$("#auth").addClass("textSelected");
$("#auth").val(content);
} else {
$('#blankCopy').show();
$("#blankButton").focus();
}
}
function authClear() {
$("#auth").val("");
$("#auth").removeClass("textSelected");
$("#postSCActivity").prop("checked", false);
}
For the above Javascript, the variable suffix is entered at a login screen that only accepts letters and numbers, no symbols.
At no point does anything get submitted; this is literally only a glorified notepad that automatically adds text and copies text to the clipboard. Are there any known ways to insert javascript that would run? I can't figure out anything... I tried typing out functions, but it just copies the text. I feel it's pretty solid, but knowing my luck, I'm overlooking something. I wanted to see what I could be missing, and how I can better secure the tool, if anything else has to be done.

It has nothing to do with the server or a submit. Although it's a common way to exploit things. What matters is the context where you introduce user provided data.
There are two contexts - interpreting and rendering.
If an attacker can use your program to write there data into a context where that data can be interpreted when it should be rendered, then they can trick your program to execute their hack.
Take this example:
document.write('<textarea>' + userData + '</textarea>')
versus:
document.getElementById('myTextArea').value(userData)
Do you see the difference? In the first example, the information is parsed by the browser and gives the attacker the opportunity to trick the browser into executing their code. In the second part, the value(...) function expects a string and doesn't parse it - it's a string to be rendered, not interpreted. There's no ambiguity.
Regular HTML doesn't have function calls. It jumps between rendered, and interpreted, and back over and over. It's easy to trick it.

How to distinguish if code in string is a piece of JS or CSS code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I receive strings of code via simple POST requests, i am looking for a clever way (without having to run the script itself) to distinguish if it's a javascript script or css script, or at least to be quite sure (i'd say 55% possibility it is one of).
These are not files these are strings, so i don't have any information about the code in the string, no file, no file ext, no headers...
Do you have any advice/resource?
thanks a lot.

If this has to work with broken code too, I think your best chance is to search for "typical CSS" and "typical JS" stuff, and compare how much speaks for JS and how much for CSS.
Typical for JS are it's reserved words, and it's operators.
Typical for CSS is the structure: [, seperated selectors] { [ ; seperated key-value pairs] }
First a few utilities that triy to evaluate how much of a passed string is part of a particular language. (very basic approach, therefore should also work with broken code)
//returns **kind of** a percentage of how much of the string has been identified as JS/CSS
function evaluateCode(pattern, commentPattern, correctionalFactor){
correctionalFactor = +correctionalFactor || 1;
return function(string){
//removing comments and compacting whitespace.
//this avoids false hits, and provides a better estimation of how much significant text/code we have (to compute the percentage)
var t = string.replace(commentPattern || "", "").replace(/\s+/, " ");
return correctionalFactor * (t.match(pattern) || []).reduce(sumLengths, 0) / t.length;
}
}
var sumLengths = (acc, match) => acc + match.length;
var evaluateJS = evaluateCode(
/\b(?:function|return|arguments|this|var|const|let|typeof|instanceof|Array|Object)\b|[+\-*/<>&|=]+|[()\[\]\{\}]/g,
/\/\*[\s\S]*\*\/|\/\/[^\n]*/g,
1.5
);
var evaluateCSS = evaluateCode(
/[a-z0-9\.#:\[\]=,\s-]+\{(?:\s*[a-z-]+\s*:[^;]+;?)*\s*\}/gi,
/\/\*[\s\S]*\*\//g
);
And the usage:
var jsRatio = evaluateJS(string),
cssRatio = evaluateCSS(string);
//If there's less than 10% difference between the two estimations, I'd call it "unclear"
if(Math.abs(jsRatio - cssRatio) < .1){
console.log("result is ambigious, but I tend more towards");
}
console.log("%s (probabilities: css %f%, js %f%)", cssRatio > jsRatio? "css": "js", cssRatio, jsRatio);
I use an estimated/guessed "correctional factor" of 1.5 on evaluateJS, because the regex matches only a part of the code,
whereas the css-regex matches almost everything.
This factor only matters when the results are ambigious, usually there should be a huge gap between the two ratios.
Edit: another (probably better) regex to search for CSS:
/[a-z0-9\-]+\s*:[^;{}]+[;}]|(?:[#.]?[a-z]+(?:[#.:\s][a-z0-9-_]+)*\s*[,{])/gi
this is looking only for key-value pairs and "typical" selectors, containing ids and classes, rather than the whole structure, wich should be benefitial, if css-structure is broken or too complex for the fairly simple regex.

You might enclose the returned string in a block that prevents it from being executed (if it's JavaScript) and see if it can be parsed.
function isJavaScript(str)
{
try
{
Function('function(){' + str + '}');
return true; // Looks like valid JS
}
catch (error)
{
// no valid JavaScript, may be CSS
return false;
}
}
I don't think this is 100% foolproof, but it may work for your purpose.

Converting javascript hexadecimal code

I was going through some downloaded javascripts and found code is written in Hexadecimal values instead of the 'normal' js syntax. For example:
if (!_0x7cd2x2[_0x2dae[19]](_0x2dae[18])) {
var _0x7cd2x8 = true;
_0x7cd2x2[_0x2dae[21]](_0x2dae[20]);
} else {
var _0x7cd2x8 = false;
_0x7cd2x2[_0x2dae[21]](_0x2dae[22]);
}
;
if (_0x7cd2x2[_0x2dae[19]](_0x2dae[23])) {
var _0x7cd2x9 = true
}
;
Can somebody please help me in understanding the code and how it was done.

So, in fact, the code above is 100% perfectly valid javascript. The original script has been run through an obfuscator in order to make it difficult to understand.
Most likely whichever obfuscator was used replaces variable names with numbers, prefixes them with "_" and prints the number as a hex value.
To understand the code you will need the entire sample, and a lot of patience.

finding and replacing parameters in href with javascript

Here's the issue:
I need to check for a parameter in the query. If it's been set, then I need to change it (harder part). If it hasn't been set, I need to set it (easy part)
I was so sure it would be an easy task, but it somehow ended up being a bit more complicated than I thought. Especially for those cases where there are multiple parameters, and the one I'm looking for is in the middle somewhere.
The parameter is "siteLanguage", and is always followed by =xx where xx represents any two characters, like en or es or whatever. So maybe regex is the answer (boy, do I suck at regex)
No frameworks for this one, guys, just plain ol' javascript.

I guess you've figured out how to find all the links.
The standard format of an URL is service://host/path?query
I suggest to cut away the query (just take everything after the first ?) and then split that at & (because that separates parameters).
You'll be left with an array of the form key=value. Split that into an associative array. Now you can work on that array. After you've made your modifications, you need to join the query again and set the href attribute of the link.

This would check all "a href=" throughout the document appending or adjusting the language.
checkhrefs = function(lang){
var links = document.getElementsByTagName("a");
for (var i=0;i<links.length;i++){
if (links[i].href.indexOf("siteLanguage") == -1){
links[i].href += "&siteLanguage="+lang;
} else {
links[i].href = links[i].href.replace(new RegExp("siteLanguage=[a-z][a-z]"),"siteLanguage="+lang);
}
}
}

Ended up just doing a quick hack like so:
function changeLanguage(lang) {
if (location.search.indexOf("siteLanguage") > -1) { //siteLanguage set
location.href = location.search.replace(/siteLanguage=[a-z][a-z]/, "siteLanguage="+lang);
} else if (location.search == "") {
location.href += "?siteLanguage="+lang;
} else {
location.href += "&siteLanguage="+lang;
}
}
Actually pretty happy with a 9-liner function...

jquery match() variable interpolation - complex regexes

I've already looked at this, which was helpful to a point.
Here's the problem. I have a list of users propagated into an element via user click; something like this:
<div id="box">
joe-user
page-joe-user
someone-else
page-someone-else
</div>
On click, I want to make sure that the user has not already been clicked into the div. So, I'm doing something like:
if ( ! $('#box').html().match(rcpt) )
{
update_div();
}
else
{
alert(rcpt+' already exists.');
}
However, with existing lack of interpolation that javascript has for regular expressions, is causing my alert to trigger in the use-case where page-joe-user is selected and then the user selects joe-user, which are clearly not exactly the same.
In Perl I would do something like:
if ( $rcpt =~ /^\Qrcpt\E/ )
{
# code
}
All I want to do is change my match() to be:
if ( ! $('#box').html().match(/^rcpt/) )
{
# code
}
if ( ! $('#box').html().match(rcpt) ) seemed a little promising but it, too, fails. Using new RegExp() also does not work using concatenation of complex RE syntax within the function IE $('#box').html().match(new RegExp('^'+rcpt)). I also tried $('#box').html().match('/^'+rcpt'/'). I can only imagine that I'm missing something. I'm pretty new to javascript.
I don't seem to be able to find anything that really addresses such a use-case, here on this site.
TIA

The match function only works on strings, not jQuery objects.
The best way to do this is to put each username into a separate HTML tag.
For example:
<ul id="users">
<li>joe-user</li>
<li>page-joe-user</li>
<li>someone-else</li>
<li>page-someone-else</li>
</ul>
You can then write the following:
if($('#users li').is(function () { return $(this).text() === rcpt; }))
If you want to do it your way, you should call text() to get the string inside the element. ($('#box').text().match(...))
EDIT: The best way to do this using your HTML would be to split the string.
For example:
var userExists = false;
var users = $('#box').text().split(/\r?\n/);
for(var i = 0; i < users.length; i++) { //IE doesn't have indexOf
if (users[i] == rcpt) {
userExists = true;
break;
}
}
if (userExists) {
//Do something
}
This has the added benefit of not being vulnerable to regex-injection.

Develop Reference

JavaScript is the programming language of the Web.

Detect if source is CSS/HTML/JavaScript - javascript

Related

Can JavaScript be executed from a textarea that does not communicate to the internet?

How to distinguish if code in string is a piece of JS or CSS code? [closed]

Converting javascript hexadecimal code

finding and replacing parameters in href with javascript

jquery match() variable interpolation - complex regexes

Categories

Resources