I'm trying to parse a blob of text in html format, that only allow bold <b></b> and italic <i></i>.
I know it nearly impossible to parse the html text to secure XSS. But given the constraints only to bold and italic, is that feasible to use regex to filter out the unnecessary tags?
Thanks.
--- Edit ---
I meant to do the parsing on the client side, and render it right back.
Please test your code against this, before jumping into conclusion.
http://voog.github.io/wysihtml/examples/simple.html
BTW, why is the question itself get down voted?
--- Closed ---
I picked #Siguza 's answer to close this discussion.
The easiest and probably most secure way I can think of (doing this with regex) is to first replace all < and > with < and > respectively, and then explicitly "un-replace" the b and i tags.
To replace < and > you just need text substitution, no regex. But I trust you know how to do this in regex anyway.
To re-enable the i and b tags, you could also use four text replacements:
<b> => <b>
</b> => </b>
<i> => <i>
</i> => </i>
Or, in regex replace /<(\/?[bi])>/g with <$1>.
But...
...for the sake of completeness, it actually is possible with just one single regex substitution:
Replace /<(|\/|[^>\/bi]|\/[^>bi]|[^\/>][^>]+|\/[^>][^>]+)>/g with <$1>.
I will not guarantee that this is bullet-proof, but I tested it against the following block using RegExr, where it appeared to hold up:
<>Test</>
<i>Test</i>
<iii>Test</iii>
<b>Test</b>
<bbb>Test</bbb>
<a>Test</a>
<abc>Test</abc>
<some tag with="attributes">Test</some>
<br/>
<br />
Can you do this with regex? Kind of. You have to write a regex to find all tags that are not b or i tags. Below is a simple example of one, it matches any tag with more than 1 character in it, which only allows <a>, <b>, <i>, <p>, <q>, <s>, and <u> (no spaces, no attributes and no classes allowed), which I believe fits your needs. There may well be a more precise regex for this, but this is simple. It may or may not catch everything. It probably doesn't.
<[^>]{2,}[^/]>
Should you do this with regex? No. There are other better, more secure ways.
Parse out tags, replace with a special delimiter (or store indices).
XSS sanitize the input.
Replace the delimiters with tags.
Make sure you don't have any mismatched tags.
XSS sanitizing needs to be done server-side - the client is in control of the client-side, and can circumvent any checks there.
I still maintain that the OWASP Cheat Sheet is sufficient for XSS sanitization, and replacing only empty bold and italic tags shouldn't compromise any of the rules.
Related
I'm looking for a solution similar to
Regex to replace multiple spaces with a single space
but instead of space the question is about <span>. It doesn't contain additional attributes in it such as class. It's just exactly 6 symbols <span> (no spaces, no nothing).
As result, the string
"<span>The <span><span><span><span>dog <span><span>has</span> a long</span> tail, and it </span></span></span>is RED</span></span>!"
should be replaced to
"<span>The <span>dog <span>has</span> a long</span> tail, and it </span></span></span>is RED</span><span>!"
(please don't pay attention closing spans will be more, additional modifications are expected thereafter).
P.S. Yes, you're right, you may want to ask if 2+ consequent spans may have spaces in between, tabs or even new lines. Honestly - yes, but even without spaces, tabs, new lines the answer will be useful. Thank you.
Try out the following two replace methods (can you use them chained):
if or is repeated directly after another (twice or more often), replace that whole thing with just one expression:
.replace(/(\<span\>){2,}/g, "<span>")
.replace(/(\</span\>){2,}/g, "</span>")
By the way, regexr.com is a great place if you want to try out regex!
I know there are other questions on editable divs, but I couldn't find one specific to the Markdown-related issue I have.
User will be typing inside a ContentEditable div. And he may choose to do any number of Markdown-related things like code blocks, headers, and whatever.
I am having issues extracting the source properly and storing it into my database to be displayed again later by a standard Markdown parser. I have tried two ways:
$('.content').text()
In this method, the problem is that all the line breaks are stripped out and of course that is not okay.
$('.content').html()
In this method, I can get the line breaks working fine by using regex to replace <br\> with \n before inserting into database. But the browser also wraps things like ## Heading Here with divs, like this: <div>## Heading Here</div>. This is problematic for me because when I go to display this afterwards, I don't get the proper Markdown formatting.
What's the best (most simple and reliable) way to solve this problem as of 2015?
EDIT: Found a potential solution here: http://www.davidtong.me/innerhtml-innertext-textcontent-html-and-text/
if you check the documentation of jquery's .text() method,
The result of the .text() method is a string containing the combined text of all matched elements. (Due to variations in the HTML parsers in different browsers, the text returned may vary in newlines and other white space.)
so getting whitespaces is not guaranteed in all browsers.
try using the innerText property of the element.
document.getElementsByClassName('content')[0].innerText
this returns the text with all white spacing intact. But this is not cross browser compatible. It works in IE and Chrome, but not in Firefox.
the innerText equivalent for Firefox is textContent (link), but that strips out the whitespaces.
This is what I've been able to come up with using that link I posted above in my edit. It's in Coffeescript.
div = $('.content')[0]
if div.innerText
text = div.innerText
else
escapedText = div.innerHTML
.replace(/(?:\r\<br\>|\r|\<br\>)/g, '\n')
.replace(/(\<([^\>]+)\>)/gi, "")
text = _.unescape(escapedText)
Basically, I'm checking whether or not innerText works, and if it doesn't then we do this other thing where we:
Take the HTML, which has escaped text.
Replace all the <br> tags with line breaks.
Strip out any tags (escaped ones won't be stripped, i.e. the stuff the user types).
Unescape the escaped text.
1) I get response with html tags, for instance: This is <b>Test</b>
2) sometimes response may containt script (or iframe, canvas and etc.) tags (XSS), for instance: This <script>alert("Hello from XSS")</script> is <b>Test</b>
3) how can remove all of XSS tags (script, iframe, canvas...) except of other html tags?
PS: I can't use escape because it's remove <b>, <strong> and other tags.
how can remove all of XSS tags (script, iframe, canvas...) except of other html tags?
All tags can harbour XSS risks. For example <b onmouseover="...">, <a href="javascript:..."> or <strong style="padding: expression(...)">.
To render HTML ‘safe’ you need to filter it to only allow a minimal set of known-safe elements and attributes. All URL attributes need further checking for known-good protocols. This is known as ‘whitelisting’.
It's not a simple task, as you will typically have to parse the HTML properly to detect which elements and attributes are present. A simple regex will not be enough to pick up the range of potentially-troublesome content, especially in JavaScript which has a relatively limited regex engine (no lookbehind, unreliable lookahead, etc).
There are tools for server-side languages that will do this for you, for example PHP's HTML Purifier. I would recommend using one of those at the server-side before returning the content, as I'm currently unaware of a good library of this kind for JavaScript.
Below function could be used to encode input data to fix XSS vulnerabilities on javascript
/*Using jQuery : the script to escape HTML/JS characters*/
function htmlEncode(value) {
if (value) {
return $('<div/>').text(value).html();
} else {
return '';
}
}
You don't need to remove the tags, just do the translations.
For example, turn < to <, > to > etc..
If you are using php, some function are for this:
htmlspecialchars
htmlentities
I'm trying to get the first letter in a paragraph and wrap it with a <span> tag. Notice I said letter and not character, as I'm dealing with messy markup that often has blank spaces.
Existing markup (which I can't edit):
<p> Actual text starts after a few blank spaces.</p>
Desired result:
<p> <span class="big-cap">A</span>ctual text starts after a few blank spaces.</p>
How do I ignore anything but /[a-zA-Z]/ ? Any help would be greatly appreciated.
$('p').html(function (i, html)
{
return html.replace(/^[^a-zA-Z]*([a-zA-Z])/g, '<span class="big-cap">$1</span>');
});
Demo: http://jsfiddle.net/mattball/t3DNY/
I would vote against using JS for this task. It'll make your page slower and also it's a bad practice to use JS for presentation purposes.
Instead I can suggest using :first-letter pseudo-class to assign additional styles to the first letter in paragraph. Here is the demo: http://jsfiddle.net/e4XY2/. It should work in all modern browsers except IE7.
Matt Ball's solution is good but if you paragraph has and image or markup or quotes the regex will not just fail but break the html
for instance
<p><strong>Important</strong></p>
or
<p>"Important"</p>
You can avoid breaking the html in these cases by adding "'< to the exuded initial characters. Though in this case there will be no span wrapped on the first character.
return html.replace(/^[^a-zA-Z'"<]*([a-zA-Z])/g, '<span class="big-cap">$1</span>');
I think Optimally you may wish to wrap the first character after a ' or "
I would however consider it best to not wrap the character if it was already in markup, but that probably requires a second replace trial.
I do not seem to have permission to reply to an answer so forgive me for doing it like this. The answer given by Matt Ball will not work if the P contains another element as first child. Go to the fiddle and add a IMG (very common) as first child of the P and the I from Img will turn into a drop cap.
If you use the x parameter (not sure if it's supported in jQuery), you can have the script ignore whitespace in the pattern. Then use something like this:
/^([a-zA-Z]).*$/
You know what format your first character should be, and it should grab only that character into a group. If you could have other characters other than whitespace before your first letter, maybe something like this:
/.*?([a-zA-Z]).*/
Conditionally catch other characters first, and then capture the first letter into a group, which you could then wrap around a span tag.
I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:
visit google search engine
you can search on google tatatata...
I want to replace google by <b>google</b>, but not here:
visit google search engine
you can search on <b>google</b> tatatata...
I tried with this one:
regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML = el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');
but the problem: I got <b>google</b> inside the <a> tag:
visit <b>google</b> search engine
you can search on <b>google</b> tatatata...
How can fix this?
You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.
You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).
It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.
Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular.
For more details see this Stackoverflow question (and answers).
I think you're all missing the question here...
When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a <p> </p> tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.
WORKAROUND
If You can't use a html parser or are quite confident about Your html structure try this:
do the "bad" changing
repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)
It's a simple workaround, but works for me.
Cons?
Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page
[edit:]
SOLUTION
Why not use jQuery, put the html code into the page and do something like this:
$(containerOrSth).find('a').each(function(){
if($(this).children().length==0){
$(this).text($(this).text().replace('google','evil'));
}else{
//here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
}
});
I'm using
regex = new RegExp("(?=[^>]*<)google", 'i');
you can't really do that, your "google" is always in some tag, either replace all or none
Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>
You can do that using REGEX, but filtering blocks like STYLE, SCRIPT and CDATA will need more work, and not implemented in the following solution.
Most of the answers state that 'your data is always in some tags' but they are missing the point, the data is always 'between' some tags, and you want to filter where it is 'in' a tag.
Note that tag characters in inline scripts will likely break this, so if they exist, they should be processed seperately with this method. Take a look at here :
complex html string.replace function
I can give you a hacky solution…
Pick a non printable character that’s not in your string…. Dup your buffer… now overwrite the tags in your dup buffer using the non printable character… perform regex to find position and length of match on dup buffer … Now you know where to perform replace in original buffer