Json.encode special symbols \u003c MVC3 - javascript

I have JavaScript application, where I use client-side templates (underscore.js, Backbone.js).
Data for initial page load is strapped into the page like this (.cshtml Razor-file):
<div id="model">#Json.Encode(Model)</div>
Razor engine performs escaping, so, if the Model is
new { Title = "<script>alert('XSS');</script>" }
, in output we have:
<div id="model">{"Title":"\u003cscript\u003ealert(\u0027XSS\u0027)\u003c/script\u003e"}</div>
Which after "parse" operation:
var data = JSON.parse($("#model").html());
we have object data with "Title" field exactly "<script>alert('XSS');</script>"!
When this goes to underscore template, it alerts.
Somehow \u003c-like symbols are treated like proper "<" symbols.
How do I escape "<" symbols to < and > from DB (if they somehow got there)?
Maybe I can tune Json.Encode serialization for escaping these symbols?
Maybe I can set up Entity Framework which I`m using, for automatically escape these symbols absolutely all the time when getting data from DB?

\u003c and similar codes are perfectly valid for JS. You can obfuscate whole JS files using this syntax, if you so choose. Essentially, you're seeing an escape character \, u for unicode, and then a 4-character Hex code which relates to a symbol.
http://javascript.about.com/library/blunicode.htm
\u003c - as you've noted, is the < character.
One approach to "fixing" this on the MVC side would be to write a RegEx which looks for the pattern \u - and then captures the next 4 characters. You could then un-encode them into actual unicode characters - and run the resultant text through your XSS prevention algorithms.
As you've noted in your question - just looking for "<" doesn't help. You also can't just look for "\u003cscript" - because this assumes the potential hacker hasn't simply unicode-encoded the entire "script" tag word. The safer approach is to un-escape all of these kinds of codes and then cleanse your HTML in plain-text.
Incidentally, it might make you feel better to note that this is one of the common (and thusfar poorly resolved) issues in XSS prevention. So you aren't alone in wanting a better solution...
You might check out the following libraries to assist in the actual html cleansing:
http://wpl.codeplex.com/ (Microsoft's attempt at a solution - though very bad user feedback)
https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET (A private project which is designed to do a lot of this kind of prevention. I find it hard to use, and poorly implemented in .NET)
Both are good references, though.

You need to encode your string as HTML before providing it to Underscore.
"HTML escaping in Underscore.js templates" explains how to do this.

If you want to write unencoded content you will need to use the Html.Raw() helper:
#Html.Raw(Json.Encode(Model))
Edit:
I guess, perhaps I'm not understanding what your problem is. For example within a test controller I have the following
ViewBag.Test = new { Title = "<script>alert('XSS');</script>" };
In the related view:
<script type="text/javascript">
var test = #Html.Raw(Json.Encode(ViewBag.Test));
console.log(test.Title);
document.write(test.Title);
</script>
Which in turn outputs to the console:
<script>alert('XSS');</script>
And opens the alert.

Related

single quotations vs double quotations [duplicate]

Consider the following two alternatives:
console.log("double");
console.log('single');
The former uses double quotes around the string, whereas the latter uses single quotes around the string.
I see more and more JavaScript libraries out there using single quotes when handling strings.
Are these two usages interchangeable? If not, is there an advantage in using one over the other?
The most likely reason for use of single vs. double in different libraries is programmer preference and/or API consistency. Other than being consistent, use whichever best suits the string.
Using the other type of quote as a literal:
alert('Say "Hello"');
alert("Say 'Hello'");
This can get complicated:
alert("It's \"game\" time.");
alert('It\'s "game" time.');
Another option, new in ECMAScript 6, is template literals which use the backtick character:
alert(`Use "double" and 'single' quotes in the same string`);
alert(`Escape the \` back-tick character and the \${ dollar-brace sequence in a string`);
Template literals offer a clean syntax for: variable interpolation, multi-line strings, and more.
Note that JSON is formally specified to use double quotes, which may be worth considering depending on system requirements.
If you're dealing with JSON, it should be noted that strictly speaking, JSON strings must be double quoted. Sure, many libraries support single quotes as well, but I had great problems in one of my projects before realizing that single quoting a string is in fact not according to JSON standards.
There is no one better solution; however, I would like to argue that double quotes may be more desirable at times:
Newcomers will already be familiar with double quotes from their language. In English, we must use double quotes " to identify a passage of quoted text. If we were to use a single quote ', the reader may misinterpret it as a contraction. The other meaning of a passage of text surrounded by the ' indicates the 'colloquial' meaning. It makes sense to stay consistent with pre-existing languages, and this may likely ease the learning and interpretation of code.
Double quotes eliminate the need to escape apostrophes (as in contractions). Consider the string: "I'm going to the mall", vs. the otherwise escaped version: 'I\'m going to the mall'.
Double quotes mean a string in many other languages. When you learn a new language like Java or C, double quotes are always used. In Ruby, PHP and Perl, single-quoted strings imply no backslash escapes while double quotes support them.
JSON notation is written with double quotes.
Nonetheless, as others have stated, it is most important to remain consistent.
Section 7.8.4 of the specification describes literal string notation. The only difference is that DoubleStringCharacter is "SourceCharacter but not double-quote" and SingleStringCharacter is "SourceCharacter but not single-quote". So the only difference can be demonstrated thusly:
'A string that\'s single quoted'
"A string that's double quoted"
So it depends on how much quote escaping you want to do. Obviously the same applies to double quotes in double quoted strings.
I'd like to say the difference is purely stylistic, but I'm really having my doubts. Consider the following example:
/*
Add trim() functionality to JavaScript...
1. By extending the String prototype
2. By creating a 'stand-alone' function
This is just to demonstrate results are the same in both cases.
*/
// Extend the String prototype with a trim() method
String.prototype.trim = function() {
return this.replace(/^\s+|\s+$/g, '');
};
// 'Stand-alone' trim() function
function trim(str) {
return str.replace(/^\s+|\s+$/g, '');
};
document.writeln(String.prototype.trim);
document.writeln(trim);
In Safari, Chrome, Opera, and Internet Explorer (tested in Internet Explorer 7 and Internet Explorer 8), this will return the following:
function () {
return this.replace(/^\s+|\s+$/g, '');
}
function trim(str) {
return str.replace(/^\s+|\s+$/g, '');
}
However, Firefox will yield a slightly different result:
function () {
return this.replace(/^\s+|\s+$/g, "");
}
function trim(str) {
return str.replace(/^\s+|\s+$/g, "");
}
The single quotes have been replaced by double quotes. (Also note how the indenting space was replaced by four spaces.) This gives the impression that at least one browser parses JavaScript internally as if everything was written using double quotes. One might think, it takes Firefox less time to parse JavaScript if everything is already written according to this 'standard'.
Which, by the way, makes me a very sad panda, since I think single quotes look much nicer in code. Plus, in other programming languages, they're usually faster to use than double quotes, so it would only make sense if the same applied to JavaScript.
Conclusion: I think we need to do more research on this.
This might explain Peter-Paul Koch's test results from back in 2003.
It seems that single quotes are sometimes faster in Explorer Windows (roughly 1/3 of my tests did show a faster response time), but if Mozilla shows a difference at all, it handles double quotes slightly faster. I found no difference at all in Opera.
2014: Modern versions of Firefox/Spidermonkey don’t do this anymore.
If you're doing inline JavaScript (arguably a "bad" thing, but avoiding that discussion) single quotes are your only option for string literals, I believe.
E.g., this works fine:
<a onclick="alert('hi');">hi</a>
But you can't wrap the "hi" in double quotes, via any escaping method I'm aware of. Even " which would have been my best guess (since you're escaping quotes in an attribute value of HTML) doesn't work for me in Firefox. " won't work either because at this point you're escaping for HTML, not JavaScript.
So, if the name of the game is consistency, and you're going to do some inline JavaScript in parts of your application, I think single quotes are the winner. Someone please correct me if I'm wrong though.
Technically there's no difference. It's only matter of style and convention.
Douglas Crockford1 recommends using double quotes.
I personally follow that.
Strictly speaking, there is no difference in meaning; so the choice comes down to convenience.
Here are several factors that could influence your choice:
House style: Some groups of developers already use one convention or the other.
Client-side requirements: Will you be using quotes within the strings? (See Ady's answer.)
Server-side language: VB.NET people might choose to use single quotes for JavaScript so that the scripts can be built server-side (VB.NET uses double-quotes for strings, so the JavaScript strings are easy to distinguished if they use single quotes).
Library code: If you're using a library that uses a particular style, you might consider using the same style yourself.
Personal preference: You might think one or other style looks better.
Just keep consistency in what you use. But don't let down your comfort level.
"This is my string."; // :-|
"I'm invincible."; // Comfortable :)
'You can\'t beat me.'; // Uncomfortable :(
'Oh! Yes. I can "beat" you.'; // Comfortable :)
"Do you really think, you can \"beat\" me?"; // Uncomfortable :(
"You're my guest. I can \"beat\" you."; // Sometimes, you've to :P
'You\'re my guest too. I can "beat" you too.'; // Sometimes, you've to :P
ECMAScript 6 update
Using template literal syntax.
`Be "my" guest. You're in complete freedom.`; // Most comfort :D
I hope I am not adding something obvious, but I have been struggling with Django, Ajax, and JSON on this.
Assuming that in your HTML code you do use double quotes, as normally should be, I highly suggest to use single quotes for the rest in JavaScript.
So I agree with ady, but with some care.
My bottom line is:
In JavaScript it probably doesn't matter, but as soon as you embed it inside HTML or the like you start to get troubles. You should know what is actually escaping, reading, passing your string.
My simple case was:
tbox.innerHTML = tbox.innerHTML + '<div class="thisbox_des" style="width:210px;" onmouseout="clear()"><a href="/this/thislist/'
+ myThis[i].pk +'"><img src="/site_media/'
+ myThis[i].fields.thumbnail +'" height="80" width="80" style="float:left;" onmouseover="showThis('
+ myThis[i].fields.left +','
+ myThis[i].fields.right +',\''
+ myThis[i].fields.title +'\')"></a><p style="float:left;width:130px;height:80px;"><b>'
+ myThis[i].fields.title +'</b> '
+ myThis[i].fields.description +'</p></div>'
You can spot the ' in the third field of showThis.
The double quote didn't work!
It is clear why, but it is also clear why we should stick to single quotes... I guess...
This case is a very simple HTML embedding, and the error was generated by a simple copy/paste from a 'double quoted' JavaScript code.
So to answer the question:
Try to use single quotes while within HTML. It might save a couple of debug issues...
It's mostly a matter of style and preference. There are some rather interesting and useful technical explorations in the other answers, so perhaps the only thing I might add is to offer a little worldly advice.
If you're coding in a company or team, then it's probably a good idea to follow the "house style".
If you're alone hacking a few side projects, then look at a few prominent leaders in the community. For example, let's say you getting into Node.js. Take a look at core modules, for example, Underscore.js or express and see what convention they use, and consider following that.
If both conventions are equally used, then defer to your personal preference.
If you don't have any personal preference, then flip a coin.
If you don't have a coin, then beer is on me ;)
I am not sure if this is relevant in today's world, but double quotes used to be used for content that needed to have control characters processed and single quotes for strings that didn't.
The compiler will run string manipulation on a double quoted string while leaving a single quoted string literally untouched. This used to lead to 'good' developers choosing to use single quotes for strings that didn't contain control characters like \n or \0 (not processed within single quotes) and double quotes when they needed the string parsed (at a slight cost in CPU cycles for processing the string).
If you are using JSHint, it will raise an error if you use a double quoted string.
I used it through the Yeoman scafflholding of AngularJS, but maybe there is somehow a manner to configure this.
By the way, when you handle HTML into JavaScript, it's easier to use single quote:
var foo = '<div class="cool-stuff">Cool content</div>';
And at least JSON is using double quotes to represent strings.
There isn't any trivial way to answer to your question.
There isn't any difference between single and double quotes in JavaScript.
The specification is important:
Maybe there are performance differences, but they are absolutely minimum and can change any day according to browsers' implementation. Further discussion is futile unless your JavaScript application is hundreds of thousands lines long.
It's like a benchmark if
a=b;
is faster than
a = b;
(extra spaces)
today, in a particular browser and platform, etc.
Examining the pros and cons
In favor of single quotes
Less visual clutter.
Generating HTML: HTML attributes are usually delimited by double quotes.
elem.innerHTML = 'Hello';
However, single quotes are just as legal in HTML.
elem.innerHTML = "<a href='" + url + "'>Hello</a>";
Furthermore, inline HTML is normally an anti-pattern. Prefer templates.
Generating JSON: Only double quotes are allowed in JSON.
myJson = '{ "hello world": true }';
Again, you shouldn’t have to construct JSON this way. JSON.stringify() is often enough. If not, use templates.
In favor of double quotes
Doubles are easier to spot if you don't have color coding. Like in a console log or some kind of view-source setup.
Similarity to other languages: In shell programming (Bash etc.), single-quoted string literals exist, but escapes are not interpreted inside them. C and Java use double quotes for strings and single quotes for characters.
If you want code to be valid JSON, you need to use double quotes.
In favor of both
There is no difference between the two in JavaScript. Therefore, you can use whatever is convenient at the moment. For example, the following string literals all produce the same string:
"He said: \"Let's go!\""
'He said: "Let\'s go!"'
"He said: \"Let\'s go!\""
'He said: \"Let\'s go!\"'
Single quotes for internal strings and double for external. That allows you to distinguish internal constants from strings that are to be displayed to the user (or written to disk etc.). Obviously, you should avoid putting the latter in your code, but that can’t always be done.
Talking about performance, quotes will never be your bottleneck. However, the performance is the same in both cases.
Talking about coding speed, if you use ' for delimiting a string, you will need to escape " quotes. You are more likely to need to use " inside the string. Example:
// JSON Objects:
var jsonObject = '{"foo":"bar"}';
// HTML attributes:
document.getElementById("foobar").innerHTML = '<input type="text">';
Then, I prefer to use ' for delimiting the string, so I have to escape fewer characters.
There are people that claim to see performance differences: old mailing list thread. But I couldn't find any of them to be confirmed.
The main thing is to look at what kind of quotes (double or single) you are using inside your string. It helps to keep the number of escapes low. For instance, when you are working with HTML content inside your strings, it is easier to use single quotes so that you don't have to escape all double quotes around the attributes.
When using CoffeeScript I use double quotes. I agree that you should pick either one and stick to it. CoffeeScript gives you interpolation when using the double quotes.
"This is my #{name}"
ECMAScript 6 is using back ticks (`) for template strings. Which probably has a good reason, but when coding, it can be cumbersome to change the string literals character from quotes or double quotes to backticks in order to get the interpolation feature. CoffeeScript might not be perfect, but using the same string literals character everywhere (double quotes) and always be able to interpolate is a nice feature.
`This is my ${name}`
I would use double quotes when single quotes cannot be used and vice versa:
"'" + singleQuotedValue + "'"
'"' + doubleQuotedValue + '"'
Instead of:
'\'' + singleQuotedValue + '\''
"\"" + doubleQuotedValue + "\""
There is strictly no difference, so it is mostly a matter of taste and of what is in the string (or if the JavaScript code itself is in a string), to keep number of escapes low.
The speed difference legend might come from PHP world, where the two quotes have different behavior.
The difference is purely stylistic. I used to be a double-quote Nazi. Now I use single quotes in nearly all cases. There's no practical difference beyond how your editor highlights the syntax.
You can use single quotes or double quotes.
This enables you for example to easily nest JavaScript content inside HTML attributes, without the need to escape the quotes.
The same is when you create JavaScript with PHP.
The general idea is: if it is possible use such quotes that you won't need to escape.
Less escaping = better code.
In addition, it seems the specification (currently mentioned at MDN) doesn't state any difference between single and double quotes except closing and some unescaped few characters. However, template literal (` - the backtick character) assumes additional parsing/processing.
A string literal is 0 or more Unicode code points enclosed in single or double quotes. Unicode code points may also be represented by an escape sequence. All code points may appear literally in a string literal except for the closing quote code points, U+005C (REVERSE SOLIDUS), U+000D (CARRIAGE RETURN), and U+000A (LINE FEED). Any code points may appear in the form of an escape sequence. String literals evaluate to ECMAScript String values...
Source: https://tc39.es/ecma262/#sec-literals-string-literals

Get between curly braces including the first and the last curly braces [duplicate]

Locked. Comments on this question have been disabled, but it is still accepting new answers and other interactions. Learn more.
I need to match all of these opening tags:
<p>
<a href="foo">
But not these:
<br />
<hr class="foo" />
I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.
<([a-z]+) *[^/]*?>
I believe it says:
Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except /, then
Find a greater-than
Do I have that right? And more importantly, what do you think?
Locked. There are disputes about this answer’s content being resolved at this time. It is not currently accepting new interactions.
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
Have you tried using an XML parser instead?
Moderator's Note
This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.
While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML.
If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.
Regexes worked just fine for me, and were very fast to set up.
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work.
But many will try, and some will even claim success - but until others find the fault and totally mess you up.
Disclaimer: use a parser if you have the option. That said...
This is the regex I use (!) to match HTML tags:
<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>
It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.
I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:
<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>
or just combine if and if not.
To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.
Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...
There are people that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid if they want to use strange words). They are lying.
There are people that will tell you that Regular Expressions shouldn't be recursive. They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance.
You can live in their reality or take the red pill.
Like Lord Marshal (is he a relative of the Marshal .NET class?), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine. Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult.
I think the XML case is quite simple. The RegEx (in the .NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this:
7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28
995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F
86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169
OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq
i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv
p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf
LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e
Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7
O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm
rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv
z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme
nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e
vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y
gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs
mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH
W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52
MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU
1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn
xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ
GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY
12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37
R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn
3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25
D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP
mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS
mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX
X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8
DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c
etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3
zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS
ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ
j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX
/ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d
mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u
v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj
4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq
GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6
mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K
MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z
0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26
7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29
7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9
r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va
j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd
w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa
2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm
AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C
j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8
fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+
+fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx
+r/vD34mUADO1P4/AQAA//8=
The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.
If you have problems reconverting it to a human-readable regex, this should help:
static string FromBase64(string str)
{
byte[] byteArray = Convert.FromBase64String(str);
using (var msIn = new MemoryStream(byteArray))
using (var msOut = new MemoryStream()) {
using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) {
ds.CopyTo(msOut);
}
return Encoding.UTF8.GetString(msOut.ToArray());
}
}
If you are unsure, no, I'm NOT kidding (but perhaps I'm lying). It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests. It's a tokenizer, not a full-blown parser, so it will only split the XML into its component tokens. It won't parse/integrate DTDs.
Oh... if you want the source code of the regex, with some auxiliary methods:
regex to tokenize an xml or the full plain regex
In shell, you can parse HTML using sed:
Turing.sed
Write HTML parser (homework)
???
Profit!
Related (why you shouldn't use regex match):
If You Like Regular Expressions So Much, Why Don't You Marry Them?
Regular Expressions: Now You Have Two Problems
Hacking stackoverflow.com's HTML sanitizer
I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format.
Microsoft actually has a section of Best Practices for Regular Expressions in the .NET Framework and specifically talks about Consider[ing] the Input Source.
Regular Expressions do have limitations, but have you considered the following?
The .NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions.
See Matching Balanced Constructs with .NET Regular Expressions
See .NET Regular Expressions: Regex and Balanced Matching
See Microsoft's docs on Balancing Group Definitions
For this reason, I believe you CAN parse XML using regular expressions. Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA.
Quote from article 1 cited above:
.NET Regular Expression Engine
As described above properly balanced constructs cannot be described by
a regular expression. However, the .NET regular expression engine
provides a few constructs that allow balanced constructs to be
recognized.
(?<group>) - pushes the captured result on the capture stack with
the name group.
(?<-group>) - pops the top most capture with the name group off the
capture stack.
(?(group)yes|no) - matches the yes part if there exists a group
with the name group otherwise matches no part.
These constructs allow for a .NET regular expression to emulate a
restricted PDA by essentially allowing simple versions of the stack
operations: push, pop and empty. The simple operations are pretty much
equivalent to increment, decrement and compare to zero respectively.
This allows for the .NET regular expression engine to recognize a
subset of the context-free languages, in particular the ones that only
require a simple counter. This in turn allows for the non-traditional
.NET regular expressions to recognize individual properly balanced
constructs.
Consider the following regular expression:
(?=<ul\s+id="matchMe"\s+type="square"\s*>)
(?>
<!-- .*? --> |
<[^>]*/> |
(?<opentag><(?!/)[^>]*[^/]>) |
(?<-opentag></[^>]*[^/]>) |
[^<>]*
)*
(?(opentag)(?!))
Use the flags:
Singleline
IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace)
IgnoreCase (not necessary)
Regular Expression Explained (inline)
(?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"...
(?> # atomic group / don't backtrack (faster)
<!-- .*? --> | # match xml / html comment
<[^>]*/> | # self closing tag
(?<opentag><(?!/)[^>]*[^/]>) | # push opening xml tag
(?<-opentag></[^>]*[^/]>) | # pop closing xml tag
[^<>]* # something between tags
)* # match as many xml tags as possible
(?(opentag)(?!)) # ensure no 'opentag' groups are on stack
You can try this at A Better .NET Regular Expression Tester.
I used the sample source of:
<html>
<body>
<div>
<br />
<ul id="matchMe" type="square">
<li>stuff...</li>
<li>more stuff</li>
<li>
<div>
<span>still more</span>
<ul>
<li>Another >ul<, oh my!</li>
<li>...</li>
</ul>
</div>
</li>
</ul>
</div>
</body>
</html>
This found the match:
<ul id="matchMe" type="square">
<li>stuff...</li>
<li>more stuff</li>
<li>
<div>
<span>still more</span>
<ul>
<li>Another >ul<, oh my!</li>
<li>...</li>
</ul>
</div>
</li>
</ul>
although it actually came out like this:
<ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li> <div> <span>still more</span> <ul> <li>Another >ul<, oh my!</li> <li>...</li> </ul> </div> </li> </ul>
Lastly, I really enjoyed Jeff Atwood's article: Parsing Html The Cthulhu Way. Funny enough, it cites the answer to this question that currently has over 4k votes.
I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.
While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.
The suggested regex is wrong, though:
<([a-z]+) *[^/]*?>
If you add something to the regex, by backtracking it can be forced to match silly things like <a >>, [^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.
My suggestion would be
<([a-z]+)[^>]*(?<!/)>
Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".
Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.
Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:
It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss.
If you only know yourself, but not your opponent, you may win or may lose.
If you know neither yourself nor your enemy, you will always endanger yourself.
In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.
I have composed a haiku describing the nature of HTML.
HTML has
complexity exceeding
regular language.
I have also composed a haiku describing the nature of regex in Perl.
The regex you seek
is defined within the phrase
<([a-zA-Z]+)(?:[^>]*[^/]*)?>
Try:
<([^\s]+)(\s[^>]*?)?(?<!/)>
It is similar to yours, but the last > must not be after a slash, and also accepts h1.
<?php
$selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed');
$html = '
<p>foo</p>
<hr/>
<br/>
<div>name</div>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$els = $dom->getElementsByTagName('*');
foreach ( $els as $el ) {
$nodeName = strtolower($el->nodeName);
if ( !in_array( $nodeName, $selfClosing ) ) {
var_dump( $nodeName );
}
}
Output:
string(4) "html"
string(4) "body"
string(1) "p"
string(1) "a"
string(3) "div"
Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.
I'm sure you already know by now that you shouldn't use regex for this purpose.
I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?
Excerpt:
It is a .NET code library that allows
you to parse "out of the web" HTML
files. The parser is very tolerant
with "real world" malformed HTML.
You want the first > not preceded by a /. Look here for details on how to do that. It's referred to as negative lookbehind.
However, a naïve implementation of that will end up matching <bar/></foo> in this example document
<foo><bar/></foo>
Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?
If you need this for PHP:
The PHP DOM functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.
simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]
I have never used querypath, so can't comment on its usefulness.
Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.
For Python and Java, similar links were posted.
For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.
Here's the solution:
<?php
// here's the pattern:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/';
// a string to parse:
$string = 'Hello, try clicking here
<br/>and check out.<hr />
<h2>title</h2>
<a name ="paragraph" rel= "I\'m an anchor"></a>
Fine, <span title=\'highlight the "punch"\'>thanks<span>.
<div class = "clear"></div>
<br>';
// let's get the occurrences:
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
// print the result:
print_r($matches[0]);
?>
To test it deeply, I entered in the string auto-closing tags like:
<hr />
<br/>
<br>
I also entered tags with:
one attribute
more than one attribute
attributes which value is bound either into single quotes or into double quotes
attributes containing single quotes when the delimiter is a double quote and vice versa
"unpretty" attributes with a space before the "=" symbol, after it and both before and after it.
Should you find something which does not work in the proof of concept above, I am available in analyzing the code to improve my skills.
<EDIT>
I forgot that the question from the user was to avoid the parsing of self-closing tags.
In this case the pattern is simpler, turning into this:
$pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/';
The user #ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value. In this case a fine tuning brings us the following pattern:
$pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/';
</EDIT>
Understanding the pattern
If someone is interested in learning more about the pattern, I provide some line:
the first sub-expression (\w+) matches the tag name
the second sub-expression contains the pattern of an attribute. It is composed by:
one or more whitespaces \s+
the name of the attribute (\w+)
zero or more whitespaces \s* (it is possible or not, leaving blanks here)
the "=" symbol
again, zero or more whitespaces
the delimiter of the attribute value, a single or double quote ('|"). In the pattern, the single quote is escaped because it coincides with the PHP string delimiter. This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute, that's why it is very important.
the value of the attribute, matched by almost anything: (.*?); in this specific syntax, using the greedy match (the question mark after the asterisk) the RegExp engine enables a "look-ahead"-like operator, which matches anything but what follows this sub-expression
here comes the fun: the \4 part is a backreference operator, which refers to a sub-expression defined before in the pattern, in this case, I am referring to the fourth sub-expression, which is the first attribute delimiter found
zero or more whitespaces \s*
the attribute sub-expression ends here, with the specification of zero or more possible occurrences, given by the asterisk.
Then, since a tag may end with a whitespace before the ">" symbol, zero or more whitespaces are matched with the \s* subpattern.
The tag to match may end with a simple ">" symbol, or a possible XHTML closure, which makes use of the slash before it: (/>|>). The slash is, of course, escaped since it coincides with the regular expression delimiter.
Small tip: to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping.
Whenever I need to quickly extract something from an HTML document, I use Tidy to convert it to XML and then use XPath or XSLT to get what I need.
In your case, something like this:
//p/a[#href='foo']
I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.
I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):
$_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g;
s/^ ?\n//s; s/ $//s; print
It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep, sed, Perl, etc. I'm not even joking :) Enjoy.
It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.
HTML Split
Some better regular expressions:
/(<.*?>|[^<]+)\s*/g # Get tags and text
/(\w+)="(.*?)"/g # Get attibutes
They are good for XML / XHTML.
With minor variations, it can cope with messy HTML... or convert the HTML -> XHTML first.
The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.
There are some nice regexes for replacing HTML with BBCode here. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.
For example:
$store =~ s/http:/http:\/\//gi;
$store =~ s/https:/https:\/\//gi;
$baseurl = $store;
if (!$query->param("ascii")) {
$html =~ s/\s\s+/\n/gi;
$html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi;
}
$html =~ s/\n//gi;
$html =~ s/\r\r//gi;
$html =~ s/$baseurl//gi;
$html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi;
$html =~ s/<p>/\n\n/gi;
$html =~ s/<br(.*?)>/\n/gi;
$html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi;
$html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi;
$html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi;
$html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi;
$html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi;
$html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi;
$html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi;
$html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi;
$html =~ s/<link(.*?)>//gi;
$html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi;
$html =~ s/<ul(.*?)>/\[list]/gi;
$html =~ s/<\/ul>/\[\/list]/gi;
$html =~ s/<div>/\n/gi;
$html =~ s/<\/div>/\n/gi;
$html =~ s/<td(.*?)>/ /gi;
$html =~ s/<tr(.*?)>/\n/gi;
$html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi;
$html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi;
$html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi;
$html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi;
$html =~ s/<head>(.*?)<\/head>//sgmi;
$html =~ s/<object>(.*?)<\/object>//sgmi;
$html =~ s/<script(.*?)>(.*?)<\/script>//sgmi;
$html =~ s/<style(.*?)>(.*?)<\/style>//sgmi;
$html =~ s/<title>(.*?)<\/title>//sgmi;
$html =~ s/<!--(.*?)-->/\n/sgmi;
$html =~ s/\/\//\//gi;
$html =~ s/http:\//http:\/\//gi;
$html =~ s/https:\//https:\/\//gi;
$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi;
$html =~ s/\r\r//gi;
$html =~ s/\[img]\//\[img]/gi;
$html =~ s/\[url=\//\[url=/gi;
About the question of the regular expression methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since nobody here spoke about recursion.
A regular expression-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.
After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found. Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.
If an (x)HTML parser needs recursion, a regular expression parser without recursion is not enough for the purpose. It's a simple construct.
The black art of regular expressions is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)
Here's the magic pattern:
$pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s";
Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote in January: Reference
(Take care. In that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the regular expression engine, since no ^ or $ anchoring was used).
Now, we could speak about the limits of this method from a more informed point of view:
according to the specific implementation of the regular expression engine, recursion may have a limit in the number of nested patterns parsed, but it depends on the language used
although corrupted, (x)HTML does not drive into severe errors. It is not sanitized.
Anyhow, it is only a regular expression pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).
<\s*(\w+)[^/>]*>
The parts explained:
<: Starting character
\s*: It may have whitespaces before the tag name (ugly, but possible).
(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious, use ([a-zA-Z0-9]+) instead.
[^/>]*: Anything except > and / until closing >
>: Closing >
UNRELATED
And to the fellows, who underestimate regular expressions, saying they are only as powerful as regular languages:
anbanban which is not regular and not even context free, can be matched with ^(a+)b\1b\1$
Backreferencing FTW!
As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.
If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:
/<[^/]*?>/g
I wrote it in 30 seconds, and tested here:
http://gskinner.com/RegExr/
It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.
It seems to me you're trying to match tags without a "/" at the end. Try this:
<([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>
It's true that when programming it's usually best to use dedicated parsers and APIs instead of regular expressions when dealing with HTML, especially if accuracy is paramount (e.g., if your processing might have security implications). However, I don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but they're not insurmountable or even necessarily relevant.
A simple regex like <([^>"']|"[^"]*"|'[^']*')*> is usually good enough, in cases such as those I just mentioned. It's a naive solution, all things considered, but it does correctly allow unencoded > symbols in attribute values. If you're looking for, e.g., a table tag, you could adapt it as </?table\b([^>"']|"[^"]*"|'[^']*')*>.
Just to give a sense of what a more "advanced" HTML regex would look like, the following does a fairly respectable job of emulating real-world browser behavior and the HTML5 parsing algorithm:
</?([A-Za-z][^\s>/]*)(?:=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)|[^>])*(?:>|$)
The following matches a fairly strict definition of XML tags (although it doesn't account for the full set of Unicode characters allowed in XML names):
<(?:([_:A-Z][-.:\w]*)(?:\s+[_:A-Z][-.:\w]*\s*=\s*(?:"[^"]*"|'[^']*'))*\s*/?|/([_:A-Z][-.:\w]*)\s*)>
Granted, these don't account for surrounding context and a few edge cases, but even such things could be dealt with if you really wanted to (e.g., by searching between the matches of another regex).
At the end of the day, use the most appropriate tool for the job, even in the cases when that tool happens to be a regex.
Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.
There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.
If you only want the tag names, it should be possible to do this via a regular expression.
<([a-zA-Z]+)(?:[^>]*[^/] *)?>
should do what you need. But I think the solution of "moritz" is already fine. I didn't see it in the beginning.
For all downvoters: In some cases it just makes sense to use a regular expression, because it can be the easiest and quickest solution. I agree that in general you should not parse HTML with regular expressions.
But regular expressions can be a very powerful tool when you have a subset of HTML where you know the format and you just want to extract some values. I did that hundreds of times and almost always achieved what I wanted.
The OP doesn't seem to say what he needs to do with the tags. For example, does he need to extract inner text, or just examine the tags?
I'm firmly in the camp that says a regular expression is not the be-all, end-all text parser. I've written a large amount of text-parsing code including this code to parse HTML tags.
While it's true I'm not all that great with regular expressions, I consider regular expressions just too rigid and hard to maintain for this sort of parsing.
This may do:
<.*?[^/]>
Or without the ending tags:
<[^/].*?[^/]>
What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...

What is the difference of operators " and ' in javascript [duplicate]

Consider the following two alternatives:
console.log("double");
console.log('single');
The former uses double quotes around the string, whereas the latter uses single quotes around the string.
I see more and more JavaScript libraries out there using single quotes when handling strings.
Are these two usages interchangeable? If not, is there an advantage in using one over the other?
The most likely reason for use of single vs. double in different libraries is programmer preference and/or API consistency. Other than being consistent, use whichever best suits the string.
Using the other type of quote as a literal:
alert('Say "Hello"');
alert("Say 'Hello'");
This can get complicated:
alert("It's \"game\" time.");
alert('It\'s "game" time.');
Another option, new in ECMAScript 6, is template literals which use the backtick character:
alert(`Use "double" and 'single' quotes in the same string`);
alert(`Escape the \` back-tick character and the \${ dollar-brace sequence in a string`);
Template literals offer a clean syntax for: variable interpolation, multi-line strings, and more.
Note that JSON is formally specified to use double quotes, which may be worth considering depending on system requirements.
If you're dealing with JSON, it should be noted that strictly speaking, JSON strings must be double quoted. Sure, many libraries support single quotes as well, but I had great problems in one of my projects before realizing that single quoting a string is in fact not according to JSON standards.
There is no one better solution; however, I would like to argue that double quotes may be more desirable at times:
Newcomers will already be familiar with double quotes from their language. In English, we must use double quotes " to identify a passage of quoted text. If we were to use a single quote ', the reader may misinterpret it as a contraction. The other meaning of a passage of text surrounded by the ' indicates the 'colloquial' meaning. It makes sense to stay consistent with pre-existing languages, and this may likely ease the learning and interpretation of code.
Double quotes eliminate the need to escape apostrophes (as in contractions). Consider the string: "I'm going to the mall", vs. the otherwise escaped version: 'I\'m going to the mall'.
Double quotes mean a string in many other languages. When you learn a new language like Java or C, double quotes are always used. In Ruby, PHP and Perl, single-quoted strings imply no backslash escapes while double quotes support them.
JSON notation is written with double quotes.
Nonetheless, as others have stated, it is most important to remain consistent.
Section 7.8.4 of the specification describes literal string notation. The only difference is that DoubleStringCharacter is "SourceCharacter but not double-quote" and SingleStringCharacter is "SourceCharacter but not single-quote". So the only difference can be demonstrated thusly:
'A string that\'s single quoted'
"A string that's double quoted"
So it depends on how much quote escaping you want to do. Obviously the same applies to double quotes in double quoted strings.
I'd like to say the difference is purely stylistic, but I'm really having my doubts. Consider the following example:
/*
Add trim() functionality to JavaScript...
1. By extending the String prototype
2. By creating a 'stand-alone' function
This is just to demonstrate results are the same in both cases.
*/
// Extend the String prototype with a trim() method
String.prototype.trim = function() {
return this.replace(/^\s+|\s+$/g, '');
};
// 'Stand-alone' trim() function
function trim(str) {
return str.replace(/^\s+|\s+$/g, '');
};
document.writeln(String.prototype.trim);
document.writeln(trim);
In Safari, Chrome, Opera, and Internet Explorer (tested in Internet Explorer 7 and Internet Explorer 8), this will return the following:
function () {
return this.replace(/^\s+|\s+$/g, '');
}
function trim(str) {
return str.replace(/^\s+|\s+$/g, '');
}
However, Firefox will yield a slightly different result:
function () {
return this.replace(/^\s+|\s+$/g, "");
}
function trim(str) {
return str.replace(/^\s+|\s+$/g, "");
}
The single quotes have been replaced by double quotes. (Also note how the indenting space was replaced by four spaces.) This gives the impression that at least one browser parses JavaScript internally as if everything was written using double quotes. One might think, it takes Firefox less time to parse JavaScript if everything is already written according to this 'standard'.
Which, by the way, makes me a very sad panda, since I think single quotes look much nicer in code. Plus, in other programming languages, they're usually faster to use than double quotes, so it would only make sense if the same applied to JavaScript.
Conclusion: I think we need to do more research on this.
This might explain Peter-Paul Koch's test results from back in 2003.
It seems that single quotes are sometimes faster in Explorer Windows (roughly 1/3 of my tests did show a faster response time), but if Mozilla shows a difference at all, it handles double quotes slightly faster. I found no difference at all in Opera.
2014: Modern versions of Firefox/Spidermonkey don’t do this anymore.
If you're doing inline JavaScript (arguably a "bad" thing, but avoiding that discussion) single quotes are your only option for string literals, I believe.
E.g., this works fine:
<a onclick="alert('hi');">hi</a>
But you can't wrap the "hi" in double quotes, via any escaping method I'm aware of. Even " which would have been my best guess (since you're escaping quotes in an attribute value of HTML) doesn't work for me in Firefox. " won't work either because at this point you're escaping for HTML, not JavaScript.
So, if the name of the game is consistency, and you're going to do some inline JavaScript in parts of your application, I think single quotes are the winner. Someone please correct me if I'm wrong though.
Technically there's no difference. It's only matter of style and convention.
Douglas Crockford1 recommends using double quotes.
I personally follow that.
Strictly speaking, there is no difference in meaning; so the choice comes down to convenience.
Here are several factors that could influence your choice:
House style: Some groups of developers already use one convention or the other.
Client-side requirements: Will you be using quotes within the strings? (See Ady's answer.)
Server-side language: VB.NET people might choose to use single quotes for JavaScript so that the scripts can be built server-side (VB.NET uses double-quotes for strings, so the JavaScript strings are easy to distinguished if they use single quotes).
Library code: If you're using a library that uses a particular style, you might consider using the same style yourself.
Personal preference: You might think one or other style looks better.
Just keep consistency in what you use. But don't let down your comfort level.
"This is my string."; // :-|
"I'm invincible."; // Comfortable :)
'You can\'t beat me.'; // Uncomfortable :(
'Oh! Yes. I can "beat" you.'; // Comfortable :)
"Do you really think, you can \"beat\" me?"; // Uncomfortable :(
"You're my guest. I can \"beat\" you."; // Sometimes, you've to :P
'You\'re my guest too. I can "beat" you too.'; // Sometimes, you've to :P
ECMAScript 6 update
Using template literal syntax.
`Be "my" guest. You're in complete freedom.`; // Most comfort :D
I hope I am not adding something obvious, but I have been struggling with Django, Ajax, and JSON on this.
Assuming that in your HTML code you do use double quotes, as normally should be, I highly suggest to use single quotes for the rest in JavaScript.
So I agree with ady, but with some care.
My bottom line is:
In JavaScript it probably doesn't matter, but as soon as you embed it inside HTML or the like you start to get troubles. You should know what is actually escaping, reading, passing your string.
My simple case was:
tbox.innerHTML = tbox.innerHTML + '<div class="thisbox_des" style="width:210px;" onmouseout="clear()"><a href="/this/thislist/'
+ myThis[i].pk +'"><img src="/site_media/'
+ myThis[i].fields.thumbnail +'" height="80" width="80" style="float:left;" onmouseover="showThis('
+ myThis[i].fields.left +','
+ myThis[i].fields.right +',\''
+ myThis[i].fields.title +'\')"></a><p style="float:left;width:130px;height:80px;"><b>'
+ myThis[i].fields.title +'</b> '
+ myThis[i].fields.description +'</p></div>'
You can spot the ' in the third field of showThis.
The double quote didn't work!
It is clear why, but it is also clear why we should stick to single quotes... I guess...
This case is a very simple HTML embedding, and the error was generated by a simple copy/paste from a 'double quoted' JavaScript code.
So to answer the question:
Try to use single quotes while within HTML. It might save a couple of debug issues...
It's mostly a matter of style and preference. There are some rather interesting and useful technical explorations in the other answers, so perhaps the only thing I might add is to offer a little worldly advice.
If you're coding in a company or team, then it's probably a good idea to follow the "house style".
If you're alone hacking a few side projects, then look at a few prominent leaders in the community. For example, let's say you getting into Node.js. Take a look at core modules, for example, Underscore.js or express and see what convention they use, and consider following that.
If both conventions are equally used, then defer to your personal preference.
If you don't have any personal preference, then flip a coin.
If you don't have a coin, then beer is on me ;)
I am not sure if this is relevant in today's world, but double quotes used to be used for content that needed to have control characters processed and single quotes for strings that didn't.
The compiler will run string manipulation on a double quoted string while leaving a single quoted string literally untouched. This used to lead to 'good' developers choosing to use single quotes for strings that didn't contain control characters like \n or \0 (not processed within single quotes) and double quotes when they needed the string parsed (at a slight cost in CPU cycles for processing the string).
If you are using JSHint, it will raise an error if you use a double quoted string.
I used it through the Yeoman scafflholding of AngularJS, but maybe there is somehow a manner to configure this.
By the way, when you handle HTML into JavaScript, it's easier to use single quote:
var foo = '<div class="cool-stuff">Cool content</div>';
And at least JSON is using double quotes to represent strings.
There isn't any trivial way to answer to your question.
There isn't any difference between single and double quotes in JavaScript.
The specification is important:
Maybe there are performance differences, but they are absolutely minimum and can change any day according to browsers' implementation. Further discussion is futile unless your JavaScript application is hundreds of thousands lines long.
It's like a benchmark if
a=b;
is faster than
a = b;
(extra spaces)
today, in a particular browser and platform, etc.
Examining the pros and cons
In favor of single quotes
Less visual clutter.
Generating HTML: HTML attributes are usually delimited by double quotes.
elem.innerHTML = 'Hello';
However, single quotes are just as legal in HTML.
elem.innerHTML = "<a href='" + url + "'>Hello</a>";
Furthermore, inline HTML is normally an anti-pattern. Prefer templates.
Generating JSON: Only double quotes are allowed in JSON.
myJson = '{ "hello world": true }';
Again, you shouldn’t have to construct JSON this way. JSON.stringify() is often enough. If not, use templates.
In favor of double quotes
Doubles are easier to spot if you don't have color coding. Like in a console log or some kind of view-source setup.
Similarity to other languages: In shell programming (Bash etc.), single-quoted string literals exist, but escapes are not interpreted inside them. C and Java use double quotes for strings and single quotes for characters.
If you want code to be valid JSON, you need to use double quotes.
In favor of both
There is no difference between the two in JavaScript. Therefore, you can use whatever is convenient at the moment. For example, the following string literals all produce the same string:
"He said: \"Let's go!\""
'He said: "Let\'s go!"'
"He said: \"Let\'s go!\""
'He said: \"Let\'s go!\"'
Single quotes for internal strings and double for external. That allows you to distinguish internal constants from strings that are to be displayed to the user (or written to disk etc.). Obviously, you should avoid putting the latter in your code, but that can’t always be done.
Talking about performance, quotes will never be your bottleneck. However, the performance is the same in both cases.
Talking about coding speed, if you use ' for delimiting a string, you will need to escape " quotes. You are more likely to need to use " inside the string. Example:
// JSON Objects:
var jsonObject = '{"foo":"bar"}';
// HTML attributes:
document.getElementById("foobar").innerHTML = '<input type="text">';
Then, I prefer to use ' for delimiting the string, so I have to escape fewer characters.
There are people that claim to see performance differences: old mailing list thread. But I couldn't find any of them to be confirmed.
The main thing is to look at what kind of quotes (double or single) you are using inside your string. It helps to keep the number of escapes low. For instance, when you are working with HTML content inside your strings, it is easier to use single quotes so that you don't have to escape all double quotes around the attributes.
When using CoffeeScript I use double quotes. I agree that you should pick either one and stick to it. CoffeeScript gives you interpolation when using the double quotes.
"This is my #{name}"
ECMAScript 6 is using back ticks (`) for template strings. Which probably has a good reason, but when coding, it can be cumbersome to change the string literals character from quotes or double quotes to backticks in order to get the interpolation feature. CoffeeScript might not be perfect, but using the same string literals character everywhere (double quotes) and always be able to interpolate is a nice feature.
`This is my ${name}`
I would use double quotes when single quotes cannot be used and vice versa:
"'" + singleQuotedValue + "'"
'"' + doubleQuotedValue + '"'
Instead of:
'\'' + singleQuotedValue + '\''
"\"" + doubleQuotedValue + "\""
There is strictly no difference, so it is mostly a matter of taste and of what is in the string (or if the JavaScript code itself is in a string), to keep number of escapes low.
The speed difference legend might come from PHP world, where the two quotes have different behavior.
The difference is purely stylistic. I used to be a double-quote Nazi. Now I use single quotes in nearly all cases. There's no practical difference beyond how your editor highlights the syntax.
You can use single quotes or double quotes.
This enables you for example to easily nest JavaScript content inside HTML attributes, without the need to escape the quotes.
The same is when you create JavaScript with PHP.
The general idea is: if it is possible use such quotes that you won't need to escape.
Less escaping = better code.
In addition, it seems the specification (currently mentioned at MDN) doesn't state any difference between single and double quotes except closing and some unescaped few characters. However, template literal (` - the backtick character) assumes additional parsing/processing.
A string literal is 0 or more Unicode code points enclosed in single or double quotes. Unicode code points may also be represented by an escape sequence. All code points may appear literally in a string literal except for the closing quote code points, U+005C (REVERSE SOLIDUS), U+000D (CARRIAGE RETURN), and U+000A (LINE FEED). Any code points may appear in the form of an escape sequence. String literals evaluate to ECMAScript String values...
Source: https://tc39.es/ecma262/#sec-literals-string-literals

Is it enough to use HTMLEncode to display uploaded text?

We're allowing users to upload pictures and provide a text description. Users can view this through a pop up box (actually a div ) via javascript. The uploaded text is a parameter to a javascript function. I 'm worried about XSS and also finding issues with HTMLEncode().
We're using HTMLEncode to guard against XSS. Unfortunately, we're finding that HTMLEncode() only replaces '<' and '>'. We also need to replace single and double quotes that people may include. Is there a single function that will do all these special type characters or must we do that manually via .NET string.Replace()?
Unfortunately, we're finding that HTMLEncode() only replaces '<' and '>'.
Assuming you are talking about HttpServerUtility.HtmlEncode, that does encode the double-quote character. It also encodes as character references the range U+0080 to U+00FF, for some reason.
What it doesn't encode is the single quote. Bit of a shame but you can usually work around it by using only double quotes as attribute value delimiters in your HTML/XML. In that case, HtmlEncode is enough to prevent HTML-injection.
However, javascript is in your tags, and HtmlEncode is decidedly not enough to escape content to go in a JavaScript string literal. JavaScript-encoding is a different thing to HTML-encoding, so if that's the reason you're worried about the single quote then you need to employ a JS string encoder instead.
(A JSON encoder is a good start for that, but you would want to ensure it encodes the U+2028 and U+2029 characters which are, annoyingly, valid in JSON but not in JavaScript. Also you might well need some variety of HTML-escaping on top of that, if you have JavaScript in an HTML context. This can get hairy; it's usually better to avoid these problems by hiding the content you want in plain HTML, for example in a hidden input or custom attribute, where you can use standard HTML-escaping, and then read that data from the DOM in JS.)
If the text description is embedded inside a JavaScript string literal, then to prevent XSS, you will need to escape special characters such as quotes, backslashes, and newlines. The HttpUtility.HtmlEncode method is not suitable for this task.
If the JavaScript string literal is in turn embedded inside HTML (for example, in an attribute), then you will need to apply HTML encoding as well, on top of the JavaScript escaping.
You can use Microsoft's Anti-Cross Site Scripting library to perform the necessary escaping and encoding, but I recommend that you try to avoid doing this yourself. For example, if you're using WebForms, consider using an <asp:HiddenField> control: Set its Value property (which will be HTML-encoded automatically) in your server-side code, and access its value property from client-side code.
how about you htmlencode all of the input with this extended function:
private string HtmlEncode(string text)
{
char[] chars = HttpUtility.HtmlEncode(text).ToCharArray();
StringBuilder result = new StringBuilder(text.Length + (int)(text.Length * 0.1));
foreach (char c in chars)
{
int value = Convert.ToInt32(c);
if (value > 127)
result.AppendFormat("&#{0};", value);
else
result.Append(c);
}
return result.ToString();
}
this function will convert all non-english characters, symbols, quotes, etc to html-entities..
try it out and let me know if this helps..
If you're using ASP.NET MVC2 or ASP.NET 4 you can replace <%= with <%: to encode your output. It's safe to use for everything it seems (like HTML Helpers).
There is a good write up of this here: New <%: %> Syntax for HTML Encoding Output in ASP.NET 4 (and ASP.NET MVC 2)

Passing bad string as javascript parameter

I'm having a problem passing parameters that I think I should be able to solve elegantly (with 1 line of code) but I can't seem to solve the issue. I have a link to delete items, like so:
<a href="javascript:confirmDelete('${result.headline}',${result.id});">
The problem arises when result.headline contains characters like ' or " or others that break the javascript call. I have tried ${fn:replace(result.headline,"'","")} which fixes issues for ' however I want to do some sort of URLEncode (I think) so that these characters dont break the javascript call. Any ideas on how to fix this without changing anything on the back-end?
Avoid putting inline JavaScript in elements, especially javascript: pseudo-URLs which (apart from being Evil that should never be used) are JavaScript-in-URL-in-HTML. You would have to take the headline and:
JavaScript string literal-encode it. (Replacing \, ', control characters and potentially Unicode characters, or just using a JSON encoder to do it for you.) Then,
URL-encode it. Then,
HTML-encode it.
Better to put the data in the markup where you only have to worry about simple HTML encoding in the way you do everywhere else to avoid cross-site-scripting:
Delete
and have the JavaScript extract them:
<script type="text/javascript">
for (var i= document.links.length; i-->0;) {
var link= document.links[i];
if (link.className.indexOf('confirmdelete-')===0) {
link.onclick= function() {
confirmDelete(this.title, this.className.substring(14));
return false;
};
}
}
</script>
(there are potentially simpler, shorter or cleaner ways of writing the above using utility functions or frameworks, and there's an argument to be had over whether a link is really the right markup for this kind of action, but that's the basic idea.)
As long as you can rely on the fact that the strings don't have newlines, all you need to do is double all backslashes and escape all single quotes.
For example:
replace(replace(result.headline, "\\", "\\\\"), "'", "\\'")
This might not be valid syntax in your server-side language; I don't recognize it.

Categories

Resources