How to remove characters from html attribute values?

How to remove characters from html attribute values? - javascript

According to the author of htmlcompressor.com this can not be done as they have semantic meaning.
Here is the particular example:
<meta name='description' content='Foo lets you save and share all your
web bookmarks / favorites in one place. It is free with no advertising for life, and
has straight forward privacy controls.'>
removing the return characters you have:
<meta name='description' content='Foo lets you save and share all your web bookmarks / favorites in one place. It is free with no advertising for life, and has straight forward privacy controls.'>
which is a single line which is what I want to send to the browser.
I want to do this for all my HTML using some string manipulation. Is this possible to do or are there other cases where a return character has meaning? Is there a way to differentiate?

According to the HTML4.01 specification ( http://www.w3.org/TR/html4/struct/global.html#h-7.4.4.2 ), the content="" attribute of the <meta /> element is CDATA, which means that whitespace is not significant:
CDATA is a sequence of characters from the document character set and may include character entities. User agents should interpret attribute values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as "myval"). Authors should not declare attribute values with leading or trailing white space.
So it looks like the author of htmlcompression is wrong.
Anyway, despite dire warnings to the contrary, you can probably get-away with using a regular expression to fix this.
I've forgotten the syntax to combine "match only this group, and replace in this sub-region" in regex, but this hack works:
This simple regex will capture the content of the content="" attribute:
<meta.+content='(.*)'>
Once you've got the content, you can do a straightforward '\r', '\n', ' ' -> ' ' replacement.

Whenever the specification is correct about content attribute being CDATA, a webmaster may use the value of any attribute such as "content" of the "meta" tag in the given example via JavaScript, and compressing the value of the attribute would alter the expected result.
So the author of htmlcompressor.com is correct in that they have a semantic meaning for the purpose of compression.
<meta id="m1" name="item1" content="Sample stuff:
1. This text is multiline on purpose.
2. And the author expects it to remain this way after compression.
So yes, it does matter...">
The same meta tag compressed:
<meta id="m2" name="item2" content="Sample stuff: 1. This text is multiline on purpose. 2. And the author expects it to remain this way after compression. So yes, it does matter...">
And to show the difference:
<script>
alert('"'
+ document.getElementById('m1').content
+ '"\n\n---------------\n\n"'
+ document.getElementById('m2').content + '"'
);
</script>
Afaik, the goal of that site is to compress documents without altering the resulting layout or functionality.
Live example: http://jsfiddle.net/7Qb74/

Related

jQuery encoding in html()

Take the following (simple) HTML page:
<html>
<head>
<script src="jquery-1.12.3.min.js"></script>
</head>
<body>
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
</body>
</html>
If in browser console I type something like:
$("#test").html()
I obtain:
<img src="/path/to/image?width=1024&height=768">
Why has the & in img source attribute been turned to &?
I can understand if the ampersand appears in a paragraph text (or something like that)... but why are image sources touched that way? This is going to break the page for further processing...
Isn't there a way for obtaining "raw" HTML out from a <div/>?

Why has the & in img source attribute has been turned to &?
Because it should1 have been & in the first place; the browser fixed it for you when it parsed the HTML, because browsers are tolerant. :-)
The text inside an HTML attribute is HTML text. In HTML text, both < and & must be encoded, because they both have special values: < is the beginning of a tag, and & is the beginning of a character entity. The typical way to encode them is with named character entities: < and & (> is also frequently written >, but it's not necessary outside a tag). If you have a & that the browser's parser determines doesn't start a character entity, the parser backs up and acts as though it saw & instead. The HTML5 specification addresses doing this in §8.2.4.2: The & puts the parser in the "data state" and the parser attempts to consume a character reference; it falls back to processing it as a literal & if it fails to consume a character reference.
So the browser fixed it, and then jQuery retrieved the corrected version and that's what gets logged to the console.
This is going to break the page for further processing...
Nothing that correctly processes HTML text will be impacted by this, nor will anything that deals with just the value of that attribute rather than the HTML text that defines the value of it.
For instance, if you ask that img element what its src is, you'll get back a string with just an & in it:
var img = document.querySelector("#test img");
console.log(img.getAttribute("src"));
console.log(img.src);
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
That's because both src and getAttribute return the string, not the way we write the string in HTML.
Similarly, anything using attribute matching selectors will work as well.
// src*="&height" means "an element with a src attribute
// containing &height anywhere in the value
var img = document.querySelector('img[src*="&height"]');
console.log("Found it? " + (img ? "true" : "false"));
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
& is only used in the HTML text defining that attribute in HTML. If a tool is processing the HTML text, it needs to correctly understand HTML text.
1 "should" is arguably a strong word here, since again the HTML specification clearly defines that an & that doesn't start a character entity and isn't an ambiguous ampersand should be read as an &. (This would be an ambiguous ampersand: &asldkfj; because it starts something that looks like a character entity, but isn't one). So in that sense, the original text is just another way to write the same thing, relying on the fact that the & isn't ambiguous.

change color of special characters in HTML

I've a Unicode HTML page (saved in DB), is there anyway that I can programmatically change color of all "." and ":" characters in text (please pay attention that my HTML content has also inline CSS which may contain "." or ":" characters, but I just want to change color of the mentioned characters in real text.
what are my options? One way can be finding these characters in the text and put them in tag, so that can be styled, any other suggestion? (if I'm going to use this method, how can I distinguish between HTML/CSS characters and real characters in the text?) I'm using ASP.NET/C#

Try utilizing String.prototype.replace() with RegExp /\.|:/g , returning i element with style attribute set to specific color
var div = document.getElementsByTagName("div")[0];
div.innerHTML = div.innerHTML.replace(/\.|:/g, function(match) {
return "<i style=color:tomato;font-weight:bold>" + match + "</i>"
})
<head>
<meta charset="utf-8" />
</head>
<body>
<div>
I've a Unicode HTML page (saved in DB), is there anyway that I can programmatically change color of all "." and ":" characters in text (please pay attention that my HTML content has also inline CSS which may contain "." or ":" characters, but I just want
to change color of the mentioned characters in real text. what are my options? One way can be finding these characters in the text and put them in tag, so that can be styled, any other suggestion? (if I'm going to use this method, how can I distinguish
between HTML/CSS characters and real characters in the text?) I'm using ASP.NET/C#
</div>
</body>

This is the Simple way to change color of any character in HTML language
"Spacial Character"

Using regular expression to parse text to prevent XSS

I'm trying to parse a blob of text in html format, that only allow bold <b></b> and italic <i></i>.
I know it nearly impossible to parse the html text to secure XSS. But given the constraints only to bold and italic, is that feasible to use regex to filter out the unnecessary tags?
Thanks.
--- Edit ---
I meant to do the parsing on the client side, and render it right back.
Please test your code against this, before jumping into conclusion.
http://voog.github.io/wysihtml/examples/simple.html
BTW, why is the question itself get down voted?
--- Closed ---
I picked #Siguza 's answer to close this discussion.

The easiest and probably most secure way I can think of (doing this with regex) is to first replace all < and > with < and > respectively, and then explicitly "un-replace" the b and i tags.
To replace < and > you just need text substitution, no regex. But I trust you know how to do this in regex anyway.
To re-enable the i and b tags, you could also use four text replacements:
<b> => <b>
</b> => </b>
<i> => <i>
</i> => </i>
Or, in regex replace /<(\/?[bi])>/g with <$1>.
But...
...for the sake of completeness, it actually is possible with just one single regex substitution:
Replace /<(|\/|[^>\/bi]|\/[^>bi]|[^\/>][^>]+|\/[^>][^>]+)>/g with <$1>.
I will not guarantee that this is bullet-proof, but I tested it against the following block using RegExr, where it appeared to hold up:
<>Test</>
<i>Test</i>
<iii>Test</iii>
<b>Test</b>
<bbb>Test</bbb>
<a>Test</a>
<abc>Test</abc>
<some tag with="attributes">Test</some>
<br/>
<br />

Can you do this with regex? Kind of. You have to write a regex to find all tags that are not b or i tags. Below is a simple example of one, it matches any tag with more than 1 character in it, which only allows <a>, <b>, <i>, <p>, <q>, <s>, and <u> (no spaces, no attributes and no classes allowed), which I believe fits your needs. There may well be a more precise regex for this, but this is simple. It may or may not catch everything. It probably doesn't.
<[^>]{2,}[^/]>
Should you do this with regex? No. There are other better, more secure ways.

Parse out tags, replace with a special delimiter (or store indices).
XSS sanitize the input.
Replace the delimiters with tags.
Make sure you don't have any mismatched tags.
XSS sanitizing needs to be done server-side - the client is in control of the client-side, and can circumvent any checks there.
I still maintain that the OWASP Cheat Sheet is sufficient for XSS sanitization, and replacing only empty bold and italic tags shouldn't compromise any of the rules.

how to display whitespace characters.. but omit when text is selected

Setup:
I'd like to output some text that shows visible spaces, linebreaks, etc
(For the purpose of displaying strings for debug purposes (or for say a rich-text editor))
ie, id like to make the following type of substitutions
" " -> "<span class="whitespace">·</span>"
"\r" -> "<span class="whitespace">\\r</span>"
"\n" -> "<span class="whitespace">\\n</span>"
perhaps the following CSS rule could be defined
/*display whitespace chars as a light grey*/
.whitespace { color:#CCC; }
so that
this two line
string
would be displayed as
this·two·lined\n
\t string
The Question:
Is it possible so that when the above "visual-whitepace" text is selected / copied-to-clipboard... it copies without the whitespace markup?
Is there some CSS property to display x, but copy y?
javascript hack?
special whitespace-font?
other?

<style>.paragraph-marker:after { content: "\B6" }</style>
<p>Foo<span class="paragraph-marker"></span></p>
<p>Bar<span class="paragraph-marker"></span></p>
The :after is a "pseudo-selector" which matches a pseudo-node that immediately follows the affected element.
The content property can be used with these pseudo-nodes to specify the textual content of them. It comes in handy when specifying quotation marks before and after quoted sections, or list separators like commas in semantic HTML <ol> which you don't want to display in bullet format.
It should come in handy for your use case since browsers don't deal with pseudo-nodes when converting a DOM selection stored in the clipboard to plain text on paste.

http://codepen.io/msvbg/pen/ebgrj
Works fine in the latest version of Chrome. Flip the showWhitespace variable to try it both ways. It works by sticking a visible whitespace layer underneath the text layer, and only the top-most layer is copied by default.

Getting unparsed (raw) HTML with JavaScript

I need to get the actual html code of an element in a web page.
For example if the actual html code inside the element is "How to fix"
Running this JavaScript:
getElementById('myE').innerHTML
Gives me "How to fix" which is the parsed HTML.
How can I get the unparsed "How to fix" using JavaScript?

You cannot get the actual HTML source of part of your web page.
When you give a web browser an HTML page, it parses the HTML into some DOM nodes that are the definitive version of your document as far as the browser is concerned. The DOM keeps the significant information from the HTML—like that you used the Unicode character U+00A0 Non-Breaking Space before the word fix—but not the irrelevent information that you used it by means of an entity reference rather than just typing it raw ( ).
When you ask the browser for an element node's innerHTML, it doesn't give you the original HTML source that was parsed to produce that node, because it no longer has that information. Instead, it generates new HTML from the data stored in the DOM. The browser decides on how to format that HTML serialisation; different browsers produce different HTML, and chances are it won't be the same way you formatted it originally.
In particular,
element names may be upper- or lower-cased;
attributes may not be in the same order as you stated them in the HTML;
attribute quoting may not be the same as in your source. IE often generates unquoted attributes that aren't even valid HTML; all you can be sure of is that the innerHTML generated will be safe to use in the same browser by writing it to another element's innerHTML;
it may not use entity references for anything but characters that would otherwise be impossible to include directly in text content: ampersands, less-thans and attribute-value-quotes. Instead of returning it may simply give you the raw character.
You may not be able to see that that's a non-breaking space, but it still is one and if you insert that HTML into another element it will act as one. You shouldn't need to rely anywhere on a non-breaking space character being entity-escaped to ... if you do, for some reason, you can get that by doing:
x= el.innerHTML.replace(/\xA0/g, ' ')
but that's only escaping U+00A0 and not any of the other thousands of possible Unicode characters, so it's a bit questionable.
If you really really need to get your page's actual source HTML, you can make an XMLHttpRequest to your own URL (location.href) and get the full, unparsed HTML source in the responseText. There is almost never a good reason to do this.

What you have should work:
Element test:
<div id="myE">How to fix</div>
JavaScript test:
alert(document.getElementById("myE").innerHTML); //alerts "How to fix"
You can try it out here. Make sure that wherever you're using the result isn't show as a space, which is likely the case. If you want to show it somewhere that's designed for HTML, you'll need to escape it.

You can use a script tag instead, which will not parse the HTML. This is more relevant when there are angle brackets, like loading a lodash or underscore template.
document.getElementById("asDiv").value = document.getElementById("myDiv").innerHTML;
document.getElementById("asScript").value = document.getElementById("myScript").innerHTML;
<div id="myDiv">
<h1>
<%= ${var} %> %>
How to fix
</h1>
</div>
<script id="myScript" type="text/template">
<h1>
<%= ${var} %>
How to fix
</h1>
</script>
<textarea rows="10" cols="40" id="asDiv"></textarea>
<textarea rows="10" cols="40" id="asScript"></textarea>
Because the HTML in a div is parsed, the inner HTML for brackets comes back as
<
, but as a script it does not.

Develop Reference

JavaScript is the programming language of the Web.