jQuery encoding in html()

jQuery encoding in html() - javascript

Take the following (simple) HTML page:
<html>
<head>
<script src="jquery-1.12.3.min.js"></script>
</head>
<body>
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
</body>
</html>
If in browser console I type something like:
$("#test").html()
I obtain:
<img src="/path/to/image?width=1024&height=768">
Why has the & in img source attribute been turned to &?
I can understand if the ampersand appears in a paragraph text (or something like that)... but why are image sources touched that way? This is going to break the page for further processing...
Isn't there a way for obtaining "raw" HTML out from a <div/>?

Why has the & in img source attribute has been turned to &?
Because it should1 have been & in the first place; the browser fixed it for you when it parsed the HTML, because browsers are tolerant. :-)
The text inside an HTML attribute is HTML text. In HTML text, both < and & must be encoded, because they both have special values: < is the beginning of a tag, and & is the beginning of a character entity. The typical way to encode them is with named character entities: < and & (> is also frequently written >, but it's not necessary outside a tag). If you have a & that the browser's parser determines doesn't start a character entity, the parser backs up and acts as though it saw & instead. The HTML5 specification addresses doing this in §8.2.4.2: The & puts the parser in the "data state" and the parser attempts to consume a character reference; it falls back to processing it as a literal & if it fails to consume a character reference.
So the browser fixed it, and then jQuery retrieved the corrected version and that's what gets logged to the console.
This is going to break the page for further processing...
Nothing that correctly processes HTML text will be impacted by this, nor will anything that deals with just the value of that attribute rather than the HTML text that defines the value of it.
For instance, if you ask that img element what its src is, you'll get back a string with just an & in it:
var img = document.querySelector("#test img");
console.log(img.getAttribute("src"));
console.log(img.src);
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
That's because both src and getAttribute return the string, not the way we write the string in HTML.
Similarly, anything using attribute matching selectors will work as well.
// src*="&height" means "an element with a src attribute
// containing &height anywhere in the value
var img = document.querySelector('img[src*="&height"]');
console.log("Found it? " + (img ? "true" : "false"));
<div id='test'>
<img src='/path/to/image?width=1024&height=768' />
</div>
& is only used in the HTML text defining that attribute in HTML. If a tool is processing the HTML text, it needs to correctly understand HTML text.
1 "should" is arguably a strong word here, since again the HTML specification clearly defines that an & that doesn't start a character entity and isn't an ambiguous ampersand should be read as an &. (This would be an ambiguous ampersand: &asldkfj; because it starts something that looks like a character entity, but isn't one). So in that sense, the original text is just another way to write the same thing, relying on the fact that the & isn't ambiguous.

Related

Escape characters within an HTML tag

I have a specific div that cannot have tags within it.
Whenever tags are found, I would like to escape them and display as regular text.
For example:
<div class='no-tags-div'>
<h1>Hi!</h1>
<p>Blablablabalablal</p>
</div>
Instead of displaying the Hi! as a header text followed by a paragraph of Blablablabalablal, I would like to literally display it with the tags:
<h1>Hi!</h1>
<p>Blablablabalablal</p>
I already have access to the content I just need to figure out how to escape any of these special characters.
Edit: I should probably specify, the content within the div is posted through an input. I am attempting to not allow users to post other tags through the input, so this isn't just static HTML text we are talking about here.

You can use < and > to escape < and >. If you're doing this on the server side, you can find and replace those. On the client, you can use element.innerText, as D. Pardal suggested, which replaces the contents of element with a text node, rather than interpreting it as HTML.

Insert emoji with zero width joiner using Javascript

I have not been able to successfully insert an emoji into the DOM using Javascript when I am given the codepoints and zero width joiners are used.
Consider this emoji: 👩‍👩‍👦
I am able to create a string that looks like this:
👩‍👩‍👦
and insert it into the innerHtml of an element but the 3 characters end up getting displayed instead of the single combined character. If you look at the html on this page for this character, you can see that the html is formatted in the same way as my string is:
https://emojipedia.org/family-woman-woman-boy/
This is only an issue when zero width joiners are used.
So doing this:
el.innerHTML = "👩‍👩‍👦"
should result in a single character but it doesn't, so how can I get the single character to display. NOTE: the character cannot just be added by typing the text into an editor. The content is generated by javascript.

Not really sure what the question is here, but if you have a good UTF8/Unicode editor you can of course just paste the emoji into your text file.
If this is problematic you could build it up using HTML escaping.
Below I have done both, the first just pasting into the editor, unfortunately SO editor is not the best here. And the second one I use using HTML escaping..
Hope this helps..
update: Using your version also seems to work for me using Chrome,
what browsers are you using..?
document.querySelector("#container").innerHTML = "👩‍👩‍👦";
document.querySelector("#container2").innerHTML =
"👩‍👩‍👦";
document.querySelector("#container3").innerHTML =
"👩‍👩‍👦";
<div id="container">
</div>
<div id="container2">
</div>
<div id="container3">
</div>

How should I prevent HTML from interpreting user-entered text as an entity?

I have a website where users and enter text. A user entered something "I worked on the #3&#4 valves" into an <input>. That text gets stored in a database, and displayed on screen somewhere else. My problem is that the "&#4" is being interpreted as an HTML entity or special character, and I want it to be interpreted literally.
Do I need to use Javascript to escape & from the <input>? I was hoping that <pre> would work, but it also interprets the text as a code. Again, this is user inputted text.
For example, when I run the code below, the <input> shows different text than the <p>. I want the <p> to show exactly what the <input> shows.
<html>
<body>
<input id="box">
<p id="para"></p>
</body>
<script>
document.getElementById("box").value = "something #3&#4";
document.getElementById("para").innerHTML = "something #3&#4";
</script>
</html>
Fiddle
EDIT:
I realized that I'll need both a client-side solution and a server-side solution. In one place that user-inputted text is displayed, I'm using Javascript's .innerHTML, and on another webpage, I'm echoing it with PHP.

I think your real issue is a lack of server side filtering. Given that you are having this problem, it seems very likely to me that you aren't doing any server-side input filtering/cleaning at all, which means that you are also going to be vulnerable to XSS
On the server side you should be sanitizing everything that goes back out to the client, which includes both stripping HTML tags (and also returning errors on save if people try to send up HTML tags) as well as replacing html special characters (see htmlspecialchars). The latter will convert your & into &, which will have the end result you desire: your HTML will not be interpreted as HTML special characters.
The problem with fixing this with javascript client side is that, not only do you have to do it everywhere, but you also have to remember to do it in a different way if there are cases where this same output is shown in the HTML document itself, i.e. not displayed by javascript.
In short, coming up with a coherent (and thorough) method for sanitizing user data before it goes back to the browser will fix your problem and also provide a first layer of protection against a number of malicious attacks.

Working fiddle.
Try to append the content as text not as HTML using one of the followinf methods ( innerText or textContent ), like :
document.getElementById("para").innerText = "something #3&#4";
document.getElementById("para").textContent = "something #3&#4";
NOTE : In case of server-side display you could use htmlentities($content).
Hope this helps.
document.getElementById("para").textContent = "something #3&#4";
<p id="para"></p>

Use innerText instead of innerHTML.
https://jsfiddle.net/9746ah8s/2/

You need to stop manipulating it as HTML, because text only becomes code if you do it explicitly. In a slightly modified version of your example, please compare:
var txt = "one <strong>two</strong>";
document.getElementById("box").value = txt;
document.getElementById("para1").innerHTML = txt;
document.getElementById("para2").innerText = txt;
<input id="box">
<p id="para1"></p>
<p id="para2"></p>
(In the case of <input> there's only one option because the element cannot hold HTML in the first place.)

To display &, you could replace all the & with &amp, this way you will see #3&#4 and '&#4' wont be interpreted.

why javascript string replace using regex removes a "/" from my br tag

I'm using javascript with a super simple regex to replace a "<" with the HTML character code for it so I can place some code on my site using the pre and code tags and have it done automatically.
jsFiddle link
basically I'm trying to figure out why this js code:
var str = document.getElementById("cleanme").innerHTML;
str=str.replace(/</g,"<");
document.getElementById("cleanme").innerHTML = str;
removes the "/" in the br tag
<pre><code id="cleanme">
<p><br />this is some code</p>
</code></pre>
not a huge deal because I'm just displaying code, but I'd still like to know.
it outputs this:
<p><br>this is some code</p>
thanks

I believe it has to do with the way certain browsers return the innerHTML property. If you use Google Chrome, inspect any < br/ > tag using the debugging tools and you'll notice they don't show a backslash. The same is true when Chrome returns an innerHTML property, the blackslash is stripped out.
So when you pass in:
<pre><code id="cleanme">
<p><br />this is some code</p>
</code></pre>
The browser return an innerHTML property of:
<pre><code id="cleanme">
<p><br>this is some code</p>
</code></pre>
Your RegEx is not the issue.

Your script is OK.
If you try this:
var str = '<p><br />this is some code</p>';
str=str.replace(/</g,"<");
str=str.replace(/>/g,">");
document.getElementById("cleanme").innerHTML = str;
It'll correctly print <br />.
Possibly it's effect of browser's HTML normalization.

Maybe too late to help you, and you've accepted a correct answer, but there's another big potential problem.
I tried this with Firefox 3.6.11 on Linux and 3.6.12 on Windows and they both behaved the same --
I did not see the <p><br>this is some code</p> in the Result pane on your fiddle, instead I saw simply this is some code with no markup at all.
Throwing firebug at it by adding a debugger; statement as the first line in the JavaScript pane and tracing through it, I found that str was getting a value of '\n', that is, just a newline was being returned from innerHTML and nothing else.
Thinking about this, but with no way to confirm it, I suspect it's because Firefox is building the DOM tree differently than you expect, because the HTML you're using is invalid. Inline elements are not allowed to contain block elements; specifically, the <code> tag is not allowed to contain a <p> tag, and <pre> is likewise not allowed to contain a <p> tag -- again, only limited inline elements can be used inside a <pre> tag).
I think FF is implicitly closing the code block before opening the paragraph so the innerHTML of id="cleanme" is nothing but the newline. It renders with the "pre" font as you expect because you've thrown the browser into Quirks Mode.

innerHTML does not return the literal source code, but the result of the browser's interpretation of it.
Different browsers will return very different results for innerHTML, sometimes omitting some quotes and 'optional' end tags, capitalizing some tag names and attributes, and collapsing extra white-space.
And HTML does not close open tags that can't have end tags, so they are not included either.

Getting unparsed (raw) HTML with JavaScript

I need to get the actual html code of an element in a web page.
For example if the actual html code inside the element is "How to fix"
Running this JavaScript:
getElementById('myE').innerHTML
Gives me "How to fix" which is the parsed HTML.
How can I get the unparsed "How to fix" using JavaScript?

You cannot get the actual HTML source of part of your web page.
When you give a web browser an HTML page, it parses the HTML into some DOM nodes that are the definitive version of your document as far as the browser is concerned. The DOM keeps the significant information from the HTML—like that you used the Unicode character U+00A0 Non-Breaking Space before the word fix—but not the irrelevent information that you used it by means of an entity reference rather than just typing it raw ( ).
When you ask the browser for an element node's innerHTML, it doesn't give you the original HTML source that was parsed to produce that node, because it no longer has that information. Instead, it generates new HTML from the data stored in the DOM. The browser decides on how to format that HTML serialisation; different browsers produce different HTML, and chances are it won't be the same way you formatted it originally.
In particular,
element names may be upper- or lower-cased;
attributes may not be in the same order as you stated them in the HTML;
attribute quoting may not be the same as in your source. IE often generates unquoted attributes that aren't even valid HTML; all you can be sure of is that the innerHTML generated will be safe to use in the same browser by writing it to another element's innerHTML;
it may not use entity references for anything but characters that would otherwise be impossible to include directly in text content: ampersands, less-thans and attribute-value-quotes. Instead of returning it may simply give you the raw character.
You may not be able to see that that's a non-breaking space, but it still is one and if you insert that HTML into another element it will act as one. You shouldn't need to rely anywhere on a non-breaking space character being entity-escaped to ... if you do, for some reason, you can get that by doing:
x= el.innerHTML.replace(/\xA0/g, ' ')
but that's only escaping U+00A0 and not any of the other thousands of possible Unicode characters, so it's a bit questionable.
If you really really need to get your page's actual source HTML, you can make an XMLHttpRequest to your own URL (location.href) and get the full, unparsed HTML source in the responseText. There is almost never a good reason to do this.

What you have should work:
Element test:
<div id="myE">How to fix</div>
JavaScript test:
alert(document.getElementById("myE").innerHTML); //alerts "How to fix"
You can try it out here. Make sure that wherever you're using the result isn't show as a space, which is likely the case. If you want to show it somewhere that's designed for HTML, you'll need to escape it.

You can use a script tag instead, which will not parse the HTML. This is more relevant when there are angle brackets, like loading a lodash or underscore template.
document.getElementById("asDiv").value = document.getElementById("myDiv").innerHTML;
document.getElementById("asScript").value = document.getElementById("myScript").innerHTML;
<div id="myDiv">
<h1>
<%= ${var} %> %>
How to fix
</h1>
</div>
<script id="myScript" type="text/template">
<h1>
<%= ${var} %>
How to fix
</h1>
</script>
<textarea rows="10" cols="40" id="asDiv"></textarea>
<textarea rows="10" cols="40" id="asScript"></textarea>
Because the HTML in a div is parsed, the inner HTML for brackets comes back as
<
, but as a script it does not.

Develop Reference

JavaScript is the programming language of the Web.