Getting raw text content of HTML element with HTML uninterpreted - javascript

I have Googled my brains out and can't figure out how to make this work. Here is what I'm trying to do:
HTML:
<div id=derp>"Hi, my name is.."</div>
Javascript:
var div = document.getElementById('derp');
alert(div.innerHTML);
alert(div.innerText);
alert(div.textContent);
All of those alerts interpret and return the " as " in the resulting string. I want to get the raw text with " uninterpreted.
They all return:
"Hi, my name is.."
When I want to get:
"Hi, my name is.."
Is there a way to do this? Preferably without trying to use a regex to replace every instance of " with ".
It's kind of a long story of what I'm trying to do, but simply using replace() to search and replace every instance of " would be a headache to implement because of other regex matching/parsing that needs to occur.
Thanks in advance for any Javascript wizards who can save my sanity!

To quote bobince
When you ask the browser for an element node's innerHTML, it doesn't
give you the original HTML source that was parsed to produce that
node, because it no longer has that information. Instead, it generates
new HTML from the data stored in the DOM. The browser decides on how
to format that HTML serialisation; different browsers produce
different HTML, and chances are it won't be the same way you formatted
it originally.
In summary: innerHTML/innerText/text/textContent/nodeValue/indexOf, none of them will give you the unparsed text.
The only possible way to do this is with regex, or you can do an ajax post to the page itself, but that is a bad practice.

I prepared some days ago a bin with some different approaches: http://jsbin.com/urazer/4/edit
My favorite:
var text = "");
var html = text.replace(/[<&>'"]/g, function(c) {
return "&#" + c.charCodeAt() + ";";
});

Related

How do I get document.getElementsByTagName('').innerHTML to make text between 2 tags?

I'm trying to use JavaScript to include a footer on several webpages, so if I want to change the footer, I only have to change it in one place. PHP is not available on this server and neither are server side inserts (SSI), but Perl, Python, and Tcl are available. I have been trying with document.getElementsByTagName('footer').innerHTML = "text"; but it doesn't produce text. I copied this code from dev.mozilla, and it tells me how many tags I have:
var footer = document.getElementsByTagName('footer');
var num = footer.length;
console.log('There is ' + num + ' footer in this document');
So, I don't know what's wrong with the innerHTML script. I also tried with paragraph tags and got the same results in both cases.
I reccoment using textContent instead. Se why here.
To see how it works, paste the following into your browser console while you're on StackOverflow and hit enter.
document.querySelector('.site-footer').textContent = 'Custom footer content.'
note: use querySelector with a class instead of getElementByTagName
Cheers! 🍻
Before asking this question, I had searched for Python includes without any luck, so I stopped there, but after asking this question, I thought that I should search for Perl/Ruby includes. Today, I found out that I can use the Perl use function, so I could study that and try to implement it although I am completely new to Perl. Ruby also appears capable, perhaps even more. I have no experience with Ruby either, but maybe I should start there.
I just figured out that getElementsByTagName() results in an array, so I have to refer to the footer's index with [0]:
var footerTags = document.getElementsByTagName('footer');
footerTags[0].innerHTML = "test";

createContextualFragment is just spitting back the html from the string in string form

I apologize, this should be simple. I just have a string containing html that I want to append to el as "real html" and return for display.
My code, taken from other stackoverflow answers:
let fragmentFromString = function (strHTML) {
return document.createRange().createContextualFragment(strHTML);
}
let x = fragmentFromString(decodeURI("<span><B>Test</B> this</span>"));
el.appendChild(x);
BUT, all it does is append the text including the html as a child on el. It doesn't actually create the nodes and make "Test" bold, etc...
What I see in my browser is all of:
<span><B>Test<>/B> this</span>
What I WANT to see in my browser is all of:
Test this
What simple thing am I missing? Thank you.
Short answer, you use 'decodeURI' which is wrong in this case, since it's not an encoded URI.
Simply delete it, and it should work.
You should use a html string in your function.

Better Way to Sanitize HTML for Insertion

In a recent review by the AMO editors, my Thunderbird addon's version was rejected because it "creates HTML from strings containing unsanitized data" - which "is a major security risk".
I think I understand why. Now, my problem is about how to solve that issue.
This thread gave me some clues, but it's not quite what I need.
My addon needs to paste the contents of the clipboard as a hyperlink, by using the clipboard contents as the link text, and inserting html around it like this: `" + clipboardtext + "".
Now, if I am inserting the clipboard contents as HTML, I need to "sanitize" it first. Here is what I came up with. Now, I haven't written in the regex part yet, because I don't think this is the best way to do this, although I think it will work:
function makeSafeHTML(whathtml){
var parser = Cc["#mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils);
var sanitizedHTML = parser.sanitize(whathtml, 01);
//now remove the extratags added by the sanitization method, perhaps via regex
//"<html><head></head><body>"
//"</body></html>"
return sanitizedHTML;
}
My intent is to do this with the resulting "sanitized" string - this will paste the string as the href value of a hyperlink:
var html_editor = editor.QueryInterface(Components.interfaces.nsIHTMLEditor);
html_editor.insertHTML("<a href='"+whathref+"'>"+whattext+"</a>");
So I am looking for a better way to get sanitized HTML into a simple string variable. Would any of you do it this way?
It seems that you simply want to insert clipboard contents into HTML code as pure text - you don't need any complicated escaping approach then, it's enough to make sure all "dangerous" characters are replaced by HTML entities:
var sanitizedText = text.replace(/&/g, "&").replace(/</g, "<")
.replace(/>/g, ">").replace(/"/g, """);
It's not clear from your question what you do with the generated HTML code. If you add it to a DOM document via something like innerHTML then you can do better - add the HTML code first and manipulate the text in the document then:
document.getElementById("text-container").textContent = text;
Using Node.textContent to set text in a document is always safe, no escaping needs to be performed.

Looking for a way to search an html page with javascript

what I would like to do is to the html page for a specific string and read in a certain amount of characters after it and present those characters in an anchor tag.
the problem I'm having is figuring out how to search the page for a string everything I've found relates to by tag or id. Also hoping to make it a greasemonkey script for my personal use.
function createlinks(srchstart,srchend){
var page = document.getElementsByTagName('html')[0].innerHTML;
page = page.substring(srchstart,srchend);
if (page.search("file','http:") != -1)
{
var begin = page.search("file','http:") + 7;
var end = begin + 79;
var link = page.substring(begin,end);
document.body.innerHTML += 'LINK | ';
createlinks(end+1,page.length);
}
};
what I came up with unfortunately after finding the links it loops over the document again
Assisted Direction
Lookup JavaScript Regex.
Apply your regex to the page's HTML (see below).
Different regex functions do different things. You could search the document for the string, as suggested, but you'd have to do it recursively, since the string you're searching for may be listed in multiple places.
To Get the Text in the Page
JavaScript: document.getElementsByTagName('html')[0].innerHTML
jQuery: $('html').html()
Note:
IE may require the element to be capitalized (eg 'HTML') - I forget
Also, the document may have newline characters \n that might want to take out, since one could be between the string you're looking for.
Okay, so in javascript you've got the whole document in the DOM tree. You an search for your string by recursively searching the DOM for the string you want. This is striaghtforward; I'll put in pseudocode because you want to think about what libraries (if any) you're using.
function search(node, string):
if node.innerHTML contains string
-- then you found it
else
for each child node child of node
search(child,string)
rof
fi

How to decode HTML entities using jQuery?

How do I use jQuery to decode HTML entities in a string?
Security note: using this answer (preserved in its original form below) may introduce an XSS vulnerability into your application. You should not use this answer. Read lucascaro's answer for an explanation of the vulnerabilities in this answer, and use the approach from either that answer or Mark Amery's answer instead.
Actually, try
var encodedStr = "This is fun & stuff";
var decoded = $("<div/>").html(encodedStr).text();
console.log(decoded);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div/>
Without any jQuery:
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
console.log(decodeEntities('1 & 2')); // '1 & 2'
This works similarly to the accepted answer, but is safe to use with untrusted user input.
Security issues in similar approaches
As noted by Mike Samuel, doing this with a <div> instead of a <textarea> with untrusted user input is an XSS vulnerability, even if the <div> is never added to the DOM:
function decodeEntities(encodedString) {
var div = document.createElement('div');
div.innerHTML = encodedString;
return div.textContent;
}
// Shows an alert
decodeEntities('<img src="nonexistent_image" onerror="alert(1337)">')
However, this attack is not possible against a <textarea> because there are no HTML elements that are permitted content of a <textarea>. Consequently, any HTML tags still present in the 'encoded' string will be automatically entity-encoded by the browser.
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
// Safe, and returns the correct answer
console.log(decodeEntities('<img src="nonexistent_image" onerror="alert(1337)">'))
Warning: Doing this using jQuery's .html() and .val() methods instead of using .innerHTML and .value is also insecure* for some versions of jQuery, even when using a textarea. This is because older versions of jQuery would deliberately and explicitly evaluate scripts contained in the string passed to .html(). Hence code like this shows an alert in jQuery 1.8:
//<!-- CDATA
// Shows alert
$("<textarea>")
.html("<script>alert(1337);</script>")
.text();
//-->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.2.3/jquery.min.js"></script>
* Thanks to Eru Penkman for catching this vulnerability.
Like Mike Samuel said, don't use jQuery.html().text() to decode html entities as it's unsafe.
Instead, use a template renderer like Mustache.js or decodeEntities from #VyvIT's comment.
Underscore.js utility-belt library comes with escape and unescape methods, but they are not safe for user input:
_.escape(string)
_.unescape(string)
I think you're confusing the text and HTML methods. Look at this example, if you use an element's inner HTML as text, you'll get decoded HTML tags (second button). But if you use them as HTML, you'll get the HTML formatted view (first button).
<div id="myDiv">
here is a <b>HTML</b> content.
</div>
<br />
<input value="Write as HTML" type="button" onclick="javascript:$('#resultDiv').html($('#myDiv').html());" />
<input value="Write as Text" type="button" onclick="javascript:$('#resultDiv').text($('#myDiv').html());" />
<br /><br />
<div id="resultDiv">
Results here !
</div>
First button writes : here is a HTML content.
Second button writes : here is a <B>HTML</B> content.
By the way, you can see a plug-in that I found in jQuery plugin - HTML decode and encode that encodes and decodes HTML strings.
The question is limited by 'with jQuery' but it might help some to know that the jQuery code given in the best answer here does the following underneath...this works with or without jQuery:
function decodeEntities(input) {
var y = document.createElement('textarea');
y.innerHTML = input;
return y.value;
}
You can use the he library, available from https://github.com/mathiasbynens/he
Example:
console.log(he.decode("Jörg &amp Jürgen rocked to & fro "));
// Logs "Jörg & Jürgen rocked to & fro"
I challenged the library's author on the question of whether there was any reason to use this library in clientside code in favour of the <textarea> hack provided in other answers here and elsewhere. He provided a few possible justifications:
If you're using node.js serverside, using a library for HTML encoding/decoding gives you a single solution that works both clientside and serverside.
Some browsers' entity decoding algorithms have bugs or are missing support for some named character references. For example, Internet Explorer will both decode and render non-breaking spaces ( ) correctly but report them as ordinary spaces instead of non-breaking ones via a DOM element's innerText property, breaking the <textarea> hack (albeit only in a minor way). Additionally, IE 8 and 9 simply don't support any of the new named character references added in HTML 5. The author of he also hosts a test of named character reference support at http://mathias.html5.org/tests/html/named-character-references/. In IE 8, it reports over one thousand errors.
If you want to be insulated from browser bugs related to entity decoding and/or be able to handle the full range of named character references, you can't get away with the <textarea> hack; you'll need a library like he.
He just darn well feels like doing things this way is less hacky.
encode:
$("<textarea/>").html('<a>').html(); // return '<a&gt'
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<textarea/>
decode:
$("<textarea/>").html('<a&gt').val() // return '<a>'
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<textarea/>
Try this :
var htmlEntities = "<script>alert('hello');</script>";
var htmlDecode =$.parseHTML(htmlEntities)[0]['wholeText'];
console.log(htmlDecode);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
parseHTML is a Function in Jquery library and it will return an array that includes some details about the given String..
in some cases the String is being big, so the function will separate the content to many indexes..
and to get all the indexes data you should go to any index, then access to the index called "wholeText".
I chose index 0 because it's will work in all cases (small String or big string).
Use
myString = myString.replace( /\&/g, '&' );
It is easiest to do it on the server side because apparently JavaScript has no native library for handling entities, nor did I find any near the top of search results for the various frameworks that extend JavaScript.
Search for "JavaScript HTML entities", and you might find a few libraries for just that purpose, but they'll probably all be built around the above logic - replace, entity by entity.
I just had to have an HTML entity charater (⇓) as a value for a HTML button. The HTML code looks good from the beginning in the browser:
<input type="button" value="Embed & Share ⇓" id="share_button" />
Now I was adding a toggle that should also display the charater. This is my solution
$("#share_button").toggle(
function(){
$("#share").slideDown();
$(this).attr("value", "Embed & Share " + $("<div>").html("⇑").text());
}
This displays ⇓ again in the button. I hope this might help someone.
You have to make custom function for html entities:
function htmlEntities(str) {
return String(str).replace(/&/g, '&').replace(/</g, '<').replace(/>/g,'>').replace(/"/g, '"');
}
Suppose you have below String.
Our Deluxe cabins are warm, cozy & comfortable
var str = $("p").text(); // get the text from <p> tag
$('p').html(str).text(); // Now,decode html entities in your variable i.e
str and assign back to tag.
that's it.
For ExtJS users, if you already have the encoded string, for example when the returned value of a library function is the innerHTML content, consider this ExtJS function:
Ext.util.Format.htmlDecode(innerHtmlContent)
Extend a String class:
String::decode = ->
$('<textarea />').html(this).text()
and use as method:
"<img src='myimage.jpg'>".decode()
You don't need jQuery to solve this problem, as it creates a bit of overhead and dependency.
I know there are a lot of good answers here, but since I have implemented a bit different approach, I thought to share.
This code is a perfectly safe security-wise approach, as the escaping handler depends on the browser, instead on the function. So, if some vulnerability will be discovered in the future, this solution is covered.
const decodeHTMLEntities = text => {
// Create a new element or use one from cache, to save some element creation overhead
const el = decodeHTMLEntities.__cache_data_element
= decodeHTMLEntities.__cache_data_element
|| document.createElement('div');
const enc = text
// Prevent any mixup of existing pattern in text
.replace(/⪪/g, '⪪#')
// Encode entities in special format. This will prevent native element encoder to replace any amp characters
.replace(/&([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+);/gi, '⪪$1⪫');
// Encode any HTML tags in the text to prevent script injection
el.textContent = enc;
// Decode entities from special format, back to their original HTML entities format
el.innerHTML = el.innerHTML
.replace(/⪪([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+)⪫/gi, '&$1;')
.replace(/⪪#/g, '⪪');
// Get the decoded HTML entities
const dec = el.textContent;
// Clear the element content, in order to preserve a bit of memory (in case the text is big)
el.textContent = '';
return dec;
}
// Example
console.log(decodeHTMLEntities("<script>alert('&awconint;&CounterClockwiseContourIntegral;∳∳⪪#x02233⪫');</script>"));
// Prints: <script>alert('∳∳∳∳⪪#x02233⪫');</script>
By the way, I have chosen to use the characters ⪪ and ⪫, because they are rarely used, so the chance of impacting the performance by matching them is significantly lower.
Here are still one problem:
Escaped string does not look readable when assigned to input value
var string = _.escape("<img src=fake onerror=alert('boo!')>");
$('input').val(string);
Exapmle: https://jsfiddle.net/kjpdwmqa/3/
Alternatively, theres also a library for it..
here, https://cdnjs.com/libraries/he
npm install he //using node.js
<script src="js/he.js"></script> //or from your javascript directory
The usage is as follows...
//to encode text
he.encode('© Ande & Nonso® Company LImited 2018');
//to decode the
he.decode('© Ande & Nonso® Company Limited 2018');
cheers.
To decode HTML Entities with jQuery, just use this function:
function html_entity_decode(txt){
var randomID = Math.floor((Math.random()*100000)+1);
$('body').append('<div id="random'+randomID+'"></div>');
$('#random'+randomID).html(txt);
var entity_decoded = $('#random'+randomID).html();
$('#random'+randomID).remove();
return entity_decoded;
}
How to use:
Javascript:
var txtEncoded = "á é í ó ú";
$('#some-id').val(html_entity_decode(txtEncoded));
HTML:
<input id="some-id" type="text" />
The easiest way is to set a class selector to your elements an then use following code:
$(function(){
$('.classSelector').each(function(a, b){
$(b).html($(b).text());
});
});
Nothing any more needed!
I had this problem and found this clear solution and it works fine.
I think that is the exact opposite of the solution chosen.
var decoded = $("<div/>").text(encodedStr).html();

Categories

Resources