How is plain text modified when set through innerHTML? - javascript

When setting innerHTML = '\r\n', it seems like browsers may end up writing '\n'.
This introduces a gap between the actual plain text content of the element and what I have been keeping track of.
Is this a rather isolated problem or are there many more potential changes I should be aware of?
How to ensure that the content of the text nodes matches exactly what I'm trying to write?
I guess it's possible just not to use innerHTML, build the nodes and the text nodes and insert them, but it's much less convenient.

When you read a string from innerHTML, it's not the string you wrote, it's created completely from scratch by converting the DOM structure of the element into HTML that will (mostly) create it. That means lots of things happen:
Newlines are normalized
Character entities are normalized
Quotes are normalized
Tags are normalized
Tags are corrected if the text you supplied defined an invalid HTML structure
...and so on. You can't expect a round-trip through the DOM to result in exactly the same string.
If you're dealing with pure text content, you can use textContent instead:
const x = document.getElementById("x");
const str = "CRLF: \r\n";
x.textContent = str;
console.log(x.textContent === str);
<div id="x"></div>
I can't 100% vouch for there being no newline normalization (or come to that, Unicode normalization; you might run the string through normalize first) although a quick test with a Chromium-based browser, Firefox, and iOS Safari suggested there wasn't, but certainly most of the issues with innerHTML don't occur.

It is not an isolated issue as it is expected behavior since you are writing HTML using the innerHTML tag.
if you want to make sure your text matches exactly what you are writing including new lines and spaces use the Html pre tag and write your text node there.
pre tag description: https://www.w3schools.com/tags/tag_pre.asp

Related

How to manipulate particular character in DOM in JavaScript?

Suppose I have text called "Hello World" inside a DIV in html file. I want to manipulate sixth position in "Hello World" text and replace that result in DOM, like using innerHTML or something like that.
The way i do is
var text = document.getElementById("divID").innerText;
now somehow I got the text and and manipluate the result using charAt for particular position and replace the result in html by replacing the whole string not just that position element. What I want to ask is do we have to every time replace the whole string or is there a way using which we can extract the character from particular position and replace the result in that position only not the whole string or text inside the div.
If you just need to insert some text into an already existing string you should use replace(). You won't really gain anything by trying to replace only one character as it will need to make a new string anyway (as strings are immutable).
jsFiddle
var text = document.getElementById("divID").innerText;
// find and replace
document.getElementById("divID").innerText = text.replace('hello world', 'hello big world');
var newtext=text.replace(text[6],'b'); should work. Glad you asked, I didn't know that would work.
Curious that it works, it doesn't replace all instances of that character either which is odd... I guess accessing characters with bracket notation treats the character as some 'character' object, not just a string.
Don't quote me on that though.
Yes, you have to replace the entire string by another, since strings are immutable in JavaScript. You can in various ways hide this behind a function call, but in the end what happens is construction of a new string that replaces the old one.
Text with div's are actually text nodes and hence we will have to explicitly manipulate their content by replacing the older content with the newer one.
If you are using jQuery then you can refer to the below link for a possible technique:
[link Replacing text nodes with jQuery] http://www.bennadel.com/blog/2253-Replacing-Text-Nodes-With-jQuery.htm.
Behind the scenes, I would guess that jQuery still replaces the entire string ** for that text node**

Javascript regex whitespace is being wacky

I'm trying to write a regex that searches a page for any script tags and extracts the script content, and in order to accommodate any HTML-writing style, I want my regex to include script tags with any arbitrary number of whitespace characters (e.g. <script type = blahblah> and <script type=blahblah> should both be found). My first attempt ended up with funky results, so I broke down the problem into something simpler, and decided to just test and play around with a regex like /\s*h\s*/g.
When testing it out on string, for some reason completely arbitrary amounts of whitespace around the 'h' would be a match, and other arbitrary amounts wouldn't, e.g. something like " h " would match but " h " wouldn't. Does anyone have an idea of why this occurring or the the error I'm making?
Since you're using JavaScript, why can't you just use getElementsByTagName('script')? That's how you should be doing it.
If you somehow have an HTML string, create an iframe and dump the HTML into it, then run getElementsByTagName('script') on it.
OK, to extend Kolink's answer, you don't need an iframe, or event handlers:
var temp = document.createElement('div');
temp.innerHTML = otherHtml;
var scripts = temp.getElementsByTagName('script');
... now scripts is a DOM collection of the script elements - and the script doesn't get executed ...
Why regex is not a fantastic idea for this:
As a <script> element may not contain the string </script> anywhere, writing a regex to match them would not be difficult: /<script[.\n]+?<\/script>/gi
It looks like you want to only match scripts with a specific type attribute. You could try to include that in your pattern too: /<script[^>]+type\s*=\s*(["']?)blahblah\1[.\n]*?<\/script>/gi - but that is horrible. (That's what happens when you use regular expressions on irregular strings, you need to simplify)
So instead you iterate through all the basic matched scripts, extract the starting tag: result.match(/<script[^>]*>/i)[0] and within that, search for your type attribute /type\s*=\s*((["'])blahblah\2|\bblahblah\b)/.test(startTag). Oh look - it's back to horrible - simplify!
This time via normalisation:
startTag = startTag.replace(/\s*=\s*/g, '=').replace(/=([^\s"'>]+)/g, '="$1"') - now you're in danger territory, what if the = is inside a quoted string? Can you see how it just gets more and more complicated?
You can only have this work using regex if you make robust assumptions about the HTML you'll use it on (i.e. to make it regular). Otherwise your problems will grow and grow and grow!
disclaimer: I haven't tested any of the regex used to see if they do what I say they do, they're just example attempts.

What regex to determine if < or > is part of HTML tag

If I have HTML like this:
<dsometext<f<legit></legit> whatever
What regex pattern do I use to switch < to < before d and f.
I think it's all < which are not followed by a > but I can't wrap the regex for that around my head. I have users typing HTML and then am using jQuery to wrap the HTML and parse the nodes, however bad interim markup blows it up, so I want to swap out the <
Ideas?
Edit
I'm not trying to parse the HTML to valid HTML. I just want to knock out interim characters as users type and the HTML is updated on page. If they are typing <strong>, and are still at the < and I try to put the HTML on the page, it will cause horrible markup. That's why I need to swap it out.
Answer
I chose #pimvdb's answer because it correctly answers the question I asked.
However to make the world happier, I found a much simpler way of doing things without using any regex. Basically I had an issue originally where [title] was in place of an element and it had no container element, guaranteed to just contain the title. Therefore changing innerHTML of anything would cause horrors. We simply added the wrapping element. The hesitation to do that and the cause of this thread was due to some crazy reasons specific to the app and backwards comparability for our users.
It's not good practice to parse HTML with regexps, but this will do fine for your sample:
"<dsometext<f<legit></legit> whatever".replace(/(?!<[^<>]+>)</g, "<");
The (?!<[^<>]+>) ensures that the < character to be replaced does not match the <...> pattern.
It is not suggested to do such html or xml parsing but it can be done by replace method itself:
"<dsometext<f<legit></legit>".replace("<d","<d").replace("<f","<f")

Is there a javascript function that converts characters to the &code; equivalent?

I have text that was created created with CKeditor, and it seems to be inserting where spaces should be. And it appears to do similar conversions with >, <, &, etc. which is fine, except, when I make a DOMSelection, those codes are removed.
So, this is what is selected:
beforeHatchery (2)
But this is what is actually in the DOM:
beforeHatchery (2)
note that I outputted the selection and the original text stored in the database using variable.inspect, so all the quotes are escaped (they wouldn't be when sent to the browser).
To save everyone the pain of looking for the difference:
From the first: Hatchery</a> (2) (The Selection)
From the second: Hatchery</a> (2) (The original)
These differences are at the very end of the selection.
So... there are three ways, I can see of, to approach this.
1) - Replace all characters commonly replaced with codes with their codes,
and hope for the best.
2) - Javascript may have some uncommon function / a library may exist that
replaces these characters for me (I think this might be the way CKeditor
does its character conversion).
3) - Figure out the way CKeditor converts and do the conversion exactly that way.
I'm using Ruby on Rails, but that shouldn't matter for this problem.
Some other things that I found out that it converts:
1: It seems to only convert spaces to if the space(s) is before or after a tag:
e.g.: "With quick <a href..."
2: It changes apostrophes to the hex value
e.g.: "opponent's"
3: It changes "&" to "&"
4: It changes angle brackets to ">" and "<" appropriately.
Does anyone have any thoughts on this?
To encode html entities in str (your question title asks for this, if I understand correctly):
$('<div/>').text(str).html();
To decode html entities in str:
$('<div/>').html(str).text();
These rely on jQuery, but vanilla alternatives are basically the same but more verbose.
To encode html entities in str:
var el = document.createElement('div');
el.innerText = str;
el.innerHTML;
To decode html entities in str:
var el = document.createElement('div');
el.innerHTML = str;
el.innerText;
Conversion of spaces to is usually done by the browser while editing content.
Conversion of ' to ' can be controled with http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html#.entities_additional
and 4. are usually needed to avoid breaking code that it's written in design view when loading again that content. You can try to change http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html#.basicEntities but that usually can lead to problems in the future.

Recommendations on Triming Large Amounts of Text from a DOM Object

I'm doing some in browser editing, and I have some content that's on the order of around 20k characters long in a <textarea>.
So it looks something like:
<textarea>
Text 1
Text 2
Text 3
Text 4
[...]
Text 20,000
</textarea>
I'd like to use jquery to trim it down when someone hits a button to chop, but I'm having trouble doing it without overloading the browser. Assume I know that the character numbers are at 16,510 - 17,888, and what I'd like to do is trim it.
I was using:
jQuery('#textsection').html(jQuery('textarea').html().substr(range.start));
But browsers seem to enjoy crashing when I do this. Alternatives?
EDIT
Solution from the comments:
var removeTextNode = document.getElementById('textarea').firstChild;
removeTextNode.splitText(indexOfCharacterToRemoveEverythingBefore);
removeTextNode.parentNode.removeChild(removeTextNode);
Not sure about jQuery, but with plain vanilla Javascript, this can be done by using the splitText() method of the textNode object. Your <pre> has a textNode child which contains all the text inside of it. (You can get it from the childNodes collection.) Split it at the desired index, then use removeChild() to delete the part you don't need.
What browser are you testing on?
substr takes the starting index, and an optional length. If the length is omitted, then it extracts upto the end of the string. substring takes the starting and ending index of the string to extract, which I think might be a better option since you already have those available.
I've created a small example at fiddle using the book Alice's Adventures in Wonderland, by Lewis Carroll. The book is about 160,000 characters in length, and you can try with various starting/ending indexes and see if it crashes the browser. Seems to work fine on my Chrome, Firefox, and Safari. Unfortunately I don't have access to IE. Here's the function that's used:
function chop(start, end) {
var trimmedText = $('#preId').html().substring(start, end);
$('textarea').val(trimmedText);
}
​

Categories

Resources