Recommendations on Triming Large Amounts of Text from a DOM Object - javascript

I'm doing some in browser editing, and I have some content that's on the order of around 20k characters long in a <textarea>.
So it looks something like:
<textarea>
Text 1
Text 2
Text 3
Text 4
[...]
Text 20,000
</textarea>
I'd like to use jquery to trim it down when someone hits a button to chop, but I'm having trouble doing it without overloading the browser. Assume I know that the character numbers are at 16,510 - 17,888, and what I'd like to do is trim it.
I was using:
jQuery('#textsection').html(jQuery('textarea').html().substr(range.start));
But browsers seem to enjoy crashing when I do this. Alternatives?
EDIT
Solution from the comments:
var removeTextNode = document.getElementById('textarea').firstChild;
removeTextNode.splitText(indexOfCharacterToRemoveEverythingBefore);
removeTextNode.parentNode.removeChild(removeTextNode);

Not sure about jQuery, but with plain vanilla Javascript, this can be done by using the splitText() method of the textNode object. Your <pre> has a textNode child which contains all the text inside of it. (You can get it from the childNodes collection.) Split it at the desired index, then use removeChild() to delete the part you don't need.

What browser are you testing on?
substr takes the starting index, and an optional length. If the length is omitted, then it extracts upto the end of the string. substring takes the starting and ending index of the string to extract, which I think might be a better option since you already have those available.
I've created a small example at fiddle using the book Alice's Adventures in Wonderland, by Lewis Carroll. The book is about 160,000 characters in length, and you can try with various starting/ending indexes and see if it crashes the browser. Seems to work fine on my Chrome, Firefox, and Safari. Unfortunately I don't have access to IE. Here's the function that's used:
function chop(start, end) {
var trimmedText = $('#preId').html().substring(start, end);
$('textarea').val(trimmedText);
}
​

Related

How is plain text modified when set through innerHTML?

When setting innerHTML = '\r\n', it seems like browsers may end up writing '\n'.
This introduces a gap between the actual plain text content of the element and what I have been keeping track of.
Is this a rather isolated problem or are there many more potential changes I should be aware of?
How to ensure that the content of the text nodes matches exactly what I'm trying to write?
I guess it's possible just not to use innerHTML, build the nodes and the text nodes and insert them, but it's much less convenient.
When you read a string from innerHTML, it's not the string you wrote, it's created completely from scratch by converting the DOM structure of the element into HTML that will (mostly) create it. That means lots of things happen:
Newlines are normalized
Character entities are normalized
Quotes are normalized
Tags are normalized
Tags are corrected if the text you supplied defined an invalid HTML structure
...and so on. You can't expect a round-trip through the DOM to result in exactly the same string.
If you're dealing with pure text content, you can use textContent instead:
const x = document.getElementById("x");
const str = "CRLF: \r\n";
x.textContent = str;
console.log(x.textContent === str);
<div id="x"></div>
I can't 100% vouch for there being no newline normalization (or come to that, Unicode normalization; you might run the string through normalize first) although a quick test with a Chromium-based browser, Firefox, and iOS Safari suggested there wasn't, but certainly most of the issues with innerHTML don't occur.
It is not an isolated issue as it is expected behavior since you are writing HTML using the innerHTML tag.
if you want to make sure your text matches exactly what you are writing including new lines and spaces use the Html pre tag and write your text node there.
pre tag description: https://www.w3schools.com/tags/tag_pre.asp

End-of-string regex match too slow

Demo here. The regex:
([^>]+)$
I want to match text at the end of a HTML snippet that is not contained in a tag (i.e., a trailing text node). The regex above seems like the simplest match, but the execution time seems to scale linearly with the length of the match-text (and has causes hangs in the wild when used in my browser extension). It's also equally slow for matching and non-matching text.
Why is this seemingly simple regex so bad?
(I also tried RegexBuddy but can't seem to get an explanation from it.)
Edit: Here's a snippet for testing the various regexes (click "Run" in the console area).
Edit 2: And a no-match test.
Consider an input like this
abc<def>xyz
With your original expression, ([^>]+)$, the engine starts from a, fails on >, backtracks, restarts from b, then from c etc. So yes, the time grows with size of the input. If, however, you force the engine to consume everything up to the rightmost > first, as in:
.+>([^>]+)$
the backtracking will be limited by the length of the last segment, no matter how much input is before it.
The second expression is not equivalent to the first one, but since you're using grouping, it doesn't matter much, just pick matches[1].
Hint: even when you target javascript, switch to the pcre mode, which gives you access to the step info and debugger:
(look at the green bars!)
You could use the actual DOM instead of Regex, which is time consuming:
var html = "<div><span>blabla</span></div><div>bla</div>Here I am !";
var temp = document.createElement('div');
temp.innerHTML = html;
var lastNode = temp.lastChild || false;
if(lastNode.nodeType == 3){
alert(lastNode.nodeValue);
}

Javascript doesn't interpret Hair space as a space with regex

I use a regex for my splitfunction.
string.split(/\s/)
But   (which is a Hair Space), will not be recognised. How to make sure it does (without implementing the exact code in the regex expression)
Per MDN, the definition of \s in a regex (in the Firefox browser) is this:
[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]
So, if you want to split on something in addition to this (e.g. an HTML entity), then you will need to add that to your own regex. Remember, string.split() is not an HTML function, it's a string function so it doesn't know anything special about HTML. If you want to split on certain HTML tags or entities, you will have to code up a regex that includes the things you want to split on.
You can code for it yourself like this:
string.split(/\s| /);
Working demo: http://jsfiddle.net/jfriend00/nAQ97/
If what you really want to do is to have your HTML parsed and converted to text by the browser (which will process all entities and HTML tags), then you can do this:
function getPlainText(str) {
var x = document.createElement("div");
x.innerHTML = str;
return (x.textContent || x.innerText);
}
Then, you could split your string like this:
getPlainText(str).split(/\s/);
Working demo: http://jsfiddle.net/jfriend00/KR2aa/
If you want to make absolutely sure this works in older browsers, you'd either have to test one of these above functions in all browsers that you care about or you'd have to use a custom regex with all the entities you want to split on in the first option or do a search/replace on all unicode characters that you want to split on in the second option and turn them into a regular space before doing the split. Because older browsers weren't very consistent here, there is no free lunch if you want safe compatibility with old browsers.

Slice text in two without breaking tags in jQuery

I have the following code that I managed to put up by combining different resources. What this does is that it takes html of a content and breaks it into two halves (for a read more application). Following code is such that it doesn't break a word (waits until the end of word).
var minCharCount = 600;
var divcontent = $('#myDiv').html();
var firstHalf = divcontent.substr(0, minCharCount);
firstHalf = firstHalf.substr(0, Math.min(firstHalf.length, firstHalf.lastIndexOf(" ")));
var secondHalf = divcontent.substr(firstHalf.length, divcontent.length);
However, one last issue with this is that it can break html tags resulting in bad code. Is there a way to make sure that the code breaks them in two after any potential tag ends?
Edit: may be it was a little difficult to understand. What I want is:
long text comes here with tags like <b>bold</b> or even <i>italic</i>.
^1 ^2 ^3
So my point is if we break at 1 its fine, but if we break at 2 and append the two parts somewhere, we get problems. So before breaking at 2 we need to check if it is in the middle of a tag. If it is then wait until the tag ends and then break: i.e. at 3.
WORKING DEMO
instead of
$('#myDiv').html();
run the function on the string returned from
$('#myDiv').text();
This way you don't get any html tags in the input string.
http://api.jquery.com/text/
UPDATE:
(in response to comment)
since you want the html tags, then you can loop through the children() of the target, measure the length of their .text(), and add them to the out put until you reach the minChars amount. Then do the same for the last child you reached, until you reach the closest amount of text to the target char count.
children() excludes text nodes, so you have to use contents().
however, this approach is cumbersome. I think your best bet is to create a range object
see here: https://developer.mozilla.org/en-US/docs/Web/API/Document.createRange

REGEXP based string.prototype.split insertion glitch

I was working on a parser that could read HTML however the code that splits it causes "l"s to be inserted in every other entry of the produced array.
The regexp is this:
textarea.value.split(/(?=<(.|\n)+>)/)
What it's supposed to do is split entry/exit/single HTML/XML tags while ignoring tabs and line terminators (it just appends them to tags they were split with)
May I have some insite as to what's happening?
You can view code in action and edit here:
http://jsfiddle.net/termtm/ew7Mt/2/
Just look in console for result it produces.
EDIT: MaxArt is right the l in last <html> causes the anomalies to be "l"s
Try this:
textarea.value.split(/(?=<[^>]+>)/);
But... what Alnitak said. A fully-fledged HTML parser based on regexps, expecially with the poor feature support of regexps in Javascript, would be a terrible (and slow) mess.
I still have to find out the reason of the odd behaviour you found. Notice that "l" (ell) is the last letter of "<html>", i.e., the first tag of your HTML code. Change it to something else and you'll notice the letters change.

Categories

Resources