Tokenize HTML string in JavaScript [duplicate] - javascript

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 5 years ago.
I would like to split a string that looks like:
This is <strong>a</strong> test link and <br /> line. break
into the following with JavaScript:
[
'This',
'is',
'<strong>a</strong>',
'test',
'link',
'<br />',
'line.',
]
I tried splitting on spaces, and < >, but that obviously doesn't work for tags like strong and a. I'm not sure how to write a regex that doesn't split within HTML tags. I also tried to use jQuery children(), but it doesn't extract plain text, just the html tags. Any help would be great.

If the code is executing in a browser, using the browser's parser to separate the string into text and tag components may provide an alternative workaround:
var text = 'This is <strong>a</strong> link and <br /> line. break'
function splitHTML( text) {
var parts = [];
var div = document.createElement('DIV');
div.innerHTML = text;
div.normalize();
for( var node = div.firstChild; node; node=node.nextSibling) {
if( node.nodeType == Node.TEXT_NODE) {
parts.push.apply( parts, node.textContent.split(" "));
}
else if( node.nodeType == Node.ELEMENT_NODE) {
parts.push( node.outerHTML);
}
}
return parts;
}
console.log( splitHTML( text));
Note the line that adds text nodes split by spaces to the result
parts.push.apply( parts, node.textContent.split(" "));
is for demonstration and needs further work to prevent zero length strings in the ouput for spaces between text and html tagged elements. Also the html tags are reconstructed from the DOM element and may not exactly match the input: in this case the XHTML tags <br \> are returned as <br> HTML tags (which don't take a closing tag).
The general idea is to side step parsing html using a regex by parsing it with the browser. Understandably this may or may not fit with the target environment and a full set of requirements.

To achieve what you want, you need to consider this:
Rule 1) if no "<" occurred yet, simply split at " ".
Rule 2) if "<" occurred, look for "/>" or "/"..">" and split after it, then start at rule 1 again.
Apply those rules while looping through a string and you are golden.
Making this recursive, i.E. nested tags like
<div>
<p>Hi</p>
<p>Bye</p>
</div>
is harder. As mentioned above, actually parsing a html tree is very complex.

Try this:
#(?:(?!<)[^<>]+(?!>))|(?:<(?=[^/>]+\/>).*\/>)|(?:<([^\s]+).*>.*(?=<\/\1>)<\/\1>)#g
It should work in simple cases, All that I can thik of right now.
Use captured group to find out TAG name, then execute it recursivly for block elements as div.

Related

IE11 innerHTML strange behaviour

I have very strange behaviour with element.innerHTML in IE11.
As you can see there: http://pe281.s3.amazonaws.com/index.html, some riotjs expressions are not evaluated.
I've tracked it down to 2 things:
- the euro sign above it. It's encoded as €, but I have the same behaviour with \u20AC or €. It happens with all characters in the currency symbols range, and some other ranges. Removing or using a standard character does not cause the issue.
- The way riotjs creates a custom tag and template. Basically it does this:
var html = "{reward.amount.toLocaleString()}<span>€</span>{moment(expiracyDate).format('DD/MM/YYYY')}";
var e = document.createElement('div');
e.innerHTML = html;
In the resulting e node, e.childNodes returns the following array:
[0]: {reward.amount.toLocaleString()}
[1]: <span>€</span>
[2]: {
[3]: moment(expiracyDate).format('DD/MM/YYYY')}
Obviously nodes 2 and 3 should be only one. Have them split makes riot not recognizing an expression to evaluate, hence the issue.
But there's more: The problem is not consistent, and for instance cannot be reproduced on a fiddle: https://jsfiddle.net/5wg3zxk5/4/, where the html string is correctly parsed.
So I guess my question is how can some specific characters change the way element.innerHTML parses its input? How can it be solved?
.childNodes is a generated array (...well NodeList) that is filled with ELEMENT_NODE but may also be filled with: ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_REFERENCE_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE, ...
You probably want only nodes from the type: ELEMENT_NODE (div and such..) and maybe also TEXT_NODE.
Use a simple loop to keep just those nodes with .nodeType === Element.ELEMENT_NODE (or just compare it to its enum which is 1).
You can also just use the much more simpler alternative of .children.
Replace <br> with <br /> (they are self-closing tags). IE is trying to close the tags for you. That's why you have doubled br tags
I think it should be something like this:
var html = {reward.amount.toLocaleString()} + "€<br>" +{moment(expiracyDate).format('DD/MM/YYYY')} + " <br>";
var e = document.createElement('div');
e.innerHTML = html;
The stuff I removed from the quotes seem to be variables or other stuff, and not a string, so it should not be in quotes.

Match text not inside span tags

Using Javascript, I'm trying to wrap span tags around certain text on the page, but I don't want to wrap tags around text already inside a set of span tags.
Currently I'm using:
html = $('#container').html();
var regex = /([\s| ]*)(apple)([\s| ]*)/g;
html = html.replace(regex, '$1<span class="highlight">$2</span>$3');
It works but if it's used on the same string twice or if the string appears in another string later, for example 'a bunch of apples' then later 'apples', I end up with this:
<span class="highlight">a bunch of <span class="highlight">apples</span></span>
I don't want it to replace 'apples' the second time because it's already inside span tags.
It should match 'apples' here:
Red apples are my <span class="highlight">favourite fruit.</span>
But not here:
<span class="highlight">Red apples are my favourite fruit.</span>
I've tried using this but it doesn't work:
([\s| ]*)(apples).*(?!</span)
Any help would be appreciated. Thank you.
First off, you should know that parsing html with regex is generally considered to be a bad idea—a Dom parser is usually recommended. With this disclaimer, I will show you a simple regex solution.
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
<span.*?<\/span>|(\bapples\b)
The left side of the alternation | matches complete <span... /span> tags. We will ignore these matches. The right side matches and captures apples to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This program shows how to use the regex (see the results in the right pane of the online demo). Please note that in the demo I replaced with [span] instead of <span> so that the result would show in the browser (which interprets the html):
var subject = 'Red apples are my <span class="highlight">favourite apples.</span>';
var regex = /<span.*?<\/span>|(\bapples\b)/g;
replaced = subject.replace(regex, function(m, group1) {
if (group1 == "" ) return m;
else return "<span class=\"highlight\">" + group1 + "</span>";
});
document.write("<br>*** Replacements ***<br>");
document.write(replaced);
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

Removing non-break-spaces in JavaScript

I am having trouble removing spaces from a string. First I am converting the div to text(); to remove the tags (which works) and then I'm trying to remove the "&nbsp" part of the string, but it won't work. Any Idea what I'm doing wrong.
newStr = $('#myDiv').text();
newStr = newStr.replace(/ /g, '');
$('#myText').val(newStr);
<html>
<div id = "myDiv"><p>remove space</p></div>
<input type = "text" id = "myText" />
</html>
When you use the text function, you're not getting HTML, but text: the entities have been changed to spaces.
So simply replace spaces:
var str = " a     b   ", // bunch of NBSPs
newStr = str.replace(/\s/g,'');
console.log(newStr)
If you want to replace only the spaces coming from do the replacement before the conversion to text:
newStr = $($('#myDiv').html().replace(/ /g,'')).text();
.text()/textContent do not contain HTML entities (such as ), these are returned as literal characters. Here's a regular expression using the non-breaking space Unicode escape sequence:
var newStr = $('#myDiv').text().replace(/\u00A0/g, '');
$('#myText').val(newStr);
Demo
It is also possible to use a literal non-breaking space character instead of the escape sequence in the Regex, however I find the escape sequence more clear in this case. Nothing that a comment wouldn't solve, though.
It is also possible to use .html()/innerHTML to retrieve the HTML containing HTML entities, as in #Dystroy's answer.
Below is my original answer, where I've misinterpreted OP's use case. I'll leave it here in case anyone needs to remove from DOM elements' text content
[...] However, be aware that re-setting the .html()/innerHTML of an element means trashing out all of the listeners and data associated with it.
So here's a recursive solution that only alters the text content of text nodes, without reparsing HTML nor any side effects.
function removeNbsp($el) {
$el.contents().each(function() {
if (this.nodeType === 3) {
this.nodeValue = this.nodeValue.replace(/\u00A0/g, '');
} else {
removeNbsp( $(this) );
}
});
}
removeNbsp( $('#myDiv') );
Demo

regex replace characters within tags

I'm already using a html parser, but I need to create a regex that will select the < and > symbols within the first instance of <code> tags - in this case, the one with the class "html".
<code class="html">
<b>test</b><script>lol</script>
<code>test</code> <b> test </b>
<lol>
</lol>
<test>
</code>
So every < or > within the indented area starting from <b> to the start of the last </code> should be replaced, leaving the outer <code> tags alone.
I'm using javascript's .replace method and would like all < and > symbols within the code area to turn into ascii < and >.
I imagine its best to use a look forward/back regex using $1 etc. but can't figure out where to begin, so any help would be much appreciated.
How about something like this? In this example I'm creating a variable and populating the variable with html, just to get things started
var doc = document.createElement( 'div' );
doc.innerHTML = ---your input html here
Here I'm pulling the code tag
var string = doc.getElementsByTagName( 'code' ).innerHTML;
Once you have the string then simply replace the desired brackets with
var string = string .replace(/[<]/, "<)
var string = string .replace(/[>]/, ">)
then just reinsert the replaced value back into your source html
The easy way:
var elem = $('.html');
elem.text(elem.html());
This will not necessarily use literally < for escaping; if you're fine with a different escape, it's much simpler than anything else you can do, though.
If you have multiple elements like that, you might need to wrap the second line in an elem.each(); otherwise the html() method will probably just concatenate the content from all elements or something similarly pointless.

Remove and extract text in javascript

I'm wanting to do the following in JavaScript as efficiently as possible:
Remove <ul></ul> tags from a string and everything in between.
For what remains, every string that is encased within <li> and </li> I want dumped in an array, without any newline characters lurking at the end.
I'm thinking regexes are the answer but I've never used them before. Guess I could figure out a way but eventually it would probably not be the most efficient.
As others have said, you do have to be careful parsing HTML with regexes. If the HTML is controlled and does not have nested ul or li tags in it and doesn't have embedded strings that contain valid HTML tags or < or > chars (e.g. the HTML is coming from a known source in a known format, it can work fine). Here's one way to do what I think you were asking for:
function parseList(str) {
var output = [], matches;
var re = /<\s*li[^>]*>(.*?)<\/li>/gi;
// remove newlines
str = str.replace(/\n|\r/igm, "");
// get text between ul tags
matches = str.match(/<\s*ul[^>]*>(.*?)<\/ul\s*>/);
if (matches) {
str = matches[1];
// get text between each li tag
while (matches = re.exec(str)) {
output.push(matches[1]);
}
}
return(output);
}
It is more foolproof to use an actual HTML parser that understands the finer points of the format (like nested tags, tag values in embedded strings, etc...), but if you have none of that, a simpler parser like this can be used.
You can see it work here: http://jsfiddle.net/jfriend00/c9ZLT/

Categories

Resources