Split any html element separately using regex in JavaScript split

Split any html element separately using regex in JavaScript split - javascript

what I'm trying to achieve is when I split the inner contents of an element, I get each item seperate, but the html element needs to be 1 element in the split.
For example:
<p id="name1" class=""> We deliver
<span class="color-secondary">software</span> &
<span class="color-secondary">websites</span> for your organization<span class="color-secondary">.</span>
</p>
Like in the example above, I want to make anything inside the <span> 1 array item after splitting the inner contents of #name1.
So in other words, I want the split array to look like this:
[
'we',
'deliver',
'<span class="color-secondary">software</span>',
'&',
'<span class="color-secondary">websites</span>'
... etc.
]
Currently this is what I have. But this does not work since it ignores the text inside of the html element and therefore splits it halfway through the element. I would also like it to be any html element, and not just limited to <span>.
let sentence = el.innerHTML; // el being #name1 in this case
let words = sentence.split(/\s(?=<span)/i);
How would I be able to achieve this with regex? Is this possible? Thank you for any help.

Here is a DOMParser based solution which parses the HTML and then iterates over the top node's children, pushing the HTML into the result array if the node is an element, or splitting the text on space (if it is a text element) and adding those values to the result array:
const html = `<p id="name1" class=""> We deliver
<span class="color-secondary">software</span> &
<span class="color-secondary">websites</span> for your organization<span class="color-secondary">.</span>
</p>`
const parser = new DOMParser();
const s = parser.parseFromString(html, 'text/html');
let result = [];
for (el of s.body.firstChild.childNodes) {
if (el.nodeType == 3 /* TEXT_NODE */ ) {
result = result.concat(el.nodeValue.trim().split(' ').filter(Boolean));
}
else if (el.nodeType == 1 /* ELEMENT_NODE */ ) {
result.push(el.outerHTML);
}
}
console.log(result);

Details are commented in example below
const nodeSplitter = (mainNode) => {
let scan;
/*
Check if initial node has text or elements
*/
if (mainNode.hasChildNodes) {
scan =
/*
Collect all elements, text, and comments
into an array
*/
Array.from(mainNode.childNodes)
/*
If node is an element, return it...
...if node is text, use `.matchAll()` to
find each word and add to array...
.filter() any falsy values and flatten
the array and then return it
*/
.flatMap(node => {
if (node.nodeType === 1) {
return node;
} else if (node.nodeType === 3) {
const rgx = new RegExp(/[\w\\\-\.\]\&]+/, 'g');
let strings = [...node.textContent.matchAll(rgx)]
.filter(node => node).flat()
return strings;
} else {
/*
Otherwise, return empty array which is
basically nothing since .flatMap()
flattens an array as default
*/
return [];
}
});
} else {
// Return if mainNode is empty
return;
}
// return results
return scan;
}
const main = document.getElementById('name1');
console.log(nodeSplitter(main));
<p id="name1" class=""> We deliver
<span class="color-secondary">software</span> &
<span class="color-secondary">websites</span> for your organization
<span class="color-secondary">.</span>
</p>

Related

Highlight matched text instead of the whole the text

I have a function, getTextNodes, that searches text nodes recursively. Then I use a addHighlight function to highlight the text with <mark> tags:
const buttonEl = `<button>
<span>
Icon
</span>
Text
</button>
`;
document.body.innerHTML = buttonEl;
const foundButtonEl = document.querySelector("button");
const elements = [];
elements.push(foundButtonEl);
addHighlight(elements, "T");
function addHighlight(elements, text) {
elements.forEach((element, index) => {
const textNodes = getTextNodes(document.body);
const matchingNode = textNodes.find(node => node.textContent.includes(text));
const markElement = document.createElement('mark');
markElement.innerHTML = matchingNode.textContent;
matchingNode.replaceWith(markElement);
});
}
function getTextNodes(node) {
let textNodes = [];
if (node.nodeType === Node.TEXT_NODE) {
textNodes.push(node);
}
node.childNodes.forEach(childNode => {
textNodes.push(...getTextNodes(childNode));
});
return textNodes;
}
The problem is that addHighlight is highlighing the whole text (in the example, Text), instead of the matched text (in the example, T).
How to change this code so that only the matched text is highlighted (text)?

matchingNode is the whole node so you're replacing everything. If you want to match just part of it, you need to iterate though the textnode and find the index position of the substring that you're searching for.
Start by splitting the node into an array
matchingNode.wholeText.split("")
Then find the index position, insert markElement at that position, and go from there.

The problem is that the node you match is the element of which the innerContent contains the string you want to highlight.
What you should do instead of :
markElement.innerHTML = matchingNode.textContent;
matchingNode.replaceWith(markElement);
is probably something like
markElement.innerHTML = text;
matchingNode.replaceTextWithHTML(text, markElement);
replaceTextWithHTML is a fictive function :)

Check the content in script tag

I have added span tag for all comma using jquery to adding class for css. unfortunately It adds span tag inside script and get collapsed.I want to check the content is not in script tag and add span tag.I want to replace all comma (,) except the content in script tag.
if ($("#overview").length > 0) {
$("#overview").html( $("#overview").html().replace(/,/g,"<span class='comma'>,</span>"));
}

You need to be careful because the html attributes can also contain text with ','.
One way to solve this problem is iterate by the TextNodes, and for each one split the value in multiples TextNodes separated by a SpanNode.
// get all text nodes excluding the script
function getTextNodes(el) {
if (el.nodeType == 3) { // nodeType 3 is a TextNode
return el.parentElement && el.parentElement.nodeName == "SCRIPT" ? [] : [el];
}
return Array.from(el.childNodes)
.reduce((acc, item) => acc.concat(getTextNodes(item)), []);
}
// this will replace the TextNode with the necessary Span and Texts nodes
function replaceComma(textNode) {
const parent = textNode.parentElement;
const subTexts = textNode.textContent.split(','); // get all the subtexts separated by ,
// for each item in subtext it will insert a new TextNode with a SpanNode
// (iterate from end to beginning to use the insertBefore function)
textNode.textContent = subTexts[subTexts.length - 1];
let currentNode = textNode;
for(var i = subTexts.length - 2; i>= 0 ; i --) {
const spanEl = createCommaEl();
parent.insertBefore(spanEl, currentNode);
currentNode = document.createTextNode(subTexts[i]);
parent.insertBefore(currentNode, spanEl)
}
}
// create the html node: <span class="comma">,</span>
// you can do this more easy with JQUERY
function createCommaEl() {
const spanEl = document.createElement('span');
spanEl.setAttribute('class', 'comma');
spanEl.textContent = ',';
return spanEl;
}
// then, if you want to replace all comma text from the element with id 'myId'
// you can do
getTextNodes(document.getElementById('myId'))
.forEach(replaceComma);

Problems when parsing nested html tags from string

I have this code that's to parse a string into html and display the text of each element.
That's working good except when I have nested tags for example <div><p>Element 1</p><p>Element 2</p></div>. In this case, the code displays <p>Element 1</p><p>Element 2</p>.
How can I do to get each tags one after the other ? (Here I want Element 1 and then Element 2)
Here's the code :
let text = new DOMParser().parseFromString(stringHtml, 'text/html');
let textBody = text.body.firstChild;
while (textBody) {
alert(textBody.innerHTML);
// other actions on the textBody element
textBody = textBody.nextSibling;
}
Thanks for helping me out

It sounds like you want a recursive function that prints the textContent of itself, or of its children, if it has children:
const stringHtml = '<div><p>Element 1</p><p>Element 2</p></div><div><p>Element 3</p><p>Element 4</p></div>';
const doc = new DOMParser().parseFromString(stringHtml, 'text/html');
const showElms = parent => {
const { children } = parent;
if (children.length) Array.prototype.forEach.call(children, showElms);
else console.log(parent.textContent);
}
showElms(doc.body);
That's assuming you want to iterate over the actual elements. If you want all text nodes instead, then recursively iterate over the childNodes instead.

Add arrays into multi-dimensional array or object

I'm parsing content generated by a wysiwyg into a table of contents widget in React.
So far I'm looping through the headers and adding them into an array.
How can I get them all into one multi-dimensional array or object (what's the best way) so that it looks more like:
h1-1
h2-1
h3-1
h1-2
h2-2
h3-2
h1-3
h2-3
h3-3
and then I can render it with an ordered list in the UI.
const str = "<h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>";
const patternh1 = /<h1>(.*?)<\/h1>/g;
const patternh2 = /<h2>(.*?)<\/h2>/g;
const patternh3 = /<h3>(.*?)<\/h3>/g;
let h1s = [];
let h2s = [];
let h3s = [];
let matchh1, matchh2, matchh3;
while (matchh1 = patternh1.exec(str))
h1s.push(matchh1[1])
while (matchh2 = patternh2.exec(str))
h2s.push(matchh2[1])
while (matchh3 = patternh3.exec(str))
h3s.push(matchh3[1])
console.log(h1s)
console.log(h2s)
console.log(h3s)

I don't know about you, but I hate parsing HTML using regexes. Instead, I think it's a better idea to let the DOM handle this:
const str = `<h1>h1-1</h1>
<h3>h3-1</h3>
<h3>h3-2</h3>
<p>something</p>
<h1>h1-2</h1>
<h2>h2-2</h2>
<h3>h3-2</h3>`;
const wrapper = document.createElement('div');
wrapper.innerHTML = str.trim();
let tree = [];
let leaf = null;
for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) {
const nodeLevel = parseInt(node.tagName[1]);
const newLeaf = {
level: nodeLevel,
text: node.textContent,
children: [],
parent: leaf
};
while (leaf && newLeaf.level <= leaf.level)
leaf = leaf.parent;
if (!leaf)
tree.push(newLeaf);
else
leaf.children.push(newLeaf);
leaf = newLeaf;
}
console.log(tree);
This answer does not require h3 to follow h2; h3 can follow h1 if you so please. If you want to turn this into an ordered list, that can also be done:
const str = `<h1>h1-1</h1>
<h3>h3-1</h3>
<h3>h3-2</h3>
<p>something</p>
<h1>h1-2</h1>
<h2>h2-2</h2>
<h3>h3-2</h3>`;
const wrapper = document.createElement('div');
wrapper.innerHTML = str.trim();
let tree = [];
let leaf = null;
for (const node of wrapper.querySelectorAll("h1, h2, h3, h4, h5, h6")) {
const nodeLevel = parseInt(node.tagName[1]);
const newLeaf = {
level: nodeLevel,
text: node.textContent,
children: [],
parent: leaf
};
while (leaf && newLeaf.level <= leaf.level)
leaf = leaf.parent;
if (!leaf)
tree.push(newLeaf);
else
leaf.children.push(newLeaf);
leaf = newLeaf;
}
const ol = document.createElement("ol");
(function makeOl(ol, leaves) {
for (const leaf of leaves) {
const li = document.createElement("li");
li.appendChild(new Text(leaf.text));
if (leaf.children.length > 0) {
const subOl = document.createElement("ol");
makeOl(subOl, leaf.children);
li.appendChild(subOl);
}
ol.appendChild(li);
}
})(ol, tree);
// add it to the DOM
document.body.appendChild(ol);
// or get it as text
const result = ol.outerHTML;
Since the HTML is parsed by the DOM and not by a regex, this solution will not encounter any errors if the h1 tags have attributes, for example.

You can simply gather all h* and then iterate over them to construct a tree as such:
Using ES6 (I inferred this is ok from your usage of const and let)
const str = `
<h1>h1-1</h1>
<h2>h2-1</h2>
<h3>h3-1</h3>
<p>something</p>
<h1>h1-2</h1>
<h2>h2-2</h2>
<h3>h3-2</h3>
`
const patternh = /<h(\d)>(.*?)<\/h(\d)>/g;
let hs = [];
let matchh;
while (matchh = patternh.exec(str))
hs.push({ lev: matchh[1], text: matchh[2] })
console.log(hs)
// constructs a tree with the format [{ value: ..., children: [{ value: ..., children: [...] }, ...] }, ...]
const add = (res, lev, what) => {
if (lev === 0) {
res.push({ value: what, children: [] });
} else {
add(res[res.length - 1].children, lev - 1, what);
}
}
// reduces all hs found into a tree using above method starting with an empty list
const tree = hs.reduce((res, { lev, text }) => {
add(res, lev-1, text);
return res;
}, []);
console.log(tree);
But because your html headers are not in a tree structure themselves (which I guess is your use case) this only works under certain assumptions, e.g. you cannot have a <h3> unless there's a <h2> above it and a <h1> above that. It will also assume a lower-level header will always belong to the latest header of an immediately higher level.
If you want to further use the tree structure for e.g. rendering a representative ordered-list for a TOC, you can do something like:
// function to render a bunch of <li>s
const renderLIs = children => children.map(child => `<li>${renderOL(child)}</li>`).join('');
// function to render an <ol> from a tree node
const renderOL = tree => tree.children.length > 0 ? `<ol>${tree.value}${renderLIs(tree.children)}</ol>` : tree.value;
// use a root node for the TOC
const toc = renderOL({ value: 'TOC', children: tree });
console.log(toc);
Hope it helps.

What you want to do is known as (a variant of a) document outline, eg. creating a nested list from the headings of a document, honoring their hierarchy.
A simple implementation for the browser using the DOM and DOMParser APIs goes as follows (put into a HTML page and coded in ES5 for easy testing):
<!DOCTYPE html>
<html>
<head>
<title>Document outline</title>
</head>
<body>
<div id="outline"></div>
<script>
// test string wrapped in a document (and body) element
var str = "<html><body><h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3></body></html>";
// util for traversing a DOM and emit SAX startElement events
function emitSAXLikeEvents(node, handler) {
handler.startElement(node)
for (var i = 0; i < node.children.length; i++)
emitSAXLikeEvents(node.children.item(i), handler)
handler.endElement(node)
}
var outline = document.getElementById('outline')
var rank = 0
var context = outline
emitSAXLikeEvents(
(new DOMParser()).parseFromString(str, "text/html").body,
{
startElement: function(node) {
if (/h[1-6]/.test(node.localName)) {
var newRank = +node.localName.substr(1, 1)
// set context li node to append
while (newRank <= rank--)
context = context.parentNode.parentNode
rank = newRank
// create (if 1st li) or
// get (if 2nd or subsequent li) ol element
var ol
if (context.children.length > 0)
ol = context.children[0]
else {
ol = document.createElement('ol')
context.appendChild(ol)
}
// create and append li with text from
// heading element
var li = document.createElement('li')
li.appendChild(
document.createTextNode(node.innerText))
ol.appendChild(li)
context = li
}
},
endElement: function(node) {}
})
</script>
</body>
</html>
I'm first parsing your fragment into a Document, then traverse it to create SAX-like startElement() calls. In the startElement() function, the rank of a heading element is checked against the rank of the most recently created list item (if any). Then a new list item is appended at the correct hierarchy level, and possibly an ol element is created as container for it. Note the algorithm as it is won't work with "jumping" from h1 to h3 in the hierarchy, but can be easily adapted.
If you want to create an outline/table of content on node.js, the code could be made to run server-side, but requires a decent HTML parsing lib (a DOMParser polyfill for node.js, so to speak). There are also the https://github.com/h5o/h5o-js and the https://github.com/hoyois/html5outliner packages for creating outlines, though I haven't tested those. These packages supposedly can also deal with corner cases such as heading elements in iframe and quote elements which you generally don't want in the the outline of your document.
The topic of creating an HTML5 outline has a long history; see eg. http://html5doctor.com/computer-says-no-to-html5-document-outline/. HTML4's practice of using no sectioning roots (in HTML5 parlance) wrapper elements for sectioning and placing headings and content at the same hierarchy level is known as "flat-earth markup". SGML has the RANK feature for dealing with H1, H2, etc. ranked elements, and can be made to infer omitted section elements, thus automatically create an outline, from HTML4-like "flat earth markup" in simple cases (eg. where only section or another single element is allowed as sectioning root).

I'll use a single regex to get the <hx></hx> contents and then sort them by x using methods Array.reduce.
Here is the base but it's not over yet :
// The string you need to parse
const str = "\
<h1>h1-1</h1>\
<h2>h2-1</h2>\
<h3>h3-1</h3>\
<p>something</p>\
<h1>h1-2</h1>\
<h2>h2-2</h2>\
<h3>h3-2</h3>";
// The regex that will cut down the <hx>something</hx>
const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g;
// We get the matches now
const matches = str.match(regex);
// We match the hx togethers as requested
const matchesSorted = Object.values(matches.reduce((tmp, x) => {
// We get the number behind hx ---> the x
const hNumber = x[2];
// If the container do not exist, create it
if (!tmp[hNumber]) {
tmp[hNumber] = [];
}
// Push the new parsed content into the array
// 4 is to start after <hx>
// length - 9 is to get all except <hx></hx>
tmp[hNumber].push(x.substr(4, x.length - 9));
return tmp;
}, {}));
console.log(matchesSorted);
As you are parsing html content I want to aware you about special cases like presency of \n or space. For example look at the following non-working snippet :
// The string you need to parse
const str = "\
<h1>h1-1\n\
</h1>\
<h2> h2-1</h2>\
<h3>h3-1</h3>\
<p>something</p>\
<h1>h1-2 </h1>\
<h2>h2-2 \n\
</h2>\
<h3>h3-2</h3>";
// The regex that will cut down the <hx>something</hx>
const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g;
// We get the matches now
const matches = str.match(regex);
// We match the hx togethers as requested
const matchesSorted = Object.values(matches.reduce((tmp, x) => {
// We get the number behind hx ---> the x
const hNumber = x[2];
// If the container do not exist, create it
if (!tmp[hNumber]) {
tmp[hNumber] = [];
}
// Push the new parsed content into the array
// 4 is to start after <hx>
// length - 9 is to get all except <hx></hx>
tmp[hNumber].push(x.substr(4, x.length - 9));
return tmp;
}, {}));
console.log(matchesSorted);
We gotta add .replace() and .trim() in order to remove unwanted \n and spaces.
Use this snippet
// The string you need to parse
const str = "\
<h1>h1-1\n\
</h1>\
<h2> h2-1</h2>\
<h3>h3-1</h3>\
<p>something</p>\
<h1>h1-2 </h1>\
<h2>h2-2 \n\
</h2>\
<h3>h3-2</h3>";
// Remove all unwanted \n
const preparedStr = str.replace(/(\r\n\t|\n|\r\t)/gm, "");
// The regex that will cut down the <hx>something</hx>
const regex = /<h[0-9]{1}>(.*?)<\/h[0-9]{1}>/g;
// We get the matches now
const matches = preparedStr.match(regex);
// We match the hx togethers as requested
const matchesSorted = Object.values(matches.reduce((tmp, x) => {
// We get the number behind hx ---> the x
const hNumber = x[2];
// If the container do not exist, create it
if (!tmp[hNumber]) {
tmp[hNumber] = [];
}
// Push the new parsed content into the array
// 4 is to start after <hx>
// length - 9 is to get all except <hx></hx>
// call trim() to remove unwanted spaces
tmp[hNumber].push(x.substr(4, x.length - 9).trim());
return tmp;
}, {}));
console.log(matchesSorted);

I write this code works with JQuery. (Please don't DV. Maybe someone needs a jquery answer later)
This recursive function creates lis of string and if one item has some childern, it will convert them to an ol.
const str =
"<div><h1>h1-1</h1><h2>h2-1</h2><h3>h3-1</h3></div><p>something</p><h1>h1-2</h1><h2>h2-2</h2><h3>h3-2</h3>";
function strToList(stri) {
const tags = $(stri);
function partToList(el) {
let output = "<li>";
if ($(el).children().length) {
output += "<ol>";
$(el)
.children()
.each(function() {
output += partToList($(this));
});
output += "</ol>";
} else {
output += $(el).text();
}
return output + "</li>";
}
let output = "<ol>";
tags.each(function(itm) {
output += partToList($(this));
});
return output + "</ol>";
}
$("#output").append(strToList(str));
li {
padding: 10px;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="output"></div>
(This code can be converted to pure JS easily)

how to get first text node while bypassing <b> and <i>?

I want to get the first text node from a string but it may contain few tags like <b>,<i> and <span>. I have tried it like this but it gives only login whereas it should give login<b>user</b> account
var s = $.trim('login<b>user</b> account<tbody> <tr> <td class="translated">Lorem ipsum dummy text</td></tr><tr><td class="translated">This is a new paragraph</td></tr><tr><td class="translated"><b>Email</b></td></tr><tr><td><i>This is yet another text</i></td> </tr></tbody>');
if( $(s).find('*').andSelf().not('b,i').length > 1 ) {
if( s.substring( 0, s.indexOf('<') ) != '') {
alert(s.substring(0, s.indexOf('<')));
} else {
alert($(s).find('*:not(:empty)').first().text());
}
}
check it on jsfiddle
Note:
This string will be dynamic, so write generic answer not specific to this text only.
More Information :
#Jeremy J Starcher! I just want to get the first non-empty text node of iframe being clicked. This node will include <b> or <i> and whatever is in between them like this:
hi my <b>bold</b> text is here // note the bold tags as it is
If only one element is clicked then its text is thrown but if there are more than then one elements selected then it must get the very first text node among all the nodes.

Kind of a brute method but you can take out all the tags you're expecting to occur in the first node and read until the start of the next tag like so:
text = text.replace("<b>"," ");
text = text.replace("</b>"," ");
text = text.replace("<i>"," ");
text = text.replace("</i>"," ");
text = text.replace("<span>"," ");
text = text.replace("</span>"," ");
text = text.substr(0, text.indexOf("<"));

I didn't fully follow the question, but if you are trying to extract the text from a DOM element, this may help:
var getText = function (el) {
var ret;
var txt = [],
i = 0;
if (!el) {
ret = "";
} else if (el.nodeType === 3) {
// No problem if it's a text node
ret = el.nodeValue;
} else {
// If there is more to it, then let's gather it all.
while (el.childNodes[i]) {
txt[txt.length] = getText(el.childNodes[i]);
i++;
}
// return the array as a string
ret = txt.join("");
}
return ret;
};

Develop Reference

JavaScript is the programming language of the Web.

Split any html element separately using regex in JavaScript split - javascript

Related

Highlight matched text instead of the whole the text

Check the content in script tag

Problems when parsing nested html tags from string

Add arrays into multi-dimensional array or object

how to get first text node while bypassing <b> and <i>?

Categories

Resources