NodeJS: Extract a sentence from html text based on a phrase

NodeJS: Extract a sentence from html text based on a phrase - javascript

I have some text stored in a database, which looks something like below:
let text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>"
The text can have many paragraphs and HTML tags.
Now, I also have a phrase:
let phrase = 'lose touch'
What I want to do is search for the phrase in text, and return the complete sentence containing the phrase in strong tag.
In the above example, even though the first para also contains the phrase 'lose touch', it should return the second sentence because it is in the second sentence that the phrase is inside strong tag. The result will be:
They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.
On the client-side, I could create a DOM tree with this HTML text, convert it into an array and search through each item in the array, but in NodeJS document is not available, so this is basically just plain text with HTML tags. How do I go about finding the right sentence in this blob of text?

I think this might help you.
No need to involve DOM in this if I understood the problem correctly.
This solution would work even if the p or strong tags have attributes in them.
And if you want to search for tags other than p, simply update the regex for it and it should work.
const search_phrase = "lose touch";
const strong_regex = new RegExp(`<\s*strong[^>]*>${search_phrase}<\s*/\s*strong>`, "g");
const paragraph_regex = new RegExp("<\s*p[^>]*>(.*?)<\s*/\s*p>", "g");
const text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>";
const paragraphs = text.match(paragraph_regex);
if (paragraphs && paragraphs.length) {
const paragraphs_with_strong_text = paragraphs.filter(paragraph => {
return strong_regex.test(paragraph);
});
console.log(paragraphs_with_strong_text);
// prints [ '<p>They don\'t just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>' ]
}
P.S. The code is not optimised, you can change it as per the requirement in your application.

There is cheerio which is something like server-side jQuery. So you can get your page as text, build DOM, and search inside of it.

first you could var arr = text.split("<p>") in order to be able to work with each sentence individually
then you could loop through your array and search for your phrase inside strong tags
for(var i = 0; i<arr.length;i++){
if(arr[i].search("<strong>"+phrase+"</strong>")!=-1){
console.log("<p>"+arr[i]);
//arr[i] is the the entire sentence containing phrase inside strong tags minus "<p>"
} }

Related

Removing html tags from multiple strings array in Javascript/React Native

I am receiving some bad data for certain product items and in my react native app its creating a bug where its outputting the bold html tags unintentionally - it isn't doing this in the website as the browser will be converting the bold tags into readable bold text in the web.
I am wondering what the best way would be to check if the array contains the bold tag and filter/remove this out of the state.
Here is an example of the data I am getting back and how its currently rendering:
["<bold>Dish Washer</bold>", "fridge", "<bold>kettle</bold", "Oven"]
. <bold>Dish Washer</bold>
. Fridge
. <bold>Kettle</bold>
. Oven
I was also wondering if there is a way to possibly check which products are displaying the tags, as it only seems to be happening with certain product descriptions.

use foreach and replace method
<script>
var a = ["<bold>Dish Washer</bold>", "fridge", "<bold>kettle</bold>", "Oven"];
var regex = /(<([^>]+)>)/ig;
a.forEach((item) =>{
var values= item.replace(regex, '');
console.log(values);***`
> strong text
`***
})
</script>

I experimented a bit and found this solution. You can alter the regex any time to filter other values
const x = ["<bold>Dish Washer</bold>", "fridge", "<bold>kettle</bold", "Oven"];
const regex = /(?<=<bold>)(.*)(?=<\/bold>)/ig;
const result = x.filter(i => !regex.test(i));

You can use array map()+ string replace() for a quick and dirty fix.
something like this (or better use a regex replace):
["<bold>Dish Washer</bold>", "fridge", "<bold>kettle</bold>", "Oven"].map(item => item.replace('<bold>','').replace('</bold>',''))
But you should in fact fix the root of problem (the source of the products)

Going through a Variable that contains HTML to check for specific words

I'm using Quill Editor, which is basically a Box like in SO where you type in text and get HTML out.
I store that HTML in a variable. Now my Website has a List of Tags (certain keywords) of brands and categories.
I would like a functionality to see if a Tag is inside that HTML.
For example, my Author types The new Nike Store is open now I would need the Nike to be a Tag. It can be a span with a certain class.
What is the best way to accomplish this? I want to check before publishing as live detection is not needed.
My Solution for now:
I didn´t implement it yet, but I think I would try to check every word inside my tag list before going to the next and wrapping it in the needed HTML Tags if the word is a Tag. But this could get messy to code because of all the other stuff like the other HTML Tags that get generated through the editor.

Assuming that the author types just plain text, you can use a regular expression to search for tag words (or phrases) and replace the phrase in the HTML with that phrase, surrounded in a span with your new class:
const input = 'The new Nike Store is open now';
const htmlStr = input.replace(/\bnike\b/gi, `<span class="tag">$&</span>`);
document.body.appendChild(document.createElement('div')).innerHTML = htmlStr;
.tag {
background-color: yellow;
}
Or, for dynamic tags, create the pattern dynamically:
const input = 'The new Nike Store is open now';
const tags = ['nike', 'open'];
const pattern = new RegExp(tags.join('|'), 'gi');
const htmlStr = input.replace(pattern, `<span class="tag">$&</span>`);
document.body.appendChild(document.createElement('div')).innerHTML = htmlStr;
.tag {
background-color: yellow;
}
(if the tags can contain characters with a special meaning in a regular expression, escape them first before passing to new RegExp)

How can I truncate the text contents of an Element while preserving HTML?

I realize that there are several similar questions here but none of the answers solve my case.
I need to be able to take the innerHTML of an element and truncate it to a given character length with the text contents of any inner HTML element taken into account and all HTML tags preserved.
I have found several answers that cover this portion of the question fine as well as several plugins which all do exactly this.
However, in all cases the solution will truncate directly in the middle of any inner elements and then close the tag.
In my case I need the contents of all inner tags to remain intact, essentially allowing any "would be" truncated inner tags to exceed the given character limit.
Any help would be greatly appreciated.
EDIT:
For example:
This is an example of a link inside another element
The above is 51 characters long including spaces. If I wanted to truncate this to 23 characters, we would have to shorten the text inside the </a> tag. Which is exactly what most solutions out there do.
This would give me the following:
This is an example of a
However, for my use case I need to keep any remaining visible tags completely intact and not truncated in any way.
So given the above example, the final output I would like, when attempting to truncate to 23 characters is the following:
This is an example of a link
So essentially we are checking where the truncation takes place. If it is outside of an element we can split the HTML string to exactly that length. If on the other hand it is inside an element, we move to the closing tag of that element, repeating for any parent elements until we get back to the root string and split it there instead.

It sounds like you'd like to be able to truncate the length of your HTML string as a text string, for example consider the following HTML:
'<b>foo</b> bar'
In this case the HTML is 14 characters in length and the text is 7. You would like to be able to truncate it to X text characters (for example 2) so that the new HTML is now:
'<b>fo</b>'
Disclosure: My answer uses a library I developed.
You could use the HTMLString library - Docs : GitHub.
The library makes this task pretty simple. To truncate the HTML as we've outlined above (e.g to 2 text characters) using HTMLString you'd use the following code:
var myString = new HTMLString.String('<b>foo</b> bar');
var truncatedString = myString.slice(0, 2);
console.log(truncatedString.html());
EDIT: After additional information from the OP.
The following truncate function truncates to the last full tag and caters for nested tags.
function truncate(str, len) {
// Convert the string to a HTMLString
var htmlStr = new HTMLString.String(str);
// Check the string needs truncating
if (htmlStr.length() <= len) {
return str;
}
// Find the closing tag for the character we are truncating to
var tags = htmlStr.characters[len - 1].tags();
var closingTag = tags[tags.length - 1];
// Find the last character to contain this tag
for (var index = len; index < htmlStr.length(); index++) {
if (!htmlStr.characters[index].hasTags(closingTag)) {
break;
}
}
return htmlStr.slice(0, index);
}
var myString = 'This is an <b>example ' +
'of a link ' +
'inside</b> another element';
console.log(truncate(myString, 23).html());
console.log(truncate(myString, 18).html());
This will output:
This is an <b>example of a link</b>
This is an <b>example of a link inside</b>

Although HTML is notorious for being terribly formed and has edge cases which are impervious to regex, here is a super light way you could hackily handle HTML with nested tags in vanilla JS.
(function(s, approxNumChars) {
var taggish = /<[^>]+>/g;
var s = s.slice(0, approxNumChars); // ignores tag lengths for solution brevity
s = s.replace(/<[^>]*$/, ''); // rm any trailing partial tags
tags = s.match(taggish);
// find out which tags are unmatched
var openTagsSeen = [];
for (tag_i in tags) {
var tag = tags[tag_i];
if (tag.match(/<[^>]+>/) !== null) {
openTagsSeen.push(tag);
}
else {
// quick version that assumes your HTML is correctly formatted (alas) -- else we would have to check the content inside for matches and loop through the opentags
openTagsSeen.pop();
}
}
// reverse and close unmatched tags
openTagsSeen.reverse();
for (tag_i in openTagsSeen) {
s += ('<\\' + openTagsSeen[tag_i].match(/\w+/)[0] + '>');
}
return s + '...';
})
In a nutshell: truncate it (ignores that some chars will be invisible), regex match the tags, push open tags onto a stack, and pop off the stack as you encounter closing tags (again, assumes well-formed); then close any still-open tags at the end.
(If you want to actually get a certain number of visible characters, you can keep a running counter of how many non-tag chars you've seen so far, and stop the truncation when you fill your quota.)
DISCLAIMER: You shouldn't use this as a production solution, but if you want a super light, personal, hacky solution, this will get basic well-formed HTML.
Since it's blind and lexical, this solution misses a lot of edge cases, including tags that should not be closed, like <img>, but you can hardcode those edge cases or, you know, include a lib for a real HTML parser if you want. Fortunately, since HTML is poorly formed, you won't see it ;)

You've tagged your question regex, but you cannot reliably do this with regular expressions. Obligatory link. So innerHTML is out.
If you're really talking characters, I don't see a way to do it other than to loop through the nodes within the element, recursing into descendant elements, totalling up the lengths of the text nodes you find as you go. When you find the point where you need to truncate, you truncate that text node and then remove all following ones — or probably better, you split that text node into two parts (using splitText) and move the second half into a display: none span (using insertBefore), and then move all subsequent text nodes into display: none spans. (This makes it much easier to undo it.)

Thanks to T.J. Crowder I soon came to the realization that the only way to do this with any kind of efficiency is to use the native DOM methods and iterate through the elements.
I've knocked up a quick, reasonably elegant function which does the trick.
function truncate(rootNode, max){
//Text method for cross browser compatibility
var text = ('innerText' in rootNode)? 'innerText' : 'textContent';
//If total length of characters is less that the limit, short circuit
if(rootNode[text].length <= max){ return; }
var cloneNode = rootNode.cloneNode(true),
currentNode = cloneNode,
//Create DOM iterator to loop only through text nodes
ni = document.createNodeIterator(currentNode, NodeFilter.SHOW_TEXT),
frag = document.createDocumentFragment(),
len = 0;
//loop through text nodes
while (currentNode = ni.nextNode()) {
//if nodes parent is the rootNode, then we are okay to truncate
if (currentNode.parentNode === cloneNode) {
//if we are in the root textNode and the character length exceeds the maximum, truncate the text, add to the fragment and break out of the loop
if (len + currentNode[text].length > max){
currentNode[text] = currentNode[text].substring(0, max - len);
frag.appendChild(currentNode);
break;
}
else{
frag.appendChild(currentNode);
}
}
//If not, simply add the node to the fragment
else{
frag.appendChild(currentNode.parentNode);
}
//Track current character length
len += currentNode[text].length;
}
rootNode.innerHTML = '';
rootNode.appendChild(frag);
}
This could probably be improved, but from my initial testing it is very quick, probably due to using the native DOM methods and it appears to do the job perfectly for me. I hope this helps anyone else with similar requirements.
DISCLAIMER: The above code will only deal with one level deep HTML tags, it will not deal with tags inside tags. Though it could easily be modified to do so by keeping track of the nodes parent and appending the nodes to the correct place in the fragment. As it stands, this is fine for my requirements but may not be useful to others.

Split a sentence with HTML into words (but leave inline HTML intact)

I am looking for a way to use javascript for splitting a sentence with HTML into words, and leaving the inline HTML tags with the text content intact. Punctuation can be regarded as a part of the word it is closest to. I'd like to use regex, and probably preg_split() for splitting the sentences. Here follows an example:
A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
<b>even more</b> <u>words</u>
Preferably, I would like to end up with the following:
[0] => A
[1] => word,
[2] => <a href='#' title=''>words within tags should remain intact</a>,
[3] => so
[4] => here's
[5] => <b>even more</b>
[6] => <u>words</u>
I know about the discussion on parsing HTML with Regex (I enjoyed reading Bobince' answer :-P ), but I need to split the words of a sentence without harming html-tags with attributes. I don't see how I can do this with JS in a different way than Regex. Of course, if there are alternatives, I'd be more than happy to adapt them, to achieve a similar result. :-)
Edit:
I searched for similar questions on Stackoverflow about this, but these don't tick the boxes for me. To put it a little into perspective:
splitting-up-html-code-tags-and-content: targets to split up the inline HTML, which is what I want to leave intact.
php-regex-to-match-outside-of-html-tags: targets all text nodes in a HTML snippet, even within HTML tags. But as a matter of fact, I want to target only the spaces outside of HTML elements (so even excluding the spaces within the text nodes being wrapped with HTML tags).

This is possible, but there will be some drawbacks to using a pure regex solution. The easiest to call out is nested HTML. The solution I'm about to show uses some back referencing to try get around this, but if you get some complicated nested HTML it will probably start failing in weird ways.
/(?:<(\w+)[^>]*>(?:[\w+]+(?:(?!<).*?)<\/\1>?)[^\s\w]?|[^\s]+)/g
Regex Demo
The regex uses back referencing and negative look behinds to get the work. You could potentially remove the back reference depending on your requirements. The back referencing helps with supporting nested tags.
JSFiddler Example - Check your console output for the example.
Here's the output from JS Fiddler (I formatted the output a bit)
[
"A",
"word,",
"<a href='#' title=''>words within tags should remain intact</a>,",
"so",
"here's",
"<b>even more</b>",
"<u>words</u>"
]
Depending on you're use case you'll need to modify it to work for you. I considered a word anything that wasn't a space, but you may have different criteria.
One negative to this method is if the start HTML tag is at the end of a word, it won't be picked up properly. ie. test<span>something else</span>.

You can use the following snippet:
function splitIntoWords(div) {
function removeEmptyStrings(k) {
return k !== '';
}
var rWordBoundary = /[\s\n\t]+/; // Includes space, newline, tab
var output = [];
for (var i = 0; i < div.childNodes.length; ++i) { // Iterate through all nodes
var node = div.childNodes[i];
if (node.nodeType === Node.TEXT_NODE) { // The child is a text node
var words = node.nodeValue.split(rWordBoundary).filter(removeEmptyStrings);
if (words.length) {
output.push.apply(output, words);
}
} else if (node.nodeType === Node.COMMENT_NODE) {
// What to do here? You can do what you want
} else {
output.push(node.outerHTML);
}
}
return output;
}
window.onload = function() {
var div = document.querySelector("div");
document.querySelector("pre").innerText = 'Output: ' + JSON.stringify(splitIntoWords(div));
}
<!-- Note you have to surround your html with a div element -->
<div>A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
<b>even more</b> <u>words</u>
</div>
<pre></pre>
It iterates through all child nodes, takes the text nodes and splits them into words (you can do this safely since text nodes can't contain children).
This takes care of most issues. With this, HTML such as text<span>Test</span> will come out ["text", "<span>Test</span>"] unlike the answer above.
This may fail with <span>There are</span>: 4 words which results in ["<span>There are</span>", ":" /* Extra colon */, "4", "words"] (which it's supposed to do, but not sure if it is desirable).
I would think this is very safe with nested elements.

Replace string when it matches a key in a key/value pair (with its corresponding value)

I'm trying to use javascript/jQuery to wrap any abbreviations in a paragraph in a <abbr title=""> tag.
For example, in a sentence like, The WHO eLENA clarifies guidance on life-saving nutrition interventions, and assists in scaling up action against malnutrition, WHO and eLENA would both be wrapped in an <abbr> tag. I'd like the title attribute to display the extended version of the abbreviation; i.e. WHO = World Health Organization.
Whats the best way of accomplishing this? I'm a bit new to javascript/jQuery so I'm fiddling in the dark here. So far I've created a variable that contains all the abbreviations as key/value pairs, and I can replace a specific instance of an abbreviation, but not much else.

First you must decide exactly what criteria you will use for selecting a replacement -- I would suggest doing it on a word boundary, such that "I work with WHO" will wrap "WHO" in an abbr, but "WHOEVER TOUCHED MY BIKE WILL REGRET IT" won't abbreviate "WHO". You should also decide if you are going to be case sensitive (probably you want to be, so that "The guy who just came in" doesn't abbreviate "who".)
Use jQuery to recurse over all of the text in the document. This can be done using the .children selector and stepping through elements and reading all the text.
For each text node, split the text into words.
For each word, look it up in your key value store to see if it matches a key. If so, get the value, and construct a new element <abbr title="value">key</abbr>.
Break up the text node into a) the text before the abbreviation (a text node), b) the abbreviation itself (an element), and c) the text after the abbreviation (a text node). Insert all three as child nodes of the original text node's parent, replacing the original text node.
Each of these steps will require a bit of work and looking up some API docs, but that is the basic process.

Firstly, this should really be done on the server, doing it on the client is very inefficient and much more prone to error. But having said that...
You can try processing the innerHTML of the element, but javascript and regular expressions are really bad at that.
The best way is to use DOM methods and parse the text of each element. When a matching word is found, replace it with an abbr element. This requires that where a match is found in a text node, the entire node is replaced because what was one text node will now be two text nodes (or more) either side of an abbr element.
Here is a simple function that goes close, but it likely has foibles that you need to address. It works on simple text strings, but you'll need to test it thoroughly on more complex strings. Naturally it should only ever be run once on a particular node or abbreviations will be doubly wrapped.
var addAbbrHelp = (function() {
var abbrs = {
'WHO': 'World Health Organisation',
'NATO': 'North Atlantic Treaty Organisation'
};
return function(el) {
var node, nodes = el.childNodes;
var word, words;
var adding, text, frag;
var abbr, oAbbr = document.createElement('abbr');
var frag, oFrag = document.createDocumentFragment()
for (var i=0, iLen=nodes.length; i<iLen; i++) {
node = nodes[i];
if (node.nodeType == 3) { // if text node
words = node.data.split(/\b/);
adding = false;
text = '';
frag = oFrag.cloneNode(false);
for (var j=0, jLen=words.length; j<jLen; j++) {
word = words[j];
if (word in abbrs) {
adding = true;
// Add the text gathered so far
frag.appendChild(document.createTextNode(text));
text = '';
// Add the wrapped word
abbr = oAbbr.cloneNode(false);
abbr.title = abbrs[word];
abbr.appendChild(document.createTextNode(word));
frag.appendChild(abbr);
// Otherwise collect the words processed so far
} else {
text += word;
}
}
// If found some abbrs, replace the text
// Otherwise, do nothing
if (adding) {
frag.appendChild(document.createTextNode(text));
node.parentNode.replaceChild(frag, node);
}
// If found another element, add abbreviation help
// to its content too
} else if (node.nodeType == 1) {
addAbbrHelp(node);
}
}
}
}());
For the markup:
<div id="d0">
<p>This is the WHO and NATO string.</p>
<p>Some non-NATO forces were involved.</p>
</div>
and calling:
addAbbrHelp(document.getElementById('d0'));
results in (my formatting):
<div id="d0">
<p>This is the<abbr title="World Health Organisation">WHO</abbr>
and <abbr title="North Atlantic Treaty Organisation">NATO</abbr>
string.</p>
<p>Some non-<abbr title="North Atlantic Treaty Organisation">NATO</abbr> forces were involved.</p>
</div>
Using the word break pattern to split words is interesting because in strings like "with non–NATO forces", the word NATO will still get wrapped but not the "non–" part. However, if the abbreviation is split across a text node or by a hyphen, it will not be recognised unless the same pattern is included as a property name in the abbrs object.

Check out the javascript replace method.
I'd use JQuery to pull out all the text in the paragraph
var text = $(p#paragraphId).html()
Use a for loop to loop through the list of abbreviations you have and then use the replace() method mentioned above to swap out the abbreviation for the tag you need.
Finally use JQuery to set the html of the paragraph back to your newly updated string.

Develop Reference

JavaScript is the programming language of the Web.