Compare string with HTML text

Compare string with HTML text - javascript

I have a string of text that I want to compare with another string that has HTML code. The problem is that the text I need to compare it to in the HTML code is within different tags. Also, if the string exists in the HTML code then I want to wrap it inside a <mark> tag.
This is the example I am using:
var html = "<h1>This is a heading</h1><div class="subtitle">and this is the subheading</div><p class="small">this is some example text</p>";
var lookup = "is a heading and this is the subheading this is some";
var finalHtml = ""; //will contain new html
//Need to do some comparison and then add a <mark> tag around found string.
console.log(finalHtml);
//This should print "<h1>This <mark>is a heading</h1><div class="subtitle">and this is the subheading</div><p class="small">this is some</mark> example text</p>"
I am using Javascript/Jquery to do this.

This will only help to search your lookup within html (i.e., no marking). I have removed tags-spaces & then checked.
var html = '<h1>This is a heading</h1><div class="subtitle">and this is the subheading</div><p class="small">this is some example text</p>';
//remove html tags & spaces.
cleanHtml = html.replace(/<\/?[^>]+(>|$)/g, "").replace(/\s/g,"");
var lookup = "is a heading and this is the subheading this is some";
lookup = lookup.replace(/\s/g,'');
if(cleanText.includes(lookup)){
//match found
}

Related

Removing span tags issue if text has encoded characters

I'm looking to remove span tags that wrap blocks of text in an in-browser editor but am having trouble if the text contains any sort of special characters like newline '\n' or encoded characters like , • , etc.
Here's my code that works on sentences without encoded characters
function fnIgnoreThisErr(evtTargID){
// use the passed parameter
var errIdx = evtTargID.substr(evtTargID.indexOf('err-') + 4);
// buld span tag for finding
var errSpan = "span.err-" + evtTargID;
// declare the editor
var editor = CKEDITOR.instances.editor1;
// get text from the editor
var edata = editor.getData();
// find the specific span in the text
var spanData = $( edata ).find(errSpan);
// get outerHTML and innerText to use for replacement
var myCurrText = spanData[0].outerHTML;
var myNewText = spanData[0].innerHTML;
// standard js replace works if no special chars
var replace_text = edata.replace(myCurrText, myNewText); //
// sets the data back in CKEditor
editor.setData(replace_text);
}
Here's an example of the text with the span tag
myCurrText:
<span class=\"vts-warn vts-ParseFailure err-2\">Approval of ICA<br />\n GAMA requested further clarification of proposed §§25.1739 (now §25.1729) and 25.1805(b) (now §26.11(b)) requirements that ICA prepared in accordance with paragraph H.</span>
And with the span tag removed.
Approval of ICA<br />\n GAMA requested further clarification of proposed §§25.1739 (now §25.1729) and 25.1805(b) (now §26.11(b)) requirements that ICA prepared in accordance with paragraph H.
It works great on plain sentences without any encoded characters. I can switch to jQuery but couldn't get replaceWith to work either.
What am I missing here?

I figured it out. There appears to be a discrepancy between html entities and the way they are being rendered/interpreted by the browser and my JS.
i.e. The outerHTML of the span is not a character-for-character match of the text in edata.
So I just get the indexOf value for the start of the span and the length of the span node. However, due to the discrepancy mentioned, this length may include additional characters. So, next, I find the exact position of the '' tag. From there, I build a string variable that exactly matches the text that needs to be replaced.
Here's my final code. (I kept it long-form for clarity)
function fnIgnoreThisErr(evtTargID){
// use the passed parameter
var errIdx = evtTargID.substr(evtTargID.indexOf('err-') + 4);
// buld span tag for finding
var errSpan = "span.err-" + evtTargID;
// declare the editor
var editor = CKEDITOR.instances.editor1;
// get text from the editor
var edata = editor.getData();
// find the specific span in the text
var spanData = $( edata ).find(errSpan);
// extract the span class name
var spanTag = '<span class="'+spanData[0].className+'">'
// find indexOf value for the span opening tag
var spanPos = edata.indexOf(spanTag);
// get the initial length of the span.
var spanLength = spanData[0].outerHTML.length;
// get the actual text from that span length.
var spanString = edata.substring(spanPos,spanPos+spanLength);
// find the acutal position of the span closing tag
var spanClose = spanString.indexOf('</span>');
var spanTagClosePos = spanClose+7;
// extract the true text comprising the span tag
var spanStringMod = edata.substring(spanPos,spanPos+spanTagClosePos);
var spanInnerHtm = spanData[0].innerHTML;
log("errSpan: "+ errSpan);
log("errSpanClass: "+ errSpanClass);
log("spanData: "+ JSON.stringify(spanData));
log("spanPos: "+ spanPos);
log("spanTagClosePos: "+ spanTagClosePos);
log("spanStringMod: "+ spanStringMod);
log("spanInnerHtm: "+ spanInnerHtm);
var newEdata = edata.replace(spanStringMod, spanInnerHtm);
log(" newEdata: "+ newEdata);
// update the editor
editor.setData(newEdata);
}
I hope this helps someone, somewhere, at some time!
Cheers!

How to ignore HTML tags in innerHTML attribute?

I'm making a messenger and my messages don't ignore HTML tags because I simply past a text from input in innerHTML of message. My code:
function Message(sender) {
...
this["text"] = "";
...
this.addText = function (text) {
this["text"] = text;
};
...
};
And here I display it:
...
var chatMessageText = document.createElement("p");
chatMessageText.innerHTML = message["text"];
...
What can I do for ignoring HTML tags in message["text"]?

Update Node#innerText property(or Node#textContent property).
chatMessageText.innerText = message["text"];
Check the difference of both here : innerText vs textContent
Refer : Difference between text content vs inner text

You can't. The point of innerHTML is that you give it HTML and it interprets it as HTML.
You could escape all the special characters, but the easier solution is to not use innerHTML.
var chatMessagePara = document.createElement("p");
var chatMessageText = document.createTextNode(message["text"]);
chatMessagePara.appendChild(chatMessageText)

detecting multiple html tags with javascript and regex

I am building a chrome extension which would read the current page and detect specific html/xml tags out of it :
For example if my current page contains the following tags or data :
some random text here and there
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>state bank of america</accountName>
<accountHolder>rahul raina</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="-2044388005">
<description>Active Global Equities</description>
<value curCode="USD">159436.01</value>
</holding>
<holding holdingType="mutualFund" uniqueId="-556870249">
<description>Passive Non-US Equities</description>
<value curCode="USD">72469.76</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
some data 123
<site name="McKinsey401k">
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>rahuk</accountName>
<accountHolder>rahuk</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="1285447255">
<description>Special Sits. Aggr. Long-Term</description>
<value curCode="USD">101944.69</value>
</holding>
<holding holdingType="mutualFund" uniqueId="1721876694">
<description>Special Situations Moderate $</description>
<value curCode="USD">49444.98</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
So I need to identify say tag and print the text between the starting and ending tag i.e : "State bank of america" and "rahukk"
So this is what I have done till now:
function countString(document_r,a,b) {
var test = document_r.body;
var text = typeof test.textContent == 'string'? test.textContent : test.innerText;
var testRE = text.match(a+"(.*)"+b);
return testRE[1];
}
chrome.extension.sendMessage({
action: "getSource",
source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document,'<accountName>','</accountName>')
});
But this only prints the innertext of only the first tag it encounters in the page i.e "State bank of america".
What if I want to print only "rahukk" which is the innertext of last tag in the page or both.
How do I print the innertext of last tag it encounters in the page or how does it print all the tags ?
Thanks in advance.
EDIT : The document above itself is an HTML page i have just put the contents of the page
UPDATE : So I did some here and there from the suggestions below and the best I could reach by this code :
function countString(document_r) {
var test = document_r.body;
var text = test.innerText;
var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var regexg = new RegExp(regex,"g");
var testRE = text.match(regexg);
return testRE;
}
chrome.extension.sendMessage({
action: "getSource",
source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document)
});
But this gave me :
XML DETAILS>>>>> Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP)),Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP)),Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP))
This again because the same XML was present in the page 3 times and What I want is that regex to match only from the last XML and I don't want the tag names too.
So my desired output would be:
XML DETAILS>>>>> Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP))

you match method is not global.
var regex = new RegExp(a+"(.*)"+b, "g");
text.match(regex);

If the full XML string is valid, you can parse it into an XML document using the DOMParser.parseFromString method:
var xmlString = '<root>[Valid XML string]</root>';
var parser = new DOMParser();
var doc = parser.parseFromString(xmlString, 'text/xml');
Then you can get a list of tags with a specified name directly:
var found = doc.getElementsByTagName('tagName');
Here's a jsFiddle example using the XML you provided, with two minor tweaks—I had to add a root element and an opening tag for the first site.

Regex pattern like this: <accountName>(.*?)<\/accountName>
var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var testRE = text.match(regex);
=> testRE contains all your matches, in case of tag=accountName it contains "state bank of america" and "rahukk"
UPDATE
According to this page to receive all matches, instead of only the first one, you smust add a "g" flag to the match pattern.
"g: The global search flag makes the RegExp search for a pattern
throughout the string, creating an array of all occurrences it can
find matching the given pattern." found here
Hope this helps you!

You don't need regular expressions for your task (besides, read RegEx match open tags except XHTML self-contained tags for why it's not a good idea!). You can do this completely via javascript:
var tag = "section";
var targets = document.getElementsByTagName(tag);
for (var i = targets.length; i > 0; i--) {
console.log(targets[i].innerText);
}

Adding bold tags to a string of html

I have a string of html in javascript/jquery.
var str = '<div id="cheese" class="appleSauce"> I like apple and cheese</div>';
I want to make the string 'apple' bold. So I do:
str = str.replace('apple','<b>apple</b>');
but this breaks the html part of the string. I get:
<div id="cheese" class="<b>apple</b>Sauce"> I like <b>apple</b> and cheese</div>
How can I replace all occurrences of a string in the text of an html string without changing the matches inside of html markup?

var e = $('#cheese');
e.html(e.text().replace('apple','<b>apple</b>'));
Working Fiddle

Create an element, jQuery element in this case, and set the innerHTML property:
var el = $('<div id="cheese" class="appleSauce"> I like apple and cheese</div>');
el.html(el.html().replace('apple','<b>apple</b>'));

You can do it like that
var str=str.replace(new RegExp(/(apple)$/),"<b>apple</b>");

Remove every html tag with JsHtmlSanitizer

I finnally got the JsHtmlSanitizer working as a standalone clientside script.
Now I'd like to remove all HTML-Tags from a string and not just script-tags and links.
This example
html_sanitize('<b>hello</b><img src="http://google.com"><a href="javascript:alert(0)"><script src="http://www.google.com"><\/script>');
returns "hello" but I'd like to remove all tags.

Why not use regular expressions to remove all HTML tags after sanitizing?
var input = '<b>hello</b><img src="http://google.com"><a href="javascript:alert(0)"><script src="http://www.google.com"></script>';
var output = null;
output = html_sanitize(input);
output = output.replace(/<[^>]+>/g, '');
This should strip your input string of all html tags after sanitization.
If you want to do just basic sanitization (removing script and style tags with their content and all html tags only) you could implement the entire thing within regex. I have demonstrated an example below.
var input = '<b>hello</b><img src="http://google.com"><a href="javascript:alert(0)"><script src="http://www.google.com"></script>';
input += '<script> if (1 < 2) { alert("This script should be removed!"); } </script><style type="text/css">.cssSelectorShouldBeRemoved > .includingThis { background-color: #FF0000; } </style>';
var output = null;
output = input.replace(/(?:<(?:script|style)[^>]*>[\s\S]+?<\/(?:script|style)[^>]*>)|<[^>]+>/ig, '');

Use this javascript function below to remove all html tags from the string you get from html_sanitize().
var output = html_sanitize('<b>hello</b><img src="http://google.com"><a href="javascript:alert(0)"><script src="http://www.google.com"><\/script>');
output = output.replace(/(<.*?>)/ig,"");
Hope it helps :)

Develop Reference

JavaScript is the programming language of the Web.

Compare string with HTML text - javascript

Related

Removing span tags issue if text has encoded characters

How to ignore HTML tags in innerHTML attribute?

detecting multiple html tags with javascript and regex

Adding bold tags to a string of html

Remove every html tag with JsHtmlSanitizer

Categories

Resources