Get value from parsed HTML using Regex

Get value from parsed HTML using Regex - javascript

For a project to make communications clearer for a website, I have to pull the messages using regex (Why? Because the message is commented out. With normal document.getElement I can't reach the message. But with the Regex mentioned below i can.)
I am trying to get a value using this expression:
\s*<td width="61%"class="valorCampoSinTamFijoPeque">(.|\n)*?<\/td>
How i use this expression:
var pulledmessage = /\s*<td width="61%"class="valorCampoSinTamFijoPeque">(.|\n)*?<\/td>/.exec(htmlDoc);
The above expression gives me NULL when i console.log() it. My guess is that the htmlDoc format that i supply the regex is not working. I just have no clue how to make it so the value does get pulled.
What i use to parse HTML:
var html1 = httpGet(messages);
parser = new DOMParser();
htmlDoc = parser.parseFromString(html1,"text/html");
The result I want to get:
<td width="61%"class="valorCampoSinTamFijoPeque"><b>D.</b> De:
Information, Information.
Information, Information
Para: Information
CC: Information
Alot of text here ............
</td>
I edited the above value to remove personal information.
html1 contains a full HTML page with the information required.

New attempt. Seeing how the td you need is commented out, remove all HTML comment delimiters from the loaded HTML file before parsing the document. This will result in the td being rendered in the document and you can use innerHTML to get the message content.
const
documentString = `
<!doctype html>
<html>
<body>
<div class="valorCampoSinTamFijoPeque">1</div>
<div class="valorCampoSinTamFijoPeque">2</div>
<div class="valorCampoSinTamFijoPeque">3</div>
<div class="valorCampoSinTamFijoPeque">4</div>
<div class="valorCampoSinTamFijoPeque">5</div>
<div class="valorCampoSinTamFijoPeque">6</div>
<!--<div class="valorCampoSinTamFijoPeque"><b>D.</b> De: Information, Information. Information, Information Para: Information CC: Information Alot of text here ............</div>-->
<div class="valorCampoSinTamFijoPeque">8</div>
</body>
</html>`,
outputElement = document.getElementById('output');
debugger;
const
// Remove all comment delimiters from the input string.
cleanupDocString = documentString.replace(/(?:<!--|-->)/gm, '');
// Create a parser and construct a document based on the string. It should
// output 8 divs.
parser = new DOMParser();
htmlDoc = parser.parseFromString(cleanupDocString,"text/html");
const
// Get the 7th div with the class name from the parsed document.
element = htmlDoc.getElementsByClassName('valorCampoSinTamFijoPeque')[6];
// Log the element found in the parsed document.
console.log(element);
// Log the content from the element.
console.log(element.innerHTML);
<div id="output"></div>

There is no need for a regex, native JS has your back!
const
documentString = '<!doctype html><html><body><div class="valorCampoSinTamFijoPeque">1</div><div class="valorCampoSinTamFijoPeque">2</div><div class="valorCampoSinTamFijoPeque">3</div><div class="valorCampoSinTamFijoPeque">4</div><div class="valorCampoSinTamFijoPeque">5</div><div class="valorCampoSinTamFijoPeque">6</div><div class="valorCampoSinTamFijoPeque">7<!--<b>D.</b> De: Information, Information. Information, Information Para: Information CC: Information Alot of text here ............--></div><div class="valorCampoSinTamFijoPeque">8</div></body></html>',
outputElement = document.getElementById('output');
function getCommentText(element) {
for (var index=0; index<element.childNodes.length;index++){
const
node = element.childNodes[index];
if (node.nodeType === Node.COMMENT_NODE) {
return node.data;
}
}
}
// Create a parser and construct a document based on the string. It should
// output 8 divs.
parser = new DOMParser();
htmlDoc = parser.parseFromString(documentString,"text/html");
const
// Get the 7th div with the class name from the parsed document.
element = htmlDoc.getElementsByClassName('valorCampoSinTamFijoPeque')[6];
// Replace the HTML of the element with the content of the comment.
element.innerHTML = getCommentText(element);
// The the inner HTML of the parsed document's body and place it inside the output
// element in the page that is visible in the user agent. The 7th div should not
// contain a number but the text that was originally in the comment.
outputElement.innerHTML = htmlDoc.body.innerHTML;
<div id="output"></div>

Related

Compare string with HTML text

I have a string of text that I want to compare with another string that has HTML code. The problem is that the text I need to compare it to in the HTML code is within different tags. Also, if the string exists in the HTML code then I want to wrap it inside a <mark> tag.
This is the example I am using:
var html = "<h1>This is a heading</h1><div class="subtitle">and this is the subheading</div><p class="small">this is some example text</p>";
var lookup = "is a heading and this is the subheading this is some";
var finalHtml = ""; //will contain new html
//Need to do some comparison and then add a <mark> tag around found string.
console.log(finalHtml);
//This should print "<h1>This <mark>is a heading</h1><div class="subtitle">and this is the subheading</div><p class="small">this is some</mark> example text</p>"
I am using Javascript/Jquery to do this.

This will only help to search your lookup within html (i.e., no marking). I have removed tags-spaces & then checked.
var html = '<h1>This is a heading</h1><div class="subtitle">and this is the subheading</div><p class="small">this is some example text</p>';
//remove html tags & spaces.
cleanHtml = html.replace(/<\/?[^>]+(>|$)/g, "").replace(/\s/g,"");
var lookup = "is a heading and this is the subheading this is some";
lookup = lookup.replace(/\s/g,'');
if(cleanText.includes(lookup)){
//match found
}

Removing span tags issue if text has encoded characters

I'm looking to remove span tags that wrap blocks of text in an in-browser editor but am having trouble if the text contains any sort of special characters like newline '\n' or encoded characters like , • , etc.
Here's my code that works on sentences without encoded characters
function fnIgnoreThisErr(evtTargID){
// use the passed parameter
var errIdx = evtTargID.substr(evtTargID.indexOf('err-') + 4);
// buld span tag for finding
var errSpan = "span.err-" + evtTargID;
// declare the editor
var editor = CKEDITOR.instances.editor1;
// get text from the editor
var edata = editor.getData();
// find the specific span in the text
var spanData = $( edata ).find(errSpan);
// get outerHTML and innerText to use for replacement
var myCurrText = spanData[0].outerHTML;
var myNewText = spanData[0].innerHTML;
// standard js replace works if no special chars
var replace_text = edata.replace(myCurrText, myNewText); //
// sets the data back in CKEditor
editor.setData(replace_text);
}
Here's an example of the text with the span tag
myCurrText:
<span class=\"vts-warn vts-ParseFailure err-2\">Approval of ICA<br />\n GAMA requested further clarification of proposed §§25.1739 (now §25.1729) and 25.1805(b) (now §26.11(b)) requirements that ICA prepared in accordance with paragraph H.</span>
And with the span tag removed.
Approval of ICA<br />\n GAMA requested further clarification of proposed §§25.1739 (now §25.1729) and 25.1805(b) (now §26.11(b)) requirements that ICA prepared in accordance with paragraph H.
It works great on plain sentences without any encoded characters. I can switch to jQuery but couldn't get replaceWith to work either.
What am I missing here?

I figured it out. There appears to be a discrepancy between html entities and the way they are being rendered/interpreted by the browser and my JS.
i.e. The outerHTML of the span is not a character-for-character match of the text in edata.
So I just get the indexOf value for the start of the span and the length of the span node. However, due to the discrepancy mentioned, this length may include additional characters. So, next, I find the exact position of the '' tag. From there, I build a string variable that exactly matches the text that needs to be replaced.
Here's my final code. (I kept it long-form for clarity)
function fnIgnoreThisErr(evtTargID){
// use the passed parameter
var errIdx = evtTargID.substr(evtTargID.indexOf('err-') + 4);
// buld span tag for finding
var errSpan = "span.err-" + evtTargID;
// declare the editor
var editor = CKEDITOR.instances.editor1;
// get text from the editor
var edata = editor.getData();
// find the specific span in the text
var spanData = $( edata ).find(errSpan);
// extract the span class name
var spanTag = '<span class="'+spanData[0].className+'">'
// find indexOf value for the span opening tag
var spanPos = edata.indexOf(spanTag);
// get the initial length of the span.
var spanLength = spanData[0].outerHTML.length;
// get the actual text from that span length.
var spanString = edata.substring(spanPos,spanPos+spanLength);
// find the acutal position of the span closing tag
var spanClose = spanString.indexOf('</span>');
var spanTagClosePos = spanClose+7;
// extract the true text comprising the span tag
var spanStringMod = edata.substring(spanPos,spanPos+spanTagClosePos);
var spanInnerHtm = spanData[0].innerHTML;
log("errSpan: "+ errSpan);
log("errSpanClass: "+ errSpanClass);
log("spanData: "+ JSON.stringify(spanData));
log("spanPos: "+ spanPos);
log("spanTagClosePos: "+ spanTagClosePos);
log("spanStringMod: "+ spanStringMod);
log("spanInnerHtm: "+ spanInnerHtm);
var newEdata = edata.replace(spanStringMod, spanInnerHtm);
log(" newEdata: "+ newEdata);
// update the editor
editor.setData(newEdata);
}
I hope this helps someone, somewhere, at some time!
Cheers!

How to ignore HTML tags in innerHTML attribute?

I'm making a messenger and my messages don't ignore HTML tags because I simply past a text from input in innerHTML of message. My code:
function Message(sender) {
...
this["text"] = "";
...
this.addText = function (text) {
this["text"] = text;
};
...
};
And here I display it:
...
var chatMessageText = document.createElement("p");
chatMessageText.innerHTML = message["text"];
...
What can I do for ignoring HTML tags in message["text"]?

Update Node#innerText property(or Node#textContent property).
chatMessageText.innerText = message["text"];
Check the difference of both here : innerText vs textContent
Refer : Difference between text content vs inner text

You can't. The point of innerHTML is that you give it HTML and it interprets it as HTML.
You could escape all the special characters, but the easier solution is to not use innerHTML.
var chatMessagePara = document.createElement("p");
var chatMessageText = document.createTextNode(message["text"]);
chatMessagePara.appendChild(chatMessageText)

Replace String Text with HTML

I'm trying to implement emoticons so for example :happy: should display an emoticon. However, the text is sanitized and does not generate the actual html.
How can I replace certain strings such as :happy: with html image?
Current attempt (replace :happy: with html):
var data = snapshot.val();
var username = data.name || "anonymous";
var message = data.text;
// REPLACE STRING WITH IMAGE
message = message.replace(":happy:", "<img src='https://s3-us-west-1.amazonaws.com/static/emoticons/1.0.jpg'>");
//CREATE ELEMENTS MESSAGE & SANITIZE TEXT
var messageElement = $("<li>");
nameElement.text(username);
messageElement.text(": "+message).prepend(nameElement);
//DISPLAY MESSAGE
messageList.append(messageElement)
Update:
I think something like text.splitText() should be used here. But I'm trying to find the best way to find the text, then split it.

Use .html() to have an element effect. .text() will just put the content without having any HTML effect
messageElement.html(": "+message).prepend(nameElement);

You can use a single detached DOM element to sanitize the inputs, then convert strings to images.
Done by setting Node.textContent[mdn] and getting Element.innerHTML[mdn] on the detached node.
Then using String.prototype.replace()[mdn] method on the resulting string.
This is library free.
var username = "anonymous",
message = "Some text with an happy <b>emoticon</b> :happy: and a sad <i>emoticon</i> :sad:.",
messageList = document.getElementById("messageList"),
sanitizer = document.createElement("p");
//CREATE ELEMENTS MESSAGE
var messageElement = document.createElement("li");
//SANITIZE TEXT
sanitizer.textContent = message;
message = sanitizer.innerHTML;
//REPLACE STRING WITH IMAGE
message = message.replace(":happy:", "<img src='https://s3-us-west-1.amazonaws.com/horde.tv/static/emoticons/1.0.jpg'>");
message = message.replace(":sad:", "<img src='https://static-cdn.jtvnw.net/emoticons/v1/86/1.0'>");
messageElement.innerHTML = username + ": " + message;
//DISPLAY MESSAGE
messageList.appendChild(messageElement);
<ul id="messageList"></ul>

detecting multiple html tags with javascript and regex

I am building a chrome extension which would read the current page and detect specific html/xml tags out of it :
For example if my current page contains the following tags or data :
some random text here and there
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>state bank of america</accountName>
<accountHolder>rahul raina</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="-2044388005">
<description>Active Global Equities</description>
<value curCode="USD">159436.01</value>
</holding>
<holding holdingType="mutualFund" uniqueId="-556870249">
<description>Passive Non-US Equities</description>
<value curCode="USD">72469.76</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
some data 123
<site name="McKinsey401k">
<investmentAccount acctType="individual" uniqueId="1629529524">
<accountName>rahuk</accountName>
<accountHolder>rahuk</accountHolder>
<balance balType="totalBalance">
<curAmt curCode="USD">516545.84</curAmt>
</balance>
<asOf localFormat="MMM dd, yyyy">2013-08-31T00:00:00</asOf>
<holdingList>
<holding holdingType="mutualFund" uniqueId="1285447255">
<description>Special Sits. Aggr. Long-Term</description>
<value curCode="USD">101944.69</value>
</holding>
<holding holdingType="mutualFund" uniqueId="1721876694">
<description>Special Situations Moderate $</description>
<value curCode="USD">49444.98</value>
</holding>
</holdingList>
<transactionList/>
</investmentAccount>
</site>
So I need to identify say tag and print the text between the starting and ending tag i.e : "State bank of america" and "rahukk"
So this is what I have done till now:
function countString(document_r,a,b) {
var test = document_r.body;
var text = typeof test.textContent == 'string'? test.textContent : test.innerText;
var testRE = text.match(a+"(.*)"+b);
return testRE[1];
}
chrome.extension.sendMessage({
action: "getSource",
source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document,'<accountName>','</accountName>')
});
But this only prints the innertext of only the first tag it encounters in the page i.e "State bank of america".
What if I want to print only "rahukk" which is the innertext of last tag in the page or both.
How do I print the innertext of last tag it encounters in the page or how does it print all the tags ?
Thanks in advance.
EDIT : The document above itself is an HTML page i have just put the contents of the page
UPDATE : So I did some here and there from the suggestions below and the best I could reach by this code :
function countString(document_r) {
var test = document_r.body;
var text = test.innerText;
var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var regexg = new RegExp(regex,"g");
var testRE = text.match(regexg);
return testRE;
}
chrome.extension.sendMessage({
action: "getSource",
source: "XML DETAILS>>>>>"+"\nAccount name is: " +countString(document)
});
But this gave me :
XML DETAILS>>>>> Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP)),Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP)),Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP))
This again because the same XML was present in the page 3 times and What I want is that regex to match only from the last XML and I don't want the tag names too.
So my desired output would be:
XML DETAILS>>>>> Retirement Program (Profit-Sharing
Retirement Plan (PSRP) and Money Purchase Pension Plan
(MPPP))

you match method is not global.
var regex = new RegExp(a+"(.*)"+b, "g");
text.match(regex);

If the full XML string is valid, you can parse it into an XML document using the DOMParser.parseFromString method:
var xmlString = '<root>[Valid XML string]</root>';
var parser = new DOMParser();
var doc = parser.parseFromString(xmlString, 'text/xml');
Then you can get a list of tags with a specified name directly:
var found = doc.getElementsByTagName('tagName');
Here's a jsFiddle example using the XML you provided, with two minor tweaks—I had to add a root element and an opening tag for the first site.

Regex pattern like this: <accountName>(.*?)<\/accountName>
var tag = "accountName";
var regex = "<" + tag + ">(.*?)<\/" + tag + ">";
var testRE = text.match(regex);
=> testRE contains all your matches, in case of tag=accountName it contains "state bank of america" and "rahukk"
UPDATE
According to this page to receive all matches, instead of only the first one, you smust add a "g" flag to the match pattern.
"g: The global search flag makes the RegExp search for a pattern
throughout the string, creating an array of all occurrences it can
find matching the given pattern." found here
Hope this helps you!

You don't need regular expressions for your task (besides, read RegEx match open tags except XHTML self-contained tags for why it's not a good idea!). You can do this completely via javascript:
var tag = "section";
var targets = document.getElementsByTagName(tag);
for (var i = targets.length; i > 0; i--) {
console.log(targets[i].innerText);
}

Develop Reference

JavaScript is the programming language of the Web.

Get value from parsed HTML using Regex - javascript

Related

Compare string with HTML text

Removing span tags issue if text has encoded characters

How to ignore HTML tags in innerHTML attribute?

Replace String Text with HTML

detecting multiple html tags with javascript and regex

Categories

Resources