Replace non-code text on webpage

Replace non-code text on webpage - javascript

I searched through a bunch of related questions that help with replacing site innerHTML using JavaScript, but most reply on targetting the ID or Class of the text. However, my can be either inside a span or td tag, possibly elsewhere. I finally was able to gather a few resources to make the following code work:
$("body").children().each(function() {
$(this).html($(this).html().replace(/\$/g,"%"));
});
The problem with the above code is that I randomly see some code artifacts or other issues on the loaded page. I think it has something to do with there being multiple "$" part of the website code and the above script is converting it to %, hence breaking things.using JavaScript or Jquery
Is there any way to modify the code (JavaScript/jQuery) so that it does not affect code elements and only replaces the visible text (i.e. >Here<)?
Thanks!
---Edit---
It looks like the reason I'm getting a conflict with some other code is that of this error "Uncaught TypeError: Cannot read property 'innerText' of undefined". So I'm guessing there are some elements that don't have innerText (even though they don't meet the regex criteria) and it breaks other inline script code.
Is there anything I can add or modify the code with to not try the .replace if it doesn't meet the regex expression or to not replace if it's undefined?

Wholesale regex modifications to the DOM are a little dangerous; it's best to limit your work to only the DOM nodes you're certain you need to check. In this case, you want text nodes only (the visible parts of the document.)
This answer gives a convenient way to select all text nodes contained within a given element. Then you can iterate through that list and replace nodes based on your regex, without having to worry about accidentally modifying the surrounding HTML tags or attributes:
var getTextNodesIn = function(el) {
return $(el)
.find(":not(iframe, script)") // skip <script> and <iframe> tags
.andSelf()
.contents()
.filter(function() {
return this.nodeType == 3; // text nodes only
}
);
};
getTextNodesIn($('#foo')).each(function() {
var txt = $(this).text().trim(); // trimming surrounding whitespace
txt = txt.replace(/^\$\d$/g,"%"); // your regex
$(this).replaceWith(txt);
})
console.log($('#foo').html()); // tags and attributes were not changed
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="foo"> Some sample data, including bits that a naive regex would trip up on:
foo<span data-attr="$1">bar<i>$1</i>$12</span><div>baz</div>
<p>$2</p>
$3
<div>bat</div>$0
<!-- $1 -->
<script>
// embedded script tag:
console.log("<b>$1</b>"); // won't be replaced
</script>
</div>

I did it solved it slightly differently and test each value against regex before attempting to replace it:
var regEx = new RegExp(/^\$\d$/);
var allElements = document.querySelectorAll("*");
for (var i = 0; i < allElements.length; i++){
var allElementsText = allElements[i].innerText;
var regExTest = regEx.test(allElementsText);
if (regExTest=== true) {
console.log(el[i]);
var newText = allElementsText.replace(regEx, '%');
allElements[i].innerText=newText;
}
}
Does anyone see any potential issues with this?
One issue I found is that it does not work if part of the page refreshes after the page has loaded. Is there any way to have it re-run the script when new content is generated on page?

Related

jQuery contents from a higher level?

I have the following jQuery that mostly works:
$("article > p, article > div, article > ol > li, article > ul > li").contents().each(function() {
if (this.nodeType === 3) {
strippedValue = $.trim($(this).text());
doStuff(strippedValue);
}
if (this.nodeType === 1) {
strippedValue = $.trim($(this).html());
doStuff(strippedValue);
}
})
The problems comes when (inside doStuff()) I try to replace HTML tags. Here is a view of my elements:
And I'm trying to replace those <kbd> tags thusly:
newStr = newStr.replace(/<kbd>/g, " <b>");
newStr = newStr.replace(/<\/kbd>/g, "<b> ");
That doesn't work, and I'm seeing in the debugger that the <kbd> tags are seen as first-class children and looped separately. Whereas I want everything inside my selectors to be seen as a raw string so I can replace things. And I realize I'm asking for a contradiction, because .contents() means get children and their contents. So if I have a selector that is a direct parent of <kbd>, then <kdb> ceases to become a raw string and becomes instead a node that is being looped.
So it seems like my selectors are wrong BUT whenever I try to bring my selectors higher in the hierarchy, immediately I lose textual contents and I end up with a bunch of html with no contents inside the elements. (The screenshot shows good contents, as expected.)
So for example I tried this:
$("article").contents().each(function() {
...
}
...hoping that the selector looping would occur a little higher, and thus allow HTML tags further down to come through as raw text. But clearly I'm lost.
My objective is to simply perform a bunch of string replacements on the contents of the html. But there are two challenges with this:
The page contents load dynamically, with ajaxy calls or similar, so full contents are not available until about a second or two after page load.
When I try to grab high-level elements such as body, it ends up devoid of much of the textual contents. The selectors I currently have don't suffer from that problem; those get everything I want BUT then HTML/XML elements get looped instead of coming through as plain text so that I can perform replacements.

Why do you need to perform the modification on raw HTML? You could just replace the DOM elements directly (not to mention that this is much more reliable then using string replacement):
$('kbd').replaceWith(function() {
return ` <b>${this.textContent}</b> `;
// or directly create DOM elements:
// const b = document.createElement('b');
// b.textContent = this.textContent;
// return b;
});
console.log($('b').length);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<kbd>hello world</kbd>
Of course you can still do string replacements where it makes sense, but you should work with DOM elements as much as possible.

Using regexes to modify the text of html (with javascript)

I want to modify the text in a html file using javascript in an android webview.
Essentially, I want to do what android Linkify does to text, but I don't want to do it with java code, because I feel like that might delay the webview rendering the html (if I parse the text before sending it to the webview).
So, for example a piece of html like this:
<html>
<body>
google.com <!--these two shouldn't be linked-->
akhilcherian#gmail.com <!--these two shouldn't be linked-->
<p>www.google.com</p> <!--this should be linked-->
<p>102-232-2312 2032-122-332 </p><!-- should be linked as numbers-->
</body>
</html>
Should become this:
<html>
<body>
google.com
akhilcherian#gmail.com
<p>www.google.com</p>
<p>102-232-2312 <a href="tel:2032-122-332>2032-122-332</a> </p>
</body>
</html>
I already have the regexes to convert numbers and email ids to links, and they're working well enough. What I want to ensure is that I don't link anything that's already within tags. I've removed anchor tags, so they're not an issue, but I also need to avoid linking things like this:
<div width="1000"> <!-- Don't want this '1000' to be linked (but I do want other 4 digit numbers to be)-->
So for example if my regex for links is:
var replacePattern1 = /((https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gim
How do I make sure that it's not within < and >? (Answers using javascript would be appreciated, but if you feel like this is a stupid way of doing it, please let me know about alternatives).
If you're answering with javascript, this question can essentially be shortened to:
How do I write a regex in javascript to search for patterns which are not surrounded by '<' '>' tags

So if you use JS than mean is client side, your DOM page have free access of all objects of your page coef events.
May be in this step you dont need to use a regex just using DOM.
jquery lib can easy update DOM object.
in your step you want only tag.
So i suggest :
//using jquery
$("p").each(function(){
console.log($(this))
});
//js
var paras = document.getElementsByTagName("p");
for(p in paras){
console.log(paras[p])
}

As i tell you the deal is manipulate the DOM so example with you step dunno if exactly what you try to get :
var paras = document.getElementsByTagName("p");
var hrefs = [];
//what you want to replace in the loop of p
var json_urls = {"links":["http://", "tel:"]};
for(p in paras){
//copy of text content of your p
var text_cp = paras[p].textContent;
//delete the p[i] content
paras[p].textContent = "";
//create element dom a
hrefs[p] = document.createElement("a");
//i add attribute id with some affectation unique
hrefs[p].id = "_" + p;
//add attribute href to a with some affectation replace + content
hrefs[p].href = json_urls.links[p] + text_cp;
hrefs[p].textContent = text_cp;
paras[p].appendChild(hrefs[p]);
}

Does javascript consider everything enclosed in <> as html tags?

I am tasked with converting hundreds of Word document pages into a knowledge base html application. This means copying and pasting the HTML of the word document into an editor like Notepad++ and cleaning it up. (Since it is internal document I need to convert, I cannot use online converters).
I have been able to do most of what I need with a javascript function that works "onload" of the body tag. I then copy the resulting HTML into my application framework.
Here is part of the function I wrote: (it shows only code for removing attributes of div and p tags but works for all html tags in the document)
function removeatts() //this function will remove all attributes from all elements and also remove empty span elements
{//for removing div tag attributes
var divs=document.getElementsByTagName('div'); //look at all div tags
var divnum=divs.length; //number of div tags on the page
for (var i=0; i<divnum; i++) //run through all the div tags
{//remove attributes for each div tag
divs[i].removeAttribute("class");
divs[i].removeAttribute("id");
divs[i].removeAttribute("name");
divs[i].removeAttribute("style");
divs[i].removeAttribute("lang");
}
//for removing p tag attributes
var ps=document.getElementsByTagName('p'); //look at all p tags
var pnum=ps.length; //number of p tags on the page
for (var i=0; i<pnum; i++) //run through all the p tags
{//remove attributes for each p tag
var para=ps[i].innerHTML;
if (para.length!==0) //ie if there is content inside the p tag
{
ps[i].removeAttribute("class");
ps[i].removeAttribute("id");
ps[i].removeAttribute("name");
ps[i].removeAttribute("style");
ps[i].removeAttribute("lang");
}
else
{//remove empty p tag
ps[i].remove() ;
}
if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>")
{
ps[i].remove() ;
}
}
The first problem I encountered is that if I included the if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>") part in an else if statement, the whole function stopped executing.
However, without the if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>") part, the function does exactly what it is supposed to.
If, however, I keep it the way it is right now, it does some of what I want it to do.
The trouble occurs over some of the Word generated html that looks like this:
<p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto; margin-
left:.25in;text-align:justify;text-indent:-.25in;line-height:150%;
mso-list:l0 level1 lfo1;tab-stops:list .75in'>
<![if !supportLists]><span style='font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol;color:black'><span style='mso-list:Ignore'>·
<span style='font:7.0pt "Times New Roman"'>
</span></span></span>
<![endif]><span style='font-family:"Arial","sans-serif";mso-fareast-font-family:Calibri;color:black'>
SOME TEXT.<span style='mso-spacerun:yes'>  </span>SOME MORE TEXT.<span style='mso-spacerun:yes'>  </span>EVEN MORE TEXT.
<span style='mso-spacerun:yes'>  </span>BLAH BLAH BLAH.<o:p></o:p></span></p>
<p><o:p></o:p></p>
Notice the <o:p></o:p> in the last two lines..... This is not getting removed either when treated as plain text or if I write code for it in the function just like the divs and paragraphs as shown in the function above. When I run the function on this, I get
<p>
<![if !supportLists]><span>·
<span>
</span></span></span>
<![endif]><span>
SOME TEXT.<span> </span>SOME MORE TEXT.<span> </span>EVEN MORE TEXT.
<span> </span>BLAH BLAH BLAH.<o:p></o:p></span></p>
<p><o:p></o:p></p>
I have looked around but cannot find any information about whether javascript works the same on known html tags and on something like this that follows the principle of opening and closing tags but doesn't match known HTML tags!
Any ideas about a workaround would be greatly appreciated!

Javascript has no special processing of HTML tags in javascript strings. It honestly doesn't know anything about HTML in the string.
More likely your issue is trying to compare .innerHTML of a tag to a predetermined string. You cannot and should not do that because there is no guarentee for the format of .innerHTML. As there are hundreds of ways that the same HTML can be formatted and some browsers don't remember the original HTML, but reconstitue it when you ask for .innerHTML, you simply can't do that type of string comparison.
To be sure of your comparison, you will have to actually parse the HTML (at least with some sort of crude parser which perhaps could even be a regex) to see if it matches what you want because you can't rely on optional spacing or optional capitilization in a direct string comparison.
Or, perhaps even better, since your HTML is already parsed, why not just look at the actual HTML objects themselves and see if you have what you want there. You shouldn't even have to remove all those attributes then.

It's not Javascript that is unhappy with the unknown tags. It's the browser.
For JS it's simply a string. So, if it's a very specific case that you don't need <o:p> in particular then you could just remove it by running it with a regex itself.
para.replace(/<[/]?o:p>/ig, "");
But if there are many more, I would strongly suggest you to get familiar with XSLT transformation.

The first problem I encountered is that if I included the if (para=="<o:p></o:p>" || para=="<o:p> </o:p>" || para=="<o:p> </o:p>")
part in an else if statement, the whole function stopped executing.
This is because you cannot have else if after else.
Notice the <o:p></o:p> in the last two lines..... This is not getting removed
I cannot confirm that. When I run your function it removes the <o:p> inside the <p>, as it is supposed to. The <o:p> within the <span> is not processed, because your function does not do that.
If you want to remove all <o:p>s, try
[].forEach.call(document.querySelectorAll('o\\:p'), function (el) {
el.remove();
});
After that, you may want to remove empty <p>s like this
[].forEach.call(document.querySelectorAll('p'), function (el) {
if (!el.childNodes.length) {
el.remove();
}
});

Replace html in userscript w/o breaking anything?

The below is in my userscript. It doesnt do the alert because when i replace the html i am clobbering it somehow.
How do i replace regular text in a div or span that is literally domain[dot]com so it will appear as domain.com? Well the below works but breaks code running after and other userscripts.
$(function() {
var html = $('body').html();
var res=html.replace(/\[dot\]/g, ".");
$('body').html(res);
//doesnt call, however html is replaced
alert('a');
});

Replace the text in the page instead of replacing in the entire HTML. If you get the entire HTML and put it back, that will make it reparse all the code and put it back as it was when initially loaded, whcih means that any events bound to any elements are gone.
Use a recursive function to find the text nodes in the document and do the replacing on the text in each node:
function replaceText(node, replacer) {
var n = node.childNodes;
for (var i = 0; i < n.length; i++) {
if (n[i].nodeType == 3) {
n[i].nodeValue = replacer(n[i].nodeValue);
} else {
replaceText(n[i], replacer);
}
}
}
$(function(){
replaceText(document.body, function(s){
return s.replace(/\[dot\]/g, '.');
});
});
Demo: http://jsfiddle.net/Guffa/ex83P/
As you see, there is no jQuery in the function, because jQuery only deals with elements, there are no methods to deal with text nodes.

Is this for a specific set of pages or do you plan on doing this across every page you encounter? If specific, try narrowing down your selectors significantly. This way you're not trying to process every span/div on the page (which is obv slow). Firebug should be able to help you.

Dynamically add anchor tags around text WHITHOUT re-writing the HTML

I'm using javascript, jQuery and regex to add anchors (#hashtag) around all hashtags on the page. The regex detects things that are hashtags, and then I use jQuery to re-write the HTML and a javascript .replace() to add in the anchor tags. I also do a javascript if statement so it doesn't replace things inside of script and style tags.
var regExp = /(\W)#([a-zA-Z_]+)(\W)/gm;
var boxLink = "$1<a class='tagLink' onClick=\"doServer('#$2')\">#$2</a>$3"
$('body').children().each(function(){
if (($(this).get(0).tagName.toLowerCase() != 'style')
&& ($(this).get(0).tagName.toLowerCase() != 'script')
) {
$(this).html($(this).html().replace(regExp, boxLink));
}
});
});
Simple enough... right?
The problem is that I'm making a plugin, so developers will deploy this on their websites. The html rewrite ($(this).html($(this).html().replace(regExp, boxLink));) breaks seemingly random areas of javascript on websites. It also messes up some HTML structure sometimes. It's just a really messy thing to be doing on lots of different sites.
So rather then fix the re-write, I'd like to just find another way to do this. Is there any way I can accomplish the same thing (adding anchor tags around all hashtags on the page) without re-writing the entire HTML on the page each load?
If not, how can I tweak the javascript I have so it isn't so conflicting with javascript on people's sites.

This replaces every textnode with a hash tag on this page with:
<span>texts without hash <a name = "myplugin">#</a></span>
You can substitute the regex to match yours :)
var getTextNodesIn = function(el) {
$(el).find("*").andSelf().contents().each(function() {
var parentNode = this.parentNode.nodeName,
data = this.data;
if(this.nodeType == 3 && parentNode !== "SCRIPT" && parentNode !== "STYLE" && data.indexOf("#") > -1){
var anch = data.replace(/#/g,"#".anchor("myplugin"));
$(this).replaceWith("<span>"+anch+"<span/>");
}
});
};
getTextNodesIn(document.body);
P.S getTextNodesIn function was taken from this post :
https://stackoverflow.com/a/4399718/776575

I think part of the problem is that you need to isolate the text nodes and operate on those, not chunks of html. Your example only iterates across the direct children of body, but then tries to apply replacements to whatever html is within those children. This could easily cause existing markup and javascript to break.
Answers to question might be helpful: How do I select text nodes with jQuery?

Develop Reference

JavaScript is the programming language of the Web.

Replace non-code text on webpage - javascript

Related

jQuery contents from a higher level?

Using regexes to modify the text of html (with javascript)

Does javascript consider everything enclosed in <> as html tags?

Replace html in userscript w/o breaking anything?

Dynamically add anchor tags around text WHITHOUT re-writing the HTML

Categories

Resources