Remove remote content links in HTML using javascript - javascript

I have to scan an HTML for remote content (Iframe tags, Img tags ,Script tags etc) and remove the links present in them based on certain blacklist.
I am able to remove Iframe ,img , script tags whose src points to a Blacklisted URL.
var mySpan = document.createElement(\"span\");
mySpan.innerHTML = \"\";
var block = p[key];
var re = new RegExp(block);
a = document.getElementsByTagName('iframe');
for(i=0;i<a.length;i++)
{
var str = a.item(i).src;
if(str.match(re))
{
a[i].parentNode.replaceChild(mySpan, a[i]);
// + "a.item(i).src = '';
}
}
Similarly for script and img tags . But there can be many more such tags. Can i have a generic solution to traverse all tags in HTML and find/replace links that are blacklisted
I am very new to Javascript so a bit weak in its basics. Can this solution work in my case ?
I dont want to use JQuery etc libraries as i am doing this on Android.

Get all elements in the document document.getElementsByTagName('*')
Once you do that use what ever code you find suitable to check each element for your condition.
This will make sure that you have checked everything, if you were using jQuery i could make thinks simpler.
But much respect for being a pure JavaScripter !

Don't use any regexp on HTML - use DOM.
Review HTML standard for list of attributes on tags that can contain external links.
Loop over collections returned from document.getElementsByTagName(tagname).
Check attribute against blacklist and clean-up with .getAttribute and .removeAttribte (bonus: you will have normalized data, no need to worry about people trying to sneak by with funky escaping!).
Many of those attributes will be called src, so you might want to loop over tag name "*" with this attribute just to be little future-proof/paranoid. Or just loop over all attributes on all elements. This will be very slow though and still don't guarantee that somebody won't avoid it with using URLs that hard to distinguish from plain text (like IP or domain name without protocol), so I recommend against full scan.

Related

Chrome extension - The new >>Manifest_version: 3<< Problem [duplicate]

Can the JavaScript command .replace replace text in any webpage? I want to create a Chrome extension that replaces specific words in any webpage to say something else (example cake instead of pie).
The .replace method is a string operation, so it's not immediately simple to run the operation on HTML documents, which are composed of DOM Node objects.
Use TreeWalker API
The best way to go through every node in a DOM and replace text in it is to use the document.createTreeWalker method to create a TreeWalker object. This is a practice that is used in a number of Chrome extensions!
// create a TreeWalker of all text nodes
var allTextNodes = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT),
// some temp references for performance
tmptxt,
tmpnode,
// compile the RE and cache the replace string, for performance
cakeRE = /cake/g,
replaceValue = "pie";
// iterate through all text nodes
while (allTextNodes.nextNode()) {
tmpnode = allTextNodes.currentNode;
tmptxt = tmpnode.nodeValue;
tmpnode.nodeValue = tmptxt.replace(cakeRE, replaceValue);
}
To replace parts of text with another element or to add an element in the middle of text, use DOM splitText, createElement, and insertBefore methods, example.
See also how to replace multiple strings with multiple other strings.
Don't use innerHTML or innerText or jQuery .html()
// the innerHTML property of any DOM node is a string
document.body.innerHTML = document.body.innerHTML.replace(/cake/g,'pie')
It's generally slower (especially on mobile devices).
It effectively removes and replaces the entire DOM, which is not awesome and could have some side effects: it destroys all event listeners attached in JavaScript code (via addEventListener or .onxxxx properties) thus breaking the functionality partially/completely.
This is, however, a common, quick, and very dirty way to do it.
Ok, so the createTreeWalker method is the RIGHT way of doing this and it's a good way. I unfortunately needed to do this to support IE8 which does not support document.createTreeWalker. Sad Ian is sad.
If you want to do this with a .replace on the page text using a non-standard innerHTML call like a naughty child, you need to be careful because it WILL replace text inside a tag, leading to XSS vulnerabilities and general destruction of your page.
What you need to do is only replace text OUTSIDE of tag, which I matched with:
var search_re = new RegExp("(?:>[^<]*)(" + stringToReplace + ")(?:[^>]*<)", "gi");
gross, isn't it. you may want to mitigate any slowness by replacing some results and then sticking the rest in a setTimeout call like so:
// replace some chunk of stuff, the first section of your page works nicely
// if you happen to have that organization
//
setTimeout(function() { /* replace the rest */ }, 10);
which will return immediately after replacing the first chunk, letting your page continue with its happy life. for your replace calls, you're also going to want to replace large chunks in a temp string
var tmp = element.innerHTML.replace(search_re, whatever);
/* more replace calls, maybe this is in a for loop, i don't know what you're doing */
element.innerHTML = tmp;
so as to minimize reflows (when the page recalculates positioning and re-renders everything). for large pages, this can be slow unless you're careful, hence the optimization pointers. again, don't do this unless you absolutely need to. use the createTreeWalker method zetlen has kindly posted above..
have you tryed something like that?
$('body').html($('body').html().replace('pie','cake'));

How can spaces be converted to &nbsp without breaking HTML tags?

I've inherited some pretty complex code for a web forum, and one of the features I'm trying to implement is the ability for spaces to not be truncated into only one. This is mainly because our users often want to include ASCII art, tables etc in their posts.
I first did this using a simple search and replace in javascript, which had the side effect of breaking HTML tags (eg <a href=....> became <a href=.....>).
I then tried doing this on server side, when the strings are retrieved, by having spaces converted before links and code people insert is converted to HTML. This works to a degree but it causes some issues with other parts of the code, for example where a message is truncated to appear on the home page, it might leave some of the space code, such as
Here is a message&nb
I think there may be a way to just alter the original javascript to achieve this - it just needs to only match spaces that are not inside a HTML tag.
The script I was using originally was message = message.replace(/\s/g, " ").
Thanks for any help you can provide with this.
You can use the pre element to include preformatted text, which renders spaces as-is. See http://www.w3.org/TR/html5-author/the-pre-element.html
Those docs specifically say one of the best uses of the pre element is "Displaying ASCII art".
Example: http://jsbin.com/owuruz/edit#preview
<pre>
/\_/\
____/ o o \
/~____ =ΓΈ= /
(______)__m_m)
</pre>
In your case, just put your message inside a pre tag.
Yes, but you need to process text content of elements, not all of the HTML document content. Moreover, you need to exclude style and script element content. As you can limit yourself to things inside the body element, you could use a recursive function like following, calling it with process(document.body) to apply it to the entire document (but you probably want to apply it to a specific element only):
function process(element) {
var children = element.childNodes;
for(var i = 0; i < children.length; i++) {
var child = children[i];
if(child.nodeType === 3) {
if(child.data) {
child.data = child.data.replace(/[ ]/g, "\xa0");
}
} else if(child.tagName != "SCRIPT") {
process(child);
}
}
}
(No reason to use the entity reference here; you can use the no-break space character U+00A0 itself, referring to it as "\xa0" in JavaScript.)
One way is to use <pre> tags to wrap your users posts so that their ASCII art is preserved. But why not use Markdown (like Stackoverflow does). There's a couple of different ports of Markdown to Javascript:
Showdown
WMD
uedit

Is it wise to use jQuery for whitelisting tags? Are there existing solutions in JavaScript?

My problem
I want to clean HTML pasted in a rich text editor (FCK 1.6 at the moment). The cleaning should be based on a whitelist of tags (and perhaps another with attributes). This is not primarily in order to prevent XSS, but to remove ugly HTML.
Currently I see no way to do it on the server, so I guess it must be done in JavaScript.
Current ideas
I found the jquery-clean plugin, but as far as I can see, it is using regexes to do the work, and we know that is not safe.
As I've not found any other JS-based solution I've started to impement one myself using jQuery. It would work by creating a jQuery version of the pasted html ($(pastedHtml)) and then traverse the resulting tree, removing each element not matching the whitelist by looking at the attribute tagName.
My questions
Is this any better?
Can I trust jQuery to represent the pasted
content well (there may be unmatched
ending tags and what-have-you)?
Is there a better solution already that
I couldn't find?
Update
This is my current, jQuery-based, solution (verbose and not extensively tested):
function clean(element, whitelist, replacerTagName) {
// Use div if no replace tag was specified
replacerTagName = replacerTagName || "div";
// Accept anything that jQuery accepts
var jq = $(element);
// Create a a copy of the current element, but without its children
var clone = jq.clone();
clone.children().remove();
// Wrap the copy in a dummy parent to be able to search with jQuery selectors
// 1)
var wrapper = $('<div/>').append(clone);
// Check if the element is not on the whitelist by searching with the 'not' selector
var invalidElement = wrapper.find(':not(' + whitelist + ')');
// If the element wasn't on the whitelist, replace it.
if (invalidElement.length > 0) {
var el = $('<' + replacerTagName + '/>');
el.text(invalidElement.text());
invalidElement.replaceWith(el);
}
// Extract the (maybe replaced) element
var cleanElement = $(wrapper.children().first());
// Recursively clean the children of the original element and
// append them to the cleaned element
var children = jq.children();
if (children.length > 0) {
children.each(function(_index, thechild) {
var cleaned = clean(thechild, whitelist, replacerTagName);
cleanElement.append(cleaned);
});
}
return cleanElement;
}
I am wondering about some points (see comments in the code);
Do I really need to wrap my element in a dummy parent to be able to match it with jQuery's ":not"?
Is this the recommended way to create a new node?
If you leverage the browser's HTML correcting abilities (e.g. you copy the rich text to the innerHTML of an empty div and take the resulting DOM tree), the HTML will be guaranteed to be valid (the way it will be corrected is somewhat browser-dependent). Although this is probably done by rich editor anyways.
jQuery's own text-top DOM transform is probably also safe, but definitely slower, so I would avoid it.
Using a whitelist based on the jQuery selector engine might be somewhat tricky because removing an element while preserving its children might make the document invalid, so the browser would correct it by changing the DOM tree, which might confuse a script trying to iterate through invalid elements. (E.g. you allow ul and li but not ol; the script removes the list root element, naked li elements are invalid so the browser wraps them in ul again, that ul will be missed by the cleaning script.) If you throw away unwanted elements together with all their children, I don't see any problems with that.

Looking for a way to search an html page with javascript

what I would like to do is to the html page for a specific string and read in a certain amount of characters after it and present those characters in an anchor tag.
the problem I'm having is figuring out how to search the page for a string everything I've found relates to by tag or id. Also hoping to make it a greasemonkey script for my personal use.
function createlinks(srchstart,srchend){
var page = document.getElementsByTagName('html')[0].innerHTML;
page = page.substring(srchstart,srchend);
if (page.search("file','http:") != -1)
{
var begin = page.search("file','http:") + 7;
var end = begin + 79;
var link = page.substring(begin,end);
document.body.innerHTML += 'LINK | ';
createlinks(end+1,page.length);
}
};
what I came up with unfortunately after finding the links it loops over the document again
Assisted Direction
Lookup JavaScript Regex.
Apply your regex to the page's HTML (see below).
Different regex functions do different things. You could search the document for the string, as suggested, but you'd have to do it recursively, since the string you're searching for may be listed in multiple places.
To Get the Text in the Page
JavaScript: document.getElementsByTagName('html')[0].innerHTML
jQuery: $('html').html()
Note:
IE may require the element to be capitalized (eg 'HTML') - I forget
Also, the document may have newline characters \n that might want to take out, since one could be between the string you're looking for.
Okay, so in javascript you've got the whole document in the DOM tree. You an search for your string by recursively searching the DOM for the string you want. This is striaghtforward; I'll put in pseudocode because you want to think about what libraries (if any) you're using.
function search(node, string):
if node.innerHTML contains string
-- then you found it
else
for each child node child of node
search(child,string)
rof
fi

Extract src attribute from script tag and parse according to particular matches

So, I have to determine page type in a proprietary CRM, using JavaScript. The only way to determine the page type (that is, the only consistent difference on the front end) is by examining a script tag (out of many list) whose src attribute begins with /modules/.
In a list of a dozen or so script tags in the header, each page has a line of the following format
<script src="/modules/example/includes/sample.js" type="text/javascript"></script>
Now, the order of the script tag is never the same, but, there's always one script that has /modules/blah. I need to extract blah to my script can detect what kind of page it is.
So, how do I, using either JavaScript or jQuery, extract the script tag's src value, where src begins with /modules, and store the value after that ('example', in the example above) as a javascript variable?
Well, you can start by collecting all of the script elements. With jQuery, that's as simple as
var scripts = $("script");
Then limit that set to the elements that have a src attribute:
var scripts = $("script[src]");
...and further limit it to those with a src attribute beginning with "/modules/":
var scripts = $("script[src^='/modules/']");
...which given your description should result in a set of exactly one element, from which you can now pull the src attribute value itself:
var path = $("script[src^='/modules/']").attr('src');
Ok, that was easy - now to extract the next part of the path. There are plenty of ways to do this, but split is quick & dumb: create an array of parts using '/' as the separator, then pick off the third element (which will be the one after "modules"):
var pathPart = $("script[src^='/modules/']").attr('src').split('/')[2];
Obviously, this is all very specific to the exact format of the script path you're using as an example, but it should give you a good idea of how to begin...

Categories

Resources