JavaScript: make html text plain text - javascript

I've got a js-function which takes string as a parameter and get it displayed in a div element. Such string may contain html tags.
How do I force JS display inner text in div-elements as html-text with html-tags. And, also, what is an adequate way to filter particular tags, i.e. apply certain tags for styling and just print others.

You just need to replace & and < (and optionally > if you like, but you don't have to) with their respective entities, using String#replace (spec, MDC) for instance.

And, also, what is an adequate way to filter particular tags, i.e. apply certain tags for styling and just print others.
To put directly user inserted HTML code is dangerous for XSS. You should use some tool to sanitize HTML code (here on StackOverflow, for example, you can use some HTML tags).
As posted in this question here on SO you can use this client-side sanitizer: http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/plugin/html-sanitizer.js
On the other hand you may need to do this on the server-side, which one depends on your environment (ASP.NET? PHP?).

Related

How to select all tags except anchors (neither anchors inside another element) with document.querySelectorAll?

edit: Is it possible to get all the inner text from tags in HTML document except text from anchor tags <a> (neither the the text from <a> anchors inside another elements) with the document.querySelectorAll method?
My program has an input field that allows users to insert some selector to get the text for certain tags in a given site page.
So, if I want to insert a selector that gets text from all nodes except <a> tags, how can I accomplish that?
I mean *:not(a) does not work, because it selects tags that may have <a>descendants and not() selector does not accept complex selectors, so *:not(* a) does not work.
I know I could delete those nodes from document first, but is it possible to accomplish this task only selecting those nodes I want with the document.querySelectorAll method?
Example:
<html>
<... lots of other tags with text inside>
<div>
<p> one paragraph </p>
<a> one link </a>
</div>
</...>
</html>
I want all the text in the html except "one link"
edit:
If you do document.querySelectorAll('*:not(a)'), you select the div, that has inside an a element. So, the innerText of this div contains the text from a element
Thank you
Your question is how to allow users to extract information from arbitrary hypertext [documents]. This means that solving the problem of "which elements to scrape" is just part of it. The other part is "how to transform the set of elements to scrape into a data set that the user ultimately is interested in".
Meaning that CSS selectors alone won't do. You need data transformation, which will deal with the set of elements as input and yield the data set of interest as output. In your question, this is illustrated by the case of just wanting the text content of some elements, or entire document, but as if the a elements were not there. That is your transformation procedure in this particular case.
You do state, however, that you want to allow users to specify what they want to scrape. This translates to your transformation procedure having other variables and possibly being general with respect to the kind of transformations it can do.
With this in mind, I would suggest you look into technologies like XSLT. XSLT, for one, is designed for these things -- transforming data.
Depending on how computer literate you expect your users to be, you might need to encapsulate the raw power and complexity of XSLT, giving users a simple UI which translates their queries to XSLT and then feeds the resulting XSL stylesheets to an XSLT processor, for example. In any case, XSLT itself will be able to carry a lot of load. You also won't need both XSLT and CSS selectors -- the former uses XPath which you can utilize and even expose to users.
Let's consider the following short example of a HTML document you want scraped:
<html>
<body>
<p>I think the document you are looking for is at example.com.</p>
</body>
</html>
If you want all text extracted but not a elements, the following XSL stylesheet will configure an XSLT processor to yield exactly that:
<?xml version="1.0" encoding="utf-8" ?>
<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="text" />
<template match="a" /><!-- empty template element, meaning that the transformation result for every 'a' element is empty text -->
</stylesheet>
The result of transforming the HTML document with the above XSL stylesheet document is the following text:
I think the document you are looking for is at .
Note how the a element is "stripped" leaving an empty space between "at" and the sentence punctuation ("."). The template element, being empty, configures the XSLT processor to not produce any text when transforming a elements ("a" is a valid, if very simple, XPath expression, by the way -- it selects all a elements). This is all part of XSLT, of course.
I have tested this with Free Online XSL Transformer which uses the very potent SAX library.
Of course, you can cover one particular use case -- yours -- with JavaScript, without XSLT. But how are you going to let your users express what they want scraped? You will probably need to invent some [simple] language -- which might as well be [the already invented] XSLT.
XSLT isn't readily available across different user agents or JavaScript runtimes, not out of the box -- native XSLT 1.0 implementations are indeed provided by both Firefox and Chrome (with the XSLTProcessor class) but are not specified by any standards body and so may be missing in your particular runtime environment. You may be able to find a suitable JavaScript implementation though, but in any case you can invoke the scraper on the server side.
Encapsulating the XSLT language behind some simpler query language and user interface, is something you will need to decide on -- if you're going to give your users the kind of possibilities you say you want them to have, they need to express their queries somehow, whether through a WYSIWYG form or with text.
clone top node, remove as from the clone, get text.
const bodyClone = document.body.cloneNode(true);
bodyClone.querySelectorAll("a").forEach(e => e.remove());
const { textContent } = bodyClone;
you can use
document.querySelectorAll('*:not(a)')
hope it will work.

Remove all javascript from page

I have a web page with control, that render user's HTML markup.
I want remove all JS calls (and CSS, I guess) to prevent users from injecting malware code. Replacing all script tags and all onclick with others handlers seems to be a bad idea, so questin is about the best solution for this XSS problem in .Net world.
I'd strongly suggest not going down the regex route (You can't parse HTML with Regex), and consider something like HTMLAgilityPack.
This would allow you to remove all script elements, as well as remove all event handlers from elements regardless of how they're set up.
The alternative is to escape all HTML input, and then manually parse the particular tags you're interested in.
<b>Hello</b>
Becomes
<b>Hello</>
And you can then match <(b|i|u|p|em|othertagsgohere)>(.+?)</$1> so that it will only match tags with no attributes on them of the types that you're interested in and. But ultimately I think the HTMLAgiltiyPack route is the better one.

How can HTML tags only be escaped in code and some tags not being escaped?

When using certain online text editors (such as the text editor on Stack Overflow itself) we see that some tags like <b>, <i>, etc are allowed and in designated code sections all HTML tags are escaped.
How is that possible? I tried it by using jQuery and I think they use regular expressions, but I don't have much experience with using regular expressions.
I succeeded doing such things through jQuery AJAX and PHP scripts in which the result (which is escaped using htmlspecialchars() except certain allowed tags) is shown with jQuery's .html() function. However, I found that it is vulnerable to XSS attack. I also tried with .text(), but it escapes all tags including the tags I tried not to escape and AJAX loading also takes time.
How should I go about doing such a thing?
You can use tinymce as your editor and initialize that on a text area and after initializing it you can use
tinyMCE.init({
selector: "textarea",
valid_children : "+body[style],-body[div],p[strong|a|#text]"
});
By doing this you allow editor you use only certain html tags rest will be escaped.
The valid_children enables you to control what child elements can exists within what parent elements. TinyMCE will remove/split any non HTML transitional contents by default. So for example a P can't be a child of another P element.
The syntax for this option is a comma separated list of parents with elements that should be added/removed as valid children for that element. So for example "+body[style]" would add style as a valid child of body.
Hope this will help you

HTML Template (Custom) Tag

I understand that using custom html tags is improper for a variety of reasons, but I wanted to run a specific situation by you that might warrant a custom html tag and hopefully get told otherwise or possibly a better way of achieving my goal.
Throughout my code I have what I term as templates that are made up of a div tag with a template and a hidden class attached to it. This is not visible on the screen, but basically these "template" tags contains html that I use in Javascript to create a variety of different items. I do this so that I can style my templates in html rather than have to worry about mixing CSS in with my Javascript.
<!-- TEMPLATE -->
<div class="template hidden">
<span>Random Container</span>
Random Button
</div>
In javascript I would do something like
var template = document.getElementById("template");
var clone = template.cloneNode(true);
clone.removeClass("template hidden");
I would rather be able to do something like this
<template class="hidden">
<span>Random Container</span>
Random Button
</template>
So that if I have multiple templates in a single div I can grab them all rather than having to give them unique class names. Of course my reasoning for needing an implementation goes a lot deeper than this, but its not necessary to waste your time with the details. Let's just say that it will help clean up my Javascript ALOT.
Because the custom template tag is hidden and really is nothing more than a container that is convenient to call within javascript with document.getElementsByTagName("template"); Is this ok to do? I would probably prefix the tag with a custom name in case template ever gets implemented into html.
Modern browsers generally “support” custom tags in the sense of parsing them and constructing DOM nodes, so that the elements can be styled and processed in scripting.
The main problem is IE prior to IE 9, but it can be handled using document.createElement('...') once for each custom tag name.
Another problem is that validators will report the tags as errors, and if there are loads of such errors, you might not notice some real errors in markup. In principle you can create your own DTD to deal with this (I have an HTML DTD generator under construction, but it is trickier than I expected...).
With these reservations, use custom tags if they essentially simplify your job as compared with using classes.
Why not use one of HTML5's data attributes? They are for storing private data or custom info.
For your case, you could add data-type="template" or data-name="template" and then search and remove based on that. One simple function just like you would write to remove your <template> tag, but without breaking rules.
So, using your example, <div data-type="template" class="hidden"></div>

Creating colorful text in JavaScript

I am trying to write a function that prints a certain text into a <div id="1"> tag.
The string should mark certain index values in different color.
What I have written now is to go to all the index values I have and add a <font color="color"> tag, and then I add it using div1.innerHTML = result;
Its a lot of work, and its very complicated. Is there another way that I can create a string
object like I've described without these HTML tags?
If I can do that then I would just use div1.appendChild(String);
I generally am loathe to recommend that anybody use a library that they don't already claim to use, but this is one of those times where the question almost directly asks for a library as an answer :-)
Check out Lettering.JS. It was designed to do exactly what you describe. It wraps your text content by letter or by word or by line (I think) in <span> tags, under your control. You then use CSS to style elements, or some more JavaScript to manipulate and style the elements it creates for you.
if what your looking for is a syntax coloring, you can try this jQuery plugin.
http://www.steamdev.com/snippet/

Categories

Resources