edit: Is it possible to get all the inner text from tags in HTML document except text from anchor tags <a> (neither the the text from <a> anchors inside another elements) with the document.querySelectorAll method?
My program has an input field that allows users to insert some selector to get the text for certain tags in a given site page.
So, if I want to insert a selector that gets text from all nodes except <a> tags, how can I accomplish that?
I mean *:not(a) does not work, because it selects tags that may have <a>descendants and not() selector does not accept complex selectors, so *:not(* a) does not work.
I know I could delete those nodes from document first, but is it possible to accomplish this task only selecting those nodes I want with the document.querySelectorAll method?
Example:
<html>
<... lots of other tags with text inside>
<div>
<p> one paragraph </p>
<a> one link </a>
</div>
</...>
</html>
I want all the text in the html except "one link"
edit:
If you do document.querySelectorAll('*:not(a)'), you select the div, that has inside an a element. So, the innerText of this div contains the text from a element
Thank you
Your question is how to allow users to extract information from arbitrary hypertext [documents]. This means that solving the problem of "which elements to scrape" is just part of it. The other part is "how to transform the set of elements to scrape into a data set that the user ultimately is interested in".
Meaning that CSS selectors alone won't do. You need data transformation, which will deal with the set of elements as input and yield the data set of interest as output. In your question, this is illustrated by the case of just wanting the text content of some elements, or entire document, but as if the a elements were not there. That is your transformation procedure in this particular case.
You do state, however, that you want to allow users to specify what they want to scrape. This translates to your transformation procedure having other variables and possibly being general with respect to the kind of transformations it can do.
With this in mind, I would suggest you look into technologies like XSLT. XSLT, for one, is designed for these things -- transforming data.
Depending on how computer literate you expect your users to be, you might need to encapsulate the raw power and complexity of XSLT, giving users a simple UI which translates their queries to XSLT and then feeds the resulting XSL stylesheets to an XSLT processor, for example. In any case, XSLT itself will be able to carry a lot of load. You also won't need both XSLT and CSS selectors -- the former uses XPath which you can utilize and even expose to users.
Let's consider the following short example of a HTML document you want scraped:
<html>
<body>
<p>I think the document you are looking for is at example.com.</p>
</body>
</html>
If you want all text extracted but not a elements, the following XSL stylesheet will configure an XSLT processor to yield exactly that:
<?xml version="1.0" encoding="utf-8" ?>
<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="text" />
<template match="a" /><!-- empty template element, meaning that the transformation result for every 'a' element is empty text -->
</stylesheet>
The result of transforming the HTML document with the above XSL stylesheet document is the following text:
I think the document you are looking for is at .
Note how the a element is "stripped" leaving an empty space between "at" and the sentence punctuation ("."). The template element, being empty, configures the XSLT processor to not produce any text when transforming a elements ("a" is a valid, if very simple, XPath expression, by the way -- it selects all a elements). This is all part of XSLT, of course.
I have tested this with Free Online XSL Transformer which uses the very potent SAX library.
Of course, you can cover one particular use case -- yours -- with JavaScript, without XSLT. But how are you going to let your users express what they want scraped? You will probably need to invent some [simple] language -- which might as well be [the already invented] XSLT.
XSLT isn't readily available across different user agents or JavaScript runtimes, not out of the box -- native XSLT 1.0 implementations are indeed provided by both Firefox and Chrome (with the XSLTProcessor class) but are not specified by any standards body and so may be missing in your particular runtime environment. You may be able to find a suitable JavaScript implementation though, but in any case you can invoke the scraper on the server side.
Encapsulating the XSLT language behind some simpler query language and user interface, is something you will need to decide on -- if you're going to give your users the kind of possibilities you say you want them to have, they need to express their queries somehow, whether through a WYSIWYG form or with text.
clone top node, remove as from the clone, get text.
const bodyClone = document.body.cloneNode(true);
bodyClone.querySelectorAll("a").forEach(e => e.remove());
const { textContent } = bodyClone;
you can use
document.querySelectorAll('*:not(a)')
hope it will work.
I've been unable to find a definitive answer to whether custom tags are valid in HTML5, like this:
<greeting>Hello!</greeting>
I've found nothing in the spec one way or the other:
http://dev.w3.org/html5/spec/single-page.html
And custom tags don't seem to validate with the W3C validator.
The Custom Elements specification is available in Chrome and Opera, and becoming available in other browsers. It provides a means to register custom elements in a formal manner.
Custom elements are new types of DOM elements that can be defined by
authors. Unlike decorators, which are stateless and ephemeral, custom
elements can encapsulate state and provide script interfaces.
Custom elements is a part of a larger W3 specification called Web Components, along with Templates, HTML Imports, and Shadow DOM.
Web Components enable Web application authors to define widgets with a
level of visual richness and interactivity not possible with CSS
alone, and ease of composition and reuse not possible with script
libraries today.
However, from this excellent walk through article on Google Developers about Custom Elements v1:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
Some Resources
Example Web Components are available at https://WebComponents.org
WebComponents.js serves as a polyfill for Web Components until they are supported everywhere. See also the WebComponents.js github page & web browser support table.
It's possible and allowed:
User agents must treat elements and attributes that they do not
understand as semantically neutral; leaving them in the DOM (for DOM
processors), and styling them according to CSS (for CSS processors),
but not inferring any meaning from them.
http://www.w3.org/TR/html5/infrastructure.html#extensibility-0
But, if you intend to add interactivity, you'll need to make your document invalid (but still fully functional) to accomodate IE's 7 and 8.
See http://blog.svidgen.com/2012/10/building-custom-xhtml5-tags.html (my blog)
Basic Custom Elements and Attributes
Custom elements and attributes are valid in HTML, provided that:
Element names are lowercase and begin with x-
Attribute names are lowercase and begin with data-
For example, <x-foo data-bar="gaz"/> or <br data-bar="gaz"/>.
A common convention for elements is x-foo; x-vendor-feature is recommended.
This handles most cases, since it's arguably rare that a developer would need all the power that comes with registering their elements. The syntax is also adequately valid and stable. A more detailed explanation is below.
Advanced Custom Elements and Attributes
As of 2014, there's a new, much-improved way to register custom elements and attributes. It won't work in older browsers such as IE 9 or Chrome/Firefox 20. But it allows you to use the standard HTMLElement interface, prevent collisions, use non-x-* and non-data-* names, and define custom behavior and syntax for the browser to respect. It requires a bit of fancy JavaScript, as detailed in the links below.
HTML5 Rocks - Defining New Elements in HTML
WebComponents.org - Introduction to Custom Elements
W3C - Web Components: Custom Elements
Regarding The Validity of The Basic Syntax
Using data-* for custom attribute names has been perfectly valid for some time, and even works with older versions of HTML.
W3C - HTML5: Extensibility
As for custom (unregistered) element names, the W3C strongly recommends against them, and considers them non-conforming. But browsers are required to support them, and x-* identifiers won't conflict with future HTML specs and x-vendor-feature identifiers won't conflict with other developers. A custom DTD can be used to work around any picky browsers.
Here are some relevant excerpts from the official docs:
"Applicable specifications MAY define new document content (e.g. a
foobar element) [...]. If the syntax and semantics of a given
conforming HTML5 document is unchanged by the use of applicable
specification(s), then that document remains a conforming HTML5
document."
"User agents must treat elements and attributes that they do not
understand as semantically neutral; leaving them in the DOM (for DOM
processors), and styling them according to CSS (for CSS processors),
but not inferring any meaning from them."
"User agents are not free to handle non-conformant documents as they
please; the processing model described in this specification applies
to implementations regardless of the conformity of the input
documents."
"The HTMLUnknownElement interface must be used for HTML elements that
are not defined by this specification."
W3C - HTML5: Conforming Documents
WhatWG - HTML Standard: DOM Elements
N.B. The answer below was correct when it was written in 2012. Since then, things have moved on a bit. The HTML spec now defines two types of custom elements - "autonomous custom elements" and "customized built-in elements". The former can go anywhere phrasing content is expected; which is most places inside body, but not e.g. children of ul or ol elements, or in table elements other than td, th or caption elements. The latter can go where-ever the element that they extend can go.
This is actually a consequence of the accumulation of the content model of the elements.
For example, the root element must be an html element.
The html element may only contain A head element followed by a body element.
The body element may only contain Flow content where flow content is defined as the elements: a,
abbr,
address,
area (if it is a descendant of a map element),
article,
aside,
audio,
b,
bdi,
bdo,
blockquote,
br,
button,
canvas,
cite,
code,
command,
datalist,
del,
details,
dfn,
div
dl,
em,
embed,
fieldset,
figure,
footer,
form,
h1,
h2,
h3,
h4,
h5,
h6,
header,
hgroup,
hr,
i,
iframe,
img,
input,
ins,
kbd,
keygen,
label,
map,
mark,
math,
menu,
meter,
nav,
noscript,
object,
ol,
output,
p,
pre,
progress,
q,
ruby,
s,
samp,
script,
section,
select,
small,
span,
strong,
style (if the scoped attribute is present),
sub,
sup,
svg,
table,
textarea,
time,
u,
ul,
var,
video,
wbr
and Text
and so on.
At no point does the content model say "you can put any elements you like in this one", which would be necessary for custom elements/tags.
I would like to point out that the word "valid" can have two different meanings in this context, either of which is potentially, um, valid.
Should an HTML document with custom tags be considered valid HTML5?
The answer to this is clearly "no." The spec lists exactly what tags are valid in what contexts. This is why an HTML validator will not accept a document with custom tags, or with standard tags in the wrong places (like an "img" tag in the header).
Will an HTML document with custom tags be parsed and rendered in a standard, clearly-defined way across browsers?
Here, perhaps surprisingly, the answer is "yes." Even though the document would not technically be considered valid HTML5, the HTML5 spec does specify exactly what browsers are supposed to do when they see a custom tag: in short, the custom tag acts kind of like a <span> - it means nothing and does nothing by default, but it can be styled by HTML and accessed by javascript.
To give an updated answer reflecting modern pages.
Custom tags are valid if either,
1) They contain a dash
<my-element>
2) They are embedded XML
<greeting xmlns="http://example.com/customNamespace">Hello!</greeting>
This assumes you are using the HTML5 doctype <!doctype html>
Considering these simple restrictions it now makes sense to do your best to keep your HTML markup valid (please stop closing tags like <img> and <hr>, it's silly and incorrect unless you use an XHTML doctype, which you probably have no need for).
Given that HTML5 clearly defines the parsing rules a compliant browser will be able to handle any tag that you throw at it even if it isn't strictly valid.
Custom HTML elements are an emerging W3 standard I have been contributing to that enables you to declare and register custom elements with the parser, you can read the spec here: W3 Web Components Custom Elements spec. Additionally, Microsoft supports a library (written by former Mozilla devs), called X-Tag - it makes working with Web Components even easier.
Quoting from the Extensibility section of the HTML5 specification:
For markup-level features that can be limited to the XML serialization and need not be supported in the HTML serialization, vendors should use the namespace mechanism to define custom namespaces in which the non-standard elements and attributes are supported.
So if you're using the XML serialization of HTML5, its legal for you to do something like this:
<greeting xmlns="http://example.com/customNamespace">Hello!</greeting>
However, if you're using the HTML syntax you are much more limited in what you can do.
For markup-level features that are intended for use with the HTML syntax, extensions should be limited to new attributes of the form "x-vendor-feature" [...] New element names should not be created.
But those instructions are primarily directed at browser vendors, who would assumedly be providing visual styling and functionality for whatever custom elements they chose to create.
For an author, though, while it may be legal to embed a custom element in the page (at least in the XML serialization), you're not going to get anything more than a node in the DOM. If you want your custom element to actually do something, or be rendered in some special way, you should be looking at the Custom Elements specification.
For a more gentle primer on the subject, read the Web Components Introduction, which also includes information about the Shadow DOM and other related specifications. These specs are still working drafts at the moment - you can see the current status here - but they are being actively developed.
As an example, a simple definition for a greeting element might look something like this:
<element name="greeting">
<template>
<style scoped>
span { color:gray; }
</style>
<span>Simon says:</span>
<q><content/></q>
</template>
</element>
This tells the browser to render the element content in quotes, and prefixed by the text "Simon says:" which is styled with the color gray. Typically a custom element definition like this would be stored in a separate html file that you would import with a link.
<link rel="import" href="greeting-definition.html" />
Although you can also include it inline if you want.
I've created a working demonstration of the above definition using the Polymer polyfill library which you can see here. Note that this is using an old version of the Polymer library - more recent versions work quite differently. However, with the spec still in development, this is not something I would recommend using in production code anyway.
just use whatever you want without any dom declaration
<container>content here</container>
add your own style (display:block) and it will work with any modern browser
data-* attributes are valid in HTML5 and even in HTML4 all web browsers used to respect them.
Adding new tags is technically okay, but is not recommended just because:
It may conflict with something added in the future, and
Makes the HTML document invalid unless dynamically added via JavaScript.
I use custom tags only in places that Google does not care, for ecample in a game engine iframe, i made a <log> tag that contained <msg>, <error> and <warning> - but through JavaScript only. And it was fully valid, according to the validator. It even works in Internet explorer with its styling! ;]
Custom tags are not valid in HTML5. But currently browsers are supporting to parse them and also you can use them using css. So if you want to use custom tags for current browsers then you can. But the support may be taken away once the browsers implement W3C standards strictly for parsing the HTML content.
I know this question is old, but I have been studying this subject and though some of the above statements are correct, they aren't the only way to create custom elements. For example:
<button id="find">click me</button>
<Query? ?attach="find" ?content="alert( find() );" ?prov="Hello there :D" >
I won't be displayed :D
</Query?>
<style type="text/css">
[\?content] {
display: none;
}
</style>
<script type="text/javascript">
S = document.getElementsByTagName("Query?")[0];
Q = S.getAttribute("?content");
A = document.getElementById( S.getAttribute("?attach") );
function find() {
return S.getAttribute("?prov");
}
(function() {
A.setAttribute("onclick", Q);
})();
</script>
would work perfectly fine ( in newer versions of Google Chrome, IE, FireFox, and mobile Safari so far ). All you need is just an alpha character (a-z, A-Z) to start the tag, and then you can use any of the non alpha characters after. If in CSS, you must use the "\" (backslash) in order to find the element, such as would need Query\^ { ... } . But in JS, you just call it how you see it. I hope this helps out. See example here
-Mink CBOS
What is the issue regarding placing xml elements inside html ? I am trying to easily retrieve javascript event info which returns some html when a div is clicked on. I want to parse that (as I cant send any data object afaik) and its very easy so I'm doing
<div><currency>eur</currency><price>120</price><weight>2kg</weight></div>
and in the js im doing
doTap=function(sent) {
console.log(sent.getElementsByTagName('price')[0].innerHTML);
The issue is that XML is not valid HTML (with respect to the elements). The browser does not know how to render a currency element, and afaik there is no standard way to deal with unknown elements. Some browsers might ignore them completely.
You should be fine with using span elements and giving them a class:
<div>
<span class="currency">eur</span>
<span class="price">120</span><
<span class="weight">2kg</span>
</div>
a elements are not supposed to contain block elements btw.
Then get the element in question by its class.
If you don't want to display the information (currency, price, etc) but only need to store it somewhere, you can use HTML5's data-* attributes:
<a data-currency-"eur" ... ></a>
and access them with getAttribute.
The main practical problem is that IE 8 and older do not deal at all with tags they don’t know when processing HTML documents. To deal with this, you can add code like document.createElement('currency').
There are other issues, too, such as the risk that some future browsers will start recognizing the markup you have invented – and this may cause unpredictable rendering features and functionality. Some more notes: http://www.cs.tut.fi/~jkorpela/pragmatic-html.html8#custom
The safer approach is to use span elements with class. You can also use a elements, since without href, they are still valid but do not create links, making the element effectively just a text-level container. By HTML syntax rules, though, you must not nest a elements (but browsers don’t care, in things like this).
So the safest approach uses e.g. <span class=currency>EUR</span> etc. instead.
Image tag (<img src="" alt="" />), line break tags (<br />), or horizontal rule tags (<hr />) have the slashes at the end to indicate itself as self-closing tags. However, when these objects are created by javascript, and I look into the source, they don't have the slashes, making them invalid by W3C standards.
How can I get over this problem?
(I use javascript Prototype library)
How are you looking at ‘the source’? JavaScript-created elements don't appear in ‘View Source’. Are you talking about innerHTML?
If so then what you are getting is the web browser's serialisation of the DOM nodes in the document. In a browser the HTML markup of a page is not the definitive store for document state. The document is stored as a load of Node objects; when these objects are serialised back to markup, that markup may not look much like the original HTML page markup that was parsed to get the document.
So regardless of which of:
<img src="x" alt="y"/>
<img src="x" alt="y">
<img alt = "y" src="x">
img= document.createElement('img'); img.src= 'x'; img.alt= 'y';
you use to create an HTMLImageElement node, when you serialise it using innerHTML the browser will typically generate the same HTML markup.
If the browser is in native-XML mode (ie because the page was served as application/xhtml+html), then the innerHTML value certainly will contain self-closing syntax like<img/>. (You might also see other XML stuff like namespaces too, in some browsers.)
However if, as is more likely today, you're serving ‘HTML-compatible XHTML’ under the media type text/html, the browser isn't actually using XHTML at all, so you'll get old-school-HTML when you access innerHTML and you won't see the self-closing /> form.
There's is no problem. W3C standards only refer to the markup in the original file, any changes after that are made directly to the DOM, (not your sauce code) which is also a W3C standard. This will in no way affect the standards compliance of your website.
To go into further detail, HTML and XHTML are only different ways of building the DOM (Document Object Model), which is best described as a large tree structure of nodes which is used to describe a web page. Once the DOM has been built, the language used to build it is irrelevant, You can even build the DOM using pure javascript if you wished.
It never matters!, I used to confirm every standard of W3C but it turns to be a stupid thing!
Just conform the safe ones of them that allows you to code cross-browsers and your case is definitely not one of them since it's never an issue and causes no problems.