What is the issue regarding placing xml elements inside html ? I am trying to easily retrieve javascript event info which returns some html when a div is clicked on. I want to parse that (as I cant send any data object afaik) and its very easy so I'm doing
<div><currency>eur</currency><price>120</price><weight>2kg</weight></div>
and in the js im doing
doTap=function(sent) {
console.log(sent.getElementsByTagName('price')[0].innerHTML);
The issue is that XML is not valid HTML (with respect to the elements). The browser does not know how to render a currency element, and afaik there is no standard way to deal with unknown elements. Some browsers might ignore them completely.
You should be fine with using span elements and giving them a class:
<div>
<span class="currency">eur</span>
<span class="price">120</span><
<span class="weight">2kg</span>
</div>
a elements are not supposed to contain block elements btw.
Then get the element in question by its class.
If you don't want to display the information (currency, price, etc) but only need to store it somewhere, you can use HTML5's data-* attributes:
<a data-currency-"eur" ... ></a>
and access them with getAttribute.
The main practical problem is that IE 8 and older do not deal at all with tags they don’t know when processing HTML documents. To deal with this, you can add code like document.createElement('currency').
There are other issues, too, such as the risk that some future browsers will start recognizing the markup you have invented – and this may cause unpredictable rendering features and functionality. Some more notes: http://www.cs.tut.fi/~jkorpela/pragmatic-html.html8#custom
The safer approach is to use span elements with class. You can also use a elements, since without href, they are still valid but do not create links, making the element effectively just a text-level container. By HTML syntax rules, though, you must not nest a elements (but browsers don’t care, in things like this).
So the safest approach uses e.g. <span class=currency>EUR</span> etc. instead.
Related
edit: Is it possible to get all the inner text from tags in HTML document except text from anchor tags <a> (neither the the text from <a> anchors inside another elements) with the document.querySelectorAll method?
My program has an input field that allows users to insert some selector to get the text for certain tags in a given site page.
So, if I want to insert a selector that gets text from all nodes except <a> tags, how can I accomplish that?
I mean *:not(a) does not work, because it selects tags that may have <a>descendants and not() selector does not accept complex selectors, so *:not(* a) does not work.
I know I could delete those nodes from document first, but is it possible to accomplish this task only selecting those nodes I want with the document.querySelectorAll method?
Example:
<html>
<... lots of other tags with text inside>
<div>
<p> one paragraph </p>
<a> one link </a>
</div>
</...>
</html>
I want all the text in the html except "one link"
edit:
If you do document.querySelectorAll('*:not(a)'), you select the div, that has inside an a element. So, the innerText of this div contains the text from a element
Thank you
Your question is how to allow users to extract information from arbitrary hypertext [documents]. This means that solving the problem of "which elements to scrape" is just part of it. The other part is "how to transform the set of elements to scrape into a data set that the user ultimately is interested in".
Meaning that CSS selectors alone won't do. You need data transformation, which will deal with the set of elements as input and yield the data set of interest as output. In your question, this is illustrated by the case of just wanting the text content of some elements, or entire document, but as if the a elements were not there. That is your transformation procedure in this particular case.
You do state, however, that you want to allow users to specify what they want to scrape. This translates to your transformation procedure having other variables and possibly being general with respect to the kind of transformations it can do.
With this in mind, I would suggest you look into technologies like XSLT. XSLT, for one, is designed for these things -- transforming data.
Depending on how computer literate you expect your users to be, you might need to encapsulate the raw power and complexity of XSLT, giving users a simple UI which translates their queries to XSLT and then feeds the resulting XSL stylesheets to an XSLT processor, for example. In any case, XSLT itself will be able to carry a lot of load. You also won't need both XSLT and CSS selectors -- the former uses XPath which you can utilize and even expose to users.
Let's consider the following short example of a HTML document you want scraped:
<html>
<body>
<p>I think the document you are looking for is at example.com.</p>
</body>
</html>
If you want all text extracted but not a elements, the following XSL stylesheet will configure an XSLT processor to yield exactly that:
<?xml version="1.0" encoding="utf-8" ?>
<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="text" />
<template match="a" /><!-- empty template element, meaning that the transformation result for every 'a' element is empty text -->
</stylesheet>
The result of transforming the HTML document with the above XSL stylesheet document is the following text:
I think the document you are looking for is at .
Note how the a element is "stripped" leaving an empty space between "at" and the sentence punctuation ("."). The template element, being empty, configures the XSLT processor to not produce any text when transforming a elements ("a" is a valid, if very simple, XPath expression, by the way -- it selects all a elements). This is all part of XSLT, of course.
I have tested this with Free Online XSL Transformer which uses the very potent SAX library.
Of course, you can cover one particular use case -- yours -- with JavaScript, without XSLT. But how are you going to let your users express what they want scraped? You will probably need to invent some [simple] language -- which might as well be [the already invented] XSLT.
XSLT isn't readily available across different user agents or JavaScript runtimes, not out of the box -- native XSLT 1.0 implementations are indeed provided by both Firefox and Chrome (with the XSLTProcessor class) but are not specified by any standards body and so may be missing in your particular runtime environment. You may be able to find a suitable JavaScript implementation though, but in any case you can invoke the scraper on the server side.
Encapsulating the XSLT language behind some simpler query language and user interface, is something you will need to decide on -- if you're going to give your users the kind of possibilities you say you want them to have, they need to express their queries somehow, whether through a WYSIWYG form or with text.
clone top node, remove as from the clone, get text.
const bodyClone = document.body.cloneNode(true);
bodyClone.querySelectorAll("a").forEach(e => e.remove());
const { textContent } = bodyClone;
you can use
document.querySelectorAll('*:not(a)')
hope it will work.
The Issue
I have read a couple older SO posts researching info on the anchor pseudo classes, and keep coming across confusion between "a" vs "a:link" and when and why you would use either. In the most common reason I've seen it is often stated that "a" would style links like
<a name="something">
My Questions
I'm just curious if anyone can explain WHY you would want to do something like that?
I've read that maybe it has something to with JavaScript targeting, but with HTML5/CSS3 and libraries like jQuery is this even a valid technique to use anymore?
In what instances would using an anchor tag that is not a link (i.e., doesn't have an "href" attribute) be #BestPractice, or is this method completely deprecated?
That can be used for in-page targeting of elements (e.g. to scroll to a certain point):
<a name="table-of-contents"></a>
<h1>Table of Contents</h1>
...
Table of Contents
Though, this is often redundant (and may also take up white space) because elements with IDs can be targeted directly:
<h1 id="table-of-contents">Table of Contents</h1>
...
Table of Contents
The <a> name attribute is technically no longer supported in HTML5, although browsers still support it for backwards compatibility.
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a#Obsolete
I recommend you stick with <a id="something"> from here on out. If you've seen examples that use name, then they're provably still residual from HTML 4 days.
I am developing a kind of HTML+JS control which can be embedded into various web pages. I know nothing about those pages (well, I could, but I don't want to). The control consists of one root element (e.g. DIV) which contains a subtree of child elements. In my script, I need to access the child elements. The question is: how can I mark those child elements to distinguish them?
The straightforward solution is using id-s. The problem here is that the id must be unique in the scope of the entire document, and I know nothing about the document my control will be embedded into. So I can't guarantee the uniqueness of my id-s. If the id-s are not unique, it will work (if used with care), but this does not conform with the standard, so I can meet problems with some new versions of the browsers, for example.
Another solution is to use the "name" attribute. It's not required to be unique -- that's good. But again, the standard allows the presence of "name" attribute only for a restricted set of element types. For example, the "name" attribute is invalid for DIV elements.
I could use, for example, the "class" attribute. It seems to be OK with the standards, but it's not OK with the meaning. "class" should be used for other purposes, and this may be confusing.
Can anybody suggest some other options to implement local id-s for HTLM elements?
You could use the HTML5 data-* attributes so you can give them a custom name with the right meaning:
https://html.spec.whatwg.org/multipage/dom.html#embedding-custom-non-visible-data-with-the-data-*-attributes
Do something like:
<div id="element-id" data-local-id="id-value">
...
</div>
and get the value in JavaScript with:
const el = document.getElementById('element-id');
const { localId } = el.dataset;
If you use a prefix to all of your ID's and or classes such as myWidgetName_98345699- the likelihood of collisions is highly improbable.
<div id="myWidgetName_98345699-container" class="myWidgetName_98345699-container">
jQuery does have selectors that will search for part of an ID, so using common names like container would be smart to stay away from as well. Using a longish alphanumeric mix for the specific part of the ID would be smart also
Typically, including hidden information within the web page required creative approaches. For example:
Storing information within HTML element attributes such as id, class, rel, and title, thus overriding the attributes original intent.
Using <span> or <div> blocks that contain the information, while making such blocks invisible to the user through styling (style="display: none;").
Adding JavaScript code to the web page to define data structures that map to HTML ID elements.
Adding your own attributes to existing HTML elements (breaking the HTML standard itself, and relying on the HTML browser to ignore any syntax errors).
The approaches above are not elegant and are not good coding practice, but the good news is that jQuery has a facility that simplifies associating data to DOM elements in a clean, cross-browser manner.
Use the custom data attributes:
Any attribute that starts with "data-" will be treated as a storage area for private data (private in the sense that the end user can't see it - it doesn't affect layout or presentation.
Defining custom data via html:
<div class="bar" id="baz" data-id="foo">
...
</div>
Associating data-id to specific DOM elements (jQuery):
$('#foo').data('id', 'baz');
Retrieving an element with specific data-id:
var $item = $('*[data-id="baz"]');
I understand that using custom html tags is improper for a variety of reasons, but I wanted to run a specific situation by you that might warrant a custom html tag and hopefully get told otherwise or possibly a better way of achieving my goal.
Throughout my code I have what I term as templates that are made up of a div tag with a template and a hidden class attached to it. This is not visible on the screen, but basically these "template" tags contains html that I use in Javascript to create a variety of different items. I do this so that I can style my templates in html rather than have to worry about mixing CSS in with my Javascript.
<!-- TEMPLATE -->
<div class="template hidden">
<span>Random Container</span>
Random Button
</div>
In javascript I would do something like
var template = document.getElementById("template");
var clone = template.cloneNode(true);
clone.removeClass("template hidden");
I would rather be able to do something like this
<template class="hidden">
<span>Random Container</span>
Random Button
</template>
So that if I have multiple templates in a single div I can grab them all rather than having to give them unique class names. Of course my reasoning for needing an implementation goes a lot deeper than this, but its not necessary to waste your time with the details. Let's just say that it will help clean up my Javascript ALOT.
Because the custom template tag is hidden and really is nothing more than a container that is convenient to call within javascript with document.getElementsByTagName("template"); Is this ok to do? I would probably prefix the tag with a custom name in case template ever gets implemented into html.
Modern browsers generally “support” custom tags in the sense of parsing them and constructing DOM nodes, so that the elements can be styled and processed in scripting.
The main problem is IE prior to IE 9, but it can be handled using document.createElement('...') once for each custom tag name.
Another problem is that validators will report the tags as errors, and if there are loads of such errors, you might not notice some real errors in markup. In principle you can create your own DTD to deal with this (I have an HTML DTD generator under construction, but it is trickier than I expected...).
With these reservations, use custom tags if they essentially simplify your job as compared with using classes.
Why not use one of HTML5's data attributes? They are for storing private data or custom info.
For your case, you could add data-type="template" or data-name="template" and then search and remove based on that. One simple function just like you would write to remove your <template> tag, but without breaking rules.
So, using your example, <div data-type="template" class="hidden"></div>
Out of curiosity, what, if any, impact will it have on a site or page if instead of using IDs or classes for elements, you simply create custom elements w/ JS and stylize them with CSS?
For example, if I create an element "container" and use it as <container> instead of <div class="container">, is there a performance difference or something?
I don't see this being used often and am wondering why?
That's like saying "What if I respect the syntax and grammar of English, but make up all the words?" While this thinking makes for good poetry, it doesn't lend itself well to technical fields ;)
HTML has a defined set of tags which are valid. If you use any tags which are made up, it will be invalid.
Now, that doesn't mean you can't get away with it; on the World Wide Web forgiveness is the default so if you used tags which you made up it wouldn't be the end of the world... but it would still be a bad idea because you'd have no guarantee how browsers handle those tags.
So the only real answer to "what impact will it have on a page if instead of using IDs or classes for elements, you simply create custom elements w/ JS and stylize them with CSS?" is anything could happen. Since you'd be using non-standard HTML elements, you'd end up with non-standard results, which none of us should try and predict.
If you really want to (and/or need to) use custom elements, look into XML. In XML you can "make up" your tags, but can still apply CSS and open the documents in a browser.
For example, save the following two files, and then open the XML file in a browser:
index.xml
<?xml-stylesheet href="style.xml.css"?>
<example>
<summary>
This is an example of making up tags in XML, and applying a stylesheet so you can open the file in a browser.
</summary>
<main>
<container>This is the stuff in the container</container>
</main>
</example>
style.xml.css
summary {
display:none;
}
main container {
border:2px solid blue;
background:yellow;
color:blue;
}
HTML is standardized, you can't simply invent new elements. Some browsers will render the text content of an element they don't recognize, but not all will, and your HTML will not be valid HTML in such a case.
HTML is a defined language, the elements and tags have certain meaning within the format. You cannot invent a new element not only because browsers may render those elements inconsistently, but also because the meaning and structure of the document becomes invalid.
You are best using the element that has the correct meaning for the content you wish to deliver. If you require a generic container for styling, the correct element is a div. There are similar elements that also provide some semantic meaning. I would recommend checking out a HTML tag index and HTML5 doctor for assistance in picking the correct element.
It sounds as though <div class="container">...</div> is the closest to what you need from your brief description.
If custom elements make your HTML easier to read and manage, go ahead and use them.
Since this question has been asked, custom elements have since been added to the WHATWG HTML Living Standard, along with an associated JavaScript API. Some web component frameworks are already implementing some of these specifications. It's no longer taboo like it was back in 2011. (I remember having some unpleasant issues dealing with the DOM in Internet Explorer when trying to use newly-announced HTML5 elements.)
As of writing this (November 2018), custom elements have been implemented into several major browsers. However, the MDN lists Microsoft Edge as having not yet implemented custom elements, although a blog post from 2015 says that the Edge team is "excited to continue to support and contribute to this journey."