I'm running a Django site with XML text stored in TextField properties. It's stored as XML rather than plain text because it's heavily annotated with information about the underlying manuscript, such as abbreviations and symbols. Here's an example:
class Entry(models.Model):
# Name and description.
chapter = models.ForeignKey(Chapter)
latin_text = models.TextField()
Here's an example of the content of latin_text:
<initial type="2">I</initial>n <place type="0"><span>Ricmond</span></place>
ten<abbr type="1">et</abbr> aeccl<abbr type="0">esi</abbr>a
de Cietriz .ii. hid<abbr type="0">as</abbr>.
I'd now like to start displaying that XML text on my HTML pages.
I know I can display raw XML by dropping it into a textarea: the issue is that I'd like to display it in a more beautiful way, with:
styling (all the abbr elements in italics, the place element in bold)
JavaScript events to let the user explore the abbreviations (when the user mouses over an abbr or place, show a pop-up explanation)
I'm not sure if XSLT can do what I need, or even if it can be used alongside HTML. So my question is:
Should I transform the XML into HTML before adding it to the Django database?
Or can I do all the rendering I need on the fly with XSLT or JavaScript?
I would use xslt to transform it and then attach events programmatically with javascript. It is data and converting it before hand prevents you then interpreting it in a different way later. Html should be layout only and separate from the data. You could translate it with Javascript but that will be intensive on the client. Xslt and css will be cleaner and attaching events in js is lightweight.
Not familiar with Django but maybe check this answer out : Can I use XSLT in Django?
Yes, XSLT will do everything you want and is heavily used by those who store XML just like you have. It is why and how XHTML came to be. For those who stored data and text in XML but wanted to present it to web browsers.
Once you have it in (X)HTML form, adding and using javascript is no different.
Related
I sort of designed my own framework (simple) awhile back to take some HTML templates for reports. That would allow the user to complete the template, and then instead of submitting the data (optional), I pre-processed the template by converting the form elements into spans and div's, etc. Not exactly standard practice, but it worked pretty well for what we were using it for.
I'd like to be able to do something similar with CKEditor. i.e. basically paste the raw HTML for the HTML template into the CKEditor (programmatically from our DB of templates), and then have CKEditor allow the user to fill in the blanks in the template, after which we have to probably process that somehow before submitting the data to the back-end.
I sort of tried that just to see what happens and I get something like what is shown below:
It would probably be easier to just sort of edit the templates by hand and convert them to something that displays well in the CKEditor, but there are a couple of hundred of them. Not sure if CKEditor supports form elements natively, or if there is a plug-in, and then how to convert what we have.
It probably isn't terrible if the form elements are kept in place as opposed to the kind of conversion I did with the custom method, as long as the appearance looks professional and readable.
edit: Is it possible to get all the inner text from tags in HTML document except text from anchor tags <a> (neither the the text from <a> anchors inside another elements) with the document.querySelectorAll method?
My program has an input field that allows users to insert some selector to get the text for certain tags in a given site page.
So, if I want to insert a selector that gets text from all nodes except <a> tags, how can I accomplish that?
I mean *:not(a) does not work, because it selects tags that may have <a>descendants and not() selector does not accept complex selectors, so *:not(* a) does not work.
I know I could delete those nodes from document first, but is it possible to accomplish this task only selecting those nodes I want with the document.querySelectorAll method?
Example:
<html>
<... lots of other tags with text inside>
<div>
<p> one paragraph </p>
<a> one link </a>
</div>
</...>
</html>
I want all the text in the html except "one link"
edit:
If you do document.querySelectorAll('*:not(a)'), you select the div, that has inside an a element. So, the innerText of this div contains the text from a element
Thank you
Your question is how to allow users to extract information from arbitrary hypertext [documents]. This means that solving the problem of "which elements to scrape" is just part of it. The other part is "how to transform the set of elements to scrape into a data set that the user ultimately is interested in".
Meaning that CSS selectors alone won't do. You need data transformation, which will deal with the set of elements as input and yield the data set of interest as output. In your question, this is illustrated by the case of just wanting the text content of some elements, or entire document, but as if the a elements were not there. That is your transformation procedure in this particular case.
You do state, however, that you want to allow users to specify what they want to scrape. This translates to your transformation procedure having other variables and possibly being general with respect to the kind of transformations it can do.
With this in mind, I would suggest you look into technologies like XSLT. XSLT, for one, is designed for these things -- transforming data.
Depending on how computer literate you expect your users to be, you might need to encapsulate the raw power and complexity of XSLT, giving users a simple UI which translates their queries to XSLT and then feeds the resulting XSL stylesheets to an XSLT processor, for example. In any case, XSLT itself will be able to carry a lot of load. You also won't need both XSLT and CSS selectors -- the former uses XPath which you can utilize and even expose to users.
Let's consider the following short example of a HTML document you want scraped:
<html>
<body>
<p>I think the document you are looking for is at example.com.</p>
</body>
</html>
If you want all text extracted but not a elements, the following XSL stylesheet will configure an XSLT processor to yield exactly that:
<?xml version="1.0" encoding="utf-8" ?>
<stylesheet version="1.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="text" />
<template match="a" /><!-- empty template element, meaning that the transformation result for every 'a' element is empty text -->
</stylesheet>
The result of transforming the HTML document with the above XSL stylesheet document is the following text:
I think the document you are looking for is at .
Note how the a element is "stripped" leaving an empty space between "at" and the sentence punctuation ("."). The template element, being empty, configures the XSLT processor to not produce any text when transforming a elements ("a" is a valid, if very simple, XPath expression, by the way -- it selects all a elements). This is all part of XSLT, of course.
I have tested this with Free Online XSL Transformer which uses the very potent SAX library.
Of course, you can cover one particular use case -- yours -- with JavaScript, without XSLT. But how are you going to let your users express what they want scraped? You will probably need to invent some [simple] language -- which might as well be [the already invented] XSLT.
XSLT isn't readily available across different user agents or JavaScript runtimes, not out of the box -- native XSLT 1.0 implementations are indeed provided by both Firefox and Chrome (with the XSLTProcessor class) but are not specified by any standards body and so may be missing in your particular runtime environment. You may be able to find a suitable JavaScript implementation though, but in any case you can invoke the scraper on the server side.
Encapsulating the XSLT language behind some simpler query language and user interface, is something you will need to decide on -- if you're going to give your users the kind of possibilities you say you want them to have, they need to express their queries somehow, whether through a WYSIWYG form or with text.
clone top node, remove as from the clone, get text.
const bodyClone = document.body.cloneNode(true);
bodyClone.querySelectorAll("a").forEach(e => e.remove());
const { textContent } = bodyClone;
you can use
document.querySelectorAll('*:not(a)')
hope it will work.
There are many ways to parse and traverse HTML4 files using many technologies. But I can not find a suitable one to save that DOM to file again.
I want to be able to load an HTML file into a DOM, change one small thing (e.g. an attribute's value), save the DOM to file again and when diffing the source file and the created file, I want them to be completely identical, except that small change.
This kind of task is absolutely no problem when working with XML and suitable XML libraries, but when it comes to HTML there are several issues: Whitespace such as indentations or linebreaks get lost or are inserted, self-closing start tags (such as <link...>) emerge as <link.../> and/or the content of CDATA sections (e.g. between <script> and </script>) is wrapped into <![CDATA[ ]]>. These things are critical in my case.
Which way can I go to load, traverse, manipulate and save HTML without the drawbacks described above, most importantly without whitespace text nodes to be altered?
comparison
If you want to get really serious leave out the GUI and go headless, SO example with Phantom
I am going with the HTML Agility Pack.
Loading and saving does not manipulate anything else than invalid parts.
I have a XML which works with an XSLT file that transforms the XML into a table on an HTML page.
I need to be able to update this table based on what the user selects from a drop down. Two options:
send new parameters to XSLT processor, re-transform, clear old HTML content, places new HTML content; do this every time drop down changes value
use javascript function to navigate HTML code directly and unhide/hide table data cells.
What would be better performance-wise?
EDIT: basically trying to apply filters
The second option. There's a difference between modifying HTML and modifying serialized DOM. If you clear the DOM and give the browser a new HTML string to replace it with, it will have to serialize that HTML into DOM. If you use JavaScript to modify parts of the DOM, then not only will you skip that step, but you'll be taking advantage of optimisations in the rendering engine that restrict re-layouts to affected elements in the DOM, rather than the entire document.
There is different content on site, which is allowed to be created/edited - news, articles, etc.
How to make correct and safe data transfer from editor to database?
I'd like to use wysiwyg editor, because potential users of this editor will be not such experienced users (Markdown and BB-code will be difficult for them, they want like in MS Word =) )
Also I'd like to add restrictions to this editor, for example: no images, only 5 colors, only 3 types of fonts, etc. (This can be done with limited controls of this editor)
My question: How to make this editor safer? How to prevent adding extra-html from user, or <script> tags. Do I have to make a html-filter of data came from database (saved content, that users wrote in editor) while rendering template page of this content (news or article)?
Should I store content in HTML-way in database? (If I want wysiwig-editor and it outputs HTML after saving). Or may be I should convert HTML from editor to bb-code or markdown (will all my limitations and restrictions) and clearing all extra-HTML... And then when getting content from database - I should convert bb-code/markdown to HTML again.
Or maybe there are easier and faster ways to making this safe?
If you are populating the text into the innerHTML of lets say a div, it allows a user to write html and display it as HTML later. However, if you don't want to let people inject HTML you can use the innerText instead. innerText works just like innerHTML but does not hit the HTML parser.
If you plan on using bb code or markdown you would parse the text for the code that needs to be converted and leave the rest as text.
You could also use regex parser to convert special characters to the HTML code equivalent then the bb code or markdown to html
Try this:
When saving to the database:
Replace known well formatted html with bb code replacing <b> with [b]. However ill formatted html will remain as typed <b > will stay <b >. Then do a regex replace on all HTML special characters ( ie < and > )
Then when retrieving from the database, you replace the bb code with html and you are all set.