Using exslt extentions be used in javascript xpaths - javascript

I would like to use javascript XPaths in a web app using exslt extensions, but I can't figure out how to do this.
Pretend I've got an html doc with some divs in it. I want to run this:
namespaces={'regexp':'http://exslt.org/regular-expressions'};
result = document.evaluate(
"//div[regexp:test(.,'$')]",
document,
function(ns){
return namespaces.hasOwnProperty(ns) ? namespaces[ns] : null;
},
XPathResult.ANY_TYPE,
null);
Only that results in an invalid XPath expression exception in evaluate. I'm using chrome.
Is there anything else I need to do to make this stuff work? I see on exslt.org that there are implementations for javascript, but how do I make sure those are available? Do I need to insert my javascript into a namespaced script element in the dom or something insane?
UPDATE
If this isn't possible directly using browser dom + javascript and xpath, would it be possible to write XSLT using exslt extensions in the browser to simulate document.evaluate (returning a list of elements that match the xpath)?

I don't think the default browser XPath implementation supports EXSLT. The javascript support mentioned on the EXSLT page is likely about how you can provide your own implementation of the exslt function using in-browser.javascript. Here's one example I was able to find very quickly.
In Firefox, for example, you can have Saxon-B as an extension to run XSLT2.0 and Saxon-B has built-in support for exslt (unlike Saxon-HE), though you will likely be better off just using XSLT/XPath 2.0 features. Here's the regular expression syntax, for example. That said, however, relying on a Mozilla Saxon-B extension isn't something that will help you with Chrome or other browsers for that matter.
With that said I don't think you can find a cross-browser solution to use EXSLT extensions in your XPath. The conformance section of the DOM Level 3 XPath calls for XPath 1.0 support and doesn't mention EXSLT. The INVALID_EXPRESSION_ERR is said to be thrown:
if the expression has a syntax error or otherwise is not a legal expression according to the rules of the specific XPathEvaluator or contains specialized extension functions or variables not supported by this implementation.
Finally, here's an open bugzilla ticket for Firefox to open up EXSLT support for their DOM Level 3 XPath implementation. It seems to be sitting there in NEW status since 2007. The ticket says that:
Currently Mozilla gives an exception "The expression is not a legal expression." even if a namespace resolver correctly resolving the EXSLT prefixes to the corresponding URLs is passed in. Here's the test case.
--
If you don't mind me asking, what exactly you wanted to use the regex for? Maybe we can help you get away with a combination of standard XPath string functions?
--
UPDATE You can build an XPath runner via XSLT (like you're asking in the update to your question) but it won't return the nodes from the source document, it will return new nodes that look exactly the same. XSLT produces a new result tree document and I don't think there's a way to let it return references to the original nodes.
As far as I can tell, Mozilla (and Chrome) both support XSLT not only for XML documents loaded from external sources, but also for DOM elements from the document being displayed. The XSLTProcessor documentation mentions how tranformToFragment(), for example, will only produce HTML DOM objects if the owner document is itself an HTMLDocument, or if the output method of the stylesheet is HTML.
Here's a simple XPath Runner that I built testing out your ides:
1) First you would need an XSLT template to work with.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:regexp="http://exslt.org/regular-expressions"
extension-element-prefixes="regexp">
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
I started building it in the JavaScript using the document.implementation.createDocument APi but figured it would be easier to just load it. FF still supports document.load while Chrome only lets you load stuff using XHR. You would need to start your Chrome with --allow-file-access-from-files if you want to load files with XHR from your local disk.
2) Once we have the template loaded we would need to modify the value of the select attribute of the xsl:copy-of instruction to run the XPath we need:
function runXPath(xpath) {
var processor = new XSLTProcessor();
var xsltns = 'http://www.w3.org/1999/XSL/Transform';
var xmlhttp = new window.XMLHttpRequest();
xmlhttp.open("GET", "xpathrunner.xslt", false);
xmlhttp.send(null);
var transform = xmlhttp.responseXML.documentElement;
var copyof = transform.getElementsByTagNameNS(xsltns, 'copy-of')[0];
copyof.setAttribute('select', xpath);
processor.importStylesheet(transform);
var body = document.getElementById('body'); // I gave my <body> an id attribute
return processor.transformToFragment(body, document);
}
You can now run it with something like:
var nodes = runXPath('//div[#id]');
console.log(nodes.hasChildNodes());
if (nodes.firstChild) {
console.log(nodes.firstChild.localName);
}
It works great for "regular" XPath like that //div[#id] (and fails to find //div[#not-there]) but I just can't get it to run the regexp:test extension function. With the //div[regexp:test(string(#id), "a")] it doesn't error out, just returns empty set.
Mozilla documentation suggests their XSLT processor support EXSLT. I would imagine they are all using libxml/libxslt behind the scenes anyway. That said, I couldn't get it to work in Mozilla either.
Hope it helps.
Any chance you can get away with jQuery regexp? not likely to be helpful for your XPath builder utility but still a way to run regexp on HTML nodes.

Related

Kotlin, how can I read a dynamic website as text?

As titled, I'm trying to read the content of sites like this one, which appears to be javascript based.
I tried using plain jdk lib, then jsoup and then htmlunit, but I couldn't get anything useful out of it (I see just the source code or just the title or null):
val url = URL("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate")
val connection = url.openConnection()
val scanner = Scanner(connection.getInputStream())
scanner.useDelimiter("\\Z")
val content = scanner.next()
scanner.close()
println(content)
val doc = Jsoup.connect("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate").get()
println(doc.text())
WebClient().use { webClient ->
val page = webClient.getPage<HtmlPage>("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate")
val pageAsText = page.asNormalizedText()
println(pageAsText)
}
WebClient(BrowserVersion.FIREFOX).use { webClient ->
val page = webClient.getPage<HtmlPage>("https://registry.terraform.io/providers/hashicorp/tls/latest/docs/data-sources/certificate")
println(page.textContent)
}
It should be something easy peasy, but I cant see what's wrong
In order for this to be possible, you need something to execute the JS that modifies the DOM.
It might be a bit overkill depending on the use case, and probably won't be possible if you're on Android, but one way to do this is to launch a headless browser separately and interact with it from your code. For instance, using Chrome Headless and the Chrome DevTools Protocol. If you're interested, I have written a Kotlin library called chrome-devtools-kotlin to interact with a Chrome browser in a type-safe way.
There might be simpler options, though. For instance maybe you can run an embedded browser instead with JBrowserDriver and still use JSoup to parse the HTML, as mentioned in this other answer.
Regarding HtmlUnit:
the page has initially no content, all you see is rendered from javascript magic on the client side using one of this spa frameworks.
It looks like there is some feature check in the beginning that figures out the js support in HtmlUnit does not have all the required features and based on this you only get a hint like "Please enable Javascript to use this application".
You can use
page.asXml()
to have a look at the content trough HtmlUnit's eyes.
You can open an HtmlUnit issue on github but i fear adding support for this will be a longer story.

How to prevent empty namespace generation in Firefox?

What I want to do is to serialize a DOM to XML. So I create a new document
var doc = document.implementation.createDocument ('http://AOR-AppML.org', 'Application', null);
and I add nodes, attributes etc. This is working fine.
The problem is that I have different behaviours with XMLSerializer in Google Chrome and Mozilla Firefox.
Chrome console output:
<Application xmlns="http://AOR-AppML.org" name="SoRiN"><ObjectType name="ObjectTypeName"/><Enumeration name="EnumerationName"/></Application>
Firefox console output (notice the xmlns=""):
<Application xmlns="http://AOR-AppML.org" name="SoRiN"><ObjectType xmlns="" name="ObjectTypeName"/><Enumeration xmlns="" name="EnumerationName"/></Application>
I don't want to generate that empty namespace. I've read this namespaces indicates that the corresponding elements have no default namespace (http://www.w3.org/TR/xml-names/#defaulting), but actually I want them to be in the same namespace as Application.
Is there any way to prevent the namespace generation in Firefox?
P.S. - yes, I've followed the advice from this post -> How to prevent the namespace generation?
UPDATE
Here is a fiddle to play with.
You have to use the method createElementNS, instead of createElement, since the latter creates an element with empty namespace URI.
Chrome serializes incorrectly the document (if you parse the string you would get a different document, with namespace URIs wrong), Firefox does the job right. Actually a bug was filed and marked as solved, but the problem seems to be still there.
So, simply replace doc.createElement(yourElementName) with doc.createElementNS('http://AOR-AppML.org', yourElementName).

Parsing XML in a Web Worker

I have been using a DOMParser object to parse a text string to an XML tree. However it is not available in the context of a Web Worker (and neither is, of course, document.ELEMENT_NODE or the various other constants that would be needed). Is there any other way to do that?
Please note that I do not want to manipulate the DOM of the current page. The XML file won't contain HTML elements or anything of the sort. In fact, I do not want to touch the document object at all. I simply want to provide a text string like the following:
<car color="blue"><driver/></car>
...and get back a suitable tree structure and a way to traverse it. I also do not care about schema validation or anything fancy. I know about XML for <SCRIPT>, which many may find useful (hence I'm linking to it here), however its licensing is not really suitable for me. I'm not sure if jQuery includes an XML parser (I'm fairly new to this stuff), but even if it does (and it is usable inside a Worker), I would not include an extra ~50K lines of code just for this function.
I suppose I could write a simple XML parser in JavaScript, I'm just wondering if I'm missing a quicker option.
according to the spec
The DOM APIs (Node objects, Document objects, etc) are not available to workers in this version of this specification.
I guess thats why DOMParser is not availlable, but I don't really understand why that decision was made. (fetching and processing an XML document in a WebWorker does not seems unreasonnable)
but you can import other tools available: a "Cross Platform XML Parsing in JavaScript"
At this point I like to share my parser: https://github.com/tobiasnickel/tXml
with its tXml() method you can parse a string into an object and it takes only 0.5kb minified + gzipped

A library in JavaScript for XML parsing with namespaces?

I have a handful of code that uses the DOM to parse and traverse some XML data. It works fine on Gecko and WebKit but, of course, IE absolutely chokes on it. Is there a library for an XML DOM that supports:
getAttributeNS
localName
namespaceURI
Support for IE7 is about as far back as I need to go.
You can use jQuery to safely and easily parse XML in Internet Explorer. This tutorial Easy XML Consumption using jQuery will give you a more in-depth information on how you can do it.
Not sure if you want to go this route, but this can be done with MSXML using their nonstandard way of doing things. MSXML 3.0 comes with IE 6 and later.
I haven't actually done this ;-) but this might be what you need:
IXMLDOMNamedNodeMap.getQualifiedItem looks like getAttributeNS
http://msdn.microsoft.com/en-us/library/ms757075.aspx
IXMLDOMNode has a namespaceURI property.
http://msdn.microsoft.com/en-us/library/ms763813.aspx
IXMLDOMNode.baseName looks like localName
http://msdn.microsoft.com/en-us/library/ms767570.aspx

E4X browser support

I'm trying to figure this out, but there's not much information. Which browsers support E4X, and why isn't it more widely adopted?
Which browsers support E4X
Firefox and others based on the Mozilla codebase.
why isn't it more widely adopted?
Because it offers little practical functionality not already covered by existing standards such as DOM.
OK, it's simpler to use than DOM, but as the price for that you don't get access to all the features of XML, and the thoroughly idiotic, needless XML literal/template syntax is a security disaster, making it so authors of even completely static htaccess-protected documents have to worry about working around the feature.
As a simpler method for accessing the results of an XMLHttpRequest, JSON totally won. For full-on XML processing, you still need DOM. For easier document handling, there are selectors, XPath and JS libraries that can do it without having to introduce weird new language syntax.
That doesn't leave much of a niche for E4X. TBH I wish it would die. (ETA: it has now pretty much done so.)
Firefox dropped E4X support in version 16:
E4X is deprecated. It will be disabled by default for content in Firefox 16, disabled by default for chrome in Firefox 17, and removed in Firefox 18. Use DOMParser/DOMSerializer or a non-native JXON algorithm instead.
According to w3schools, "Firefox is currently the only browser with relatively good support for E4X."
You could try XPath instead. Although XPath isn't cross-browser there are several Javascript solutions for it like this jQuery plugin.
EDIT
You could actually use jQuery without a plugin for this:
$('<xml><some><code>code</code><tag>text</tag></xml></xml>').find('some > code').text()
I have developed a babel plugin that adds E4X basic support to all browsers via babel compilation.
https://www.npmjs.com/package/babel-plugin-transform-simple-e4x
You can also use the npm simple4x library to parse xml strings to a XML like object.
https://www.npmjs.com/package/simple4x
The plugin transpiles the following E4X:
var fooId = 'foo-id';
var barText = 'bar text';
var xml =
<xml>
<foo id={fooId}>
{barText}
</foo>
</xml>;
To the following JavaScript:
var XML = new require("simple4x");
var fooId = 'foo-id';
var barText = 'bar text';
var xml = new XML("<xml><foo id=\"" + fooId + "\">" + barText + "</foo></xml>");

Categories

Resources