XPath conditional text node selection

XPath conditional text node selection - javascript

I need to construct an xpath string to select all descendants of a certain table with these conditions:
The table is a descendant of a form with a specific action attribute value.
The selected descendants are text nodes.
The text node content can only contain whitespace.
It'll probably look something like:
//form[#action = "submit.html"]//table//text()[ ...? ]
Any tips would be appreciated. Thanks.
Edit: Here is my previous working compromise:
function KillTextNodes(rootpath)
{
XPathIterate(rootpath + '//text()', function(node)
{
var tagname = node.parentNode.tagName;
if (tagname != 'OPTION' && tagname != 'TH')
Kill(node);
});
}
Here is my function based on the accepted answer:
function KillTextNodes(rootpath)
{
XPathIterate(rootpath + '//text()[not(normalize-space())]', function(node) { Kill(node); });
}
To explain my motivation a little - I'm iterating through the DOM with Javascript, and run into the same problem that many others do where unexpected empty text nodes throw off the results. This function helps me out a lot by simply deleting all of the empty text nodes so that my iteration logic can stay simple.

Hi there. I need to construct an xpath
string to select all descendants of a
certain table with these conditions:
•The table is a descendant of a form
with a specific action attribute
value.
•The selected descendants are
text nodes.
•The text node content can
only contain whitespace.
Use:
//form[#action = "submit.html"]//table//text()[not(normalize-space())]
This selects all text nodes that have only white-space in them and that are descendents of any tablethat is a descendent of any form having an action attribute with value "submit.html".

Text nodes containing whitespace only will be stripped from the document representation - i.e. there won't actually be a node. That means you can't access the text itself, but what you can do is match a parent lacking a text node using not() - something like:
//form[#action = "submit.html"]//table//*[not(text())]
Though in your case I would guess that will be far more aggressive than you actually intend. As an aside, be careful with these // matches, they're not very efficient and again very aggressive.
(I've just noticed this isn't an XSLT question! If you're in JS land have you considered using DOM methods to get your list?)

Related

How to get text nodes and tagged nodes out of contents of a Div element in javascript? [duplicate]

Having problem with a textNode that I can't convert to a string.
I'm trying to scrape a site and get certain information out from it, and when I use an XPath to find this text I'm after I get an textNode back.
When I look in google development tool in chrome, I can se that the textNode itself contain the text I'm after, but how do I convert the textNode to plain text?
here is the line of code I use:
abstracts = ZU.xpath(doc, '//*[#id="abstract"]/div/div/par/text()');
I have tried to use stuff like .innerHTML, toString, textContent but nothing have worked so far.

I usually use Text.wholeText if I want to see the content string of a textNode, because textNode is an object so using toString or innerHTML will not work because it is an object not as the string itself...
Example: from https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText
The Text.wholeText read-only property returns the full text of all Text nodes logically adjacent to the node. The text is concatenated in document order. This allows to specify any text node and obtain all adjacent text as a single string.
Syntax
str = textnode.wholeText;
Notes and example:
Suppose you have the following simple paragraph within your webpage (with some whitespace added to aid formatting throughout the code samples here), whose DOM node is stored in the variable para:
<p>Thru-hiking is great! <strong>No insipid election coverage!</strong>
However, <a href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You decide you don’t like the middle sentence, so you remove it:
para.removeChild(para.childNodes[1]);
Later, you decide to rephrase things to, “Thru-hiking is great, but casting a ballot is tricky.” while preserving the hyperlink. So you try this:
para.firstChild.data = "Thru-hiking is great, but ";
All set, right? Wrong! What happened was you removed the strong element, but the removed sentence’s element separated two text nodes. One for the first sentence, and one for the first word of the last. Instead, you now effectively have this:
<p>Thru-hiking is great, but However, <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You’d really prefer to treat all those adjacent text nodes as a single one. That’s where wholeText comes in: if you have multiple adjacent text nodes, you can access the contents of all of them using wholeText. Let’s pretend you never made that last mistake. In that case, we have:
assert(para.firstChild.wholeText == "Thru-hiking is great! However, ");
wholeText is just a property of text nodes that returns the string of data making up all the adjacent (i.e. not separated by an element boundary) text nodes combined.
Now let’s return to our original problem. What we want is to be able to replace the whole text with new text. That’s where replaceWholeText() comes in:
para.firstChild.replaceWholeText("Thru-hiking is great, but ");
We’re removing every adjacent text node (all the ones that constituted the whole text) but the one on which replaceWholeText() is called, and we’re changing the remaining one to the new text. What we have now is this:
<p>Thru-hiking is great, but <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
Some uses of the whole-text functionality may be better served by using Node.textContent, or the longstanding Element.innerHTML; that’s fine and probably clearer in most circumstances. If you have to work with mixed content within an element, as seen here, wholeText and replaceWholeText() may be useful.
More info: https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText

JavaScript HTML element to DOM selector

I have a function that requires a HTML element to perform the action on. I request the DOM selector as a parameter
function(document.body);
where element is the DOM query but somewhere else in the function I need the query as a string. Is it possible to turn the object into it's original string type? And if so, how?

You can do it the other way around. Pass your function a selector string:
functionName('body')
And then retrieve the relevant DOM element using .querySelector():
var el = document.querySelector(string);

The approach you are using is not going to produce the right results consistently in my opinion.
Passing in the element is going to mean that the element must be found beforehand and also is going to practically prevent finding other elements of similar location due to a lack of information.
Passing in a selector is an option, and then using that going forward if you need to find a similar set of elements but sometimes the selector is too complex to be placed in one string - for example if it needs filtering or some other metric.
Your best bet is to accept a callback function that returns the desired element or set of elements when you are dealing with complex selectors or locations for elements. It can simply return the same element each time if it is basic, or if it is more complicated the callback can access the DOM, filter based on some metric, and then return the subset which at times is ideal.
A callback function will provide the full amount of support without needing to always have a conversion to query string which is not always possible for complex structures.

If I understand you correct you could to this
function (elem) {
if (typeof elem === 'string' || elem instanceof String) {
elem = document.querySelector(elem);
}
var target = elem.querySelector(...);
}

To get the HTML string of an element use 'innerHTML'.
Temporary parent:
var div = document.createElement("div");
Append desired element to parent:
div.appendChild(document.body);
Test result:
alert(div.innerHTML);
If you are trying to get the string of the resulting element of 'querySelector()':
div.appendChild(document.body.querySelector('yourClass'));
The temporary parent ensures that you get the tags for the element returned. Otherwise you would only get the string for its child elements. If you want to use 'querySelectorAll()', just loop over the returned node array.

How to change all text with javascript?

I want to build a chrome app that finds all the strings that look like a telephone and replaces them with a link. I want to only happen for text elements so it doesn't break javascript functions from the websites that the app runs on.
This is what I have so far:
var regex = /((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}(?!([^<]*>)|(((?!<a).)*<\/a>))/g;
var text = $("body:first").html();
text = text.replace(regex, "$&");
$("body:first").html(text);
but it breaks if there are javascript

Yes, your code just retrieves the markup representation of the current state of the DOM, and overwrites that, losing all event bindings and a lot of other significant data.
What you'll need to do is to iterate through all the text nodes. You can't reach text nodes by the sizzle selector alone, so you'll need to rely on jQuery's contents() function.
You could do something like this to get all the text nodes:
var allTextNodes = $('*').contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
});

Javascript/XML - Getting the node name

I need to get the the name of the tag "myChild" and the "content".
This is simple, but i am stuck, sleepy and here is what I get with my tests:
XML:
...
<myParent>
<myChild>content</myChild>
</myParent>
<myParent>
<myChild>content</myChild>
</myParent>
...
JS:
var x=xmlDoc.getElementsByTagName("myParent");
alert(x[1].childNodes[0].nodeName); //returns "#text" - "myChild" needed
alert(x[1].childNodes[0].nodeValue); //returns "" - "content" needed

You want tagName, which is the name of the element. (Sorry about that, for Elements, tagName and nodeName are the same.)
The problem is that the first child of your myParent element isn't the myChild element, it's a text node (containing whitespace). Your structure looks like this:
Element "myParent"
Text node with a carriage return and some spaces or tabs
Element "myChild"
Text node with "content"
Text node with a carriage return and some spaces or tabs
Element "myParent"
Text node with a carriage return and some spaces or tabs
Element "myChild"
Text node with "content"
Text node with a carriage return and some spaces or tabs
You need to navigate down to the actual myChild element, which you can do with getElementsByTagName again, or just by scanning:
var x=xmlDoc.getElementsByTagName("myParent");
var c = x[1].firstChild;
while (c && c.nodeType != 1) { // 1 = ELEMENT_NODE
c = c.nextSibling;
}
alert(c.nodeName); // "myChild"
Note that Elements don't have a meaningful nodeValue property; instead, you collect their child text nodes. (More in the DOM specs: DOM2, DOM3.)
Also note that when indexing into a NodeList, the indexes start at 0. You seem to have started with 1; ignore this comment if you were skipping the first one for a reason.
Off-topic: It's always best to understand the underlying mechanics of what you're working with, and I do recommend playing around with the straight DOM and referring to the DOM specs listed above. But for interacting with these trees, a good library can be really useful and save you a lot of time. jQuery works well with XML data. I haven't used any of the others like Prototype, YUI, Closure, or any of several others with XML, so can't speak to that, but I expect at least some of them support it.

Try x[1].getElementsByTagName('*')[0] instead.
(This is only trustable for index 0, other indexes may return elements that are not child-nodes, if the direct childs contain further element-nodes. )

Is it wise to use jQuery for whitelisting tags? Are there existing solutions in JavaScript?

My problem
I want to clean HTML pasted in a rich text editor (FCK 1.6 at the moment). The cleaning should be based on a whitelist of tags (and perhaps another with attributes). This is not primarily in order to prevent XSS, but to remove ugly HTML.
Currently I see no way to do it on the server, so I guess it must be done in JavaScript.
Current ideas
I found the jquery-clean plugin, but as far as I can see, it is using regexes to do the work, and we know that is not safe.
As I've not found any other JS-based solution I've started to impement one myself using jQuery. It would work by creating a jQuery version of the pasted html ($(pastedHtml)) and then traverse the resulting tree, removing each element not matching the whitelist by looking at the attribute tagName.
My questions
Is this any better?
Can I trust jQuery to represent the pasted
content well (there may be unmatched
ending tags and what-have-you)?
Is there a better solution already that
I couldn't find?
Update
This is my current, jQuery-based, solution (verbose and not extensively tested):
function clean(element, whitelist, replacerTagName) {
// Use div if no replace tag was specified
replacerTagName = replacerTagName || "div";
// Accept anything that jQuery accepts
var jq = $(element);
// Create a a copy of the current element, but without its children
var clone = jq.clone();
clone.children().remove();
// Wrap the copy in a dummy parent to be able to search with jQuery selectors
// 1)
var wrapper = $('<div/>').append(clone);
// Check if the element is not on the whitelist by searching with the 'not' selector
var invalidElement = wrapper.find(':not(' + whitelist + ')');
// If the element wasn't on the whitelist, replace it.
if (invalidElement.length > 0) {
var el = $('<' + replacerTagName + '/>');
el.text(invalidElement.text());
invalidElement.replaceWith(el);
}
// Extract the (maybe replaced) element
var cleanElement = $(wrapper.children().first());
// Recursively clean the children of the original element and
// append them to the cleaned element
var children = jq.children();
if (children.length > 0) {
children.each(function(_index, thechild) {
var cleaned = clean(thechild, whitelist, replacerTagName);
cleanElement.append(cleaned);
});
}
return cleanElement;
}
I am wondering about some points (see comments in the code);
Do I really need to wrap my element in a dummy parent to be able to match it with jQuery's ":not"?
Is this the recommended way to create a new node?

If you leverage the browser's HTML correcting abilities (e.g. you copy the rich text to the innerHTML of an empty div and take the resulting DOM tree), the HTML will be guaranteed to be valid (the way it will be corrected is somewhat browser-dependent). Although this is probably done by rich editor anyways.
jQuery's own text-top DOM transform is probably also safe, but definitely slower, so I would avoid it.
Using a whitelist based on the jQuery selector engine might be somewhat tricky because removing an element while preserving its children might make the document invalid, so the browser would correct it by changing the DOM tree, which might confuse a script trying to iterate through invalid elements. (E.g. you allow ul and li but not ol; the script removes the list root element, naked li elements are invalid so the browser wraps them in ul again, that ul will be missed by the cleaning script.) If you throw away unwanted elements together with all their children, I don't see any problems with that.

Develop Reference

JavaScript is the programming language of the Web.

XPath conditional text node selection - javascript

Related

How to get text nodes and tagged nodes out of contents of a Div element in javascript? [duplicate]

JavaScript HTML element to DOM selector

How to change all text with javascript?

Javascript/XML - Getting the node name

Is it wise to use jQuery for whitelisting tags? Are there existing solutions in JavaScript?

Categories

Resources