I need to get the the name of the tag "myChild" and the "content".
This is simple, but i am stuck, sleepy and here is what I get with my tests:
XML:
...
<myParent>
<myChild>content</myChild>
</myParent>
<myParent>
<myChild>content</myChild>
</myParent>
...
JS:
var x=xmlDoc.getElementsByTagName("myParent");
alert(x[1].childNodes[0].nodeName); //returns "#text" - "myChild" needed
alert(x[1].childNodes[0].nodeValue); //returns "" - "content" needed
You want tagName, which is the name of the element. (Sorry about that, for Elements, tagName and nodeName are the same.)
The problem is that the first child of your myParent element isn't the myChild element, it's a text node (containing whitespace). Your structure looks like this:
Element "myParent"
Text node with a carriage return and some spaces or tabs
Element "myChild"
Text node with "content"
Text node with a carriage return and some spaces or tabs
Element "myParent"
Text node with a carriage return and some spaces or tabs
Element "myChild"
Text node with "content"
Text node with a carriage return and some spaces or tabs
You need to navigate down to the actual myChild element, which you can do with getElementsByTagName again, or just by scanning:
var x=xmlDoc.getElementsByTagName("myParent");
var c = x[1].firstChild;
while (c && c.nodeType != 1) { // 1 = ELEMENT_NODE
c = c.nextSibling;
}
alert(c.nodeName); // "myChild"
Note that Elements don't have a meaningful nodeValue property; instead, you collect their child text nodes. (More in the DOM specs: DOM2, DOM3.)
Also note that when indexing into a NodeList, the indexes start at 0. You seem to have started with 1; ignore this comment if you were skipping the first one for a reason.
Off-topic: It's always best to understand the underlying mechanics of what you're working with, and I do recommend playing around with the straight DOM and referring to the DOM specs listed above. But for interacting with these trees, a good library can be really useful and save you a lot of time. jQuery works well with XML data. I haven't used any of the others like Prototype, YUI, Closure, or any of several others with XML, so can't speak to that, but I expect at least some of them support it.
Try x[1].getElementsByTagName('*')[0] instead.
(This is only trustable for index 0, other indexes may return elements that are not child-nodes, if the direct childs contain further element-nodes. )
Related
Having problem with a textNode that I can't convert to a string.
I'm trying to scrape a site and get certain information out from it, and when I use an XPath to find this text I'm after I get an textNode back.
When I look in google development tool in chrome, I can se that the textNode itself contain the text I'm after, but how do I convert the textNode to plain text?
here is the line of code I use:
abstracts = ZU.xpath(doc, '//*[#id="abstract"]/div/div/par/text()');
I have tried to use stuff like .innerHTML, toString, textContent but nothing have worked so far.
I usually use Text.wholeText if I want to see the content string of a textNode, because textNode is an object so using toString or innerHTML will not work because it is an object not as the string itself...
Example: from https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText
The Text.wholeText read-only property returns the full text of all Text nodes logically adjacent to the node. The text is concatenated in document order. This allows to specify any text node and obtain all adjacent text as a single string.
Syntax
str = textnode.wholeText;
Notes and example:
Suppose you have the following simple paragraph within your webpage (with some whitespace added to aid formatting throughout the code samples here), whose DOM node is stored in the variable para:
<p>Thru-hiking is great! <strong>No insipid election coverage!</strong>
However, <a href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You decide you don’t like the middle sentence, so you remove it:
para.removeChild(para.childNodes[1]);
Later, you decide to rephrase things to, “Thru-hiking is great, but casting a ballot is tricky.” while preserving the hyperlink. So you try this:
para.firstChild.data = "Thru-hiking is great, but ";
All set, right? Wrong! What happened was you removed the strong element, but the removed sentence’s element separated two text nodes. One for the first sentence, and one for the first word of the last. Instead, you now effectively have this:
<p>Thru-hiking is great, but However, <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
You’d really prefer to treat all those adjacent text nodes as a single one. That’s where wholeText comes in: if you have multiple adjacent text nodes, you can access the contents of all of them using wholeText. Let’s pretend you never made that last mistake. In that case, we have:
assert(para.firstChild.wholeText == "Thru-hiking is great! However, ");
wholeText is just a property of text nodes that returns the string of data making up all the adjacent (i.e. not separated by an element boundary) text nodes combined.
Now let’s return to our original problem. What we want is to be able to replace the whole text with new text. That’s where replaceWholeText() comes in:
para.firstChild.replaceWholeText("Thru-hiking is great, but ");
We’re removing every adjacent text node (all the ones that constituted the whole text) but the one on which replaceWholeText() is called, and we’re changing the remaining one to the new text. What we have now is this:
<p>Thru-hiking is great, but <a
href="http://en.wikipedia.org/wiki/Absentee_ballot">casting a
ballot</a> is tricky.</p>
Some uses of the whole-text functionality may be better served by using Node.textContent, or the longstanding Element.innerHTML; that’s fine and probably clearer in most circumstances. If you have to work with mixed content within an element, as seen here, wholeText and replaceWholeText() may be useful.
More info: https://developer.mozilla.org/en-US/docs/Web/API/Text/wholeText
I want to build a chrome app that finds all the strings that look like a telephone and replaces them with a link. I want to only happen for text elements so it doesn't break javascript functions from the websites that the app runs on.
This is what I have so far:
var regex = /((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}(?!([^<]*>)|(((?!<a).)*<\/a>))/g;
var text = $("body:first").html();
text = text.replace(regex, "$&");
$("body:first").html(text);
but it breaks if there are javascript
Yes, your code just retrieves the markup representation of the current state of the DOM, and overwrites that, losing all event bindings and a lot of other significant data.
What you'll need to do is to iterate through all the text nodes. You can't reach text nodes by the sizzle selector alone, so you'll need to rely on jQuery's contents() function.
You could do something like this to get all the text nodes:
var allTextNodes = $('*').contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
});
I have encountered a very strange bug in Firefox.
I have a javascript function in an external file that works perfectly on regular complexity websites. However I have been putting together a few demonstration examples and come across something odd.
With html formatted like this (in an editor):
<div><p>Q: Where's the rabbit?</p><p class="faq_answer">A: I don't know, honest</p></div>
The Javascript works as expected.
However when like this:
<div>
<p>Q: Where's the rabbit?</p>
<p class="faq_answer">A: I don't know, honest</p>
</div>
It fails at this line:
elementsList[i].parentNode.firstChild.appendChild(finalRender.cloneNode(true));
Why on Earth would formatting of html cause anything at all?
It is not a bug. The DOM has not only element nodes, but also text nodes [docs] (among others). In this example:
<div>
<p>Q: Where's the rabbit?</p>
you have at least two text nodes:
One between the <div> and the <p>, containing a line-break.
One text node inside the <p> element node, containing the text Where's the rabbit?.
Thus, if elementsList[i].parentNode refers to the <div> element,
elementsList[i].parentNode.firstChild
will refer to the first text node.
If you want to get the first element node, use
elementsList[i].parentNode.children[0]
Update: You mentioned Firefox 3.0, and indeed, the children property is not supported in this version.
Afaik the only solution to this is to loop over the children (or traversing them) and test whether it is a text node or not:
var firstChild = elementsList[i].parentNode.firstChild;
// a somehow shorthand loop
while(firstChild.nodeType !== 1 && (firstChild = firstChild.nextSibling));
if(firstChild) {
// exists and found
}
You might want to put this in an extra function:
function getFirstElementChild(element) {
var firstChild = null;
if(element.children) {
firstChild = element.children[0] || null;
}
else {
firstChild = element.firstChild;
while(firstChild.nodeType !== 1 && (firstChild = firstChild.nextSibling));
}
return firstChild;
}
You can (and should) also consider using a library that abstracts from all that, like jQuery.
It depends on what your code is actually doing, but if you run this method for every node, it would be something like:
$('.faq_answer').prev().append(finalRender.cloneNode(true));
(assuming the p element always comes before the .faq_answer element)
This is the whole code, you wouldn't have to loop over the elements anymore.
Because you have a text node between <div> and <p>.
As usual, the assumption of a browser bug is incorrect: this is, instead, a programmer bug!
Couldn't one achieve it by using ParentNode.children instead?
My problem
I want to clean HTML pasted in a rich text editor (FCK 1.6 at the moment). The cleaning should be based on a whitelist of tags (and perhaps another with attributes). This is not primarily in order to prevent XSS, but to remove ugly HTML.
Currently I see no way to do it on the server, so I guess it must be done in JavaScript.
Current ideas
I found the jquery-clean plugin, but as far as I can see, it is using regexes to do the work, and we know that is not safe.
As I've not found any other JS-based solution I've started to impement one myself using jQuery. It would work by creating a jQuery version of the pasted html ($(pastedHtml)) and then traverse the resulting tree, removing each element not matching the whitelist by looking at the attribute tagName.
My questions
Is this any better?
Can I trust jQuery to represent the pasted
content well (there may be unmatched
ending tags and what-have-you)?
Is there a better solution already that
I couldn't find?
Update
This is my current, jQuery-based, solution (verbose and not extensively tested):
function clean(element, whitelist, replacerTagName) {
// Use div if no replace tag was specified
replacerTagName = replacerTagName || "div";
// Accept anything that jQuery accepts
var jq = $(element);
// Create a a copy of the current element, but without its children
var clone = jq.clone();
clone.children().remove();
// Wrap the copy in a dummy parent to be able to search with jQuery selectors
// 1)
var wrapper = $('<div/>').append(clone);
// Check if the element is not on the whitelist by searching with the 'not' selector
var invalidElement = wrapper.find(':not(' + whitelist + ')');
// If the element wasn't on the whitelist, replace it.
if (invalidElement.length > 0) {
var el = $('<' + replacerTagName + '/>');
el.text(invalidElement.text());
invalidElement.replaceWith(el);
}
// Extract the (maybe replaced) element
var cleanElement = $(wrapper.children().first());
// Recursively clean the children of the original element and
// append them to the cleaned element
var children = jq.children();
if (children.length > 0) {
children.each(function(_index, thechild) {
var cleaned = clean(thechild, whitelist, replacerTagName);
cleanElement.append(cleaned);
});
}
return cleanElement;
}
I am wondering about some points (see comments in the code);
Do I really need to wrap my element in a dummy parent to be able to match it with jQuery's ":not"?
Is this the recommended way to create a new node?
If you leverage the browser's HTML correcting abilities (e.g. you copy the rich text to the innerHTML of an empty div and take the resulting DOM tree), the HTML will be guaranteed to be valid (the way it will be corrected is somewhat browser-dependent). Although this is probably done by rich editor anyways.
jQuery's own text-top DOM transform is probably also safe, but definitely slower, so I would avoid it.
Using a whitelist based on the jQuery selector engine might be somewhat tricky because removing an element while preserving its children might make the document invalid, so the browser would correct it by changing the DOM tree, which might confuse a script trying to iterate through invalid elements. (E.g. you allow ul and li but not ol; the script removes the list root element, naked li elements are invalid so the browser wraps them in ul again, that ul will be missed by the cleaning script.) If you throw away unwanted elements together with all their children, I don't see any problems with that.
I need to construct an xpath string to select all descendants of a certain table with these conditions:
The table is a descendant of a form with a specific action attribute value.
The selected descendants are text nodes.
The text node content can only contain whitespace.
It'll probably look something like:
//form[#action = "submit.html"]//table//text()[ ...? ]
Any tips would be appreciated. Thanks.
Edit: Here is my previous working compromise:
function KillTextNodes(rootpath)
{
XPathIterate(rootpath + '//text()', function(node)
{
var tagname = node.parentNode.tagName;
if (tagname != 'OPTION' && tagname != 'TH')
Kill(node);
});
}
Here is my function based on the accepted answer:
function KillTextNodes(rootpath)
{
XPathIterate(rootpath + '//text()[not(normalize-space())]', function(node) { Kill(node); });
}
To explain my motivation a little - I'm iterating through the DOM with Javascript, and run into the same problem that many others do where unexpected empty text nodes throw off the results. This function helps me out a lot by simply deleting all of the empty text nodes so that my iteration logic can stay simple.
Hi there. I need to construct an xpath
string to select all descendants of a
certain table with these conditions:
•The table is a descendant of a form
with a specific action attribute
value.
•The selected descendants are
text nodes.
•The text node content can
only contain whitespace.
Use:
//form[#action = "submit.html"]//table//text()[not(normalize-space())]
This selects all text nodes that have only white-space in them and that are descendents of any tablethat is a descendent of any form having an action attribute with value "submit.html".
Text nodes containing whitespace only will be stripped from the document representation - i.e. there won't actually be a node. That means you can't access the text itself, but what you can do is match a parent lacking a text node using not() - something like:
//form[#action = "submit.html"]//table//*[not(text())]
Though in your case I would guess that will be far more aggressive than you actually intend. As an aside, be careful with these // matches, they're not very efficient and again very aggressive.
(I've just noticed this isn't an XSLT question! If you're in JS land have you considered using DOM methods to get your list?)