How to retrieve the text in html CDATA section?

How to retrieve the text in html CDATA section? - javascript

I have the following script element section in HTML:
<script type="text/x-markdown"><![CDATA[
# hello, This is Markdown Script Demo]]></script>
When i'm trying to retrieve the inner content via scripttag.innerHTML, it returns the text with ![CDATA[...]]>parts
Is there more efficient way to retrieve the inner part of CDATA section at once instead of applying regexp to remove it from received innerHTML data?

I don't think you will be able to retreive only whats inside the CDATA as its not a tag but plain text, when you get the innerHTML of the tag you will get everything as a string, so regexp is the only way I see you could get whats inside.

CDATA is an XML concept. It is a way of specifying a section of text inside which things that look like mark-up or special XML characters are treated as plain text. It is essentially equivalent to escaping < to < etc. everywhere within the CDATA section.
If the document has an HTML doctype, then the CDATA receives no special processing and is just more characters. If the document had an XHTML doctype, then you would be able to retrieve the CDATA section as is, with no further ado.

This question is quite old, but this might help somebody.
You can probably use textContent.
Example from parsing a rss feed node which looks like this:
<title><![CDATA[This contains the title]]></title>
Javascript:
const desc = el.querySelector('title').textContent;

Related

Load HTML content as text into div?

I'm using this code to grab html stored in a string into a div with contenteditable="true" (the string works, and if I manually place the code there it also works, but I need a way to "inject" html or whatever as text in it)
document.getElementById('content').innerHTML=txt
Problem is: It's not placing the html as text inside of it, but executing like it was part of the page. Is there a way around it? I need the HTML(javascript or whatever be written in the string) to be like text...

Use textContent instead to inject strings like this:
document.getElementById('content').textContent=txt

You should use textContent property:
document.getElementById('content').textContent = txt
for more information give a look on MDN

Need to color the tags in an xml, displayed in a textarea

I need to color the tags in an XML string, which is displayed in the textarea of an html page.
say for example, im having an xml string stored in a variable 'xmldata'.
the textarea tag in html is as below
<textarea id="xmlfile" cols="20" rows="30"></textarea>
using the below javascript statement, im displaying the xml string in the textarea
document.getElementById("xmlfile").value=xmldata;
But the xml string is displayed as a plain text in the textarea.
Is there any javascript function to color the tags in xml ?
I don't want any external javascript and css code work like "google-code-prettify"
All i need is a simple javascript function that colors the tags in an xml string which is displayed in the textarea.
Please help me with a solution.
-Dinesh

Since the contents of your text area are not separate DOM elements I don't believe you'll be able to individually set their attributes (since they don't have individual attributes). You might find some variation on a rich text editor that you can plug in. This may or may not violate your stipulation that you don't want external javascript libraries.

As replied here have a look at a self contained prettifier that works for most cases does nice indenting for long lines and colorizes the output if needed. Nevertheless I guess it might not help if you need it inside a textarea.
function formatXml(xml,colorize,indent) {
function esc(s){return s.replace(/[-\/&<> ]/g,function(c){ // Escape special chars
return c==' '?' ':'&#'+c.charCodeAt(0)+';';});}
var se='<p class="xel">',tb='<div class="xtb">',d=0,i,re='',ib,
sd='<p class="xdt">',tc='<div class="xtc">',ob,at,sz='</p>',
sa='<p class="xat">',tz='</div>',ind=esc(indent||' ');
if (!colorize) se=sd=sa=sz='';
xml.match(/(?<=<).*(?=>)|$/s)[0].split(/>\s*</).forEach(function(nd){
ob=nd.match(/^([!?\/]?)(.*?)([?\/]?)$/s); // Split outer brackets
ib=ob[2].match(/^(.*?)>(.*)<\/(.*)$/s)||['',ob[2],'']; // Split inner brackets
at=ib[1].match(/^--.*--$|=|('|").*?\1|[^\t\n\f \/>"'=]+/g)||['']; // Split attributes
if (ob[1]=='/') d--; // Decrease indent
re+=tb+tc+ind.repeat(d)+tz+tc+esc('<'+ob[1])+se+esc(at[0])+sz;
for (i=1;i<at.length;i+=3) re+=esc(' ')+sa+esc(at[i])+sz+"="+sd+esc(at[i+2])+sz;
re+=ib[2]?esc('>')+sd+esc(ib[2])+sz+esc('</')+se+ib[3]+sz:'';
re+=esc(ob[3]+'>')+tz+tz;
if (ob[1]+ob[3]+ib[2]=='') d++; // Increase indent
});
return re;
}
For demo see https://jsfiddle.net/dkb0La16/

Getting unparsed (raw) HTML with JavaScript

I need to get the actual html code of an element in a web page.
For example if the actual html code inside the element is "How to fix"
Running this JavaScript:
getElementById('myE').innerHTML
Gives me "How to fix" which is the parsed HTML.
How can I get the unparsed "How to fix" using JavaScript?

You cannot get the actual HTML source of part of your web page.
When you give a web browser an HTML page, it parses the HTML into some DOM nodes that are the definitive version of your document as far as the browser is concerned. The DOM keeps the significant information from the HTML—like that you used the Unicode character U+00A0 Non-Breaking Space before the word fix—but not the irrelevent information that you used it by means of an entity reference rather than just typing it raw ( ).
When you ask the browser for an element node's innerHTML, it doesn't give you the original HTML source that was parsed to produce that node, because it no longer has that information. Instead, it generates new HTML from the data stored in the DOM. The browser decides on how to format that HTML serialisation; different browsers produce different HTML, and chances are it won't be the same way you formatted it originally.
In particular,
element names may be upper- or lower-cased;
attributes may not be in the same order as you stated them in the HTML;
attribute quoting may not be the same as in your source. IE often generates unquoted attributes that aren't even valid HTML; all you can be sure of is that the innerHTML generated will be safe to use in the same browser by writing it to another element's innerHTML;
it may not use entity references for anything but characters that would otherwise be impossible to include directly in text content: ampersands, less-thans and attribute-value-quotes. Instead of returning it may simply give you the raw character.
You may not be able to see that that's a non-breaking space, but it still is one and if you insert that HTML into another element it will act as one. You shouldn't need to rely anywhere on a non-breaking space character being entity-escaped to ... if you do, for some reason, you can get that by doing:
x= el.innerHTML.replace(/\xA0/g, ' ')
but that's only escaping U+00A0 and not any of the other thousands of possible Unicode characters, so it's a bit questionable.
If you really really need to get your page's actual source HTML, you can make an XMLHttpRequest to your own URL (location.href) and get the full, unparsed HTML source in the responseText. There is almost never a good reason to do this.

What you have should work:
Element test:
<div id="myE">How to fix</div>
JavaScript test:
alert(document.getElementById("myE").innerHTML); //alerts "How to fix"
You can try it out here. Make sure that wherever you're using the result isn't show as a space, which is likely the case. If you want to show it somewhere that's designed for HTML, you'll need to escape it.

You can use a script tag instead, which will not parse the HTML. This is more relevant when there are angle brackets, like loading a lodash or underscore template.
document.getElementById("asDiv").value = document.getElementById("myDiv").innerHTML;
document.getElementById("asScript").value = document.getElementById("myScript").innerHTML;
<div id="myDiv">
<h1>
<%= ${var} %> %>
How to fix
</h1>
</div>
<script id="myScript" type="text/template">
<h1>
<%= ${var} %>
How to fix
</h1>
</script>
<textarea rows="10" cols="40" id="asDiv"></textarea>
<textarea rows="10" cols="40" id="asScript"></textarea>
Because the HTML in a div is parsed, the inner HTML for brackets comes back as
<
, but as a script it does not.

Content inside CDATA is not displayed properly when processed through JavaScript

I have an XML document with some sample content like this:
<someTag>
<![CDATA[Hello World]]>
</someTag>
I'm parsing the above XML in JavaScript. When I try access and render the Hello World text using xmldoc.getElementsByTagName("someTag")[0].childNodes[0].textContent all I get was a blank text on screen.
The code is not returning undefined or any error messages. So I guess the code is properly accessing the message. But due to CDATA, it is not rendering properly on screen.
Anyway to fix the issue and get the Hello World out of this xml file?

Note that Firefox's behaviour is absolutely correct. someTag has three children:
A Text node containing the whitespace between the <someTag> and <!CDATA. This is one newline and one space;
the CDATASection node itself;
another whitespace Text node containing the single newline character between the end of the CDATA and the close-tag.
It's best not to rely closely on what combination of text and CDATA nodes might exist in an element if all you want is the text value inside it. Just call textContent on <someTag> itself and you'll get all the combined text content: '\n Hello World\n'. (You can .trim() this is you like.)

If you're running Firefox, maybe this is the issue you're having. The behavior looks very similair... The following might do the trick:
xmldoc.getElementsByTagName("someTag")[0].childNodes[1].textContent;

Regex replace string but not inside html tag

I want to replace a string in HTML page using JavaScript but ignore it, if it is in an HTML tag, for example:
visit google search engine
you can search on google tatatata...
I want to replace google by <b>google</b>, but not here:
visit google search engine
you can search on <b>google</b> tatatata...
I tried with this one:
regex = new RegExp(">([^<]*)?(google)([^>]*)?<", 'i');
el.innerHTML = el.innerHTML.replace(regex,'>$1<b>$2</b>$3<');
but the problem: I got <b>google</b> inside the <a> tag:
visit <b>google</b> search engine
you can search on <b>google</b> tatatata...
How can fix this?

You'd be better using an html parser for this, rather than regex. I'm not sure it can be done 100% reliably.

You may or may not be able to do with with a regexp. It depends on how precisely you can define the conditions. Saying you want the string replaced except if it's in an HTML tag is not narrow enough, since everything on the page is presumably within some HTML tag (BODY if nothing else).
It would probably work better to traverse the DOM tree for this instead of trying to use a regexp on the HTML.

Parsing HTML with a regular expression is not going to be easy for anything other than trivial cases, since HTML isn't regular.
For more details see this Stackoverflow question (and answers).

I think you're all missing the question here...
When he says inside the tag, he means inside the opening tag, as in the <a href="google.com"> tag...This is something quite different than text, say, inside a <p> </p> tag pair or <body> </body>. While I don't have the answer yet, I'm struggling with this same problem and I know it has to be solvable using regex. Once I figure it out, i'll come back and post.

WORKAROUND
If You can't use a html parser or are quite confident about Your html structure try this:
do the "bad" changing
repeat replace (<[^>]*)(<[^>]+>) to $1 a few times (as much as You need)
It's a simple workaround, but works for me.
Cons?
Well... You have to do the replace twice for the case ... ...> as it removes only first unwanted tag from every tag on the page
[edit:]
SOLUTION
Why not use jQuery, put the html code into the page and do something like this:
$(containerOrSth).find('a').each(function(){
if($(this).children().length==0){
$(this).text($(this).text().replace('google','evil'));
}else{
//here You have to care about children tags, but You have to know where to expect them - before or after text. comment for more help
}
});

I'm using
regex = new RegExp("(?=[^>]*<)google", 'i');

you can't really do that, your "google" is always in some tag, either replace all or none

Well, since everything is part of a tag, your request makes no real sense. If it's just the <a /> tag, you might just check for that part. Mainly by making sure you don't have a tailing </a> tag before a fresh <a>

You can do that using REGEX, but filtering blocks like STYLE, SCRIPT and CDATA will need more work, and not implemented in the following solution.
Most of the answers state that 'your data is always in some tags' but they are missing the point, the data is always 'between' some tags, and you want to filter where it is 'in' a tag.
Note that tag characters in inline scripts will likely break this, so if they exist, they should be processed seperately with this method. Take a look at here :
complex html string.replace function

I can give you a hacky solution…
Pick a non printable character that’s not in your string…. Dup your buffer… now overwrite the tags in your dup buffer using the non printable character… perform regex to find position and length of match on dup buffer … Now you know where to perform replace in original buffer

Develop Reference

JavaScript is the programming language of the Web.