Removing invalid characters from XML before serializing it with XMLSerializer() - javascript

I'm trying to store user-input in an XML document on the client-side (javascript), and transmit that to the server for persistence.
One user, for example, pasted in text that included an STX character (0x2). The XMLSerializer did not escape the STX character, and therefore, did not serialize to well-formed XML. Or perhaps the .attr() call should have escaped the STX character, but in either case, invalid XML was produced.
I'm finding the output of in-browser XMLSerializer() isn't always well-formed, (and doesn't even satisfy the browser's own DOMParser()
This example shows that the STX character is not properly encoded by XMLSerializer():
> doc = $.parseXML('<?xml version="1.0" encoding="utf-8" ?>\n<elem></elem>');
#document
> $(doc).find("elem").attr("someattr", String.fromCharCode(0x2));
[ <elem someattr=​"">​</elem>​ ]
> serializedDoc = new XMLSerializer().serializeToString(doc);
"<?xml version="1.0" encoding="utf-8"?><elem someattr=""/></elem>"
> $.parseXML(serializedDoc);
Error: Invalid XML: <?xml version="1.0" encoding="utf-8"?><elem someattr=""/></elem>
How should I construct an XML document in-browser (with params determined by arbitrary user-input) such that it will always be well-formed (everything properly escaped)? I don't need to support IE8 or IE7.
(And yes, I do validate the XML on the server side, but if the browser hands the server a document that is not well-formed, the best the server can do is reject it, which isn't that helpful to the poor user)

Here's a function sanitizeStringForXML() which can either be used to cleanse strings before assignment, or a derivative function removeInvalidCharacters(xmlNode) which can be passed a DOM tree and will automatically sanitize attributes and textNodes so they are safe to store.
var stringWithSTX = "Bad" + String.fromCharCode(2) + "News";
var xmlNode = $("<myelem/>").attr("badattr", stringWithSTX);
var serializer = new XMLSerializer();
var invalidXML = serializer.serializeToString(xmlNode);
// Now cleanse it:
removeInvalidCharacters(xmlNode);
var validXML = serializer.serializeToString(xmlNode);
I based this on a list of characters from the non-restricted characters section of this wikipedia article, but the supplementary planes require 5-hex-digit unicode characters, and the Javascript regex does not include a syntax for this, so for now, I'm just stripping them out (you aren't missing too much...):
// WARNING: too painful to include supplementary planes, these characters (0x10000 and higher)
// will be stripped by this function. See what you are missing (heiroglyphics, emoji, etc) at:
// http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
var NOT_SAFE_IN_XML_1_0 = /[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm;
function sanitizeStringForXML(theString) {
"use strict";
return theString.replace(NOT_SAFE_IN_XML_1_0, '');
}
function removeInvalidCharacters(node) {
"use strict";
if (node.attributes) {
for (var i = 0; i < node.attributes.length; i++) {
var attribute = node.attributes[i];
if (attribute.nodeValue) {
attribute.nodeValue = sanitizeStringForXML(attribute.nodeValue);
}
}
}
if (node.childNodes) {
for (var i = 0; i < node.childNodes.length; i++) {
var childNode = node.childNodes[i];
if (childNode.nodeType == 1 /* ELEMENT_NODE */) {
removeInvalidCharacters(childNode);
} else if (childNode.nodeType == 3 /* TEXT_NODE */) {
if (childNode.nodeValue) {
childNode.nodeValue = sanitizeStringForXML(childNode.nodeValue);
}
}
}
}
}
Note that this only removes invalid characters from nodeValues of attributes and textNodes. It does not check tag names or attribute names, comments, etc etc.

Check
https://gist.github.com/john-doherty/b9195065884cdbfd2017a4756e6409cc,
very useful gist, example usage:
const resultXml = removeXMLInvalidChars(INPUT_XML_STRING, true);

Related

Firefox pref is destroying JSON

I have the following JSON: http://pastebin.com/Sh20StJY
SO removed the chars on my post, so look at the link for the real JSON
which was generated using JSON.stringify and saved on Firefox prefs (pref.setCharPref(prefName, value);)
The problem is that when I save the value, Firefox does something that corrupts the JSON. If I try a JSON.parse retrieving the value from the config I get an error:
Error: JSON.parse: bad control character in string literal
If I try to validate the above JSON (which was retrieved from the settings) I get an error at line 20, the tokens value contains two invalid characters.
If I try a JSON.parse immediately after JSON.stringify the error doesn't occur.
Do I have to set something to save in a different encoding? How can I fix it?
nsIPrefBranch.getCharPref() only works for ASCII data, your JSON data contains some non-ASCII characters however. You can store Unicode data in preferences, it is merely a little bit more complicated:
var str = Components.classes["#mozilla.org/supports-string;1"]
.createInstance(Components.interfaces.nsISupportsString);
str.data = value;
pref.setComplexValue(prefName, Components.interfaces.nsISupportsString, str);
And to read that preference:
var str = pref.getComplexValue(prefName, Components.interfaces.nsISupportsString);
var value = str.data;
For reference: Documentation
Your JSON appears to contain non-ASCII characters such as ½. Can you check what encoding everything is being handled in?
nsIPrefBranch.setCharPref() assumes that its input is UTF-8 encoded, and the return value of nsIPrefBranch.getCharPref() is always an UTF-8 string. If your input is a bytestring or a character in some other encoding, you will either need to switch to UTF-8, or encode and decode it yourself when interacting with preferences.
I did this in one place to fix this issue:
(function overrideJsonParse() {
if (!window.JSON || !window.JSON.parse) {
window.setTimeout(overrideJsonParse, 1);
return; //this code has executed before JSON2.js, try again in a moment
}
var oldParse = window.JSON.parse;
window.JSON.parse = function (s) {
var b = "", i, l = s.length, c;
for (i = 0; i < l; ++i) {
c = s[i];
if (c.charCodeAt(0) >= 32) { b += c; }
}
return oldParse(b);
};
}());
This works in IE8 (using json2 or whatever), IE9, Firefox and Chrome.
The code seems correct. Try use single quotes '..': '...' instead of double quotes "..":"..." .
I still couldn't find the solution, but I found a workaround:
var b = "";
[].forEach.call("{ JSON STRING }", function(c, i) {
if (c.charCodeAt(0) >= 32)
b += c;
});
Now b is the new JSON, and might work...

Check if HTML snippet is valid with JavaScript

I need a reliable JavaScript library / function to check if an HTML snippet is valid that I can call from my code. For example, it should check that opened tags and quotation marks are closed, nesting is correct, etc.
I don't want the validation to fail because something is not 100% standard (but would work anyway).
Update: this answer is limited - please see the edit below.
Expanding on #kolink's answer, I use:
var checkHTML = function(html) {
var doc = document.createElement('div');
doc.innerHTML = html;
return ( doc.innerHTML === html );
}
I.e., we create a temporary div with the HTML. In order to do this, the browser will create a DOM tree based on the HTML string, which may involve closing tags etc.
Comparing the div's HTML contents with the original HTML will tell us if the browser needed to change anything.
checkHTML('<a>hell<b>o</b>')
Returns false.
checkHTML('<a>hell<b>o</b></a>')
Returns true.
Edit: As #Quentin notes below, this is excessively strict for a variety of reasons: browsers will often fix omitted closing tags, even if closing tags are optional for that tag. Eg:
<p>one para
<p>second para
...is considered valid (since Ps are allowed to omit closing tags) but checkHTML will return false. Browsers will also normalise tag cases, and alter white space. You should be aware of these limits when deciding to use this approach.
Well, this code:
function tidy(html) {
var d = document.createElement('div');
d.innerHTML = html;
return d.innerHTML;
}
This will "correct" malformed HTML to the best of the browser's ability. If that's helpful to you, it's a lot easier than trying to validate HTML.
None of the solutions presented so far is doing a good job in answering the original question, especially when it comes to
I don't want the validation to fail because something is not 100%
standard (but would work anyways).
tldr >> check the JSFiddle
So I used the input of the answers and comments on this topic and created a method that does the following:
checks html string tag by tag if valid
trys to render html string
compares theoretically to be created tag count with actually rendered html dom tag count
if checked 'strict', <br/> and empty attribute normalizations ="" are not ignored
compares rendered innerHTML with given html string (while ignoring whitespaces and quotes)
Returns
true if rendered html is same as given html string
false if one of the checks fails
normalized html string if rendered html seems valid but is not equal to given html string
normalized means, that on rendering, the browser ignores or repairs sometimes specific parts of the input (like adding missing closing-tags for <p> and converts others (like single to double quotes or encoding of ampersands).
Making a distinction between "failed" and "normalized" allows to flag the content to the user as "this will not be rendered as you might expect it".
Most times normalized gives back an only slightly altered version of the original html string - still, sometimes the result is quite different. So this should be used e.g. to flag user-input for further review before saving it to a db or rendering it blindly. (see JSFiddle for examples of normalization)
The checks take the following exceptions into consideration
ignoring of normalization of single quotes to double quotes
image and other tags with a src attribute are 'disarmed' during rendering
(if non strict) ignoring of <br/> >> <br> conversion
(if non strict) ignoring of normalization of empty attributes (<p disabled> >> <p disabled="">)
encoding of initially un-encoded ampersands when reading .innerHTML, e.g. in attribute values
.
function simpleValidateHtmlStr(htmlStr, strictBoolean) {
if (typeof htmlStr !== "string")
return false;
var validateHtmlTag = new RegExp("<[a-z]+(\s+|\"[^\"]*\"\s?|'[^']*'\s?|[^'\">])*>", "igm"),
sdom = document.createElement('div'),
noSrcNoAmpHtmlStr = htmlStr
.replace(/ src=/, " svhs___src=") // disarm src attributes
.replace(/&/igm, "#svhs#amp##"), // 'save' encoded ampersands
noSrcNoAmpIgnoreScriptContentHtmlStr = noSrcNoAmpHtmlStr
.replace(/\n\r?/igm, "#svhs#nl##") // temporarily remove line breaks
.replace(/(<script[^>]*>)(.*?)(<\/script>)/igm, "$1$3") // ignore script contents
.replace(/#svhs#nl##/igm, "\n\r"), // re-add line breaks
htmlTags = noSrcNoAmpIgnoreScriptContentHtmlStr.match(/<[a-z]+[^>]*>/igm), // get all start-tags
htmlTagsCount = htmlTags ? htmlTags.length : 0,
tagsAreValid, resHtmlStr;
if(!strictBoolean){
// ignore <br/> conversions
noSrcNoAmpHtmlStr = noSrcNoAmpHtmlStr.replace(/<br\s*\/>/, "<br>")
}
if (htmlTagsCount) {
tagsAreValid = htmlTags.reduce(function(isValid, tagStr) {
return isValid && tagStr.match(validateHtmlTag);
}, true);
if (!tagsAreValid) {
return false;
}
}
try {
sdom.innerHTML = noSrcNoAmpHtmlStr;
} catch (err) {
return false;
}
// compare rendered tag-count with expected tag-count
if (sdom.querySelectorAll("*").length !== htmlTagsCount) {
return false;
}
resHtmlStr = sdom.innerHTML.replace(/&/igm, "&"); // undo '&' encoding
if(!strictBoolean){
// ignore empty attribute normalizations
resHtmlStr = resHtmlStr.replace(/=""/, "")
}
// compare html strings while ignoring case, quote-changes, trailing spaces
var
simpleIn = noSrcNoAmpHtmlStr.replace(/["']/igm, "").replace(/\s+/igm, " ").toLowerCase().trim(),
simpleOut = resHtmlStr.replace(/["']/igm, "").replace(/\s+/igm, " ").toLowerCase().trim();
if (simpleIn === simpleOut)
return true;
return resHtmlStr.replace(/ svhs___src=/igm, " src=").replace(/#svhs#amp##/, "&");
}
Here you can find it in a JSFiddle https://jsfiddle.net/abernh/twgj8bev/ , together with different test-cases, including
"<a href='blue.html id='green'>missing attribute quotes</a>" // FAIL
"<a>hell<B>o</B></a>" // PASS
'hell<b>o</b>' // PASS
'<a href=test.html>hell<b>o</b></a>', // PASS
"<a href='test.html'>hell<b>o</b></a>", // PASS
'<ul><li>hell</li><li>hell</li></ul>', // PASS
'<ul><li>hell<li>hell</ul>', // PASS
'<div ng-if="true && valid">ampersands in attributes</div>' // PASS
.
9 years later, how about using DOMParser?
It accepts string as parameter and returns Document type, just like HTML.
Thus, when it has an error, the returned document object has <parsererror> element in it.
If you parse your html as xml, at least you can check your html is xhtml compliant.
Example
> const parser = new DOMParser();
> const doc = parser.parseFromString('<div>Input: <input /></div>', 'text/xml');
> (doc.documentElement.querySelector('parsererror') || {}).innerText; // undefined
To wrap this as a function
function isValidHTML(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/xml');
if (doc.documentElement.querySelector('parsererror')) {
return doc.documentElement.querySelector('parsererror').innerText;
} else {
return true;
}
}
Testing the above function
isValidHTML('<a>hell<B>o</B></a>') // true
isValidHTML('hell') // true
isValidHTML('<a href='test.html'>hell</a>') // true
isValidHTML("<a href=test.html>hell</a>") // This page contains the following err..
isValidHTML('<ul><li>a</li><li>b</li></ul>') // true
isValidHTML('<ul><li>a<li>b</ul>') // This page contains the following err..
isValidHTML('<div><input /></div>' // true
isValidHTML('<div><input></div>' // This page contains the following err..
The above works for very simple html.
However if your html has some code-like texts; <script>, <style>, etc, you need to manipulate just for XML validation although it's valid HTML
The following updates code-like html to a valid XML syntax.
export function getHtmlError(html) {
const parser = new DOMParser();
const htmlForParser = `<xml>${html}</xml>`
.replace(/(src|href)=".*?&.*?"/g, '$1="OMITTED"')
.replace(/<script[\s\S]+?<\/script>/gm, '<script>OMITTED</script>')
.replace(/<style[\s\S]+?<\/style>/gm, '<style>OMITTED</style>')
.replace(/<pre[\s\S]+?<\/pre>/gm, '<pre>OMITTED</pre>')
.replace(/ /g, ' ');
const doc = parser.parseFromString(htmlForParser, 'text/xml');
if (doc.documentElement.querySelector('parsererror')) {
console.error(htmlForParser.split(/\n/).map( (el, ndx) => `${ndx+1}: ${el}`).join('\n'));
return doc.documentElement.querySelector('parsererror');
}
}
function validHTML(html) {
var openingTags, closingTags;
html = html.replace(/<[^>]*\/\s?>/g, ''); // Remove all self closing tags
html = html.replace(/<(br|hr|img).*?>/g, ''); // Remove all <br>, <hr>, and <img> tags
openingTags = html.match(/<[^\/].*?>/g) || []; // Get remaining opening tags
closingTags = html.match(/<\/.+?>/g) || []; // Get remaining closing tags
return openingTags.length === closingTags.length ? true : false;
}
var htmlContent = "<p>your html content goes here</p>" // Note: String without any html tag will consider as valid html snippet. If it’s not valid in your case, in that case you can check opening tag count first.
if(validHTML(htmlContent)) {
alert('Valid HTML')
}
else {
alert('Invalid HTML');
}
Using pure JavaScript you may check if an element exists using the following function:
if (typeof(element) != 'undefined' && element != null)
Using the following code we can test this in action:
HTML:
<input type="button" value="Toggle .not-undefined" onclick="toggleNotUndefined()">
<input type="button" value="Check if .not-undefined exists" onclick="checkNotUndefined()">
<p class=".not-undefined"></p>
CSS:
p:after {
content: "Is 'undefined'";
color: blue;
}
p.not-undefined:after {
content: "Is not 'undefined'";
color: red;
}
JavaScript:
function checkNotUndefined(){
var phrase = "not ";
var element = document.querySelector('.not-undefined');
if (typeof(element) != 'undefined' && element != null) phrase = "";
alert("Element of class '.not-undefined' does "+phrase+"exist!");
// $(".thisClass").length checks to see if our elem exists in jQuery
}
function toggleNotUndefined(){
document.querySelector('p').classList.toggle('not-undefined');
}
It can be found on JSFiddle.
function isHTML(str)
{
var a = document.createElement('div');
a.innerHTML = str;
for(var c= a.ChildNodes, i = c.length; i--)
{
if (c[i].nodeType == 1) return true;
}
return false;
}
Good Luck!
It depends on js-library which you use.
Html validatod for node.js https://www.npmjs.com/package/html-validator
Html validator for jQuery https://api.jquery.com/jquery.parsehtml/
But, as mentioned before, using the browser to validate broken HTML is a great idea:
function tidy(html) {
var d = document.createElement('div');
d.innerHTML = html;
return d.innerHTML;
}
Expanding on #Tarun's answer from above:
function validHTML(html) { // checks the validity of html, requires all tags and property-names to only use alphabetical characters and numbers (and hyphens, underscore for properties)
html = html.toLowerCase().replace(/(?<=<[^>]+?=\s*"[^"]*)[<>]/g,"").replace(/(?<=<[^>]+?=\s*'[^']*)[<>]/g,""); // remove all angle brackets from tag properties
html = html.replace(/<script.*?<\/script>/g, ''); // Remove all script-elements
html = html.replace(/<style.*?<\/style>/g, ''); // Remove all style elements tags
html = html.toLowerCase().replace(/<[^>]*\/\s?>/g, ''); // Remove all self closing tags
html = html.replace(/<(\!|br|hr|img).*?>/g, ''); // Remove all <br>, <hr>, and <img> tags
//var tags=[...str.matchAll(/<.*?>/g)]; this would allow for unclosed initial and final tag to pass parsing
html = html.replace(/^[^<>]+|[^<>]+$|(?<=>)[^<>]+(?=<)/gs,""); // remove all clean text nodes, note that < or > in text nodes will result in artefacts for which we check and return false
tags = html.split(/(?<=>)(?=<)/);
if (tags.length%2==1) {
console.log("uneven number of tags in "+html)
return false;
}
var tagno=0;
while (tags.length>0) {
if (tagno==tags.length) {
console.log("these tags are not closed: "+tags.slice(0,tagno).join());
return false;
}
if (tags[tagno].slice(0,2)=="</") {
if (tagno==0) {
console.log("this tag has not been opened: "+tags[0]);
return false;
}
var tagSearch=tags[tagno].match(/<\/\s*([\w\-\_]+)\s*>/);
if (tagSearch===null) {
console.log("could not identify closing tag "+tags[tagno]+" after "+tags.slice(0,tagno).join());
return false;
} else tags[tagno]=tagSearch[1];
if (tags[tagno]==tags[tagno-1]) {
tags.splice(tagno-1,2);
tagno--;
} else {
console.log("tag '"+tags[tagno]+"' trying to close these tags: "+tags.slice(0,tagno).join());
return false;
}
} else {
tags[tagno]=tags[tagno].replace(/(?<=<\s*[\w_\-]+)(\s+[\w\_\-]+(\s*=\s*(".*?"|'.*?'|[^\s\="'<>`]+))?)*/g,""); // remove all correct properties from tag
var tagSearch=tags[tagno].match(/<(\s*[\w\-\_]+)/);
if ((tagSearch===null) || (tags[tagno]!="<"+tagSearch[1]+">")) {
console.log("fragmented tag with the following remains: "+tags[tagno]);
return false;
}
var tagSearch=tags[tagno].match(/<\s*([\w\-\_]+)/);
if (tagSearch===null) {
console.log("could not identify opening tag "+tags[tagno]+" after "+tags.slice(0,tagno).join());
return false;
} else tags[tagno]=tagSearch[1];
tagno++;
}
}
return true;
}
This performs a few additional checks, such as testing whether tags match and whether properties would parse. As it does not depend on an existing DOM, it can be used in a server environment, but beware: it is slow. Also, in theory, tags can be names much more laxly, as you can basically use any unicode (with a few exceptions) in tag- and property-names. This would not pass my own sanity-check, however.

Javascript to add cdata section on the fly?

I'm having trouble with special characters that exist in an xml node attribute. To combat this, I'm trying to render the attributes as child nodes and, where necessary, using cdata sections to get around the special characters. The problem is, I can't seem to get the cdata section appended to the node correctly.
I'm iterating over the source xml node's attributes and creating new nodes. If the attribute.name = "description" I want to put the attribute.text() in a cdata section and append the new node. That's where I jump the track.
// newXMLData is the new xml document that I've created in memory
for (var ctr =0;ctr< this.attributes.length;ctr++){ // iterate over the attributes
if( this.attributes[ctr].name =="Description"){ // if the attribute name is "Description" add a CDATA section
var thisNodeName = this.attributes[ctr].name;
newXMLDataNode.append("<"+thisNodeName +"></"+ thisNodeName +">" );
var cdata = newXMLData.createCDATASection('test'); // here's where it breaks.
} else {
// It's not "Description" so just append the new node.
newXMLDataNode.append("<"+ this.attributes[ctr].name +">" + $(this.attributes[ctr]).text() + "</"+ this.attributes[ctr].name +">" );
}
}
Any ideas? Is there another way to add a cdata section?
Here's a sample snippet of the source...
<row
pSiteID="4"
pSiteTile="Test Site Name "
pSiteURL="http://www.cnn.com"
ID="1"
Description="<div>blah blah blah since June 2007.&nbsp; T<br>&nbsp;<br>blah blah blah blah&nbsp; </div>"
CreatedDate="2010-09-20 14:46:18"
Comments="Comments example.
" >
here's what I'm trying to create...
<Site>
<PSITEID>4</PSITEID>
<PSITETILE>Test Site Name</PSITETILE>
<PSITEURL>http://www.cnn.com</PSITEURL>
<ID>1</ID>
<DESCRIPTION><![CDATA[<div>blah blah blah since June 2007.&nbsp; T<br>&nbsp;<br>blah blah blah blah&nbsp; </div ]]></DESCRIPTION>
<CREATEDDATE>2010-09-20 14:46:18</CREATEDDATE>
<COMMENTS><![CDATA[ Comments example.
]]></COMMENTS>
</Site>
I had the same issue. i was trying to append CDATA to xml nodes, so i thought its as easy as adding like so:
valueNode[0].text = "<![CDATA["+ tmpVal +"]]>";
//valueNode[0] represents "<value></value>"
This does not work because the whole thing will get interpreted as text therefore <(less than) and > (great than) will be replaced automatically.
what you need to do is use createCDATASection by doing the following:
var tmpCdata = $xmlDoc[0].createCDATASection(escape("muzi test 002"));
//i'm also escaping special charactures as well
valueNode[0].appendChild(tmpCdata);
results will be:
<value><![CDATA[muzi%20test%20002]]></value>
Brettz9 (in previous answer) explains how to do this but quite complex, therefore i just wanted to add my solution which is much simpler.
thanks,
Not sure of browser support for document.implementation.createDocument or createCDataSection, but this works in Mozilla at least:
<script>
// Define some helpers (not available IE < 9)
function parse (str) {
return new DOMParser().parseFromString(str, 'text/xml').documentElement;
}
function ser (dom) {
return new XMLSerializer().serializeToString(dom);
}
// Simulate your XML retrieval
var row = '<row pSiteID="4" pSiteTile="Test Site Name " pSiteURL="http://www.cnn.com" ID="1" Description="<div>blah blah blah since June 2007.&nbsp; T<br>&nbsp;<br>blah blah blah blah&nbsp; </div>" CreatedDate="2010-09-20 14:46:18" Comments="Comments example.
" />';
// Hack to convert source to well-formed XML, or otherwise you can't use DOM methods on it which
// depend on well-formed XML
row = row.replace(/(=\s*")([\s\S]*?)(")/g, function (n0, n1, n2, n3) {
return n1+ // Add back equal sign and opening quote
n2.replace(/</g, '<'). // Create well-formed XML by avoiding less-than signs inside attributes
replace(/&nbsp;/g, '&#160;')+ // HTML entities (except for gt, lt, amp, quot) must be either converted to numeric character references or your XML must define the same entities
n3; // Add back closing quote
});
// Simulate your retrieval of DOM attributes, though in this context, we're just making attributes into a global
this.attributes = parse(row).attributes;
// Simulate your creation of an XML document
var newXMLData = document.implementation.createDocument(null, 'Site', null);
// Modify your code to avoid jQuery dependency for easier testing and to
// avoid confusion (error?) of having two variables, newXMLData and newXMLDataNode
for (var ctr =0;ctr< this.attributes.length;ctr++){ // iterate over the attributes
if (this.attributes[ctr].name =="Description") { // if the attribute name is "Description" add a CDATA section
var thisNodeName = this.attributes[ctr].name;
var str = "<"+thisNodeName +"></"+ thisNodeName +">";
var node = parse(str);
var cdata = newXMLData.createCDATASection(this.attributes[ctr].textContent);
node.appendChild(cdata);
newXMLData.documentElement.appendChild(node);
}
else {
// It's not "Description" so just append the new node.
var str= "<"+ this.attributes[ctr].name +">" + this.attributes[ctr].textContent + "</"+ this.attributes[ctr].name +">";
newXMLData.documentElement.appendChild(parse(str));
}
}
// Prove its working (though you may wish to use toUpperCase() if you need the element names upper-cased);
// if you need CDATA for Comments, you can follow the pattern above to add support for that too
alert(ser(newXMLData));
</script>

how do I access XHR responseBody (for binary data) from Javascript in IE?

I've got a web page that uses XMLHttpRequest to download a binary resource.
In Firefox and Gecko I can use responseText to get the bytes, even if the bytestream includes binary zeroes. I may need to coerce the mimetype with overrideMimeType() to make that happen. In IE, though, responseText doesn't work, because it appears to terminate at the first zero. If you read 100,000 bytes, and byte 7 is a binary zero, you will be able to access only 7 bytes. IE's XMLHttpRequest exposes a responseBody property to access the bytes. I've seen a few posts suggesting that it's impossible to access this property in any meaningful way directly from Javascript. This sounds crazy to me.
xhr.responseBody is accessible from VBScript, so the obvious workaround is to define a method in VBScript in the webpage, and then call that method from Javascript. See jsdap for one example. EDIT: DO NOT USE THIS VBScript!!
var IE_HACK = (/msie/i.test(navigator.userAgent) &&
!/opera/i.test(navigator.userAgent));
// no no no! Don't do this!
if (IE_HACK) document.write('<script type="text/vbscript">\n\
Function BinaryToArray(Binary)\n\
Dim i\n\
ReDim byteArray(LenB(Binary))\n\
For i = 1 To LenB(Binary)\n\
byteArray(i-1) = AscB(MidB(Binary, i, 1))\n\
Next\n\
BinaryToArray = byteArray\n\
End Function\n\
</script>');
var xml = (window.XMLHttpRequest)
? new XMLHttpRequest() // Mozilla/Safari/IE7+
: (window.ActiveXObject)
? new ActiveXObject("MSXML2.XMLHTTP") // IE6
: null; // Commodore 64?
xml.open("GET", url, true);
if (xml.overrideMimeType) {
xml.overrideMimeType('text/plain; charset=x-user-defined');
} else {
xml.setRequestHeader('Accept-Charset', 'x-user-defined');
}
xml.onreadystatechange = function() {
if (xml.readyState == 4) {
if (!binary) {
callback(xml.responseText);
} else if (IE_HACK) {
// call a VBScript method to copy every single byte
callback(BinaryToArray(xml.responseBody).toArray());
} else {
callback(getBuffer(xml.responseText));
}
}
};
xml.send('');
Is this really true? The best way? copying every byte? For a large binary stream that's not going to be very efficient.
There is also a possible technique using ADODB.Stream, which is a COM equivalent of a MemoryStream. See here for an example. It does not require VBScript but does require a separate COM object.
if (typeof (ActiveXObject) != "undefined" && typeof (httpRequest.responseBody) != "undefined") {
// Convert httpRequest.responseBody byte stream to shift_jis encoded string
var stream = new ActiveXObject("ADODB.Stream");
stream.Type = 1; // adTypeBinary
stream.Open ();
stream.Write (httpRequest.responseBody);
stream.Position = 0;
stream.Type = 1; // adTypeBinary;
stream.Read.... /// ???? what here
}
But that's not going to work well - ADODB.Stream is disabled on most machines these days.
In The IE8 developer tools - the IE equivalent of Firebug - I can see the responseBody is an array of bytes and I can even see the bytes themselves. The data is right there. I don't understand why I can't get to it.
Is it possible for me to read it with responseText?
hints? (other than defining a VBScript method)
Yes, the answer I came up with for reading binary data via XHR in IE, is to use VBScript injection. This was distasteful to me at first, but, I look at it as just one more browser dependent bit of code.
(The regular XHR and responseText works fine in other browsers; you may have to coerce the mime type with XMLHttpRequest.overrideMimeType(). This isn't available on IE).
This is how I got a thing that works like responseText in IE, even for binary data.
First, inject some VBScript as a one-time thing, like this:
if(/msie/i.test(navigator.userAgent) && !/opera/i.test(navigator.userAgent)) {
var IEBinaryToArray_ByteStr_Script =
"<!-- IEBinaryToArray_ByteStr -->\r\n"+
"<script type='text/vbscript' language='VBScript'>\r\n"+
"Function IEBinaryToArray_ByteStr(Binary)\r\n"+
" IEBinaryToArray_ByteStr = CStr(Binary)\r\n"+
"End Function\r\n"+
"Function IEBinaryToArray_ByteStr_Last(Binary)\r\n"+
" Dim lastIndex\r\n"+
" lastIndex = LenB(Binary)\r\n"+
" if lastIndex mod 2 Then\r\n"+
" IEBinaryToArray_ByteStr_Last = Chr( AscB( MidB( Binary, lastIndex, 1 ) ) )\r\n"+
" Else\r\n"+
" IEBinaryToArray_ByteStr_Last = "+'""'+"\r\n"+
" End If\r\n"+
"End Function\r\n"+
"</script>\r\n";
// inject VBScript
document.write(IEBinaryToArray_ByteStr_Script);
}
The JS class I'm using that reads binary files exposes a single interesting method, readCharAt(i), which reads the character (a byte, really) at the i'th index. This is how I set it up:
// see doc on http://msdn.microsoft.com/en-us/library/ms535874(VS.85).aspx
function getXMLHttpRequest()
{
if (window.XMLHttpRequest) {
return new window.XMLHttpRequest;
}
else {
try {
return new ActiveXObject("MSXML2.XMLHTTP");
}
catch(ex) {
return null;
}
}
}
// this fn is invoked if IE
function IeBinFileReaderImpl(fileURL){
this.req = getXMLHttpRequest();
this.req.open("GET", fileURL, true);
this.req.setRequestHeader("Accept-Charset", "x-user-defined");
// my helper to convert from responseBody to a "responseText" like thing
var convertResponseBodyToText = function (binary) {
var byteMapping = {};
for ( var i = 0; i < 256; i++ ) {
for ( var j = 0; j < 256; j++ ) {
byteMapping[ String.fromCharCode( i + j * 256 ) ] =
String.fromCharCode(i) + String.fromCharCode(j);
}
}
// call into VBScript utility fns
var rawBytes = IEBinaryToArray_ByteStr(binary);
var lastChr = IEBinaryToArray_ByteStr_Last(binary);
return rawBytes.replace(/[\s\S]/g,
function( match ) { return byteMapping[match]; }) + lastChr;
};
this.req.onreadystatechange = function(event){
if (that.req.readyState == 4) {
that.status = "Status: " + that.req.status;
//that.httpStatus = that.req.status;
if (that.req.status == 200) {
// this doesn't work
//fileContents = that.req.responseBody.toArray();
// this doesn't work
//fileContents = new VBArray(that.req.responseBody).toArray();
// this works...
var fileContents = convertResponseBodyToText(that.req.responseBody);
fileSize = fileContents.length-1;
if(that.fileSize < 0) throwException(_exception.FileLoadFailed);
that.readByteAt = function(i){
return fileContents.charCodeAt(i) & 0xff;
};
}
if (typeof callback == "function"){ callback(that);}
}
};
this.req.send();
}
// this fn is invoked if non IE
function NormalBinFileReaderImpl(fileURL){
this.req = new XMLHttpRequest();
this.req.open('GET', fileURL, true);
this.req.onreadystatechange = function(aEvt) {
if (that.req.readyState == 4) {
if(that.req.status == 200){
var fileContents = that.req.responseText;
fileSize = fileContents.length;
that.readByteAt = function(i){
return fileContents.charCodeAt(i) & 0xff;
}
if (typeof callback == "function"){ callback(that);}
}
else
throwException(_exception.FileLoadFailed);
}
};
//XHR binary charset opt by Marcus Granado 2006 [http://mgran.blogspot.com]
this.req.overrideMimeType('text/plain; charset=x-user-defined');
this.req.send(null);
}
The conversion code was provided by Miskun.
Very fast, works great.
I used this method to read and extract zip files from Javascript, and also in a class that reads and displays EPUB files in Javascript. Very reasonable performance. About half a second for a 500kb file.
XMLHttpRequest.responseBody is a VBArray object containing the raw bytes. You can convert these objects to standard arrays using the toArray() function:
var data = xhr.responseBody.toArray();
I would suggest two other (fast) options:
First, you can use
ADODB.Recordset to convert the byte array into a string. I would guess that this object is more common that ADODB.Stream, which is often disabled for security reasons. This option is VERY fast, less than 30ms for a 500kB file.
Second, if the Recordset component is not accessible, there is a trick to access the byte array data from Javascript. Send your xhr.responseBody to VBScript, pass it through any VBScript string function such as CStr (takes no time), and return it to JS. You will get a weird string with bytes concatenated into 16-bit unicode (in reverse). You can then convert this string quickly into a usable bytestring through a regular expression with dictionary-based replacement. Takes about 1s for 500kB.
For comparison, the byte-by-byte conversion through loops takes several minutes for this same 500kB file, so it's a no-brainer :) Below the code I have been using, to insert into your header. Then call the function ieGetBytes with your xhr.responseBody.
<!--[if IE]>
<script type="text/vbscript">
'Best case scenario when the ADODB.Recordset object exists
'We will do the existence test in Javascript (see after)
'Extremely fast, about 25ms for a 500kB file
Function ieGetBytesADO(byteArray)
Dim recordset
Set recordset = CreateObject("ADODB.Recordset")
With recordset
.Fields.Append "temp", 201, LenB(byteArray)
.Open
.AddNew
.Fields("temp").AppendChunk byteArray
.Update
End With
ieGetBytesADO = recordset("temp")
recordset.Close
Set recordset = Nothing
End Function
'Trick to return a Javascript-readable string from a VBScript byte array
'Yet the string is not usable as such by Javascript, since the bytes
'are merged into 16-bit unicode characters. Last character missing if odd length.
Function ieRawBytes(byteArray)
ieRawBytes = CStr(byteArray)
End Function
'Careful the last character is missing in case of odd file length
'We Will call the ieLastByte function (below) from Javascript
'Cannot merge directly within ieRawBytes as the final byte would be duplicated
Function ieLastChr(byteArray)
Dim lastIndex
lastIndex = LenB(byteArray)
if lastIndex mod 2 Then
ieLastChr = Chr( AscB( MidB( byteArray, lastIndex, 1 ) ) )
Else
ieLastChr = ""
End If
End Function
</script>
<script type="text/javascript">
try {
// best case scenario, the ADODB.Recordset object exists
// we can use the VBScript ieGetBytes function to transform a byte array into a string
var ieRecordset = new ActiveXObject('ADODB.Recordset');
var ieGetBytes = function( byteArray ) {
return ieGetBytesADO(byteArray);
}
ieRecordset = null;
} catch(err) {
// no ADODB.Recordset object, we will do the conversion quickly through a regular expression
// initializes for once and for all the translation dictionary to speed up our regexp replacement function
var ieByteMapping = {};
for ( var i = 0; i < 256; i++ ) {
for ( var j = 0; j < 256; j++ ) {
ieByteMapping[ String.fromCharCode( i + j * 256 ) ] = String.fromCharCode(i) + String.fromCharCode(j);
}
}
// since ADODB is not there, we replace the previous VBScript ieGetBytesADO function with a regExp-based function,
// quite fast, about 1.3 seconds for 500kB (versus several minutes for byte-by-byte loops over the byte array)
var ieGetBytes = function( byteArray ) {
var rawBytes = ieRawBytes(byteArray),
lastChr = ieLastChr(byteArray);
return rawBytes.replace(/[\s\S]/g, function( match ) {
return ieByteMapping[match]; }) + lastChr;
}
}
</script>
<![endif]-->
Thanks so much for this solution. the BinaryToArray() function in VbScript works great for me.
Incidentally, I need the binary data for providing it to an Applet. (Don't ask me why Applets can't be used for downloading binary data. Long story short.. weird MS authentication that cant go thru applets (URLConn) calls. Its especially weird in cases where users are behind a proxy )
The Applet needs a byte array from this data, so here's what I do to get it:
String[] results = result.toString().split(",");
byte[] byteResults = new byte[results.length];
for (int i=0; i<results.length; i++){
byteResults[i] = (byte)Integer.parseInt(results[i]);
}
The byte array can then converted into a bytearrayinputstream for further processing.
Thank you for this post.
I found this link usefull:
http://www.codingforums.com/javascript-programming/47018-help-using-responsetext-property-microsofts-xmlhttp-activexobject-ie6.html
Specially this part:
</script>
<script language="VBScript">
Function BinaryToString(Binary)
Dim I,S
For I = 1 to LenB(Binary)
S = S & Chr(AscB(MidB(Binary,I,1)))
Next
BinaryToString = S
End Function
</script>
I've added this to my htm page.
Then I call this function from my javascript:
responseText = BinaryToString(xhr.responseBody);
Works on IE8, IE9, IE10, FF & Chrome.
You could also just make a proxy script that goes to the address you're requesting & base64's it. Then you just have to pass a query string to the proxy script that tells it the address. In IE you have to manually do base64 in JS though. But this is a way to go if you don't want to use VBScript.
I used this for my GameBoy Color emulator.
Here is the PHP script that does the magic:
<?php
//Binary Proxy
if (isset($_GET['url'])) {
try {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, stripslashes($_GET['url']));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($curl, CURLOPT_POST, false);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
$result = curl_exec($curl);
curl_close($curl);
if ($result !== false) {
header('Content-Type: text/plain; charset=ASCII');
header('Expires: '.gmdate('D, d M Y H:i:s \G\M\T', time() + (3600 * 24 * 7)));
echo(base64_encode($result));
}
else {
header('HTTP/1.0 404 File Not Found');
}
}
catch (Exception $error) { }
}
?>
I was trying to download a file and than sign it using CAPICOM.DLL. The only way I coud do it was by injecting a VBScript function that does the download. That is my solution:
if(/msie/i.test(navigator.userAgent) && !/opera/i.test(navigator.userAgent)) {
var VBConteudo_Script =
'<!-- VBConteudo -->\r\n'+
'<script type="text/vbscript">\r\n'+
'Function VBConteudo(url)\r\n'+
' Set objHTTP = CreateObject("MSXML2.XMLHTTP")\r\n'+
' objHTTP.open "GET", url, False\r\n'+
' objHTTP.send\r\n'+
' If objHTTP.Status = 200 Then\r\n'+
' VBConteudo = objHTTP.responseBody\r\n'+
' End If\r\n'+
'End Function\r\n'+
'\<\/script>\r\n';
// inject VBScript
document.write(VBConteudo_Script);
}

Detect difference between & and %26 in location.hash

Analyzing the location.hash with this simple javascript code:
<script type="text/javascript">alert(location.hash);</script>
I have a difficult time separating out GET variables that contain a & (encoded as %26) and a & used to separate variables.
Example one:
code=php&age=15d
Example two:
code=php%20%26%20code&age=15d
As you can see, example 1 has no problems, but getting javascript to know that "code=php & code" in example two is beyond my abilities:
(Note: I'm not really using these variable names, and changing them to something else will only work so long as a search term does not match a search key, so I wouldn't consider that a valid solution.)
There is no difference between %26 and & in a fragment identifier (‘hash’). ‘&’ is only a reserved character with special meaning in a query (‘search’) segment of a URI. Escaping ‘&’ to ‘%26’ need be given no more application-level visibility than escaping ‘a’ to ‘%61’.
Since there is no standard encoding scheme for hiding structured data within a fragment identifier, you could make your own. For example, use ‘+XX’ hex-encoding to encode a character in a component:
hxxp://www.example.com/page#code=php+20+2B+20php&age=15d
function encodeHashComponent(x) {
return encodeURIComponent(x).split('%').join('+');
}
function decodeHashComponent(x) {
return decodeURIComponent(x.split('+').join('%'));
}
function getHashParameters() {
var parts= location.hash.substring(1).split('&');
var pars= {};
for (var i= parts.length; i-->0;) {
var kv= parts[i].split('=');
var k= kv[0];
var v= kv.slice(1).join('=');
pars[decodeHashComponent(k)]= decodeHashComponent(v);
}
return pars;
}
Testing on Firefox 3.1, it looks as if the browser converts hex codes to the appropriate characters when populating the location.hash variable, so there is no way JavaScript can know how the original was a single character or a hex code.
If you're trying to encode a character like & inside of your hash variables, I would suggest replacing it with another string.
You can also parse the string in weird ways, like (JS 1.6 here):
function pairs(xs) {
return xs.length > 1 ? [[xs[0], xs[1]]].concat(pairs(xs.slice(2))) : []
}
function union(xss) {
return xss.length == 0 ? [] : xss[0].concat(union(xss.slice(1)));
}
function splitOnLast(s, sub) {
return s.indexOf(sub) == -1 ? [s] :
[s.substr(0, s.lastIndexOf(sub)),
s.substr(s.lastIndexOf(sub) + sub.length)];
}
function objFromPairs(ps) {
var o = {};
for (var i = 0; i < ps.length; i++) {
o[ps[i][0]] = ps[i][1];
}
return o;
}
function parseHash(hash) {
return objFromPairs(
pairs(
union(
location.hash
.substr(1)
.split("=")
.map(
function (s) splitOnLast(s, '&')))))
}
>>> location.hash
"#code=php & code&age=15d"
>>> parseHash(location.hash)
{ "code": "php & code", "age": "15d" }
Just do the same as you do with the first example, but after you have split on the & then call unescape() to convert the %26 to & and the %20 to a space.
Edit:
Looks like I'm a bit out of date and you should be using decodeURIComponent() now, though I don't see any clear explanation on what it does differently to unescape(), apart from a suggestion that it doesn't handle Unicode properly.
This worked fine for me:
var hash = [];
if (location.hash) {
hash = location.href.split('#')[1].split('&');
}

Categories

Resources