JavaScript CSV Parser Library [closed] - javascript

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Is there a decent CSV Parser library for JavaScript? I've used this and that solution so far. In the first solution a new line is never created as a new sub-array, also the code tells so and the second solution does not work on text files formatted in Windows with <CR><LF> , respectively \r\n
Is it sufficient to apply
text = text.replace("\r","");
to the Windows CSV files? This actually works, but I think this is a little bit quirks. Are there csv parser which are more common than a random bloggers solution?

Here's the 'easy' solution
csv.split(/\r\n|\r|\n/g)
It handles:
\n
\r
\r\n
\n\r
Unfortunately, it breaks on values that contain newline chars between delimiters.
For example, the following line entry...
"this is some","valid CSV data","with a \r\nnewline char"
Will break it because the '\r\n' will be mistakenly interpreted as the end of an entry.
For a complete solution, your best bet is to create a ND-FSM (Non-Deterministic Finite State Machine) lexer/parser. If you have ever heard of the Chomsky Hierarchy, CSV can be parsed as a Type III grammar. That means char-by-char or token-by-token processing with state tracking.
I have a fully RFC 4180 compliant client-side library available but somehow I attracted the attention of a delete-happy mod for external linking. There's a link in my profile if you're interested; otherwise, good luck.
I'll give you fair warning from experience, CSV looks deceptively easy on the surface. After studying tens/hundreds of implementations, I have only seen 3 javascript parsers that did a reasonable job of meeting the spec and none of them were completely RFC compliant. I managed to write one but only with the help of the community and lots and lots of pain.

If you're working in Node, there's an excellent CSV parser that can handle extremely large amounts of data (>GB files) and supports escape characters.
If you're working in browser JS, you could still extract the processing logic from the code so that it operates on a string (instead of a Node Stream).

Here is one way to do it:
// based on json_parse from JavaScript The Good Part by D. Crockford
var csv_parse = function () {
var at,
ch,
text,
error = function (m) {
throw {
name: 'SyntaxError',
message: m,
at: at,
text: text
};
},
next = function (c) {
if (c && c !== ch) {
error("Expected '" + c + "' instead of '" + ch + "'");
}
ch = text.charAt(at);
at += 1;
return ch;
},
//needed to handle "" which indicates escaped quote
peek = function () {
return text.charAt(at);
},
white = function () {
while (ch && ch <= ' ' && ch !== '\n') {
next();
}
},
// if numeric, then return number
number = function () {
var number,
string = word();
number = +string;
if (isNaN(number)) {
return string;
} else {
return number;
}
},
word = function () {
var string = '';
while (ch !== ',' && ch !== '\n') {
string += ch;
next();
}
return string;
},
// the matching " is the end of word not ,
// need to worry about "", which is escaped quote
quoted = function () {
var string ='';
if (ch === '"') {
while (next()) {
if (ch === '"') {
//print('need to know ending quote or escaped quote');
// need to know ending quote or escaped quote ("")
if (peek() === '"') {
//print('maybe double quote near '+string);
next('"');
string += ch;
} else {
next('"')
return string;
}
} else {
string += ch;
}
}
return string;
}
error("Bad string");
},
value = function () {
white();
switch(ch) {
case '-':
return number();
case '"':
return quoted();
default:
return ch >= '0' && ch <= '9' ? number() : word();
}
return number();
},
line = function () {
var array = [];
white();
if (ch === '\n') {
next('\n');
return array;//empty []
}
while (ch) {
array.push( value() );
white();
if (ch === '\n') {
next('\n');
return array;//got something
}
next(',');// not very liberal with delimiter
white();
}
};
return function (_line) {
var result;
text = _line;
at = 0;
ch = ' ';
result = line();
white();
if (ch) {
error("Syntax error");
}
return result;
};
}();

My function is solid, just drop in and use, I hope it is of help to you.
csvToArray v1.3
A compact (508 bytes) but compliant function to convert a CSV string into a 2D array, conforming to the RFC4180 standard.
http://code.google.com/p/csv-to-array/
Common Usage: jQuery
$.ajax({
url: "test.csv",
dataType: 'text',
cache: false
}).done(function(csvAsString){
csvAsArray=csvAsString.csvToArray();
});
Common usage: Javascript
csvAsArray = csvAsString.csvToArray();
Override field separator
csvAsArray = csvAsString.csvToArray("|");
Override record separator
csvAsArray = csvAsString.csvToArray("", "#");
Override Skip Header
csvAsArray = csvAsString.csvToArray("", "", 1);
Override all
csvAsArray = csvAsString.csvToArray("|", "#", 1);

Related

JS (no lookback) regex for replacing :bound SQL vars without replacing 'colons:in:literals' ...?

Messing around with node-mysql, I wrote some code that lets me use PDO-style :bound values (plus ::bound field names), and rewrites the query with ? and ?? respectively where they are found, and builds a linear array of the values when I execute the statement. I did this because when I look at a SQL statement with a ton of ? ?? all over it and have to count the number of params in my execution, it makes my eyes bleed. I want to just assign a standard object at execution time.
The trouble is, after writing this (it works) I realized my regex for finding those colons in the statement had one tiny little problem, namely, it looks like this:
/.?:(\w+)/g
It picks up the first colon if needed and we take it from there. The problem is, it also picks up colons in literals within the query. So if for some reason you wanted a non-bound string as part of your insert/update, it would be replaced by this engine.
Is there any standard regex for picking up every global instance of the word ":param{#}" in the following statement, without picking up the word "Hello:world", in JS, without lookbacks?
INSERT INTO test VALUES(:param1, :param2, 'Hello:world', :param3);
You're often much better off writing a parser than using regular expressions. It's much more flexible, gives you better error reporting and allows you to handle current & future edge cases much more easily.
The string parsing deals with MySQL string literals syntax & escape sequences described here and just skips over them.
I'm not dealing with valid/invalid binding boundaries, but you could add that if you wanted. You could also remove error reporting such as underterminated string literals and just be forgiving.
The lookahead === ':' && peek() !== '=' condition is to ignore the := MySQL operator.
const parseBindings = (() => {
const bindingCharRx = /\w/;
return function(sql) {
const bindings = [];
let i = 0,
lookahead = sql[i];
while (lookahead) {
if (isStringDelim(lookahead)) parseString();
else if (lookahead === ':' && peek() !== '=') parseBinding();
else consume();
}
return bindings;
function parseString() {
const start = i,
delim = lookahead;
consume();
while (lookahead) {
if (lookahead === '\\') {
consume();
consume();
continue;
}
if (lookahead === delim) {
consume();
if (lookahead !== delim) return;
}
consume();
}
throw new Error(`Underterminated string literal starting at index ${start}.`);
}
function isStringDelim(char) {
return char === "'" || char === '"';
}
function parseBinding() {
const start = i;
consume();
while (lookahead && bindingCharRx.test(lookahead)) consume();
const name = sql.slice(start + 1, i);
if (!name.length) {
throw new Error(`Invalid binding starting at index ${start}.`);
}
bindings.push({
start,
end: i,
name: name
});
}
function consume() {
lookahead = sql[++i]
}
function peek() {
return sql[i + 1]
}
}
})();
function replaceNamedBindings(values, sql) {
const bindings = parseBindings(sql);
const bindingNames = new Set(bindings.map(b => b.name));
const unknownBinding = Object.keys(values).find(k => !bindingNames.has(k));
if (unknownBinding) throw new Error(`Couldn't find a binding named '${unknownBinding}'.`);
let lastIndex = 0,
newSql = '';
for (const binding of bindings) {
if (binding.name in values) {
newSql += sql.slice(lastIndex, binding.start) + values[binding.name];
lastIndex = binding.end;
}
}
newSql += sql.slice(lastIndex);
return newSql;
}
const sql = `INSERT INTO test VALUES(:param1, :param2, 'Hello:world', :param3);`;
console.log(replaceNamedBindings({
param1: '(param1 value)',
param2: '(param2 value)',
param3: '(param3 value)'
}, sql));
console.log(parseBindings(sql));
console.log(parseBindings(`:pickup1 ":dontpickup1" ':dontpickup2' := """:dontpickup3" ''':dontpickup4' "\\":dontpickup5" :pickup2`));
//Will throw exception b/c :world is not a binding
console.log(replaceNamedBindings({
world: '(world value)'
}, sql));

regex detect url and prepend http:// [duplicate]

This question already has answers here:
Adding http:// to all links without a protocol
(4 answers)
Closed 8 years ago.
I would like to detect url's that are entered in a text input. I have the following code which prepends http:// to the beginning of what has been entered:
var input = $(this);
var val = input.val();
if (val && !val.match(/^http([s]?):\/\/.*/)) {
input.val('http://' + val);
}
How would I go about adapting this to only append the http:// if it contains a string followed by a tld? At the moment if I enter a string for example:
Hello. This is a test
the http:// will get appended to hello, even though it's not a url. Any help would be greatly appreciated.
This simple function works for me. We don't care about the real existence of a TLD domain to gain speed, rather we check the syntax like example.com.
Sorry, I've forgotten that VBA trim() is not intrinsic function in js, so:
// Removes leading whitespaces
function LTrim(value)
{
var re = /\s*((\S+\s*)*)/;
return value.replace(re, "$1");
}
// Removes ending whitespaces
function RTrim(value)
{
var re = /((\s*\S+)*)\s*/;
return value.replace(re, "$1");
}
// Removes leading and ending whitespaces
function trim(value)
{
return LTrim(RTrim(value));
}
function hasDomainTld(strAddress)
{
var strUrlNow = trim(strAddress);
if(strUrlNow.match(/[,\s]/))
{
return false;
}
var i, regex = new RegExp();
regex.compile("[A-Za-z0-9\-_]+\\.[A-Za-z0-9\-_]+$");
i = regex.test(strUrlNow);
regex = null;
return i;
}
So your code, $(this) is window object, so I pass the objInput through an argument, using classical js instead of jQuery:
function checkIt(objInput)
{
var val = objInput.value;
if(val.match(/http:/i)) {
return false;
}
else if (hasDomainTld(val)) {
objInput.value = 'http://' + val;
}
}
Please test yourself: http://jsfiddle.net/SDUkZ/8/
The best solution i have found is to use the following regex:
/\.[a-zA-Z]{2,3}/
This detects the . after the url, and characters for the extension with a limit of 2/3 characters.
Does this seem ok for basic validation? Please let me know if you see any problems that could arise.
I know that it will detect email address's but this wont matter in this instance.
You need to narrow down your requirements first as URL detection with regular expressions can be very tricky. These are just a few situations where your parser can fail:
IDNs (госуслуги.рф)
Punycode cases (xn--blah)
New TLD being registered (.amazon)
SEO-friendly URLs (domain.com/Everything you need to know about RegEx.aspx)
We recently faced a similar problem and what we ended up doing was a simple check whether the URL starts with either http://, https://, or ftp:// and prepending with http:// if it doesn't start with any of the mentioned schemes. Here's the implementation in TypeScript:
public static EnsureAbsoluteUri(uri: string): string {
var ret = uri || '', m = null, i = -1;
var validSchemes = ko.utils.arrayMap(['http', 'https', 'ftp'], (i) => { return i + '://' });
if (ret && ret.length) {
m = ret.match(/[a-z]+:\/\//gi);
/* Checking against a list of valid schemes and prepending with "http://" if check fails. */
if (m == null || !m.length || (i = $.inArray(m[0].toLowerCase(), validSchemes)) < 0 ||
(i >= 0 && ret.toLowerCase().indexOf(validSchemes[i]) != 0)) {
ret = 'http://' + ret;
}
}
return ret;
}
As you can see, we're not trying to be smart here as we can't predict every possible URL form. Furthermore, this method is usually executed against field values we know are meant to be URLs so the change of misdetection is minimal.
Hope this helps.

encodeURIComponent throws an exception

I am programmatically building a URI with the help of the encodeURIComponent function using user provided input. However, when the user enters invalid unicode characters (such as U+DFFF), the function throws an exception with the following message:
The URI to be encoded contains an invalid character
I looked this up on MSDN, but that didn't tell me anything I didn't already know.
To correct this error
Ensure the string to be encoded contains only valid Unicode sequences.
My question is, is there a way to sanitize the user provided input to remove all invalid Unicode sequences before I pass it on to the encodeURIComponent function?
Taking the programmatic approach to discover the answer, the only range that turned up any problems was \ud800-\udfff, the range for high and low surrogates:
for (var regex = '/[', firstI = null, lastI = null, i = 0; i <= 65535; i++) {
try {
encodeURIComponent(String.fromCharCode(i));
}
catch(e) {
if (firstI !== null) {
if (i === lastI + 1) {
lastI++;
}
else if (firstI === lastI) {
regex += '\\u' + firstI.toString(16);
firstI = lastI = i;
}
else {
regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
firstI = lastI = i;
}
}
else {
firstI = i;
lastI = i;
}
}
}
if (firstI === lastI) {
regex += '\\u' + firstI.toString(16);
}
else {
regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
}
regex += ']/';
alert(regex); // /[\ud800-\udfff]/
I then confirmed this with a simpler example:
for (var i = 0; i <= 65535 && (i <0xD800 || i >0xDFFF ) ; i++) {
try {
encodeURIComponent(String.fromCharCode(i));
}
catch(e) {
alert(e); // Doesn't alert
}
}
alert('ok!');
And this fits with what MSDN says because indeed all those Unicode characters (even valid Unicode "non-characters") besides surrogates are all valid Unicode sequences.
You can indeed filter out high and low surrogates, but when used in a high-low pair, they become legitimate (as they are meant to be used in this way to allow for Unicode to expand (drastically) beyond its original maximum number of characters):
alert(encodeURIComponent('\uD800\uDC00')); // ok
alert(encodeURIComponent('\uD800')); // not ok
alert(encodeURIComponent('\uDC00')); // not ok either
So, if you want to take the easy route and block surrogates, it is just a matter of:
urlPart = urlPart.replace(/[\ud800-\udfff]/g, '');
If you want to strip out unmatched (invalid) surrogates while allowing surrogate pairs (which are legitimate sequences but the characters are rarely ever needed), you can do the following:
function stripUnmatchedSurrogates (str) {
return str.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '').split('').reverse().join('').replace(/[\uDC00-\uDFFF](?![\uD800-\uDBFF])/g, '').split('').reverse().join('');
}
var urlPart = '\uD801 \uD801\uDC00 \uDC01'
alert(stripUnmatchedSurrogates(urlPart)); // Leaves one valid sequence (representing a single non-BMP character)
If JavaScript had negative lookbehind the function would be a lot less ugly...

Split a CSV string by line skipping newlines contained between quotes

If the following regex can split a csv string by line.
var lines = csv.split(/\r|\r?\n/g);
How could this be adapted to skip newline chars that are contained within a CSV value (Ie between quotes/double-quotes)?
Example:
2,"Evans & Sutherland","230-132-111AA",,"Visual","P
CB",,1,"Offsite",
If you don't see it, here's a version with the newlines visible:
2,"Evans & Sutherland","230-132-111AA",,"Visual","P\r\nCB",,1,"Offsite",\r\n
The part I'm trying to skip over is the newline contained in the middle of the "PCB" entry.
Update:
I probably should've mentioned this before but this is a part of a dedicated CSV parsing library called jquery-csv. To provide a better context I have added the current parser implementation below.
Here's the code for validating and parsing an entry (ie one line):
$.csvEntry2Array = function(csv, meta) {
var meta = (meta !== undefined ? meta : {});
var separator = 'separator' in meta ? meta.separator : $.csvDefaults.separator;
var delimiter = 'delimiter' in meta ? meta.delimiter : $.csvDefaults.delimiter;
// build the CSV validator regex
var reValid = /^\s*(?:D[^D\\]*(?:\\[\S\s][^D\\]*)*D|[^SD\s\\]*(?:\s+[^SD\s\\]+)*)\s*(?:S\s*(?:D[^D\\]*(?:\\[\S\s][^D\\]*)*D|[^SD\s\\]*(?:\s+[^SD\s\\]+)*)\s*)*$/;
reValid = RegExp(reValid.source.replace(/S/g, separator));
reValid = RegExp(reValid.source.replace(/D/g, delimiter));
// build the CSV line parser regex
var reValue = /(?!\s*$)\s*(?:D([^D\\]*(?:\\[\S\s][^D\\]*)*)D|([^SD\s\\]*(?:\s+[^SD\s\\]+)*))\s*(?:S|$)/g;
reValue = RegExp(reValue.source.replace(/S/g, separator), 'g');
reValue = RegExp(reValue.source.replace(/D/g, delimiter), 'g');
// Return NULL if input string is not well formed CSV string.
if (!reValid.test(csv)) {
return null;
}
// "Walk" the string using replace with callback.
var output = [];
csv.replace(reValue, function(m0, m1, m2) {
// Remove backslash from any delimiters in the value
if (m1 !== undefined) {
var reDelimiterUnescape = /\\D/g;
reDelimiterUnescape = RegExp(reDelimiterUnescape.source.replace(/D/, delimiter), 'g');
output.push(m1.replace(reDelimiterUnescape, delimiter));
} else if (m2 !== undefined) {
output.push(m2);
}
return '';
});
// Handle special case of empty last value.
var reEmptyLast = /S\s*$/;
reEmptyLast = RegExp(reEmptyLast.source.replace(/S/, separator));
if (reEmptyLast.test(csv)) {
output.push('');
}
return output;
};
Note: I haven't tested yet but I think I could probably incorporate the last match into the main split/callback.
This is the code that does the split-by-line part:
$.csv2Array = function(csv, meta) {
var meta = (meta !== undefined ? meta : {});
var separator = 'separator' in meta ? meta.separator : $.csvDefaults.separator;
var delimiter = 'delimiter' in meta ? meta.delimiter : $.csvDefaults.delimiter;
var skip = 'skip' in meta ? meta.skip : $.csvDefaults.skip;
// process by line
var lines = csv.split(/\r\n|\r|\n/g);
var output = [];
for(var i in lines) {
if(i < skip) {
continue;
}
// process each value
var line = $.csvEntry2Array(lines[i], {
delimiter: delimiter,
separator: separator
});
output.push(line);
}
return output;
};
For a breakdown on how that reges works take a look at this answer. Mine is a slightly adapted version. I consolidated the single and double quote matching to match just one text delimiter and made the delimiter/separators dynamic. It does a great job of validating entiries but the line-splitting solution I added on top is pretty frail and breaks on the edge case I described above.
I'm just looking for a solution that walks the string extracting valid entries (to pass on to the entry parser) or fails on bad data returning an error indicating the line the parsing failed on.
Update:
splitLines: function(csv, delimiter) {
var state = 0;
var value = "";
var line = "";
var lines = [];
function endOfRow() {
lines.push(value);
value = "";
state = 0;
};
csv.replace(/(\"|,|\n|\r|[^\",\r\n]+)/gm, function (m0){
switch (state) {
// the start of an entry
case 0:
if (m0 === "\"") {
state = 1;
} else if (m0 === "\n") {
endOfRow();
} else if (/^\r$/.test(m0)) {
// carriage returns are ignored
} else {
value += m0;
state = 3;
}
break;
// delimited input
case 1:
if (m0 === "\"") {
state = 2;
} else {
value += m0;
state = 1;
}
break;
// delimiter found in delimited input
case 2:
// is the delimiter escaped?
if (m0 === "\"" && value.substr(value.length - 1) === "\"") {
value += m0;
state = 1;
} else if (m0 === ",") {
value += m0;
state = 0;
} else if (m0 === "\n") {
endOfRow();
} else if (m0 === "\r") {
// Ignore
} else {
throw new Error("Illegal state");
}
break;
// un-delimited input
case 3:
if (m0 === ",") {
value += m0;
state = 0;
} else if (m0 === "\"") {
throw new Error("Unquoted delimiter found");
} else if (m0 === "\n") {
endOfRow();
} else if (m0 === "\r") {
// Ignore
} else {
throw new Error("Illegal data");
}
break;
default:
throw new Error("Unknown state");
}
return "";
});
if (state != 0) {
endOfRow();
}
return lines;
}
All it took is 4 states for a line splitter:
0: the start of an entry
1: the following is quoted
2: a second quote has been encountered
3: the following isn't quoted
It's almost a complete parser. For my use case, I just wanted a line splitter so I could provide a more granual approach to processing CSV data.
Note: Credit for this approach goes to another dev whom I won't name publicly without his permission. All I did was adapt it from a complete parser to a line-splitter.
Update:
Discovered a few broken edge cases in the previous lineSplitter implementation. The one provided should be fully RFC 4180 compliant.
As I have noted in a comment there is no complete solution just using single regex.
A novel method using several regexps by splitting on comma and joining back strings with embedded commas is described here:-
Personally I would use a simple finite state machine as described here
The state machine has more code, but the code is cleaner and its clear what each piece of code is doing. Longer term this will be much more reliable and maintainable.
It's not a good idea to use regex's to parse. Better to use it to detect the "bad" splits and then merge them back:
var lines = csv.split(/\r?\n/g);
var bad = [];
for(var i=lines.length-1; i> 0; i--) {
// find all the unescaped quotes on the line:
var m = lines[i].match(/[^\\]?\"/g);
// if there are an odd number of them, this line, and the line after it is bad:
if((m ? m.length : 0) % 2 == 1) { bad.push(i--); }
}
// starting at the bottom of the list, merge lines back, using \r\n
for(var b=0,len=bad.length; b < len; b++) {
lines.splice(bad[b]-1, 2, lines[bad[b]-1]+"\r\n"+lines[bad[b]]);
}
(This answer is licensed under both CC0 and WTFPL.)
Be careful- That newline is PART of that value. It's not PCB, it's P\nCB.
However, why can't you just use string.split(',')? If need be, you can run through the list and cast to ints or remove the padded quotation marks.

Convert HTML Character Entities back to regular text using javascript

the questions says it all :)
eg. we have >, we need > using only javascript
Update: It seems jquery is the easy way out. But, it would be nice to have a lightweight solution. More like a function which is capable to do this by itself.
You could do something like this:
String.prototype.decodeHTML = function() {
var map = {"gt":">" /* , … */};
return this.replace(/&(#(?:x[0-9a-f]+|\d+)|[a-z]+);?/gi, function($0, $1) {
if ($1[0] === "#") {
return String.fromCharCode($1[1].toLowerCase() === "x" ? parseInt($1.substr(2), 16) : parseInt($1.substr(1), 10));
} else {
return map.hasOwnProperty($1) ? map[$1] : $0;
}
});
};
function decodeEntities(s){
var str, temp= document.createElement('p');
temp.innerHTML= s;
str= temp.textContent || temp.innerText;
temp=null;
return str;
}
alert(decodeEntities('<'))
/* returned value: (String)
<
*/
I know there are libraries out there, but here are a couple of solutions for browsers. These work well when placing html entity data strings into human editable areas where you want the characters to be shown, such as textarea's or input[type=text].
I add this answer as I have to support older versions of IE and I feel that it wraps up a few days worth of research and testing. I hope somebody finds this useful.
First this is for more modern browsers using jQuery, Please note that this should NOT be used if you have to support versions of IE before 10 (7, 8, or 9) as it will strip out the newlines leaving you with just one long line of text.
if (!String.prototype.HTMLDecode) {
String.prototype.HTMLDecode = function () {
var str = this.toString(),
$decoderEl = $('<textarea />');
str = $decoderEl.html(str)
.text()
.replace(/<br((\/)|( \/))?>/gi, "\r\n");
$decoderEl.remove();
return str;
};
}
This next one is based on kennebec's work above, with some differences which are mostly for the sake of older IE versions. This does not require jQuery, but does still require a browser.
if (!String.prototype.HTMLDecode) {
String.prototype.HTMLDecode = function () {
var str = this.toString(),
//Create an element for decoding
decoderEl = document.createElement('p');
//Bail if empty, otherwise IE7 will return undefined when
//OR-ing the 2 empty strings from innerText and textContent
if (str.length == 0) {
return str;
}
//convert newlines to <br's> to save them
str = str.replace(/((\r\n)|(\r)|(\n))/gi, " <br/>");
decoderEl.innerHTML = str;
/*
We use innerText first as IE strips newlines out with textContent.
There is said to be a performance hit for this, but sometimes
correctness of data (keeping newlines) must take precedence.
*/
str = decoderEl.innerText || decoderEl.textContent;
//clean up the decoding element
decoderEl = null;
//replace back in the newlines
return str.replace(/<br((\/)|( \/))?>/gi, "\r\n");
};
}
/*
Usage:
var str = ">";
return str.HTMLDecode();
returned value:
(String) >
*/
Here is a "class" for decoding whole HTML document.
HTMLDecoder = {
tempElement: document.createElement('span'),
decode: function(html) {
var _self = this;
html.replace(/&(#(?:x[0-9a-f]+|\d+)|[a-z]+);/gi,
function(str) {
_self.tempElement.innerHTML= str;
str = _self.tempElement.textContent || _self.tempElement.innerText;
return str;
}
);
}
}
Note that I used Gumbo's regexp for catching entities but for fully valid HTML documents (or XHTML) you could simpy use /&[^;]+;/g.
There is nothing built in, but there are many libraries that have been written to do this.
Here is one.
And here one that is a jQuery plugin.

Categories

Resources