Convert HTML Character Entities back to regular text using javascript

Convert HTML Character Entities back to regular text using javascript - javascript

the questions says it all :)
eg. we have >, we need > using only javascript
Update: It seems jquery is the easy way out. But, it would be nice to have a lightweight solution. More like a function which is capable to do this by itself.

You could do something like this:
String.prototype.decodeHTML = function() {
var map = {"gt":">" /* , … */};
return this.replace(/&(#(?:x[0-9a-f]+|\d+)|[a-z]+);?/gi, function($0, $1) {
if ($1[0] === "#") {
return String.fromCharCode($1[1].toLowerCase() === "x" ? parseInt($1.substr(2), 16) : parseInt($1.substr(1), 10));
} else {
return map.hasOwnProperty($1) ? map[$1] : $0;
}
});
};

function decodeEntities(s){
var str, temp= document.createElement('p');
temp.innerHTML= s;
str= temp.textContent || temp.innerText;
temp=null;
return str;
}
alert(decodeEntities('<'))
/* returned value: (String)
<
*/

I know there are libraries out there, but here are a couple of solutions for browsers. These work well when placing html entity data strings into human editable areas where you want the characters to be shown, such as textarea's or input[type=text].
I add this answer as I have to support older versions of IE and I feel that it wraps up a few days worth of research and testing. I hope somebody finds this useful.
First this is for more modern browsers using jQuery, Please note that this should NOT be used if you have to support versions of IE before 10 (7, 8, or 9) as it will strip out the newlines leaving you with just one long line of text.
if (!String.prototype.HTMLDecode) {
String.prototype.HTMLDecode = function () {
var str = this.toString(),
$decoderEl = $('<textarea />');
str = $decoderEl.html(str)
.text()
.replace(/<br((\/)|( \/))?>/gi, "\r\n");
$decoderEl.remove();
return str;
};
}
This next one is based on kennebec's work above, with some differences which are mostly for the sake of older IE versions. This does not require jQuery, but does still require a browser.
if (!String.prototype.HTMLDecode) {
String.prototype.HTMLDecode = function () {
var str = this.toString(),
//Create an element for decoding
decoderEl = document.createElement('p');
//Bail if empty, otherwise IE7 will return undefined when
//OR-ing the 2 empty strings from innerText and textContent
if (str.length == 0) {
return str;
}
//convert newlines to <br's> to save them
str = str.replace(/((\r\n)|(\r)|(\n))/gi, " <br/>");
decoderEl.innerHTML = str;
/*
We use innerText first as IE strips newlines out with textContent.
There is said to be a performance hit for this, but sometimes
correctness of data (keeping newlines) must take precedence.
*/
str = decoderEl.innerText || decoderEl.textContent;
//clean up the decoding element
decoderEl = null;
//replace back in the newlines
return str.replace(/<br((\/)|( \/))?>/gi, "\r\n");
};
}
/*
Usage:
var str = ">";
return str.HTMLDecode();
returned value:
(String) >
*/

Here is a "class" for decoding whole HTML document.
HTMLDecoder = {
tempElement: document.createElement('span'),
decode: function(html) {
var _self = this;
html.replace(/&(#(?:x[0-9a-f]+|\d+)|[a-z]+);/gi,
function(str) {
_self.tempElement.innerHTML= str;
str = _self.tempElement.textContent || _self.tempElement.innerText;
return str;
}
);
}
}
Note that I used Gumbo's regexp for catching entities but for fully valid HTML documents (or XHTML) you could simpy use /&[^;]+;/g.

There is nothing built in, but there are many libraries that have been written to do this.
Here is one.
And here one that is a jQuery plugin.

Related

Check if HTML snippet is valid with JavaScript

I need a reliable JavaScript library / function to check if an HTML snippet is valid that I can call from my code. For example, it should check that opened tags and quotation marks are closed, nesting is correct, etc.
I don't want the validation to fail because something is not 100% standard (but would work anyway).

Update: this answer is limited - please see the edit below.
Expanding on #kolink's answer, I use:
var checkHTML = function(html) {
var doc = document.createElement('div');
doc.innerHTML = html;
return ( doc.innerHTML === html );
}
I.e., we create a temporary div with the HTML. In order to do this, the browser will create a DOM tree based on the HTML string, which may involve closing tags etc.
Comparing the div's HTML contents with the original HTML will tell us if the browser needed to change anything.
checkHTML('<a>hell<b>o</b>')
Returns false.
checkHTML('<a>hell<b>o</b></a>')
Returns true.
Edit: As #Quentin notes below, this is excessively strict for a variety of reasons: browsers will often fix omitted closing tags, even if closing tags are optional for that tag. Eg:
<p>one para
<p>second para
...is considered valid (since Ps are allowed to omit closing tags) but checkHTML will return false. Browsers will also normalise tag cases, and alter white space. You should be aware of these limits when deciding to use this approach.

Well, this code:
function tidy(html) {
var d = document.createElement('div');
d.innerHTML = html;
return d.innerHTML;
}
This will "correct" malformed HTML to the best of the browser's ability. If that's helpful to you, it's a lot easier than trying to validate HTML.

None of the solutions presented so far is doing a good job in answering the original question, especially when it comes to
I don't want the validation to fail because something is not 100%
standard (but would work anyways).
tldr >> check the JSFiddle
So I used the input of the answers and comments on this topic and created a method that does the following:
checks html string tag by tag if valid
trys to render html string
compares theoretically to be created tag count with actually rendered html dom tag count
if checked 'strict', <br/> and empty attribute normalizations ="" are not ignored
compares rendered innerHTML with given html string (while ignoring whitespaces and quotes)
Returns
true if rendered html is same as given html string
false if one of the checks fails
normalized html string if rendered html seems valid but is not equal to given html string
normalized means, that on rendering, the browser ignores or repairs sometimes specific parts of the input (like adding missing closing-tags for <p> and converts others (like single to double quotes or encoding of ampersands).
Making a distinction between "failed" and "normalized" allows to flag the content to the user as "this will not be rendered as you might expect it".
Most times normalized gives back an only slightly altered version of the original html string - still, sometimes the result is quite different. So this should be used e.g. to flag user-input for further review before saving it to a db or rendering it blindly. (see JSFiddle for examples of normalization)
The checks take the following exceptions into consideration
ignoring of normalization of single quotes to double quotes
image and other tags with a src attribute are 'disarmed' during rendering
(if non strict) ignoring of <br/> >> <br> conversion
(if non strict) ignoring of normalization of empty attributes (<p disabled> >> <p disabled="">)
encoding of initially un-encoded ampersands when reading .innerHTML, e.g. in attribute values
.
function simpleValidateHtmlStr(htmlStr, strictBoolean) {
if (typeof htmlStr !== "string")
return false;
var validateHtmlTag = new RegExp("<[a-z]+(\s+|\"[^\"]*\"\s?|'[^']*'\s?|[^'\">])*>", "igm"),
sdom = document.createElement('div'),
noSrcNoAmpHtmlStr = htmlStr
.replace(/ src=/, " svhs___src=") // disarm src attributes
.replace(/&/igm, "#svhs#amp##"), // 'save' encoded ampersands
noSrcNoAmpIgnoreScriptContentHtmlStr = noSrcNoAmpHtmlStr
.replace(/\n\r?/igm, "#svhs#nl##") // temporarily remove line breaks
.replace(/(<script[^>]*>)(.*?)(<\/script>)/igm, "$1$3") // ignore script contents
.replace(/#svhs#nl##/igm, "\n\r"), // re-add line breaks
htmlTags = noSrcNoAmpIgnoreScriptContentHtmlStr.match(/<[a-z]+[^>]*>/igm), // get all start-tags
htmlTagsCount = htmlTags ? htmlTags.length : 0,
tagsAreValid, resHtmlStr;
if(!strictBoolean){
// ignore <br/> conversions
noSrcNoAmpHtmlStr = noSrcNoAmpHtmlStr.replace(/<br\s*\/>/, "<br>")
}
if (htmlTagsCount) {
tagsAreValid = htmlTags.reduce(function(isValid, tagStr) {
return isValid && tagStr.match(validateHtmlTag);
}, true);
if (!tagsAreValid) {
return false;
}
}
try {
sdom.innerHTML = noSrcNoAmpHtmlStr;
} catch (err) {
return false;
}
// compare rendered tag-count with expected tag-count
if (sdom.querySelectorAll("*").length !== htmlTagsCount) {
return false;
}
resHtmlStr = sdom.innerHTML.replace(/&/igm, "&"); // undo '&' encoding
if(!strictBoolean){
// ignore empty attribute normalizations
resHtmlStr = resHtmlStr.replace(/=""/, "")
}
// compare html strings while ignoring case, quote-changes, trailing spaces
var
simpleIn = noSrcNoAmpHtmlStr.replace(/["']/igm, "").replace(/\s+/igm, " ").toLowerCase().trim(),
simpleOut = resHtmlStr.replace(/["']/igm, "").replace(/\s+/igm, " ").toLowerCase().trim();
if (simpleIn === simpleOut)
return true;
return resHtmlStr.replace(/ svhs___src=/igm, " src=").replace(/#svhs#amp##/, "&");
}
Here you can find it in a JSFiddle https://jsfiddle.net/abernh/twgj8bev/ , together with different test-cases, including
"<a href='blue.html id='green'>missing attribute quotes</a>" // FAIL
"<a>hell<B>o</B></a>" // PASS
'hell<b>o</b>' // PASS
'<a href=test.html>hell<b>o</b></a>', // PASS
"<a href='test.html'>hell<b>o</b></a>", // PASS
'<ul><li>hell</li><li>hell</li></ul>', // PASS
'<ul><li>hell<li>hell</ul>', // PASS
'<div ng-if="true && valid">ampersands in attributes</div>' // PASS
.

9 years later, how about using DOMParser?
It accepts string as parameter and returns Document type, just like HTML.
Thus, when it has an error, the returned document object has <parsererror> element in it.
If you parse your html as xml, at least you can check your html is xhtml compliant.
Example
> const parser = new DOMParser();
> const doc = parser.parseFromString('<div>Input: <input /></div>', 'text/xml');
> (doc.documentElement.querySelector('parsererror') || {}).innerText; // undefined
To wrap this as a function
function isValidHTML(html) {
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/xml');
if (doc.documentElement.querySelector('parsererror')) {
return doc.documentElement.querySelector('parsererror').innerText;
} else {
return true;
}
}
Testing the above function
isValidHTML('<a>hell<B>o</B></a>') // true
isValidHTML('hell') // true
isValidHTML('<a href='test.html'>hell</a>') // true
isValidHTML("<a href=test.html>hell</a>") // This page contains the following err..
isValidHTML('<ul><li>a</li><li>b</li></ul>') // true
isValidHTML('<ul><li>a<li>b</ul>') // This page contains the following err..
isValidHTML('<div><input /></div>' // true
isValidHTML('<div><input></div>' // This page contains the following err..
The above works for very simple html.
However if your html has some code-like texts; <script>, <style>, etc, you need to manipulate just for XML validation although it's valid HTML
The following updates code-like html to a valid XML syntax.
export function getHtmlError(html) {
const parser = new DOMParser();
const htmlForParser = `<xml>${html}</xml>`
.replace(/(src|href)=".*?&.*?"/g, '$1="OMITTED"')
.replace(/<script[\s\S]+?<\/script>/gm, '<script>OMITTED</script>')
.replace(/<style[\s\S]+?<\/style>/gm, '<style>OMITTED</style>')
.replace(/<pre[\s\S]+?<\/pre>/gm, '<pre>OMITTED</pre>')
.replace(/ /g, ' ');
const doc = parser.parseFromString(htmlForParser, 'text/xml');
if (doc.documentElement.querySelector('parsererror')) {
console.error(htmlForParser.split(/\n/).map( (el, ndx) => `${ndx+1}: ${el}`).join('\n'));
return doc.documentElement.querySelector('parsererror');
}
}

function validHTML(html) {
var openingTags, closingTags;
html = html.replace(/<[^>]*\/\s?>/g, ''); // Remove all self closing tags
html = html.replace(/<(br|hr|img).*?>/g, ''); // Remove all <br>, <hr>, and <img> tags
openingTags = html.match(/<[^\/].*?>/g) || []; // Get remaining opening tags
closingTags = html.match(/<\/.+?>/g) || []; // Get remaining closing tags
return openingTags.length === closingTags.length ? true : false;
}
var htmlContent = "<p>your html content goes here</p>" // Note: String without any html tag will consider as valid html snippet. If it’s not valid in your case, in that case you can check opening tag count first.
if(validHTML(htmlContent)) {
alert('Valid HTML')
}
else {
alert('Invalid HTML');
}

Using pure JavaScript you may check if an element exists using the following function:
if (typeof(element) != 'undefined' && element != null)
Using the following code we can test this in action:
HTML:
<input type="button" value="Toggle .not-undefined" onclick="toggleNotUndefined()">
<input type="button" value="Check if .not-undefined exists" onclick="checkNotUndefined()">
<p class=".not-undefined"></p>
CSS:
p:after {
content: "Is 'undefined'";
color: blue;
}
p.not-undefined:after {
content: "Is not 'undefined'";
color: red;
}
JavaScript:
function checkNotUndefined(){
var phrase = "not ";
var element = document.querySelector('.not-undefined');
if (typeof(element) != 'undefined' && element != null) phrase = "";
alert("Element of class '.not-undefined' does "+phrase+"exist!");
// $(".thisClass").length checks to see if our elem exists in jQuery
}
function toggleNotUndefined(){
document.querySelector('p').classList.toggle('not-undefined');
}
It can be found on JSFiddle.

function isHTML(str)
{
var a = document.createElement('div');
a.innerHTML = str;
for(var c= a.ChildNodes, i = c.length; i--)
{
if (c[i].nodeType == 1) return true;
}
return false;
}
Good Luck!

It depends on js-library which you use.
Html validatod for node.js https://www.npmjs.com/package/html-validator
Html validator for jQuery https://api.jquery.com/jquery.parsehtml/
But, as mentioned before, using the browser to validate broken HTML is a great idea:
function tidy(html) {
var d = document.createElement('div');
d.innerHTML = html;
return d.innerHTML;
}

Expanding on #Tarun's answer from above:
function validHTML(html) { // checks the validity of html, requires all tags and property-names to only use alphabetical characters and numbers (and hyphens, underscore for properties)
html = html.toLowerCase().replace(/(?<=<[^>]+?=\s*"[^"]*)[<>]/g,"").replace(/(?<=<[^>]+?=\s*'[^']*)[<>]/g,""); // remove all angle brackets from tag properties
html = html.replace(/<script.*?<\/script>/g, ''); // Remove all script-elements
html = html.replace(/<style.*?<\/style>/g, ''); // Remove all style elements tags
html = html.toLowerCase().replace(/<[^>]*\/\s?>/g, ''); // Remove all self closing tags
html = html.replace(/<(\!|br|hr|img).*?>/g, ''); // Remove all <br>, <hr>, and <img> tags
//var tags=[...str.matchAll(/<.*?>/g)]; this would allow for unclosed initial and final tag to pass parsing
html = html.replace(/^[^<>]+|[^<>]+$|(?<=>)[^<>]+(?=<)/gs,""); // remove all clean text nodes, note that < or > in text nodes will result in artefacts for which we check and return false
tags = html.split(/(?<=>)(?=<)/);
if (tags.length%2==1) {
console.log("uneven number of tags in "+html)
return false;
}
var tagno=0;
while (tags.length>0) {
if (tagno==tags.length) {
console.log("these tags are not closed: "+tags.slice(0,tagno).join());
return false;
}
if (tags[tagno].slice(0,2)=="</") {
if (tagno==0) {
console.log("this tag has not been opened: "+tags[0]);
return false;
}
var tagSearch=tags[tagno].match(/<\/\s*([\w\-\_]+)\s*>/);
if (tagSearch===null) {
console.log("could not identify closing tag "+tags[tagno]+" after "+tags.slice(0,tagno).join());
return false;
} else tags[tagno]=tagSearch[1];
if (tags[tagno]==tags[tagno-1]) {
tags.splice(tagno-1,2);
tagno--;
} else {
console.log("tag '"+tags[tagno]+"' trying to close these tags: "+tags.slice(0,tagno).join());
return false;
}
} else {
tags[tagno]=tags[tagno].replace(/(?<=<\s*[\w_\-]+)(\s+[\w\_\-]+(\s*=\s*(".*?"|'.*?'|[^\s\="'<>`]+))?)*/g,""); // remove all correct properties from tag
var tagSearch=tags[tagno].match(/<(\s*[\w\-\_]+)/);
if ((tagSearch===null) || (tags[tagno]!="<"+tagSearch[1]+">")) {
console.log("fragmented tag with the following remains: "+tags[tagno]);
return false;
}
var tagSearch=tags[tagno].match(/<\s*([\w\-\_]+)/);
if (tagSearch===null) {
console.log("could not identify opening tag "+tags[tagno]+" after "+tags.slice(0,tagno).join());
return false;
} else tags[tagno]=tagSearch[1];
tagno++;
}
}
return true;
}
This performs a few additional checks, such as testing whether tags match and whether properties would parse. As it does not depend on an existing DOM, it can be used in a server environment, but beware: it is slow. Also, in theory, tags can be names much more laxly, as you can basically use any unicode (with a few exceptions) in tag- and property-names. This would not pass my own sanity-check, however.

argument problem On function argument ($0,$1);+jquery,js?

Doubt On function($0,$1);
// $0,$1 two argument
My question is this two argument are Not defined But it hold some data on it ???
can any on help to Understand
how this two argument run;
function strip_tags(input, allowed) {
allowed = (((allowed || "") + "").toLowerCase().match(/<[a-z][a-z0-9]*>/g) || []).join('');
//console.log('----------->'+allowed.join('ss'));
var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;
return input.replace(commentsAndPhpTags,'').replace(tags, function ($0, $1) { // need help to understand $0 , $1
//console.log('----------->'+$1);
return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
});
}

That is a really bad way to sanitize markup. It's almost guaranteed to have some loopholes. A simpler way would just be to strip all markup:
var stripTags = function(str) {
return str.replace(/<[^>]+>/g, '');
};
As far as allowing specific elements goes, it would be better to write a tokenizer, iterate over the tokens, drop everything that's not allowed, and then output the markup from those tokens.
But if you don't care to write a tokenizer, this would be a better way of going about it, even though it's still kind of crude:
var allowed = { p: true, a: true };
var sanitize = function(str) {
return str.replace(/<\s*\/?\s*([^\s>]+)[^>]*>/g, function(tag, name) {
if (!allowed[name.toLowerCase()]) {
return '';
}
return tag;
});
};
But as the comment above mentions, if you're only sanitizing a user's markup on the client-side, it's a major problem. You need to be doing sanitization on the server-side.

return input.replace(commentsAndPhpTags, '').replace(tags, function (input, group1) {
//console.log('----------->'+group1);
return allowed.indexOf('<' + group1.toLowerCase() + '>') > -1 ? input : '';
});
You regex /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi contains only a group match, which will be the content inside parentheses ([a-z][a-z0-9]*), replace() will pass to your function the original string and the group matches.
However, your regex should be like this /(<\/?[a-z][a-z0-9]*\b[^>]*>)/gi in order to be able to strip the tags.

Regex to detect that the URL doesn't end with an extension

I'm using this regular expression for detect if an url ends with a jpg :
var exp = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|]*^\.jpg)/ig;
it detects the url : e.g. http://www.blabla.com/sdsd.jpg
but now i want to detect that the url doesn't ends with an jpg extension, i try with this :
var exp = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|]*[^\.jpg]\b)/ig;
but only get http://www.blabla.com/sdsd
then i used this :
var exp = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|]*[^\.jpg]$)/ig;
it works if the url is alone, but dont work if the text is e.g. :
http://www.blabla.com/sdsd.jpg text

Try using a negative lookahead.
(?!\.jpg)
What you have now, [^\.jpg] is saying "any character BUT a period or the letters j, p, or g".
EDIT Here's an answer using negative look ahead and file extensions.
Update
Knowing this is a "url finder" now, here's a better solution:
// parseUri 1.2.2
// (c) Steven Levithan <stevenlevithan.com>
// MIT License
// --- http://blog.stevenlevithan.com/archives/parseuri
function parseUri (str) {
var o = parseUri.options,
m = o.parser[o.strictMode ? "strict" : "loose"].exec(str),
uri = {},
i = 14;
while (i--) uri[o.key[i]] = m[i] || "";
uri[o.q.name] = {};
uri[o.key[12]].replace(o.q.parser, function ($0, $1, $2) {
if ($1) uri[o.q.name][$1] = $2;
});
return uri;
};
parseUri.options = {
strictMode: false,
key: ["source","protocol","authority","userInfo","user","password","host","port","relative","path","directory","file","query","anchor"],
q: {
name: "queryKey",
parser: /(?:^|&)([^&=]*)=?([^&]*)/g
},
parser: {
strict: /^(?:([^:\/?#]+):)?(?:\/\/((?:(([^:#]*)(?::([^:#]*))?)?#)?([^:\/?#]*)(?::(\d*))?))?((((?:[^?#\/]*\/)*)([^?#]*))(?:\?([^#]*))?(?:#(.*))?)/,
loose: /^(?:(?![^:#]+:[^:#\/]*#)([^:\/?#.]+):)?(?:\/\/)?((?:(([^:#]*)(?::([^:#]*))?)?#)?([^:\/?#]*)(?::(\d*))?)(((\/(?:[^?#](?![^?#\/]*\.[^?#\/.]+(?:[?#]|$)))*\/?)?([^?#\/]*))(?:\?([^#]*))?(?:#(.*))?)/
}
};//end parseUri
function convertUrls(element){
var urlRegex = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig
element.innerHTML = element.innerHTML.replace(urlRegex,function(url){
if (parseUri(url).file.match(/\.(jpg|png|gif|bmp)$/i))
return '<img src="'+url+'" alt="'+url+'" />';
return ''+url+'';
});
}
I used a parseUri method and a slightly different RegEx for detecting the links. Between the two, you can go through and replace the links within an element with either a link or the image equivalent.
Note that my version checks most images types using /\.(jpg|png|gif|bmp)$/i, however this can be altered to explicitly capture jpg using /\.jpg$/i. A demo can be found here.
The usage should be pretty straight forward, pass the function an HTML element you want parsed. You can capture it using any number of javascript methods (getElementByID, getElementsByTagName, ...). Hand it off to this function, and it will take care of the rest.
You can also alter it and add it tot he string protoype so it can be called natively. This version could be performed like so:
String.prototype.convertUrls = function(){
var urlRegex = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig
return this.replace(urlRegex,function(url){
if (parseUri(url).file.match(/\.(jpg|png|gif|bmp)$/i))
return '<img src="'+url+'" alt="'+url+'" />';
return ''+url+'';
});
}
function convertUrls(element){
element.innerHTML = element.innerHTML.convertUrls();
}
(Note the logic has moved to the prototype function and the element function just calls the new string extension)
This working revision can be found here

Define the URL regex from the RFC 3986 appendix:
function hasJpgExtension(myUrl) {
var urlRegex = /^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/;
var match = myUrl.match(urlRegex);
if (!match) { return false; }
Whitelist the protocol
if (!/^https?/i.test(match[2])) { return false; }
Grab the path portion so that you can filter out the query and the fragment.
var path = match[5];
Decode it so to normalize any %-encoded characters in the path.
path = decodeURIComponenent(path);
And finally, check that it ends with the appropriate extension:
return /\.jpg$/i.test(path);
}

This is a simple solution from the post of #Brad and don't need the parseUri function:
function convertUrls(text){
var urlRegex = /((\b(https?|ftp|file):\/\/|www)[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig;
var result = text.replace(urlRegex,function(url){
if (url.match(/\.(jpg|png|gif|bmp)$/i))
return '<img width="185" src="'+url+'" alt="'+url+'" />';
else if(url.match(/^(www)/i))
return ''+url+'';
return ''+url+'';
});
return result;
}
The same result :
http://jsfiddle.net/dnielF/CC9Va/
I don't know if this is the best solution but works for me :D thanks !

Generally you can check all the extensions with some like (for pictures):
([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)

JavaScript string.format function does not work in IE

I have a JavaScript from this source in a comment of a blog: frogsbrain
It's a string formatter, and it works fine in Firefox, Google Chrome, Opera and Safari.
Only problem is in IE, where the script does no replacement at all. The output in both test cases in IE is only 'hello', nothing more.
Please help me to get this script working in IE also, because I'm not the Javascript guru and I just don't know where to start searching for the problem.
I'll post the script here for convenience. All credits go to Terence Honles for the script so far.
// usage:
// 'hello {0}'.format('world');
// ==> 'hello world'
// 'hello {name}, the answer is {answer}.'.format({answer:'42', name:'world'});
// ==> 'hello world, the answer is 42.'
String.prototype.format = function() {
var pattern = /({?){([^}]+)}(}?)/g;
var args = arguments;
if (args.length == 1) {
if (typeof args[0] == 'object' && args[0].constructor != String) {
args = args[0];
}
}
var split = this.split(pattern);
var sub = new Array();
var i = 0;
for (;i < split.length; i+=4) {
sub.push(split[i]);
if (split.length > i+3) {
if (split[i+1] == '{' && split[i+3] == '}')
sub.push(split[i+1], split[i+2], split[i+3]);
else {
sub.push(split[i+1], args[split[i+2]], split[i+3]);
}
}
}
return sub.join('')
}

I think the issue is with this.
var pattern = /({?){([^}]+)}(}?)/g;
var split = this.split(pattern);
Javascript's regex split function act different in IE than other browser.
Please take a look my other post in SO

var split = this.split(pattern);
string.split(regexp) is broken in many ways on IE (JScript) and is generally best avoided. In particular:
it does not include match groups in the output array
it omits empty strings
alert('abbc'.split(/(b)/)) // a,c
It would seem simpler to use replace rather than split:
String.prototype.format= function(replacements) {
return this.replace(String.prototype.format.pattern, function(all, name) {
return name in replacements? replacements[name] : all;
});
}
String.prototype.format.pattern= /{?{([^{}]+)}}?/g;

How can I get file extensions with JavaScript?

See code:
var file1 = "50.xsl";
var file2 = "30.doc";
getFileExtension(file1); //returns xsl
getFileExtension(file2); //returns doc
function getFileExtension(filename) {
/*TODO*/
}

Newer Edit: Lots of things have changed since this question was initially posted - there's a lot of really good information in wallacer's revised answer as well as VisioN's excellent breakdown
Edit: Just because this is the accepted answer; wallacer's answer is indeed much better:
return filename.split('.').pop();
My old answer:
return /[^.]+$/.exec(filename);
Should do it.
Edit: In response to PhiLho's comment, use something like:
return (/[.]/.exec(filename)) ? /[^.]+$/.exec(filename) : undefined;

return filename.split('.').pop();
Edit:
This is another non-regex solution that I think is more efficient:
return filename.substring(filename.lastIndexOf('.')+1, filename.length) || filename;
There are some corner cases that are better handled by VisioN's answer below, particularly files with no extension (.htaccess etc included).
It's very performant, and handles corner cases in an arguably better way by returning "" instead of the full string when there's no dot or no string before the dot. It's a very well crafted solution, albeit tough to read. Stick it in your helpers lib and just use it.
Old Edit:
A safer implementation if you're going to run into files with no extension, or hidden files with no extension (see VisioN's comment to Tom's answer above) would be something along these lines
var a = filename.split(".");
if( a.length === 1 || ( a[0] === "" && a.length === 2 ) ) {
return "";
}
return a.pop(); // feel free to tack .toLowerCase() here if you want
If a.length is one, it's a visible file with no extension ie. file
If a[0] === "" and a.length === 2 it's a hidden file with no extension ie. .htaccess
This should clear up issues with the slightly more complex cases. In terms of performance, I think this solution is a little slower than regex in most browsers. However, for most common purposes this code should be perfectly usable.

The following solution is fast and short enough to use in bulk operations and save extra bytes:
return fname.slice((fname.lastIndexOf(".") - 1 >>> 0) + 2);
Here is another one-line non-regexp universal solution:
return fname.slice((Math.max(0, fname.lastIndexOf(".")) || Infinity) + 1);
Both work correctly with names having no extension (e.g. myfile) or starting with . dot (e.g. .htaccess):
"" --> ""
"name" --> ""
"name.txt" --> "txt"
".htpasswd" --> ""
"name.with.many.dots.myext" --> "myext"
If you care about the speed you may run the benchmark and check that the provided solutions are the fastest, while the short one is tremendously fast:
How the short one works:
String.lastIndexOf method returns the last position of the substring (i.e. ".") in the given string (i.e. fname). If the substring is not found method returns -1.
The "unacceptable" positions of dot in the filename are -1 and 0, which respectively refer to names with no extension (e.g. "name") and to names that start with dot (e.g. ".htaccess").
Zero-fill right shift operator (>>>) if used with zero affects negative numbers transforming -1 to 4294967295 and -2 to 4294967294, which is useful for remaining the filename unchanged in the edge cases (sort of a trick here).
String.prototype.slice extracts the part of the filename from the position that was calculated as described. If the position number is more than the length of the string method returns "".
If you want more clear solution which will work in the same way (plus with extra support of full path), check the following extended version. This solution will be slower than previous one-liners but is much easier to understand.
function getExtension(path) {
var basename = path.split(/[\\/]/).pop(), // extract file name from full path ...
// (supports `\\` and `/` separators)
pos = basename.lastIndexOf("."); // get last position of `.`
if (basename === "" || pos < 1) // if file name is empty or ...
return ""; // `.` not found (-1) or comes first (0)
return basename.slice(pos + 1); // extract extension ignoring `.`
}
console.log( getExtension("/path/to/file.ext") );
// >> "ext"
All three variants should work in any web browser on the client side and can be used in the server side NodeJS code as well.

function getFileExtension(filename)
{
var ext = /^.+\.([^.]+)$/.exec(filename);
return ext == null ? "" : ext[1];
}
Tested with
"a.b" (=> "b")
"a" (=> "")
".hidden" (=> "")
"" (=> "")
null (=> "")
Also
"a.b.c.d" (=> "d")
".a.b" (=> "b")
"a..b" (=> "b")

There is a standard library function for this in the path module:
import path from 'path';
console.log(path.extname('abc.txt'));
Output:
.txt
So, if you only want the format:
path.extname('abc.txt').slice(1) // 'txt'
If there is no extension, then the function will return an empty string:
path.extname('abc') // ''
If you are using Node, then path is built-in. If you are targetting the browser, then Webpack will bundle a path implementation for you. If you are targetting the browser without Webpack, then you can include path-browserify manually.
There is no reason to do string splitting or regex.

function getExt(filename)
{
var ext = filename.split('.').pop();
if(ext == filename) return "";
return ext;
}

var extension = fileName.substring(fileName.lastIndexOf('.')+1);

If you are dealing with web urls, you can use:
function getExt(filepath){
return filepath.split("?")[0].split("#")[0].split('.').pop();
}
getExt("../js/logic.v2.min.js") // js
getExt("http://example.net/site/page.php?id=16548") // php
getExt("http://example.net/site/page.html#welcome.to.me") // html
getExt("c:\\logs\\yesterday.log"); // log
Demo: https://jsfiddle.net/squadjot/q5ard4fj/

var parts = filename.split('.');
return parts[parts.length-1];

function file_get_ext(filename)
{
return typeof filename != "undefined" ? filename.substring(filename.lastIndexOf(".")+1, filename.length).toLowerCase() : false;
}

Code
/**
* Extract file extension from URL.
* #param {String} url
* #returns {String} File extension or empty string if no extension is present.
*/
var getFileExtension = function (url) {
"use strict";
if (url === null) {
return "";
}
var index = url.lastIndexOf("/");
if (index !== -1) {
url = url.substring(index + 1); // Keep path without its segments
}
index = url.indexOf("?");
if (index !== -1) {
url = url.substring(0, index); // Remove query
}
index = url.indexOf("#");
if (index !== -1) {
url = url.substring(0, index); // Remove fragment
}
index = url.lastIndexOf(".");
return index !== -1
? url.substring(index + 1) // Only keep file extension
: ""; // No extension found
};
Test
Notice that in the absence of a query, the fragment might still be present.
"https://www.example.com:8080/segment1/segment2/page.html?foo=bar#fragment" --> "html"
"https://www.example.com:8080/segment1/segment2/page.html#fragment" --> "html"
"https://www.example.com:8080/segment1/segment2/.htaccess?foo=bar#fragment" --> "htaccess"
"https://www.example.com:8080/segment1/segment2/page?foo=bar#fragment" --> ""
"https://www.example.com:8080/segment1/segment2/?foo=bar#fragment" --> ""
"" --> ""
null --> ""
"a.b.c.d" --> "d"
".a.b" --> "b"
".a.b." --> ""
"a...b" --> "b"
"..." --> ""
JSLint
0 Warnings.

Fast and works correctly with paths
(filename.match(/[^\\\/]\.([^.\\\/]+)$/) || [null]).pop()
Some edge cases
/path/.htaccess => null
/dir.with.dot/file => null
Solutions using split are slow and solutions with lastIndexOf don't handle edge cases.

// 获取文件后缀名
function getFileExtension(file) {
var regexp = /\.([0-9a-z]+)(?:[\?#]|$)/i;
var extension = file.match(regexp);
return extension && extension[1];
}
console.log(getFileExtension("https://www.example.com:8080/path/name/foo"));
console.log(getFileExtension("https://www.example.com:8080/path/name/foo.BAR"));
console.log(getFileExtension("https://www.example.com:8080/path/name/.quz/foo.bar?key=value#fragment"));
console.log(getFileExtension("https://www.example.com:8080/path/name/.quz.bar?key=value#fragment"));

i just wanted to share this.
fileName.slice(fileName.lastIndexOf('.'))
although this has a downfall that files with no extension will return last string.
but if you do so this will fix every thing :
function getExtention(fileName){
var i = fileName.lastIndexOf('.');
if(i === -1 ) return false;
return fileName.slice(i)
}

"one-liner" to get filename and extension using reduce and array destructuring :
var str = "filename.with_dot.png";
var [filename, extension] = str.split('.').reduce((acc, val, i, arr) => (i == arr.length - 1) ? [acc[0].substring(1), val] : [[acc[0], val].join('.')], [])
console.log({filename, extension});
with better indentation :
var str = "filename.with_dot.png";
var [filename, extension] = str.split('.')
.reduce((acc, val, i, arr) => (i == arr.length - 1)
? [acc[0].substring(1), val]
: [[acc[0], val].join('.')], [])
console.log({filename, extension});
// {
// "filename": "filename.with_dot",
// "extension": "png"
// }

There's also a simple approach using ES6 destructuring:
const path = 'hello.world.txt'
const [extension, ...nameParts] = path.split('.').reverse();
console.log('extension:', extension);

function extension(fname) {
var pos = fname.lastIndexOf(".");
var strlen = fname.length;
if (pos != -1 && strlen != pos + 1) {
var ext = fname.split(".");
var len = ext.length;
var extension = ext[len - 1].toLowerCase();
} else {
extension = "No extension found";
}
return extension;
}
//usage
extension('file.jpeg')
always returns the extension lower cas so you can check it on field change
works for:
file.JpEg
file (no extension)
file. (noextension)

This simple solution
function extension(filename) {
var r = /.+\.(.+)$/.exec(filename);
return r ? r[1] : null;
}
Tests
/* tests */
test('cat.gif', 'gif');
test('main.c', 'c');
test('file.with.multiple.dots.zip', 'zip');
test('.htaccess', null);
test('noextension.', null);
test('noextension', null);
test('', null);
// test utility function
function test(input, expect) {
var result = extension(input);
if (result === expect)
console.log(result, input);
else
console.error(result, input);
}
function extension(filename) {
var r = /.+\.(.+)$/.exec(filename);
return r ? r[1] : null;
}

I'm sure someone can, and will, minify and/or optimize my code in the future. But, as of right now, I am 200% confident that my code works in every unique situation (e.g. with just the file name only, with relative, root-relative, and absolute URL's, with fragment # tags, with query ? strings, and whatever else you may decide to throw at it), flawlessly, and with pin-point precision.
For proof, visit: https://projects.jamesandersonjr.com/web/js_projects/get_file_extension_test.php
Here's the JSFiddle: https://jsfiddle.net/JamesAndersonJr/ffcdd5z3/
Not to be overconfident, or blowing my own trumpet, but I haven't seen any block of code for this task (finding the 'correct' file extension, amidst a battery of different function input arguments) that works as well as this does.
Note: By design, if a file extension doesn't exist for the given input string, it simply returns a blank string "", not an error, nor an error message.
It takes two arguments:
String: fileNameOrURL (self-explanatory)
Boolean: showUnixDotFiles (Whether or Not to show files that begin with a dot ".")
Note (2): If you like my code, be sure to add it to your js library's, and/or repo's, because I worked hard on perfecting it, and it would be a shame to go to waste. So, without further ado, here it is:
function getFileExtension(fileNameOrURL, showUnixDotFiles)
{
/* First, let's declare some preliminary variables we'll need later on. */
var fileName;
var fileExt;
/* Now we'll create a hidden anchor ('a') element (Note: No need to append this element to the document). */
var hiddenLink = document.createElement('a');
/* Just for fun, we'll add a CSS attribute of [ style.display = "none" ]. Remember: You can never be too sure! */
hiddenLink.style.display = "none";
/* Set the 'href' attribute of the hidden link we just created, to the 'fileNameOrURL' argument received by this function. */
hiddenLink.setAttribute('href', fileNameOrURL);
/* Now, let's take advantage of the browser's built-in parser, to remove elements from the original 'fileNameOrURL' argument received by this function, without actually modifying our newly created hidden 'anchor' element.*/
fileNameOrURL = fileNameOrURL.replace(hiddenLink.protocol, ""); /* First, let's strip out the protocol, if there is one. */
fileNameOrURL = fileNameOrURL.replace(hiddenLink.hostname, ""); /* Now, we'll strip out the host-name (i.e. domain-name) if there is one. */
fileNameOrURL = fileNameOrURL.replace(":" + hiddenLink.port, ""); /* Now finally, we'll strip out the port number, if there is one (Kinda overkill though ;-)). */
/* Now, we're ready to finish processing the 'fileNameOrURL' variable by removing unnecessary parts, to isolate the file name. */
/* Operations for working with [relative, root-relative, and absolute] URL's ONLY [BEGIN] */
/* Break the possible URL at the [ '?' ] and take first part, to shave of the entire query string ( everything after the '?'), if it exist. */
fileNameOrURL = fileNameOrURL.split('?')[0];
/* Sometimes URL's don't have query's, but DO have a fragment [ # ](i.e 'reference anchor'), so we should also do the same for the fragment tag [ # ]. */
fileNameOrURL = fileNameOrURL.split('#')[0];
/* Now that we have just the URL 'ALONE', Let's remove everything to the last slash in URL, to isolate the file name. */
fileNameOrURL = fileNameOrURL.substr(1 + fileNameOrURL.lastIndexOf("/"));
/* Operations for working with [relative, root-relative, and absolute] URL's ONLY [END] */
/* Now, 'fileNameOrURL' should just be 'fileName' */
fileName = fileNameOrURL;
/* Now, we check if we should show UNIX dot-files, or not. This should be either 'true' or 'false'. */
if ( showUnixDotFiles == false )
{
/* If not ('false'), we should check if the filename starts with a period (indicating it's a UNIX dot-file). */
if ( fileName.startsWith(".") )
{
/* If so, we return a blank string to the function caller. Our job here, is done! */
return "";
};
};
/* Now, let's get everything after the period in the filename (i.e. the correct 'file extension'). */
fileExt = fileName.substr(1 + fileName.lastIndexOf("."));
/* Now that we've discovered the correct file extension, let's return it to the function caller. */
return fileExt;
};
Enjoy! You're Quite Welcome!:

Try this:
function getFileExtension(filename) {
var fileinput = document.getElementById(filename);
if (!fileinput)
return "";
var filename = fileinput.value;
if (filename.length == 0)
return "";
var dot = filename.lastIndexOf(".");
if (dot == -1)
return "";
var extension = filename.substr(dot, filename.length);
return extension;
}

If you are looking for a specific extension and know its length, you can use substr:
var file1 = "50.xsl";
if (file1.substr(-4) == '.xsl') {
// do something
}
JavaScript reference: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/substr

I just realized that it's not enough to put a comment on p4bl0's answer, though Tom's answer clearly solves the problem:
return filename.replace(/^.*?\.([a-zA-Z0-9]+)$/, "$1");

For most applications, a simple script such as
return /[^.]+$/.exec(filename);
would work just fine (as provided by Tom). However this is not fool proof. It does not work if the following file name is provided:
image.jpg?foo=bar
It may be a bit overkill but I would suggest using a url parser such as this one to avoid failure due to unpredictable filenames.
Using that particular function, you could get the file name like this:
var trueFileName = parse_url('image.jpg?foo=bar').file;
This will output "image.jpg" without the url vars. Then you are free to grab the file extension.

function func() {
var val = document.frm.filename.value;
var arr = val.split(".");
alert(arr[arr.length - 1]);
var arr1 = val.split("\\");
alert(arr1[arr1.length - 2]);
if (arr[1] == "gif" || arr[1] == "bmp" || arr[1] == "jpeg") {
alert("this is an image file ");
} else {
alert("this is not an image file");
}
}

I'm many moons late to the party but for simplicity I use something like this
var fileName = "I.Am.FileName.docx";
var nameLen = fileName.length;
var lastDotPos = fileName.lastIndexOf(".");
var fileNameSub = false;
if(lastDotPos === -1)
{
fileNameSub = false;
}
else
{
//Remove +1 if you want the "." left too
fileNameSub = fileName.substr(lastDotPos + 1, nameLen);
}
document.getElementById("showInMe").innerHTML = fileNameSub;
<div id="showInMe"></div>

A one line solution that will also account for query params and any characters in url.
string.match(/(.*)\??/i).shift().replace(/\?.*/, '').split('.').pop()
// Example
// some.url.com/with.in/&ot.s/files/file.jpg?spec=1&.ext=jpg
// jpg

return filename.replace(/\.([a-zA-Z0-9]+)$/, "$1");
edit: Strangely (or maybe it's not) the $1 in the second argument of the replace method doesn't seem to work... Sorry.

fetchFileExtention(fileName) {
return fileName.slice((fileName.lastIndexOf(".") - 1 >>> 0) + 2);
}

Wallacer's answer is nice, but one more checking is needed.
If file has no extension, it will use filename as extension which is not good.
Try this one:
return ( filename.indexOf('.') > 0 ) ? filename.split('.').pop().toLowerCase() : 'undefined';

Don't forget that some files can have no extension, so:
var parts = filename.split('.');
return (parts.length > 1) ? parts.pop() : '';

Develop Reference

JavaScript is the programming language of the Web.

Convert HTML Character Entities back to regular text using javascript - javascript

the questions says it all :) eg. we have >, we need > using only javascript Update: It seems jquery is the easy way out. But, it would be nice to have a lightweight solution. More like a function which is capable to do this by itself.

function decodeEntities(s){ var str, temp= document.createElement('p'); temp.innerHTML= s; str= temp.textContent || temp.innerText; temp=null; return str; } alert(decodeEntities('<')) /* returned value: (String) < */

There is nothing built in, but there are many libraries that have been written to do this. Here is one. And here one that is a jQuery plugin.

Related

Check if HTML snippet is valid with JavaScript

argument problem On function argument ($0,$1);+jquery,js?

Regex to detect that the URL doesn't end with an extension

JavaScript string.format function does not work in IE

How can I get file extensions with JavaScript?

Categories

Resources