JavaScript regex needed [duplicate]

JavaScript regex needed [duplicate] - javascript

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
I have this piece of HTML code:
<tr class="AttCardFooter" onMouseOver="this.style.backgroundColor='#E1EAFE'" onMouseOut="this.style.backgroundColor='Transparent'">...etc etc.. new lines etc.. <td title="Saldo+">33:33</td><td title="Saldo-">22:22</td> .. etc etc...
I need JavaScript regex that will grab these two fields 33:33 and 22:22
All that I have tried fails because of new line characters.
If anyone knows how to accomplish that I would be very grateful.

Don't use a regexp. I guess your piece of html is an innerHTML or outerHTML of some element. Instead of parsing html with regexp do this:
var el = document.querySelector("tr.AttCardFooter"); // I guess you have that variable already
for (var i=0; i<el.cells.length; i++) {
var td = el.cells[i];
if (td.title == "Saldo+")
var positiveSaldo = td.innerText;
else if (td.title == "Saldo-")
var negativeSaldo = td.innerText;
}
alert("Voila:\n"+positiveSaldo+"\n"+negativeSaldo);
You may improve this with your favorite libraries dom functions. For example in jQuery it would be something like
var el = $("tr.AttCardFooter");
var positiveSaldo = el.find('td[title="Saldo+"]').text();
var negativeSaldo = el.find('td[title="Saldo-"]').text();

You would have to process the new line characters. Try this:
Pattern p = Pattern.compile(myRegExp, Pattern.DOT_ALL);

Try this lib and add /s flag so the dot matches new lines.
http://xregexp.com/

Regular expressions cannot parse HTML.
If the table row is already part of your document, then there's no need to do anything fancy. Use the DOM to extract the values you need -- Bergi has great suggestions.
If the string comes from somewhere else, consider using a hidden <div> and set its innerHTML. The browser handles parsing the string; you then use DOM methods to extract the values you need.
Of course, if that "somewhere else" is untrusted input (third party site, user input, etc), you can't just use a hidden <div> due to the security problems of blindly sticking potentially executable code into your page. Instead, you need to sandbox this step in an <iframe> running on a different Origin.

Related

jQuery, how to test of a variable is a text node, containing no markup?

http://jsfiddle.net/DerNalia/zrppg/8/
I have two lines of code that pretty much do the same thing
var doesntbreak = $j("hello");
var breaks = $j(" ");
The first one doesn't error, but the second one throws this
Syntax error, unrecognized expression:
should'nt they both behave the same?
any insight as to how to solve this?
in the actual method I'm using, ele is from the Dom, so it could eb a text node, or any other kind of node.
UPDATE:
the input to the function that I'm using that I noticed this takes selection from the dom.
updated example: http://jsfiddle.net/DerNalia/zrppg/11/ <- includes html markup.
So, I guess, my question is, how do I test if something is JUST a text node? and doesn't contain any markup?

In general, you cannot create standalone text nodes with the jQuery function. If a string isn't obviously HTML, it gets treated as a selector, and is not recognized by jQuery as a valid selector.
Assuming you want to parse arbitrary strings (which may have HTML tags or not), I suggest something like var result = $('<div></div>').html(' ').contents();. Place your your HTML or text string in a div to parse it and then immediately extract the parsed result as a jQuery object with the list of elements. You can append the resultant list of elements with $(parentElem).append(result);

try this:
function isTextNode(node){
div=document.createElement('div');
div.innerHTML=node;
return $(div).text()==$(div).html();
}
And " " is'nt a valid selector if you want to find a elements containing some text you must use the :contains selector http://api.jquery.com/contains-selector/

Internet Explorer (older versions at least) don't have built in "querySelector" functions, so the Sizzle engine has to do the work directly. Thus, the slightly different tolerances for bogus input can cause differences in error reporting.
Your selector expression " " is equally invalid in all browsers, however. The library is not obliged to quietly accept anything you pass it, so perhaps you should reconsider your application design.
If you want to check for entities, you could use a regular expression if you're confident that it's just a text node. Or you could get the contents with .text() instead of .html().

So, I have to thank Apsillers and Rolando for pointing me in the right direction. Their answers were very close, but gave me the information I needed.
This is what I ended up using:
TEXT_NODE = 3;
objectify = function(n) {
return $j("<div></div>").html(n).contents();
}
function textOnly(n) {
var o = objectify(n);
for (var i = 0; i < o.length; i++) {
if (objectify(o[i])[0].nodeType != TEXT_NODE) {
return false
}
}
return true;
}
And here is a jsFiddle with some test cases, that neither of the original code submissions passed.
to pass, it needed to handle this kind of input
"hello" // true
"hello<b>there</b>" // false
"<b>there</b>" // false
" " // false

Not actual answer, but may help someone with similar issue as mine and loosely related to this question. :)
I was getting same issue today, so fixed by removing
Changed:
var breaks = $j(" ");
to:
var breaks = $j(" ".replace(/&.*;/g, ""));
Here I am removing , < etc...
Note: value at is dynamic for me, so it can be anything.

How can spaces be converted to &nbsp without breaking HTML tags?

I've inherited some pretty complex code for a web forum, and one of the features I'm trying to implement is the ability for spaces to not be truncated into only one. This is mainly because our users often want to include ASCII art, tables etc in their posts.
I first did this using a simple search and replace in javascript, which had the side effect of breaking HTML tags (eg <a href=....> became <a href=.....>).
I then tried doing this on server side, when the strings are retrieved, by having spaces converted before links and code people insert is converted to HTML. This works to a degree but it causes some issues with other parts of the code, for example where a message is truncated to appear on the home page, it might leave some of the space code, such as
Here is a message&nb
I think there may be a way to just alter the original javascript to achieve this - it just needs to only match spaces that are not inside a HTML tag.
The script I was using originally was message = message.replace(/\s/g, " ").
Thanks for any help you can provide with this.

You can use the pre element to include preformatted text, which renders spaces as-is. See http://www.w3.org/TR/html5-author/the-pre-element.html
Those docs specifically say one of the best uses of the pre element is "Displaying ASCII art".
Example: http://jsbin.com/owuruz/edit#preview
<pre>
/\_/\
____/ o o \
/~____ =ø= /
(______)__m_m)
</pre>
In your case, just put your message inside a pre tag.

Yes, but you need to process text content of elements, not all of the HTML document content. Moreover, you need to exclude style and script element content. As you can limit yourself to things inside the body element, you could use a recursive function like following, calling it with process(document.body) to apply it to the entire document (but you probably want to apply it to a specific element only):
function process(element) {
var children = element.childNodes;
for(var i = 0; i < children.length; i++) {
var child = children[i];
if(child.nodeType === 3) {
if(child.data) {
child.data = child.data.replace(/[ ]/g, "\xa0");
}
} else if(child.tagName != "SCRIPT") {
process(child);
}
}
}
(No reason to use the entity reference here; you can use the no-break space character U+00A0 itself, referring to it as "\xa0" in JavaScript.)

One way is to use <pre> tags to wrap your users posts so that their ASCII art is preserved. But why not use Markdown (like Stackoverflow does). There's a couple of different ports of Markdown to Javascript:
Showdown
WMD
uedit

Looking for a way to search an html page with javascript

what I would like to do is to the html page for a specific string and read in a certain amount of characters after it and present those characters in an anchor tag.
the problem I'm having is figuring out how to search the page for a string everything I've found relates to by tag or id. Also hoping to make it a greasemonkey script for my personal use.
function createlinks(srchstart,srchend){
var page = document.getElementsByTagName('html')[0].innerHTML;
page = page.substring(srchstart,srchend);
if (page.search("file','http:") != -1)
{
var begin = page.search("file','http:") + 7;
var end = begin + 79;
var link = page.substring(begin,end);
document.body.innerHTML += 'LINK | ';
createlinks(end+1,page.length);
}
};
what I came up with unfortunately after finding the links it loops over the document again

Assisted Direction
Lookup JavaScript Regex.
Apply your regex to the page's HTML (see below).
Different regex functions do different things. You could search the document for the string, as suggested, but you'd have to do it recursively, since the string you're searching for may be listed in multiple places.
To Get the Text in the Page
JavaScript: document.getElementsByTagName('html')[0].innerHTML
jQuery: $('html').html()
Note:
IE may require the element to be capitalized (eg 'HTML') - I forget
Also, the document may have newline characters \n that might want to take out, since one could be between the string you're looking for.

Okay, so in javascript you've got the whole document in the DOM tree. You an search for your string by recursively searching the DOM for the string you want. This is striaghtforward; I'll put in pseudocode because you want to think about what libraries (if any) you're using.
function search(node, string):
if node.innerHTML contains string
-- then you found it
else
for each child node child of node
search(child,string)
rof
fi

How to filter using Regex and javascript?

I have some text in an element in my page, and i want to scrap the price on that page without any text beside.
I found the page contain price like that:
<span class="discount">now $39.99</span>
How to filter this and just get "$39.99" just using JavaScript and regular expressions.
The question may be too easy or asked by another way before but i know nothing about regular expressions so asked for your help :).

<script language="javascript">
window.onload = function () {
// Get all of the elements with class name "discount"
var elements = document.getElementsByClassName('discount');
// Loop over each <span class="discount">
for (var i=0; i < elements.length; i++) {
// get the text, e.g. "now $39.99"
var rawText = elements[i].innerHTML;
// Here's a regular expression to match one or more digits (\d+)
// followed by a period (\.) and one or more digits again (\d+)
var priceAsString = rawText.match(/\d+\.\d+/)
// You'll want to make the price a floating point number if you
// intend to do any calculations with it.
var price = parseFloat(priceAsString);
// Now what do you want to do with the price? I'll just write it out
// to the console (using FireBug or something similar)
console.log(price);
}
}
</script>

document.evaluate("//span[#class='discount']",
document,
null,
XPathResult.ANY_UNORDERED_NODE_TYPE,
null).singleNodeValue.textContent.replace("now $", "");
EDIT: This is standard XPath. I'm not sure what kind of explanation you're seeking. For outdated browsers, you will need a third-party library like Sarissa and/or Java-line.

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
Patrick McElhaney's and Matthew Flaschen's answers are both good ways to solve the problem.

as Matthew Flaschen suggested, XPATH is a better way to go, if you know something about the node structure of the target document (and since you provided an example, you seem to). If you don't know the node structure, regexes are still lousy for parsing XML.
some more resources to kick-start you:
XPath in Javascript: Introduction
DOM Parsing With XPath and JavaScript
Mozilla dev-center: Introduction to using XPath in JavaScript
I've also found the FireFox extension combo of DOM Inspector and XPather to be an invaluable tool for deriving and testing XPath expressions on a given page. (If you're using another browser -- well, I don't know).

How to decode HTML entities using jQuery?

How do I use jQuery to decode HTML entities in a string?

Security note: using this answer (preserved in its original form below) may introduce an XSS vulnerability into your application. You should not use this answer. Read lucascaro's answer for an explanation of the vulnerabilities in this answer, and use the approach from either that answer or Mark Amery's answer instead.
Actually, try
var encodedStr = "This is fun & stuff";
var decoded = $("<div/>").html(encodedStr).text();
console.log(decoded);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div/>

Without any jQuery:
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
console.log(decodeEntities('1 & 2')); // '1 & 2'
This works similarly to the accepted answer, but is safe to use with untrusted user input.
Security issues in similar approaches
As noted by Mike Samuel, doing this with a <div> instead of a <textarea> with untrusted user input is an XSS vulnerability, even if the <div> is never added to the DOM:
function decodeEntities(encodedString) {
var div = document.createElement('div');
div.innerHTML = encodedString;
return div.textContent;
}
// Shows an alert
decodeEntities('<img src="nonexistent_image" onerror="alert(1337)">')
However, this attack is not possible against a <textarea> because there are no HTML elements that are permitted content of a <textarea>. Consequently, any HTML tags still present in the 'encoded' string will be automatically entity-encoded by the browser.
function decodeEntities(encodedString) {
var textArea = document.createElement('textarea');
textArea.innerHTML = encodedString;
return textArea.value;
}
// Safe, and returns the correct answer
console.log(decodeEntities('<img src="nonexistent_image" onerror="alert(1337)">'))
Warning: Doing this using jQuery's .html() and .val() methods instead of using .innerHTML and .value is also insecure* for some versions of jQuery, even when using a textarea. This is because older versions of jQuery would deliberately and explicitly evaluate scripts contained in the string passed to .html(). Hence code like this shows an alert in jQuery 1.8:
//<!-- CDATA
// Shows alert
$("<textarea>")
.html("<script>alert(1337);</script>")
.text();
//-->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.2.3/jquery.min.js"></script>
* Thanks to Eru Penkman for catching this vulnerability.

Like Mike Samuel said, don't use jQuery.html().text() to decode html entities as it's unsafe.
Instead, use a template renderer like Mustache.js or decodeEntities from #VyvIT's comment.
Underscore.js utility-belt library comes with escape and unescape methods, but they are not safe for user input:
_.escape(string)
_.unescape(string)

I think you're confusing the text and HTML methods. Look at this example, if you use an element's inner HTML as text, you'll get decoded HTML tags (second button). But if you use them as HTML, you'll get the HTML formatted view (first button).
<div id="myDiv">
here is a <b>HTML</b> content.
</div>
<br />
<input value="Write as HTML" type="button" onclick="javascript:$('#resultDiv').html($('#myDiv').html());" />
<input value="Write as Text" type="button" onclick="javascript:$('#resultDiv').text($('#myDiv').html());" />
<br /><br />
<div id="resultDiv">
Results here !
</div>
First button writes : here is a HTML content.
Second button writes : here is a <B>HTML</B> content.
By the way, you can see a plug-in that I found in jQuery plugin - HTML decode and encode that encodes and decodes HTML strings.

The question is limited by 'with jQuery' but it might help some to know that the jQuery code given in the best answer here does the following underneath...this works with or without jQuery:
function decodeEntities(input) {
var y = document.createElement('textarea');
y.innerHTML = input;
return y.value;
}

You can use the he library, available from https://github.com/mathiasbynens/he
Example:
console.log(he.decode("Jörg &amp Jürgen rocked to & fro "));
// Logs "Jörg & Jürgen rocked to & fro"
I challenged the library's author on the question of whether there was any reason to use this library in clientside code in favour of the <textarea> hack provided in other answers here and elsewhere. He provided a few possible justifications:
If you're using node.js serverside, using a library for HTML encoding/decoding gives you a single solution that works both clientside and serverside.
Some browsers' entity decoding algorithms have bugs or are missing support for some named character references. For example, Internet Explorer will both decode and render non-breaking spaces ( ) correctly but report them as ordinary spaces instead of non-breaking ones via a DOM element's innerText property, breaking the <textarea> hack (albeit only in a minor way). Additionally, IE 8 and 9 simply don't support any of the new named character references added in HTML 5. The author of he also hosts a test of named character reference support at http://mathias.html5.org/tests/html/named-character-references/. In IE 8, it reports over one thousand errors.
If you want to be insulated from browser bugs related to entity decoding and/or be able to handle the full range of named character references, you can't get away with the <textarea> hack; you'll need a library like he.
He just darn well feels like doing things this way is less hacky.

encode:
$("<textarea/>").html('<a>').html(); // return '<a&gt'
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<textarea/>
decode:
$("<textarea/>").html('<a&gt').val() // return '<a>'
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<textarea/>

Try this :
var htmlEntities = "<script>alert('hello');</script>";
var htmlDecode =$.parseHTML(htmlEntities)[0]['wholeText'];
console.log(htmlDecode);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
parseHTML is a Function in Jquery library and it will return an array that includes some details about the given String..
in some cases the String is being big, so the function will separate the content to many indexes..
and to get all the indexes data you should go to any index, then access to the index called "wholeText".
I chose index 0 because it's will work in all cases (small String or big string).

Use
myString = myString.replace( /\&/g, '&' );
It is easiest to do it on the server side because apparently JavaScript has no native library for handling entities, nor did I find any near the top of search results for the various frameworks that extend JavaScript.
Search for "JavaScript HTML entities", and you might find a few libraries for just that purpose, but they'll probably all be built around the above logic - replace, entity by entity.

I just had to have an HTML entity charater (⇓) as a value for a HTML button. The HTML code looks good from the beginning in the browser:
<input type="button" value="Embed & Share ⇓" id="share_button" />
Now I was adding a toggle that should also display the charater. This is my solution
$("#share_button").toggle(
function(){
$("#share").slideDown();
$(this).attr("value", "Embed & Share " + $("<div>").html("⇑").text());
}
This displays ⇓ again in the button. I hope this might help someone.

You have to make custom function for html entities:
function htmlEntities(str) {
return String(str).replace(/&/g, '&').replace(/</g, '<').replace(/>/g,'>').replace(/"/g, '"');
}

Suppose you have below String.
Our Deluxe cabins are warm, cozy & comfortable
var str = $("p").text(); // get the text from <p> tag
$('p').html(str).text(); // Now,decode html entities in your variable i.e
str and assign back to tag.
that's it.

For ExtJS users, if you already have the encoded string, for example when the returned value of a library function is the innerHTML content, consider this ExtJS function:
Ext.util.Format.htmlDecode(innerHtmlContent)

Extend a String class:
String::decode = ->
$('<textarea />').html(this).text()
and use as method:
"<img src='myimage.jpg'>".decode()

You don't need jQuery to solve this problem, as it creates a bit of overhead and dependency.
I know there are a lot of good answers here, but since I have implemented a bit different approach, I thought to share.
This code is a perfectly safe security-wise approach, as the escaping handler depends on the browser, instead on the function. So, if some vulnerability will be discovered in the future, this solution is covered.
const decodeHTMLEntities = text => {
// Create a new element or use one from cache, to save some element creation overhead
const el = decodeHTMLEntities.__cache_data_element
= decodeHTMLEntities.__cache_data_element
|| document.createElement('div');
const enc = text
// Prevent any mixup of existing pattern in text
.replace(/⪪/g, '⪪#')
// Encode entities in special format. This will prevent native element encoder to replace any amp characters
.replace(/&([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+);/gi, '⪪$1⪫');
// Encode any HTML tags in the text to prevent script injection
el.textContent = enc;
// Decode entities from special format, back to their original HTML entities format
el.innerHTML = el.innerHTML
.replace(/⪪([a-z1-8]{2,31}|#x[0-9a-f]+|#\d+)⪫/gi, '&$1;')
.replace(/⪪#/g, '⪪');
// Get the decoded HTML entities
const dec = el.textContent;
// Clear the element content, in order to preserve a bit of memory (in case the text is big)
el.textContent = '';
return dec;
}
// Example
console.log(decodeHTMLEntities("<script>alert('&awconint;&CounterClockwiseContourIntegral;∳∳⪪#x02233⪫');</script>"));
// Prints: <script>alert('∳∳∳∳⪪#x02233⪫');</script>
By the way, I have chosen to use the characters ⪪ and ⪫, because they are rarely used, so the chance of impacting the performance by matching them is significantly lower.

Here are still one problem:
Escaped string does not look readable when assigned to input value
var string = _.escape("<img src=fake onerror=alert('boo!')>");
$('input').val(string);
Exapmle: https://jsfiddle.net/kjpdwmqa/3/

Alternatively, theres also a library for it..
here, https://cdnjs.com/libraries/he
npm install he //using node.js
<script src="js/he.js"></script> //or from your javascript directory
The usage is as follows...
//to encode text
he.encode('© Ande & Nonso® Company LImited 2018');
//to decode the
he.decode('© Ande & Nonso® Company Limited 2018');
cheers.

To decode HTML Entities with jQuery, just use this function:
function html_entity_decode(txt){
var randomID = Math.floor((Math.random()*100000)+1);
$('body').append('<div id="random'+randomID+'"></div>');
$('#random'+randomID).html(txt);
var entity_decoded = $('#random'+randomID).html();
$('#random'+randomID).remove();
return entity_decoded;
}
How to use:
Javascript:
var txtEncoded = "á é í ó ú";
$('#some-id').val(html_entity_decode(txtEncoded));
HTML:
<input id="some-id" type="text" />

The easiest way is to set a class selector to your elements an then use following code:
$(function(){
$('.classSelector').each(function(a, b){
$(b).html($(b).text());
});
});
Nothing any more needed!
I had this problem and found this clear solution and it works fine.

I think that is the exact opposite of the solution chosen.
var decoded = $("<div/>").text(encodedStr).html();

Develop Reference

JavaScript is the programming language of the Web.