String is treated differently when extracted from DOM - javascript

I am facing a very weird problem with Javascript. When I extract text from DOM and try to decode HTML entities, it's not working. However, when I assign the value directly in the code, it's working just fine.
I just don't get why the string is treated differently in both cases. I have tested in FireFox and Chrome and both produce the same result.
Update:
The correct output should be %7B (after decoding the string). That means that when I assign the value directly to the variable it's working correctly, but when extracted from DOM, it's not. How can I extract the text from DOM and decode it so it produces "%7B" ?
DEMO: jsFiddle
HTML:
<div class="myclass">\u00257B</div>
Javascript Code:
$(document).ready(function(){
//Extracting the text from DOM
var myText = $(".myclass").html();
//decoding HTML entities
var decodedText = $("<div />").html(myText).text();
//alerting the decoded text
alert(decodedText); // output: \u00257B
//assigning the value directly to the variable
var myText2 = "\u00257B";
//decoding HTML entities
var decodedText2 = $("<div />").html(myText2).text();
//alerting decoded text
alert(decodedText2); // output: %7B
});

The reason myText2 produces a different result is because the backslash in string literals is an escape character.
to escape a backslash, simply use it twice:
myText2 = "\\u00257b";
Here is a some further information about escape characters in JavaScript
EDIT
There's probably a better way, but this will work: (eval is generally frowned upon and has security implications if the value from your text is uncontrolled input)
myText = eval("\"" + decodedText + "\"")

I think this is because when you extract the string from the dom the "\u" is escaped.
If you do var myText2 = "\\u00257B"; you'll get the same result
http://jsfiddle.net/9n6t5qxr/1/
if you do console.log('\u0025') it prints %, which is why you are seeing %7B

Related

Regex to find a specific string that is not in a HTML attribute

My case is: I have a string with HTML elements:
This is a text and "specific_string"
I need a Regex to match only the one that is not in a HTML attribute.
This is my current Regex, it works but it gives a false positive when the string is wrapped by double quotes
((?!\"[\w\s]*)specific_string(?![\w\s]*\"))
I have tried the following Regex:
((?!\"[\w\s]*)specific_string(?![\w\s]*\"))
It works but it gives a false positive when the string is wrapped by double quotes
if you want to get what's inside the tag you might be trying to use the split() tool; to cut the string every >" or "<" basically like this:
let string = "<a href='something+specific_string' title='testing'>This is a text and 'specific_string'</a>";
string = string.split('>');
string = string[1].split('<');
console.log(string)
So, when you want to manipulate it, just use position 0 of the string. Is not regex like u wnat, but is an idea
Though it can suffice in simple cases, you should know it's often said that RegExp is ill-suited for parsing HTML, and depending on environment you could be better off using more robust techniques. (There's http://htmlparsing.com/ dedicated to the topic but yet it doesn't discuss JS.)
That said, the following works in Chrome 107 and Node 16.13.
(s=>s.match(/(?<=>[^<]*|^[^<]*)specific_string/))
('This is a text and "specific_string"')
It uses look-behind. In lieu of that you could use /(>[^<]*|^[^<]*)(specific_string)/ and compensate index/lengths to get the position of a match...
As you answer in a comment that you'll replace in user-provided HTML, I encourage you to consider security implications (namely XSS).
Back on the topic of parsing HTML w/o RegExp we obviously have the techniques in a web browser and I couldn't stop myself writing a quick and dirty textNode replacer in web JS, working in Chrome 107:
((html, fun) => {
const el = document.createElement('body')
el.innerHTML = html
const X = new XPathEvaluator, R = X.evaluate('//*[text()]', el)
const A = []; for (let n; n = R.iterateNext();) A.push(n) // mutating el while iterating XPathResult is illegal
for (let n of A) fun(n)
return el.innerHTML})
('This is a text and "specific_string"',
n => n.innerHTML = n.innerHTML
.replace(/specific_string/, '<b>replaced</b>'))

jQuery html() function and

I have a string that contains unicode encoded nonbreaking space. I need to save this string to the hidden HTML element, so another function can read this value.
It looks like html() function does some transformation of the string. Example:
var testString = "string with \xa0 non breaking space";
$(".export-file-buffer").html(testString);
var receivedString = $(".export-file-buffer").html();
console.log(testString);
console.log(receivedString);
What I see in console:
string with   non breaking space
string with non breaking space
Why exactly it's happening? Could you point me to the doc that describes this behavior?
Rather than making it displayable, if you just need to store a reference to it on an element you can use the data() method.
var testString = "string with \xa0 non breaking space";
var $target = $('#target');
$target.data('rawData', testString);
console.log($target.data('rawData'));
var fromData = $target.data('rawData');
console.log(
fromData.split('').map(function(character){
if (character < ' ' || character > '~' ) {
return '\\x'+ character.charCodeAt(0).toString(16);
} else {
return character;
}
}).join('')
);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="target"></div>
As you can see, the value is not converted to . The reason for this is that when jQuery sets a value on an Element with data() it does not put it directly on the element. Rather, it stores the value in an internal cache and associates the value to the element. Since the value is only in javascript memory, the browser does not convert it.
The value still prints out in the console not as \xa0 because that hex character code reference is not a visible character code on the ascii chart. I included a little script that encodes the characters on the ascii chart before space and after the tilde.
There are multiple ways of keeping string that can be shared on a html page.
Using input type="hidden". Here you can simply keep value by $(".export-file-buffer").val(testString) just like you did for html
(less recommended) Using a global variable/ keeping on window.
window.export-file-buffer = testString
and retrieve later calling window.export-file-buffer

IE11 innerHTML strange behaviour

I have very strange behaviour with element.innerHTML in IE11.
As you can see there: http://pe281.s3.amazonaws.com/index.html, some riotjs expressions are not evaluated.
I've tracked it down to 2 things:
- the euro sign above it. It's encoded as €, but I have the same behaviour with \u20AC or €. It happens with all characters in the currency symbols range, and some other ranges. Removing or using a standard character does not cause the issue.
- The way riotjs creates a custom tag and template. Basically it does this:
var html = "{reward.amount.toLocaleString()}<span>€</span>{moment(expiracyDate).format('DD/MM/YYYY')}";
var e = document.createElement('div');
e.innerHTML = html;
In the resulting e node, e.childNodes returns the following array:
[0]: {reward.amount.toLocaleString()}
[1]: <span>€</span>
[2]: {
[3]: moment(expiracyDate).format('DD/MM/YYYY')}
Obviously nodes 2 and 3 should be only one. Have them split makes riot not recognizing an expression to evaluate, hence the issue.
But there's more: The problem is not consistent, and for instance cannot be reproduced on a fiddle: https://jsfiddle.net/5wg3zxk5/4/, where the html string is correctly parsed.
So I guess my question is how can some specific characters change the way element.innerHTML parses its input? How can it be solved?
.childNodes is a generated array (...well NodeList) that is filled with ELEMENT_NODE but may also be filled with: ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_REFERENCE_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE, ...
You probably want only nodes from the type: ELEMENT_NODE (div and such..) and maybe also TEXT_NODE.
Use a simple loop to keep just those nodes with .nodeType === Element.ELEMENT_NODE (or just compare it to its enum which is 1).
You can also just use the much more simpler alternative of .children.
Replace <br> with <br /> (they are self-closing tags). IE is trying to close the tags for you. That's why you have doubled br tags
I think it should be something like this:
var html = {reward.amount.toLocaleString()} + "€<br>" +{moment(expiracyDate).format('DD/MM/YYYY')} + " <br>";
var e = document.createElement('div');
e.innerHTML = html;
The stuff I removed from the quotes seem to be variables or other stuff, and not a string, so it should not be in quotes.

Javascript string parser - escape issue

I'm running a Node server that receives a plain utf8 text and parses the content to JSON. Part of the JSON will be the body of an HTML document.
The problem is that when the input has characters such as "ä" or " ' ", the HTML document gets all crazy. I guess it has to do with the coding/decoding of the parser for these special characters.
Any ideas regarding this ?
[EDIT]
The parsing and JSON object are basically this:
var string = <mail_body><html> html code here...<html><mail_body>
var mail_body = string.split("<mail_body>")[1]
var obj = {
"subject": "subject 123",
"mail_body": mail_body
}
You can use this for the "'"
var escapedText = text.replace(/\\'/g, "\\'");
and use a unicode for the "letter a with eyes"
like this -> \u2665
https://mathiasbynens.be/notes/javascript-escapes
The most important thing you need to do is to escape the incoming string to eliminate quotes that will break your JSON, which is the only significant problem I would expect to see with Node - browsers have a slightly harder time. From your input you're looking at something like this:
var string = <mail_body><html> html code here...<html><mail_body>
var mail_body = string.split("<mail_body>")[1]
mail_body = mail_body.replace(/\"/g, '\\"'); // regex for global replace, have to escape quotes
That should get you a mail body that doesn't unexpectedly end and break the rest of your JSON.

Value &# to unicode convert

I have lots of characters in the form ¶ which I would like to display as unicode characters in my text editor.
This ought to convert them:
var newtext = doctext.replace(
/&#(\d+);/g,
String.fromCharCode(parseInt("$1", 10))
);
But doesn't seem to work. The regular expression /&#(\d+);/ is getting me the numbers out -- but the String.fromCharCode does not appear to give the results I'd like. What is up?
The replacement part should be an anonymous function instead of an expression:
var newtext = doctext.replace(
/&#(\d+);/g,
function($0, $1) {
return String.fromCharCode(parseInt($1, 10));
}
);
The replace method is not foolproof, if you use full HTML (i.e. don't control what the input is). For example, the method submitted by Jack (and obviously the idea in the original post as well) works excellently if your entities are all decimal, but doesn't work for hex A, and even less for named entities like ".
For this, there is another trick you can do: create an element, set its innerHTML to the source, then read out its text value. Basically, browsers know what to do with entities, so we delegate. :) In jQuery it is easy:
$('<div/>').html('&').text()
// => "&"
With plain JS it gets a bit more verbose:
var el = document.createElement();
el.innerHTML = '&';
el.textContent
// => "&"

Categories

Resources