Is there a javascript function that converts characters to the &code; equivalent? - javascript

I have text that was created created with CKeditor, and it seems to be inserting where spaces should be. And it appears to do similar conversions with >, <, &, etc. which is fine, except, when I make a DOMSelection, those codes are removed.
So, this is what is selected:
beforeHatchery (2)
But this is what is actually in the DOM:
beforeHatchery (2)
note that I outputted the selection and the original text stored in the database using variable.inspect, so all the quotes are escaped (they wouldn't be when sent to the browser).
To save everyone the pain of looking for the difference:
From the first: Hatchery</a> (2) (The Selection)
From the second: Hatchery</a> (2) (The original)
These differences are at the very end of the selection.
So... there are three ways, I can see of, to approach this.
1) - Replace all characters commonly replaced with codes with their codes,
and hope for the best.
2) - Javascript may have some uncommon function / a library may exist that
replaces these characters for me (I think this might be the way CKeditor
does its character conversion).
3) - Figure out the way CKeditor converts and do the conversion exactly that way.
I'm using Ruby on Rails, but that shouldn't matter for this problem.
Some other things that I found out that it converts:
1: It seems to only convert spaces to if the space(s) is before or after a tag:
e.g.: "With quick <a href..."
2: It changes apostrophes to the hex value
e.g.: "opponent's"
3: It changes "&" to "&"
4: It changes angle brackets to ">" and "<" appropriately.
Does anyone have any thoughts on this?

To encode html entities in str (your question title asks for this, if I understand correctly):
$('<div/>').text(str).html();
To decode html entities in str:
$('<div/>').html(str).text();
These rely on jQuery, but vanilla alternatives are basically the same but more verbose.
To encode html entities in str:
var el = document.createElement('div');
el.innerText = str;
el.innerHTML;
To decode html entities in str:
var el = document.createElement('div');
el.innerHTML = str;
el.innerText;

Conversion of spaces to is usually done by the browser while editing content.
Conversion of ' to ' can be controled with http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html#.entities_additional
and 4. are usually needed to avoid breaking code that it's written in design view when loading again that content. You can try to change http://docs.cksource.com/ckeditor_api/symbols/CKEDITOR.config.html#.basicEntities but that usually can lead to problems in the future.

Related

How is plain text modified when set through innerHTML?

When setting innerHTML = '\r\n', it seems like browsers may end up writing '\n'.
This introduces a gap between the actual plain text content of the element and what I have been keeping track of.
Is this a rather isolated problem or are there many more potential changes I should be aware of?
How to ensure that the content of the text nodes matches exactly what I'm trying to write?
I guess it's possible just not to use innerHTML, build the nodes and the text nodes and insert them, but it's much less convenient.
When you read a string from innerHTML, it's not the string you wrote, it's created completely from scratch by converting the DOM structure of the element into HTML that will (mostly) create it. That means lots of things happen:
Newlines are normalized
Character entities are normalized
Quotes are normalized
Tags are normalized
Tags are corrected if the text you supplied defined an invalid HTML structure
...and so on. You can't expect a round-trip through the DOM to result in exactly the same string.
If you're dealing with pure text content, you can use textContent instead:
const x = document.getElementById("x");
const str = "CRLF: \r\n";
x.textContent = str;
console.log(x.textContent === str);
<div id="x"></div>
I can't 100% vouch for there being no newline normalization (or come to that, Unicode normalization; you might run the string through normalize first) although a quick test with a Chromium-based browser, Firefox, and iOS Safari suggested there wasn't, but certainly most of the issues with innerHTML don't occur.
It is not an isolated issue as it is expected behavior since you are writing HTML using the innerHTML tag.
if you want to make sure your text matches exactly what you are writing including new lines and spaces use the Html pre tag and write your text node there.
pre tag description: https://www.w3schools.com/tags/tag_pre.asp

 appearing in textarea elements but not in string

I am working on an autocomplete used inside a textarea. I know there is some autocompletes already created, but anyway.
It works well, but if when I'm typing something and I select one or many characters and delete it, a  appears at the end of my string (or where I was inside it). I tried to replace it while retrieving my html with replaceAll, but it doesn't work (There is not this special char when I use an indexOf). The problem is he doesn't find any result because of this char. Let's see an exemple :
This is my array (a little bit cut but we don't really care)
let array = [{
name: "test",
value: "I'm a test value"
},
{
name: "valueorange",
value: "I'm just an orange"
},
// This is how I get the contents of my span (I tried both innerHTML and innerText, same results).
// Same while using .text() or .html() with jquery
let value = jqElement.find("#searching-span")[0].innerHTML.substring(1).toLowerCase();
value = value.replaceAll(" ", " ");
value = value.replaceAll("", "");
I can replace every without any problems. Finally I check with a loop if there is some value with indexOf on each value, and if it returns anything I push it and get it in a new array. But when I have  I have no results.
Any idea how I can resolve it ?
I tried to be clear, I hope my english wasn't so bad, sorry if I made many mistakes !
Character entities and HTML escaped characters like and  appearing in HTML source code are converted by the HTML parser into unicode characters like \u00a0 and \ufeff before being inserted into the DOM.
If replacing them in JavaScript, use their unicode characters, not HTML escape sequences, to match them in DOM strings. For example:
p.textContent = p.textContent.replaceAll("\ufeff", '*'); // zwj
p.textContent = p.textContent.replaceAll("\xa0", '-'); // nbsp
<p id="p">   </p>
Note that zero width joiners are uses a lot in emoji character sequences and arbitrarily removing may break emoji character decoding (although decoding badly formed emoji strings is almost a prerequisite for handling emojis in the wild).
Second note: I am not suggesting this as a means of circumventing badly decoding characters that have been encoded using a Unicode Transform Format. Making sure decoding is performed correctly is always a better option.

JavaScript + RegEx Complications- Searching Strings Not Containing SubString

I am trying to use a RegEx to search through a long string, and I am having trouble coming up with an expression. I am trying to search through some HTML for a set of tags beginning with a tag containing a certain value and ending with a different tag containing another value. The code I am currently using to attempt this is as follows:
matcher = new RegExp(".*(<[^>]+" + startText + "((?!" + endText + ").)*" + endText + ")", 'g');
data.replace(matcher, "$1");
The strangeness around the middle ( ((\\?\\!endText).)* ) is borrowed from another thread, found here, that seems to describe my problem. The issue I am facing is that the expression matches the beginning tag, but it does not find the ending tag and instead includes the remainder of the data. Also, the lookaround in the middle slowed the expression down a lot. Any suggestions as to how I can get this working?
EDIT: I understand that parsing HTML in RegEx isn't the best option (makes me feel dirty), but I'm in a time-crunch and any other alternative I can think of will take too long. It's hard to say what exactly the markup I will be parsing will look like, as I am creating it on the fly. The best I can do is to say that I am looking at a large table of data that is collected for a range of items on a range of dates. Both of these ranges can vary, and I am trying to select a certain range of dates from a single row. The approximate value of startText and endText are \\#\\#ASSET_ID\\#\\#_<YYYY_MM_DD>. The idea is to find the code that corresponds to this range of cells. (This edit could quite possibly have made this even more confusing, but I'm not sure how much more information I could really give without explaining the entire application).
EDIT: Well, this was a stupid question. Apparently, I just forgot to add .* after the last paren. Can't believe I spent so long on this! Thanks to those of you that tried to help!
First of all, why is there a .* Dot Asterisk in the beginning? If you have text like the following:
This is my Text
And you want "my Text" pulled out, you do my\sText. You don't have to do the .*.
That being said, since all you'll be matching now is what you need, you don't need the main Capture Group around "Everything". This: .*(xxx) is a huge no-no, and can almost always be replaced with this: xxx. In other words, your regex could be replaced with:
<[^>]+xxx((?!zzz).)*zzz
From there I examine what it's doing.
You are looking for an HTML opening Delimeter <. You consume it.
You consume at least one character that is NOT a Closing HTML Delimeter, but can consume many. This is important, because if your tag is <table border=2>, then you have, at minimum, so far consumed <t, if not more.
You are now looking for a StartText. If that StartText is table, you'll never find it, because you have consumed the t. So replace that + with a *.
The regex is still success if the following is NOT the closing text, but starts from the VERY END of the document, because the Asterisk is being Greedy. I suggest making it lazy by adding a ?.
When the backtracking fails, it will look for the closing text and gather it successfully.
The result of that logic:
<[^>]*xxx((?!zzz).)*?zzz
If you're going to use a dot anyway, which is okay for new Regex writers, but not suggested for seasoned, I'd go with this:
<[^>]*xxx.*?zzz
So for Javascript, your code would say:
matcher = new RegExp("<[^>]*" + startText + ".*?" + endText, 'gi');
I put the IgnoreCase "i" in there for good measure, but you may or may not want that.

Javascript regex whitespace is being wacky

I'm trying to write a regex that searches a page for any script tags and extracts the script content, and in order to accommodate any HTML-writing style, I want my regex to include script tags with any arbitrary number of whitespace characters (e.g. <script type = blahblah> and <script type=blahblah> should both be found). My first attempt ended up with funky results, so I broke down the problem into something simpler, and decided to just test and play around with a regex like /\s*h\s*/g.
When testing it out on string, for some reason completely arbitrary amounts of whitespace around the 'h' would be a match, and other arbitrary amounts wouldn't, e.g. something like " h " would match but " h " wouldn't. Does anyone have an idea of why this occurring or the the error I'm making?
Since you're using JavaScript, why can't you just use getElementsByTagName('script')? That's how you should be doing it.
If you somehow have an HTML string, create an iframe and dump the HTML into it, then run getElementsByTagName('script') on it.
OK, to extend Kolink's answer, you don't need an iframe, or event handlers:
var temp = document.createElement('div');
temp.innerHTML = otherHtml;
var scripts = temp.getElementsByTagName('script');
... now scripts is a DOM collection of the script elements - and the script doesn't get executed ...
Why regex is not a fantastic idea for this:
As a <script> element may not contain the string </script> anywhere, writing a regex to match them would not be difficult: /<script[.\n]+?<\/script>/gi
It looks like you want to only match scripts with a specific type attribute. You could try to include that in your pattern too: /<script[^>]+type\s*=\s*(["']?)blahblah\1[.\n]*?<\/script>/gi - but that is horrible. (That's what happens when you use regular expressions on irregular strings, you need to simplify)
So instead you iterate through all the basic matched scripts, extract the starting tag: result.match(/<script[^>]*>/i)[0] and within that, search for your type attribute /type\s*=\s*((["'])blahblah\2|\bblahblah\b)/.test(startTag). Oh look - it's back to horrible - simplify!
This time via normalisation:
startTag = startTag.replace(/\s*=\s*/g, '=').replace(/=([^\s"'>]+)/g, '="$1"') - now you're in danger territory, what if the = is inside a quoted string? Can you see how it just gets more and more complicated?
You can only have this work using regex if you make robust assumptions about the HTML you'll use it on (i.e. to make it regular). Otherwise your problems will grow and grow and grow!
disclaimer: I haven't tested any of the regex used to see if they do what I say they do, they're just example attempts.

How do I escape a string inside JavaScript code inside an onClick handler?

Maybe I'm just thinking about this too hard, but I'm having a problem figuring out what escaping to use on a string in some JavaScript code inside a link's onClick handler. Example:
Select
The <%itemid%> and <%itemname%> are where template substitution occurs. My problem is that the item name can contain any character, including single and double quotes. Currently, if it contains single quotes it breaks the JavaScript code.
My first thought was to use the template language's function to JavaScript-escape the item name, which just escapes the quotes. That will not fix the case of the string containing double quotes which breaks the HTML of the link. How is this problem normally addressed? Do I need to HTML-escape the entire onClick handler?
If so, that would look really strange since the template language's escape function for that would also HTMLify the parentheses, quotes, and semicolons...
This link is being generated for every result in a search results page, so creating a separate method inside a JavaScript tag is not possible, because I'd need to generate one per result.
Also, I'm using a templating engine that was home-grown at the company I work for, so toolkit-specific solutions will be of no use to me.
In JavaScript you can encode single quotes as "\x27" and double quotes as "\x22". Therefore, with this method you can, once you're inside the (double or single) quotes of a JavaScript string literal, use the \x27 \x22 with impunity without fear of any embedded quotes "breaking out" of your string.
\xXX is for chars < 127, and \uXXXX for Unicode, so armed with this knowledge you can create a robust JSEncode function for all characters that are out of the usual whitelist.
For example,
Select
Depending on the server-side language, you could use one of these:
.NET 4.0
string result = System.Web.HttpUtility.JavaScriptStringEncode("jsString")
Java
import org.apache.commons.lang.StringEscapeUtils;
...
String result = StringEscapeUtils.escapeJavaScript(jsString);
Python
import json
result = json.dumps(jsString)
PHP
$result = strtr($jsString, array('\\' => '\\\\', "'" => "\\'", '"' => '\\"',
"\r" => '\\r', "\n" => '\\n' ));
Ruby on Rails
<%= escape_javascript(jsString) %>
Use hidden spans, one each for each of the parameters <%itemid%> and <%itemname%> and write their values inside them.
For example, the span for <%itemid%> would look like <span id='itemid' style='display:none'><%itemid%></span> and in the javascript function SelectSurveyItem to pick the arguments from these spans' innerHTML.
If it's going into an HTML attribute, you'll need to both HTML-encode (as a minimum: > to > < to &lt and " to ") it, and escape single-quotes (with a backslash) so they don't interfere with your javascript quoting.
Best way to do it is with your templating system (extending it, if necessary), but you could simply make a couple of escaping/encoding functions and wrap them both around any data that's going in there.
And yes, it's perfectly valid (correct, even) to HTML-escape the entire contents of your HTML attributes, even if they contain javascript.
Try avoid using string-literals in your HTML and use JavaScript to bind JavaScript events.
Also, avoid 'href=#' unless you really know what you're doing. It breaks so much usability for compulsive middleclickers (tab opener).
<a id="tehbutton" href="somewhereToGoWithoutWorkingJavascript.com">Select</a>
My JavaScript library of choice just happens to be jQuery:
<script type="text/javascript">//<!-- <![CDATA[
jQuery(function($){
$("#tehbutton").click(function(){
SelectSurveyItem('<%itemid%>', '<%itemname%>');
return false;
});
});
//]]>--></script>
If you happen to be rendering a list of links like that, you may want to do this:
<a id="link_1" href="foo">Bar</a>
<a id="link_2" href="foo2">Baz</a>
<script type="text/javascript">
jQuery(function($){
var l = [[1,'Bar'],[2,'Baz']];
$(l).each(function(k,v){
$("#link_" + v[0] ).click(function(){
SelectSurveyItem(v[0],v[1]);
return false;
});
});
});
</script>
Another interesting solution might be to do this:
Select
Then you can use a standard HTML-encoding on both the variables, without having to worry about the extra complication of the javascript quoting.
Yes, this does create HTML that is strictly invalid. However, it is a valid technique, and all modern browsers support it.
If it was my, I'd probably go with my first suggestion, and ensure the values are HTML-encoded and have single-quotes escaped.
Declare separate functions in the <head> section and invoke those in your onClick method. If you have lots you could use a naming scheme that numbers them, or pass an integer in in your onClicks and have a big fat switch statement in the function.
Any good templating engine worth its salt will have an "escape quotes" function. Ours (also home-grown, where I work) also has a function to escape quotes for javascript. In both cases, the template variable is then just appended with _esc or _js_esc, depending on which you want. You should never output user-generated content to a browser that hasn't been escaped, IMHO.
I have faced this problem as well. I made a script to convert single quotes into escaped double quotes that won't break the HTML.
function noQuote(text)
{
var newtext = "";
for (var i = 0; i < text.length; i++) {
if (text[i] == "'") {
newtext += "\"";
}
else {
newtext += text[i];
}
}
return newtext;
}
Use the Microsoft Anti-XSS library which includes a JavaScript encode.
First, it would be simpler if the onclick handler was set this way:
<a id="someLinkId"href="#">Select</a>
<script type="text/javascript">
document.getElementById("someLinkId").onClick =
function() {
SelectSurveyItem('<%itemid%>', '<%itemname%>'); return false;
};
</script>
Then itemid and itemname need to be escaped for JavaScript (that is, " becomes \", etc.).
If you are using Java on the server side, you might take a look at the class StringEscapeUtils from jakarta's common-lang. Otherwise, it should not take too long to write your own 'escapeJavascript' method.
Is the answers here that you can't escape quotes using JavaScript and that you need to start with escaped strings.
Therefore. There's no way of JavaScript being able to handle the string 'Marge said "I'd look that was" to Peter' and you need your data be cleaned before offering it to the script?
I faced the same problem, and I solved it in a tricky way. First make global variables, v1, v2, and v3. And in the onclick, send an indicator, 1, 2, or 3 and in the function check for 1, 2, 3 to put the v1, v2, and v3 like:
onclick="myfun(1)"
onclick="myfun(2)"
onclick="myfun(3)"
function myfun(var)
{
if (var ==1)
alert(v1);
if (var ==2)
alert(v2);
if (var ==3)
alert(v3);
}

Categories

Resources