JavaScript/HTML/Unicode accents: á != á

JavaScript/HTML/Unicode accents: á != á - javascript

I want to check if a user submitted string is the same as the string in my answer key. Sometimes the words involve Spanish accents (like in sábado), and that makes the condition always false.
I have Firebug log $('#answer').val() and it shows up as sábado. (The á comes from a button that inserts the value á, if that matters) whereas logging the answer from the answer key shows sábado (how I wrote it in the actual answer key).
I have tried replacing the &aacute in the answer key with a normal á, but it still doesn't work, and results in a Unicode diamond-question-mark. When I do that and also replace the value of the button that makes the user-submitted á, the condition works correctly, but then the button, the user string, and the answer string all have the weird Unicode diamond-question-mark.
I have also tried using á in both places and it's no different from using á. Both my HTML and Javascript are using charset="utf-8".
How can I fix this?

If you're consistently using UTF-8, there's no need for HTML entities except to encode syntax (ie <, >, & and - within attributes - ").
For anything else, use the proper characters, and your problems should go away - until you run into unicode normalization issues, ie the difference between 'a\u0301' and '\u00E1'...

The issue is that you're not using the real UTF-8 characters in both strings (entered answer and the key). You should NOT be supplying "a button that inserts the value á" -- Re: "if that matters" it does!
The characters should be added by the keyboard input system. And your comparison string should also be only utf-8 characters. It should NOT be character entities.

Related

Compare strings with different encodings

I've just needed to compare to strings in JavaScript, and the comparision of specific strings failed sometimes.
One value was obtained with jQuery via the text() method (from some auto-generated HTML):
var value1 = $('#somelement').text();
The other value is hardcoded in a JavaScript file (from me).
After some testing I found that these strings have different encodings, which became clear when I logged them with the escape() function.
Firebug showed me something like this:
console.log(escape(value1));
"blabla%A0%28blub%29"
console.log(escape(value2));
"blabla%20%28blub%29"
So at the end it's the whitespace with different encodings which made my comparison fails.
So my question is: how to handle this correctly? Can I just replace the whitespace to be equal? But I guess there are other control characters - like tab, return and so on - which could mess up my comparison?

So at the end it's the whitespace with different encodings which made my comparison fails.
No, it is not a different encoding. It is just a different whitespace - a non-breaking space.
Can I just replace the white space to be equal? But I guess there are other control characters - like tab, return and so on - which could mess up my comparison?
You can replace all of them. You might want to try something like
value1.replace(/\s+/g, " ").replace(/^\s*|\s$/g, "") == value2
which joins multiple whitespaces (of all kinds, including returns) to a single space and also trims the string before the comparison.

Define allowed characters in text objects HTML

Is there anyway I can define the encoding in text areas using HTML and pure JS?
I want to have them not permitting special unicode characters (such as ♣♦♠).
The valid character range (for my purpose) is from Unicode code point U+0000 to U+00FF.
It is OK to silently replace invalid characters with an empty string upon form-submission (without warning to the user).

So, as you have clarified in your comments: you want to replace the characters you deem illegal with empty strings on form-submission without warning.
Given the following example html (body content):
<form action="demo_form.asp">
First name: <input type="text" name="fname" /><br>
Last name: <input type="text" name="lname" /><br>
Likes: <textarea name="txt_a"></textarea><br>
Dislikes: <textarea name="txt_b"></textarea><br>
<input type="submit" value="Submit">
</form>
Here is a basic concept javascript:
function demo(){
for( var elms=this.getElementsByTagName('textarea')
, L=elms.length
; L--
; elms[L].value=elms[L].value.replace(/[^\u0000-\u00FF]/g,'')
);
}
window.onload=function(){
document.forms[0].onsubmit=demo; //hook form's onsubmit use any method you like
};
The basic idea is to force the browser's regex engine to match on Unicode (not local charset) using the \uXXXX notation.
Then we simply make a range: [\u0000-\u00FF] and finally specify we want to match on everything outside that range: [^\u0000-\u00FF].
Everything that matches those criteria will be replaced by '' (an empty string) on form-submission. No warning no nothing.
You can/should freely expand this concept to incorporate this into your code (in a way that fits your code-flow) (and where needed, apply it to input type="text" etc), depending on your further requirements.
This should get you started!
EDIT:
Note that your current valid-range specification (\u0000-\u00FF) will effectively dis-allow all such 'pesky' special characters like:
fancy quotes ‘ ’ “ ”
(that's a great feature for people copying from Word etc.),
€ ™ Œ œ, etc.
But, it will nicely include the full C1 control-block (all 32 control-characters). However on the other hand.. it's consistent with including the full C0 control-block.
Effectively, this is now your (what you requested) valid char-set: http://en.wikipedia.org/wiki/ISO/IEC_8859-1
As you can now see, there is a lot more to this. That is why sane applications (finally) are starting to use Unicode (usually encoded for the web as UTF-8) and just accept what the users provide (within (extremely clearly specified) reason)!
Most common validation-questions are (in the real world) nothing more than a high-school-class example of the concept of validating (and even more to the point: to explain the basics of regular expressions with what is considered to be easily understandable examples, like name/email/address). Sadly they are wildly applied even by some government identity-systems (up to passports etc) to people's names, addresses etc. In fact: even the full current Unicode cannot represent every person's name (in native writing) on the planet (that is actually still alive)!! Real world example: try entering and leaving a commercial flight when your boarding-pass has a different credentials then your passport (regardless of which one is wrong).. 'Just' an umlaut missing is going to be a problem somewhere, worse example, imagine an woman with a German first name, Thai last name and married to a man with a Mandarin last name..
Source: xkcd.com/1171/
Finally: Please do realize that in most cases this whole exercise is useless (if you do it silently without warning), because:
you may never just accept user-input on the server-side without proper cleanup, so you are already (silently without the user knowing it) cleaning up your input to the form that you require (to a novice programmer (that forgets to think about (for example) users with javascript disabled,) this sometimes feels like repeating the work already done in javascript on the client-side)...
Usually, the only use of replicating the server-side behavior on the client-side (usually using javascript) is so the user dynamically knows what would be dis-allowed by the server (without sending data back and forth) and can adapt accordingly!

You can use form attribute accept-charset
The accept-charset attribute specifies the character encodings that
are to be used for the form submission.
The default value is the reserved string "UNKNOWN" (indicates that the
encoding equals the encoding of the document containing the
element).
See this documentation http://www.w3schools.com/tags/att_form_accept_charset.asp
I cannot say if this will protect the text field but at least it controls what character set is submitted by the form.
Actually this issue has already been answered
javascript to prevent writing into form elements after n utf 8 characters

JavaScript automatically converts some special characters

I need to extract a HTML-Substring with JS which is position dependent. I store special characters HTML-encoded.
For example:
HTML
<div id="test"><p>lösen & grüßen</p></div>
Text
lösen & grüßen
My problem lies in the JS-part, for example when I try to extract the fragment
lö, which has the HTML-dependent starting position of 3 and the end position of 9 inside the <div> block. JS seems to convert some special characters internally so that the count from 3 to 9 is wrongly interpreted as "lösen " and not "lö". Other special characters like the & are not affected by this.
So my question is, if someone knows why JS is behaving in that way? Characters like ä or ö are being converted while characters like & or are plain. Is there any possibility to avoid this conversion?
I've set up a fiddle to demonstrate this: JSFiddle
Thanks for any help!
EDIT:
Maybe I've explained it a bit confusing, sorry for that. What I want is the HTML:
<p>lösen & grüßen</p> .
Every special character should be unconverted, except the HTML-Tags. Like in the HTML above.
But JS converts the ö or ü into ö or ü automatically, what I need to avoid.

That's because the browser (and not JavaScript) turns entities that don't need to be escaped in HTML into their respective Unicode characters (e.g. it skips &, < and >).
So by the time you inspect .innerHTML, it no longer contains exactly what was in the original page source; you could reverse this process, but it involves the full map of character <-> entity pairs which is just not practical.

If i understand you correctly, then try use innerHTML or .html('your html code') for jQuery on the target element

what kind of encoding is this?

I've got some data from dbpedia using jena and since jena's output is based on xml so there are some circumstances that xml characters need to be treated differently like following :
Guns n &#039; Roses
I just want to know what kind of econding is this?
I want decode/encode my input based on above encode(r) with the help of javascript and send it back to a servlet.
(edited post if you remove the space between & and amp you will get the correct character since in stackoverflow I couldn't find a way to do that I decided to put like that!)

Seems to be XML entity encoding, and a numeric character reference (decimal).
A numeric character reference refers to a character by its Universal
Character Set/Unicode code point, and uses the format
You can get some info here: List of XML and HTML character entity references on Wikipedia.
Your character is number 39, being the apostrophe: ', which can also be referenced with a character entity reference: &apos;.
To decode this using Javascript, you could use for example php.js, which has an html_entity_decode() function (note that it depends on get_html_translation_table()).
UPDATE: in reply to your edit: Basically that is the same, the only difference is that it was encoded twice (possibly by mistake). & is the ampersand: &.

This is an SGML/HTML/XML numeric character entity reference.
In this case for an apostrophe '.

Jquery embedded quote in attribute

I have a custom attribute that is being filled from a database. This attribute can contain an embedded single quote like this,
MYATT='Tony\'s Test'
At some pont in my code I use jquery to copy this attribute to a field like this,
$('#MY_DESC').val($(recdata).attr('MYATT'));
MY_DESC is a text field in a dialog box. When I display the dialog box all I see in the field is
Tony\
What I need to see is,
Tony's Test
How can I fix this so I can see the entire string?

Try:
MYATT='Tony&#x27s Test'
I didn't bother verifying this with the HTML spec, but the wikipedia entry says:
The ability to "escape" characters in this way allows for the characters < and & (when written as < and &, respectively) to be interpreted as character data, rather than markup. For example, a literal < normally indicates the start of a tag, and & normally indicates the start of a character entity reference or numeric character reference; writing it as & or & or & allows & to be included in the content of elements or the values of attributes. The double-quote character ("), when used to quote an attribute value, must also be escaped as " or " or " when it appears within the attribute value itself. The single-quote character ('), when used to quote an attribute value, must also be escaped as ' or ' (should NOT be escaped as &apos; except in XHTML documents) when it appears within the attribute value itself. However, since document authors often overlook the need to escape these characters, browsers tend to be very forgiving, treating them as markup only when subsequent text appears to confirm that intent.

In case you won't use double-quotes, put your custom attribute into them :)
If not, I suggest escape the value.

Before setting the value of your text field, you might try running a regular expression against the string to remove all backslashes from the string.

If you do this:
alert($(recdata).attr('MYATT'));
You will see the same result of "Tony\" meaning that the value isn't being properly consumed by the browser. The escaped \' value isn't working in this case.
Do you have the means to edit these values as they are being produced? Can you parse them to include escape values before being rendered?

Develop Reference

JavaScript is the programming language of the Web.

JavaScript/HTML/Unicode accents: á != á - javascript

Related

Compare strings with different encodings

Define allowed characters in text objects HTML

JavaScript automatically converts some special characters

what kind of encoding is this?

Jquery embedded quote in attribute

Categories

Resources