Numbers localization in Web applications - javascript

How can I set the variant of Arabic numeral without changing character codes?
Eastern Arabic ۰ ۱ ۲ ۳ ٦ ٥ ٤ ۷ ۸ ۹
Persian variant ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
Western Arabic 0 1 2 3 4 5 6 7 8 9
(And other numeral systems)
Here is a sample code:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<div lang="fa">0123456789</div>
<div lang="ar">0123456789</div>
<div lang="en">0123456789</div>
</body>
</html>
How can I do this using only client-side technologies (HTML,CSS,JS)?
The solution should have no negative impact on page's SEO score.
Note that in Windows text boxes (e.g. Run) numbers are displayed correctly according to language of surrounding text.
See also: Numbers localization in desktop applications
Note: Localisation of numbers are super easy on backend using this PHP package https://github.com/salarmehr/cosmopolitan

Here is an approach with code shifting:
// Eastern Arabic (officially "Arabic-Indic digits")
"0123456789".replace(/\d/g, function(v) {
return String.fromCharCode(v.charCodeAt(0) + 0x0630);
}); // "٠١٢٣٤٥٦٧٨٩"
// Persian variant (officially "Eastern Arabic-Indic digits (Persian and Urdu)")
"0123456789".replace(/\d/g, function(v) {
return String.fromCharCode(v.charCodeAt(0) + 0x06C0);
}); // "۰۱۲۳۴۵۶۷۸۹"
DEMO: http://jsfiddle.net/bKEbR/
Here we use Unicode shift, since numerals in any Unicode group are placed in the same order as in latin group (i.e. [0x0030 ... 0x0039]). So, for example, for Arabic-Indic group shift is 0x0630.
Note, it is difficult for me to distinguish Eastern characters, so if I've made a mistake (there are many different groups of Eastern characters in Unicode), you could always calculate the shift using any online Unicode table. You may use either official Unicode Character Code Charts, or Unicode Online Chartable.

One has to decide if this is a question of appearance or of transformation. One must also decide if this is a question involving character-level semantics or numeral representations. Here are my thoughts:
The question would have entirely different semantics, if we had a situation where Unicode had not sparated out the codes for numeric characters.
Then, displaying the different glyphs as appropriate would simply be a matter of using the appropriate font. On the other hand, it would not have been possible to simply write out the different characters as I did below without changing fonts. (The situation is not exactly perfect as fonts do not necessarily cover the whole range of the 16-bit Unicode set, let alone the 32-bit Unicode set.)
9, ٩ (Arabic), ۹ (Urdu), 玖 (Chinese, complex), ๙ (Thai), ௯ (Tamil) etc.
Now, assuming we accept Unicode semantics i.e. that '9' ,'٩', and '۹' are distinct characters, we may conclude that the question is not about appearance (something that would have been in the purview of CSS), but of transformation -- a few thoughts about this later, for now let us assume this is the case.
When focusing on character-level semantics, the situation is not too dissimilar with what happens with alphabets and letters. For instance, Greek 'α' and Latin 'a' are considered distinct, even though the Latin alphabet is nearly identical to the Greek alphabet used in Euboea. Perhaps even more dramatically, the corresponding capital variants, 'Α' (Greek) and 'A' (Latin) are visually identical in practically all fonts supporting both scripts, yet distinct as far as Unicode is concerned.
Having stated the ground rules, let us see how the question can be answered by ignoring them, and in particular ignoring (character-level) Unicode semantics.
(Horrible, nasty and non-backwards compatible) Solution: Use fonts that map '0' to '9' to the desired glyphs. I am not aware of any such fonts. You would have to use #font-face and some font that has been appropriately hacked to do what you want.
Needless to say, I am not particularly fond of this solution. However, it is the only simple solution I am aware of that does what the question asks "without changing character codes" on either the server or the client side. (Technically speaking the Cufon solution I propose below does not change the character codes either, but what it does, drawing text into canvases is vastly more complex and also requires tweaking open-source code).
Note: Any transformational solution i.e. any solution that changes the DOM and replaces characters in the range '0' to '9' to, say, their Arabic equivalents will break code that expects numerals to appear in their original form in the DOM. This problem is, of course, worst when discussing forms and inputs.
An example of an answer taking the transformational approach is would be:
$("[lang='fa']").find("*").andSelf().contents().each(function() {
if (this.nodeType === 3)
{
this.nodeValue = this.nodeValue.replace(/\d/g, function(v) {
return String.fromCharCode(v.charCodeAt(0) + 0x0630);
});
}
});
Note: Code taken from VisioN's second jsFiddle. If this is the only part of this answer that you like, make sure you upvote VisioN's answer, not mine!!! :-)
This has two problems:
It messes with the DOM and as a result may break code that used to work assuming it would find numerals in the "standard" form (using digits '0' to '9'). See the problem here: http://jsfiddle.net/bKEbR/10/ For instance, if you had a field containing the sum of some integers the user inputs, you might be in for a surprise when you try to get its value...
It does not address the issue of what goes on inside input (and textarea) elements. If an input field is initialised with, say, "42", it will retail that value. This can be fixed easily, but then there is the issue of actual input... One may decide to change characters as they come, convert the values when they changes and so on and so forth. If such conversion is made then both the client side and the server side will need to be prepared to deal with different kinds of numeral. What comes out of the box in Javascript, jQuery and even Globalize (client-side), and ASP.NET, PHP etc. (server-side) will break if fed with numerals in non-standard formats ...
A slightly more comprehensive solution (taking care also of input/textarea elements, both their initial values and user input) might be:
//before the DOM change, test1 holds a numeral parseInt can understand
alert("Before: test holds the value:" +parseInt($("#test1").text()));
function convertNumChar(c) {
return String.fromCharCode(c.charCodeAt(0) + 0x0630);
}
function convertNumStr(s) {
return s.replace(/\d/g, convertNumChar);
}
//the change in the DOM
$("[lang='fa']").find("*").andSelf().contents()
.each(function() {
if (this.nodeType === 3)
this.nodeValue = convertNumStr(this.nodeValue);
})
.filter("input:text,textarea")
.each(function() {
this.value = convertNumStr(this.value)
})
.change(function () {this.value = convertNumStr(this.value)});
//test1 now holds a numeral parseInt cannot understand
alert("After: test holds the value:" +parseInt($("#test1").text()))
The entire jsFiddle can be found here: http://jsfiddle.net/bKEbR/13/
Needless to say, this only solves the aforementioned problems partially. Client-side and/or server-side code will have to recognise the non-standard numerals and convert them appropriately either to the standard format or to their actual values.
This is not a simple matter that a few lines of javascript will solve. And this is but the simplest case of such possible conversion since there is a simple character-to-character mapping that needs to be applied to go from one form of numeral to the other.
Another go at an appearance-based approach:
Cufon-based Solution (Overkill, Non-Backwards Compatible (requires canvas), etc.): One could relatively easily tweak a library like Cufon to do what is envisaged. Cufon can do its thing and draw glyphs on a canvas object, except that the tweak will ensure that when elements have a certain property, the desired glyphs will be used instead of the ones normally chosen. Cufon and other libraries of the kind tend to add elements to the DOM and alter the appearance of existing elements but not touch their text, so the problems with the transformational approaches should not apply. In fact it is interesting to note that while (tweaked) Cufon provides a clearly transformational apprroach as far as the overall DOM is concerned, it is an appearance-based solution as far as its mentality goes; I would call it a hybrid solution.
Alternative Hybrid-Solution: Create new DOM elements with the arabic content, hide the old elements but leave their ids and content intact. Synchronize the arabic content elements with their corresponding, hidden, elements.
Let's try to think outside the box (the box being current web standards).
The fact that certain characters are unique does not mean they are unrelated. Moreover, it does not necessarily mean that their difference is one of appearance. For instance, 'a' and 'A' are the same letter; in some contexts they are considered to be the same and in others to be different. Having, the distinction in Unicode (and ASCII and ISO-Latin-1 etc. before it) means that some effort is required to overcome it.
CSS offers a quick and easy way for changing the case of letters. For instance, body {text-transform:uppercase} would turn all letters in the text in the body of the page into upper case. Note that this is also a case of appearance-change rather than transformation: the DOM of the body element does not change, just the way it is rendered.
Note: If CSS supported something like numerals-transform: 'ar' that would probably have been the ideal answer to the question as it was phrased.
However, before we rush to tell the CSS committee to add this feature, we may want to consider what that would mean. Here, we are tackling a tiny little problem, but they have to deal with the big picture.
Output:
Would this numerals-transform feature work allow '10' (2-characters) to appear as 十(Chinese, simple), 拾 (Chinese, complex), X (Latin) (all 1-character) and so on if instead of 'ar', the appropriate arguments were given?
Input:
Would this numerals-transform feature change '十'(Chinese, simple) into its Arabic equivalent, or would it simply target '10'? Would it somehow cleverly detect that "MMXI" (Latin numeral for 2012) is a number and not a word and convert it accordingly?
The question of number representation is not as simple as one might imagine just looking at this question.
So, where does all this leave us:
There is no simple presentation-based solution. If one appears in the future, it will not be backwards compatible.
There can be a transformational "solution" here and now, but even if this is made to work also with form elements as I have done (http://jsfiddle.net/bKEbR/13/) there need to be server-side and client-side awareness of the non-standard format used.
There may be complex hybrid solutions. They are complex but offer some of the advantages of the presentation-based approaches in some cases.
A CSS solution would be nice, but actually the problem is big and complex when one looks at the big picture which involves other numeric systems (with less trivial conversions from and to the standard system), decimal points,signs etc.
At the end of the day, the solution I see as realistic and backwards compatible would be an extension of Globalize (and server-side equivalents) possibly with some additional code to take care of user input. The idea is that this is not a problem at the character-level (because once you consider the big picture it is not) and that it will have to be treated in the same way that differences with thousands and decimal separators have been dealt with: as formatting/parsing issues.

I imagine the best way is to use a regexp to search what numeric characters should be changed via adding a class name to the div that needs a different numeric set.
You can do this using jQuery fairly easy.
jsfiddle DEMO
EDIT: And if you don't want to use a variable, then see this revised demo:
jsfiddle DEMO 2

I have been working on a general web page localization technique that does more than just numbers (its similar to .po files)
The localization files are simple (the strings can contain html if needed)
/* Localization file - save as document_url.lang.js ... index.html.en.js: */
items=[
{"id":"string1","value":"Localized text of string1 here."},
{"id":"string2", "value":"۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ "}
];
rtl=false; /* set to true for rtl languages */
This format is useful to separate out for translators (or mechanical turk)
and a basic page template
<html><meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<head><title>My title</title>
<style>.txt{float:left;margin-left:10px}</style>
</head>
<body onload='setLang()'>
<div id="string1" class="txt">This is the default text of string1.</div>
<div id="string2" class="txt">0 1 2 3 4 5 6 7 8 9 </div>
</body></html>
<script>
function setLang(){
for(var i=0;i<items.length;i++){
term=document.getElementById(items[i].id)
if(term)term.innerHTML=items[i].value
if(rtl){ /* for rtl languages */
term.style.styleFloat="right"
term.style.cssFloat="right"
term.style.textAlign="right"
}
}
}
var lang=navigator.userLanguage || navigator.language;
var script=document.createElement("script");
script.src=document.URL+"-"+lang.substring(0,2)+".js"
var head = document.getElementsByTagName('head')[0]
head.insertBefore(script,head.firstChild)
</script>
I tried to keep it pretty simple, yet cover as many locales as possible so additional css is likely required (I have to admit a lack of exposure to rtl languages, so many more styles may need to be set)
I do have font checking code that would be useful if you know what fonts support your character codes well
function hasFont(f){
var s=document.createElement("span")
s.style.fontSize="72px"
s.innerHTML="MWMWM"
s.style.visibility="hidden"
s.style.fontFamily=[(f=="monospace")?'':'monospace','sans-serif','serif']
document.body.appendChild(s)
var w=s.offsetWidth
s.style.fontFamily=[f,'monospace','sans-serif','serif']
document.body.lastChild=s
return s.offsetWidth!=w
}
usage: if(hasFont("myfont"))myelement.style.fontFamily="myfont"

A new (to date) and simple JS solution would be to use Intl.NumberFormat. It supports numeral localization, formatting variations as well as local currencies (see documentation for more examples).
To use an example very similar to MDN's own:
const val = 1234567809;
console.log('Eastern Arabic (Arabic-Egyptian)', new Intl.NumberFormat('ar-EG').format(val));
console.log('Persian variant (Farsi)',new Intl.NumberFormat('fa').format(val));
console.log('English (US)',new Intl.NumberFormat('en-US').format(val));
Intl.NumberFormat also seems to support string numeric values as well as indicates when it's not a number in the local language.
const val1 = '456';
const val2 = 'Numeric + string example, 123';
console.log('Eastern Arabic', new Intl.NumberFormat('ar-EG').format(val1));
console.log('Eastern Arabic', new Intl.NumberFormat('ar-EG').format(val2));
console.log('Persian variant',new Intl.NumberFormat('fa').format(val1));
console.log('Persian variant',new Intl.NumberFormat('fa').format(val2));
console.log('English',new Intl.NumberFormat('en-US').format(val1));
console.log('English', new Intl.NumberFormat('en-US').format(val2));
For the locale identifier (string passed to NumberFormat constructor indicating locale), I experimented with the values above and they seemed fine. I tried finding a list for all possible values, and through MDN came across this documentation and this list that could be helpful.
I'm not familiar with SEO, and am thus unsure how this answers that part of the question.

you can try this:
This is CSS source code:
#font-face
{
font-family: A1Tahoma;
src: url(yourfont.eot) format('eot')
, url(yourfont.ttf) format('truetype')
, url(yourfont.woff) format('woff')
, url(yourfont.svg) format('svg');
}
p{font-family:A1Tahoma; font-size:30px;}
And this is HTML code:
<p>سلام به همه</p>
<p>1234567890</p>
And finally you will see your result.remember that 4 font types use for any browser such as IE,FIREFOX and so on.
"salam reza ,to mituni in karo anjam bedi ta un fonte dekhaheto be site ezafe koni."

I have created a jquery plugin that can convert Western Arabic numbers to Eastern ones (Persian only). But it can be extended to convert a number to any desired numeral system. My jQuery plugin has two advantages:
Detect and convert numbers properly in child nodes.
Detect and convert point characters appropriately.
You can clone this plugin from github.
My plugin code:
(function( $ ){
$.fn.persiaNumber = function() {
var groupSelection = this;
for(i=0; i< groupSelection.length ; i++){
var htmlTxt = $(groupSelection[i]).html();
var trueTxt = convertDecimalPoint(htmlTxt);
trueTxt = convertToPersianNum(trueTxt);
$(groupSelection[i]).html(trueTxt);
}
function convertToPersianNum(htmlTxt){
var otIndex = htmlTxt.indexOf("<");
var ctIndex = htmlTxt.indexOf(">");
if(otIndex == -1 && ctIndex == -1 && htmlTxt.length > 0){
var trueTxt = htmlTxt.replace(/1/gi, "۱").replace(/2/gi, "۲").replace(/3/gi, "۳").replace(/4/gi, "۴").replace(/5/gi, "۵").replace(/6/gi, "۶").replace(/7/gi, "۷").replace(/8/gi, "۸").replace(/9/gi, "۹").replace(/0/gi, "۰");
return trueTxt;
}
var tag = htmlTxt.substring(otIndex,ctIndex + 1);
var str = htmlTxt.substring(0,otIndex);
str = convertDecimalPoint(str);
str = str.replace(/1/gi, "۱").replace(/2/gi, "۲").replace(/3/gi, "۳").replace(/4/gi, "۴").replace(/5/gi, "۵").replace(/6/gi, "۶").replace(/7/gi, "۷").replace(/8/gi, "۸").replace(/9/gi, "۹").replace(/0/gi, "۰");
var refinedHtmlTxt = str + tag;
var htmlTxt = htmlTxt.substring(ctIndex + 1, htmlTxt.length);
if(htmlTxt.length > 0 && otIndex != -1 || ctIndex != -1){
var trueTxt = refinedHtmlTxt;
var trueTxt = trueTxt + convertToPersianNum(htmlTxt);
}else{
return refinedHtmlTxt+ htmlTxt;
}
return trueTxt;
}
function convertDecimalPoint(str){
for(j=1;j<str.length - 1; j++){
if(str.charCodeAt(j-1) > 47 && str.charCodeAt(j-1) < 58 && str.charCodeAt(j+1) > 47 && str.charCodeAt(j+1) < 58 && str.charCodeAt(j) == 46)
str = str.substring(0,j) + '٫' + str.substring(j+1,str.length);
}
return str;
}
};
})( jQuery );
http://jsfiddle.net/VPWmq/2/

You can convert numbers in this way:
const persianDigits = ['۰', '۱', '۲', '۳', '۴', '۵', '۶', '۷', '۸', '۹'];
const number = 44653420;
convertedNumber = String(number).replace(/\d/g, function(digit) {
return persianDigits[digit]
})
console.log(convertedNumber) // ۴۴۶۵۳۴۲۰

If anyone is looking for localizing into Bangla numbers using this code shifting method:
$("[lang='bang']").text(function(i, val) {
return val.replace(/\d/g, function(v) {
return String.fromCharCode(v.charCodeAt(0) + 0x09B6);
});
});
You can also visit here to see the UNICODE of ASCII Hexadecimal codes of Bangla

Related

How can I convert this UTF-8 string to plain text in javascript and how can a normal user write it in a textarea [duplicate]

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.
I found more information about it in HackerRank which states:
Returns a string containing the Unicode Normalization Form of the
calling string's value.
With the example:
var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));
having as output:
HackerRank
HackerRank
Also, in GeeksForGeeks:
The string.normalize() is an inbuilt function in javascript which is
used to return a Unicode normalisation form of a given input string.
with the example:
<script>
// Taking a string as input.
var a = "GeeksForGeeks";
// calling normalize function.
b = a.normalize('NFC')
c = a.normalize('NFD')
d = a.normalize('NFKC')
e = a.normalize('NFKD')
// Printing normalised form.
document.write(b +"<br>");
document.write(c +"<br>");
document.write(d +"<br>");
document.write(e);
</script>
having as output:
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
Maybe the examples given are just really bad as they don't allow me to see any change.
I wonder... what's the point of this method?
It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).
The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.
An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".
And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.
In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.
So: you want that the same (semantically speaking) string has the same sequence of unicode code points.
Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.
The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.
As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.
This is the example (taken from MDN):
const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';
console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false
const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');
console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true
As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.
Very beautifully explained here --> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Short answer : The point is, characters are represented through a coding scheme like ascii, utf-8 , etc.,(We use mostly UTF-8). And some characters have more than one representation. So 2 string may render similarly, but their unicode may vary! So string comparrision may fail here! So we use normaize to return a single type of representation
// source from MDN
let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');
console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1
Normalization of strings isn't exclusive of JavaScript - see for instances in Python. The values valid for the arguments are defined by the Unicode (more on Unicode normalization).
When it comes to JavaScript, note that there's documentation with String.normalize() and String.prototype.normalize(). As #ChrisG mentions
String.prototype.normalize() is correct in a technical sense, because
normalize() is a dynamic method you call on instances, not the class
itself. The point of normalize() is to be able to compare Strings that
look the same but don't consist of the same characters, as shown in
the example code on MDN.
Then, when it comes to its usage, found a great example of the usage of String.normalize() that has
let s1 = 'sabiá';
let s2 = 'sabiá';
// one is in NFC, the other in NFD, so they're different
console.log(s1 == s2); // false
// with normalization, they become the same
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true
// transform string into array of codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }
// printing the codepoints you can see the difference
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]
So while saibá e saibá in this example look the same to the human eye or even if we used console.log(), we can see that without normalization when comparing them we'd get different results. Then, by analyzing the codepoints, we see they're different.
There are some great answers here already, but I wanted to throw in a practical example.
I enjoy Bible translation as a hobby. I wasn't too thrilled at the flashcard option out there in the wild in my price range (free) so I made my own. The problem is, there is more than one way to do Hebrew and Greek in Unicode to get the exact same thing. For example:
בָּא
בָּא
These should look identical on your screen, and for all practical purposes they are identical. However, the first was typed with the qamats (the little t shaped thing under it) before the dagesh (the dot in the middle of the letter) and the second was typed with the dagesh before the qamats. Now, since you're just reading this, you don't care. And your web browser doesn't care. But when my flashcards compare the two, then they aren't the same. To the code behind the scenes, it's no different than saying "center" and "centre" are the same.
Similarly, in Greek:
ἀ
ἀ
These two should look nearly identical, but the top is one Unicode character and the second one is two Unicode characters. Which one is going to end up typed in my flashcards is going to depend on which keyboard I'm sitting at.
When I'm adding flashcards, believe it or not, I don't always type in vocab lists of 100 words. That's why God gave us spreadsheets. And sometimes the places I'm importing the lists from do it one way, and sometimes they do it the other way, and sometimes they mix it. But when I'm typing, I'm not trying to memorize the order that the dagesh or quamats appear or if the accents are typed as a separate character or not. Regardless if I remember to type the dagesh first or not, I want to get the right answer, because really it's the same answer in every practical sense either way.
So I normalize the order before saving the flashcards and I normalize the order before checking it, and the result is that it doesn't matter which way I type it, it comes out right!
If you want to check out the results:
https://sthelenskungfu.com/flashcards/
You need a Google or Facebook account to log in, so it can track progress and such. As far as I know (or care) only my daughter and I currently use it.
It's free, but eternally in beta.

Is it better to compare strings using toLowerCase or toUpperCase in JavaScript?

I'm going through a code review and I'm curious if it's better to convert strings to upper or lower case in JavaScript when attempting to compare them while ignoring case.
Trivial example:
var firstString = "I might be A different CASE";
var secondString = "i might be a different case";
var areStringsEqual = firstString.toLowerCase() === secondString.toLowerCase();
or should I do this:
var firstString = "I might be A different CASE";
var secondString = "i might be a different case";
var areStringsEqual = firstString.toUpperCase() === secondString.toUpperCase();
It seems like either "should" or would work with limited character sets like only English letters, so is one more robust than the other?
As a note, MSDN recommends normalizing strings to uppercase, but that is for managed code (presumably C# & F# but they have fancy StringComparers and base libraries):
http://msdn.microsoft.com/en-us/library/bb386042.aspx
Revised answer
It's been quite a while when I answered this question. While cultural issues still holds true (and I don't think they will ever go away), the development of ECMA-402 standard made my original answer... outdated (or obsolete?).
The best solution for comparing localized strings seems to be using function localeCompare() with appropriate locales and options:
var locale = 'en'; // that should be somehow detected and passed on to JS
var firstString = "I might be A different CASE";
var secondString = "i might be a different case";
if (firstString.localeCompare(secondString, locale, {sensitivity: 'accent'}) === 0) {
// do something when equal
}
This will compare two strings case-insensitive, but accent-sensitive (for example ą != a).
If this is not sufficient for performance reasons, you may want to use eithertoLocaleUpperCase()ortoLocaleLowerCase()` passing the locale as a parameter:
if (firstString.toLocaleUpperCase(locale) === secondString.toLocaleUpperCase(locale)) {
// do something when equal
}
In theory there should be no differences. In practice, subtle implementation details (or lack of implementation in the given browser) may yield different results...
Original answer
I am not sure if you really meant to ask this question in Internationalization (i18n) tag, but since you did...
Probably the most unexpected answer is: neither.
There are tons of problems with case conversion, which inevitably leads to functional issues if you want to convert the character case without indicating the language (like in JavaScript case). For instance:
There are many natural languages that don't have concept of upper- and lowercase characters. No point in trying to convert them (although this will work).
There are language specific rules for converting the string. German sharp S character (ß) is bound to be converted into two upper case S letters (SS).
Turkish and Azerbaijani (or Azeri if you prefer) has "very strange" concept of two i characters: dotless ı (which converts to uppercase I) and dotted i (which converts to uppercase İ <- this font does not allow for correct presentation, but this is really different glyph).
Greek language has many "strange" conversion rules. One particular rule regards to uppercase letter sigma (Σ) which depending on a place in a word has two lowercase counterparts: regular sigma (σ) and final sigma (ς). There are also other conversion rules in regard to "accented" characters, but they are commonly omitted during implementation of conversion function.
Some languages has title-case letters, i.e. Lj which should be converted to things like LJ or less appropriately LJ. The same may regard to ligatures.
Finally there are many compatibility characters that may mean the same as what you are trying to compare to, but be composed of completely different characters. To make it worse, things like "ae" may be the equivalent of "ä" in German and Finnish, but equivalent of "æ" in Danish.
I am trying to convince you that it is really better to compare user input literally, rather than converting it. If it is not user-related, it probably doesn't matter, but case conversion will always take time. Why bother?
Some other options have been presented, but if you must use toLowerCase, or
toUpperCase, I wanted some actual data on this. I pulled the full list
of two byte characters that fail with toLowerCase or toUpperCase. I then
ran this test:
let pairs = [
[0x00E5,0x212B],[0x00C5,0x212B],[0x0399,0x1FBE],[0x03B9,0x1FBE],[0x03B2,0x03D0],
[0x03B5,0x03F5],[0x03B8,0x03D1],[0x03B8,0x03F4],[0x03D1,0x03F4],[0x03B9,0x1FBE],
[0x0345,0x03B9],[0x0345,0x1FBE],[0x03BA,0x03F0],[0x00B5,0x03BC],[0x03C0,0x03D6],
[0x03C1,0x03F1],[0x03C2,0x03C3],[0x03C6,0x03D5],[0x03C9,0x2126],[0x0392,0x03D0],
[0x0395,0x03F5],[0x03D1,0x03F4],[0x0398,0x03D1],[0x0398,0x03F4],[0x0345,0x1FBE],
[0x0345,0x0399],[0x0399,0x1FBE],[0x039A,0x03F0],[0x00B5,0x039C],[0x03A0,0x03D6],
[0x03A1,0x03F1],[0x03A3,0x03C2],[0x03A6,0x03D5],[0x03A9,0x2126],[0x0398,0x03F4],
[0x03B8,0x03F4],[0x03B8,0x03D1],[0x0398,0x03D1],[0x0432,0x1C80],[0x0434,0x1C81],
[0x043E,0x1C82],[0x0441,0x1C83],[0x0442,0x1C84],[0x0442,0x1C85],[0x1C84,0x1C85],
[0x044A,0x1C86],[0x0412,0x1C80],[0x0414,0x1C81],[0x041E,0x1C82],[0x0421,0x1C83],
[0x1C84,0x1C85],[0x0422,0x1C84],[0x0422,0x1C85],[0x042A,0x1C86],[0x0463,0x1C87],
[0x0462,0x1C87]
];
let upper = 0, lower = 0;
for (let pair of pairs) {
let row = 'U+' + pair[0].toString(16).padStart(4, '0') + ' ';
row += 'U+' + pair[1].toString(16).padStart(4, '0') + ' pass: ';
let s = String.fromCodePoint(pair[0]);
let t = String.fromCodePoint(pair[1]);
if (s.toUpperCase() == t.toUpperCase()) {
row += 'toUpperCase ';
upper++;
} else {
row += ' ';
}
if (s.toLowerCase() == t.toLowerCase()) {
row += 'toLowerCase';
lower++;
}
console.log(row);
}
console.log('upper pass: ' + upper + ', lower pass: ' + lower);
Interestingly, one of the pairs fails with both. But based on this,
toUpperCase is the best option.
It never depends upon the browser as it is only the JavaScript which is involved.
both will give the performance based upon the no of characters need to be changed (flipping case)
var areStringsEqual = firstString.toLowerCase() === secondString.toLowerCase();
var areStringsEqual = firstString.toUpperCase() === secondString.toUpperCase();
If you use test prepared by #adeneo you can feel it's browser dependent, but make some other test inputs like:
"AAAAAAAAAAAAAAAAAAAAAAAAAAAA"
and
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
and compare.
Javascript performance depends upon the browser if some DOM API or any DOM manipulation/interaction is there, otherwise for all plain JavaScript, it will give the same performance.

jQuery zip masking for multiple formats

I have a requirements for masking a zip field so that it allows the classic 5 digits zip (XXXXX) or 5 + 4 format (XXXXX-XXXX).
I could so something like:
$('#myZipField').mask("?99999-9999");
but the complication comes from the fact that dash should not be showing if the user puts in only 5 digits.
This is the best I came up with so far - I could extend it to auto-insert the dash when they insert the 6th digit but the problem with this would be funny behavior on deletion (I could stop them from deleting the dash but it would patching the patch and so forth, it becomes a nightmare):
$.mask.definitions['~']='[-]';
$("#myZipField").mask("?99999~9999", {placeholder:""});
Is there any out of the box way of doing this or do I have to roll my own?
You don't have to use a different plug-in. Just move the question mark, so that instead of:
$('#myZipField').mask("?99999-9999");
you should use:
$('#myZipField').mask("99999?-9999");
After all, it isn't the entire string which is optional, just the - and onward.
This zip code is actually simple, but when you have a more complex format to handle, here is how it's solved with the plugin (from the demo page):
var options = {onKeyPress: function(cep, e, field, options){
var masks = ['00000-000', '0-00-00-00'];
mask = (cep.length>7) ? masks[1] : masks[0];
$('.crazy_cep').mask(mask, options);
}};
$('.crazy_cep').mask('00000-000', options);
If you're using jQuery already, there are probably hundreds of plugins for masks etc, for example:
http://www.meiocodigo.com/projects/meiomask/
So I don't think you'd have to roll your own
When you use jQuery Inputmask plugin and you want to use 4 or 5 digit values for zip code you should use:
$('#myZipField').inputmask("9999[9]");
Why not have the field be transparent, and have a text object behind it with the form in light grey? So they see #######-#### in the background, and then rig it so the letters dissapear as they type. At that point, it suggests that they should enter a dash if they want to put the extra four, right? Then, you could just rig the script to autoinsert the hyphen if they mess up and type 6 numbers?

How to compare locale dependent float numbers?

I need to compare a float value entered in a web form against a range. The problem is that the client computers may have various locale settings, meaning that user may use either "." or "," to separate the integer part from decimal one.
Is there a simple way to do it? As it is for an intranet and that they are only allowed to use IE, a VBScript is fine, even if I would prefer to use JavaScript.
EDIT: Let me clarify it a bit:
I cannot rely on the system locale, because, for example, a lot of our french customers use a computer with an english locale, even if they still use the comma to fill data in the web forms.
So I need a way to perform a check accross multiple locale "string to double" conversion.
I know that the raise condition is "what about numbers with 3 decimal digits", but in our environment, this kind of answer never happen, and if it happens, it will be threated as an out of range error due to the multiplication by a thousand, so it's not a real issue for us.
In Javascript use parseFloat on the text value to get a number. Similarly in VBScript use CDbl on the text value. Both should conform to the current locale settings enforce for the user.
This code should work:
function toFloat(localFloatStr)
var x = localFloatStr.split(/,|\./),
x2 = x[x.length-1],
x3 = x.join('').replace(new RegExp(x2+'$'),'.'+x2);
return parseFloat(x3);
// x2 is for clarity, could be omitted:
//=>x.join('').replace(new RegExp(x[x.length-1]+'$'),'.'+x[x.length-1])
}
alert(toFloat('1,223,455.223')); //=> 1223455.223
alert(toFloat('1.223.455,223')); //=> 1223455.223
// your numbers ;~)
alert(toFloat('3.123,56')); //=> 3123.56
alert(toFloat('3,123.56')); //=> 3123.56
What we do is try parsing using the culture of the user and if that doesn't work, parse it using an invariant culture.
I wouldn't know how to do it in javascript or vbscript exactly though.
I used KooiInc's answer but change it a bit, because it didn't reckon with some cases.
function toFloat(strNum) {
var full = strNum.split(/[.,]/);
if (full.length == 1) return parseFloat(strNum);
var back = full[full.length - 1];
var result = full.join('').replace(new RegExp(back + '$'), '.' + back);
return parseFloat(result);
}
Forbid using any thousands separator.
Give the user an example: "Reals should look like this: 3123.56 or 3123,56". Then simply change , to . and parse it.
You can always tell user that he did something wrong with a message like this:
"I don't understand what you mean by "**,**,**".
Please format numbers like "3123.56."

Javascript percentage validation

I am after a regular expression that validates a percentage from 0 100 and allows two decimal places.
Does anyone know how to do this or know of good web site that has example of common regular expressions used for client side validation in javascript?
#Tom - Thanks for the questions. Ideally there would be no leading 0's or other trailing characters.
Thanks to all those who have replied so far. I have found the comments really interesting.
Rather than using regular expressions for this, I would simply convert the user's entered number to a floating point value, and then check for the range you want (0 to 100). Trying to do numeric range validation with regular expressions is almost always the wrong tool for the job.
var x = parseFloat(str);
if (isNaN(x) || x < 0 || x > 100) {
// value is out of range
}
I propose this one:
(^100(\.0{1,2})?$)|(^([1-9]([0-9])?|0)(\.[0-9]{1,2})?$)
It matches 100, 100.0 and 100.00 using this part
^100(\.0{1,2})?$
and numbers like 0, 15, 99, 3.1, 21.67 using
^([1-9]([0-9])?|0)(\.[0-9]{1,2})?$
Note what leading zeros are prohibited, but trailing zeros are allowed (though no more than two decimal places).
This reminds me of an old blog Entry By Alex Papadimoulis (of The Daily WTF fame) where he tells the following story:
"A client has asked me to build and install a custom shelving system. I'm at the point where I need to nail it, but I'm not sure what to use to pound the nails in. Should I use an old shoe or a glass bottle?"
How would you answer the question?
It depends. If you are looking to pound a small (20lb) nail in something like drywall, you'll find it much easier to use the bottle, especially if the shoe is dirty. However, if you are trying to drive a heavy nail into some wood, go with the shoe: the bottle with shatter in your hand.
There is something fundamentally wrong with the way you are building; you need to use real tools. Yes, it may involve a trip to the toolbox (or even to the hardware store), but doing it the right way is going to save a lot of time, money, and aggravation through the lifecycle of your product. You need to stop building things for money until you understand the basics of construction.
This is such a question where most people sees it as a challenge to come up with the correct regular expression to solve the problem, but it would be much better to just say that using regular expressions are using the wrong tool for the job.
The problem when trying to use regex to validate numeric ranges is that it is hard to change if the requirements for the allowed range is changes. Today the requirement may be to validate numbers between 0 and 100 and it is possible to write a regex for that which doesn't make your eyes bleed. But next week the requirment maybe changes so values between 0 and 315 are allowed. Good luck altering your regex.
The solution given by Greg Hewgill is probably better - even though it would validate "99fxx" as "99". But given the circumstances that might actually be ok.
Given that your value is in str
str.match(/^(100(\.0{1,2})?|([0-9]?[0-9](\.[0-9]{1,2})))$/)
^100(\.(0){0,2})?$|^([1-9]?[0-9])(\.(\d{0,2}))?\%$
This would match:
100.00
optional "1-9" followed by a digit (this makes the int part), optionally followed by a dot and two digits
From what I see, Greg Hewgill's example doesn't really work that well because parseFloat('15x') would simply return 15 which would match the 0<x<100 condition. Using parseFloat is clearly wrong because it doesn't validate the percentage value, it tries to force a validation. Some people around here are complaining about leading zeroes and some are ignoring trailing invalid characters. Maybe the author of the question should edit it and make clear what he needs.
I recomend this, if you are not exclusively developing for english speaking users:
[0-9]{1,2}((,|\.)[0-9]{1,10})?%?
You can simply replace the 10 by a 2 to get two decimal places.
My example will match:
15.5
5.4366%
1,43
50,55%
34
45%
Of cause the output of this one is harder to cast, but something like this will do (Java Code):
private static Double getMyVal(String myVal) {
if (myVal.contains("%")) {
myVal = myVal.replace("%", "");
}
if (myVal.contains(",")) {
myVal = myVal.replace(',', '.');
}
return Double.valueOf(myVal);
}
None of the above solutions worked for me, as I needed my regex to allow for values with numbers and a decimal while the user is typing ex: '18.'
This solution allows for an empty string so the user can delete their entire input, and accounts for the other rules articulated above.
/(^$)|(^100(\.0{1,2})?$)|(^([1-9]([0-9])?|0)\.(\.[0-9]{1,2})?$)|(^([1-9]([0-9])?|0)(\.[0-9]{1,2})?$)/
(100|[0-9]{1,2})(\.[0-9]{1,2})?
That should be the regex you want. I suggest you to read Mastering Regular Expression and download RegexBuddy or The Regex Coach.
#mlarsen:
Is not that a regex here won't do the job better.
Remember that validation msut be done both on client and on server side, so something like:
100|(([1-9][0-9])|[0-9])(\.(([0-9][1-9])|[1-9]))?
would be a cross-language check, just beware of checking the input length with the output match length.
(100(\.(0){1,2})?|([1-9]{1}|[0-9]{2})(\.[0-9]{1,2})?)

Categories

Resources