String comparison - Javascript - javascript

I'm trying to get my head around string comparisons in Javascript
function f(str){
return str[0] < str[str.length -1]
}
f("a+"); // false
In ASCII: 'a' == 97, '+' == 43
Am I correct in thinking my test: f(str) is based on ASCII values above?

You don't need a function or a complicated test pulling a string apart for this. Just do 'a' < '+' and learn from what happens. Or, more simply, check the char's charcode using 'a'.charCodeAt(0).

You are almost right. It is based on unicode code units (not code points, this is the 16-bit encoded version), not ascii on values.
From the ECMAScript 2015 specification:
If both px and py are Strings, then
If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.)
If px is a prefix of py, return true.
Let k be the smallest nonnegative integer such that the code unit at index k within px is different from the code unit at index k within py. (There must be such a k, for neither String is a prefix of the other.)
Let m be the integer that is the code unit value at index k within px.
Let n be the integer that is the code unit value at index k within py.
If m < n, return true. Otherwise, return false.
Note2
The comparison of Strings uses a simple lexicographic ordering on
sequences of code unit values. There is no attempt to use the more
complex, semantically oriented definitions of character or string
equality and collating order defined in the Unicode specification.
Therefore String values that are canonically equal according to the
Unicode standard could test as unequal. In effect this algorithm
assumes that both Strings are already in normalized form. Also, note
that for strings containing supplementary characters, lexicographic
ordering on sequences of UTF-16 code unit values differs from that on
sequences of code point values.
Basically it means that string comparison is based on a lexicographical order of "code units", which is the numeric value of unicode characters.

JavaScript engines are allowed to use either UCS-2 or UTF-16 (which is the same for most practical purposes).
So, technically, your function is based on UTF-16 values and you were comparing 0x0061 and 0x002B.

Related

What is a surrogate pair?

I came across this code in a javascript open source project.
validator.isLength = function (str, min, max)
// match surrogate pairs in string or declare an empty array if none found in string
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
// subtract the surrogate pairs string length from main string length
var len = str.length - surrogatePairs.length;
// now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
return len >= min && (typeof max === 'undefined' || len <= max);
};
As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:
Is my understanding of the code correct?
What are surrogate pairs?
I have thus far only figured out that this is related to encoding.
Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.
JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.
Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.
Try this
var len = "πŸ˜€".length // There is an emoji in the string (if you don’t see it)
vs
var str = "πŸ˜€"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;
In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.
You might want to read
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
For your second question:
1. What is a "surrogate pair" in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Hope this helps.
Did you try to just google it?
The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
In UTF-16 some characters are stored in 8 bits and others in 16 bits.
Surrogate pair is a character representation that take 16 bits.
Some character codes is reserved to be the first one in such pairs.

Why does string to number comparison work in Javascript

I am trying to compare a value coming from a HTML text field with integers. And it works as expected.
Condition is -
x >= 1 && x <= 999;
Where x is the value of text field. Condition returns true whenever value is between 1-999 (inclusive), else false.
Problem is, that the value coming from the text field is of string type and I'm comparing it with integer types. Is it okay to have this comparison like this or should I use parseInt() to convert x to integer ?
Because JavaScript defines >= and <= (and several other operators) in a way that allows them to coerce their operands to different types. It's just part of the definition of the operator.
In the case of <, >, <=, and >=, the gory details are laid out in Β§11.8.5 of the specification. The short version is: If both operands are strings (after having been coerced from objects, if necessary), it does a string comparison. Otherwise, it coerces the operands to numbers and does a numeric comparison.
Consequently, you get fun results, like that "90" > "100" (both are strings, it's a string comparison) but "90" < 100 (one of them is a number, it's a numeric comparison). :-)
Is it okay to have this comparison like this or should I use parseInt() to convert x to integer ?
That's a matter of opinion. Some people think it's totally fine to rely on the implicit coercion; others think it isn't. There are some objective arguments. For instance, suppose you relied on implicit conversion and it was fine because you had those numeric constants, but later you were comparing x to another value you got from an input field. Now you're comparing strings, but the code looks the same. But again, it's a matter of opinion and you should make your own choice.
If you do decide to explicitly convert to numbers first, parseInt may or may not be what you want, and it doesn't do the same thing as the implicit conversion. Here's a rundown of options:
parseInt(str[, radix]) - Converts as much of the beginning of the string as it can into a whole (integer) number, ignoring extra characters at the end. So parseInt("10x") is 10; the x is ignored. Supports an optional radix (number base) argument, so parseInt("15", 16) is 21 (15 in hex). If there's no radix, assumes decimal unless the string starts with 0x (or 0X), in which case it skips those and assumes hex. Does not look for the new 0b (binary) or 0o (new style octal) prefixes; both of those parse as 0. (Some browsers used to treat strings starting with 0 as octal; that behavior was never specified, and was [specifically disallowed][2] in the ES5 specification.) Returns NaN if no parseable digits are found.
Number.parseInt(str[, radix]) - Exactly the same function as parseInt above. (Literally, Number.parseInt === parseInt is true.)
parseFloat(str) - Like parseInt, but does floating-point numbers and only supports decimal. Again extra characters on the string are ignored, so parseFloat("10.5x") is 10.5 (the x is ignored). As only decimal is supported, parseFloat("0x15") is 0 (because parsing ends at the x). Returns NaN if no parseable digits are found.
Number.parseFloat(str) - Exactly the same function as parseFloat above.
Unary +, e.g. +str - (E.g., implicit conversion) Converts the entire string to a number using floating point and JavaScript's standard number notation (just digits and a decimal point = decimal; 0x prefix = hex; 0b = binary [ES2015+]; 0o prefix = octal [ES2015+]; some implementations extend it to treat a leading 0 as octal, but not in strict mode). +"10x" is NaN because the x is not ignored. +"10" is 10, +"10.5" is 10.5, +"0x15" is 21, +"0o10" is 8 [ES2015+], +"0b101" is 5 [ES2015+]. Has a gotcha: +"" is 0, not NaN as you might expect.
Number(str) - Exactly like implicit conversion (e.g., like the unary + above), but slower on some implementations. (Not that it's likely to matter.)
Bitwise OR with zero, e.g. str|0 - Implicit conversion, like +str, but then it also converts the number to a 32-bit integer (and converts NaN to 0 if the string cannot be converted to a valid number).
So if it's okay that extra bits on the string are ignored, parseInt or parseFloat are fine. parseInt is quite handy for specifying radix. Unary + is useful for ensuring the entire string is considered. Takes your choice. :-)
For what it's worth, I tend to use this function:
const parseNumber = (str) => str ? +str : NaN;
(Or a variant that trims whitespace.) Note how it handles the issue with +"" being 0.
And finally: If you're converting to number and want to know whether the result is NaN, you might be tempted to do if (convertedValue === NaN). But that won't work, because as Rick points out below, comparisons involving NaN are always false. Instead, it's if (isNaN(convertedValue)).
The MDN's docs on Comparision states that the operands are converted to a common type before comparing (with the operator you're using):
The more commonly used abstract comparison (e.g. ==) converts the operands to the same Type before making the comparison. For relational abstract comparisons (e.g., <=), the operands are first converted to primitives, then to the same type, before comparison.
You'd only need to apply parseInt() if you were using a strict comparision, that does not perform the auto casting before comparing.
You should use parseInt if the var is a string. Add = to compare datatype value:
parseInt(x) >== 1 && parseInt(x) <== 999;

Bitwise operations on strings in javascript

In javascript the following test of character to character binary operations prints 0 676 times:
var s = 'abcdefghijklmnopqrstuvwxyz';
var i, j;
for(i=0; i<s.length;i++){ for(j=0; j<s.length;j++){ console.log(s[i] | s[j]) }};
If js was using the actual binary representation of the strings I would expect some non-zero values here.
Similarly, testing binary operations on strings and integers, the following print 26 255s and 0s, respectively. (255 was chosen because it is 11111111 in binary).
var s = 'abcdefghijklmnopqrstuvwxyz';
var i; for(i=0; i<s.length;i++){ console.log(s[i] | 255) }
var i; for(i=0; i<s.length;i++){ console.log(s[i] & 255) }
What is javascript doing here? It seems like javascript is casting any string to false before binary operations.
Notes
If you try this in python, it throws an error:
>>> s = 'abcdefghijklmnopqrstuvwxyz'
>>> [c1 | c2 for c2 in s for c1 in s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'str' and 'str'
But stuff like this seems to work in php.
In JavaScript, when a string is used with a binary operator it is first converted to a number. Relevant portions of the ECMAScript spec are shown below to explain how this works.
Bitwise operators:
The production A : A # B, where # is one of the bitwise operators in the productions above, is evaluated as follows:
Let lref be the result of evaluating A.
Let lval be GetValue(lref).
Let rref be the result of evaluating B.
Let rval be GetValue(rref).
Let lnum be ToInt32(lval).
Let rnum be ToInt32(rval).
Return the result of applying the bitwise operator # to lnum and rnum. The result is a signed 32 bit integer.
ToInt32:
The abstract operation ToInt32 converts its argument to one of 232 integer values in the range βˆ’231 through 231βˆ’1, inclusive. This abstract operation functions as follows:
Let number be the result of calling ToNumber on the input argument.
If number is NaN, +0, βˆ’0, +∞, or βˆ’βˆž, return +0.
Let posInt be sign(number) * floor(abs(number)).
Let int32bit be posInt modulo 232; that is, a finite integer value k of Number type with positive sign and less than 232 in magnitude such that the mathematical difference of posInt and k is mathematically an integer multiple of 232.
If int32bit is greater than or equal to 231, return int32bit βˆ’ 232, otherwise return int32bit.
The internal ToNumber function will return NaN for any string that cannot be parsed as a number, and ToInt32(NaN) will give 0. So in your code example all of the bitwise operators with letters as the operands will evaluate to 0 | 0, which explains why only 0 is printed.
Note that something like '7' | '8' will evaluate to 7 | 8 because in this case the strings used as the operands can be successfully convered to a number.
As for why the behavior in Python is different, there isn't really any implicit type conversion in Python so an error is expected for any type that doesn't implement the binary operators (by using __or__, __and__, etc.), and strings do not implement those binary operators.
Perl does something completely different, bitwise operators are implemented for strings and it will essentially perform the bitwise operator for the corresponding bytes from each string.
If you want to use JavaScript and get the same result as Perl, you will need to first convert the characters to their code points using str.charCodeAt, perform the bitwise operator on the resulting integers, and then use String.fromCodePoint to convert the resulting numeric values into characters.
I'd be surprised if JavaScript worked at all with bitwise operations on non-numerical strings and produced anything meaningful. I'd imagine that because any bitwise operator in JavaScript converts its operand into a 32 bit integer, that it would simply turn all non-numerical strings into 0.
I'd use...
"a".charCodeAt(0) & 0xFF
That produces 97, the ASCII code for "a", which is correct, given it's masked off with a byte with all bits set.
Try to remember that because things work nicely in other languages, it isn't always the case in JavaScript. We're talking about a language conceived and implemented in a very short amount of time.
JavaScript is using type coercion which allows it to attempt to parse the strings as numbers automatically when you try to perform a numeric operation on them. The parsed value is either 0 or more likely NaN. This obviously won't get you the information you're trying to get.
I think what you're looking for is charCodeAt which will allow you to get the numeric Unicode value for a character in a string and the possibly the complementary fromCodePoint which converts the numeric value back to a character.

String sequence similarity / difference ratio in javascript and python

Say I had a reference string
"abcdabcd"
and a target string
"abcdabEd"
Is there a simple way in javascript and python to get the string sequence similarity ratio?
Example:
"abcdabcd" differs from "abcdabEd" by the character "E" so the ratio of similarity is high but less than 1.0
"bcdabcda" differs from "abcdabEd" greatly because every character at a specific string index is different so the similarity ratio is 0.0
note that the similarity ratio is not how many similar characters are in each string but how similar the sequences are from each other
therefore code like
# python - incorrect for this problem
difflib.SequenceMatcher(None, "bcdabcda", "abcdabEd").ratio()
would be wrong
You can use this general formula, it works with strings or arrays of objects with the same or different lengths:
similarity=#common/(sqrt(nx*ny));
where #common are the common occurrences (in this case the number of matching characters);
nx is the length of the array of objects x (or the string called x);
ny is the length of the array of objects y (or the string called y).
If the length of the strings is the same that formula reduces to the simple case:
similarity=#common/n;
where:
n=nx=ny.
In python this formula for similarity of strings (considering the order of characters, as you want) can be written as:
from math import sqrt
def similarity(x, y):
n=min(len(x), len(y))
common=0
for i in range(n):
if (x[i]==y[i]):
common+=1
return common/sqrt(len(x)*len(y))
and in javascript it's analogous.
how bout
float(sum([a==b for a,b in zip(my_string1,my_string2)]))/len(my_string1)
>>> s1,s2 = "abcdabcd","abcdabEd"
>>> print float(sum([a==b for a,b in zip(s1,s2)]))/len(s1)
0.875

Javascript array sort speed affected by string length?

Just wondering, I have seen diverging opinions on this subject.
If you take an array of strings, say 1000 elements and use the sort method. Which one would be faster? An array in which the strings are 100 characters long or one in which the strings are only 3 characters long?
I tried to test but I have a bug with Firebug at the moment and Date() appears too random.
Thank you!
It depends what the strings contain, if they contain different characters, the rest of the string doesn't have to be checked for comparison so it doesn't matter.
For example, "abc" < "bca" Here only the first character had to be checked.
You can read the specs for this: http://ecma-international.org/ecma-262/5.1/#sec-11.8.5
Specifically:
Else, both px and py are Strings
If py is a prefix of px, return false. (A String value p is a prefix of String value
q if q can be the result of concatenating p and some other String r. Note that any
String is a prefix of itself, because r may be the empty String.)
If px is a prefix of py, return true.
Let k be the smallest nonnegative integer such that the character at position k within px is
different from the character at position k within py. (There must be such a k, for neither
String is a prefix of the other.)
Let m be the integer that is the code unit value for the character at position k within
px.
Let n be the integer that is the code unit value for the character at position k within
py.
If m < n, return true. Otherwise, return false.
It really depends on how different the strings are, but I guess the differences would be minimal due to the fact that what's called to do the comparison is way slower than actually comparing the strings.
But then again, modern browsers use some special optimizations for sort, so they cut some comparisons to speed things up. And this would happen more often sorting an array of short strings.
And FYI, if you want to make some benchmark, use a reliable tool like jsPerf.

Categories

Resources