Javascript + Emoji strangeness - javascript

I'm trying to do some string methods with some text that has Emoji embedded inside of it.
However, this is a very strange thing I have seen:
"🌑".length == 2
I'm just wondering how it decides what appears as 1 character to me, is actually 2.

In Javascript, a string is a sequence of 16-bit code points. Since emoji > are encoded above the BMP, it means that they are represented by a pair > of code points, also known as a surrogate pair.
So for instance, 0x1F600, which is 😀, is represented by:
"\uD83D\uDE00"
If you're willing to go further on this, you can read this article: Emojis in Javascript - Parsing emoji in Javascript is… not easy.

Related

How can I convert this UTF-8 string to plain text in javascript and how can a normal user write it in a textarea [duplicate]

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.
I found more information about it in HackerRank which states:
Returns a string containing the Unicode Normalization Form of the
calling string's value.
With the example:
var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));
having as output:
HackerRank
HackerRank
Also, in GeeksForGeeks:
The string.normalize() is an inbuilt function in javascript which is
used to return a Unicode normalisation form of a given input string.
with the example:
<script>
// Taking a string as input.
var a = "GeeksForGeeks";
// calling normalize function.
b = a.normalize('NFC')
c = a.normalize('NFD')
d = a.normalize('NFKC')
e = a.normalize('NFKD')
// Printing normalised form.
document.write(b +"<br>");
document.write(c +"<br>");
document.write(d +"<br>");
document.write(e);
</script>
having as output:
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
Maybe the examples given are just really bad as they don't allow me to see any change.
I wonder... what's the point of this method?
It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).
The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.
An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".
And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.
In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.
So: you want that the same (semantically speaking) string has the same sequence of unicode code points.
Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.
The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.
As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.
This is the example (taken from MDN):
const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';
console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false
const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');
console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true
As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.
Very beautifully explained here --> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Short answer : The point is, characters are represented through a coding scheme like ascii, utf-8 , etc.,(We use mostly UTF-8). And some characters have more than one representation. So 2 string may render similarly, but their unicode may vary! So string comparrision may fail here! So we use normaize to return a single type of representation
// source from MDN
let string1 = '\u00F1'; // ñ
let string2 = '\u006E\u0303'; // ñ
string1 = string1.normalize('NFC');
string2 = string2.normalize('NFC');
console.log(string1 === string2); // true
console.log(string1.length); // 1
console.log(string2.length); // 1
Normalization of strings isn't exclusive of JavaScript - see for instances in Python. The values valid for the arguments are defined by the Unicode (more on Unicode normalization).
When it comes to JavaScript, note that there's documentation with String.normalize() and String.prototype.normalize(). As #ChrisG mentions
String.prototype.normalize() is correct in a technical sense, because
normalize() is a dynamic method you call on instances, not the class
itself. The point of normalize() is to be able to compare Strings that
look the same but don't consist of the same characters, as shown in
the example code on MDN.
Then, when it comes to its usage, found a great example of the usage of String.normalize() that has
let s1 = 'sabiá';
let s2 = 'sabiá';
// one is in NFC, the other in NFD, so they're different
console.log(s1 == s2); // false
// with normalization, they become the same
console.log(s1.normalize('NFC') === s2.normalize('NFC')); // true
// transform string into array of codepoints
function codepoints(s) { return Array.from(s).map(c => c.codePointAt(0).toString(16)); }
// printing the codepoints you can see the difference
console.log(codepoints(s1)); // [ "73", "61", "62", "69", "e1" ]
console.log(codepoints(s2)); // [ "73", "61", "62", "69", "61", "301" ]
So while saibá e saibá in this example look the same to the human eye or even if we used console.log(), we can see that without normalization when comparing them we'd get different results. Then, by analyzing the codepoints, we see they're different.
There are some great answers here already, but I wanted to throw in a practical example.
I enjoy Bible translation as a hobby. I wasn't too thrilled at the flashcard option out there in the wild in my price range (free) so I made my own. The problem is, there is more than one way to do Hebrew and Greek in Unicode to get the exact same thing. For example:
בָּא
בָּא
These should look identical on your screen, and for all practical purposes they are identical. However, the first was typed with the qamats (the little t shaped thing under it) before the dagesh (the dot in the middle of the letter) and the second was typed with the dagesh before the qamats. Now, since you're just reading this, you don't care. And your web browser doesn't care. But when my flashcards compare the two, then they aren't the same. To the code behind the scenes, it's no different than saying "center" and "centre" are the same.
Similarly, in Greek:
ἀ
ἀ
These two should look nearly identical, but the top is one Unicode character and the second one is two Unicode characters. Which one is going to end up typed in my flashcards is going to depend on which keyboard I'm sitting at.
When I'm adding flashcards, believe it or not, I don't always type in vocab lists of 100 words. That's why God gave us spreadsheets. And sometimes the places I'm importing the lists from do it one way, and sometimes they do it the other way, and sometimes they mix it. But when I'm typing, I'm not trying to memorize the order that the dagesh or quamats appear or if the accents are typed as a separate character or not. Regardless if I remember to type the dagesh first or not, I want to get the right answer, because really it's the same answer in every practical sense either way.
So I normalize the order before saving the flashcards and I normalize the order before checking it, and the result is that it doesn't matter which way I type it, it comes out right!
If you want to check out the results:
https://sthelenskungfu.com/flashcards/
You need a Google or Facebook account to log in, so it can track progress and such. As far as I know (or care) only my daughter and I currently use it.
It's free, but eternally in beta.

Javascript string.length does not equal Python len()

Imagine the following text entered in an HTML textarea:
123456
7
If one calculates the length of this text via javascript, i.e. string.length, that comes out to 10.
Now if that input's length is measured in python, i.e. via len(string), it is 13.
It does not look 13 to the human eye, but if one runs print repr(string) in python, we get 123456\r\n\r\n\r\n7. That is 13 characters, not 10. For reference, this test was carried out in Ubuntu OS.
Is there any way for python to report the string length via a mechanism that imitates javascript's string.length's result? I.e. in simpler terms, how do I get 10 in python?
I understand I can manually iterate and collapse \r\n into a single character, but I wonder if there is a more robust - even inbuilt - way to do it? In any case, an illustrative example would be great!
You can make use of Regular Expressions which is much more elegant than iterating. Replacing the characters \n and \r by '' does the trick.
Use the re module of python.
import re
x = '123456\r\n\r\n\r\n7'
y = re.sub(r'\r\n','\n',x)
print(len(y)) #Answer will be 10
For further reference, check out the python docs

Why doesn't my function correctly replace when using some regex pattern

This is an extension of this SO question
I made a function to see if i can correctly format any number. The answers below work on tools like https://regex101.com and https://regexr.com/, but not within my function(tried in node and browser):
const
const format = (num, regex) => String(num).replace(regex, '$1')
Basically given any whole number, it should not exceed 15 significant digits. Given any decimal, it should not exceed 2 decimal points.
so...
Now
format(0.12345678901234567890, /^\d{1,13}(\.\d{1,2}|\d{0,2})$/)
returns 0.123456789012345678 instead of 0.123456789012345
but
format(0.123456789012345,/^-?(\d*\.?\d{0,2}).*/)
returns number formatted to 2 deimal points as expected.
Let me try to explain what's going on.
For the given input 0.12345678901234567890 and the regex /^\d{1,13}(\.\d{1,2}|\d{0,2})$/, let's go step by step and see what's happening.
^\d{1,13} Does indeed match the start of the string 0
(\. Now you've opened a new group, and it does match .
\d{1,2} It does find the digits 1 and 2
|\d{0,2} So this part is skipped
) So this is the end of your capture group.
$ This indicates the end of the string, but it won't match, because you've still got 345678901234567890 remaining.
Javascript returns the whole string because the match failed in the end.
Let's try removing $ at the end, to become /^\d{1,13}(\.\d{1,2}|\d{0,2})/
You'd get back ".12345678901234567890". This generates a couple of questions.
Why did the preceding 0 get removed?
Because it was not part of your matching group, enclosed with ().
Why did we not get only two decimal places, i.e. .12?
Remember that you're doing a replace. Which means that by default, the original string will be kept in place, only the parts that match will get replaced. Since 345678901234567890 was not part of the match, it was left intact. The only part that matched was 0.12.
Answer to title question: your function doesn't replace, because there's nothing to replace - the regex doesn't match anything in the string. csb's answer explains that in all details.
But that's perhaps not the answer you really need.
Now, it seems like you have an XY problem. You ask why your call to .replace() doesn't work, but .replace() is definitely not a function you should use. Role of .replace() is replacing parts of string, while you actually want to create a different string. Moreover, in the comments you suggest that your formatting is not only for presenting data to user, but you also intend to use it in some further computation. You also mention cryptocurriencies.
Let's cope with these problems one-by-one.
What to do instead of replace?
Well, just produce the string you need instead of replacing something in the string you don't like. There are some edge cases. Instead of writing all-in-one regex, just handle them one-by-one.
The following code is definitely not best possible, but it's main aim is to be simple and show exactly what is going on.
function format(n) {
const max_significant_digits = 15;
const max_precision = 2;
let digits_before_decimal_point;
if (n < 0) {
// Don't count minus sign.
digits_before_decimal_point = n.toFixed(0).length - 1;
} else {
digits_before_decimal_point = n.toFixed(0).length;
}
if (digits_before_decimal_point > max_significant_digits) {
throw new Error('No good representation for this number');
}
const available_significant_digits_for_precision =
Math.max(0, max_significant_digits - digits_before_decimal_point);
const effective_max_precision =
Math.min(max_precision, available_significant_digits_for_precision);
const with_trailing_zeroes = n.toFixed(effective_max_precision);
// I want to keep the string and change just matching part,
// so here .replace() is a proper method to use.
const withouth_trailing_zeroes = with_trailing_zeroes.replace(/\.?0*$/, '');
return withouth_trailing_zeroes;
}
So, you got the number formatted the way you want. What now?
What can you use this string for?
Well, you can display it to the user. And that's mostly it. The value was rounded to (1) represent it in a different base and (2) fit in limited precision, so it's pretty much useless for any computation. And, BTW, why would you convert it to String in the first place, if what you want is a number?
Was the value you are trying to print ever useful in the first place?
Well, that's the most serious question here. Because, you know, floating point numbers are tricky. And they are absolutely abysmal for representing money. So, most likely the number you are trying to format is already a wrong number.
What to use instead?
Fixed-point arithmetic is the most obvious answer. Works most of the time. However, it's pretty tricky in JS, where number may slip into floating-point representation almost any time. So, it's better to use decimal arithmetic library. Optionally, switch to a language that has built-in bignums and decimals, like Python.

Javascript eval - obfuscation?

I came across some eval code:
eval('[+!+[]+!+[]+!+[]+!+[]+!+[]]');
This code equals the integer 5.
What is this type of thing called? I've tried searching the web but I can't seem to figure out what this is referred to. I find this very interesting and would like to know where/how one learns how to print different things instead of just the integer 5. Letters, symbols and etc. Since I can't pin point a pattern in that code I've had 0 success taking from and adding to it to make different results.
Is this some type of obfuscation?
This type of obfuscation, eval() aside, is known as Non-alphanumeric obfuscation.
To be completely Non-alphanumeric, the eval would have to be performed by Array constructor prototypes functions and subscript notation:
[]["sort"]["constructor"]("string to be evaled");
These strings are then converted to non-alphanumeric form.
AFAIK, it was first proposed by Yosuke Hosogawa around 2009.
If you want to see it in action see this tool: http://www.jsfuck.com/
It is not considered a good type of obfuscation because it is easy to reverse back to the original source code, without even having to run the code (statically). Plus, it increases the file size tremendously.
But its an interesting form of obfuscation that explores JavaScript type coercion. To learn more about it I recommend this presentation, slide 33:
http://www.slideshare.net/auditmark/owasp-eu-tour-2013-lisbon-pedro-fortuna-protecting-java-script-source-code-using-obfuscation
That's called Non-alphanumeric JavaScript and it's possible because of JavaScript type coercion capabilities. There are actually some ways to call eval/Function without having to use alpha characters:
[]["filter"]["constructor"]('[+!+[]+!+[]+!+[]+!+[]+!+[]]')()
After you replace strings "filter" and "constructor" by non-alphanumeric representations you have your full non-alphanumeric JavaScript.
If you want to play around with this, there is a site where you can do it: http://www.jsfuck.com/.
Check this https://github.com/aemkei/jsfuck/blob/master/jsfuck.js for more examples like the following:
'a': '(false+"")[1]',
'b': '(Function("return{}")()+"")[2]',
'c': '([]["filter"]+"")[3]',
...
To get the value as 5, the expression should have been like this
+[!+[] + !+[] + !+[] + !+[] + !+[]]
Let us analyze the common elements first. !+[].
An empty array literal in JavaScript is considered as Falsy.
+ operator applied to an array literal, will try convert it to a number and since it is empty, JavaScript will evaluate it to 0.
! operator converts 0 to true.
So,
console.log(!+[]);
would print true. Now, the expression can be reduced like this
+[true + true + true + true + true]
Since true is treated as 1 in arithmetic expressions, the actual expression becomes
+[ 5 ]
The + operator tries to convert the array ([ 5 ]) to a number and which results in 5. That is why you are getting 5.
I don't know of any term used to describe this type of code, aside from "abusing eval()".
I find this very interesting and would like to know where/how one
learns how to print different things instead of just the integer 5.
Letters, symbols and etc. Since I can't pin point a pattern in that
code I've had 0 success taking from and adding to it to make different
results.
This part I can at least partially answer. The eval() you pasted relies heavily on Javascript's strange type coercion rules. There are a lot of pages on the web describing various strange consequences of the coercion rules, which you can easily search for. But I can't find any reference on type coercion with the specific goal of getting "surprising" output from things like eval(), unless you count this video by Destroy All Software (the Javascript part starts at 1:20). Most of them understandably focus on how to avoid strange bugs in your code. For your purpose, I believe the most useful things I know of are:
The ! operator will convert anything to a boolean.
The unary + operator will convert anything to a number.
The binary + operator will coerce its arguments to either numbers or strings before adding or concatenating them. Normally it will only go with strings if one or the other argument is already a string, but there are exceptions.
The bitwise operators output integers, regardless of input.
Converting to numbers is complicated. Booleans will go to 0 or 1. Strings will attempt to use parseInt(), or produce NaN if that fails. I forget the rest.
Converting an object to a string or number will invoke its "toString" or "toValue" method respectively, if one exists.
You get the idea. thefourtheye's answer walks through exactly how these rules apply to the example you gave. I'm not even going to try summarizing what happens when Dates, Functions, and Regexps are involved.
Is this some type of obfuscation?
Normally you'd simply get obfuscation for free as part of minification, so I have no idea why someone would write that in real production code (assuming that's where you found it).

Reassembling negative Python marshal int's into Javascript numbers

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?
The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Categories

Resources