JavaScript string normalize() does not work in some cases - why?

JavaScript string normalize() does not work in some cases - why? - javascript

In my debugger I can see an object is retrieved with a unicode character. e.g.
{
name: "(Other\uff09"
}
If this object is referenced using the var myObj in the debugger I see that
myObj.name.normalize()
returns
"(Other\uff09"
If instead I use
"(Other\uff09".normalize()
it returns
"(Other)"
Why?

I've learnt a little since posting this question. Initially I thought that \uff09 was equivalent to ). It's not, instead it's ）, aka Full Width Right Parenthesis.
This essentially means that the opening and closing brackets do not match. e.g. ( is not closed by ）. This is causing parsing issues for the NodeMailer AddressParser module I'm using.
To normalize it I'm using
JSON.parse(`"(Other\uff09"`).normalize("NFKC")
which translates the string to (Other) rather than (Other）.
Breaking this down
JSON.parse(`"(Other\uff09"`)
where the double quotes is required, produces (Other）. i.e. It converts the unicode escape characters into the character they represent.
Then
.normalize("NFKC")
converts that into (Other).
This worked for me in this sample. I still need to see if it scales up.

Related

ES6 / JS: Regex for replacing delve with conditional chaining

How to replace delve with conditional chaining in a vs code project?
e.g.
delve(seo,'meta')
delve(item, "image.data.attributes.alternativeText")
desired result
seo?.meta
item?.image.data.attributes.alternativeText
Is it possible using find/replace in Visual Studio Code?

I propose the following RegEx:
delve\(\s*([^,]+?)\s*,\s*['"]([^.]+?)['"]\s*\)
and the following replacement format string:
$1?.$2
Explanation: Match delve(, a first argument up until the first comma (lazy match), and then a second string argument (no care is taken to ensure that the brackets match as this is rather quick'n'dirty anyways), then the closing bracket of the call ). Spacing at reasonable places is accounted for.
which will work for simple cases like delve(someVar, "key") but might fail for pathological cases; always review the replacements manually.
Note that this is explicitly made incapable of dealing with delve(var, "a.b.c") because as far as I know, VSC format strings don't support "joining" variable numbers of captures by a given string. As a workaround, you could explicitly create versions with two, three, four, five... dots and write the corresponding replacements. The version for two dots for example looks as follows:
delve\(([^,]+?)\s*,\s*['"]([^.]+?)\.([^.]+?)['"]\s*\)
and the format string is $1?.$2?.$3.
You write:
e.g.
delve(seo,'meta')
delve(item, "image.data.attributes.alternativeText")
desired result
seo?.meta
item?.image.data.attributes.alternativeText
but I highly doubt that this is intended, because delve(item, "image.data.attributes.alternativeText") is in fact equivalent to item?.image?.data?.attributes?.alternativeText rather than the desired result you describe. To make it handle it that way, simply replace [^.] with . to make it accept strings containing any characters (including dots).

TextEncoder / TextDecoder not round tripping

I'm definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn't seem to:
new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);
Since I'm just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing?

Based on some spelunking, the TextEncoder.encode() method appears to take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.
Also, according to MDN:
A USVString is a sequence of Unicode scalar values. This definition
differs from that of DOMString or the JavaScript String type in that
it always represents a valid sequence suitable for text processing,
while the latter can contain surrogate code points.
So, my guess is your String argument to encode() is getting converted to a USVString (either implicitly or within encode()). Based on this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the "Replacement Character".
The reason String.fromCharCode(55296).charCodeAt(0) works I believe is because it doesn't need to do this String -> USVString conversion.
As to why TextEncoder.encode() was designed this way, I don't understand the unicode details well enough to attempt to explain, but I suspect it's to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array. I'm guessing requiring a USVString argument without surrogates (instead of a native UTF-16 String possibly with surrogates) simplifies the encoding to UTF-8, or maybe makes some encoding/decoding use cases simpler?

For those (like me) who aren't sure what "unicode surrogates" are:
The problem
The character code 55296 is not a valid character by itself. So this part of the code is already a problem:
String.fromCharCode(55296)
Since there is no valid character at that charCode, the .fromCharCode function returns the error character "�" instead, which happens to have the code 65533.
Codes like 55296 are only valid as the first element of a pair of codes. Pairs of codes are used to represent the characters that didn't fit in Unicode's Basic Multilingual Plane. (There are a lot of characters outside the Basic Multilingual Plane, so they need two 16-bit numbers to encode them.)
For example, here is a valid use of the code 55296:
console.log(String.fromCharCode(55296, 57091)
It returns the character "𐌃", from the ancient Etruscan alphabet.
The solution
This code will round-trip correctly:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).charCodeAt(0)); // Returns 55296
But beware: .charCodeAt only returns the first part of the pair. A safer option might be to use String.codePointAt to convert the character into a single 32-bit code:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).codePointAt(0)); // Returns 66307

Why does a number inside parentheses have methods, but a number outside parentheses does not? [duplicate]

If I try to write
3.toFixed(5)
there is a syntax error. Using double dots, putting in a space, putting the three in parentheses or using bracket notation allows it to work properly.
3..toFixed(5)
3 .toFixed(5)
(3).toFixed(5)
3["toFixed"](5)
Why doesn't the single dot notation work and which one of these alternatives should I use instead?

The period is part of the number, so the code will be interpreted the same as:
(3.)toFixed(5)
This will naturally give a syntax error, as you can't immediately follow the number with an identifier.
Any method that keeps the period from being interpreted as part of the number would work. I think that the clearest way is to put parentheses around the number:
(3).toFixed(5)

You can't access it because of a flaw in JavaScript's tokenizer. Javascript tries to parse the dot notation on a number as a floating point literal, so you can't follow it with a property or method:
2.toString(); // raises SyntaxError
As you mentioned, there are a couple of workarounds which can be used in order make number literals act as objects too. Any of these is equally valid.
2..toString(); // the second point is correctly recognized
2 .toString(); // note the space left to the dot
(2).toString(); // 2 is evaluated first
To understand more behind object usage and properties, check out the Javascript Garden.

It doesn't work because JavaScript interprets the 3. as being either the start of a floating-point constant (such as 3.5) or else an entire floating-point constant (with 3. == 3.0), so you can't follow it by an identifier (in your case, a property-name). It fails to recognize that you intended the 3 and the . to be two separate tokens.
Any of your workarounds looks fine to me.

This is an ambiguity in the Javascript grammar. When the parser has got some digits and then encounters a dot, it has a choice between "NumberLiteral" (like 3.5) or "MemberExpression" (like 3.foo). I guess this ambiguity cannot be resolved by lookahead because of scientific notation - should 3.e2 be interpreted as 300 or a property e2 of 3? Therefore they voluntary decided to prefer NumberLiterals here, just because there's actually not very much demand for things like 3.foo.

As others have mentioned, Javascript parser interprets the dot after Integer literals as a decimal point and hence it won't invoke the methods or properties on Number object.
To explicitly inform JS parser to invoke the properties or methods on Integer literals, you can use any of the below options:
Two Dot Notation
3..toFixed()
Separating with a space
3 .toFixed()
Write integer as a decimal
3.0.toFixed()
Enclose in parentheses
(3).toFixed()
Assign to a constant or variable
const nbr = 3;
nbr.toFixed()

g:message with arguments inside Javascript / jQuery not working as expected

I had an issue where String arguments were being truncated to the first character in our g:message tags (longs/integers seemed to be fine).
Ultimately, I figured out we were not calling g:message syntactically correct from within Javascript so some minor tweaks fixed the issue. Problem is - I don't understand why the former doesn't work.
Can anyone describe what was happening here?
jQuery("#myId").html("<g:message code='domain.message.path' args="${command?.foo?.name}"/>"); //incorrect, only displays first character of message
jQuery("#myId").html("${g.message(code: 'domain.message.path', args: [command?.foo?.name])}"); //correct, displays full string

I assume you're rendering this as part of a .gsp page? Here's the thing. In the first one, you're nesting quotes, essentially leaving the ${} section out of the string. Even Stackoverflow can tell; note how that part is a different color:
jQuery("#myId").html("<g:message code='domain.message.path' args="${command?.foo?.name}"/>");
See how the quote at the end of html( is ended by the quote before ${, leaving the ${command?.foo?.name} block outside the string? If command.foo.name was the string "bob", then when this rendered, you'd get:
jQuery("#myId").html("<g:message code='domain.message.path' args="bob"/>");
You might think this looks right, but javascript will handle this poorly.
If you used single quotes for the internal string, like you do with 'domain.message.path', it should work fine:
jQuery("#myId").html("<g:message code='domain.message.path' args='${command?.foo?.name}'/>");

regex replace on JSON is removing an Object from Array

I'm trying to improve my understanding of Regex, but this one has me quite mystified.
I started with some text defined as:
var txt = "{\"columns\":[{\"text\":\"A\",\"value\":80},{\"text\":\"B\",\"renderer\":\"gbpFormat\",\"value\":80},{\"text\":\"C\",\"value\":80}]}";
and do a replace as follows:
txt.replace(/\"renderer\"\:(.*)(?:,)/g,"\"renderer\"\:gbpFormat\,");
which results in:
"{"columns":[{"text":"A","value":80},{"text":"B","renderer":gbpFormat,"value":80}]}"
What I expected was for the renderer attribute value to have it's quotes removed; which has happened, but also the C column is completely missing! I'd really love for someone to explain how my Regex has removed column C?
As an extra bonus, if you could explain how to remove the quotes around any value for renderer (i.e. so I don't have to hard-code the value gbpFormat in the regex) that'd be fantastic.

You are using a greedy operator while you need a lazy one. Change this:
"renderer":(.*)(?:,)
^---- add here the '?' to make it lazy
To
"renderer":(.*?)(?:,)
Working demo
Your code should be:
txt.replace(/\"renderer\"\:(.*?)(?:,)/g,"\"renderer\"\:gbpFormat\,");
If you are learning regex, take a look at this documentation to know more about greedyness. A nice extract to understand this is:
Watch Out for The Greediness!
Suppose you want to use a regex to match an HTML tag. You know that
the input will be a valid HTML file, so the regular expression does
not need to exclude any invalid use of sharp brackets. If it sits
between sharp brackets, it is an HTML tag.
Most people new to regular expressions will attempt to use <.+>. They
will be surprised when they test it on a string like This is a
first test. You might expect the regex to match and when
continuing after that match, .
But it does not. The regex will match first. Obviously not
what we wanted. The reason is that the plus is greedy. That is, the
plus causes the regex engine to repeat the preceding token as often as
possible. Only if that causes the entire regex to fail, will the regex
engine backtrack. That is, it will go back to the plus, make it give
up the last iteration, and proceed with the remainder of the regex.
Like the plus, the star and the repetition using curly braces are
greedy.

Try like this:
txt = txt.replace(/"renderer":"(.*?)"/g,'"renderer":$1');
The issue in the expression you were using was this part:
(.*)(?:,)
By default, the * quantifier is greedy by default, which means that it gobbles up as much as it can, so it will run up to the last comma in your string. The easiest solution would be to turn that in to a non-greedy quantifier, by adding a question mark after the asterisk and change that part of your expression to look like this
(.*?)(?:,)
For the solution I proposed at the top of this answer, I also removed the part matching the comma, because I think it's easier just to match everything between quotes. As for your bonus question, to replace the matched value instead of having to hardcode gbpFormat, I used a backreference ($1), which will insert the first matched group into the replacement string.

Don't manipulate JSON with regexp. It's too likely that you will break it, as you have found, and more importantly there's no need to.
In addition, once you have changed
'{"columns": [..."renderer": "gbpFormat", ...]}'
into
'{"columns": [..."renderer": gbpFormat, ...]}' // remove quotes from gbpFormat
then this is no longer valid JSON. (JSON requires that property values be numbers, quoted strings, objects, or arrays.) So you will not be able to parse it, or send it anywhere and have it interpreted correctly.
Therefore you should parse it to start with, then manipulate the resulting actual JS object:
var object = JSON.parse(txt);
object.columns.forEach(function(column) {
column.renderer = ghpFormat;
});
If you want to replace any quoted value of the renderer property with the value itself, then you could try
column.renderer = window[column.renderer];
Assuming that the value is available in the global namespace.
This question falls into the category of "I need a regexp, or I wrote one and it's not working, and I'm not really sure why it has to be a regexp, but I heard they can do all kinds of things, so that's just what I imagined I must need." People use regexps to try to do far too many complex matching, splitting, scanning, replacement, and validation tasks, including on complex languages such as HTML, or in this case JSON. There is almost always a better way.
The only time I can imagine wanting to manipulate JSON with regexps is if the JSON is broken somehow, perhaps due to a bug in server code, and it needs to be fixed up in order to be parseable.

Develop Reference

JavaScript is the programming language of the Web.