Spaces required between keyword and literal - javascript

Looking at the output of UglifyJS2, I noticed that no spaces are required between literals and the in operator (e.g., 'foo'in{foo:'bar'} is valid).
Playing around with Chrome's DevTools, however, I noticed that hex and binary number literals require a space before the in keyword:
Internet explorer returned true to all three tests, while FireFox 48.0.1 threw a SyntaxError for the first one (1in foo), however it is okay with string literals ('1'in foo==true).
It seems that there should be no problem parsing JavaScript, allowing for keywords to be next to numeric literals, but I can't find any explicit rule in the ECMAScript specification (any of them).
Further testing shows that statements like for(var i of[1,2,3])... are allowed in both Chrome and FireFox (IE11 doesn't support for..of loops), and typeof"string" works in all three.
Which behavior is correct? Is it, in fact, defined somewhere that I missed, or are all these effects a result of idiosyncrasies of each browser's parser?

Not an expert - I haven't done a JS compiler, but have done others.
ecma-262.pdf is a bit vague, but it's clear that an expression such as 1 in foo should be parsed as 3 input elements, which are all tokens. Each token is a CommonToken (11.5); in this case, we get numericLiteral, identifierName (yes, in is an identifierName), and identifierName. Exactly the same is true when parsing 0b1 in foo (see 11.8.3).
So, what happens when you take out the WS? It's not covered explicitly (as far as I can see), but it's common practice (in other languages) when writing a lexer to scan the longest character sequence that will match something you could potentially be looking for. The introduction to section 11 pretty much says exactly that:
The source text is scanned from left to right, repeatedly taking the
longest possible sequence of code points as the next input element.
So, for 0b1in foo the lexer goes through 0b1, which matches a numeric literal, and reaches i, giving 0b1i, which doesn't match anything. So it passes the longest match (0b1) to the rest of the parser as a token, and starts again at i. It finds n, followed by WS, so passes in as the second token, and so on.
So, basically, and rather bizarrely, it looks like IE is correct.

TL;DR
There would be no change to how code would be interpreted if whitespace weren't required in these circumstances, but it's part of the spec.
Looking at the source code of v8 that handles number literal parsing, it cites ECMA 262 § 7.8.3:
The source character immediately following a NumericLiteral must not be an IdentifierStart or DecimalDigit.
NOTE For example:
3in
is an error and not the two input elements 3 and in.
This section seems to contradict the introduction of section 7. However, it does not seem that there would be any problems with breaking that rule and allowing for 3in to be parsed. There are cases where allowing for no spaces between literals and identifiers would change how the source is parsed, but all cases merely change which errors are generated.

Related

JS lexing---multi line string

I am making a JS lexer as part of my study. In JS, single line stings start from " or ' and ends with the same character except if that character is preceded by a backslash.
In my current code, I loop through every character and append them to existing tokens based on flags like "string" or "regex". so it feels natural to implement multi line string with " or ' because it seems that it does not affect any other part of my lexer
Is there any practical reason why new line is not allowed as contents of strings?
Many languages, but not all, prohibit unescaped newlines in string literals. So JavaScript is certainly not unique here.
But the motivation really has little to do with the ease, difficulty or efficiency of lexical analysis. In fact, for lexical analysis the simplest syntax is to allow any character rather than having to include special-case checks. [Note 1]
There are other considerations, though; notably, the importance of a program to be readable and easy to debug. Long strings put an extra load on someone reading the code, because they may not be aware that a section of program text is actually part of a string literal. (There's a similar problem with multiline comments, which is why it's usually considered good style to mark every line in a long comment in some way, for example with a vertical column of stars at the left-hand margin. No such solution exists for string literals, though.)
Also, unterminated multiline strings can be annoying to correct. If strings are cannot span lines, the error will be detected on the line containing the problem. But multiline strings might continue until the beginning of the next string, then triggering a syntax error when the contents of the next string are accidentally parsed as program text. Or worse, resulting in a completely incorrect parse of what was supposed to be program text, followed by another incorrect string literal starting where the second literal ends, and continuing from there.
That also makes it hard for developer tools, such as editors and syntax highlighters, to deal with program text as it is being typed.
In the end, you may or may not find these arguments compelling, and a language designer might have other aesthetic preferences as well. I can't really speak for the original designers of the JavaScript language, and neither of us can take a voyage in time to argue with them and maybe change their decision.
For better or worse, languages are designed according to particular subjective judgements, and if the language is successful these judgements become permanent features. They are things you have to accept if you are using a language and they're not usually worth obsessing about. You get used to them, or you find a different language to program in, with its own syntax quirks.
When you design your own language, you will need to resolve a large number of syntactic questions, and you will undoubtedly run into cases where the answer is not clearcut because there is no objectively correct unique solution. Whatever you do, someone will want to argue with you. Perhaps you can refer them to this answer.
Notes:
There is actually a historic reason for not allowing multiline string literals, which is much clearer but has been more or less irrelevant for several decades.
Once Upon A Time, common filesystems considered text files to be linear arrays of fixed-length lines (often 80 character lines, matching a Hollerith card). One advantage of such a filesystem is that it could instantly navigate to a particular line number in a file, since all lines were the same length. But in any case, for systems where programs were entered on punched cards, the fixed length lines were just part of the environment.
To make all lines the same length, lines needed to be filled out with space characters. This would obviously make multiline string literals awkward, and that's why C never allowed multiline string literals, instead relying on a syntactic feature where consecutive string literals are automatically concatenated into a single literal.
In the end, fixed-line-length filesystems proved to be unpopular, and I don't think you're likley to run into one these days. But a careful reading of the C and Posix standards shows that such filesystems must still be usable by conforming implementations, with the consequence that a fully portable program must be prepared to deal with line length limits on output and trailing whitespace on input.
There is also such syntax
const string =
'line1\
line2\
line3'

JavaScript print all used Unicode characters

I am trying to make JavaScript print all Unicode characters. According to my research, there are 1,114,112 Unicode characters.
A script like the following could work:
for(i = 0; i < 1114112; i++)
console.log(String.fromCharCode(i));
But I found out that only 10% of the 1,114,112 Unicode characters are used.
How can I can I only print the used unicode characters?
As Jukka said, JavaScript has no built-in way of knowing whether a given Unicode code point has been assigned a symbol yet or not.
There is still a way to do what you want, though.
I’ve written several scripts that parse the Unicode database and create separate data files for each category, property, script, block, etc. in Unicode. I’ve also created an HTTP API that allows you to programmatically get all code points (i.e. an array of numbers) in a given Unicode category, or all symbols (i.e. an array of strings for each character) with a given Unicode property, or a regular expression with that matches any symbols in a certain Unicode script.
For example, to get an array of strings that contains one item for each Unicode code point that has been assigned a symbol in Unicode v6.3.0, you could use the following URL:
http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B
Note that you can prepend and append anything you like to the output by tweaking the URL parameters, to make it easier to reuse the data in your own scripts. An example HTML page that console.log()s all these symbols, as you requested, could be written as follows:
<!DOCTYPE html>
<meta charset="utf-8">
<title>All assigned Unicode v6.3.0 symbols</title>
<script src="http://mathias.html5.org/data/unicode/format?version=6.3.0&property=Assigned&type=symbols&prepend=window.symbols%20%3D%20&append=%3B"></script>
<script>
window.symbols.forEach(function(symbol) {
// Do what you want to do with `symbol` here, e.g.
console.log(symbol);
});
</script>
Demo. Note that since this is a lot of data, you can expect your DevTools console to become slow when opening this page.
Update: Nowadays, you should use Unicode data packages such as unicode-11.0.0 instead. In Node.js, you can then do the following:
const symbols = require('unicode-11.0.0/Binary_Property/Assigned/symbols.js');
console.log(symbols);
// Or, to get the code points:
require('unicode-11.0.0/Binary_Property/Assigned/code-points.js');
// Or, to get a regular expression that only matches these characters:
require('unicode-11.0.0/Binary_Property/Assigned/regex.js');
There is no direct way in JavaScript to find out whether a code point is assigned to a character or not, which appears to be the question here. You need information extracted from suitable sources, and this information needs to be updated whenever new characters are assigned in new versions of Unicode.
There are 1,114,112 code points in Unicode. The Unicode standard assigns to each code point the property gc, General Category. If the value of this property is anything but Cs, Co, or Cn, then the code point is assigned to a character. (Code points with gc equal to Co are Private Use code points, to which no character is assigned, but they may be used for characters by private agreements.)
What you would need to do is to get a copy of some relevant files in the Unicode character database (just a collection of files in specific formats, really) and write code that reads it and generates information about assigned code points. For the purposes of printing all Unicode characters, it might be best to generate the information as an array of ranges of assigned codepoints. And this would need to be repeated when the standard is updated with new characters.
Even the rest isn’t trivial. You would need to decide what it means to print a character. Some characters are control characters that may have an effect such as causing a newline, but lacking a visible glyph. Some (spaces) have empty glyphs. Some (combining marks) are meant to be rendered as marks attached to preceding character, though they have conventional renderings as “standalone” characters, too. Some are meant to take essentially different shapes depending on nearest context; they may have isolated forms, too, but just writing a character after another by no means guarantees that an isolated form is used.
Then there’s the problem of fonts. No single font can contain all Unicode characters, so you would need to find a collection of fonts that cover all of Unicode when used together, preferably so that they stylistically match somehow.
So if you are just looking for a compilation of all printable Unicode characters, consider using the Unicode code charts.
The trouble here is that Javascript is not, contrary to popular opinion, a Unicode environment.
Internally, it uses USC-2, an incompatible 16-bit encoding method that predates UTF16.
In addition, many of the unicode characters are not directly printable by themselves -- some of them are modifies for the previous characters -- for example the Spanish letter ñ can be written in unicode either as a single point -- that character -- or as two points -- n and ~
Here are a couple of resources that should really help you in understanding this:
http://mathiasbynens.be/notes/javascript-encoding
http://mathiasbynens.be/notes/javascript-unicode

Do browsers support different HTML5 pattern regexp features?

I had a simple RegEx pattern in a customer-facing payment form on our website:
<input type="text" pattern="(|\$)[0-9]*(|\.[0-9]{2})"
title="Please enter a valid number in the amount field" required>
It was added to help quickly notify customers when they fail to enter a valid number, before hitting the server-side validation.
After four customers called in complaining that they were unable to submit the form because their browser continually told them the amount they had entered was incorrect, I did some digging and discovered that IE10+ doesn't like the back of that expression--any amount entered that did not include a decimal point was accepted, anything with a decimal was rejected. The pattern works in my development environment (Chrome 30+) and in Opera 12, but Firefox 27 won't validate it at all.
I read the specs, which just says:
If specified, the attribute's value must match the JavaScript Pattern production. [ECMA262]
And since the only browsers that support pattern are capable of supporting ECMAScript 5, I figure this includes the full support of all Javascript regular expressions.
Where can I learn more about the quirks between pattern support in the different browsers?
The problem seems to an IE-only bug. Your link to the spec is pretty dead on, heres the bit IE is missing:
... except that the pattern attribute is matched against the entire value, not just any subset (somewhat as if it implied a ^(?: at the start of the pattern and a )$ at the end)
You can actually fix this bug by doing just that to your own pattern - namely:
^(?:(|\$)[0-9]*(|\.[0-9]{2}))$
This is working for me in IE9 and IE10, as well as Chrome. See updated fiddle
The technical reason this happens is a bit more complex:
If you read the EMCA 5.1 spec, in section 15.10.2.3, it talks about how alternations should be evaluated. Basically, each 'part' of the | is evaluated left to right, until one is found that matches. That value is assumed unless there is a problem in the 'sequel', in which case the other possibilities in the alternation are evaluated.
What it seems IE is doing is matching the beginning of your string using the empty parts of your alternations, and it works: \$[digits][empty] matches the start of $12.12 up to the decimal point. IE's regex engine (correctly) says that this is a match, because a substring matched, and it's not been told to check to the end of the string.
Once the regex engine (without the anchors to force the whole string to match) returns true, that there was a match, some engineer at Microsoft took a shortcut and told the pattern attribute to also check that the matched part equals the whole string, and there's where the failure comes from. The engine only matched part of the string, even though it could have matched more, so the secondary check fails, thinking there is extraneous input at the end.
This case is subtle, so I'm not too surprised it hasn't been caught before. I have created a bug report https://connect.microsoft.com/IE/feedback/details/836117/regex-bug-in-pattern-validator to see if there is a response from Microsoft.
The reason this relates to the EMCA spec is that if the engine was told to match the whole string, it would have backtracked when it hit the decimal and tried to match the 2nd part of the alternation, found and matched (\.[0-9{2}), and the whole thing would have worked.
Now, for some workarounds:
Add the anchors ^(?: and )$ to your patterns
Don't use empty alternations. Personally, I like using the optional $ instead for these cases. Your pattern becomes (\$?)[0-9]*(\.[0-9]{2})? and will work because ? is a greedy match, and the engine will consume the whole string if possible, rather than alternation, which is first match
Swap the order on your alternations. If the longer string is tested first, it will match first, and be used first. This has come up in other languages - Why order matters in this RegEx with alternation?
PS: Be careful with the * for your digits. Right now, "$" is a valid match because * allows for 0 digits. My recommendation for your full regex would be (\$)?(\d+)(\.\d{2})?

Javascript code analysis and constants

Given there is no cross browser const in Javascript and most of the work-arounds are more complex than I care for, I am just going to go with the naming convention of THIS_IS_A_CONSTANT. All well and good, but what occurred to me is that if there was way to get my IDE (VS.NET 2010 with Resharper 6) to give me a warning on any Javascript code that makes an assignment to a variable with that naming convention except in the variable declaration this would handle most of the potential issues around the lack of real constants in Javascript (at least for my needs).
So does anyone know of a good way to generate such warnings? In-IDE would be the best thing but other solutions are fine as well. I have looked for something like FX-Cop for Javascript; jslint doesn't seem to allow the creation of new rules but maybe I didn't look deep enough. I may also suggest this as a feature in Resharper (assuming I am not missing a way to make it do so already).
Thanks,
Matthew
So you want to find any assigment of the form:
id = exp ;
where id doesn't contain the substring CONSTANT and exp is a numeric constant?
Our Source Code Search Engine (SCSE) can do this pretty directly. The SCSE reads source code for a large set of files for many languages (including JavaScript), breaks it into tokens ignoring whitespace, and indexes it all to enable fast search for token sequences. Any hits are displayed in a hit window and can be clicked to see the actual file text in context.
Your particular query would be stated:
(I - I=*CONSTANT*) '=' N ( ';' | O | K | I)
This hunts for any assignment in which the target identifier doesn't contain the string constant (see wildcard stars around the match string), assigned a constant *N*umber is not followed by a ';' or an *O*perator, *K*word or *I*dentifier (all this extra stuff is because JavaScript might not have a semicolon to terminate the statement). It probably picks up some cases it should not but
these are easily inspected.

Is there any practical reason to use quoted strings for JSON keys?

According to Crockford's json.org, a JSON object is made up of members, which is made up of pairs.
Every pair is made of a string and a value, with a string being defined as:
A string is a sequence of zero or more
Unicode characters, wrapped in double
quotes, using backslash escapes. A
character is represented as a single
character string. A string is very
much like a C or Java string.
But in practice most programmers don't even know that a JSON key should be surrounded by double quotes, because most browsers don't require the use of double quotes.
Does it make any sense to bother surrounding your JSON in double quotes?
Valid Example:
{
"keyName" : 34
}
As opposed to the invalid:
{
keyName : 34
}
The real reason about why JSON keys should be in quotes, relies in the semantics of Identifiers of ECMAScript 3.
Reserved words cannot be used as property names in Object Literals without quotes, for example:
({function: 0}) // SyntaxError
({if: 0}) // SyntaxError
({true: 0}) // SyntaxError
// etc...
While if you use quotes the property names are valid:
({"function": 0}) // Ok
({"if": 0}) // Ok
({"true": 0}) // Ok
The own Crockford explains it in this talk, they wanted to keep the JSON standard simple, and they wouldn't like to have all those semantic restrictions on it:
....
That was when we discovered the
unquoted name problem. It turns out
ECMA Script 3 has a whack reserved
word policy. Reserved words must be
quoted in the key position, which is
really a nuisance. When I got around
to formulizing this into a standard, I
didn't want to have to put all of the
reserved words in the standard,
because it would look really stupid.
At the time, I was trying to convince
people: yeah, you can write
applications in JavaScript, it's
actually going to work and it's a good
language. I didn't want to say, then,
at the same time: and look at this
really stupid thing they did! So I
decided, instead, let's just quote the
keys.
That way, we don't have to tell
anybody about how whack it is.
That's why, to this day, keys are quoted in
JSON.
...
The ECMAScript 5th Edition Standard fixes this, now in an ES5 implementation, even reserved words can be used without quotes, in both, Object literals and member access (obj.function Ok in ES5).
Just for the record, this standard is being implemented these days by software vendors, you can see what browsers include this feature on this compatibility table (see Reserved words as property names)
Yes, it's invalid JSON and will be rejected otherwise in many cases, for example jQuery 1.4+ has a check that makes unquoted JSON silently fail. Why not be compliant?
Let's take another example:
{ myKey: "value" }
{ my-Key: "value" }
{ my-Key[]: "value" }
...all of these would be valid with quotes, why not be consistent and use them in all cases, eliminating the possibility of a problem?
One more common example in the web developer world: There are thousands of examples of invalid HTML that renders in most browsers...does that make it any less painful to debug or maintain? Not at all, quite the opposite.
Also #Matthew makes the best point of all in comments below, this already fails, unquoted keys will throw a syntax error with JSON.parse() in all major browsers (and any others that implement it correctly), you can test it here.
If I understand the standard correctly, what JSON calls "objects" are actually much closer to maps ("dictionaries") than to actual objects in the usual sense. The current standard easily accommodates an extension allowing keys of any type, making
{
"1" : 31.0,
1 : 17,
1n : "valueForBigInt1"
}
a valid "object/map" of 3 different elements.
If not for this reason, I believe the designers would have made quotes around keys optional for all cases (maybe except keywords).
YAML, which is in fact a superset of JSON, supports what you want to do. ALthough its a superset, it lets you keep it as simple as you want.
YAML is a breath of fresh air and it may be worth your time to have a look at it. Best place to start is here: http://en.wikipedia.org/wiki/YAML
There are libs for every language under the sun, including JS, eg https://github.com/nodeca/js-yaml

Categories

Resources