JS lexing---multi line string

JS lexing---multi line string - javascript

I am making a JS lexer as part of my study. In JS, single line stings start from " or ' and ends with the same character except if that character is preceded by a backslash.
In my current code, I loop through every character and append them to existing tokens based on flags like "string" or "regex". so it feels natural to implement multi line string with " or ' because it seems that it does not affect any other part of my lexer
Is there any practical reason why new line is not allowed as contents of strings?

Many languages, but not all, prohibit unescaped newlines in string literals. So JavaScript is certainly not unique here.
But the motivation really has little to do with the ease, difficulty or efficiency of lexical analysis. In fact, for lexical analysis the simplest syntax is to allow any character rather than having to include special-case checks. [Note 1]
There are other considerations, though; notably, the importance of a program to be readable and easy to debug. Long strings put an extra load on someone reading the code, because they may not be aware that a section of program text is actually part of a string literal. (There's a similar problem with multiline comments, which is why it's usually considered good style to mark every line in a long comment in some way, for example with a vertical column of stars at the left-hand margin. No such solution exists for string literals, though.)
Also, unterminated multiline strings can be annoying to correct. If strings are cannot span lines, the error will be detected on the line containing the problem. But multiline strings might continue until the beginning of the next string, then triggering a syntax error when the contents of the next string are accidentally parsed as program text. Or worse, resulting in a completely incorrect parse of what was supposed to be program text, followed by another incorrect string literal starting where the second literal ends, and continuing from there.
That also makes it hard for developer tools, such as editors and syntax highlighters, to deal with program text as it is being typed.
In the end, you may or may not find these arguments compelling, and a language designer might have other aesthetic preferences as well. I can't really speak for the original designers of the JavaScript language, and neither of us can take a voyage in time to argue with them and maybe change their decision.
For better or worse, languages are designed according to particular subjective judgements, and if the language is successful these judgements become permanent features. They are things you have to accept if you are using a language and they're not usually worth obsessing about. You get used to them, or you find a different language to program in, with its own syntax quirks.
When you design your own language, you will need to resolve a large number of syntactic questions, and you will undoubtedly run into cases where the answer is not clearcut because there is no objectively correct unique solution. Whatever you do, someone will want to argue with you. Perhaps you can refer them to this answer.
Notes:
There is actually a historic reason for not allowing multiline string literals, which is much clearer but has been more or less irrelevant for several decades.
Once Upon A Time, common filesystems considered text files to be linear arrays of fixed-length lines (often 80 character lines, matching a Hollerith card). One advantage of such a filesystem is that it could instantly navigate to a particular line number in a file, since all lines were the same length. But in any case, for systems where programs were entered on punched cards, the fixed length lines were just part of the environment.
To make all lines the same length, lines needed to be filled out with space characters. This would obviously make multiline string literals awkward, and that's why C never allowed multiline string literals, instead relying on a syntactic feature where consecutive string literals are automatically concatenated into a single literal.
In the end, fixed-line-length filesystems proved to be unpopular, and I don't think you're likley to run into one these days. But a careful reading of the C and Posix standards shows that such filesystems must still be usable by conforming implementations, with the consequence that a fully portable program must be prepared to deal with line length limits on output and trailing whitespace on input.

There is also such syntax
const string =
'line1\
line2\
line3'

Related

In a stringified array is it possible to differentiate between quotes that were in a string and those that surrounded the string itself?

Some Context:
• I'm still learning to code atm (started less than a year ago)
• I'm mostly self taught at that since I think my computer science class feels
too slow.
• The website I'm learning on is code.org, specifically in the "game lab"
• The site's coding environments only use ES5 because they don't want to
update them to ES6 or something like that
• In class we're making function libraries and while not required, I want
mine to be "highly usable," for lack of a better term, while also being
reasonably short (prefer not to automate things if I can get them done
quicker somehow, but that's just personal preference).
So now for where the actual question comes in: in a stringified array, is it possible to differentiate between a quotation mark that was inside a string and a quotation mark that actually denotes a string? Because I noticed something confusing with the output of JSON.parse(JSON.stringify()) on code.org, specifically, if you write something like,
JSON.parse(JSON.stringify(['hi","hi']))
the output will be ["hi","hi"] which looks just like an array containing two strings (on code.org it doesn't show the \'s), but still contains just one, which is fine unless you're using a regular expression to detect whether or not a match is within a string (if every quotation mark after the match has a "partner"), which is what I'm doing in 4 different functions. One flattens a list (since ES5 doesn't have Array.prototype.flat()), one removes all instances of the arguments from a list, one removes all instances of specified operand types, and one replaces all instances of an argument with the one that follows it.
Now I know the odds of a string containing an odd number of quotation marks (whether single or double) is likely extremely low, but it still bothers me that not having a way to differentiate between quotes formerly within a string and quotes which formerly denoted a string (in an array after it's been stringified) as these functions otherwise function exactly as intended. The regular expression I'm using to determine if there's an even number of quotes left in the stringified array is /(?=[^"]*(?:(?:"[^"]*){2})*$)/ where you put the match before the lookahead assertion and anything you absolutely want to follow before the first [^"]*.
To highlight the actual issue I'm trying to solve, this is my flatten function (since it's the shortest of the 4), and yeah, yeah, I know "eval bad" but it's extremely convenient to use here since it shortens the actual modification into a single line, and I highly doubt anyone's actually going to find a way to abuse it given its implementation ("this" needs to be an array for splice to work, so if I'm not mistaken, there isn't really a way to abuse it, but tell me if I'm wrong, since I probably am).
Array.prototype.flatten = function() {
eval(('this.splice(0,this.length,' + JSON.stringify(this).replace(/[\[\]](?=[^"]*(?:(?:"[^"]*){2})*$)/g, '') + ')').replace(/,(?=((,[^"]*(?:(?:"[^"]*){2})*)*.$))/g, ''));
return this;
};
This works really well outside of the previously specified conditions, but if I were to call it with something like [1,'"'] it'd find 3 quotation marks after the \[ and wouldn't be able to remove it but would be able to remove the \], thus when eval actually gets to .splice(), it would look like eval('this.splice(0,this.length,[1,"\"")') causing the error Unexpected token ')' to be thrown
Any help on this is appreciated, even if it's just telling me it isn't possible, thanks for reading my ramblings.
TL;DR: in a stringified array is it possible to differentiate between " and \" (string wrapping quotes of strings within a stringified array and quotes within a string within a stringified array) in a regular expression or any other method using only the tools available in ES5 (site I'm learning on doesn't want to update their project environments for whatever reason)

You are having a problem because your input is not a context free grammar and can not be correctly parsed with regular expressions.
Can you explain why JSON.parse is unacceptable? It is even in ancient browsers and versions of node.js.
Someone writing a json parser might use bison or yacc, so if this is a learning experience consider playing with jison.

I ended up finding a way to do this, for whatever reason (either I didn't notice last night because I was tired or it legitimately changed overnight, though likely the former) I can now see the " when viewing the value of the the stringified array, and lo and behold modifying the regular expression so that it ignored instances of " resolved the issue.
New regular expression for quotation mark pair matching now reads:
// old even number of quotation marks after match check
/(?=[^"]*(?:(?:"[^"]*){2})*$)/
// new even number of quotation marks after match check
/(?=(\\"|[^"])*(?:(?:(?<!\\)"(\\"|[^"])*){2})*$)/
// (only real difference is that it accounts for the \)
Sorry for anyone who may have misunderstood the question due to how all over the place it was, I'm aware that I tend to end up writing a lot more than is necessary and it often leads to tangents that muddle my view of what I was initially asking, which in turn makes the point I'm actually trying to get across even harder to grasp at. Thanks to those who still tried to help me regardless of how much of a mess of a first question this was.

Spaces required between keyword and literal

Looking at the output of UglifyJS2, I noticed that no spaces are required between literals and the in operator (e.g., 'foo'in{foo:'bar'} is valid).
Playing around with Chrome's DevTools, however, I noticed that hex and binary number literals require a space before the in keyword:
Internet explorer returned true to all three tests, while FireFox 48.0.1 threw a SyntaxError for the first one (1in foo), however it is okay with string literals ('1'in foo==true).
It seems that there should be no problem parsing JavaScript, allowing for keywords to be next to numeric literals, but I can't find any explicit rule in the ECMAScript specification (any of them).
Further testing shows that statements like for(var i of[1,2,3])... are allowed in both Chrome and FireFox (IE11 doesn't support for..of loops), and typeof"string" works in all three.
Which behavior is correct? Is it, in fact, defined somewhere that I missed, or are all these effects a result of idiosyncrasies of each browser's parser?

Not an expert - I haven't done a JS compiler, but have done others.
ecma-262.pdf is a bit vague, but it's clear that an expression such as 1 in foo should be parsed as 3 input elements, which are all tokens. Each token is a CommonToken (11.5); in this case, we get numericLiteral, identifierName (yes, in is an identifierName), and identifierName. Exactly the same is true when parsing 0b1 in foo (see 11.8.3).
So, what happens when you take out the WS? It's not covered explicitly (as far as I can see), but it's common practice (in other languages) when writing a lexer to scan the longest character sequence that will match something you could potentially be looking for. The introduction to section 11 pretty much says exactly that:
The source text is scanned from left to right, repeatedly taking the
longest possible sequence of code points as the next input element.
So, for 0b1in foo the lexer goes through 0b1, which matches a numeric literal, and reaches i, giving 0b1i, which doesn't match anything. So it passes the longest match (0b1) to the rest of the parser as a token, and starts again at i. It finds n, followed by WS, so passes in as the second token, and so on.
So, basically, and rather bizarrely, it looks like IE is correct.

TL;DR
There would be no change to how code would be interpreted if whitespace weren't required in these circumstances, but it's part of the spec.
Looking at the source code of v8 that handles number literal parsing, it cites ECMA 262 § 7.8.3:
The source character immediately following a NumericLiteral must not be an IdentifierStart or DecimalDigit.
NOTE For example:
3in
is an error and not the two input elements 3 and in.
This section seems to contradict the introduction of section 7. However, it does not seem that there would be any problems with breaking that rule and allowing for 3in to be parsed. There are cases where allowing for no spaces between literals and identifiers would change how the source is parsed, but all cases merely change which errors are generated.

regex for matching finite-depth nested strings -- slow, crashy behavior

I was writing some regexes in my text editor (Sublime) today in an attempt to quickly find specific segments of source code, and it required getting a little creative because sometimes the function call might contain more function calls. For example I was looking for jQuery selectors:
$("div[class='should_be_using_dot_notation']");
$(escapeJQSelector("[name='crazy{"+getName(object)+"}']"));
I don't consider it unreasonable to expect one of my favorite powertools (regex) to help me do this sort of searching, but it's clear that the expression required to parse the second bit of code there will be somewhat complex as there are two levels of nested parens.
I am sufficiently versed in the theory to know that this sort of parsing is exactly what a context-free grammar parser is for, and that building out a regex is likely to suck up more memory and time (perhaps in an exponential rather than O(n^3) fashion). However I am not expecting to see that sort of feature available in my text editor or web browser any time soon, and I just wanted to squeak by with a big nasty regex.
Starting from this (This matches zero levels of nested parens, and no trivial empty ones):
\$\([^)(]+?\)
Here's what the one-level nested parens one I came up with looks like:
\$\(((\([^)(]*\))|[^)(])+?\)
Breaking it down:
\$\( begin text
( groups the contents of the $() call
(\( groups a level 1 nested pair of parens
[^)(]* only accept a valid pair of parens (it shall contain anything but parens)
\)) close level 1 nesting
| contents also can be
[^)(] anything else that also is not made of parens
)+? not sure if this should be plus or star or if can be greedy (the contents are made up of either a level 1 paren group or any other character)
\) end
This worked great! But I need one more level of nesting.
I started typing up the two-level nested expression in my editor and it began to pause for 2-3 seconds at a time when I put in *'s.
So I gave up on that and moved to regextester.com, and before very long at all, the entire browser tab was frozen.
My question is two-fold.
What's a good way of constructing an arbitrary-level regex? Is this something that only human pattern-recognition can ever hope to achieve? It seems to me that I can get a good deal of intuition for how to go about making the regex capable of matching two levels of nesting based on the similarities between the first two. I think this could just be distilled down into a few "guidelines".
Why does regex parsing on non-enormous regexes block or freeze for so long?
I understand the O(n) linear time is for n where n is length of input to run the regex over (i.e. my test strings). But in a system where it recompiles the regex each time I type a new character into it, what would cause it to freeze up? Is this necessarily a bug in the regex code (I hope not, I thought the Javascript regex impl was pretty solid)? Part of my reasoning moving to a different regex tester from my editor was that I'd no longer be running it (on each keypress) over all ~2000 lines of source code, but it did not prevent the whole environment from locking up as I edited my regex. It would make sense if each character changed in the regex would correspond to some simple transformation in the DFA that represents that expression. But this appears not to be the case. If there are certain exponential time or space consequences to adding a star in a regex, it could explain this super-slow-to-update behavior.
Meanwhile I'll just go work out the next higher nested regexes by hand and copy them in to the fields once i'm ready to test them...

Um. Okay, so nobody wants to write the answer, but basically the answer here is
Backtracking
It can cause exponential runtime when you do certain non-greedy things.
The answer to the first part of my question:
The two-nested expression is as follows:
\$\(((\(((\([^)(]*\))|[^)(])*\))|[^)(])*\)
The transformation to make the next nested expression is to replace instances of [^)(]* with ((\([^)(]*\))|[^)(])*, or, as a meta-regex (where the replace-with section does not need escaping):
s/\[^\)\(\]\*/((\([^)(]*\))|[^)(])*/
This is conceptually straightforward: In the expression matching N levels of nesting, if we replace the part that forbids more nesting with something that matches one more level of nesting then we get the expression for N+1 levels of nesting!

To match an arbitrary number of nested (), with only one pair on each level of nesting, you could use the following, changing 2 to whatever number of nested () you require
/(?:\([^)(]*){2}(?:[^)(]*\)){2}/
To avoid excessive backtracking you want to avoid using nested quantifiers, particularly when the sub-pattern on both sides of an inner alternation is capable of matching the same substring.

Javascript code analysis and constants

Given there is no cross browser const in Javascript and most of the work-arounds are more complex than I care for, I am just going to go with the naming convention of THIS_IS_A_CONSTANT. All well and good, but what occurred to me is that if there was way to get my IDE (VS.NET 2010 with Resharper 6) to give me a warning on any Javascript code that makes an assignment to a variable with that naming convention except in the variable declaration this would handle most of the potential issues around the lack of real constants in Javascript (at least for my needs).
So does anyone know of a good way to generate such warnings? In-IDE would be the best thing but other solutions are fine as well. I have looked for something like FX-Cop for Javascript; jslint doesn't seem to allow the creation of new rules but maybe I didn't look deep enough. I may also suggest this as a feature in Resharper (assuming I am not missing a way to make it do so already).
Thanks,
Matthew

So you want to find any assigment of the form:
id = exp ;
where id doesn't contain the substring CONSTANT and exp is a numeric constant?
Our Source Code Search Engine (SCSE) can do this pretty directly. The SCSE reads source code for a large set of files for many languages (including JavaScript), breaks it into tokens ignoring whitespace, and indexes it all to enable fast search for token sequences. Any hits are displayed in a hit window and can be clicked to see the actual file text in context.
Your particular query would be stated:
(I - I=*CONSTANT*) '=' N ( ';' | O | K | I)
This hunts for any assignment in which the target identifier doesn't contain the string constant (see wildcard stars around the match string), assigned a constant *N*umber is not followed by a ';' or an *O*perator, *K*word or *I*dentifier (all this extra stuff is because JavaScript might not have a semicolon to terminate the statement). It probably picks up some cases it should not but
these are easily inspected.

Can I depend on the behavior of charCodeAt() and fromCharCode() to remain the same?

I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.

fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.

I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!

In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.

As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306

JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.

In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).

Develop Reference

JavaScript is the programming language of the Web.