Lexers which require parsing information - javascript

I am currently writing a little programming language and have come across a problem. In javascript template literals, we can embed arbitrary expressions, like:
let a = `hello ${ { a: 10, b: 15 } } world`
To properly lex the above-given snippet, the lexer needs to understand bracket matching(parsing essentially) as it can't just assume the first } to be the end of the embedded expression. How do lexers idiomatically solve this problem? One way is to check for proper bracket matching instead of treating them as simple operators, but I am not sure it is the best way. Looking into the code of some javascript lexers also was not very helpful.

The ECMAScript standard specifies (as a theoretical model) different "goal symbols" which are used in different syntactic contexts. Templated strings are one of the contexts with specific goal symbols.
That means that you need a lexical scanner which can switch states. Whether it does so itself, by duplicating part of the work of the parser, or as a result of a syntax action depends on the precise structure of the parsing architecture. You'll find implementations corresponding to both possibilities.
Putting this kind of logic into the parser is easier with predictive (top-down) parsers, such as a recursive descent parser. You could insert a call to the lexer's interface for changing states immediately after recognising the token which triggers the state change (backtick in your templated literal example). Or you could write the lexer interface so the "get a token" function also takes a lexical state argument; then your parser can maintain the lexical state. In effect, this last option is equivalent to using multiple lexical scanners, one for each state, which is also an attractive option, but it depends on separating the lexical scanner from the mechanism for reading input. (Personally, I favour this separation, but it's rarely discussed or implemented.)
Alternatively, you can use a bottom-up parser. In that case you need to be careful with synchronization between the parser's lookahead mechanism and the lexer scanner, since the interface between a bottom-up parser and a lexical scanner always allows the parser to read at least one token ahead, and it's possible that the state change needed to be done before that token was scanned. There are ways to handle this synchronization issue, but it's common for bottom-up parsers to put simple lexical state transitions into the lexer in order to avoid this issue. This necessarily involves a little duplication of effort between scanner and parser but counting braces is not so complicated.
If you're trying to use ECMAScript parsers as a source of inspiration, you need to be aware that there are many other complications with ECMAScript, particularly automatic semicolon insertion, which also involve coordination between lexer and parser. Solving those may impose other constraints on the overall architecture, and certainly makes the resulting parser code harder to read and understand.

Related

Javascript used to be a lisp? [duplicate]

A friend of mine drew my attention the welcome message of 4th European Lisp Symposium:
... implementation and application of
any of the Lisp dialects, including
Common Lisp, Scheme, Emacs Lisp,
AutoLisp, ISLISP, Dylan, Clojure,
ACL2, ECMAScript, ...
and then asked if ECMAScript is really a dialect of Lisp. Can it really be considered so? Why?
Is there a well defined and clear-cut set of criteria to help us detect whether a language is a dialect of Lisp? Or is being a dialect taken in a very loose sense (and in that case can we add Python, Perl, Haskell, etc. to the list of Lisp dialects?)
Brendan Eich wanted to do a Scheme-like language for Netscape, but reality intervened and he ended up having to make do with something that looked vaguely like C and Java for "normal" people, but which worked like a functional language.
Personally I think it's an unnecessary stretch to call ECMAScript "Lisp", but to each his own. The key thing about a real Lisp seems like the characteristic that data structure notation and code notation are the same, and that's not true about ECMAScript (or Ruby or Python or any other dynamic functional language that's not Lisp).
Caveat: I have no Lisp credentials :-)
It's not. It's got a lot of functional roots, but so do plenty of other non-lisp languages nowadays, as you pointed out.
Lisps have one remaining characteristic that make them lisps, which is that lisp code is written in terms of lisp data structures (homoiconicity). This is what enables lisps powerful macro system, and why it looks so bizzare to non-lispers. A function call is just a list, where the first element in the list is the name of the function.
Since lisp code is just lisp data, it's possible to do some extremely powerful stuff with metaprogramming, that just can't be done in other languages. Many lisps, even modern ones like clojure, are largely implemented in themselves as a set of macros.
Even though I wouldn't call JavaScript a Lisp, it is, in my humble opinion, more akin to the Lisp way of doing things than most mainstream languages (even functional ones).
For one, just like Lisp, it's, in essence, a simple, imperative language based on the untyped lambda calculus that is fit to be driven by a REPL.
Second, it's easy to embed literal data (including code in the form of lambda expressions) in JavaScript, since a subset of it is equivalent to JSON. This is a common Lisp pattern.
Third, its model of values and types is very lispy. It's object-oriented in a broad sense of the word in that all values have a concept of identity, but it's not particularly object-oriented in most narrower senses of the word. Just as in Lisp, objects are typed and very dynamic. Code is usually split into units of functions, not classes.
In fact, there are a couple of (more or less) recent developments in the JavaScript world that make the language feel pretty lispy at times. Take jQuery, for example. Embedding CSS selectors as a sublanguage is a pretty Lisp-like approach, in my opinion. Or consider ECMAScript Harmony's metaobject protocol: It really looks like a direct port of Common Lisp's (much more so than either Python's or Ruby's metaobject systems!). The list goes on.
JavaScript does lack macros and a sensible implementation of a REPL with editor integration, which is unfortunate. Certainly, influences from other languages are very much visible as well (and not necessarily in a bad way). Still, there is a significant amount of cultural compatibility between the Lisp and JavaScript camps. Some of it may be coincidental (like the recent rise of JavaScript JIT compilation), some systematic, but it's definitely there.
If you call ECMAScript Lisp, you're basically asserting that any dynamic language is Lisp. Since we already have "dynamic language", you're reducing "Lisp" to a useless synonym for it instead of allowing it to have a more specific meaning.
Lisp should properly refer to a language with certain attributes.
A language is Lisp if:
Its source code is tree-structured data, which has a straightforward printed notation as nested lists. Every possible tree structure has a rendering in the corresponding notation and is susceptible to being given a meaning as a construct; the notation itself doesn't have to be extended to extend the language.
The tree-structured data is a principal data structure in the language itself, which makes programs susceptible to manipulation by programs.
The language has symbol data type. Symbols have a printed representation which is interned: when two or more instances of the same printed notation for a symbol appear in the notation, they all denote the same object.
A symbol object's principal virtue is that it is different from all other symbols. Symbols are paired with various other entities in various ways in the semantics of Lisp programs, and thereby serve as names for those entities.
For instance, dialect of Lisp typically have variables, just like other languages. In Lisp, variables are denoted by symbols (the objects in memory) rather than textual names. When part of a Lisp program defines some variable a, the syntax for that a is a symbol object and not the character string "a", which is just that symbol's name for the purposes of printing. A reference to the variable, the expression written as a elsewhere in the program, is also an on object. Because of the way symbols work, it is the same object; this object sameness then connects the reference to the definition. Object sameness might be implemented as pointer equality at the machine level. We know that two symbol values are the same because they are pointers to the same memory location in the heap (an object of symbol type).
Case in point: the NewLisp dialect which has a non-traditional memory management for most data types, including nested lists, makes an exception for symbols by making them behave in the above way. Without this, it wouldn't be Lisp. Quote: "Objects in newLISP (excluding symbols and contexts) are passed by value copy to other user-defined functions. As a result, each newLISP object only requires one reference." [emphasis mine]. Passing symbols too, as by value copy, would destroy their identity: a function receiving a symbol wouldn't be getting the original one, and therefore not correctly receiving its identity.
Compound expressions in a Lisp language—those which are not simple primaries like numbers or strings—consist of a simple list, whose first element is a symbol indicating the operation. The remaining elements, if any, are argument expressions. The Lisp dialect applies some sort of evaluation strategy to reduce the expression to a value, and evoke any side effects it may have.
I would tentatively argue that lists being made of binary cells that hold pairs of values, terminated by a special empty list object, probably should be considered part of the definition of Lisp: the whole business of being able to make a new list out of an existing one by "consing" a new item to the front, and the easy recursion on the "first" and "rest" of a list, and so on.
And then I would stop right there. Some people believe that Lisp systems have to be interactive: provide an environment with a listener, in which everything is mutable, and can be redefined at any time and so on. Some believe that Lisps have to have first-class functions: that there has to be a lambda operator and so on. Staunch traditionalists might even insists that there have to be car and cdr functions, the dotted pair notation supporting improper lists, and that lists have to be made up of cells, and terminated by specifically the symbol nil denoting the empty list, and also a Boolean false. Insisting on car and cdr allows Scheme to be a Lisp, but nil being the list terminator and false rules
The more we shovel into the definition of "Lisp dialect", though, the more it becomes political; people get upset that their favorite dialect (perhaps which they created themselves) is being excluded on some technicality. Insisting on car and cdr allows Scheme to be a Lisp, but nil being the list terminator and false rules it out. What, Scheme not a Lisp?
So, based on the above, ECMAScript isn't a dialect of Lisp. However, an ECMAScript implementation contains functionality which can be exposed as a Lisp dialect and numerous such dialects have been developed. Someone who needs wants ECMAScript to be considered a Lisp for some emotional reasons should perhaps be content with that: that the semantics to support Lisp is there, and just needs a suitable interface to that semantics, which can be developed in ECMAScript and which can interoperate with ECMAScript code.
No it's not.
In order to be considered a Lisp, one has to be homoiconic, which ECMAscript is not.
Not a 'dialect'. I learned LISP in the 70's and haven't used it since, but when I learned JavaScript recently I found myself thinking it was LISP-like. I think that's due to 2 factors: (1) JSON is a list-like associative structures and (2) it's seems as though JS 'objects' are essentially JSON. So even though you don't write JS programs in JSON as you would write LISP in lists, you kind of almost do.
So my answer is that there are enough similarities that programmers familiar with LISP will be reminded of it when they use JavaScript. Statements like JS = LISP in a Java suit are only expressing that feeling. I believe that's all there is to it.
Yes, it is. Quoting Crockford:
"JavaScript has much in common with Scheme. It is a dynamic language. It has a flexible datatype (arrays) that can easily simulate s-expressions. And most importantly, functions are lambdas.
Because of this deep similarity, all of the functions in [recursive programming primer] 'The Little Schemer' can be written in JavaScript."
http://www.crockford.com/javascript/little.html
On the subject of homoiconicity, I would recommend searching that word along with JavaScript. Saying that it is "not homoiconic" is true but not the end of the story.
I think that ECMAScript is a dialect of LISP in the same sense that English is a dialect of French. There are commonalities, but you'll have trouble with assignments in one armed only with knowledge of the other :)
I find it interesting that only one of the three keynote presentations highlighted for the 4th European Lisp Symposium directly concerns Lisp (the other two being about x86/JVM/Python and Scala).
"dialect" is definitely stretching it too far. Still, as someone who has learned and used Python, Javascript, and Scheme, Javascript clearly has a far Lisp-ier feel to it (and Coffeescript probably even more so) than Python.
As for why the European Lisp Symposium would want to portray Javascript as a Lisp, obviously they want to piggyback on the popularity of the Javascript for which the programmer population is many, many times larger than all the rest of the Lisp dialects in their list combined.

lightweight javascript to javascript parser

How would I go about writing a lightweight javascript to javascript parser. Something simple that can convert some snippets of code.
I would like to basically make the internal scope objects in functions public.
So something like this
var outer = 42;
window.addEventListener('load', function() {
var inner = 42;
function magic() {
var in_magic = inner + outer;
console.log(in_magic);
}
magic();
}, false);
Would compile to
__Scope__.set('outer', 42);
__Scope__.set('console', console);
window.addEventListener('load', constructScopeWrapper(__Scope__, function(__Scope__) {
__Scope__.set('inner', 42);
__Scope__.set('magic',constructScopeWrapper(__Scope__, function _magic(__Scope__) {
__Scope__.set('in_magic', __Scope__.get('inner') + __Scope__.get('outer'));
__Scope__.get('console').log(__Scope__.get('in_magic'));
}));
__Scope__.get('magic')();
}), false);
Demonstation Example
Motivation behind this is to serialize the state of functions and closures and keep them synchronized across different machines (client, server, multiple servers). For this I would need a representation of [[Scope]]
Questions:
Can I do this kind of compiler without writing a full JavaScript -> (slightly different) JavaScript compiler?
How would I go about writing such a compiler?
Can I re-use existing js -> js compilers?
I don't think your task is easy or short given that you want to access and restore all the program state. One of the issues is that you might have to capture the program state at any moment during a computation, right? That means the example as shown isn't quite right; that captures state sort of before execution of that code (except that you've precomputed the sum that initializes magic, and that won't happen before the code runs for the original JavaScript). I assume you might want to capture the state at any instant during execution.
The way you've stated your problem, is you want a JavaScript parser in JavaScript.
I assume you are imagining that your existing JavaScript code J, includes such a JavaScript parser and whatever else is necessary to generate your resulting code G, and that when J starts up it feeds copies of itself to G, manufacturing the serialization code S and somehow loading that up.
(I think G is pretty big and hoary if it can handle all of Javascript)
So your JavaScript image contains J, big G, S and does an expensive operation (feed J to G) when it starts up.
What I think might serve you better is a tool G that processes your original JavaScript code J offline, and generates program state/closure serialization code S (to save and restore that state) that can be added to/replace J for execution. J+S are sent to the client, who never sees G or its execution. This decouples the generation of S from the runtime execution of J, saving on client execution time and space.
In this case, you want a tool that will make generation of such code S easiest. A pure JavaScript parser is a start but isn't likely enough; you'll need symbol table support to know which function code is connected a function call F(...), and which variable definition in which scope corresponds to assignments or accesses to a variable V. You may need to actually modify your original code J to insert points of access where the program state can be captured. You may need flow analysis to find out where some values went. Insisting all of this in JavaScript narrows your range of solutions.
For these tasks, you will likely find a program transformation tool useful. Such tools contain parsers for the langauge of interest, build ASTs representing the program, enable the construction of identifier-to-definition maps ("symbol tables"), can carry out modifications to the ASTs representing insertion of access points, or synthesis of ASTs representing your demonstration example, and then regenerate valid JavaScript code containing the modified J and the additions S.
Of all the program transformation systems that I know about (which includes all the ones at the Wikipedia site), none are implemented in JavaScript.
Our DMS Software Reengineering Toolkit is such a program transformation system offering all the features I just described. (Yes, its big and hoary; it has to be to handle the complexities of real computer languages). It has a JavaScript front end that contains a complete JavaScript parser to ASTs, and the machinery to regenerate JavaScript code from modified or synthesized ASTs. (Also big and hoary; good thing that hoary + hoary is still just hoary). Should it be useful, DMS also provides support for building control and dataflow analysis.
If you want something with a simple interface, you could try node-burrito: https://github.com/substack/node-burrito
It generates an AST using the uglify-js parser and then recursively walks the nodes. All you have to do is give a single callback which tests each node. You can alter the ones you need to change, and it outputs the resulting code.
I'd try to look for an existing parser to modify. Perhaps you could adapt JSLint/JSHint?
There is a problem with the rewriting above, you're not hoisting the initialization of magic to the top of the scope.
There's a number of projects out there that parse JavaScript.
Crock's Pratt parser which works well on JavaScript that fits within "The good parts" and less well on other JS.
The es-lab parser based on ometa which handles the full grammar including a lot of corner cases that Crock's parser misses. It may not perform as well as Crock's.
narcissus parser and evaluator. I don't have much experience with this.
There are also a number of high-quality lexers for JavaScript that let you manipulate JS at the token level. This can be tougher than it sounds though since JavaScript is not lexically regular, and predicting semicolon insertion is difficult without a full parse.
My es5-lexer is a carefully constructed and efficient lexer for EcmaScript 5 that provides the ability to tokenize JavaScript. It is heuristic where JavaScript's grammar is not lexically regular but the heuristic is very good and it provides a means to transform a token stream so that an interpreter is guaranteed to interpret it the way the lexer interpreted the tokens so if you don't trust your input, you can still be sure that the interpretation underlying the security transformations is sound even if not correct according to the spec for some bizarre inputs.
Your problem seams to be in same family of problems as what is solved with the JS Opfuscators and JS Compressors -- they as well as you need to be able to parse and reformat the JS to an equivalent script;
There was a good discussion on obfuscators here and the possible solution to your problem could be to leverage the parse and generator part from one of the FOSS versions.
One callout, your example code does not take into account the scopes of the variables you want to set/get and that will eventually become a problem that you will have to solve.
Addition
Given the scope problem for closure defined functions, you are probably unlikely to be able to solve this problem as a static parsing problem, as the scope variables outside the closure will have to be imported/exported to resolve/save and re-instantiate scope. Hence you may need to dig into the evaluation engine itself, and perhaps get the V8 engine and make a hack to the interpreter itself -- that is assuming that you do not need this to be generic cross all script engines and that you can tie it down to a single implementation which you control.

Recursive Descent Parser for something simple?

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided to write a parser instead. I started by writing a simple parser that remembered state by pushing/popping off of a stack, but things kept escalating until I had a recursive descent parser on my hands.
Soon after, I compared the performance of all my previous parsing methods. The recursive descent parser was by far the slowest. I'm stuck: Is it worth using a recursive descent parser for something simple, or am I justified in taking shortcuts? I would love to go the pure regex route, which is insanely fast (almost 3 times faster than the RD parser), but is very hacky and unmaintainable to a degree. I suppose performance isn't terribly important because compiled templates are cached, but is a recursive descent parser the right tool for every task? I guess my question could be viewed as more of a philosophical one: to what degree is it worth sacrificing maintainability/flexibility for performance?
Recursive descent parsers can be extremely fast.
These are usually organized with a lexer, that uses regular expressions to recognize langauge tokens that are fed to the parser. Most of the work in processing the source text is done character-by-character by the lexer using the insanely fast FSAs that the REs are often compiled into.
The parser only sees tokens occasionally compared to the rate at which the lexer sees characters, and so its speed often doesn't matter. However, when comparing parser-to-parser speeds, ignoring time required to lex the tokens, recursive descent parsers can be very fast because they implement the parser stack using function calls which are already very efficient compared to general parser push-current-state-on-simulated-stack.
So, you can have you cake and eat it, too. Use regexps for the lexemes. Use the parser (any kind, recursive descent are just fine) to process lexemes. You should be pleased with the performance.
This approach also satisifies the observation made by other answers: write it in a way to make it maintainable. Lexer/Parser separation does this very nicely, I assure you.
Readability first, performance later...
So if your parser makes code more readable then it is the right tool
to what degree is it worth sacrificing
maintainability/flexibility for
performance?
I think it's very important to write clear maintanable code as a first priority. Until your code not only indicates that it is a bottleneck, but that your application performance also suffers from it, you should always consider clear code to be the best code.
It's also important to not reinvent the wheel. The comment on taking a look at another parser is a very good one. There often found common solutions for writing routines such as this.
Recusion is very elegant when applied to something applicible. In my own experiance slow code due to recursion is an exception, not the norm.
A Recursive Descent Parser should be faster
...or you're doing something wrong.
First off, your code should be broken into 2 distinct steps. Lexer + Parser.
Some reference examples online will first tokenize the entire syntax first into a large intermediate data structure, then pass that along to the parser. While good for demonstration; don't do this, it doubles time and memory complexity. Instead, as soon as a match is determined by the lexer, notify the parser of either a state transition or state transition + data.
As for the lexer. This is probably where you'll find your current bottleneck. If the lexer is cleanly separated from your parser, you can try wapping between Regex and non-Regex implementations to compare performance.
Regex isn't, by any means, faster than reading raw strings. It just avoids some common mistakes by default. Specifically, the unnecessary creation of string objects. Ideally, your lexer should scan your code and produce an output with zero intermediate data except the bare minimum required to track state within your parser. Memory-wise you should have:
Raw input (ie source)
Parser state (ex isExpression, isSatement, row, col)
Data (Ex AST, Tree, 2D Array, etc).
For instance, if your current lexer matches a non-terminal and copies every char over one-by one until it reaches the next terminal; you're essentially recreating that string for every letter matched. Keep in mind that string data types are immutable, concat will always create a new string. You should be scanning the text using pointer arithmetic or some equivalent.
To fix this problem, you need to scan from the startPos of the non-terminal to the end of the non-terminal and copy only when a match is complete.
Regex supports all of this by default out of the box, which is why it's a preferred tool for writing lexers. Instead of trying to write a Regex that parses your entire grammer, write one that only focuses on matching terminals & non-terminals as capture groups. Skip tokenization, and pass the results directly into your parser/state machine.
The key here is, don't try to use Regex as a state machine. At best it will only work for Regular (ie Chomsky Type III, no stack) declarative syntaxes -- hence the name Regular Expression. For example, HTML is a Context-Free (ie Chomsky Type II, stack based) declarative syntax, which is why a Rexeg alone is never enough to parse it. Your grammar, and generally all templating syntaxes, fall into this category. You've clearly hit the limit of Regex already, so you're on the right track.
Use Regex for tokenization only. If you're really concerned with performance, rewrite your lexer to eliminate any and all unnecessary string copying and/or intermediate data. See if you can outperform the Regex version.
The key being. The Regex version is easier to understand and maintain, whereas your hand-rolled lexer will likely be just a tinge faster if written correctly. Conventional wisdom says, do yourself a favor and prefer the former. In terms of Big-O complexity, there shouldn't be any difference between the two. They're two forms of the same thing.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you
If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).
You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.
Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

What functions a lexer needs to provide?

I am making a lexer, don't tell me to not do because I already did most of it.
Currently it makes an array of tokens and that's it.
I would like to know, what functions the lexer needs to provide and a brief explanation of what each function needs to do.
I'll accept the most complete list.
An example function would be:
next: Consume the current token and return it
Also, should the lexer have the expect function or should the interpreter implement it?
By the way, the lexer constructor accepts a string as argument and make the lexical analyses and store all the tokens in the "tokens" variable.
The language is javascript, so I can't overload operators.
In my experience, you need:
nextToken — move forward in the input and get the next token.
curToken — return the current token; don't move
curValue — tokens like STRING and NUMBER have values; tokens like SEMICOLON don't
sourcePos — return the source position (line number, character position) of the first character of the current token
edit — oh also:
prefetch — initialize the lexer by getting the first token.
Additionally, for some languages you might want 2 or more tokens of lookahead. Then you'd want a variation on plain curToken so that you can look at a bigger "window" on the token stream. For most languages that's not really necessary however.
edit again — also I won't tell you not to write one because they're basically the funnest things ever. In javascript you can't get too crazy, but in a language like Erlang you can have your lexer act like a "token pump" by making it generate a stream of tokens it sends to a separate parser process.
You should be able to compile a comprehensive list by writing a program that uses your lexer, and implementing the functions you end up needing.
Think a second time about what you're asking: "what functions the lexer needs to provide"
What it it "needs" depends of course on what you need, not what it needs. We will probably be able to give you better aid if you explain your own needs. But well, here's a shot anyway:
A minimal one would consist of a single function that takes a string as an argument and returns a list of strings (or an iterator over strings if you want to be fancy and deferred). That's enough for many use-cases and hence is what a lexer "needs".
A more descriptive one could return more complex objects than strings, containing further information about each token (such as it's position in the original string for example, so that you'll be able to tell the poor programmer with syntax errors in his code where he should look). You can probably come up with lots of meta data to add in there besides line numbers, but once again, it all depends on your needs.

Categories

Resources