I am making a lexer, don't tell me to not do because I already did most of it.
Currently it makes an array of tokens and that's it.
I would like to know, what functions the lexer needs to provide and a brief explanation of what each function needs to do.
I'll accept the most complete list.
An example function would be:
next: Consume the current token and return it
Also, should the lexer have the expect function or should the interpreter implement it?
By the way, the lexer constructor accepts a string as argument and make the lexical analyses and store all the tokens in the "tokens" variable.
The language is javascript, so I can't overload operators.
In my experience, you need:
nextToken — move forward in the input and get the next token.
curToken — return the current token; don't move
curValue — tokens like STRING and NUMBER have values; tokens like SEMICOLON don't
sourcePos — return the source position (line number, character position) of the first character of the current token
edit — oh also:
prefetch — initialize the lexer by getting the first token.
Additionally, for some languages you might want 2 or more tokens of lookahead. Then you'd want a variation on plain curToken so that you can look at a bigger "window" on the token stream. For most languages that's not really necessary however.
edit again — also I won't tell you not to write one because they're basically the funnest things ever. In javascript you can't get too crazy, but in a language like Erlang you can have your lexer act like a "token pump" by making it generate a stream of tokens it sends to a separate parser process.
You should be able to compile a comprehensive list by writing a program that uses your lexer, and implementing the functions you end up needing.
Think a second time about what you're asking: "what functions the lexer needs to provide"
What it it "needs" depends of course on what you need, not what it needs. We will probably be able to give you better aid if you explain your own needs. But well, here's a shot anyway:
A minimal one would consist of a single function that takes a string as an argument and returns a list of strings (or an iterator over strings if you want to be fancy and deferred). That's enough for many use-cases and hence is what a lexer "needs".
A more descriptive one could return more complex objects than strings, containing further information about each token (such as it's position in the original string for example, so that you'll be able to tell the poor programmer with syntax errors in his code where he should look). You can probably come up with lots of meta data to add in there besides line numbers, but once again, it all depends on your needs.
Related
I am currently writing a little programming language and have come across a problem. In javascript template literals, we can embed arbitrary expressions, like:
let a = `hello ${ { a: 10, b: 15 } } world`
To properly lex the above-given snippet, the lexer needs to understand bracket matching(parsing essentially) as it can't just assume the first } to be the end of the embedded expression. How do lexers idiomatically solve this problem? One way is to check for proper bracket matching instead of treating them as simple operators, but I am not sure it is the best way. Looking into the code of some javascript lexers also was not very helpful.
The ECMAScript standard specifies (as a theoretical model) different "goal symbols" which are used in different syntactic contexts. Templated strings are one of the contexts with specific goal symbols.
That means that you need a lexical scanner which can switch states. Whether it does so itself, by duplicating part of the work of the parser, or as a result of a syntax action depends on the precise structure of the parsing architecture. You'll find implementations corresponding to both possibilities.
Putting this kind of logic into the parser is easier with predictive (top-down) parsers, such as a recursive descent parser. You could insert a call to the lexer's interface for changing states immediately after recognising the token which triggers the state change (backtick in your templated literal example). Or you could write the lexer interface so the "get a token" function also takes a lexical state argument; then your parser can maintain the lexical state. In effect, this last option is equivalent to using multiple lexical scanners, one for each state, which is also an attractive option, but it depends on separating the lexical scanner from the mechanism for reading input. (Personally, I favour this separation, but it's rarely discussed or implemented.)
Alternatively, you can use a bottom-up parser. In that case you need to be careful with synchronization between the parser's lookahead mechanism and the lexer scanner, since the interface between a bottom-up parser and a lexical scanner always allows the parser to read at least one token ahead, and it's possible that the state change needed to be done before that token was scanned. There are ways to handle this synchronization issue, but it's common for bottom-up parsers to put simple lexical state transitions into the lexer in order to avoid this issue. This necessarily involves a little duplication of effort between scanner and parser but counting braces is not so complicated.
If you're trying to use ECMAScript parsers as a source of inspiration, you need to be aware that there are many other complications with ECMAScript, particularly automatic semicolon insertion, which also involve coordination between lexer and parser. Solving those may impose other constraints on the overall architecture, and certainly makes the resulting parser code harder to read and understand.
Some Context:
• I'm still learning to code atm (started less than a year ago)
• I'm mostly self taught at that since I think my computer science class feels
too slow.
• The website I'm learning on is code.org, specifically in the "game lab"
• The site's coding environments only use ES5 because they don't want to
update them to ES6 or something like that
• In class we're making function libraries and while not required, I want
mine to be "highly usable," for lack of a better term, while also being
reasonably short (prefer not to automate things if I can get them done
quicker somehow, but that's just personal preference).
So now for where the actual question comes in: in a stringified array, is it possible to differentiate between a quotation mark that was inside a string and a quotation mark that actually denotes a string? Because I noticed something confusing with the output of JSON.parse(JSON.stringify()) on code.org, specifically, if you write something like,
JSON.parse(JSON.stringify(['hi","hi']))
the output will be ["hi","hi"] which looks just like an array containing two strings (on code.org it doesn't show the \'s), but still contains just one, which is fine unless you're using a regular expression to detect whether or not a match is within a string (if every quotation mark after the match has a "partner"), which is what I'm doing in 4 different functions. One flattens a list (since ES5 doesn't have Array.prototype.flat()), one removes all instances of the arguments from a list, one removes all instances of specified operand types, and one replaces all instances of an argument with the one that follows it.
Now I know the odds of a string containing an odd number of quotation marks (whether single or double) is likely extremely low, but it still bothers me that not having a way to differentiate between quotes formerly within a string and quotes which formerly denoted a string (in an array after it's been stringified) as these functions otherwise function exactly as intended. The regular expression I'm using to determine if there's an even number of quotes left in the stringified array is /(?=[^"]*(?:(?:"[^"]*){2})*$)/ where you put the match before the lookahead assertion and anything you absolutely want to follow before the first [^"]*.
To highlight the actual issue I'm trying to solve, this is my flatten function (since it's the shortest of the 4), and yeah, yeah, I know "eval bad" but it's extremely convenient to use here since it shortens the actual modification into a single line, and I highly doubt anyone's actually going to find a way to abuse it given its implementation ("this" needs to be an array for splice to work, so if I'm not mistaken, there isn't really a way to abuse it, but tell me if I'm wrong, since I probably am).
Array.prototype.flatten = function() {
eval(('this.splice(0,this.length,' + JSON.stringify(this).replace(/[\[\]](?=[^"]*(?:(?:"[^"]*){2})*$)/g, '') + ')').replace(/,(?=((,[^"]*(?:(?:"[^"]*){2})*)*.$))/g, ''));
return this;
};
This works really well outside of the previously specified conditions, but if I were to call it with something like [1,'"'] it'd find 3 quotation marks after the \[ and wouldn't be able to remove it but would be able to remove the \], thus when eval actually gets to .splice(), it would look like eval('this.splice(0,this.length,[1,"\"")') causing the error Unexpected token ')' to be thrown
Any help on this is appreciated, even if it's just telling me it isn't possible, thanks for reading my ramblings.
TL;DR: in a stringified array is it possible to differentiate between " and \" (string wrapping quotes of strings within a stringified array and quotes within a string within a stringified array) in a regular expression or any other method using only the tools available in ES5 (site I'm learning on doesn't want to update their project environments for whatever reason)
You are having a problem because your input is not a context free grammar and can not be correctly parsed with regular expressions.
Can you explain why JSON.parse is unacceptable? It is even in ancient browsers and versions of node.js.
Someone writing a json parser might use bison or yacc, so if this is a learning experience consider playing with jison.
I ended up finding a way to do this, for whatever reason (either I didn't notice last night because I was tired or it legitimately changed overnight, though likely the former) I can now see the " when viewing the value of the the stringified array, and lo and behold modifying the regular expression so that it ignored instances of " resolved the issue.
New regular expression for quotation mark pair matching now reads:
// old even number of quotation marks after match check
/(?=[^"]*(?:(?:"[^"]*){2})*$)/
// new even number of quotation marks after match check
/(?=(\\"|[^"])*(?:(?:(?<!\\)"(\\"|[^"])*){2})*$)/
// (only real difference is that it accounts for the \)
Sorry for anyone who may have misunderstood the question due to how all over the place it was, I'm aware that I tend to end up writing a lot more than is necessary and it often leads to tangents that muddle my view of what I was initially asking, which in turn makes the point I'm actually trying to get across even harder to grasp at. Thanks to those who still tried to help me regardless of how much of a mess of a first question this was.
I regularly receive emails from the same person, each containing one or more unique identifying codes. I need to get those codes.
The email body contains a host of inconsistent email content, but it is the strings I am interested in. They look like...
loYm9vYzE6Z-aaj5lL_Og539wFer0KfD
FuZTFvYzE68y8-t4UgBT9npHLTGmVAor
JpZDRwYzE6dgyo1legz9sqpVy_F21nx8
ZzZ3RwYzE63P3UwX2ANPI-c4PMo7bFmj
What the strings seem to have in common is, they are all 32 characters in length and all composed of a mixture of both uppercase, lowercase, numbers and symbols. But a given email may contain none, one or multiple, and the strings will be in an unpredictable position, not on adjacent lines as above.
I wish to make a Zap workflow in Zapier, the linking tool for web services, to find these strings and use them in another app - ie. whenever a string is found, create a new Trello card.
I have already started the workflow with Zapier's "Gmail" integration as a "trigger", specifically a search using the "from:" field corresponding to the regular sender. That's the easy part.
But the actual parsing of the email body is foxing me. Zapier has a rudimentary email parser, but it is not suitable for this task. What is suitable is using Zapier's own "Code" integration to execute freeform code - namely, a regular expression to identify those strings.
I have never done this before and am struggling to formulate working code. Zapier Code can take either Python (documentation) or Javascript (documentation). It supports data variables "input_data" (Python) or "inputData" (Javascript) and "output" (both).
See, below, how I insert the Gmail body in to "body" for parsing...
I need to use the Code box to construct a regular expression to find each unique identifier string and output it as input to the next integration in the workflow, ie. Trello.
For info, in the above screengrab, the existing "hello world" code in the box is Zapier's own test code. The fields "id" and "hello" are made available to the next workflow app in the chain.
But I need to do my process for all of the strings found within an email body - ie. if an email contains just one code, create one Trello card; but if an email contains four codes, create a Trello card for each of the four.
That is, there could be multiple outputs. I have no idea how this could work, since I think these workflows are only supposed to accommodate one action.
I could use some help getting over the hill. Thank-you.
David here, from the Zapier Platform team.
I'm glad you're showing interest in the code step. Assuming your assumptions (32 characters exactly) is always going to be true, this should be fairly straightforward.
First off, the regex. We want to look for a character that's a letter, number, or punctuation. Luckily, javascript's \w is equivalent to [A-Z0-9a-z_], which covers the bases in all of your examples besides the -, which we'll include manually. Finally, we want exactly 32 character length strings, so we'll ask for that. We also want to add the global flag, so we find all matches, not just the first. So we have the following:
/[\w-]{32}/g
You've already covered mapping the body in, so that's good. The javascript code will be as follows:
// stores an array of any length (0 or more) with the matches
var matches = inputData.body.match(/[\w-]{32}/g)
// the .map function executes the nameless inner function once for each
// element of the array and returns a new array with the results
// [{str: 'loYm9vYzE6Z-aaj5lL_Og539wFer0KfD'}, ...]
return (matches || []).map(function (m) { return {str: m} })
Here, you'll be taking advantage of an undocumented feature of code steps: when you return an array of objects, subsequent steps are executed once for each object. If you return an empty array (which is what'll happen if no keys are found), the zap halts and nothing else happens. When you're testing, there'll be no indicator that anything besides the first result does anything. Once your zap is on and runs for real though, it'll fan out as described here.
That's all it takes! Hopefully that all makes sense. Let me know if you've got any other questions!
I'm currently working on a small little dsl, not unlike rabl. I'm struggling with the implementation of one of my rules. Before we get to the problem, I'll explain a bit about my syntax/grammar.
In my little language you can define properties, object/array blocks, or custom blocks (these are all used to build a json object/array). A "custom block" can either be one that contains my standard expressions (property, object/array block, etc) or some JavaScript. These expressions are written as such -
-- An object block
object #model
-- A property node
property some, key(name="value")
-- A custom node
object custom_obj as
property some(name="key")
end
-- A custom script node
property full_name as (u)
// This is JavaScript
return u.first_name + ' ' + u.last_name;
end
end
The problem I'm running into is with my custom script node. I'm having a real hard defining the script token so that JISON can properly capture the stuff inside the block.
In my lexer, I currently have...
# script_param is basically a regex to match "(some_ident)"
{script_param} { this.begin('js'); return 'SCRIPT_PARAM'; }
<js>(.|\n|\r)*?"end" %{
this.popState();
yytext = yytext.substr(0, yyleng - 3).trim();
return 'SCRIPT';
%}
That SCRIPT token will basically match anything after (u) up to (and including) the end token (which usually ends a block). I really dislike this because my usual block terminator (end) is actually part of the script token, which feels totally hacky to me. Unfortunately, I'm not able to find a better way to capture ANYTHING between (..) and end.
I've tried writing a regex that captures anything that ends with a ";", but that poses problems when I have multiple script nodes in my dsl code. I've only been able to make this work by including the "end" keyword as part of my capture.
Here are the links to my grammar and lexer files.
I'd greatly appreciate any insight into solving my problem! If I didn't explain my problem clearly, let me know and I'll try my best to clarify!
Many thanks in advance!!
I will also happily accept any advice as to how to clean up my grammar. I'm still fairly new at this stuff and feel like my stuff is a mess right now :)
It's easy enough to match a string up to but not including the first instance of end:
([^e]|e[^n]|en[^d])*
(And it doesn't even need non-greedy repetition.)
However, that's not what you want. The included JavaScript might include:
variables whose names happen to include the characters end (tendency)
comments (/* Take the values up to the end of the line */)
character strings (if (word == "end"))
and, indeed, the word end itself, which is not a reserved word in js.
Really, the only clean solution is to be able to lex javascript. Fortunately, you don't have to do it precisely, because you're not interpreting it, but even so it is a bit of work. The most annoying part of javascript lexing, like other similar languages, is identifying when / is the beginning of a regular expression, and when it is just division; getting that right requires most of a javascript parser, particularly since it also interacts with the semicolon rule.
To deal with the fact that the included javascript might actually use a variable named end, you have a couple of choices, as far as I can see:
Document the fact that end is a reserved word.
Only recognize end when it appears outside of parentheses and in a place where a statement might start (not too difficult if you end up building enough of a JS parser to correctly identify regular expressions)
Only recognize end when it appears by itself on a line.
This last choice would really simplify your problem a lot, so you might want to think about it, although it's not really very elegant.
I'm working on creating a basic RPG game engine prototype using JavaScript and canvas. I'm still working out some design specs on paper, and I've hit a bit of a problem I'm not quite sure how to tackle.
I will have a Character object that will have an array of Attribute objects. Attributes will look something like this:
function(name, value){
this.name = name;
this.value = value;
...
}
A Character will also have "skills" that are calculated off attributes. A skills value can also be determined by a formula entered by the user. A legit formula would look something like this:
((#attribute1Name + (#attribute2Name / 2) * 5)
where any text following the # sign represents the name of an attribute belonging to that character. The formula will be entered into a text field as a string.
What I'm having a problem with is understanding the proper way to parse and evaluate this formula. Initially, my plan was to do a simple replace on the attribute names and eval the expression (if invalid, the eval would fail). However, this presents a problem as it would allow for JavaScript injection into the field. I'm assuming I'll need some kind of FSM similar to an infix calculator to solve this, but I'm a little rusty on my computation theory (thanks corporate world!). I'm really not asking for someone to just hand me the code so much as I'd like to get your input on what is the best solution to this problem?
EDIT: Thanks for the responses. Unfortunately life has kept me busy and I haven't tried a solution yet. Will update when I get a result (good or bad).
Different idea, hence a separate suggestion:
eval() works fine, and there's no need to re-invent the wheel.
Assuming that there's only a small and fixed number of variables in your formula language, it would be sufficient to scan your way through the expression and verify that everything you encounter is either a parenthesis, an operator or one of your variable names. I don't think there would be any way to assemble those pieces into a piece of code that could have malicious side effects on eval.
So:
Scan the expression to verify that it draws from just a very limited vocabulary.
Let eval() work it out.
Probably the compromise with the least amount of work and code while bringing risk down to (near?) 0. At worst, a misuser could tack parentheses on a variable name in an attempt to execute the variable.
I think instead of letting them put the whole formula in, you could have select tags that have operations and values, and let them choose.
ie. a set of tags with attribute-operation-number:
<select> <select> <input type="text">
#attribute1Name1 + (check if input is number)
#attribute1Name2 -
#attribute1Name3 *
#attribute1Name4 /
etc.
There is a really simple solution: Just enter a normal JavaScript formula (i.e. as if you were writing a method for your object) and use this to reference the object you're working on.
To change this when evaluating the method use apply() or call() (see this answer).
I recently wrote a similar application. I probably invested far too much work, but I went the whole 9 yards and wrote both a scanner and a parser.
The scanner converted the text into a series of tokens; tokens are simple objects consisting of token type and value. For the punctuation marks, value = character, for numbers the values would be integers corresponding to the numeric value of the number, and for variables it would be (a reference to) a variable object, where that variable would be sitting in a list of objects having a name. Same variable object = same variable, natch.
The parser was a simple brute force recursive descent parser. Here's the code.
My parser does logic expressions, with AND/OR taking the place of +/-, but I think you can see the idea. There are several levels of expressions, and each tries to assemble as much of itself as it can, and calls to lower levels for parsing nested constructs. When done, my parser has generated a single Node containing a tree structure that represents the expression.
In your program, I guess you could just store that Node, as its structure will essentially represent the formula for its evaluation.
Given all that work, though, I'd understand just how tempting it would be to just cave in and use eval!
I'm fascinated by the task of getting this done by the simplest means possible.
Here's another approach:
Convert infix to postfix;
use a very simple stack-based calculator to evaluate the resulting expression.
The rationale here being, once you get rid of the complication of "* before +" and parentheses, the remaining calculation is very straightforward.
You could look at running the user-defined code in a sandbox to prevent attacks:
Is It Possible to Sandbox JavaScript Running In the Browser?