As part of a small project I'm working on, I need to be able to parse a string into a custom object, which represents an action, date and a few other properties. The tricky part is that the input string can come in a variety of flavors that all need to be properly parsed.
Input strings may be in the following formats:
Go to work tomorrow at 9am
Wash my car on Monday, at 3 pm
Call the doctor next Tuesday at 10am
Fill out the rebate form in 3 days at 2:30pm
Wake me up every day at 7:00am
And the output object would look something like this:
{
"Action":"Wash my car",
"DateTime":"2011-12-26 3:00PM", // Format is irrelevant at this point
"Recurring":False,
"RecurranceType":""
}
At first I thought of constructing some sort of tree to represent different states (On, In, Every, etc.) with different outcomes and further states (candidate for a state machine, right?). However, the more I thought about this, the more it started looking like a grammar parsing problem. Due to a (limited) number of ways the sentence could be formed, it looks like some sort of grammar parsing algorithm would need to be implemented.
In addition, I'm doing this on the front end, so JavaScript is the language of choice here. Back end will be written in Python and could be used by calling AJAX methods, if necessary, but I'd prefer to keep it all in JavaScript. (To be honest, I don't think the language is a big issue here).
So, am I in way over my head? I have a strong JavaScript background, but nothing beyond school courses when it comes to language design, parsing, etc. Is there a better way to solve this problem? Any suggestions are greatly appreciated.
I don't know a lot about grammar parsing, but something here might help.
My first thought is that your sentence syntax seems to be pretty consistent
1st 3-4 words are generally VERB text NOUN, followed by some form of time. If the total options are limited to what form the sentence can take, you can hard-code some parsing rules.
I also ran across a couple of js grammar parsers that might get you somewhere:
http://jscc.jmksf.com/
http://pegjs.majda.cz/
http://www.corion.net/perl-dev/Javascript-Grammar.html
This is an interesting problem you have. Please update this with your solutions later.
Related
I'm working on a MongoDB database and so far have stored some information as Numbers instead of Strings because I assumed that would be more efficient. For example, I store countries following ISO 3166-1 numeric and sex following ISO/IEC 5218. But so far I have not found a similar standard for languages, ISO 639 does not appear to have a matching list of numeric codes.
What would be the right way to do this? Should I just use the String codes?
Thanks!
If you're a fan of the numbers, you can use country calling codes, although they "only" represent the ITU members (193 countries according to Wikipedia). But hey, they have Somalia and Palestine, so that's a good hint about how global this is.
However, storing everything in an encoded format (numbers here) implies a decoding step on the fly when any piece of data is requested (with translation tables stored in RAM instead of DB's ROM). Probably on the server whose CPU is precious, but you might have deported the issue on the client, overworking the precious, time-critical server-client link in the process.
So, back in the 90s, when a 40MB HDD was expensive, that might have been interesting. Today, the cost of storing data vs. the cost of processing data is not on the same side of 1... Not counting the time it takes you to think and implement the transformations. All being said "IMHO", I think this level of efficiency actually kills efficiency. ;)
EDIT: Oops, just realized I misthought (does that verb even exist?) the country/language issue. Countries you have sorted out already, my bad. I know no numbered list of languages. However, the second part of the post might still be relevant...
If you are after raw performance and/or want to achieve really small data sizes, I would suggest you use either the three-letter (higher granularity) or the two-letter (lower granularity) codes from IOC ISO-639-1/2.
To my knowledge, there's no helper or anything for this standard built into any programming language that I know, so you'd need to build your own translator (code<->full name) which, however, should be trivial.
And as others already mentioned, you have to assess the cost involved with this (e.g. not being able to simply look at the data and understand it right away anymore) for yourself. I personally do recommend keeping data sizes small since BSON parsing and string operations are horribly expensive compared to dealing with numbers (or shorter strings for that matter). When dealing with small data sets, this won't make a noticeable difference. If, however, you need to churn through millions of documents or more optimizations like this can become mission critical.
Sorry for the somewhat truthful misleading title.
I am currently trying to grasp several formats of dates, within a given sentence, will add more as time permits and neccessity arises, but for the most part this is what I have.
https://regex101.com/r/vV0uZ3/1
The main reason why I said break this, is because I can't think of any other types of dates that could mess this up, but I also feel that it is inefficient, semi new to regexes, and this seems to work, but I feel that there is a better manner in which to do it as well, this is somewhat time insensitive data processing, but speed is always a needed point in the end game.
Shortened scope, are there (base) formats of date inputs that this would not be able to pull in?
Also would it be better to use the case insensitive flag 'i' or pull in the full (base) alphabet into it [A-z], for time, and solution comprehension?
EDIT~~
I may be misrepresenting my question a bit, here goes.
I more or less will be reading in human inputs, and mostly american date formats within a document.
I have seen the comments made, and they are definitely sensible, just not too sure if at the moment those edge cases would be plausible, human randomness is a beautiful thing though, so side question, would having a regex that could catch all of these date formats(n/m/e | n-m-e | n m, e |...) heavily bog down my code in the long ru, and would the trade off for maleability be worth the slight shift in efficiency?
I am working on front-end only React application and will soon be implementing internationalization. We only need one additional language... at this point. I would like to do it in a way that is maintainable, where adding a new language would be ideally as close as possible to merely providing a new config object with translations for various strings.
The issue I know we will have, is that we have dynamic inputs inside of sentences as demonstrated below (where [] are inputs and ** is dynamically changing data). This is just an example sentence... lots of other similar type things elsewhere in the app.
I am [23] years old. I was born in [ ______▾]. In 2055 I would be *65* years old.
We could break out 'I am', '*age input', 'years old. I was born in', '*year dropdown'. etc. But depending on the language, word order could be changed or an input could be at the beginning of a sentence etc, and I feel like doing it in that way would make for a really weird looking and hard to maintain language file.
I'm looking to know if there are common patterns and/or libraries we can use to help with this challenge.
A react specific library is react-intl by yahoo. Which is part of a larger project called FormatJS which has many libraries and solutions for internalization. These and their corresponding docs are a good starting point.
I'm interested in finding a more-sophisticated-than-typical algorithm for finding differences between strings, that can be "tuned" via some parameters, to balance between such things as "maximize count of identical characters" vs. "maximize the length of spans" vs. "try to keep whole words intact".
Ultimately, I want to be able to make the results as human readable as possible. For instance, if a long sentence has been replaced with an entirely new sentence, where the only things it has in common with the original are the words "the" "and" and "a" in that order, I might want it treated as if the whole sentence is changed, rather than just that 4 particular spans are changed --- just like how a reasonable person would see it.
Does such a thing exist? Although I'm working in javascript/node.js, an algorithm in any language would be helpful.
I'm actually ok with something that uses Monte Carlo methods or the like, if its results are better. Computation time is not an issue (within reason), nor is determinism.
Note: although this is beyond the scope of what I'm asking, I'll throw one more thing out there just in case: It would also be great if it could recognize changes that are out of order....for instance if someone changes the order of two paragraphs while leaving them otherwise identical, it would be awesome if it recognized it as a simple move, rather than as one subtraction and and one unrelated addition.
I've had good luck with diff_match_patch. There are some good options for tuning it for readability.
Try http://prettydiff.com/ Its code is already formatted for compatibility with CommonJS, which is the framework Node uses.
I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you
If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).
You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.
Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.