String translation with dynamic text and inputs - javascript

I am working on front-end only React application and will soon be implementing internationalization. We only need one additional language... at this point. I would like to do it in a way that is maintainable, where adding a new language would be ideally as close as possible to merely providing a new config object with translations for various strings.
The issue I know we will have, is that we have dynamic inputs inside of sentences as demonstrated below (where [] are inputs and ** is dynamically changing data). This is just an example sentence... lots of other similar type things elsewhere in the app.
I am [23] years old. I was born in [ ______▾]. In 2055 I would be *65* years old.
We could break out 'I am', '*age input', 'years old. I was born in', '*year dropdown'. etc. But depending on the language, word order could be changed or an input could be at the beginning of a sentence etc, and I feel like doing it in that way would make for a really weird looking and hard to maintain language file.
I'm looking to know if there are common patterns and/or libraries we can use to help with this challenge.

A react specific library is react-intl by yahoo. Which is part of a larger project called FormatJS which has many libraries and solutions for internalization. These and their corresponding docs are a good starting point.

Related

Abstract Classification using NLP/ML

I need to autogenerate categories of a publication using its abstract and support synonyms. I have classification data of 800-900 articles which I can use for training. This classification data is generated by the pharma experts by reading a unstructured publication.
Existing classification categories are like below for existing publications:
Drug : Some drug, Some other drug.
Diseases : Some Disease.
Authors : Some authors and so on..
These categories are currently generated by Human expert. I explored Natural library in node.js and lingpipe in Java. It has classifiers but I am not able to figure out what is the most efficient way to train it, so that I get 90% accuracy.
Following are approaches in my mind :
I can pass entire abstracts of publication one by one and tell it its categories like below?
var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'year:2012');
classifier.addDocument('This article is for parcetamol written by Techgyani. Article was written in 2012', 'author:techgyani');
classifier.train();
I can pass it sentence one by one and tell it what is its category which will be manual and timeconsuming process. So that when I pass it entire abstract, it will autogenerate set of categories for me like below :
var natural = require('natural');
var classifier = new natural.BayesClassifier();
classifier.addDocument('This article is for parcetamol written by Techgyani', 'drug:Paracetamol');
classifier.addDocument('This article is for parcetamol written by Techgyani', 'author:techgyani');
classifier.addDocument('Article was written in 2012', 'year:2012');
classifier.train();
I can also extract tokens from the publication and search my database and figure categories on my own without any use of NLP/ML libraries.
According to your experience which is the most efficient way to solve this problem? I am open for solution in any language but I prefer Javascript because existing stack is in Javascript.
I'd recommend using either most frequent words or word frequency as features in a naive bayes classifier.
No need to tag sentences individually. I'd expect reasonable accuracy at the document level, although that will depend on the nature of your documents trained and classified.
Great discussion on Python implementation below
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
According to me, second solution of yours will work like a charm. You need to train your classifier in order to do your work.
You need to pass classifier.train(data, labels);. I know this will be a manual work, but it will hardly take some time to train your classifier.
Once it is trained, you can very well pass one of your sentence and see for the output by yourself
You should explore off the shelf Named Entity Recognition models first before investing in training. Spacy is written in Python but has a javascript binding. The classifier in natural use naive bayes and logistic regression and will not have as good a performance as a neural network library like Spacy. I suspect that natural will not work well for new cases where it has not already not seen the drug, disease, or author name in the training set.

How to implement different languages on html page

I am just a newcomer developing an app with html/css/js via phonegap. I've been searching info on how to make my app be displayed in different languages and Google doesn't understand me.
So the idea is to have a button on index.html that let the user choose the language in which the app will be displayed, in this case Spanish/English, nothing strange like arabic blablabla....
So I guess that the solution must be related to transform all the text that I load in html to variables and then depending on the language selected display the correct one. I have no idea how to make this, and Im not able to find examples. So that's what Im asking for... if someone could give some code snipet to see how html variables works and how should I save user language selection...
Appreciated guys!
This can be done by internationalization (such as i18N). To do this you need separate file for each language and put all your text in it. Search Google for internationalization.
Otherwise you can look into embeding Google Translate.
This depends on the complexity of language-dependencies in the application. If you have just a handful of short texts in a strongly graphic application, you can just store the texts in JavaScript variables or, better, in properties of an object, with one object per language.
But if you expect to encounter deeper language-dependencies as well (e.g., displaying dynamically computed decimal numbers, which should be e.g. 1.5 in English and 1,5 in Spanish), then it’s probably better to use a library like Globalize.js (described in some detail in my book Going Global with JavaScript and Globalize.js). That way you could use a unified approach, writing e.g. a string using Globalize.localize('greeting') and a number using Globalize.format(x, 'n1') and a date using Globalize.format(date, 'MMM d').

Parsing a string into a custom object based on different criteria

As part of a small project I'm working on, I need to be able to parse a string into a custom object, which represents an action, date and a few other properties. The tricky part is that the input string can come in a variety of flavors that all need to be properly parsed.
Input strings may be in the following formats:
Go to work tomorrow at 9am
Wash my car on Monday, at 3 pm
Call the doctor next Tuesday at 10am
Fill out the rebate form in 3 days at 2:30pm
Wake me up every day at 7:00am
And the output object would look something like this:
{
"Action":"Wash my car",
"DateTime":"2011-12-26 3:00PM", // Format is irrelevant at this point
"Recurring":False,
"RecurranceType":""
}
At first I thought of constructing some sort of tree to represent different states (On, In, Every, etc.) with different outcomes and further states (candidate for a state machine, right?). However, the more I thought about this, the more it started looking like a grammar parsing problem. Due to a (limited) number of ways the sentence could be formed, it looks like some sort of grammar parsing algorithm would need to be implemented.
In addition, I'm doing this on the front end, so JavaScript is the language of choice here. Back end will be written in Python and could be used by calling AJAX methods, if necessary, but I'd prefer to keep it all in JavaScript. (To be honest, I don't think the language is a big issue here).
So, am I in way over my head? I have a strong JavaScript background, but nothing beyond school courses when it comes to language design, parsing, etc. Is there a better way to solve this problem? Any suggestions are greatly appreciated.
I don't know a lot about grammar parsing, but something here might help.
My first thought is that your sentence syntax seems to be pretty consistent
1st 3-4 words are generally VERB text NOUN, followed by some form of time. If the total options are limited to what form the sentence can take, you can hard-code some parsing rules.
I also ran across a couple of js grammar parsers that might get you somewhere:
http://jscc.jmksf.com/
http://pegjs.majda.cz/
http://www.corion.net/perl-dev/Javascript-Grammar.html
This is an interesting problem you have. Please update this with your solutions later.

JavaScript webpage version comparison

In order to expedite our 'content update review process', which is used in approving web page content for publishing, I'm looking to implement a JavaScript function that will compare two webpage versions.
So far, I've created a page that will load the content to be compared from the new and old versions of a particular page. Is there a (relatively) simple way to iterate through the html of each using JavaScript/jQuery and highlight what content has changed or is missing?
Since there would be so many html-specific details (since this is essentially html text comare), is there a JavaScript library I can use?
I should add that my first would be to implement this in PHP. Unfortunately, we have many constraints that only permit us to use limited resources such as JavaScript.
Version Control is a non-trivial problem. It's probably not something you should implement from scratch, either, if this is part of your "content update review process."
Instead, consider using a tool like Subversion, Git, or your favorite source control solution.
If you really wanna do this, you can go from something as simple as Regex matching to DOM matching. There's no "magic library" that I'm aware of that will encapsulate this for you, so it'll be work. Work that you'll probably do wrong.
Seriously consider a version control provider, or use a CMS that has built in versioning of pages. If you're feeling squirrely, check out an open source CMS (like Drupal) and try to figure out how they implement versioning, then reverse engineer/re-engineer it yourself. I hope the inefficiency in that is obvious.
I would do this in 3 steps
1/ segment the content into 2 arrays
for each page
. choose a separator, like the "." or ""
. you have the content as a big string, split it and build an array
2/ compare the arrays
loop on these 2 arrays containing the segmented content, let's say A[idxA] and B[idxB]
. if A[idxA] == B[idxB] then idxA++ and idxB++
. else find if there is an index where A[idxA] == B[index]
. if there is, mark all indexes between idxB and index as "B modified"
. else, mark idxA as "A modified"
3/ display the differences
At the end you should have all the indexes where A and B are not equal. You can then join the 2 arrays after adding some markups to highlight the differences.
It is not a perfect solution, it will be wrong sometimes.. But not often if you choose your separator correctly. If you want it perfect, you will have to test several match and compute the number of differences in order to minimise it

Programming tips with Japanese Language/Characters [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.
My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.
However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:
けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.
My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.
Stick to Unicode and utf-8 everywhere.
Stay away from the native Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but be aware that you'll probably encounter them at some point if you continue.
Get familiar with a segmenter for doing complicated stuff like POS analysis, word segmentation, etc. the standard tools used by most people who do NLP (natural language processing) work on Japanese are, in order of popularity/power.
MeCab (originally on SourceForge) is awesome: it allows you to take text like,
「日本語は、とても難しいです。」
and get all sorts of great info back
kettle:~$ echo 日本語は、難しいです | mecab
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
、 記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,
kettle:~$ echo メキシコ料理が食べたい | mecab
メキシコ 名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理 名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS
However, the documentation is all in Japanese, and it's a bit complicated to set up and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby...
Apt-repos for ubuntu:
deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
Packages to install:
$ apt-get install mecab-ipadic-utf8 mecab python-mecab
should do the trick I think.
The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.
I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will require tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.
MeCab is statistically driven, and trained on loads of data. It employs a sophisticated machine learning technique called conditional random fields (CRFs) and the results are really quite good.
Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. Kanji can be quite intimidating at the beginning.
My question is not only how to do it
in javascript, but what are some tips
and strategies to doing these kinds
of things in other langauges too.
What you want to do is pretty basic string manipution - apart from the missing word separators, as Barry notes, though that's not a technical problem.
Basically, for a modern Unicode-aware programming language (which JavaScript has been since version 1.3, I believe) there is no real difference between a Japanese kana or kanji, and a latin letter - they're all just characters. And a string is just, well, a string of characters.
Where it gets difficult is when you have to convert between strings and bytes, because then you need to pay attention to what encoding you are using. Unfortunately, many programmers, especially native English speakers tend to gloss over this problem because ASCII is the de facto standard encoding for latin letters and other encodings usually try to be compatible. If latin letters are all you need, then you can get along being blissfully ignorant about character encodings, believe that bytes and characters are basically the same thing - and write programs that mutilate anything that's not ASCII.
So the "secret" of Unicode-aware programming is this: learn to recognize when and where strings/characters are converted to and from bytes, and make sure that in all those places the correct encoding is used, i.e. the same that will be used for the reverse conversion and one that can encode all the character's you're using. UTF-8 is slowly becoming the de-facto standard and should normally be used wherever you have a choice.
Typical examples (non-exhaustive):
When writing source code with non-ASCII string literals (configure encoding in the editor/IDE)
When compiling or interpreting such source code (compiler/interpreter needs to know the encoding)
When reading/writing strings to a file (encoding must be specified somewhere in the API, or in the file's metadata)
When writing strings to a database (encoding must be specified in the configuration of the DB or the table)
When delivering HTML pages via a webserver (encoding must be specified in the HTML headers or the pages' meta header; forms can be even more tricky)
What you need to do is to look at the rules of grammar. Have an array of rules for each conjugation. Let's take 〜て form for example.
Psudocode :
def te_form(verb)
switch verb.substr(-1, 1) == "る" then return # verb minus ru plus te
case "る" #return (verb - る) + て
case "す" #return (verb - す)+して
etc. Basically, break it down into Type I, II and III verbs.
your question is totally unclear to me.
however, i had some experience working with japanese language, so i'll give my 2 Cents.
since japanese texts do not feature word separation (e.g. space character), the most important tool we had to acquire is a dictionary-based word recognizer.
once you got the text split, it's easier to manipulate it with "normal" tools.
there were only 2 tools which did the above, and as a by-product they also worked as a tagger (i.e. noun, verb, etc.).
edit:
always use unicode when working w languagers.
If I recall correctly (and I slacked off a lot the year I took Japanese so I could be wrong), the replacements you want to do are determined by the last symbol or two in the word. Taking your first example, any verb ending in 'す' will always have 'して' when conjugated this way. Similarly for む -> んで. Could you maybe establish a mapping of last character(s) -> conjugated form. You might have to account for exceptions, such as anything which conjugates to xxって.
As for portability between languages, you'll have to implement the logic differently based on how they work. This solution would be fairly straightforward to implement for Spanish as well, since the conjugations depends on if the verb ends in -ar, -er, or -ir (with some verbs requiring exceptions in your logic). Unfortunately, that's the limit of my multi-lingual skills, so I don't know how well it would do beyond those two.
Since most verbs in Japanese follow one of a small set of predictable patterns, the easiest and most extensible way to generate all the forms of a given verb is to have the verb know what conjugation it should follow, then write functions to generate each form depending on the conjugation.
Pseudocode:
generateDictionaryForm(verb)
case Ru-Verb: verb.stem + る
case Su-Verb: verb.stem + す
case Ku-Verb: verb.stem + く
...etc.
generatePoliteForm(verb)
case Ru-Verb: verb.stem + ります
case Su-Verb: verb.stem + します
case Ku-Verb: verb.stem + きます
...etc.
Irregular verbs would of course be special-cased.
Some variant of this would work for any other fairly regular language (i.e. not English).
Try to install my gem (rom2jap). It is in ruby.
gem install rom2jap
Open your terminal and type:
require 'rom2jap'

Categories

Resources