Programming tips with Japanese Language/Characters [closed]

Programming tips with Japanese Language/Characters [closed] - javascript

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.
My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.
However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:
けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.
My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.

Stick to Unicode and utf-8 everywhere.
Stay away from the native Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but be aware that you'll probably encounter them at some point if you continue.
Get familiar with a segmenter for doing complicated stuff like POS analysis, word segmentation, etc. the standard tools used by most people who do NLP (natural language processing) work on Japanese are, in order of popularity/power.
MeCab (originally on SourceForge) is awesome: it allows you to take text like,
「日本語は、とても難しいです。」
and get all sorts of great info back
kettle:~$ echo 日本語は、難しいです | mecab
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
、 記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,
kettle:~$ echo メキシコ料理が食べたい | mecab
メキシコ 名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理 名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS
However, the documentation is all in Japanese, and it's a bit complicated to set up and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby...
Apt-repos for ubuntu:
deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
Packages to install:
$ apt-get install mecab-ipadic-utf8 mecab python-mecab
should do the trick I think.
The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.
I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will require tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.
MeCab is statistically driven, and trained on loads of data. It employs a sophisticated machine learning technique called conditional random fields (CRFs) and the results are really quite good.
Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. Kanji can be quite intimidating at the beginning.

My question is not only how to do it
in javascript, but what are some tips
and strategies to doing these kinds
of things in other langauges too.
What you want to do is pretty basic string manipution - apart from the missing word separators, as Barry notes, though that's not a technical problem.
Basically, for a modern Unicode-aware programming language (which JavaScript has been since version 1.3, I believe) there is no real difference between a Japanese kana or kanji, and a latin letter - they're all just characters. And a string is just, well, a string of characters.
Where it gets difficult is when you have to convert between strings and bytes, because then you need to pay attention to what encoding you are using. Unfortunately, many programmers, especially native English speakers tend to gloss over this problem because ASCII is the de facto standard encoding for latin letters and other encodings usually try to be compatible. If latin letters are all you need, then you can get along being blissfully ignorant about character encodings, believe that bytes and characters are basically the same thing - and write programs that mutilate anything that's not ASCII.
So the "secret" of Unicode-aware programming is this: learn to recognize when and where strings/characters are converted to and from bytes, and make sure that in all those places the correct encoding is used, i.e. the same that will be used for the reverse conversion and one that can encode all the character's you're using. UTF-8 is slowly becoming the de-facto standard and should normally be used wherever you have a choice.
Typical examples (non-exhaustive):
When writing source code with non-ASCII string literals (configure encoding in the editor/IDE)
When compiling or interpreting such source code (compiler/interpreter needs to know the encoding)
When reading/writing strings to a file (encoding must be specified somewhere in the API, or in the file's metadata)
When writing strings to a database (encoding must be specified in the configuration of the DB or the table)
When delivering HTML pages via a webserver (encoding must be specified in the HTML headers or the pages' meta header; forms can be even more tricky)

What you need to do is to look at the rules of grammar. Have an array of rules for each conjugation. Let's take 〜て form for example.
Psudocode :
def te_form(verb)
switch verb.substr(-1, 1) == "る" then return # verb minus ru plus te
case "る" #return (verb - る) + て
case "す" #return (verb - す）＋して
etc. Basically, break it down into Type I, II and III verbs.

your question is totally unclear to me.
however, i had some experience working with japanese language, so i'll give my 2 Cents.
since japanese texts do not feature word separation (e.g. space character), the most important tool we had to acquire is a dictionary-based word recognizer.
once you got the text split, it's easier to manipulate it with "normal" tools.
there were only 2 tools which did the above, and as a by-product they also worked as a tagger (i.e. noun, verb, etc.).
edit:
always use unicode when working w languagers.

If I recall correctly (and I slacked off a lot the year I took Japanese so I could be wrong), the replacements you want to do are determined by the last symbol or two in the word. Taking your first example, any verb ending in 'す' will always have 'して' when conjugated this way. Similarly for む -> んで. Could you maybe establish a mapping of last character(s) -> conjugated form. You might have to account for exceptions, such as anything which conjugates to xxって.
As for portability between languages, you'll have to implement the logic differently based on how they work. This solution would be fairly straightforward to implement for Spanish as well, since the conjugations depends on if the verb ends in -ar, -er, or -ir (with some verbs requiring exceptions in your logic). Unfortunately, that's the limit of my multi-lingual skills, so I don't know how well it would do beyond those two.

Since most verbs in Japanese follow one of a small set of predictable patterns, the easiest and most extensible way to generate all the forms of a given verb is to have the verb know what conjugation it should follow, then write functions to generate each form depending on the conjugation.
Pseudocode:
generateDictionaryForm(verb)
case Ru-Verb: verb.stem + る
case Su-Verb: verb.stem + す
case Ku-Verb: verb.stem + く
...etc.
generatePoliteForm(verb)
case Ru-Verb: verb.stem + ります
case Su-Verb: verb.stem + します
case Ku-Verb: verb.stem + きます
...etc.
Irregular verbs would of course be special-cased.
Some variant of this would work for any other fairly regular language (i.e. not English).

Try to install my gem (rom2jap). It is in ruby.
gem install rom2jap
Open your terminal and type:
require 'rom2jap'

Related

How to check if the input is emoji without using regular expression?

I'm new to web development and just trying to check if the user input contains emojis without using regex for performance reasons.
Is there a way to do it with JavaScript on the front end or by using java on the backend?

Java does not identify emoji as such
The official Unicode Character Database does not identify emoji characters as such, according to Annex A of Unicode® Technical Standard #51 UNICODE EMOJI.
I suppose that is why we do not see any kind of isEmoji method on the Java 13 class, Character.
Roll-your-own
According to that Annex A, there are emoji-data data files available describing aspects of emoji characters. If you are sufficiently motivated to reliably identify emoji characters, I suggest reading that Technical Note, and consider importing the data from those files to identify the code points of emoji. There may well be ranges of numbers that the Unicode Consortium uses to cluster the emoji characters.
Keep in mind that the Unicode Consortium in recent years has been frequently adding more and more emoji. So you will be chasing a moving target, needing updates.
You may be able to narrow down your ranges with the named ranges of code points defined in Character.UnicodeBlock.
I am guessing that Character.OTHER_SYMBOL may help, as the emoji I perused are so tagged, according to the handy macOS app, UnicodeChecker.
FYI, the Unicode Consortium does publish a list of emoji: Full Emoji List, v12.0.
By the way, the CLDR published by the Unicode Consortium and used by default in recent versions of Java defines how to sort emoji. Yes, emoji have sort-order: human faces before cat faces, and so on. The code points for emoji characters are assigned rather arbitrarily, so do not go by that for sorting.

Instead of trying to blacklist emojis, it'd probably be easier to whitelist the characters you do want to allow. If your site is multilingual, you'd have to add the characters of the languages you want to support. It should be relatively simple to loop over each character of your input and see if it's in the list of valid characters.
You'll want to do your validation on both the frontend and the backend. You want to do the frontend so you can show feedback to the user immediately, and you have to do validation on the backend so that people can't game your system by opening their browser's console or getting creative. Frontend stuff should never be trusted by the server in general.

Dealing with illegal characters (apostrophes) in a TXT file with Node.js

I am relying on .txt files being sent externally in Node.js that sometimes have what i would class as "illegal" characters such as apostrophes and commas resulting in copying and pasting from webpages and programs such as Microsoft Word
How can I get Node.js or use Javascript to replace these incorrect formats such as apostrophes with correctly formatted apostrophes or strip out any illegal characters full stop?
Here is an example from a web page and shown in PasteBin:
Resilience is what happens when we’re able to move forward even when things don’t fit together the way we expect.
And tolerances are an engineer’s measurement of how well the parts meet spec. (The word ‘precision’ comes to mind). A 2018 Lexus is better than 1968 Camaro because every single part in the car fits together dramatically better. The tolerances are more narrow now.
One way to ensure that things work out the way you hope is to spend the time and money to ensure that every part, every form, every worker meets spec. Tighten your spec, increase precision and you’ll discover that systems become more reliable.
The other alternative is to embrace the fact that nothing is ever exactly on spec, and to build resilient systems.
You’ll probably find that while precision feels like the way forward, resilience, the ability to thrive when things go wrong, is a much safer bet.
The trap? Hoping for one, the other or both but not doing the work to make it likely. What will you do when it doesn’t work?
Neither resilience nor tolerances get better on their own.
https://pastebin.com/uJ7GAKk4
Copied from the following URL and pasted into Notepad and saved
https://seths.blog/storyoftheweek/

You could use a RegExp to remove the unwanted characters
// text is the pasted text
var filtered = text.replace(/[',]/gm, '');

JavaScript equivalent of C#'s Char.IsSymbol

I'm trying to strip all 'Unicode Symbols' from a string. That is, keeping all multilingual characters but removing dingbats, arrows, and all of that stuff.
C# has a very handy function called Char.IsSymbol that can be run on all characters of a string, stripping the character when the functions returns true.
I've been searching on doing something similar in JavaScript. If it's a regex then how can I compile a list of all the unicode ranges of the symbol characters? I looked at XRegExp but couldn't find something that only filters symbols.

XRegExp does have support for what you're looking for - http://xregexp.com/plugins/#unicode
You'd probably match either for \pL or \pS. You can find a nice list of the typical unicode categories in http://www.regular-expressions.info/unicode.html#category
Overall, Unicode is quite tricky. It gives plenty of opportunities for giving you trouble, especially with software that isn't fully Unicode compatible (sadly, this includes JavaScript - see https://mathiasbynens.be/notes/javascript-unicode for a nice set of example). This is further exacerbated by the fact that JS often runs with double-encoding (HTML+JS, and there's worse cases as well). Somebody will probably find a way to bypass your checks, but I'm afraid there's no easy way to prevent that. Just be on the lookout :)

How would I create a Text to Html parser? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Edit: I recently learned about a project called CommonMark, which
correctly identifies and deals with the ambiguities in the original
Markdown specification. http://commonmark.org/ It has great C# library
support.
You can find the syntax here.
The source that follows with the download is written in Perl, which I have no intentions of honoring. It is riddled with regular expressions, and it relies on MD5 hashes to escape certain characters. Something is just wrong about that!
I'm about to hard code a parser for Markdown. What is experience with this?
If you don't have anything meaningful to say about the actual parsing of Markdown, spare me the time. (This might sound harsh, but yes, I'm looking for insight, not a solution, that is, a third-party library).
To help a bit with the answers, regular expressions are meant to identify patterns! NOT to parse an entire grammar. That people consider doing so is foobar.
If you think about Markdown, it's fundamentally based around the concept of paragraphs.
As such, a reasonable approach might be to split the input into paragraphs.
There are many kinds of paragraphs, for example, heading, text, list, blockquote, and code.
The challenge is thus to identify these paragraphs and in what context they occur.
I'll be back with a solution, once I find it's worthy to be shared.

The only markdown implementation I know of, that uses an actual parser, is Jon MacFarleane’s peg-markdown. Its parser is based on a Parsing Expression Grammar parser generator called peg.
EDIT: Mauricio Fernandez recently released his Simple Markup Markdown parser, which he wrote as part of his OcsiBlog Weblog Engine. Because the parser is written in OCaml, it is extremely simple and short (268 SLOC for the parser, 43 SLOC for the HTML emitter), yet blazingly fast (20% faster than discount (written in hand-optimized C) and sixhundred times faster than BlueCloth (Ruby)), despite the fact that it isn't even optimized for performance yet. Because it is only intended for internal use by Mauricio himself for his weblog, there are a few deviations from the official Markdown specification, but Mauricio has created a branch which reverts most of those changes.

I released a new parser-based Markdown Java implementation last week, called pegdown.
pegdown uses a PEG parser to first build an abstract syntax tree, which is subsequently written out to HTML. As such it is quite clean and much easier to read, maintain and extend than a regex based approach.
The PEG grammar is based on John MacFarlanes C implementation "peg-markdown".
Maybe something of interest to you...

If I was to try to parse markdown (and its extension Markdown extra) I think I would try to use a state machine and parse it one char at a time, linking together some internal structures representing bits of text as I go along then, once all is parsed, generating the output from the objects all stringed together.
Basically, I'd build a mini-DOM-like tree as I read the input file.
To generate an output, I would just traverse the tree and output HTML or anything else (PS, LaTex, RTF,...)
Things that can increase complexity:
The fact that you can mix HTML and markdown, although the rule could be easy to implement: just ignore anything that's between two balanced tags and output it verbatim.
URLs and notes can have their reference at the bottom of the text. Using data structures for hyperlinks could simply record something like:
[my text to a link][linkkey]
results in a structure like:
URLStructure:
| InnerText : "my text to a link"
| Key : "linkkey"
| URL : <null>
Headers can be defined with an underline, that could force us to use a simple data structure for a generic paragraph and modify its properties as we read the file:
ParagraphStructure:
| InnerText : the current paragraph text
| (beginning of line until end of line).
| HeadingLevel : <null> or 1-4 when we can assess
| that paragraph heading level, if any.
Anyway, just some thoughts.
I'm sure that there are many small details to take care of and I'm pretty sure that Regexes could become handy during the process.
After all, they were meant to process text.

I'd probably read the syntax specification enough times to know it, and get a feel for how to parse it.
Reading the existing parser code is of course brilliant, both to see what seems to be the main source of complexity, and if any special clever tricks are being used. The use of MD5 checksumming seems a bit weird, but I haven't studied the code enough to understand why it's being done. A comment in a routine called _EscapeSpecialChars() states:
We're replacing each such character with its corresponding MD5 checksum value;
this is likely overkill, but it should prevent us from colliding with the escape
values by accident.
Replacing a single character by a full MD5 does seem extravagant, but perhaps it really makes sense.
Of course, it'd be clever to consider creating a "true" syntax, for a tool such as Flex to get out of the regex bog.

If Perl isn't your thing, there are Markdown implementations in at least 10 other languages. They probably don't all have 100% compatibility, but tend to be pretty close.

MarkdownPapers is another Java implementation whose parser is defined in a JavaCC grammar.

If you are using a programming language that has more than three other
users, you should be able to find a library to parse it for you. A
quick Google-ing reveals libraries for CL, Haskell, Python,
JavaScript, Ruby, and so on. It is highly unlikely that you will need
to reinvent this wheel.
If you really have to write it from scratch, I recommend writing a
proper parser. With this technique, you won't have to escape things
with MD5 hashes. (I agree that if you have to do something like this,
it's time to reconsider your design.)

There are libraries available in a number of languages, including php, ruby, java, c#, javascript. I'd suggest looking at some of these for ideas.
It depends on which language you wish to use, for the best way to implement it, there will be idiomatic and non idiomatic ways to do it.
Regexes work in perl, because perl and regex are best friends.

Markdown is a JAWL (just another wiki language)
There are plenty of open source wiki's out there that you can examine the code of the parser. Most use REGEX
Check out the screwturn wiki, is has an interesting multi pass formatter pipeline, a very nice technique - see /core/Formatter.cs and /core/FormatterPipeline.cs
Best is to use/join an existing project, these sorts of things are always much harder than they appear

Here you can find a JavaScript-implementation of Markdown. It also relies heavily on regular expressions, as this is just the fastest and easiest way to parse the text.
But it spares the MD5 part.
I cannot help directly with the coding of the parsing, but maybe this link can help you one way or another.

Syntax / Logical checker In Javascript?

I'm building a solution for a client which allows them to create very basic code,
now i've done some basic syntax validation but I'm stuck at variable verification.
I know JSLint does this using Javascript and i was wondering if anyone knew of a good way to do this.
So for example say the user wrote the code
moose = "barry"
base = 0
if(moose == "barry"){base += 100}
Then i'm trying to find a way to clarify that the "if" expression is in the correct syntax, if the variable moose has been initialized etc etc
but I want to do this without scanning character by character,
the code is a mini language built just for this application so is very very basic and doesn't need to manage memory or anything like that.
I had thought about splitting first by Carriage Return and then by Space but there is nothing to say the user won't write something like moose="barry" or if(moose=="barry")
and there is nothing to say the user won't keep the result of a condition inline.
Obviously compilers and interpreters do this on a much more extensive scale but i'm not sure if they do do it character by character and if they do how have they optimized?
(Other option is I could send it back to PHP to process which would then releave the browser of responsibility)
Any suggestions?
Thanks
The use case is limited, the syntax will never be extended in this case, the language is a simple scripted language to enable the client to create a unique cost based on their users input the end result will be processed by PHP regardless to ensure the calculation can't be adjusted by the end user and to ensure there is some consistency.
So for example, say there is a base cost of £1.00
and there is a field on the form called "Additional Cost", the language will allow them manipulate the base cost relative to the "additional cost" field.
So
base = 1;
if(additional > 100 && additional < 150){base += 50}
elseif(additional == 150){base *= 150}
else{base += additional;}
This is a basic example of how the language would be used.
Thank you for all your answers,
I've investigated a parser and creating one would be far more complex than is required
having run several tests with 1000's of lines of code and found that character by character it only takes a few seconds to process even on a single core P4 with 512mb of memory (which is far less than the customer uses)
I've decided to build a PHP based syntax checker which will check the information and convert the variables etc into valid PHP code whilst it's checking it (so that it's ready to be called later without recompilation) using this instead of javascript this seems more appropriate and will allow for more complex code to arise without hindering the validation process
It's only taken an hour and I have code which is able to check the validity of an if statement and isn't confused by nested if's, spaces or odd expressions, there is very little left to be checked whereas a parser and full blown scripting language would have taken a lot longer
You've all given me a lot to think about and i've rated relevant answers thank you

If you really want to do this — and by that I mean if you really want your software to work properly and predictably, without a bunch of weird "don't do this" special cases — you're going to have to write a real parser for your language. Once you have that, you can transform any program in your language into a data structure. With that data structure you'll be able to conduct all sorts of analyses of the code, including procedures that at least used to be called use-definition and definition-use chain analysis.
If you concoct a "programming language" that enables some scripting in an application, then no matter how trivial you think it is, somebody will eventually write a shockingly large program with it.
I don't know of any readily-available parser generators that generate JavaScript parsers. Recursive descent parsers are not too hard to write, but they can get ugly to maintain and they make it a little difficult to extend the syntax (esp. if you're not very experienced crafting the original version).

You might want to look at JS/CC which is a parser generator that generates a parser for a grammer, in Javascript. You will need to figure out how to describe your language using a BNF and EBNF. Also, JS/CC has its own syntax (which is somewhat close to actual BNF/EBNF) for specifying the grammar. Given the grammer, JS/CC will generate a parser for that grammar.
Your other option, as Pointy said, is to write your own lexer and recursive-descent parser from scratch. Once you have a BNF/EBNF, it's not that hard. I recently wrote a parser from an EBNF in Javascript (the grammar was pretty simple so it wasn't that hard to write one YMMV).
To address your comments about it being "client specific". I will also add my own experience here. If you're providing a scripting language and a scripting environment, there is no better route than an actual parser.
Handling special cases through a bunch of if-elses is going to be horribly painful and a maintenance nightmare. When I was a freshman in college, I tried to write my own language. This was before I knew anything about recursive-descent parsers, or just parsers in general. I figured out by myself that code can be broken down into tokens. From there, I wrote an extremely unwieldy parser using a bunch of if-elses, and also splitting the tokens by spaces and other characters (exactly what you described). The end result was terrible.
Once I read about recursive-descent parsers, I wrote a grammar for my language and easily created a parser in a 10th of the time it took me to write my original parser. Seriously, if you want to save yourself a lot of pain, write an actual parser. If you go down your current route, you're going to be fixing issues forever. You're going to have to handle cases where people put the space in the wrong place, or perhaps they have one too many (or one too little) spaces. The only other alternative is to provide an extremely rigid structure (i.e, you must have exactly x number of spaces following this statement) which is liable to make your scripting environment extremely unattractive. An actual parser will automatically fix all these problems.

Javascript has a function 'eval'.
var code = 'alert(1);';
eval(code);
It will show alert. You can use 'eval' to execute basic code.

Develop Reference

JavaScript is the programming language of the Web.