How to check if the input is emoji without using regular expression?

How to check if the input is emoji without using regular expression? - javascript

I'm new to web development and just trying to check if the user input contains emojis without using regex for performance reasons.
Is there a way to do it with JavaScript on the front end or by using java on the backend?

Java does not identify emoji as such
The official Unicode Character Database does not identify emoji characters as such, according to Annex A of Unicode® Technical Standard #51 UNICODE EMOJI.
I suppose that is why we do not see any kind of isEmoji method on the Java 13 class, Character.
Roll-your-own
According to that Annex A, there are emoji-data data files available describing aspects of emoji characters. If you are sufficiently motivated to reliably identify emoji characters, I suggest reading that Technical Note, and consider importing the data from those files to identify the code points of emoji. There may well be ranges of numbers that the Unicode Consortium uses to cluster the emoji characters.
Keep in mind that the Unicode Consortium in recent years has been frequently adding more and more emoji. So you will be chasing a moving target, needing updates.
You may be able to narrow down your ranges with the named ranges of code points defined in Character.UnicodeBlock.
I am guessing that Character.OTHER_SYMBOL may help, as the emoji I perused are so tagged, according to the handy macOS app, UnicodeChecker.
FYI, the Unicode Consortium does publish a list of emoji: Full Emoji List, v12.0.
By the way, the CLDR published by the Unicode Consortium and used by default in recent versions of Java defines how to sort emoji. Yes, emoji have sort-order: human faces before cat faces, and so on. The code points for emoji characters are assigned rather arbitrarily, so do not go by that for sorting.

Instead of trying to blacklist emojis, it'd probably be easier to whitelist the characters you do want to allow. If your site is multilingual, you'd have to add the characters of the languages you want to support. It should be relatively simple to loop over each character of your input and see if it's in the list of valid characters.
You'll want to do your validation on both the frontend and the backend. You want to do the frontend so you can show feedback to the user immediately, and you have to do validation on the backend so that people can't game your system by opening their browser's console or getting creative. Frontend stuff should never be trusted by the server in general.

Related

How can I know how many characters I use with Azure Cognitive Services Speech Synthesis (TTS)?

I use Azure Cognitive Services Speech Synthesis free. How can I know how many characters I use to be able to estimate the cost of my project?
With a free account I can use 0.5 million characters of a Neural voice for free per month. I want to know in my experiments how many characters I use on this moment.

From my understanding you are looking to identify the Number of characters that has been consumed by service at that point of time.
Unfortunately, AFAIK there is no metric to track that from the Azure Portal. However, you can maintain the count locally at your end or the central location where you can query yourself --- add an additional logic to maintain the metrics in your code.
The character is counted based on the below conditions (that can be found here):
Text passed to the text-to-speech service in the SSML body of the
request
All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
Letters, punctuation, spaces, tabs, markup, and all white-space characters
Every code point defined in Unicode
Exception for Chinese, Japanese, and Korean language -- character is counted as two characters for billing.

Is there anyway to simulate a "Did you mean" in Java Script?

So I'm creating a bot with an API, and the list is pretty case sensitive and only allowing exact matches.
For example, there I have the word "ENCHANTED_GLISTERING_MELON". Its all-caps have underscores and complicated spelling, and the site does not accept if it is not an exact match. It is not so user-friendly. Is there any way to so that when a user inputs something, it will auto-capitalize, replace spaces with underscores, and most importantly, check for misspellings, then consider the closest word? I have a dictionary of what the site accepts.

It not a a simple task to disallow some words with typos.
To avoid reinventing the wheel I would recommend you to use the one of the Open Source engines like RASA to enable neural language processing with your chat.
https://rasa.com/
However, it's not so easy to use if you having troubles with parsing the string in JavaScript.
For a words similarities you check Levenshtein Distance algorithm:
https://www.npmjs.com/package/autocorrect
https://www.npmjs.com/package/string-similarity
Getting the closest string match
For a simple solution you can just replace your disallowed words:
How to replace several words in javascript
Also, if it's just a filter for a bad words in your chat you can use some existing libraries like bad-words:
https://www.npmjs.com/package/bad-words
And you can capitalize everything for your particular strange case:
'enchanted glistering melon'.trim().replace(/ /g,'_').toLocaleUpperCase()

Find UNICODE or not using Javascript

We are designing a SMS send form where users can type any character they want. The system should determine what type of character they type and based on that it will decide the type of the message and charge the user for SMS credts. This form is going to be used by all over the world.
I am trying this using Javascript. I count the number of characters and loop through each character. If any of the character is double byte (> 255) then I determine it is a UNICODE or else it is a plain ASCII text.
I am not sure whether I am doing in the right way.
Recently one of the user tried the below and he claimed that the system has not deducted as UNICODE. I got surprised that all these characters are less than 255 and I doubt my logic whether am I doing correct.
Sævar Davíðssson. ÆÝÐÞ
Can someone guide me please.

Because of how various sms systems handle characters, you might have to create a whitelist in order to know what people will or will not get charged for.
Some carriers even charge differently depending on whether they're going to other carriers as well, so it can get fairly complex.
And if that wasn't bad enough, some carriers don't use pre-defined standards for their character sets. And several (especially internationally) use different and conflicting standards for character encoding.
Especially using JavaScript if you don't have the same character encoding as the carrier you'll run into problems figuring out what's legal to use.

The original ASCII standard only defines 7-bit characters. There are a variety of 8-bit character encodings expanding on ASCII. One of the most popular ones is ISO 8859-1 ("latin-1", also mostly coincident with the windows codepage 1252). This adds a lot of western european language characters to the 7-bit ASCII set, including the ones in your example string.

How to get the character corresponding to a Unicode character name?

I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)

JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.

Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.

Programming tips with Japanese Language/Characters [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.
My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.
However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:
けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.
My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.

Stick to Unicode and utf-8 everywhere.
Stay away from the native Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but be aware that you'll probably encounter them at some point if you continue.
Get familiar with a segmenter for doing complicated stuff like POS analysis, word segmentation, etc. the standard tools used by most people who do NLP (natural language processing) work on Japanese are, in order of popularity/power.
MeCab (originally on SourceForge) is awesome: it allows you to take text like,
「日本語は、とても難しいです。」
and get all sorts of great info back
kettle:~$ echo 日本語は、難しいです | mecab
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
、 記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,
kettle:~$ echo メキシコ料理が食べたい | mecab
メキシコ 名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理 名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS
However, the documentation is all in Japanese, and it's a bit complicated to set up and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby...
Apt-repos for ubuntu:
deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
Packages to install:
$ apt-get install mecab-ipadic-utf8 mecab python-mecab
should do the trick I think.
The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.
I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will require tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.
MeCab is statistically driven, and trained on loads of data. It employs a sophisticated machine learning technique called conditional random fields (CRFs) and the results are really quite good.
Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. Kanji can be quite intimidating at the beginning.

My question is not only how to do it
in javascript, but what are some tips
and strategies to doing these kinds
of things in other langauges too.
What you want to do is pretty basic string manipution - apart from the missing word separators, as Barry notes, though that's not a technical problem.
Basically, for a modern Unicode-aware programming language (which JavaScript has been since version 1.3, I believe) there is no real difference between a Japanese kana or kanji, and a latin letter - they're all just characters. And a string is just, well, a string of characters.
Where it gets difficult is when you have to convert between strings and bytes, because then you need to pay attention to what encoding you are using. Unfortunately, many programmers, especially native English speakers tend to gloss over this problem because ASCII is the de facto standard encoding for latin letters and other encodings usually try to be compatible. If latin letters are all you need, then you can get along being blissfully ignorant about character encodings, believe that bytes and characters are basically the same thing - and write programs that mutilate anything that's not ASCII.
So the "secret" of Unicode-aware programming is this: learn to recognize when and where strings/characters are converted to and from bytes, and make sure that in all those places the correct encoding is used, i.e. the same that will be used for the reverse conversion and one that can encode all the character's you're using. UTF-8 is slowly becoming the de-facto standard and should normally be used wherever you have a choice.
Typical examples (non-exhaustive):
When writing source code with non-ASCII string literals (configure encoding in the editor/IDE)
When compiling or interpreting such source code (compiler/interpreter needs to know the encoding)
When reading/writing strings to a file (encoding must be specified somewhere in the API, or in the file's metadata)
When writing strings to a database (encoding must be specified in the configuration of the DB or the table)
When delivering HTML pages via a webserver (encoding must be specified in the HTML headers or the pages' meta header; forms can be even more tricky)

What you need to do is to look at the rules of grammar. Have an array of rules for each conjugation. Let's take 〜て form for example.
Psudocode :
def te_form(verb)
switch verb.substr(-1, 1) == "る" then return # verb minus ru plus te
case "る" #return (verb - る) + て
case "す" #return (verb - す）＋して
etc. Basically, break it down into Type I, II and III verbs.

your question is totally unclear to me.
however, i had some experience working with japanese language, so i'll give my 2 Cents.
since japanese texts do not feature word separation (e.g. space character), the most important tool we had to acquire is a dictionary-based word recognizer.
once you got the text split, it's easier to manipulate it with "normal" tools.
there were only 2 tools which did the above, and as a by-product they also worked as a tagger (i.e. noun, verb, etc.).
edit:
always use unicode when working w languagers.

If I recall correctly (and I slacked off a lot the year I took Japanese so I could be wrong), the replacements you want to do are determined by the last symbol or two in the word. Taking your first example, any verb ending in 'す' will always have 'して' when conjugated this way. Similarly for む -> んで. Could you maybe establish a mapping of last character(s) -> conjugated form. You might have to account for exceptions, such as anything which conjugates to xxって.
As for portability between languages, you'll have to implement the logic differently based on how they work. This solution would be fairly straightforward to implement for Spanish as well, since the conjugations depends on if the verb ends in -ar, -er, or -ir (with some verbs requiring exceptions in your logic). Unfortunately, that's the limit of my multi-lingual skills, so I don't know how well it would do beyond those two.

Since most verbs in Japanese follow one of a small set of predictable patterns, the easiest and most extensible way to generate all the forms of a given verb is to have the verb know what conjugation it should follow, then write functions to generate each form depending on the conjugation.
Pseudocode:
generateDictionaryForm(verb)
case Ru-Verb: verb.stem + る
case Su-Verb: verb.stem + す
case Ku-Verb: verb.stem + く
...etc.
generatePoliteForm(verb)
case Ru-Verb: verb.stem + ります
case Su-Verb: verb.stem + します
case Ku-Verb: verb.stem + きます
...etc.
Irregular verbs would of course be special-cased.
Some variant of this would work for any other fairly regular language (i.e. not English).

Try to install my gem (rom2jap). It is in ruby.
gem install rom2jap
Open your terminal and type:
require 'rom2jap'

Develop Reference

JavaScript is the programming language of the Web.