So I'm creating a bot with an API, and the list is pretty case sensitive and only allowing exact matches.
For example, there I have the word "ENCHANTED_GLISTERING_MELON". Its all-caps have underscores and complicated spelling, and the site does not accept if it is not an exact match. It is not so user-friendly. Is there any way to so that when a user inputs something, it will auto-capitalize, replace spaces with underscores, and most importantly, check for misspellings, then consider the closest word? I have a dictionary of what the site accepts.
It not a a simple task to disallow some words with typos.
To avoid reinventing the wheel I would recommend you to use the one of the Open Source engines like RASA to enable neural language processing with your chat.
https://rasa.com/
However, it's not so easy to use if you having troubles with parsing the string in JavaScript.
For a words similarities you check Levenshtein Distance algorithm:
https://www.npmjs.com/package/autocorrect
https://www.npmjs.com/package/string-similarity
Getting the closest string match
For a simple solution you can just replace your disallowed words:
How to replace several words in javascript
Also, if it's just a filter for a bad words in your chat you can use some existing libraries like bad-words:
https://www.npmjs.com/package/bad-words
And you can capitalize everything for your particular strange case:
'enchanted glistering melon'.trim().replace(/ /g,'_').toLocaleUpperCase()
Related
I'm new to web development and just trying to check if the user input contains emojis without using regex for performance reasons.
Is there a way to do it with JavaScript on the front end or by using java on the backend?
Java does not identify emoji as such
The official Unicode Character Database does not identify emoji characters as such, according to Annex A of Unicode® Technical Standard #51 UNICODE EMOJI.
I suppose that is why we do not see any kind of isEmoji method on the Java 13 class, Character.
Roll-your-own
According to that Annex A, there are emoji-data data files available describing aspects of emoji characters. If you are sufficiently motivated to reliably identify emoji characters, I suggest reading that Technical Note, and consider importing the data from those files to identify the code points of emoji. There may well be ranges of numbers that the Unicode Consortium uses to cluster the emoji characters.
Keep in mind that the Unicode Consortium in recent years has been frequently adding more and more emoji. So you will be chasing a moving target, needing updates.
You may be able to narrow down your ranges with the named ranges of code points defined in Character.UnicodeBlock.
I am guessing that Character.OTHER_SYMBOL may help, as the emoji I perused are so tagged, according to the handy macOS app, UnicodeChecker.
FYI, the Unicode Consortium does publish a list of emoji: Full Emoji List, v12.0.
By the way, the CLDR published by the Unicode Consortium and used by default in recent versions of Java defines how to sort emoji. Yes, emoji have sort-order: human faces before cat faces, and so on. The code points for emoji characters are assigned rather arbitrarily, so do not go by that for sorting.
Instead of trying to blacklist emojis, it'd probably be easier to whitelist the characters you do want to allow. If your site is multilingual, you'd have to add the characters of the languages you want to support. It should be relatively simple to loop over each character of your input and see if it's in the list of valid characters.
You'll want to do your validation on both the frontend and the backend. You want to do the frontend so you can show feedback to the user immediately, and you have to do validation on the backend so that people can't game your system by opening their browser's console or getting creative. Frontend stuff should never be trusted by the server in general.
I am trying to come up with a regex that will match all code in a web page unless it contains a certain phrase.
I'm testing it on this string:
<html> This is a web page </html>
It should look at the entire string before the word 'is', see that 'is' is present, and return a non-match. The negative lookahead portion of this will be much more specific in my implementation, I just wanted to give a simple example.
The regex I'm trying to use looks like this:
^[\s\S]+(?!is)[\s\S]+$
This consists of the beginning of the string:
(^)
literally anything:
([\s\S]+)
negative lookahead:
((?!is))
another literally anything:
([\s\S]+)
and end of string:
($)
I'm using a scanning tool that takes a selenium authentication script. When the tool runs the script, it uses a regex to find a value on the web page after authentication to verify that the login script ran correctly. This regex value is different for every single site I scan. But all of the sites it's visiting use the same authentication method that will always show the same page if authentication fails. So basically I need to come up with a regex that will fail if this bad login page is displayed, I'm currently trying to employ a negative lookahead to accomplish this. The scanner is kind of dumb, so the regex is the only way I can interact with this authentication verification process.
The first alternative matches non-whitespace characters, while the second alternative matches any whitespace not followed by 'is' plus a whitespace.
^(\S|\s(?!is\s))*$
Taken together, as above, they should achieve the desired result
I was able to solve my problem and wanted to share it here in case anyone ever stumbles across this thread. I ended up using a negative lookbehind to achieve the targeted functionality. Since the regex is being executed in multiline mode, I had to make my match string look at the very last bit of text on the page. Then the lookbehind would search everything that came before the match string. Since this was happening at the very end of the page, the engine couldn't advance any further and the result from the last line was retained.
I oftentimes need to check for certain strings on a webpage, that are not necessarily spelled absolutely the same. For example, sometimes I screen a page for a string like google, then on other pages I want it to match against, let's say: gooogle or Google Inc..
Where to start in terms of pattern-matching and algorithms?
for theory:
search for edit-distance:
https://en.wikipedia.org/wiki/Edit_distance
and n-gram:
https://en.wikipedia.org/wiki/N-gram
Here is an actual framework which provides those functionalities:
fuzzyset.js
I have a form and I need to require letters and numbers. All the solutions I have seen, simply allow only letters and numbers but do not require both.
I have this Regex: /^[0-9a-zA-Z]+$/ which allows one or the other. How can I make this a requirement, meaning the text must contain at least a number.
Thanks my friends.
Guy
To break this down, we're requiring at least 2 characters, a letter and a number. In the code we start with the possibility of an alpha-numeric character. I'm not using \w because it also allows _ characters. In the group we have an or that looks for either a letter before a number, or a number before a letter. Then after the group we're requiring if anything exists that it also be alpha-numeric.
/^[A-Za-z0-9]*([A-Za-z][0-9]|[0-9][A-Za-z])[A-Za-z0-9]*$/i
As a recommendation, it's always best to use a server-side language as your front-line defense when validating a form instead of a Javascript-only approach. The reasons:
Someone can disable Javascript
The server needs to be protected from malicious attack (SQL or XSS injection)
Someone can bypass your form altogether by directly linking to the handler (if you're not requiring a valid referrer)
Some browsers like Lynx do not use Javascript, so it's not user friendly for people who need to use screen reading devices
I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)
JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.
Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.