JavaScript utf8 encoding or regex pattern with Unicode letters support?

JavaScript utf8 encoding or regex pattern with Unicode letters support? - javascript

I have been recently requested, to adapt an app's input, to support Unicode letters, on some of the inputs within the web app.
That app, already does some validation with regex, with the pattern html attribute. Like so:
<input required="true" pattern="[a-zA-Z0-9_\-]+" type="text" name="name">
Now, since I have to adapt some inputs to the new requirements, I was wondering what would be better to do?
Do Encoding / decoding UTF8 in javascript?
http://ecmanaut.blogspot.ca/2006/07/encoding-decoding-utf8-in-javascript.html
http://laffers.net/blog/2010/12/10/regex-match-unicode-characters/
or
Addapt the regex pattern just like suggested here: PHP Regex for Multiple Unicode Characters ?

Javascript is by definition completely in unicode (Except websites with non unicode encoding, but there the solution may still work), so just add letters you need to regexp. If you need to add them by charcode use \x0000

I had decided to go for editing the regex option, since when my view starts being processed, there is a module that will set the pattern attribute, with the defined regex, on specific inputs.
So, it is better to edit the regex pattern and then set in on inputs when the view is loaded, then doing:
Load view and set pattern attributes for each input
Write js to analyze specific inputs, to see if they contain decoded Unicode characters
If so, encode those characters while typing/before submit, since I have a regex pattern that doesn't allow such characters
Essentially, I'm saving me time on writing useless code, and most important, a lot of browser processing (it was going to be to much, for what it is needed).
I should've gone for this option since the beginning (duh!)

Related

Regex to match certain characters and exclude certain characters but without negative lookahead

I want a regex that matches all emojis (or most of them) but excludes certain characters (such as “|”|‘|’|…|—).
This regex does the job via negative lookahead:
/(?!\u201C|\u201D|\u2018|\u2019|\u2026|\u2014)(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])/
But apparently Google Scripts doesn't support this. Error:
Invalid regular expression pattern
(?!“|”|‘|’|…|—)(©|®|[ -㌀]|?[퀀-?]|?[퀀-?]|?[퀀-?])
Is there another way to achieve my goal (a regex that works with Google Script's findText)?

Option 1
Maybe,
[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]
might be working OK for your desired emojis.
Demo
Option 2
Otherwise, you might want to negate those undesired chars using char classes, such as:
[these unicode ranges &&[^these unicodes]]
which would become pretty complicated, yet possible.
Option 3
Using this option you can most likely solve your problem much simpler. I guess, your problem is that those undesired punctuations are already among the desired unicodes. Check to see if that'd be the case. For example, in
[\u100-\u200]
you might have \u150 and \u175 as undesired chars, which you want them to be removed from your desired ranges of unicodes that you already have.
You can then simply remove those from the range, such as with:
[\u100-\u149\u151-\u174\u176-\u200]
and as simple as that the problem would be solved.
Source
javascript unicode emoji regular expressions

Requiring Letters and Numbers in form field with JavaScript

I have a form and I need to require letters and numbers. All the solutions I have seen, simply allow only letters and numbers but do not require both.
I have this Regex: /^[0-9a-zA-Z]+$/ which allows one or the other. How can I make this a requirement, meaning the text must contain at least a number.
Thanks my friends.
Guy

To break this down, we're requiring at least 2 characters, a letter and a number. In the code we start with the possibility of an alpha-numeric character. I'm not using \w because it also allows _ characters. In the group we have an or that looks for either a letter before a number, or a number before a letter. Then after the group we're requiring if anything exists that it also be alpha-numeric.
/^[A-Za-z0-9]*([A-Za-z][0-9]|[0-9][A-Za-z])[A-Za-z0-9]*$/i
As a recommendation, it's always best to use a server-side language as your front-line defense when validating a form instead of a Javascript-only approach. The reasons:
Someone can disable Javascript
The server needs to be protected from malicious attack (SQL or XSS injection)
Someone can bypass your form altogether by directly linking to the handler (if you're not requiring a valid referrer)
Some browsers like Lynx do not use Javascript, so it's not user friendly for people who need to use screen reading devices

How to get the character corresponding to a Unicode character name?

I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)

JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.

Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))

Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.

Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

How to detect what allowed character in current Regular Expression by using JavaScript?

In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks

You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.

I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.

Develop Reference

JavaScript is the programming language of the Web.

JavaScript utf8 encoding or regex pattern with Unicode letters support? - javascript

Javascript is by definition completely in unicode (Except websites with non unicode encoding, but there the solution may still work), so just add letters you need to regexp. If you need to add them by charcode use \x0000

Related

Regex to match certain characters and exclude certain characters but without negative lookahead

Requiring Letters and Numbers in form field with JavaScript

How to get the character corresponding to a Unicode character name?

help making a "universal" regex Javascript compatible

How to detect what allowed character in current Regular Expression by using JavaScript?

Categories

Resources