Why does my JavaScript RegExp not work as expected

Why does my JavaScript RegExp not work as expected - javascript

I am writing a password screen, and the requirements for the password security are somewhere between 8 and 20 characters in length, must contain at least one Alpha character and at least one numeric character and at least one special character of [!##$%^&*].
I have cobbled together this regular expression, which appeared to work in C#, but when I started rewriting the code for a JavaScript validation, the regular expression is flagging what I thought were valid passwords as invalid.
Here is the regular expression as I assign it to RegExp:
var regExPatt = new RegExp('^(?=(?:.*[a-zA-Z]){1})(?=(?:.*\d){1})(?=(?:.*[!###$%^&*]){1})(?!.*\s).{8,20}$');
NOTE BENE: The double ## symbol is there to get the # symbol into the RegExp, otherwise it tries to treat partial strings like Razor variables and things go sideways fast.
Where did I go wrong with this regular expression? I know it is fairly complicated.
Passwords that work:
freddy1234%
freddy123$5
freddy12#45
freddy1#345
freddy!2345
Passwords that do not work:
test1234%
wilma1234%
Any ideas?

JavaScript developers should have knowledge about
RegExp object description
Regular Expressions chapter in the JavaScript Guide
Developers who want to use positive or negtive lookahead should take into account that this requires JavaScript v1.5 as it can be read on page New in JavaScript 1.5. But that should be no problem nowadays as this is a very old version released on November 2000 and all browsers used nowadays support v1.5 of JavaScript.
Lookbehind is not yet (JavaScript v1.8.5) supported by JavaScript at all.
A list of the JavaScript versions and which browser supports which JavaScript version can be found on Wikipedia page about JavaScript.
New in JavaScript contains the links to the pages explaining what was added in which version of JavaScript.

Related

JavaScript equivalent of C#'s Char.IsSymbol

I'm trying to strip all 'Unicode Symbols' from a string. That is, keeping all multilingual characters but removing dingbats, arrows, and all of that stuff.
C# has a very handy function called Char.IsSymbol that can be run on all characters of a string, stripping the character when the functions returns true.
I've been searching on doing something similar in JavaScript. If it's a regex then how can I compile a list of all the unicode ranges of the symbol characters? I looked at XRegExp but couldn't find something that only filters symbols.

XRegExp does have support for what you're looking for - http://xregexp.com/plugins/#unicode
You'd probably match either for \pL or \pS. You can find a nice list of the typical unicode categories in http://www.regular-expressions.info/unicode.html#category
Overall, Unicode is quite tricky. It gives plenty of opportunities for giving you trouble, especially with software that isn't fully Unicode compatible (sadly, this includes JavaScript - see https://mathiasbynens.be/notes/javascript-unicode for a nice set of example). This is further exacerbated by the fact that JS often runs with double-encoding (HTML+JS, and there's worse cases as well). Somebody will probably find a way to bypass your checks, but I'm afraid there's no easy way to prevent that. Just be on the lookout :)

Do browsers support different HTML5 pattern regexp features?

I had a simple RegEx pattern in a customer-facing payment form on our website:
<input type="text" pattern="(|\$)[0-9]*(|\.[0-9]{2})"
title="Please enter a valid number in the amount field" required>
It was added to help quickly notify customers when they fail to enter a valid number, before hitting the server-side validation.
After four customers called in complaining that they were unable to submit the form because their browser continually told them the amount they had entered was incorrect, I did some digging and discovered that IE10+ doesn't like the back of that expression--any amount entered that did not include a decimal point was accepted, anything with a decimal was rejected. The pattern works in my development environment (Chrome 30+) and in Opera 12, but Firefox 27 won't validate it at all.
I read the specs, which just says:
If specified, the attribute's value must match the JavaScript Pattern production. [ECMA262]
And since the only browsers that support pattern are capable of supporting ECMAScript 5, I figure this includes the full support of all Javascript regular expressions.
Where can I learn more about the quirks between pattern support in the different browsers?

The problem seems to an IE-only bug. Your link to the spec is pretty dead on, heres the bit IE is missing:
... except that the pattern attribute is matched against the entire value, not just any subset (somewhat as if it implied a ^(?: at the start of the pattern and a )$ at the end)
You can actually fix this bug by doing just that to your own pattern - namely:
^(?:(|\$)[0-9]*(|\.[0-9]{2}))$
This is working for me in IE9 and IE10, as well as Chrome. See updated fiddle
The technical reason this happens is a bit more complex:
If you read the EMCA 5.1 spec, in section 15.10.2.3, it talks about how alternations should be evaluated. Basically, each 'part' of the | is evaluated left to right, until one is found that matches. That value is assumed unless there is a problem in the 'sequel', in which case the other possibilities in the alternation are evaluated.
What it seems IE is doing is matching the beginning of your string using the empty parts of your alternations, and it works: \$[digits][empty] matches the start of $12.12 up to the decimal point. IE's regex engine (correctly) says that this is a match, because a substring matched, and it's not been told to check to the end of the string.
Once the regex engine (without the anchors to force the whole string to match) returns true, that there was a match, some engineer at Microsoft took a shortcut and told the pattern attribute to also check that the matched part equals the whole string, and there's where the failure comes from. The engine only matched part of the string, even though it could have matched more, so the secondary check fails, thinking there is extraneous input at the end.
This case is subtle, so I'm not too surprised it hasn't been caught before. I have created a bug report https://connect.microsoft.com/IE/feedback/details/836117/regex-bug-in-pattern-validator to see if there is a response from Microsoft.
The reason this relates to the EMCA spec is that if the engine was told to match the whole string, it would have backtracked when it hit the decimal and tried to match the 2nd part of the alternation, found and matched (\.[0-9{2}), and the whole thing would have worked.
Now, for some workarounds:
Add the anchors ^(?: and )$ to your patterns
Don't use empty alternations. Personally, I like using the optional $ instead for these cases. Your pattern becomes (\$?)[0-9]*(\.[0-9]{2})? and will work because ? is a greedy match, and the engine will consume the whole string if possible, rather than alternation, which is first match
Swap the order on your alternations. If the longer string is tested first, it will match first, and be used first. This has come up in other languages - Why order matters in this RegEx with alternation?
PS: Be careful with the * for your digits. Right now, "$" is a valid match because * allows for 0 digits. My recommendation for your full regex would be (\$)?(\d+)(\.\d{2})?

How to get the character corresponding to a Unicode character name?

I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)

JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.

Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))

Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.

Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

Why this regex is not working for german words?

I am trying to break the following sentence in words and wrap them in span.
<p class="german_p big">Das ist ein schönes Armband</p>
I followed this:
How to get a word under cursor using JavaScript?
$('p').each(function() {
var $this = $(this);
$this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
});
The only problem i am facing is, after wrapping the words in span the resultant html is like this:
<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>
so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?

Unicode in Javascript Regexen
Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.
The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.
It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.
This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.
However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.
Unicode Support in Other Languages
Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.
In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].
Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.
I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.
The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.
The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.
SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.
Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.
Sorry 'bout that. ☹

You can also use
/\b([äöüÄÖÜß\w]+)\b/g
instead of
/\b(\w+)\b/g
in order to handle the umlauts

\w only matches A-Z, a-z, 0-9, and _ (underscore).
You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.
Reference: http://www.javascriptkit.com/javatutors/redev2.shtml

To include all the Latin 1 Supplement characters like äöüßÒÿ you can use:
[\w\u00C0-\u00ff]
however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ČŇů . To include that you can use:
[\w\u00C0-\u024f]

\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.

As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.
* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.

the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/

Develop Reference

JavaScript is the programming language of the Web.