Regex character class inside character range failing intellij's jsLint inspection - javascript

What is the easiest way to rectify my failing inspection? There is no option in intellij (that I can find) to allow character classes inside a character range.

If you want to allow a literal hyphen - in a character class, you need to put it immediately after the opening [ (Refer "Character Classes" section at http://www.regular-expressions.info/reference.html) or immediately prior to the closing ] (I've found works in some languages at least), else it's deemed to signify a range.
And note also this comment by #IanMackinnon on this SO question (although I couldn't find an authoritative source for this after a very brief search): "explicitly escaping hyphens in character classes is another JSLint recommendation." - this was written in the context of literal hyphens. Regardless of if jsLint needs this or not to pass inspection, it's probably good practice to do this in order to future-proof the literal hyphen in case a future developer accidentally turns the class into a range by putting something between the (un-escaped) hyphen and the opening (or closing) bracket.
I therefore think the section of your regex that currently reads as [\w-+\s] should be re-written as [\-\w+\s]).
And the subsequent [\w-+] as [\-\w+], etc...

Related

Ensure that all lines in a file match a known RegEx pattern, without relying on \A \z [duplicate]

This question already has an answer here:
\z PCRE equivalent in JavaScript regex to match all markdown list items
(1 answer)
Closed 6 months ago.
In a multi-line file I want to check that all lines match one or more complex patterns (which may each cover multiple lines).
I can make this work just fine with this RegEx
\A(patternA\n|patternB\n)*\z
(For some languages it works with \Z instead of \z)
It will match:
patternA
patternB
patternA
and will reject
patternA
patternC
patternB
But it does not work in JavaScript (where I need to execute the test) because JavaScript RegEx apparently does not support the anchors \A (start of file) or \z (end of file). And if those anchors are left off then I just get back a set of matches (the first and third lines in my second example above), without the information that there are also non-matches.
At the moment, the only thing I can think of is to run the RegEx check without those two anchors, and then check that the sum of the length of all the matches equals the length of the overall text, but this seems rather clunky.
Is there a simple/elegant way to implement this check in JavaScript RegEx?
I now think the best solution may be to invert logic so that it searches for anything that does not match the required patterns, and passes the check if no match is found. The following RegEx, running under JavaScript, for example, matches the second example from my original post and not the first:
^(?!patternA$|patternB$|$)
The last option ($) is needed as otherwise it always matches the (empty) line following the last newline.
If the individual patterns are complex, it may be both easier on the regex engine and simpler to understand to write imperative code that loops through each of the lines and checks for each of the patterns in order.
This will let each of the patterns stay one regex. They do not have to be baked into the "whole file" pattern when one is updated, added or removed. The different patterns can also use different regex flags and so on.

Spaces required between keyword and literal

Looking at the output of UglifyJS2, I noticed that no spaces are required between literals and the in operator (e.g., 'foo'in{foo:'bar'} is valid).
Playing around with Chrome's DevTools, however, I noticed that hex and binary number literals require a space before the in keyword:
Internet explorer returned true to all three tests, while FireFox 48.0.1 threw a SyntaxError for the first one (1in foo), however it is okay with string literals ('1'in foo==true).
It seems that there should be no problem parsing JavaScript, allowing for keywords to be next to numeric literals, but I can't find any explicit rule in the ECMAScript specification (any of them).
Further testing shows that statements like for(var i of[1,2,3])... are allowed in both Chrome and FireFox (IE11 doesn't support for..of loops), and typeof"string" works in all three.
Which behavior is correct? Is it, in fact, defined somewhere that I missed, or are all these effects a result of idiosyncrasies of each browser's parser?
Not an expert - I haven't done a JS compiler, but have done others.
ecma-262.pdf is a bit vague, but it's clear that an expression such as 1 in foo should be parsed as 3 input elements, which are all tokens. Each token is a CommonToken (11.5); in this case, we get numericLiteral, identifierName (yes, in is an identifierName), and identifierName. Exactly the same is true when parsing 0b1 in foo (see 11.8.3).
So, what happens when you take out the WS? It's not covered explicitly (as far as I can see), but it's common practice (in other languages) when writing a lexer to scan the longest character sequence that will match something you could potentially be looking for. The introduction to section 11 pretty much says exactly that:
The source text is scanned from left to right, repeatedly taking the
longest possible sequence of code points as the next input element.
So, for 0b1in foo the lexer goes through 0b1, which matches a numeric literal, and reaches i, giving 0b1i, which doesn't match anything. So it passes the longest match (0b1) to the rest of the parser as a token, and starts again at i. It finds n, followed by WS, so passes in as the second token, and so on.
So, basically, and rather bizarrely, it looks like IE is correct.
TL;DR
There would be no change to how code would be interpreted if whitespace weren't required in these circumstances, but it's part of the spec.
Looking at the source code of v8 that handles number literal parsing, it cites ECMA 262 § 7.8.3:
The source character immediately following a NumericLiteral must not be an IdentifierStart or DecimalDigit.
NOTE For example:
3in
is an error and not the two input elements 3 and in.
This section seems to contradict the introduction of section 7. However, it does not seem that there would be any problems with breaking that rule and allowing for 3in to be parsed. There are cases where allowing for no spaces between literals and identifiers would change how the source is parsed, but all cases merely change which errors are generated.

Regex error in Netbeans not present in other editors

I have the following regular expression that works fine in my application code and other code editors have not reported a problem with it. It is used to validate a password.
/^(?=.*[A-Za-z])+(?=.*[\d])+(?=.*[^A-Za-z\d\s])+.*$/
So in other words:
Must have one letter
Must have one digit
Must have one non-letter, non-digit
Now it seems netbeans has a fairly decent regex parser and it has reported that this is an erroneous statement. But as i am new to regex I cannot spot the error. Is it due to using the positive lookahead ?= with the one or more + at the end?
When I take out the + the error goes away, but the regex stops performing in my application.
If anyone can tell me what is wrong with my expression that would be great.
The statement is used in a jQuery validation plugin that i use, if that helps. Also due to the fact I am using a plugin, I would prefer not splitting this into several smaller (clearly simpler and cleaner) expressions. That would require a great deal of work.
It never makes sense to apply a quantifier to a zero-width assertion such as a lookahead. The whole point of such assertions is that they allow you to assert that some condition is true, without consuming any of the text--that is, advancing the current match position. Some regex flavors treat that as a syntax error, while others effectively ignore the quantifier. Getting rid of those plus signs makes your regex correct:
/^(?=.*[A-Za-z])(?=.*\d)(?=.*[^A-Za-z\d\s]).*$/
If it doesn't work as expected, you may be running into the infamous IE lookahead bug. The usual workaround is to reorder things so the first lookahead is anchored at the end, like so:
/^(?=.{8,15}$)(?=.*[A-Za-z])(?=.*\d)(?=.*[^A-Za-z\d\s]).*/
The (?=.{8,15}$) is just an example; I have no idea what your real requirements are. If you do want to impose minimum and maximum length limits, this is the ideal place to do it.

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))
Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.
Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

How to detect what allowed character in current Regular Expression by using JavaScript?

In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.

Categories

Resources