Regular expressions - javascript - javascript

I know this is a simple thing. but i just cant make it work.
Req: A word which contain at least one number, alphabets (can be both cases) and at least one symbol (special character).
In c# (?=.[0-9])(?=.[a-zA-z])(?=.*[!##$%_]) worked. But in javascript its not working.
Seems like it always look for number at the beginning since my condition starts with number in the regexp.f
Can anyone give me a regexp that can be used in javascript?
-Rakesh

JavaScript does support lookaheads. However, your groups expect that there's at least on character before the number and letter (because they start with just a dot .). Try adding a * to those two dots:
var pattern = /(?=.*[0-9])(?=.*[a-zA-z])(?=.*[!##$%_])/;
pattern.test('xxx'); // false
pattern.test('111'); // false
pattern.test('!!!'); // false
pattern.test('x1!'); // true
I'm seeing the same problem with this regular expression in C#, too.

Just to cover the obvious answer, given the requirements as stated I would use separate tests.
/[0-9]/.test(string) && /[a-z]/i.test(string) && /[!##$%_]/.test(string)
If you're interested in abstracting this away, one way is to store the tests in an array.
var tests = [ /[0-9]/, /[a-z]/i, /[!##$%_]/ ];
And one way to evaluate multiple tests without modifying the scope of surrounding code, simply shoehorning this into a closure, follows.
var passes = (function(){
for (var i=0; i<tests.length; i++)
if (!tests[i].test(string)) return false;
return true;
})();

I don't think javascript supports those lookaheads. Try
/(.*[a-zA-Z].*[0-9].*[!##$%_].*|.*[a-zA-Z].*[!##$%_].*[0-9].*|.*[0-9].*[a-zA-Z].*[!##$%_].*|.*[!##$%_].*[a-zA-Z].*[0-9].*|.*[0-9].*[!##$%_].*[a-zA-Z].*|.*[!##$%_].*[0-9].*[a-zA-Z].*)/
Not expecting any points for elegance...
Edit: As bdukes pointed out, js does support lookaheads. However, this (ugly) expression does work.

You can have a very long reg exp, with the three character classes repeated in differtent order, or use more than one test-
function teststring(s){
return /^[\da-zA-Z!##$%_]+$/.test(s) &&
/\d/.test(s) && /[a-zA-Z]/.test(s) && /[!##$%_]/.test(s);
}

Related

Javascript regular expression to capture every possible mathematical operation between parenthesis

I am trying to capture mathematical expressions between parenthesis in a string with javascript. I need to capture parenthesis that ONLY include numbers and mathematical operators [0-9], +, - , *, /, % and the decimal dot. The examples below demonstrate what I am after. I managed to get close to the desired result but the nested parenthesis always screw my regex up so I need help! I also need it to look globally, not for first occurence only.
let string = "If(2>1,if(a>100, (-2*(3-5)(8-2)), (1+2)), (3(1+2)) )";
What I want to do if possible is manage to transform this syntax
if(condition, iftrue, iffalse)
to this syntax
if(condition) { iftrue } else { iffalse }
so that it can be evaluated by javascript and previewed in the browser. I have done it so far but if the iftrue or iffalse blocks contain parenthesis, everything blows up! So I m trying to capture that parenthesis and calculate before transforming the syntax. Any advice is appreciated.
The closest i got was this /[\d()+-*/.]/g which gets whats i want but in this example
(1+2) (1 < 1) sdasdasd (1*(2+3))
instead of dismissing the (1<1) group entirelly it matches (1 and 1). My ideal scenario would be
(1+2) (1<1) sdasdasd (1*(2+3))
Another example:
let codeToEval = "if(a>10, 2, 2*(b+4))";
codeToEval is the passed in a function that replaces a and b with the correct values so it ends up like this.
codeToEvalAfterReplacement = "if(5>10,2,2*(5+4))";
And now I want to transform this in
if(5>10) {
2
} else {
2*(5+4)
}
so it can be evaluated by javascript eval() and eventually previewed to the users.
Your current regex /[\d()+-*/.]/g will match single characters from the class
but multiple times because of the g flag, this is why (1 and 1) are still matched
in (1 < 1).
Based on your pattern requirements I would change it to /\([-+*/%.0-9()]+\)/g.
This will match parentheses containing one or more of the characters you describe within them.
Note that your current pattern has a - somewhere in the middle of a class which can lead to weird behaviours because some regex engines will treat +-* within a class as a range (plus through asterisk, which is a stange range). Notice I put - at the start of the class in the new pattern so it matches an actual -.
I've assumed there will be no empty parentheses (), if there are you can change + (one or more) after ] to * (zero or more)
The g flag is still added so you match every one of such expressions.
I can't say with 100% certainty that the new regex will allow you to robustly transform the syntax you state, as it depends on the complexity of the 'iftrue' and 'iffalse' code blocks. See if you can make it work with the new pattern, otherwise you may want to look into other solutions for parsing code.
Call function in if parenthesis and all conditions in that function.
if(test()){
// if code
}else{
// else code
}
function test(){
// check both cases here
if(case 1 && case 2){
return true
}
return false;
}

Combine whitelist and blacklist in javascript regex expression

I am having problems constructing a regex that will allow the full range of UTF-8 characters with the exception of 2 characters: _ and ?
So the whitelist is: ^[\u0000-\uFFFF] and the blacklist is: ^[^_%]
I need to combine these into one expression.
I have tried the following code, but does not work the way I had hoped:
var input = "this%";
var patrn = /[^\u0000-\uFFFF&&[^_%]]/g;
if (input.match(patrn) == "" || input.match(patrn) == null) {
return true;
} else {
return false;
}
input: this%
actual output: true
desired output: false
If I understand correctly, one of these should be enough:
/^[^_%]*$/.test(str);
!/[_%]/.test(str);
Use negative lookahead:
(?!_blacklist_)_whitelist_
In this case:
^(?:(?![_%])[\u0000-\uFFFF])*$
Underscore is \u005F and percent is \u0025. You can simply alter the range to exclude these two characters:
^[\u0000-\u0024\u0026-\u005E\u0060-\uFFFF]
This will be just as fast as the original regex.
But I don't think that you are going to get the result you really want this way. JS can only go up to \uFFFF, anything past that will be two characters technically.
According to here, the following code returns false:
/^.$/.test('💩')
You need to have a different way to see if you have characters outside that range. This answer gives the following code:
String.prototype.getCodePointLength= function() {
return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};
Simply put, if the number returned by that is not the same as the number returned by .length() you have a surrogate pair (and thus you should return false).
If your input passes that test, you can run it up against another regex to avoid all the characters between \u0000-\uFFFF that you want to avoid.

How to look for a pattern that might be missing some characters, but following a certain order?

I am trying to make a validation for "KQkq" <or> "-", in the first case, any of the letters can be missing (expect all of them, in which case it should be "-"). The order of the characters is also important.
So quick examples of legal examples are:
-
Kkq
q
This is for a Chess FEN validation, I have validated the first two parts using:.
var fen_parts = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1";
fen_parts = fen_parts.split(" ");
if(!fen_parts[0].replace(/[1-8/pnbrqk]/gi,"").length
&& !fen_parts[1].replace(/[wb]/,"").length
&& !fen_parts[2].replace(/[kq-]/gi,"").length /*not working, allows KKKKKQkq to be valid*/
){
//...
}
But simply using /[kq-]/gi to validate the third part allows too many things to be introduced, here are some quick examples of illegal examples:
KKKKQkq (there is more than one K)
QK (order is incorrect)
You can do
-|K?Q?k?q?
though you will need to do a second test to ensure that the input is not empty. Alternatively, using only regex:
KQ?k?q?|Qk?q?|kq?|q|-
This seems to work for me...
^(-|(K)?((?!\2)Q)?((?!\2\3)k)?((?!\2\3\4)q)?)$
A .match() returns null if the expression did not match. In that case you can use the logical OR to default to an array with an empty-string (a structure similar to the one returned by .match() on a successful match), which will allow you to check the length of the matched expression. The length will be 0 if the expression did not match, or K?Q?k?q? matched the empty string. If the pattern matches, the length will be > 0. in code:
("KQkq".match(/^(?:K?Q?k?q?|-)$/) || [""])[0].length
Because | is "stronger" than you'd expect, it is necessary to wrap your actual expression in a non-capturing group (?:).
Having answered the question, let's have a look at the rest of your code:
if (!fen_parts[0].replace(/[1-8/pnbrqk]/gi,"").length)
is, from the javascript's perspective equivalent to
if (!fen_parts[0].match(/[^1-8/pnbrqk]/gi))
which translates to "false if any character but 1-8/pnbrqk". This notation is not only simpler to read, it also executes faster as there is no unnecessary string mutation (replace) and computation (length) going on.

What Javascript constructs does JsLex incorrectly lex?

JsLex is a Javascript lexer I've written in Python. It does a good job for a day's work (or so), but I'm sure there are cases it gets wrong. In particular, it doesn't understand anything about semicolon insertion, and there are probably ways that's important for lexing. I just don't know what they are.
What Javascript code does JsLex lex incorrectly? I'm especially interested in valid Javascript source where JsLex incorrectly identifies regex literals.
Just to be clear, by "lexing" I mean identifying tokens in a source file. JsLex makes no attempt to parse Javascript, much less execute it. I've written JsLex to do full lexing, though to be honest I would be happy if it merely was able to successfully find all the regex literals.
Interestingly enough I tried your lexer on the code of my lexer/evaluator written in JS ;) You're right, it is not always doing well with regular expressions. Here some examples:
rexl.re = {
NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
UNQUOTED_LITERAL: /^#(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
QUOTED_LITERAL: /^'(?:[^']|'')*'/,
NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|#|\^|\/\+|\/|\*|\+|-)/
};
This one is mostly fine - only UNQUITED_LITERAL is not recognized, otherwise all is fine. But now let's make a minor addition to it:
rexl.re = {
NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
UNQUOTED_LITERAL: /^#(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
QUOTED_LITERAL: /^'(?:[^']|'')*'/,
NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|#|\^|\/\+|\/|\*|\+|-)/
};
str = '"';
Now all after the NAME's regexp messes up. It makes 1 big string. I think the latter problem is that String token is too greedy. The former one might be too smart regexp for the regex token.
Edit: I think I've fixed the regexp for the regex token. In your code replace lines 146-153 (the whole 'following characters' part) with the following expression:
([^/]|(?<!\\)(?<=\\)/)*
The idea is to allow everything except /, also allow \/, but not allow \\/.
Edit: Another interesting case, passes after the fix, but might be interesting to add as the built-in test case:
case 'UNQUOTED_LITERAL':
case 'QUOTED_LITERAL': {
this._js = "e.str(\"" + this.value.replace(/\\/g, "\\\\").replace(/"/g, "\\\"") + "\")";
break;
}
Edit: Yet another case. It appears to be too greedy about keywords as well. See the case:
var clazz = function() {
if (clazz.__) return delete(clazz.__);
this.constructor = clazz;
if(constructor)
constructor.apply(this, arguments);
};
It lexes it as: (keyword, const), (id, ructor). The same happens for an identifier inherits: in and herits.
Example: The first occurrence of / 2 /i below (the assignment to a) should tokenize as Div, NumericLiteral, Div, Identifier, because it is in a InputElementDiv context. The second occurrence (the assignment to b) should tokenize as RegularExpressionLiteral, because it is in a InputElementRegExp context.
i = 1;
var a = 1 / 2 /i;
console.info(a); // ⇒ 0.5
console.info(typeof a); // number
var b = 1 + / 2 /i;
console.info(b); // ⇒ 1/2/i
console.info(typeof b); // ⇒ string
Source:
There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a division (/) or division-assignment (/=) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.
Note that contexts exist in the syntactic grammar where both a division and a RegularExpressionLiteral are permitted by the syntactic grammar; however, since the lexical grammar uses the InputElementDiv goal symbol in such cases, the opening slash is not recognised as starting a regular expression literal in such a context. As a workaround, one may enclose the regular expression literal in parentheses.
— Standard ECMA-262 3rd Edition - December 1999, p. 11
The simplicity of your solution for handling this hairy problem is very cool, but I noticed that it doesn't quite handle a change in something.property syntax for ES5, which allows reserved words following a .. I.e., a.if = 'foo'; (function () {a.if /= 3;});, is a valid statement in some recent implementations.
Unless I'm mistaken there is only one use of . anyway for properties, so the fix could be adding an additional state following the . which only accepts the identifierName token (which is what identifier uses, but it doesn't reject reserved words) would probably do the trick. (Obviously the div state follows that as per usual.)
I've been thinking about the problems of writing a lexer for JavaScript myself, and I just came across your implementation in my search for good techniques. I found a case where yours doesn't work that I thought I'd share if you're still interested:
var g = 3, x = { valueOf: function() { return 6;} } /2/g;
The slashes should both be parsed as division operators, resulting in x being assigned the numeric value 1. Your lexer thinks that it is a regexp. There is no way to handle all variants of this case correctly without maintaining a stack of grouping contexts to distinguish among the end of a block (expect regexp), the end of a function statement (expect regexp), the end of a function expression (expect division), and the end of an object literal (expect division).
Does it work properly for this code (this shouldn't have a semicolon; it produces an error when lexed properly)?
function square(num) {
var result;
var f = function (x) {
return x * x;
}
(result = f(num));
return result;
}
If it does, does it work properly for this code, that relies on semicolon insertion?
function square(num) {
var f = function (x) {
return x * x;
}
return f(num);
}

Find longest repeating substring in JavaScript using regular expressions

I'd like to find the longest repeating string within a string, implemented in JavaScript and using a regular-expression based approach.
I have an PHP implementation that, when directly ported to JavaScript, doesn't work.
The PHP implementation is taken from an answer to the question "Find longest repeating strings?":
preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $input, $matches, PREG_SET_ORDER);
This will populate $matches[0][X] (where X is the length of $matches[0]) with the longest repeating substring to be found in $input. I have tested this with many input strings and found am confident the output is correct.
The closest direct port in JavaScript is:
var matches = /(?=((.+)(?:.*?\2)+))/.exec(input);
This doesn't give correct results
input Excepted result matches[0][X]
======================================================
inputinput input input
7inputinput input input
inputinput7 input input
7inputinput7 input 7
XXinputinputYY input XX
I'm not familiar enough with regular expressions to understand what the regular expression used here is doing.
There are certainly algorithms I could implement to find the longest repeating substring. Before I attempt to do that, I'm hoping a different regular expression will produce the correct results in JavaScript.
Can the above regular expression be modified such that the expected output is returned in JavaScript? I accept that this may not be possible in a one-liner.
Javascript matches only return the first match -- you have to loop in order to find multiple results. A little testing shows this gets the expected results:
function maxRepeat(input) {
var reg = /(?=((.+)(?:.*?\2)+))/g;
var sub = ""; //somewhere to stick temp results
var maxstr = ""; // our maximum length repeated string
reg.lastIndex = 0; // because reg previously existed, we may need to reset this
sub = reg.exec(input); // find the first repeated string
while (!(sub == null)){
if ((!(sub == null)) && (sub[2].length > maxstr.length)){
maxstr = sub[2];
}
sub = reg.exec(input);
reg.lastIndex++; // start searching from the next position
}
return maxstr;
}
// I'm logging to console for convenience
console.log(maxRepeat("aabcd")); //aa
console.log(maxRepeat("inputinput")); //input
console.log(maxRepeat("7inputinput")); //input
console.log(maxRepeat("inputinput7")); //input
console.log(maxRepeat("7inputinput7")); //input
console.log(maxRepeat("xxabcdyy")); //x
console.log(maxRepeat("XXinputinputYY")); //input
Note that for "xxabcdyy" you only get "x" back, as it returns the first string of maximum length.
It seems JS regexes are a bit weird. I don't have a complete answer, but here's what I found.
Although I thought they did the same thing re.exec() and "string".match(re) behave differently. Exec seems to only return the first match it finds, whereas match seems to return all of them (using /g in both cases).
On the other hand, exec seems to work correctly with ?= in the regex whereas match returns all empty strings. Removing the ?= leaves us with
re = /((.+)(?:.*?\2)+)/g
Using that
"XXinputinputYY".match(re);
returns
["XX", "inputinput", "YY"]
whereas
re.exec("XXinputinputYY");
returns
["XX", "XX", "X"]
So at least with match you get inputinput as one of your values. Obviously, this neither pulls out the longest, nor removes the redundancy, but maybe it helps nonetheless.
One other thing, I tested in firebug's console which threw an error about not supporting $1, so maybe there's something in the $ vars worth looking at.

Categories

Resources