This question already has answers here:
How to validate phone numbers using regex
(43 answers)
Closed 6 years ago.
Wanting to validate phone numbers with the following criteria.
-Minimum of 6 digits.
-Can only have the following symbols "+", "(", ")", "-".
-Contain no more than n consecutive symbols, but numbers are OK.
Here are some examples of what i consider valid:
07519767576
+447519767576
(02380) 346450
(+44) 7519767576
I have been trying to do this myself for quite a while but hitting a brick wall. Here is what i have tried so far
^(?=.{9,}$)(?=[^0-9]*[0-9])(?:([\d\s\+\(\)\-])\1?(?!\1{5}))+?$
This kinda works but its a bit of a hack because it also limits amount of consecutive numbers.
I am not able to do this check in PHP, it has to be done in JS sadly. Is this even possible without needing a degree in regex?
At least one of your requirements is beyond what traditional regular languages in general can do. As pointed out in the comments, counting the number of digits across patterns, groups or regular expressions is not possible in traditional regular languages, which essentially use Deterministic Finite Automata (also knows as DFAs) to compute regular expression matches.
PCRE compatible regular expressions, which is what most languages like Javascript and Python for example, support add additional functionality with things such as backtracking, look ahead matching, grouping, counting for a single group, and so on.
These enhance the set of patterns PCRE regular expressions can match, or more technically the set of languages the expression will accept. But to the best of my knowledge, none of these extensions let one do counting in the way you want to here, at least directly.
Turns out PCRE compatible regular expressions are NP-Complete in theory, but that doesn't mean it's easy or even feasible to write a regular expression for a given problem.
In most cases one would write a small hand rolled parser in a turing complete programming language, which can do what you need fairly easily.
OP mentioned that doing this is not an option and thus the problem as has come to a standstill.
Related
Looking for pre-processor for creating own syntax of regular expression, based on RegExp & PCRE syntax so it can be parsed to PCRE syntax. Example at the end
I guess I need a processor of regular expression that outputs a tree structure that represents regular expression, so I can traverse the tree and hotswap some parts, then compile it to regular expression string.
But this processor must have ability to add own syntax parsing/processing.
Is there some processor like this, already made by someone? I've made one by myself some time ago, but looking for more professional solution.
Of course we are talking about node.js/javascript
Yes, node.js has not support for PCRE, but there is a npm module for using PCRE with node.js, it works great!
Why someone would need it?
For example, you can create big regular expression by smaller ones:
(John (like|love)s every (animal|creature) on earth: (#animals))
(#...) is hash tag group, it means in place of it will be another regular expression containing alterantives for all animals.
Another example, you can create more sophisticated kind of groups:
(#(a|x)(b)(c))
permutation group matches all brackets (3 or less or more) in any order:
(a|x)(b)(c)
(a|x)(c)(b)
(b)(a|x)(c)
(b)(c)(a|x)
(c)(a|x)(b)
(c)(b)(a|x)
have more, but I guess I've made a point.
I'm trying to strip all 'Unicode Symbols' from a string. That is, keeping all multilingual characters but removing dingbats, arrows, and all of that stuff.
C# has a very handy function called Char.IsSymbol that can be run on all characters of a string, stripping the character when the functions returns true.
I've been searching on doing something similar in JavaScript. If it's a regex then how can I compile a list of all the unicode ranges of the symbol characters? I looked at XRegExp but couldn't find something that only filters symbols.
XRegExp does have support for what you're looking for - http://xregexp.com/plugins/#unicode
You'd probably match either for \pL or \pS. You can find a nice list of the typical unicode categories in http://www.regular-expressions.info/unicode.html#category
Overall, Unicode is quite tricky. It gives plenty of opportunities for giving you trouble, especially with software that isn't fully Unicode compatible (sadly, this includes JavaScript - see https://mathiasbynens.be/notes/javascript-unicode for a nice set of example). This is further exacerbated by the fact that JS often runs with double-encoding (HTML+JS, and there's worse cases as well). Somebody will probably find a way to bypass your checks, but I'm afraid there's no easy way to prevent that. Just be on the lookout :)
Hey I've written a fractal-generating program in JavaScript and HTML5 (here's the link), which was about a 2 year process including all the research I did on Complex math and fractal equations, and I was looking to update the interface, since it is quite intimidating for people to look at. While looking through the code I noticed that some of my old techniques for going about doing things were very inefficient, such as my Complex.parseFunction.
I'm looking for a way to use RegExp to parse components of the expression such as functions, operators, and variables, as well as implementing the proper order of operations for the expression. An example below might demonstrate what I mean:
//the first example parses an expression with two variables and outputs to string
console.log(Complex.parseFunction("i*-sinh(C-Z^2)", ["Z","C"], false))
"Complex.I.mult(Complex.neg(Complex.sinh(C.sub(Z.cPow(new Complex(2,0,2,0))))))"
//the second example parses the same expression but outputs to function
console.log(Complex.parseFunction("i*-sinh(C-Z^2)", ["Z","C"], true))
function(Z,C){
return Complex.I.mult(Complex.neg(Complex.sinh(C.sub(Z.cPow(new Complex(2,0,2,0))))));
}
I know how to handle RegExp using String.prototype.replace and all that, all I need is the RegExp itself. Please note that it should be able to tell the difference between the subtraction operator (e.g. "C-Z^2") and the negative function (e.g. "i*-(Z^2+C)") by noting whether it is directly after a variable or an operator respectively.
While you can use regular expressions as part of an expression parser, for example to break out tokens, regular expressions do not have the computational power to parse properly nested mathematical expressions. That is essentially one of the core results of computing theory (finite state automata vs. push down automata). You probably want to look at something like recursive-descent or LR parsing.
I also wouldn't worry too much about the efficiency of parsing an expression provided you only do it once. Given all of the other math you are doing, I doubt it is material.
Is it possible to do client side validation in a localized web application environment?
I've only seen regular expressions written in English, can they be written for other languages? Would the regular expressions have to be changed based on the language chosen by an end user or is it possible to use just 1?
Are there any tools/frameworks to help with this?
Previous answer was good, but it's not clear to me that it answered the question. For that matter, I don't really understand the question. If you're asking whether JavaScript regular expressions are independent of language, then the answer is yes, they are just looking at characters in a string. But obviously the things you're looking for with those regular expressions (words, numbers, phone numbers, dates, etc.) would presumably vary with language and locale. So you may be able to construct a universal regex that works to validate all phone numbers, for example, but it's probably unlikely, and in any case there may be cases where a valid number in one context is invalid in another. You're better off to create language-specific regular expressions used for validation just as you would create language specific strings. Does that answer your question?
No. Please do not confuse validation for well-formedness. The former is a measure of conformity to a grammar definition and the later is a measure of conformity to a syntax requirement. Even if your regex was so extremely awesome as to account for all well-formedness conditions it is absent the context of structured definitions where the structure is recursive and reflective.
I am trying to break the following sentence in words and wrap them in span.
<p class="german_p big">Das ist ein schönes Armband</p>
I followed this:
How to get a word under cursor using JavaScript?
$('p').each(function() {
var $this = $(this);
$this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
});
The only problem i am facing is, after wrapping the words in span the resultant html is like this:
<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>
so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?
Unicode in Javascript Regexen
Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.
The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.
It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.
This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.
However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.
Unicode Support in Other Languages
Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.
In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].
Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.
I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.
The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.
The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.
SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.
Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.
Sorry 'bout that. ☹
You can also use
/\b([äöüÄÖÜß\w]+)\b/g
instead of
/\b(\w+)\b/g
in order to handle the umlauts
\w only matches A-Z, a-z, 0-9, and _ (underscore).
You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.
Reference: http://www.javascriptkit.com/javatutors/redev2.shtml
To include all the Latin 1 Supplement characters like äöüßÒÿ you can use:
[\w\u00C0-\u00ff]
however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ČŇů . To include that you can use:
[\w\u00C0-\u024f]
\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.
As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.
* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.
the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/