Javascript regex "replace(/[ -_]/g)" deletes numbers? - javascript

I was doing some tests in Javascript with the replace javascript function.
Consider the following examples executed on a node REPL.
It's a replace that deletes spaces, hyphens and underscores from a string.
> "call this 9344 5 66 22".replace(/[ _-]/g, '');
'callthis934456622'
That was what I was expecting. To only delete the spaces.
However take a look at this:
> "call this 9344 5 66 22".replace(/[ -_]/g, '');
'callthis'
Why when I put this regex combination exact like this -_ (space, hyphen, underscore) it deletes the numbers in the string?
More tests I did:
-(space, hyphen) does not deletes numbers
_(space, underscore) does not deletes numbers
_-(space, underscore, hyphen) does not deletes numbers
-_(hyphen, underscore, space) does not deletes numbers
_-(underscore, hyphen, space) REPL blocks??
-_(space, hyphen, underscore) does deletes numbers

[ -_] means characters from space (ASCII 32) to _ (ASCII 95) which includes, among other things, numbers and capital letters.
What you are looking for is [ \-_]. Escaping the - will make it act like the character instead of the meta-character for ranges.

Hyphen if not present at start or end position in a character class needs to be escaped otherwise it represents a range.
So this regex:
[ -_]
will match anything from space to underscore i.e. ASCII 32-95

The - character has special meaning in character classes. When it appears between two characters, it represents a character range — e.g. [a-z] matches any character with a character code between a and z, inclusive.
However, as you've observed, when it's placed at the beginning or end of the character class, it just represents a literal - character. This can also be accomplished by escaping the - within the character class — i.e. [ \-_].

"call this 9344 5 66 22".replace(/(\s|-|_)/g, '');

In a class, the dash - character has special meaning as a range operator ONLY when
it doesn't separate clauses, parsed left to right.
Otherwise it is considered no different than any other literal.
Regular expression parsers have no time to worry about good form.
So you can put the dash anywhere you want as a literal, as long as it separates clauses (i.e. its not ambigous).
Most people put it at the end or beginning or escape it so no conceptual errors occur.
Example of clauses, which are hilighted, and literal dashes:
[-a-z-\p{L}-0-9-\x00-\x09-\x20-]

Related

How to improve this Regular expression validation?

I tried to write a form validation for description textarea> of the users about their owns like he/she education or experience.
I wrote this Regex for this textarea, but I have a problem if user use above comma it's not allowed, for example if user written "House's", it's not allowing to write this comma '.
PWhich symbols may needed or predicate while users describe owns?
I used this Regex:
$descriptionValidation = "/^[a-zA-Z0-9\.\-\,\"\(\) ]+[a-zA-Z0-9\.\-\,\"\(\) ]*$/";
To match a whole string and require that the string only consist of alphanumeric characters and: dots, commas, single-quotes (also called apostrophes, but not "above commas"), double-quotes, left parentheses, right parentheses, spaces, and hyphens, use the following expression.
The ^ and $ metacharacters ensure that the characters span the entire length of the string. + means one or more of the any of the characters in the list. The "list" is technically called a "character class". a-z is the full range of letters and \d is the full range of numbers. - does have special meaning inside of a character class but only if it has a non-ranged expression on both sides of it. If you wish to prevent mistakes with hyphens inside of a character class, you can add a backslash to escape it or you can write the hyphen at the start or end of the character class OR you can write it next to a character range.
/^[a-z\d.,'"() -]+$/i
When declaring this pattern in php using single quotes, you will need to escape the single-quote in the character class.
$descriptionValidation = '/^[a-z\d.,\'"() -]+$/i';

regex allows one character (it should not) why?

Hello I am trying to create a regex that recognizes money and numbers being inputted. I have to allow numbers because I am expecting non-formatted numbers to be inputted programmatically and then I will format them myself. For some reason my regex is allowing a one letter character as a possible input.
[\$]?[0-9,]*\.[0-9][0-9]
I understand that my regex accepts the case where multiple commas are added and also needs two digit after the decimal point. I have had an idea of how to fix that already. I have narrowed it down to possibly the *\. as the problem
EDIT
I found the regex expression that worked [\$]?([0-9,])*[\.][0-9]{2} but I still don't know how or why it was failing in the first place
I am using the .formatCurrency() to format the input into a money format. It can be found here but it still allows me to use alpha characters so i have to further masked it using the $(this).inputmask('Regex', { regex: "[\$]?([0-9,])*[\.][0-9]{2}" }); where input mask is found here and $(this) is a reference to a input element of type text. My code would look something like this
<input type="text" id="123" data-Money="true">
//in the script
.find("input").each(function () {
if ($(this).attr("data-Money") == "true") {
$(this).inputmask('Regex', { regex: "[\$]?([0-9,])*[\.][0-9]{2}" });
$(this).on("blur", function () {
$(this).formatCurrency();
});
I hope this helps. I try creating a JSfiddle but Idk how to add external libraries/plugin/extension
The "regular expression" you're using in your example script isn't a RegExp:
$(this).inputmask('Regex', { regex: "[\$]?([0-9,])*[\.][0-9]{2}" });
Rather, it's a String which contains a pattern which at some point is being converted into a true RegExp by your library using something along the lines of
var RE=!(value instanceof RegExp) ? new RegExp(value) : value;
Within Strings a backslash \ is used to represent special characters, like \n to represent a new-line. Adding a backslash to the beginning of a period, i.e. \., does nothing as there is no need to "escape" the period.
Thus, the RegExp being created from your String isn't seeing the backslash at all.
Instead of providing a String as your regular expression, use JavaScript's literal regular expression delimiters.
So rather than:
$(this).inputmask('Regex', { regex: "[\$]?([0-9,])*[\.][0-9]{2}" });
use
$(this).inputmask('Regex', { regex: /[\$]?([0-9,])*[\.][0-9]{2}/ });
And I believe your "regular expression" will perform as you expect.
(Note the use of forward slashes / to delimit your pattern, which JavaScript will use to provide a true RegExp.)
Firstly, you can replace '[0-9]' with '\d'. So we can rewrite your first regex a little more cleanly as
\$?[\d,]*\.\d\d
Breaking this down:
\$? - A literal dollar sign, zero or one
[\d,]* - Either a digit or a comma, zero or more
\. - A literal dot, required
\d - A digit, required
\d - A digit, required
From this, we can see that the minimum legal string is \.\d\d, three characters long. The regex you gave will never validate against any one character string.
Looking at your second regex,
[\$]? - A literal dollar sign, zero or one
([0-9,])* - Either a digit or a comma, subexpression for later use, zero or more
[\.] - A literal dot, required
[0-9]{2} - A digit, twice required
This has the exact same minimum matchable string as above - \.\d\d.
edit: As mentioned, depending on the language you may need to escape forward slashes to ensure they aren't misinterpretted by the language when processing the string.
Also, as an aside, the below regex is probably closer to what you need.
[A-Z]{3} ?(\d{0,3}(?:([,. ])\d{3}(?:\2\d{3})*)?)(?!\2)[,.](\d\d)\b
Explanation:
[A-Z]{3} - Three letters; for an ISO currency code
? - A space, zero or more; for readability
( - Capture block; to catch the integer currency amount
\d{0,3} - A digit, between one and three; for the first digit block
(?: - Non capturing block (NC)
([,. ]) - A comma, dot or space; as a thousands delimiter
\d{3} - A digit, three; the first possible whole thousands
(?: - Non capturing block (NC)
\2 - Match 2; the captured thousands delimiter above
\d{3} - A digits, three
)* - The above group, zero or more, i.e. as many thousands as we want
)? - The above (NC) group, zero or one, ie. all whole thousands
) - The above group, i.e everything before the decimal
[.,] - A comma or dot, as a decimal delimiter
(\d{2}) - Capture, A digit, two; ie. the decimal portion
\b - A word boundry; to ensure that we don't catch another
digit in the wrong place.
The negative lookahead was provided by an answer from John Kugelman in this question.
This correctly matches (matches enclosed in square brackets):
[AUD 1.00]
[USD 1,300,000.00]
[YEN 200 000.00]
I need [USD 1,000,000.00], all in non-sequential bills.
But not:
GBP 1.000
YEN 200,000

regular expression incorrectly matching % and $

I have a regular expression in JavaScript to allow numeric and (,.+() -) character in phone field
my regex is [0-9-,.+() ]
It works for numeric as well as above six characters but it also allows characters like % and $ which are not in above list.
Even though you don't have to, I always make it a point to escape metacharacters (easier to read and less pain):
[0-9\-,\.+\(\) ]
But this won't work like you expect it to because it will only match one valid character while allowing other invalid ones in the string. I imagine you want to match the entire string with at least one valid character:
^[0-9\-,\.\+\(\) ]+$
Your original regex is not actually matching %. What it is doing is matching valid characters, but the problem is that it only matches one of them. So if you had the string 435%, it matches the 4, and so the regex reports that it has a match.
If you try to match it against just one invalid character, it won't match. So your original regex doesn't match the string %:
> /[0-9\-,\.\+\(\) ]/.test("%")
false
> /[0-9\-,\.\+\(\) ]/.test("44%5")
true
> "444%6".match(/[0-9\-,\.+\(\) ]/)
["4"] //notice that the 4 was matched.
Going back to the point about escaping, I find that it is easier to escape it rather than worrying about the different rules where specific metacharacters are valid in a character class. For example, - is only valid in the following cases:
When used in an actual character class with proper-order such as [a-z] (but not [z-a])
When used as the first or last character, or by itself, so [-a], [a-], or [-].
When used after a range like [0-9-,] or [a-d-j] (but keep in mind that [9-,] is invalid and [a-d-j] does not match the letters e through f).
For these reasons, I escape metacharacters to make it clear that I want to match the actual character itself and to remove ambiguities.
You just need to anchor your regex:
^[0-9-,.+() ]+$
In character class special char doesn't need to be escaped, except ] and -.
But, these char are not escaped when:
] is alone in the char class []]
- is at the begining [-abc] or at the end [abc-] of the char class or after the last end range [a-c-x]
Escape characters with special meaning in your RegExp. If you're not sure and it isn't an alphabet character, it usually doesn't hurt to escape it, too.
If the whole string must match, include the start ^ and end $ of the string in your RegExp, too.
/^[\d\-,\.\+\(\) ]*$/

What's the meaning about characterEncoding

I'm reading the Sizzle source code. I'm confused when I read the regular about characterEncoding. In the source code, the characterEncoding defined as below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
It looks try to match \\. or \w- or ^\x00-\xa0.
I know [\w-] means \ or w or -, and I also know [^\x00-\xa0] means anything not in \x00-\x20. Who can tell me what's the meaning about \\. and \x00-\x20.
Thanks
I think I know what it is. The type of characterEncoding is string. So if we assign like below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
The value of characterEncoding is:
(?:\\.|[\w-]|[^\x00-\xa0])+
So if I build a regular expression like above, it means:
[\w-] // A symbol of Latin alphabet or a digit or an underscore '_' or '-'
[^\\x00-\\xa0] // ISO 10646 characters U+00A1 and higher
\\. // '\' and '.'
So this time, my question is when will the pattern \\. work?
The variable would be better named css3Identifier or something.
Transforming [\w-]|[^\x00-\xa0] into an equivalent form that matches the spec better:
[a-zA-Z0-9_-]|[\u00A1-\uFFFF]
Consider that A1 is 161, _ is underscore and - is a dash and then
read this:
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_)
"and higher" is covered by -\uFFFF
The "\\\\." matches any single character preceded by backslash. e.g.- \7B would match \7 and then B would be caught
by the middle alternative. It also matches \n, \r, \t etc.
It is just the valid regex format of CSS identifier, class, tag and attributes. A link is also in the source code comment. Following are the rules, including the possible use of backslashes which might answer your question:
4.1. Characters and case
The following rules always hold:
All CSS style sheets are case-insensitive, except for parts that are not under the control of CSS. For example, the case-sensitivity of values of the HTML attributes "id" and "class", of font names, and of URIs lies outside the scope of this specification. Note in particular that element names are case-insensitive in HTML, but case-sensitive in XML.
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit or a hyphen followed by a digit. They can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F". (See [UNICODE310] and [ISO10646].)
In CSS3, a backslash () character indicates three types of character escapes.
First, inside a string (see [CSS3VAL]), a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline).
Second, it cancels the meaning of special CSS characters. Any character (except a hexadecimal digit) can be escaped with a backslash to remove its special meaning. For example, "\"" is a string consisting of one double quote. Style sheet preprocessors must not remove these backslashes from a style sheet since that would change the style sheet's meaning.
Third, backslash escapes allow authors to refer to characters they can't easily put in a style sheet. In this case, the backslash is followed by at most six hexadecimal digits (0..9A..F), which stand for the ISO 10646 ([ISO10646]) character with that number. If a digit or letter follows the hexadecimal number, the end of the number needs to be made clear. There are two ways to do that:
with a space (or other whitespace character): "\26 B" ("&B"). In this case, user agents should treat a "CR/LF" pair (13/10) as a single whitespace character.
by providing exactly 6 hexadecimal digits: "\000026B" ("&B")
In fact, these two methods may be combined. Only one whitespace character is ignored after a hexadecimal escape. Note that this means that a "real" space after the escape sequence must itself either be escaped or doubled.
Backslash escapes are always considered to be part of an identifier or a string (i.e., "\7B" is not punctuation, even though "{" is, and "\32" is allowed at the start of a class name, even though "2" is not).
http://www.w3.org/TR/css3-syntax/#characters

Please explain some Javascript Regular Expressions

I'm learning Javascript via an online tutorial, but nowhere on that website or any other I googled for was the jumble of symbols explained that makes up a regular expression.
Check if all numbers: /^[0-9]+$/
Check if all letters: /^[a-zA-Z]+$/
And the hardest one:
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
What do all the slashes and dollar signs and brackets mean? Please explain.
(By the way, what languages are required to create a flexible website? I know a bit of Javascript and wanna learn jQuery and PHP. Anything else needed?)
Thanks.
There are already a number of good sites that explain regular expressions so I'll just dive a bit into how each of the specific examples you gave translate.
Check if all numbers: ^ anchors the start of the expression (e.g. start at the beginning of the text). Without it a match could be found anywhere. [0-9] finds the characters in that character class (e.g. the numbers 0-9). The + after the character class just means "one or more". The ending $ anchors the end of the text (e.g. the match should run to the end of the input). So if you put that together, that regular expression would allow for only 1 or more numbers in a string. Note that the anchors are important as without them it might match something like "foo123bar".
Check if all letters: Pretty much the same as above but the character classes are different. In this example the character class [a-zA-Z] represents all lowercase and uppercase characters.
The last one actually isn't any more difficult than the other two it's just longer. This answer is getting quite long so I'll just explain the new symbols. A \w in a character class will match word characters (which are defined per regex implementation but are generally 0-9a-zA-Z_ at least). The backslash before the # escapes the # so that it isn't seen as a token in the regex. A period will match any character so .+ will match one or more of any character (e.g. a, 1, Z, 1a, etc). The last part of the regex ({2,4}) defines an interval expression. This means that it can match a minimum of 2 of the thing that precedes it, and a maximum of 4.
Hope you got something out of the above.
There is an awesome explanation of regular expressions at http://www.regular-expressions.info/ including notes on language and implementation specifics.
Let me explain:
Check if all numbers: /^[0-9]+$/
So, first thing we see is the "/" at the beginning and the end. This is a deliminator, and only serves to show the beginning and end of the regular expression.
Next, we have a "^", this means the beginning of the string. [0-9] means a number from 0-9. + is a modifier, which modifies the term in front of it, in this case, it means you can have one or more of something, so you can have one or more numbers from 0-9.
Finally, we end with "$", which is the opposite of "^", and means the end of the string. So put that all together and it basically makes sure that inbetween the start and end of the string, there can be any number of digits from 0-9.
Check if all letters: /^[a-zA-Z]+$/
We notice this is very similar, but instead of checking for numbers 0-9, it checks for letters a-z (lowercase) and A-Z (uppercase).
And the hardest one:
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
"\w" means that it is a word, in this case we can have any number of letters or numbers, as well as the period means that it can be pretty much any character.
The new thing here is escape characters. Many symbols cannot be used without escaping them by placing a slash in front, as is the case with "\#". This means it is looking directly for the symbol "#".
Now it looks for letters and symbols, a period (this one seems incorrect, it should be escaping the period too, though it will still work, since an unescaped period will make any symbol). Numbers inside {} mean that there is inbetween this many terms in the previous term, so of the [a-zA-Z0-9], there should be 2-4 characters (this part here is the website domain, such as .com, .ca, or .info). Note there's another error in this one here, the [a-zA-z0-9] should be [a-zA-Z0-9] (capital Z).
Oh, and check out that site listed above, it is a great set of tutorials too.
Regular Expressions is a complex beast and, as already pointed out, there are quite a few guides off of google you can go read.
To answer the OP questions:
Check if all numbers: /^[0-9]+$/
regexps here are all delimated with //, much like strings are quoted with '' or "".
^ means start of string or line (depending on what options you have about multiline matching)
[...] are called character classes. Anything in [] is a list of single matching characters at that position in this case 0-9. The minus sign has a special meaning of "sequence of characters between". So [0-9] means "one of 0123456789".
+ means "1 or more" of the preceeding match (in this case [0-9]) so one or more numbers
$ means end of string/line match.
So in summary find any string that contains only numbers, i.e '0123a' will not match as [0-9]+ fails to match a before $).
Check if all letters: /^[a-zA-Z]+$/
Hopefully [A-Za-z] makes sense now (A-Z = ABCDEF...XYZ and a-z abcdef...xyz)
Validate Email: /^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
Not all regexp parses know the \w sequence. Javascript, java and perl I know do support it.
I have already have covered '/^ at the beginning, for this [] match we are looking for
\w - . and +. I think that regexp is incorrect. Either the minus sign should be escaped with \ or it should be at the end of the [] (i.e [\w+.-]). But that is an aside they are basically attempting to allow anything of abcdefghijklmnopqrstuvwxyz01234567890-.+
so fred.smith-foo+wee#mymail.com will match but fred.smith%foo+wee#mymail.com wont (the % is not matched by [\w.+-]).
\# is the litteral atsil sign (it is escaped as perl expands # an array variable reference)
[a-zA-Z0-9.-]+ is the same as [\w.-]+. Very much like the user part of the match, but does not match +. So this matches foo.com. and google.co. but not my+foo.com or my***domain.co.
. means match any one character. This again is incorrect as fred#foo%com will match as . matches %*^%$£! etc. This should of been written as \.
The last character class [a-zA-z0-9]{2,4} looks for between 2 3 or 4 of the a-zA-Z0-9 specified in the character class (much like + looks for "1 more more" {2,4} means at least 2 with a maximum of 4 of the preceeding match. So 'foo' matches, '11' matches, '11111' does not match and 'information' does not.
The "tweaked" regexp should be:
/^[\w.+-]+\#[a-zA-Z0-9.-]+\.[a-zA-z0-9]{2,4}$/
I'm not doing a tutorial on RegEx's, that's been done really well already, but here are what your expressions mean.
/^<something>$/ String begins, has something in the middle, and then immediately ends.
/^foo$/.test('foo'); // true
/^foo$/.test('fool'); // false
/^foo$/.test('afoo'); // false
+ One or more of something:
/a+/.test('cot');//false
/a+/.test('cat');//true
/a+/.test('caaaaaaaaaaaat');//true
[<something>] Include any characters found between the brackets. (includes ranges like 0-9, a-z, and A-Z, as well as special codes like \w for 0-9a-zA-Z_-
/^[0-9]+/.test('f00')//false
/^[0-9]+/.test('000')//true
{x,y} between X and Y occurrences
/^[0-9]{1,2}$/.test('12');// true
/^[0-9]{1,2}$/.test('1');// true
/^[0-9]{1,2}$/.test('d');// false
/^[0-9]{1,2}$/.test('124');// false
So, that should cover everything, but for good measure:
/^[\w-.+]+\#[a-zA-Z0-9.-]+.[a-zA-z0-9]{2,4}$/
Begins with at least character from \w, -, +, or .. Followed by an #, followed by at least one in the set a-zA-Z0-9.- followed by one character of anything (. means anything, they meant \.), followed by 2-4 characters of a-zA-z0-9
As a side note, this regular expression to check emails is not only dated, but it is very, very, very incorrect.

Categories

Resources