Crockford - Chapter 7 - parse_url - javascript

var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;
Why is the dot . in this part
[0-9.-A-Za-z]+
not escaped by a backslash?

Brackets ([]) specify a character class: matching a single character in the string between [].
While inside a character class, only the \ and - have special meaning (are metacharacters):
backslash \: general escape character.
hyphen -: character range.
Notice, though, it must be between chars to have special meaning:
[0-9] means any number between 0 and 9, while in [09-], - assumes the quality of an ordinary -, not a range.
That's why, inside [], a . is just (will only match) a dot.
Note: It is also worth noticing that the char ] must be escaped to be used inside a character class, such as [a-z\]], otherwise it will close it as usual. Finally, using ^, as in [^a-z], designates a negated character class, that means any char that is not one of those (in the example, any char that is not a...z).

So it matches a dot.
Except under some circumstances (e.g., escaping the range hyphen when it's not the first character in the character class brackets) you don't need to escape special characters in a class.
You may escape the normal metacharacters inside character classes, but it's noisy and redundant.

Related

Javascript regex "replace(/[ -_]/g)" deletes numbers?

I was doing some tests in Javascript with the replace javascript function.
Consider the following examples executed on a node REPL.
It's a replace that deletes spaces, hyphens and underscores from a string.
> "call this 9344 5 66 22".replace(/[ _-]/g, '');
'callthis934456622'
That was what I was expecting. To only delete the spaces.
However take a look at this:
> "call this 9344 5 66 22".replace(/[ -_]/g, '');
'callthis'
Why when I put this regex combination exact like this -_ (space, hyphen, underscore) it deletes the numbers in the string?
More tests I did:
-(space, hyphen) does not deletes numbers
_(space, underscore) does not deletes numbers
_-(space, underscore, hyphen) does not deletes numbers
-_(hyphen, underscore, space) does not deletes numbers
_-(underscore, hyphen, space) REPL blocks??
-_(space, hyphen, underscore) does deletes numbers
[ -_] means characters from space (ASCII 32) to _ (ASCII 95) which includes, among other things, numbers and capital letters.
What you are looking for is [ \-_]. Escaping the - will make it act like the character instead of the meta-character for ranges.
Hyphen if not present at start or end position in a character class needs to be escaped otherwise it represents a range.
So this regex:
[ -_]
will match anything from space to underscore i.e. ASCII 32-95
The - character has special meaning in character classes. When it appears between two characters, it represents a character range — e.g. [a-z] matches any character with a character code between a and z, inclusive.
However, as you've observed, when it's placed at the beginning or end of the character class, it just represents a literal - character. This can also be accomplished by escaping the - within the character class — i.e. [ \-_].
"call this 9344 5 66 22".replace(/(\s|-|_)/g, '');
In a class, the dash - character has special meaning as a range operator ONLY when
it doesn't separate clauses, parsed left to right.
Otherwise it is considered no different than any other literal.
Regular expression parsers have no time to worry about good form.
So you can put the dash anywhere you want as a literal, as long as it separates clauses (i.e. its not ambigous).
Most people put it at the end or beginning or escape it so no conceptual errors occur.
Example of clauses, which are hilighted, and literal dashes:
[-a-z-\p{L}-0-9-\x00-\x09-\x20-]

What's the meaning about characterEncoding

I'm reading the Sizzle source code. I'm confused when I read the regular about characterEncoding. In the source code, the characterEncoding defined as below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
It looks try to match \\. or \w- or ^\x00-\xa0.
I know [\w-] means \ or w or -, and I also know [^\x00-\xa0] means anything not in \x00-\x20. Who can tell me what's the meaning about \\. and \x00-\x20.
Thanks
I think I know what it is. The type of characterEncoding is string. So if we assign like below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
The value of characterEncoding is:
(?:\\.|[\w-]|[^\x00-\xa0])+
So if I build a regular expression like above, it means:
[\w-] // A symbol of Latin alphabet or a digit or an underscore '_' or '-'
[^\\x00-\\xa0] // ISO 10646 characters U+00A1 and higher
\\. // '\' and '.'
So this time, my question is when will the pattern \\. work?
The variable would be better named css3Identifier or something.
Transforming [\w-]|[^\x00-\xa0] into an equivalent form that matches the spec better:
[a-zA-Z0-9_-]|[\u00A1-\uFFFF]
Consider that A1 is 161, _ is underscore and - is a dash and then
read this:
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_)
"and higher" is covered by -\uFFFF
The "\\\\." matches any single character preceded by backslash. e.g.- \7B would match \7 and then B would be caught
by the middle alternative. It also matches \n, \r, \t etc.
It is just the valid regex format of CSS identifier, class, tag and attributes. A link is also in the source code comment. Following are the rules, including the possible use of backslashes which might answer your question:
4.1. Characters and case
The following rules always hold:
All CSS style sheets are case-insensitive, except for parts that are not under the control of CSS. For example, the case-sensitivity of values of the HTML attributes "id" and "class", of font names, and of URIs lies outside the scope of this specification. Note in particular that element names are case-insensitive in HTML, but case-sensitive in XML.
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit or a hyphen followed by a digit. They can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F". (See [UNICODE310] and [ISO10646].)
In CSS3, a backslash () character indicates three types of character escapes.
First, inside a string (see [CSS3VAL]), a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline).
Second, it cancels the meaning of special CSS characters. Any character (except a hexadecimal digit) can be escaped with a backslash to remove its special meaning. For example, "\"" is a string consisting of one double quote. Style sheet preprocessors must not remove these backslashes from a style sheet since that would change the style sheet's meaning.
Third, backslash escapes allow authors to refer to characters they can't easily put in a style sheet. In this case, the backslash is followed by at most six hexadecimal digits (0..9A..F), which stand for the ISO 10646 ([ISO10646]) character with that number. If a digit or letter follows the hexadecimal number, the end of the number needs to be made clear. There are two ways to do that:
with a space (or other whitespace character): "\26 B" ("&B"). In this case, user agents should treat a "CR/LF" pair (13/10) as a single whitespace character.
by providing exactly 6 hexadecimal digits: "\000026B" ("&B")
In fact, these two methods may be combined. Only one whitespace character is ignored after a hexadecimal escape. Note that this means that a "real" space after the escape sequence must itself either be escaped or doubled.
Backslash escapes are always considered to be part of an identifier or a string (i.e., "\7B" is not punctuation, even though "{" is, and "\32" is allowed at the start of a class name, even though "2" is not).
http://www.w3.org/TR/css3-syntax/#characters

javascript replace() function strange behaviour with regexp

Am i doing sth wrong or there is a problem with JS replace ?
<input type="text" id="a" value="(55) 55-55-55" />​
document.write($("#a").val().replace(/()-/g,''));​
prints (55) 555555
http://jsfiddle.net/Yb2yV/
how can i replace () and spaces too?
In a JavaScript regular expression, the ( and ) characters have special meaning. If you want to list them literally, put a backslash (\) in front of them.
If your goal is to get rid of all the (, ), -, and space characters, you could do it with a character class combined with an alternation (e.g., either-or) on \s, which stands for "whitespace":
document.write($("#a").val().replace(/[()\-]|\s/g,''));​
(I didn't put backslashes in front of the () because you don't need to within a character class. I did put one in front of the - because within a character class, - has special meaning.)
Alternately, if you want to get rid of anything that isn't a digit, you can use \D:
document.write($("#a").val().replace(/\D/g,''));​
\D means "not a digit" (note that it's a capital, \d in lower case is the opposite [any digit]).
More info on the MDN page on regular expressions.
You need to use a character class
/[-() ]/
Using "-" as the first character solves the ambiguity because a dash is normally used for ranges (e.g. [a-zA-Z0-9]).
document.write($("#a").val().replace(/[\s()-]/g,''));​
That will remove all whitespace (\s), parens, and dashes
Use this
.replace(/\(|\)|-| /g,'')
You have to escape the parenthesis (i.e. \( instead of (). In your regexp, you want to list the four items: \(, \), '-' and (space) and as you want to replace any of them, not just a string of them four together, you have to use OR | between them.
May be very bad but a very basic approach would be,
document.write($("#a").val().replace(/(\()|(\))|-| |/g,''));​​
| means OR,
\ is used for escaping reserved symbols
You want to match any character in the set, so you should use square brackets to make a character set:
document.write($("#a").val().replace(/[()\- ]/g,''));
Normally, parentheses have a special meaning in regular expressions, so they were being ignored in your regex, leaving just the dash. Normally, to get literal parentheses, you need to escape them with \ (but in a square bracket block, as above, you don't).
The dash above is escaped because it has normally indicates range in a character set, e.g., [a-z].
The brackets indicate a capturing group in the regexp. You'd need to escape them (/\(\)-/) to match the sequence "()-". Yet I guess you want to use a character class, i.e. a expression that matches "(", ")" or "-"; for whitespaces include the \s shorthand:
value.replace(/[()-\s]/g, "");
You might want to read some documentation or tutorial.

Regex not working as expected

Whats wrong with this regular expression?
/^[a-zA-Z\d\s&#-\('"]{1,7}$/;
when I enter the following valid input, it fails:
a&'-#"2
Also check for 2 consecutive spaces within the input.
The dash needs to be either escaped (\-) or placed at the end of the character class, or it will signify a range (as in A-Z), not a literal dash:
/^[A-Z\d\s&#('"-]{1,7}$/i
would be a better regex.
N. B: [#-\(] would have matched #, $, %, &, ' or (.
To address the added requirement of not allowing two consecutive spaces, use a lookahead assertion:
/^(?!.*\s{2})[A-Z\d\s&#('"-]{1,7}$/i
(?!.*\s{2}) means "Assert that it's impossible to match (from the current position) any string followed by two whitespace characters". One caveat: The dot doesn't match newline characters.
The - (hyphen) has a special meaning inside a character class, used for specifying ranges. Did you mean to escape it?:
/^[a-zA-Z\d\s&#\-\('"]{1,7}$/;
This RegExp matches your input.
You have an unescaped - in the middle of your character class. This means that you're actually searching for all characters between and including # and ( (which are #, $, %, &, ', and (). Either move it to the end or escape it with a backslash. Your regex should read:
/^[a-zA-Z\d\s&#\('"-]{1,7}$/
or
/^[a-zA-Z\d\s&#\-\('"]{1,7}$/
remove the ; at the end and
^[a-zA-Z\d\s\&\#\-\(\'\"]+$
Your input does not match the regular expression. The problem here is the hyphen in you regexp. If you move it from its position after the '#' character to the start of the regex, like so:
/^[-a-zA-Z\d\s&#\('"]{1,7}$/;
everything is fine and dandy.
You can always use Rubular for checking your regular expressions. I use it on a regular (no pun intended) basis.

Why this Regex, matches incorrect characters?

I need to match these characters. This quote is from an API documentation (external to our company):
Valid characters: 0-9 A-Z a-z & # - . , ( ) / : ; ' # "
I used this Regex to match characters:
^[0-9a-z&#-\.,()/:;'""#]*$
However, this wrongly matches characters like %, $, and many other characters. What's wrong?
You can test this regular expression online using http://regexhero.net/tester/, and this regular expression is meant to work in both .NET and JavaScript.
You are not escaping the dash -, which is a reserved character. If you add replace the dash with \- then the regex no longer matches those characters between # and \
Move the literal - to the front of the character set:
^[-0-9a-z&#\.,()/:;'""#]*$
otherwise it is taken as specifying a range like when you use it in 0-9.
- sign, when not escaped, has special meaning in square brackets. #-\. is transformed into #-. (BTW, backslash before dot is not necessary in square brackets), which means "any character between # (ASCII 0x23) and . (ASCII 0x2E). The correct notation is
^[0-9a-z&#\-.,()/:;'"#]*$
The special characters in a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-).
As such, you should either escape them with a backslash (\), or put them in a position where there is no ambiguity and they do not need escaping. In the case of a hyphen, this would be the first or last position.
You also do not need to escape the dot (.).
Your regex thus becomes:
^[-0-9a-z&#.,()/:;'"#]*$
As a side note, there are many available regex evaluators which provide code hinting. This way, you can simply hover your mouse over your regular expression and it can be explained in English words.
One such free one is RegExr.
Typing your original regex in it and hovering over the hyphen shows:
Matches characters in the range '#-\'
Try that
^[0-9a-zA-Z\&\#\-\.\,\(\)\/\:\;\'\"\#]*$

Categories

Resources