What's the meaning about characterEncoding - javascript

I'm reading the Sizzle source code. I'm confused when I read the regular about characterEncoding. In the source code, the characterEncoding defined as below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
It looks try to match \\. or \w- or ^\x00-\xa0.
I know [\w-] means \ or w or -, and I also know [^\x00-\xa0] means anything not in \x00-\x20. Who can tell me what's the meaning about \\. and \x00-\x20.
Thanks
I think I know what it is. The type of characterEncoding is string. So if we assign like below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
The value of characterEncoding is:
(?:\\.|[\w-]|[^\x00-\xa0])+
So if I build a regular expression like above, it means:
[\w-] // A symbol of Latin alphabet or a digit or an underscore '_' or '-'
[^\\x00-\\xa0] // ISO 10646 characters U+00A1 and higher
\\. // '\' and '.'
So this time, my question is when will the pattern \\. work?

The variable would be better named css3Identifier or something.
Transforming [\w-]|[^\x00-\xa0] into an equivalent form that matches the spec better:
[a-zA-Z0-9_-]|[\u00A1-\uFFFF]
Consider that A1 is 161, _ is underscore and - is a dash and then
read this:
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_)
"and higher" is covered by -\uFFFF
The "\\\\." matches any single character preceded by backslash. e.g.- \7B would match \7 and then B would be caught
by the middle alternative. It also matches \n, \r, \t etc.

It is just the valid regex format of CSS identifier, class, tag and attributes. A link is also in the source code comment. Following are the rules, including the possible use of backslashes which might answer your question:
4.1. Characters and case
The following rules always hold:
All CSS style sheets are case-insensitive, except for parts that are not under the control of CSS. For example, the case-sensitivity of values of the HTML attributes "id" and "class", of font names, and of URIs lies outside the scope of this specification. Note in particular that element names are case-insensitive in HTML, but case-sensitive in XML.
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit or a hyphen followed by a digit. They can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F". (See [UNICODE310] and [ISO10646].)
In CSS3, a backslash () character indicates three types of character escapes.
First, inside a string (see [CSS3VAL]), a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline).
Second, it cancels the meaning of special CSS characters. Any character (except a hexadecimal digit) can be escaped with a backslash to remove its special meaning. For example, "\"" is a string consisting of one double quote. Style sheet preprocessors must not remove these backslashes from a style sheet since that would change the style sheet's meaning.
Third, backslash escapes allow authors to refer to characters they can't easily put in a style sheet. In this case, the backslash is followed by at most six hexadecimal digits (0..9A..F), which stand for the ISO 10646 ([ISO10646]) character with that number. If a digit or letter follows the hexadecimal number, the end of the number needs to be made clear. There are two ways to do that:
with a space (or other whitespace character): "\26 B" ("&B"). In this case, user agents should treat a "CR/LF" pair (13/10) as a single whitespace character.
by providing exactly 6 hexadecimal digits: "\000026B" ("&B")
In fact, these two methods may be combined. Only one whitespace character is ignored after a hexadecimal escape. Note that this means that a "real" space after the escape sequence must itself either be escaped or doubled.
Backslash escapes are always considered to be part of an identifier or a string (i.e., "\7B" is not punctuation, even though "{" is, and "\32" is allowed at the start of a class name, even though "2" is not).
http://www.w3.org/TR/css3-syntax/#characters

Related

document.querySelectorAll attribute value with carriage return [duplicate]

I have a link with an href attribute that has carriage returns in its value (HTML cannot change):
<a href="
http://google.com
">Testing</a>
I originally thought a backslash can be used to escape a carriage return character (U+000D) when used inside of a string, but then read this in the CSS spec:
Any character (except a hexadecimal digit, linefeed, carriage return, or form feed) can be escaped with a backslash to remove its special meaning. For example, "\"" is a string consisting of one double quote.
Is this possible? I've tried using \a and \n as well without any luck.
http://jsfiddle.net/AHuvh/1/
The reason why the selector that you have doesn't work:
a[href="\
http://google.com\
"]
Is because:
First, inside a string, a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline). Outside a string, a backslash followed by a newline stands for itself (i.e., a DELIM followed by a newline).
This is mentioned in the paragraph above the one you quote from section 4.1.3, and why it says that newlines are one of the characters that cannot be escaped with a backslash.
This makes it equivalent to the following selector:
a[href=" http://google.com"]
Which would only match your element if its attribute value did not contain any newlines.
That said, it is in fact possible to match an element by an attribute value containing newlines. However, CSS makes this a little complicated:
To represent a newline in a CSS string, you need to use the \a escape sequence (case insensitive, up to 6 hex digits). This is stated in section 4.3.7, on strings (which are treated the same whether in a property value or an attribute selector):
A string cannot directly contain a newline. To include a newline in a string, use an escape representing the line feed character in ISO-10646 (U+000A), such as "\A" or "\00000a". This character represents the generic notion of "newline" in CSS.
\n has no special meaning in CSS; in fact, it's the same as n.
If a space directly follows an escape sequence such as \a, that space must be doubled so that the space character directly following the escape sequence can be consumed. Refer to section 4.1.3 again, which states:
Only one white space character is ignored after a hexadecimal escape. Note that this means that a "real" space after the escape sequence must be doubled.
This may be why you couldn't get it to work even with \a. Understandably, it's an incredibly obscure rule, especially when working with whitespace characters.
This results in the following selector, with the styles applying correctly:
a[href="\a http://google.com\a"]
Notice that there are five space characters between the first \a and the URL, whereas in comparison there are only four spaces following the newline in your HTML.
You can use ~ ([attr~=value]):
Represents an element with an attribute name of attr whose value is a whitespace-separated list of words, one of which is exactly "value".
a[href~="http://google.com"] {
text-decoration: none;
padding-left: 50px;
}
jsFiddle example

What does `escape a string` mean in Regex? (Javascript)

I'm trying to understand the backslash and how to use escaping like: \ in regular expressions.
I've read that when using strings its named to escape a string.
But what does that actually mean?
Many characters in regular expressions have special meanings. For instance, the dot character '.' means "any one character". There are a great deal of these specially-defined characters, and sometimes, you want to search for one, not use its special meaning.
See this example to search for any filename that contains a '.':
/^[^.]+\..+/
In the example, there are 3 dots, but our description says that we're only looking for one. Let's break it down by the dots:
Dot #1 is used inside a "character class" (the characters inside the square brackets), which tells the regex engine to search for "any one character" that is not a '.', and the "+" says to keep going until there are no more characters or the next character is the '.' that we're looking for.
Dot #2 is preceded by a backslash, which says that we're looking for a literal '.' in the string (without the backslash, it would be using its special meaning, which is looking for "any one character"). This dot is said to be "escaped", because it's special meaning is not being used in this context - the backslash immediately before it made that happen.
Dot #3 is simply looking for "any one character" again, and the '+' following it says to keep doing that until it runs out of characters.
So, the backslash is used to "escape" the character immediately following it; as such, it's called the "escape character". That just means that the character's special meaning is taken away in that one place.
Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".
Say you try to print out a string, let's say "this\that".
That \ character is recognized as a special character. I'm not sure about regex, but say in Java or C, \t will tab the rest of the string over, so it would print as
this hat
But the \ "escapes" a character from the string, deriving it of regular meaning, so using "this\that" instead would result in
this\that
I hope this helped.
Quoting from MSDN:
The backslash (\) in a regular expression indicates one of the following:
The character that follows it is a special character, as shown in the table in the following section. For example, \b is an anchor that indicates that a regular expression match should begin on a word boundary, \t represents a tab, and \x020 represents a space.
A character that otherwise would be interpreted as an unescaped language construct should be interpreted literally. For example, a brace ({) begins the definition of a quantifier, but a backslash followed by a brace (\{) indicates that the regular expression engine should match the brace. Similarly, a single backslash marks the beginning of an escaped language construct, but two backslashes (\) indicate that the regular expression engine should match the backslash.

Javascript regex "replace(/[ -_]/g)" deletes numbers?

I was doing some tests in Javascript with the replace javascript function.
Consider the following examples executed on a node REPL.
It's a replace that deletes spaces, hyphens and underscores from a string.
> "call this 9344 5 66 22".replace(/[ _-]/g, '');
'callthis934456622'
That was what I was expecting. To only delete the spaces.
However take a look at this:
> "call this 9344 5 66 22".replace(/[ -_]/g, '');
'callthis'
Why when I put this regex combination exact like this -_ (space, hyphen, underscore) it deletes the numbers in the string?
More tests I did:
-(space, hyphen) does not deletes numbers
_(space, underscore) does not deletes numbers
_-(space, underscore, hyphen) does not deletes numbers
-_(hyphen, underscore, space) does not deletes numbers
_-(underscore, hyphen, space) REPL blocks??
-_(space, hyphen, underscore) does deletes numbers
[ -_] means characters from space (ASCII 32) to _ (ASCII 95) which includes, among other things, numbers and capital letters.
What you are looking for is [ \-_]. Escaping the - will make it act like the character instead of the meta-character for ranges.
Hyphen if not present at start or end position in a character class needs to be escaped otherwise it represents a range.
So this regex:
[ -_]
will match anything from space to underscore i.e. ASCII 32-95
The - character has special meaning in character classes. When it appears between two characters, it represents a character range — e.g. [a-z] matches any character with a character code between a and z, inclusive.
However, as you've observed, when it's placed at the beginning or end of the character class, it just represents a literal - character. This can also be accomplished by escaping the - within the character class — i.e. [ \-_].
"call this 9344 5 66 22".replace(/(\s|-|_)/g, '');
In a class, the dash - character has special meaning as a range operator ONLY when
it doesn't separate clauses, parsed left to right.
Otherwise it is considered no different than any other literal.
Regular expression parsers have no time to worry about good form.
So you can put the dash anywhere you want as a literal, as long as it separates clauses (i.e. its not ambigous).
Most people put it at the end or beginning or escape it so no conceptual errors occur.
Example of clauses, which are hilighted, and literal dashes:
[-a-z-\p{L}-0-9-\x00-\x09-\x20-]

Crockford - Chapter 7 - parse_url

var parse_url = /^(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;
Why is the dot . in this part
[0-9.-A-Za-z]+
not escaped by a backslash?
Brackets ([]) specify a character class: matching a single character in the string between [].
While inside a character class, only the \ and - have special meaning (are metacharacters):
backslash \: general escape character.
hyphen -: character range.
Notice, though, it must be between chars to have special meaning:
[0-9] means any number between 0 and 9, while in [09-], - assumes the quality of an ordinary -, not a range.
That's why, inside [], a . is just (will only match) a dot.
Note: It is also worth noticing that the char ] must be escaped to be used inside a character class, such as [a-z\]], otherwise it will close it as usual. Finally, using ^, as in [^a-z], designates a negated character class, that means any char that is not one of those (in the example, any char that is not a...z).
So it matches a dot.
Except under some circumstances (e.g., escaping the range hyphen when it's not the first character in the character class brackets) you don't need to escape special characters in a class.
You may escape the normal metacharacters inside character classes, but it's noisy and redundant.

How to properly escape attribute values in css/js attribute selector [attr=value]?

How should I escape attributes in the css/js attibute selector [attr=value]?
Specifically, is this correct?
document.querySelector('input[name="test[33]"]')
I'm looking for the "standard way" of doing this, if any, because I don't want Sizzle using a heavy-to-execute fallback function
Yes, that is one correct approach. The Selectors Level 3 specification states the following:
Attribute values must be CSS identifiers or strings.
The example in your question uses a string as the attribute value. An "identifier" is defined as follows:
In CSS, identifiers... can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code...
So following that, it is also legal to escape the special characters and omit the quotes:
document.querySelector('input[name=test\\[33\\]]')

Categories

Resources