I'm really new to Javascript and I heard about unicode characters, but I don't know how they work. I did this:
alert("U+00BF");
which is the unicode for an upside-down question mark, but for some reason it just alerts the letters "U+00BF". I've tried using unicode characters with a format more like this:
alert("/xF3");
and those worked, but I don't know what I'm doing wrong with the first one. Does anyone know?
The "U+00BF" in alert("U+00BF"); is a string of length 6 containing the characters 'U', '+', '0', '0', 'B', 'F'. Hence the string "U+00BF" is echoed out in the alert.
Based on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Values,_variables,_and_literals#Unicode ,
You can use the Unicode escape sequence in string literals, regular expressions, and identifiers. The escape sequence consists of six ASCII characters: \u and a four-digit hexadecimal number. For example, \u00A9 represents the copyright symbol. Every Unicode escape sequence in JavaScript is interpreted as one character.
Which means, we need to do:
alert("\u00BF");
to see the "upside-down question mark" unicode character.
The notation U+00BF is simply a conventional way of emphasing that you are mentioning, in text, a Unicode character by its code number 00BF. It is not an escape notation of any kind in JavaScript.
You can use the character as such,
alert("¿")
provided that handle the character encoding issues, as you should anyway.
If, however, you find this simple approach not applicable due to some external constraints, you can use a classic JavaScript escape notation:
alert("\xBF")
for characters in the range up to U+00FF, or the Unicode-based JavaScript escape notation
alert("\u00BF")
for characters in the range up to U+FFFF. (For characters beyond that, you need a so-called surrogate pair.)
Note that the special character used in these notations is \ U+005C REVERSE SOLIDUS, commonly, and originally, called “backslash”, not / U+002F SOLIDUS, commonly called “slash”, or sometimes (for emphasis) “forward slash”.
Related
To detect if the string is composed of ASCII characters, I am using a regex that looks as follows:
"string".match(/^[\x00-\x7F]*$/gm)
This works fine in detecting the ASCII characters. But for this leaves the characters that are similar in meaning to ascii codes. For example a double quote that falls out of ASCII set and is included in unicode set. For example:
"see the difference in double quotes“
With the above regex, this string will fail the detection test because of “. How could I extend the above regex to include characters such as these that are very similar to meaning in ASCII set. For example, , [comma], "[double quote], etc.
Regex doesn't understand the meaning of anything, it only follows its rules to match the sequence of characters.
If you want to match a comma, you need to put a comma in your character set. If you are looking for "similar" characters, you need to identify each and every one of them and put them inside the character set.
[,"]
will match "comma" and "double quote".
I'm writing some code that scans a string for TeX-style Greek character (like \Delta or \alpha), and replaces the string with the Unicode symbol. It works fine for the non-italic Greek characters. The problem is that I want to use mathematical italic for the lower case. These codes are one digit longer. For example, the code for the letter alpha is 1d6fc. When I put \u1d6fc into my string it displays as the character that matches \u1d6f (a lower case m with a superimposed tilde) followed by the letter c. How do I force the "correct" reading of the code?
You have to use UTF-16 surrogate pairs for characters beyond the UTF-16 range. In your particular case, you can use 0xD835 0xDEFC:
console.log('\uD835\uDEFC')
Here is a handy pair calculator. If you don't have to worry about Internet Explorer, you can also use String.fromCodePoint(), which will deal with that mess for you. If you do have to worry about Internet Explorer, MDN has a polyfill for that method.
To produce a \u escape sequence with more than 4 hex digits (code point belonging to a so-called astral plane), you can use the Unicode code point escape notation \u{xxxxx}:
console.log ('\u{1d6fc}');
or you can call String.fromCodePoint with the code point value expressed in hexadecimal using the 0x prefix notation:
console.log (String.fromCodePoint (0x1d6fc));
I'm reading the Sizzle source code. I'm confused when I read the regular about characterEncoding. In the source code, the characterEncoding defined as below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
It looks try to match \\. or \w- or ^\x00-\xa0.
I know [\w-] means \ or w or -, and I also know [^\x00-\xa0] means anything not in \x00-\x20. Who can tell me what's the meaning about \\. and \x00-\x20.
Thanks
I think I know what it is. The type of characterEncoding is string. So if we assign like below:
characterEncoding = "(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+"
The value of characterEncoding is:
(?:\\.|[\w-]|[^\x00-\xa0])+
So if I build a regular expression like above, it means:
[\w-] // A symbol of Latin alphabet or a digit or an underscore '_' or '-'
[^\\x00-\\xa0] // ISO 10646 characters U+00A1 and higher
\\. // '\' and '.'
So this time, my question is when will the pattern \\. work?
The variable would be better named css3Identifier or something.
Transforming [\w-]|[^\x00-\xa0] into an equivalent form that matches the spec better:
[a-zA-Z0-9_-]|[\u00A1-\uFFFF]
Consider that A1 is 161, _ is underscore and - is a dash and then
read this:
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_)
"and higher" is covered by -\uFFFF
The "\\\\." matches any single character preceded by backslash. e.g.- \7B would match \7 and then B would be caught
by the middle alternative. It also matches \n, \r, \t etc.
It is just the valid regex format of CSS identifier, class, tag and attributes. A link is also in the source code comment. Following are the rules, including the possible use of backslashes which might answer your question:
4.1. Characters and case
The following rules always hold:
All CSS style sheets are case-insensitive, except for parts that are not under the control of CSS. For example, the case-sensitivity of values of the HTML attributes "id" and "class", of font names, and of URIs lies outside the scope of this specification. Note in particular that element names are case-insensitive in HTML, but case-sensitive in XML.
In CSS3, identifiers (including element names, classes, and IDs in selectors (see [SELECT] [or is this still true])) can contain only the characters [A-Za-z0-9] and ISO 10646 characters 161 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit or a hyphen followed by a digit. They can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F". (See [UNICODE310] and [ISO10646].)
In CSS3, a backslash () character indicates three types of character escapes.
First, inside a string (see [CSS3VAL]), a backslash followed by a newline is ignored (i.e., the string is deemed not to contain either the backslash or the newline).
Second, it cancels the meaning of special CSS characters. Any character (except a hexadecimal digit) can be escaped with a backslash to remove its special meaning. For example, "\"" is a string consisting of one double quote. Style sheet preprocessors must not remove these backslashes from a style sheet since that would change the style sheet's meaning.
Third, backslash escapes allow authors to refer to characters they can't easily put in a style sheet. In this case, the backslash is followed by at most six hexadecimal digits (0..9A..F), which stand for the ISO 10646 ([ISO10646]) character with that number. If a digit or letter follows the hexadecimal number, the end of the number needs to be made clear. There are two ways to do that:
with a space (or other whitespace character): "\26 B" ("&B"). In this case, user agents should treat a "CR/LF" pair (13/10) as a single whitespace character.
by providing exactly 6 hexadecimal digits: "\000026B" ("&B")
In fact, these two methods may be combined. Only one whitespace character is ignored after a hexadecimal escape. Note that this means that a "real" space after the escape sequence must itself either be escaped or doubled.
Backslash escapes are always considered to be part of an identifier or a string (i.e., "\7B" is not punctuation, even though "{" is, and "\32" is allowed at the start of a class name, even though "2" is not).
http://www.w3.org/TR/css3-syntax/#characters
I need a regular expression to allow all alphabet characters plus Greek/German alphabet in a string but replace those symbols ?,&,^,". with *
I skipped the list with characters to escape to made the question simple.
I really want to see how to construct this and afterwards include alphabet sets using ASCII codes.
if you have a finite and short set of elements to replace you could just use a class e.g.
string.replace(/[?\^&]/g, '*');
and add as many symbols as you want to reject. you could also add ranges of unicode symbols you want to replace (e.g. \u017F-\036F\u0400-\uFFFF )
otherwise use a a class to specify what symbols don't need to be replaced, like a-z, accented/diacritic letters and greek symbols
string.replace(/[^a-z\00C0-\017E\u0370-\03FF]/gi, '*');
You have to use the XRegexp plugin, along with the Unicode add-on.
Once you have that, you can use modern regexes like /[\p{L}\p{Nl}]/, which necessarily also includes those \p{Greek} code points which are letters or letter-numbers. But you could also match /[\p{Latin}\p{Greek}]/ if you wanted.
Javascript’s own regexes are terrible. Use XRegexp.
So something like: /^[^?&\^"]*$/ (that means the string is composed only of characters outside the five you listed)...
But if you want to have the greek characters and the unicode characters (what are unicode characters? àèéìòù? Japanese?) perhaps you'll have to use http://xregexp.com/ It is a regex library for javascript that includes character classes for the various unicode character classes (I know I'm repeating myself) plus other "commands" for unicode handling.
I encountered this regular expression that detects string literal of Unicode characters in JavaScript.
'"'("\\x"[a-fA-F0-9]{2}|"\\u"[a-fA-F0-9]{4}|"\\"[^xu]|[^"\n\\])*'"'
but I couldn't understand the role and need of
"\\x"[a-fA-F0-9]{2}
"\\"[^xu]|[^"\n\\]
My guess about 1) is that it is detecting control characters.
"\\x"[a-fA-F0-9]{2}
This is a literal \x followed by two characters from the hex-digit group.
This matches the shorter-form character escapes for the code points 0–255, \x00–\xFF. These are valid in JavaScript string literals but they aren't in JSON, where you have to use \u0000–\u00FF instead.
"\\"[^xu]|[^"{esc}\n]
This matches one of:
backslash followed by one more character, except for x or u. The valid cases for \xNN and \uNNNN were picked up in the previous |-separated clauses, so what this does is avoid matching invalid syntax like \uqX.
anything else, except for the " or newline. It is probably also supposed to be excluding other escape characters, which I'm guessing is what {esc} means. That isn't part of the normal regex syntax, but it may be some extended syntax or templating over the top of regex. Otherwise, [^"{esc}\n] would mean just any character except ", {, e, s, c, } or newline, which would be wrong.
Notably, the last clause, that picks up ‘anything else’, doesn't exclude \ itself, so you can still have \uqX in your string and get a match even though that is invalid in both JSON and JavaScript.