JavaScript equivalent of C#'s Char.IsSymbol - javascript

I'm trying to strip all 'Unicode Symbols' from a string. That is, keeping all multilingual characters but removing dingbats, arrows, and all of that stuff.
C# has a very handy function called Char.IsSymbol that can be run on all characters of a string, stripping the character when the functions returns true.
I've been searching on doing something similar in JavaScript. If it's a regex then how can I compile a list of all the unicode ranges of the symbol characters? I looked at XRegExp but couldn't find something that only filters symbols.

XRegExp does have support for what you're looking for - http://xregexp.com/plugins/#unicode
You'd probably match either for \pL or \pS. You can find a nice list of the typical unicode categories in http://www.regular-expressions.info/unicode.html#category
Overall, Unicode is quite tricky. It gives plenty of opportunities for giving you trouble, especially with software that isn't fully Unicode compatible (sadly, this includes JavaScript - see https://mathiasbynens.be/notes/javascript-unicode for a nice set of example). This is further exacerbated by the fact that JS often runs with double-encoding (HTML+JS, and there's worse cases as well). Somebody will probably find a way to bypass your checks, but I'm afraid there's no easy way to prevent that. Just be on the lookout :)

Related

JS lexing---multi line string

I am making a JS lexer as part of my study. In JS, single line stings start from " or ' and ends with the same character except if that character is preceded by a backslash.
In my current code, I loop through every character and append them to existing tokens based on flags like "string" or "regex". so it feels natural to implement multi line string with " or ' because it seems that it does not affect any other part of my lexer
Is there any practical reason why new line is not allowed as contents of strings?
Many languages, but not all, prohibit unescaped newlines in string literals. So JavaScript is certainly not unique here.
But the motivation really has little to do with the ease, difficulty or efficiency of lexical analysis. In fact, for lexical analysis the simplest syntax is to allow any character rather than having to include special-case checks. [Note 1]
There are other considerations, though; notably, the importance of a program to be readable and easy to debug. Long strings put an extra load on someone reading the code, because they may not be aware that a section of program text is actually part of a string literal. (There's a similar problem with multiline comments, which is why it's usually considered good style to mark every line in a long comment in some way, for example with a vertical column of stars at the left-hand margin. No such solution exists for string literals, though.)
Also, unterminated multiline strings can be annoying to correct. If strings are cannot span lines, the error will be detected on the line containing the problem. But multiline strings might continue until the beginning of the next string, then triggering a syntax error when the contents of the next string are accidentally parsed as program text. Or worse, resulting in a completely incorrect parse of what was supposed to be program text, followed by another incorrect string literal starting where the second literal ends, and continuing from there.
That also makes it hard for developer tools, such as editors and syntax highlighters, to deal with program text as it is being typed.
In the end, you may or may not find these arguments compelling, and a language designer might have other aesthetic preferences as well. I can't really speak for the original designers of the JavaScript language, and neither of us can take a voyage in time to argue with them and maybe change their decision.
For better or worse, languages are designed according to particular subjective judgements, and if the language is successful these judgements become permanent features. They are things you have to accept if you are using a language and they're not usually worth obsessing about. You get used to them, or you find a different language to program in, with its own syntax quirks.
When you design your own language, you will need to resolve a large number of syntactic questions, and you will undoubtedly run into cases where the answer is not clearcut because there is no objectively correct unique solution. Whatever you do, someone will want to argue with you. Perhaps you can refer them to this answer.
Notes:
There is actually a historic reason for not allowing multiline string literals, which is much clearer but has been more or less irrelevant for several decades.
Once Upon A Time, common filesystems considered text files to be linear arrays of fixed-length lines (often 80 character lines, matching a Hollerith card). One advantage of such a filesystem is that it could instantly navigate to a particular line number in a file, since all lines were the same length. But in any case, for systems where programs were entered on punched cards, the fixed length lines were just part of the environment.
To make all lines the same length, lines needed to be filled out with space characters. This would obviously make multiline string literals awkward, and that's why C never allowed multiline string literals, instead relying on a syntactic feature where consecutive string literals are automatically concatenated into a single literal.
In the end, fixed-line-length filesystems proved to be unpopular, and I don't think you're likley to run into one these days. But a careful reading of the C and Posix standards shows that such filesystems must still be usable by conforming implementations, with the consequence that a fully portable program must be prepared to deal with line length limits on output and trailing whitespace on input.
There is also such syntax
const string =
'line1\
line2\
line3'

Why does my JavaScript RegExp not work as expected

I am writing a password screen, and the requirements for the password security are somewhere between 8 and 20 characters in length, must contain at least one Alpha character and at least one numeric character and at least one special character of [!##$%^&*].
I have cobbled together this regular expression, which appeared to work in C#, but when I started rewriting the code for a JavaScript validation, the regular expression is flagging what I thought were valid passwords as invalid.
Here is the regular expression as I assign it to RegExp:
var regExPatt = new RegExp('^(?=(?:.*[a-zA-Z]){1})(?=(?:.*\d){1})(?=(?:.*[!###$%^&*]){1})(?!.*\s).{8,20}$');
NOTE BENE: The double ## symbol is there to get the # symbol into the RegExp, otherwise it tries to treat partial strings like Razor variables and things go sideways fast.
Where did I go wrong with this regular expression? I know it is fairly complicated.
Passwords that work:
freddy1234%
freddy123$5
freddy12#45
freddy1#345
freddy!2345
Passwords that do not work:
test1234%
wilma1234%
Any ideas?
JavaScript developers should have knowledge about
RegExp object description
Regular Expressions chapter in the JavaScript Guide
Developers who want to use positive or negtive lookahead should take into account that this requires JavaScript v1.5 as it can be read on page New in JavaScript 1.5. But that should be no problem nowadays as this is a very old version released on November 2000 and all browsers used nowadays support v1.5 of JavaScript.
Lookbehind is not yet (JavaScript v1.8.5) supported by JavaScript at all.
A list of the JavaScript versions and which browser supports which JavaScript version can be found on Wikipedia page about JavaScript.
New in JavaScript contains the links to the pages explaining what was added in which version of JavaScript.

How to get the character corresponding to a Unicode character name?

I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)
JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.
Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.

Can I depend on the behavior of charCodeAt() and fromCharCode() to remain the same?

I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example ⊇ is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.
I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.
My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.
I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.
fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.
Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.
One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.
I chose to do this because Perl's unicode support is very hard to deal with correctly.
This is ɴᴏᴛ true!
Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C♯, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.
The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:
It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼
Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.
Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.
Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.
In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.
And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.
Also read up on all the stuff I said about 𝔸𝕤𝕤𝕦𝕞𝕖 𝔹𝕣𝕠𝕜𝕖𝕟𝕟𝕖𝕤𝕤. Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.
It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!
In my opinion it won't break.
Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:
Every letter in every
alphabet is assigned a number by
the Unicode consortium which is
written like this: U+0639. This
number is called a code point. The U+
means "Unicode" and the numbers are
hexadecimal. The English letter A would
be U+0041.
It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.
As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.
To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:
var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
var str = "𝌆";
var codePoints = getStringCodePoints(s);
console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306
JavaScript Strings are UTF-16 this isn't something that is going to be changed.
But don't forget that UTF-16 is variable length encoding.
In 2018, you can use String.codePointAt() and String.fromCodePoint().
These methods work even if a character is not in the Basic-Multilingual Plane(BMP).

Why this regex is not working for german words?

I am trying to break the following sentence in words and wrap them in span.
<p class="german_p big">Das ist ein schönes Armband</p>
I followed this:
How to get a word under cursor using JavaScript?
$('p').each(function() {
var $this = $(this);
$this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
});
The only problem i am facing is, after wrapping the words in span the resultant html is like this:
<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>
so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?
Unicode in Javascript Regexen
Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.
The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.
It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.
This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.
However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.
Unicode Support in Other Languages
Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.
In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].
Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.
I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.
The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.
The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.
SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.
Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.
Sorry 'bout that. ☹
You can also use
/\b([äöüÄÖÜß\w]+)\b/g
instead of
/\b(\w+)\b/g
in order to handle the umlauts
\w only matches A-Z, a-z, 0-9, and _ (underscore).
You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.
Reference: http://www.javascriptkit.com/javatutors/redev2.shtml
To include all the Latin 1 Supplement characters like äöüßÒÿ you can use:
[\w\u00C0-\u00ff]
however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ČŇů . To include that you can use:
[\w\u00C0-\u024f]
\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.
As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.
* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.
the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/

Categories

Resources