Why this regex is not working for german words?

Why this regex is not working for german words? - javascript

I am trying to break the following sentence in words and wrap them in span.
<p class="german_p big">Das ist ein schönes Armband</p>
I followed this:
How to get a word under cursor using JavaScript?
$('p').each(function() {
var $this = $(this);
$this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
});
The only problem i am facing is, after wrapping the words in span the resultant html is like this:
<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>
so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?

Unicode in Javascript Regexen
Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.
The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.
It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.
This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.
However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.
Unicode Support in Other Languages
Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.
In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].
Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.
I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.
The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.
The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.
SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.
Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.
Sorry 'bout that. ☹

You can also use
/\b([äöüÄÖÜß\w]+)\b/g
instead of
/\b(\w+)\b/g
in order to handle the umlauts

\w only matches A-Z, a-z, 0-9, and _ (underscore).
You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.
Reference: http://www.javascriptkit.com/javatutors/redev2.shtml

To include all the Latin 1 Supplement characters like äöüßÒÿ you can use:
[\w\u00C0-\u00ff]
however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ČŇů . To include that you can use:
[\w\u00C0-\u024f]

\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.

As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.
* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.

the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/

Related

JavaScript equivalent of C#'s Char.IsSymbol

I'm trying to strip all 'Unicode Symbols' from a string. That is, keeping all multilingual characters but removing dingbats, arrows, and all of that stuff.
C# has a very handy function called Char.IsSymbol that can be run on all characters of a string, stripping the character when the functions returns true.
I've been searching on doing something similar in JavaScript. If it's a regex then how can I compile a list of all the unicode ranges of the symbol characters? I looked at XRegExp but couldn't find something that only filters symbols.

XRegExp does have support for what you're looking for - http://xregexp.com/plugins/#unicode
You'd probably match either for \pL or \pS. You can find a nice list of the typical unicode categories in http://www.regular-expressions.info/unicode.html#category
Overall, Unicode is quite tricky. It gives plenty of opportunities for giving you trouble, especially with software that isn't fully Unicode compatible (sadly, this includes JavaScript - see https://mathiasbynens.be/notes/javascript-unicode for a nice set of example). This is further exacerbated by the fact that JS often runs with double-encoding (HTML+JS, and there's worse cases as well). Somebody will probably find a way to bypass your checks, but I'm afraid there's no easy way to prevent that. Just be on the lookout :)

RegEx to test if a string contains more than X Unicode words

I saw many solutions that match Latin characters words like this one: /^\W*(\w+\b\W*){80,}$/
I'm looking for the equivalent expression that will support any language with Unicode characters.
The RegEx need to be JavaScript compatible.

EDIT: Javascript sadly doesn't seem to support this solution... You might want to look into XRegEx
I'll leave this here in case it's of use for anyone in another language more Perl compatible, but this doesn't answer your question, sorry.
For unicode support you can use the \p{...} pattern.
Your pattern would become
/^\P{L}*(\p{L}+\P{L}*){80,}$/
Here \P{L} stands for anything but a letter, \p{L} for any letter (but not a digit or a _, so it's a little bit different from \w)

Javascript Regex + Unicode Diacritic Combining Characters`

I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:
'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.
I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/
Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily?
Thank you

Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.
But there exists the lib XRegExp that adds this support. With this lib you can use
\p{L}: to match any kind of letter from any language.
\p{M}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
So your character class would look like this:
[\p{L}\p{M}]+
that would match all possible letters that are in the Unicode table.
If you want to limit it, you can have a look at Unicode scripts and replace \p{L} by a script, they collect all letters from certain languages. e.g. \p{Latin} for all Latin letters or \p{Cyrillic} for all Cyrillic letters.

Usually this is made by combining an 'é' with a '\u0323' under dot diacritic
However, that isn't what you have here:
'ẹ́'
that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an ẹ with an acute diacritic.
The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.
I don't just want to match e. I want to match all combinations
Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.
Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.
Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.

How to get the character corresponding to a Unicode character name?

I'm developing a Braille-to-text translator, and a nice feature to have is showing an output in Unicode's Braille patterns characters (say, kind of a Unicode Braille generator).
Since I know the dots that are "enabled" in each cell ("Braille character"), it would be trivial to construct the Unicode name of the character I need (they are of the form of BRAILLE PATTERN DOTS-123456 if they are all enabled, or BRAILLE PATTERN DOTS-14 if only dots 1 and 4 are enabled.
Is there any simple method to get a Unicode character in Javascript from its Unicode name?
My second try will be math*ing* with the Unicode values, but I think constructing the names is pretty much straightforward.
Thanks in advance :)

JavaScript, unlike some other languages, does not have any direct way of getting a character from its Unicode name. In my full Unicode input utility, I have therefore used the brute force method of using the Unicode character data base as a text block and parsing it. You might find some better, more efficient and more maintainable tools, but if you need just some specific collections of characters as in the question, an ad hoc approach is better. In this case, you don’t even need the Unicode names as such; they would be just an intermediate step from dot patterns to characters.
Clause 15.11 in the Unicode Standard, chapter 15, describes the allocation principles for Braille symbols.

Very interesting. In my app. I use a DB look up as you described and then use Javascript and the html canvas object to dynamically construct the Braille. This has the added benefit that I can create custom ARIA tags if desired. I say this because ASCII braille and Unicode aren't readable formats by several if not all Screen Readers. I know VoiceOver on iOS and Mac's won't read it. Something I'm working on is a way to make JS read BRL ASCII fields & Unicode and create ARIA tags so that a blind user actual knows what's going on on the webpage.

Javascript regex compared to Perl regex

I'm just a noob when it comes to regexp. I know Perl is amazing with regexp and I don't know much Perl. Recently started learning JavaScript and came across regex for
validating user inputs... haven't used them much.
How does JavaScript regexp compare with Perl regexp? Similarities and differences?
Can all regexp(s) written in JS be used in Perl and vice-versa?
Similar syntax?

From ECMAScript 2018 onwards, many of JavaScript's regex deficiencies have been fixed.
It now supports lookbehind assertions, even unbounded ones.
Unicode property escapes have been added.
There finally is a DOTALL (/s) flag.
What is still missing:
JavaScript doesn't have a way to prevent backtracking by making matches final (using possessive quantifiers ++/*+/?+ or atomic groups (?>...)).
Recursive/balanced subgroup matching is not supported.
One other (cosmetic) thing is that JavaScript doesn't know verbose regexes, which might make them harder to read.
Other than that, the basic regex syntax is very similar in both flavors.

This comparison will answer all your queries.

Another difference: In JavaScript, there is no s modifier: The dot "." will never match a newline character. As a replacement for ".", the character class [\s\S] can be used in JavaScript, which will work like /./s in Perl.

I just ran into an instance where the \d, decimal is not recognized in some versions of JavaScript -- you have to use [0-9].

Develop Reference

JavaScript is the programming language of the Web.