Find arabic words using regex - javascript

I using this pattern to find any word in the string:
\b(\w{1,})
but this can't find arabic words. How can I change this pattern to find both english and arabic words?
Thanks

Regex \w is an alias for A-z, 0-9, and _ (underscore) and will not match arabic unicode range. To include characters other than A-z you need to specify them, for example
[A-z\u0600-\u065F\u066A-\u06EF\u06FA-\u06FF]+
For explanation about character codes see Match Arabic word with regex that ends with “#”?

If your text only includes English and Arabic, and you want to sort the results, you could use this:
([^x00-\x7F ]+) for the Arabic text and this: (\w+) for the English text
The first part captures all characters other than the English set plus a space; the second part captures English characters (plus _).

Like smirnov said, that regex you're using will only find Latin strings. For Arabic you should use [\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd‌​3f]|[\ufd50-\ufd8f]|‌​[\ufd92-\ufdc7]|[\uf‌​e70-\ufefc]|[\uFDF0-‌​\uFDFD] (which should find all Arabic characters, even weird ones like ؁.)
Depending on what you're trying to do, you might want to split the string into a list and process it that way (that's what I usually end up doing when I'm dealing with mixed-language texts). Then you can identify the language of each word and process it accordingly.

Related

How to cover international lowercase word characters in a regex?

How do I cover all lowercase non-numeric word characters in a regex?
For example this would cover Müller for german umlauts
/[A-Z][a-zäöü-]+/g
...but what about french or spanish word characters? Is it possible to get those covered by a range or something like that?
The regex should NOT match a string which has multiple uppercase characters, like DOntGetMe. But it should match McDonald.
So I came up at the end with
/([A-Z][a-zäöü'-](?:[\w'-]+)?)/g
But it still don't cover french/spanish characters.
https://regex101.com/r/ofl4tj/2
Match
Smith
Müller
McDonald
O'Riley
Lee
Li
Manco-Johnson
Don't match
MIsspelled
ABC
Since you are looking for a letter from English alphabet as beginning of word use a word boundary \b right before [A-Z]. Using Unicode CLDR or Mother E**, to match all lowercase letters from all languages (\p{Ll}) this would be the regex:
(?:[a-z\xB5\xDF-\xF6\xF8-\xFF\u0101\u0103\u0105\u0107\u0109\u010B\u010D\u010F\u0111\u0113\u0115\u0117\u0119\u011B\u011D\u011F\u0121\u0123\u0125\u0127\u0129\u012B\u012D\u012F\u0131\u0133\u0135\u0137\u0138\u013A\u013C\u013E\u0140\u0142\u0144\u0146\u0148\u0149\u014B\u014D\u014F\u0151\u0153\u0155\u0157\u0159\u015B\u015D\u015F\u0161\u0163\u0165\u0167\u0169\u016B\u016D\u016F\u0171\u0173\u0175\u0177\u017A\u017C\u017E-\u0180\u0183\u0185\u0188\u018C\u018D\u0192\u0195\u0199-\u019B\u019E\u01A1\u01A3\u01A5\u01A8\u01AA\u01AB\u01AD\u01B0\u01B4\u01B6\u01B9\u01BA\u01BD-\u01BF\u01C6\u01C9\u01CC\u01CE\u01D0\u01D2\u01D4\u01D6\u01D8\u01DA\u01DC\u01DD\u01DF\u01E1\u01E3\u01E5\u01E7\u01E9\u01EB\u01ED\u01EF\u01F0\u01F3\u01F5\u01F9\u01FB\u01FD\u01FF\u0201\u0203\u0205\u0207\u0209\u020B\u020D\u020F\u0211\u0213\u0215\u0217\u0219\u021B\u021D\u021F\u0221\u0223\u0225\u0227\u0229\u022B\u022D\u022F\u0231\u0233-\u0239\u023C\u023F\u0240\u0242\u0247\u0249\u024B\u024D\u024F-\u0293\u0295-\u02AF\u0371\u0373\u0377\u037B-\u037D\u0390\u03AC-\u03CE\u03D0\u03D1\u03D5-\u03D7\u03D9\u03DB\u03DD\u03DF\u03E1\u03E3\u03E5\u03E7\u03E9\u03EB\u03ED\u03EF-\u03F3\u03F5\u03F8\u03FB\u03FC\u0430-\u045F\u0461\u0463\u0465\u0467\u0469\u046B\u046D\u046F\u0471\u0473\u0475\u0477\u0479\u047B\u047D\u047F\u0481\u048B\u048D\u048F\u0491\u0493\u0495\u0497\u0499\u049B\u049D\u049F\u04A1\u04A3\u04A5\u04A7\u04A9\u04AB\u04AD\u04AF\u04B1\u04B3\u04B5\u04B7\u04B9\u04BB\u04BD\u04BF\u04C2\u04C4\u04C6\u04C8\u04CA\u04CC\u04CE\u04CF\u04D1\u04D3\u04D5\u04D7\u04D9\u04DB\u04DD\u04DF\u04E1\u04E3\u04E5\u04E7\u04E9\u04EB\u04ED\u04EF\u04F1\u04F3\u04F5\u04F7\u04F9\u04FB\u04FD\u04FF\u0501\u0503\u0505\u0507\u0509\u050B\u050D\u050F\u0511\u0513\u0515\u0517\u0519\u051B\u051D\u051F\u0521\u0523\u0525\u0527\u0529\u052B\u052D\u052F\u0560-\u0588\u10D0-\u10FA\u10FD-\u10FF\u13F8-\u13FD\u1C80-\u1C88\u1D00-\u1D2B\u1D6B-\u1D77\u1D79-\u1D9A\u1E01\u1E03\u1E05\u1E07\u1E09\u1E0B\u1E0D\u1E0F\u1E11\u1E13\u1E15\u1E17\u1E19\u1E1B\u1E1D\u1E1F\u1E21\u1E23\u1E25\u1E27\u1E29\u1E2B\u1E2D\u1E2F\u1E31\u1E33\u1E35\u1E37\u1E39\u1E3B\u1E3D\u1E3F\u1E41\u1E43\u1E45\u1E47\u1E49\u1E4B\u1E4D\u1E4F\u1E51\u1E53\u1E55\u1E57\u1E59\u1E5B\u1E5D\u1E5F\u1E61\u1E63\u1E65\u1E67\u1E69\u1E6B\u1E6D\u1E6F\u1E71\u1E73\u1E75\u1E77\u1E79\u1E7B\u1E7D\u1E7F\u1E81\u1E83\u1E85\u1E87\u1E89\u1E8B\u1E8D\u1E8F\u1E91\u1E93\u1E95-\u1E9D\u1E9F\u1EA1\u1EA3\u1EA5\u1EA7\u1EA9\u1EAB\u1EAD\u1EAF\u1EB1\u1EB3\u1EB5\u1EB7\u1EB9\u1EBB\u1EBD\u1EBF\u1EC1\u1EC3\u1EC5\u1EC7\u1EC9\u1ECB\u1ECD\u1ECF\u1ED1\u1ED3\u1ED5\u1ED7\u1ED9\u1EDB\u1EDD\u1EDF\u1EE1\u1EE3\u1EE5\u1EE7\u1EE9\u1EEB\u1EED\u1EEF\u1EF1\u1EF3\u1EF5\u1EF7\u1EF9\u1EFB\u1EFD\u1EFF-\u1F07\u1F10-\u1F15\u1F20-\u1F27\u1F30-\u1F37\u1F40-\u1F45\u1F50-\u1F57\u1F60-\u1F67\u1F70-\u1F7D\u1F80-\u1F87\u1F90-\u1F97\u1FA0-\u1FA7\u1FB0-\u1FB4\u1FB6\u1FB7\u1FBE\u1FC2-\u1FC4\u1FC6\u1FC7\u1FD0-\u1FD3\u1FD6\u1FD7\u1FE0-\u1FE7\u1FF2-\u1FF4\u1FF6\u1FF7\u210A\u210E\u210F\u2113\u212F\u2134\u2139\u213C\u213D\u2146-\u2149\u214E\u2184\u2C30-\u2C5E\u2C61\u2C65\u2C66\u2C68\u2C6A\u2C6C\u2C71\u2C73\u2C74\u2C76-\u2C7B\u2C81\u2C83\u2C85\u2C87\u2C89\u2C8B\u2C8D\u2C8F\u2C91\u2C93\u2C95\u2C97\u2C99\u2C9B\u2C9D\u2C9F\u2CA1\u2CA3\u2CA5\u2CA7\u2CA9\u2CAB\u2CAD\u2CAF\u2CB1\u2CB3\u2CB5\u2CB7\u2CB9\u2CBB\u2CBD\u2CBF\u2CC1\u2CC3\u2CC5\u2CC7\u2CC9\u2CCB\u2CCD\u2CCF\u2CD1\u2CD3\u2CD5\u2CD7\u2CD9\u2CDB\u2CDD\u2CDF\u2CE1\u2CE3\u2CE4\u2CEC\u2CEE\u2CF3\u2D00-\u2D25\u2D27\u2D2D\uA641\uA643\uA645\uA647\uA649\uA64B\uA64D\uA64F\uA651\uA653\uA655\uA657\uA659\uA65B\uA65D\uA65F\uA661\uA663\uA665\uA667\uA669\uA66B\uA66D\uA681\uA683\uA685\uA687\uA689\uA68B\uA68D\uA68F\uA691\uA693\uA695\uA697\uA699\uA69B\uA723\uA725\uA727\uA729\uA72B\uA72D\uA72F-\uA731\uA733\uA735\uA737\uA739\uA73B\uA73D\uA73F\uA741\uA743\uA745\uA747\uA749\uA74B\uA74D\uA74F\uA751\uA753\uA755\uA757\uA759\uA75B\uA75D\uA75F\uA761\uA763\uA765\uA767\uA769\uA76B\uA76D\uA76F\uA771-\uA778\uA77A\uA77C\uA77F\uA781\uA783\uA785\uA787\uA78C\uA78E\uA791\uA793-\uA795\uA797\uA799\uA79B\uA79D\uA79F\uA7A1\uA7A3\uA7A5\uA7A7\uA7A9\uA7AF\uA7B5\uA7B7\uA7B9\uA7FA\uAB30-\uAB5A\uAB60-\uAB65\uAB70-\uABBF\uFB00-\uFB06\uFB13-\uFB17\uFF41-\uFF5A]|\uD801[\uDC28-\uDC4F\uDCD8-\uDCFB]|\uD803[\uDCC0-\uDCF2]|\uD806[\uDCC0-\uDCDF]|\uD81B[\uDE60-\uDE7F]|\uD835[\uDC1A-\uDC33\uDC4E-\uDC54\uDC56-\uDC67\uDC82-\uDC9B\uDCB6-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDCCF\uDCEA-\uDD03\uDD1E-\uDD37\uDD52-\uDD6B\uDD86-\uDD9F\uDDBA-\uDDD3\uDDEE-\uDE07\uDE22-\uDE3B\uDE56-\uDE6F\uDE8A-\uDEA5\uDEC2-\uDEDA\uDEDC-\uDEE1\uDEFC-\uDF14\uDF16-\uDF1B\uDF36-\uDF4E\uDF50-\uDF55\uDF70-\uDF88\uDF8A-\uDF8F\uDFAA-\uDFC2\uDFC4-\uDFC9\uDFCB]|\uD83A[\uDD22-\uDD43])
\b is not Unicode aware in JS so you can't use it for multi-byte letters - I'm not sure if you need it either. But if you do translate \p{L}\p{D} using above tools and use it in a negative lookahead (?!...) or simply do a (?!\S). The latter is more general but most of times satisfies the needs.

Javascript regex character set restriction related query

I am using a JavaScript RegEx which is mentioned below:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*([-_.])).+$
This accepts only that text which has at least 1: uppercase letter, lowercase letter, number & a special symbol from .-_.
Now assume I supply User-123 as the user id which confirms to the above RegEx & I use the onscreen keyboard to type in a character from Finnish language, which results in User-123Ã.
The RegEx being fulfilled, the text is accepted by my JavaScript code, but I want it to only accept Alphanumeric input in English, and nothing else.
How should I enhance this RegEx to do so?
This string "User-123Ã" have contain Unicode "Ã" not alphabets, so how can identify js code,
[Code] [Glyph] [Decimal] [HTML] Description [#]
U+00C3 Ã Ã Ã Latin Capital letter A with tilde 0131
Try this link also,
How to find whether a particular string has unicode characters
I am not sure this will solve the issue, but in most cases when you want to restrict the input itself to some characters, your consuming pattern should only match those characters you allow. The lookahead restrictions just require or forbid certain characters to appear certain number of times at certain positions, but what you match in the consuming part is crucial.
.+$ allows all letters. Replace it with [\w.-]+$ (\w = [a-zA-Z0-9_]) instead to restrict to the characters you require in the lookaheads.

JavaScript Regex for capitalized letters with accents

In JavaScript, its easy to match letters and accents with this regex:
text.match(/[a-z\u00E0-\u00FC]+/i);
And only the lowercase letters and accents without the i option:
text.match(/[a-z\u00E0-\u00FC]+/);
But what is the correct regular expression to match only capitalized letters and accents?
EDIT: like the answers already mention below, the regex above also matches some other signs, and miss some special accent characters like ý and Ý, ć and Ć and many others.
The range U+00C0 - U+00DC should be the uppercase equivalent for U+00E0 - U+00FC
So this text.match(/[A-Z\u00C0-\u00DC]+/); should be what you are looking for.
A site like graphemica can help you to determine the ranges you need yourself.
EDIT like the other answers already mention, this also matches some other signs.
Replace a-z with A-Z and \u00E0-\u00FC with \u00C0-\u00DC to match the same letters in uppercase as text.match(/[a-z\u00E0-\u00FC]+/); matches in lowercase.
However!
This is not a proper implementation, neither for lowercase nor for uppercase letters, as, for example, your lowercase match includes ÷ (division sign), which is not at all a letter, and my uppercase string will match × (multiplication sign), which looks like an X, but isn't actually a letter either.
In addition to that, you're missing characters like ý and Ý, ć and Ć and many, many others.
Your first regex doesn't actually match letters and accents: it only matches letters and a specific subset of accented letters, namely the ones between unicode codepoints \u00E0 and \u00FC. This range does not include any capital letters, while it does include e.g. the ÷ sign and some letters not generally regarded as 'accented'.
Depending on what you actually need, this may not be what you want. If you really want to match all capitals letters and all capital letters with the same accents, you need the regex /[A-Z\u00C0-\u00DC]+/, but please check with e.g. http://unicode-table.com/en/#basic-latin to check whether it suits your needs.
To match all capitalized letters, accented or not, you may use the following unicode regex /\p{Lu}+/u. For example, in node repl:
Note that this will match non-latin letters as well, like the capital greek delta Δ in the example.

Regex: any character that is NOT a letter (but not only English letters)

I want to delete from string all characters that are not letters.
I know that there is something like \W in regex, but it considers non-English characters as not letters. For example my script deletes all Polish letters (like "ą", "ć", "ó"), but I need them.
How to tell regex to do this?
Code:
var text = text.replace(/\W/g, ' ');
You can either use Steve Levithan's XRegExp library (with Unicode plugins), or you have to define the Unicode character range manually, since JavaScript doesn't support Unicode properties.
[^\u0041-\u005A\u0061-\u007A\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0345\u0370-\u0374\u0376\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05B0-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u05D0-\u05EA\u05F0-\u05F2\u0610-\u061A\u0620-\u0657\u0659-\u065F\u066E-\u06D3\u06D5-\u06DC\u06E1-\u06E8\u06ED-\u06EF\u06FA-\u06FC\u06FF\u0710-\u073F\u074D-\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0817\u081A-\u082C\u0840-\u0858\u08A0\u08A2-\u08AC\u08E4-\u08E9\u08F0-\u08FE\u0900-\u093B\u093D-\u094C\u094E-\u0950\u0955-\u0963\u0971-\u0977\u0979-\u097F\u0981-\u0983\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD-\u09C4\u09C7\u09C8\u09CB\u09CC\u09CE\u09D7\u09DC\u09DD\u09DF-\u09E3\u09F0\u09F1\u0A01-\u0A03\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A3E-\u0A42\u0A47\u0A48\u0A4B\u0A4C\u0A51\u0A59-\u0A5C\u0A5E\u0A70-\u0A75\u0A81-\u0A83\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD-\u0AC5\u0AC7-\u0AC9\u0ACB\u0ACC\u0AD0\u0AE0-\u0AE3\u0B01-\u0B03\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D-\u0B44\u0B47\u0B48\u0B4B\u0B4C\u0B56\u0B57\u0B5C\u0B5D\u0B5F-\u0B63\u0B71\u0B82\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCC\u0BD0\u0BD7\u0C01-\u0C03\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C33\u0C35-\u0C39\u0C3D-\u0C44\u0C46-\u0C48\u0C4A-\u0C4C\u0C55\u0C56\u0C58\u0C59\u0C60-\u0C63\u0C82\u0C83\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCC\u0CD5\u0CD6\u0CDE\u0CE0-\u0CE3\u0CF1\u0CF2\u0D02\u0D03\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D-\u0D44\u0D46-\u0D48\u0D4A-\u0D4C\u0D4E\u0D57\u0D60-\u0D63\u0D7A-\u0D7F\u0D82\u0D83\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E01-\u0E3A\u0E40-\u0E46\u0E4D\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB9\u0EBB-\u0EBD\u0EC0-\u0EC4\u0EC6\u0ECD\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F71-\u0F81\u0F88-\u0F97\u0F99-\u0FBC\u1000-\u1036\u1038\u103B-\u103F\u1050-\u1062\u1065-\u1068\u106E-\u1086\u108E\u109C\u109D\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u135F\u1380-\u138F\u13A0-\u13F4\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16EE-\u16F0\u1700-\u170C\u170E-\u1713\u1720-\u1733\u1740-\u1753\u1760-\u176C\u176E-\u1770\u1772\u1773\u1780-\u17B3\u17B6-\u17C8\u17D7\u17DC\u1820-\u1877\u1880-\u18AA\u18B0-\u18F5\u1900-\u191C\u1920-\u192B\u1930-\u1938\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A1B\u1A20-\u1A5E\u1A61-\u1A74\u1AA7\u1B00-\u1B33\u1B35-\u1B43\u1B45-\u1B4B\u1B80-\u1BA9\u1BAC-\u1BAF\u1BBA-\u1BE5\u1BE7-\u1BF1\u1C00-\u1C35\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF3\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2160-\u2188\u24B6-\u24E9\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2DE0-\u2DFF\u2E2F\u3005-\u3007\u3021-\u3029\u3031-\u3035\u3038-\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FCC\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA674-\uA67B\uA67F-\uA697\uA69F-\uA6EF\uA717-\uA71F\uA722-\uA788\uA78B-\uA78E\uA790-\uA793\uA7A0-\uA7AA\uA7F8-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA827\uA840-\uA873\uA880-\uA8C3\uA8F2-\uA8F7\uA8FB\uA90A-\uA92A\uA930-\uA952\uA960-\uA97C\uA980-\uA9B2\uA9B4-\uA9BF\uA9CF\uAA00-\uAA36\uAA40-\uAA4D\uAA60-\uAA76\uAA7A\uAA80-\uAABE\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEF\uAAF2-\uAAF5\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uABC0-\uABEA\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]
matches a character that isn't a Unicode letter.
It depends on what engine you are working with. It also depends on how your Unicode characters are encoded — are they encoded as a single character, or as a character+mark combination?
You can try the following: \p{L} to target character+mark combinations, and \P{M}\p{M}*+ for the single character encodings.
So, finally I decided to write my own regex condition, because it seems like it there isn't any fast&simple way to do that in javascript.
I added here all unnecessary characters that came to my mind, could be in typical website and aren't needed to understand single word (I left ' character because in English it is quite important ;) ). If you want you can edit my answer and add your own ones.
[:;.,\?!-()~\/"|®##$%^&*+-]
JS:
text = text.replace(/[:;\.,\?!\-\(\)~\\\/"|®##$%^&*+-]/, "");

RegEx with extended latin alphabet (ä ö ü è ß)

I want to do some basic String testing in Node.js. Assume I have a form where users enter their name and I wanna check if it's just rubbish or a real name.
Happily (or sadly for my check) I get users from all around the world which means that their names contain non-english characters, like ä ö ü ß é. I was used to use /[A-Za-z -]{2,}/ but this doesn't match names like "Jan Buschtöns".
Do I have to manually add every possible non-english but latin character to my RegEx to work? I don't want a 100+ characters long RegEx like /[A-Za-z -äöüÄÖÜßéÉèÈêÊ...]{2,}/.
Check http://www.regular-expressions.info/unicode.html and http://xregexp.com/plugins/
You would need to use \p{L} to match any letter character if you want to include unicode.
Speaking unicode, alternative of \w is [\p{L}\p{N}_] then.
Update: As of ES2018, JavaScript supports Unicode property escapes such as \p{L}, which matches anything that Unicode considers to be a letter. All modern browsers support this feature, so that's probably the way to go as long as you don't care about ancient browsers.
Old answer for pre-ES2018 browsers:
The answer depends on exactly what you want to do.
As you have noticed, [A-Za-z] only matches Latin letters without diacritics.
If you only care about German diacritics and the ß ligature, then you can just replace that part with [A-Za-zÄÖÜäöüß], e.g.:
/[A-Za-zÄÖÜäöüß -]{2,}/
But that probably isn’t what you want to do. You probably want to match Latin letters with any diacritics, not just those used in German. Or perhaps you want to match any letters from any alphabet, not just Latin.
Other regular expression dialects have character classes to help you with problems like this, but unfortunately JavaScript’s regular expression dialect has very few character classes and none of them help you here.
(In case you don’t know, a “character class” is an expression that matches any character that is a member of a predefined group of characters. For example, \w is a character class that matches any ASCII letter, or digit, or an underscore, and . is a character class that matches any character.)
This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match.
A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full:
/[A-Za-z\\u0080-\\uFFFF -]{2,}/
This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range. This includes all possible alphabetic characters with or without diacritics in any script. However, it also includes a lot of characters that are not letters. Non-letters in the ASCII range are excluded, but non-letters outside the ASCII range are included.
The above might be good enough for your purposes, but if it isn’t then you will have to figure out which character ranges you need and specify those explicitly.
If you want just Latin letters, including those with less common diacritics like åēį, but excluding e.g. Chinese, Devanagari, and Cyrillic characters, you can use \p{Script=Latin} with the u flag. This feature is called Unicode property escapes, and was introduced in ES2018.
For example, /\p{Script=Latin}+/u will match a word that only contains Latin characters.
This is a JavaScript/node.js question, but I barely see any actual JavaScript code which shows how to do it. Its a bit trickier, because it requires the Unicode "u" flag:
// Result: '_ {_} [_]'
'ulike {adj} [ubøyelig]'.replace(/\p{L}+/gu, '_')
I know this question is old but I'm working on some NPL software and I needed to match all words in most latin-like languages and I did this with the following piece of code.
let myString = "Whatever you want here, ex. Bân-lâm-gú or bokmål or Português or Română or Slovenčina or Slovenščina";
let wordchar = "-A-Za-zᴀⱯɐᵄⱭɑᵅꬰꭤⱰɒᶛʙᴃᴯꞖꞗꞴꞵᴄↃↄꞳꭓꭕꭔÐðᶞꟇꟈꝹꝺᴅᴆꝱẟᴇꬲꬳꬴƎᴲǝⱻƏəₔᵊƐɛᵋEɘꞫɜᶟɞʚᴈᵌɤꝻꝼꜰℲⅎꟻꬵꝽᵹꞬɡᶢꬶɢᵷ⅁ꝾꝿƔɣˠƢƣʜǶƕⱵⱶꟵꟶꜦꜧꭜıꞮɪᶦꟾꟷᴉᵎᵻᶧƖɩᶥᴊKᴋꞰʞʟᶫꝆꝇᴌꬸꬹꬷꭝꝲꞀꞁ⅃ᴍꬺꟽꟿꝳɴᶰᴎᴻꬻꝴŊŋᵑꬼᴏᴑꬽꬾƆɔᵓᴐꬿᴒᴖᵔᴗᵕꞶꞷɷȢȣᴕᴽᴘꟼɸᶲⱷĸꞯꞂꞃƦʀꝚꝛᴙꭆɹʴᴚʁʶꭉꭇꭈꭊꭋꭌꭅꝵꝶꝜꝝſꟉꟊꞄꞅƧƨꜱƩʃᶴꭍƪʅꞆꞇᴛꝷꞱʇᴜᶸᴝᵙᴞꭒꭟꭎꭏꞍɥᶣƜɯꟺᵚᴟƱʊᶷᴠỼỽɅʌᶺᴡꟂꟃʍꭩꭖꭗꭘꭙꭙ̆ʏꭚʎ⅄ƍᴢꝢꝣƷʒᶾᴣƸƹȜȝÞþǷƿꝨꝩꝪꝫꝬꝭꝮꝯꝰꝸꜪꜫꜬꜭꜮꜯƼƽƄƅɁɂʔꜢꜣꞋꞌꞏʕˤᴤᴥᵜꜤꜥʖǀǁǃǂʗʘʬʭꞚꞛꞜꞝꞞꞟẚÀàÁáÂâẦầẤấẪẫẨẩÃãÃ̀ã̀Ã́ã́Ã̂ã̂Ã̌ã̌Ã̍ã̍Ã̎ã̎ĀāĀ̀ā̀Ā́ā́Ā̂ā̂Ā̃ā̃Ā̃́ā̃́Ā̄ā̄Ā̆ā̆Ā̆́ā̆́Ā̈ā̈Ā̊ā̊Ā̌ā̌ĂăẰằẮắẴẵẲẳȦȧȦ́ȧ́ǠǡÄäÄ́ä́Ä̀ä̀Ä̂ä̂Ä̃ä̃ǞǟǞ̆ǟ̆Ä̆ä̆Ä̌ä̌ẢảÅåÅǺǻÅ̂å̂Å̃å̃Å̄å̄Å̄̆å̄̆Å̆å̆A̋a̋ǍǎA̍a̍A̎a̎ȀȁȂȃA̐a̐A̓a̓A̧a̧À̧à̧Á̧á̧Â̧â̧Ǎ̧ǎ̧A̭a̭A̰a̰À̰à̰Á̰á̰Ā̰ā̰Ä̰ä̰Ä̰́ä̰́ĄąĄ̀ą̀Ą́ą́Ą̂ą̂Ą̃ą̃Ą̄ą̄Ą̄̀ą̄̀Ą̄́ą̄́Ą̄̂ą̄̂Ą̄̌ą̄̌Ą̇ą̇Ą̈ą̈Ą̈̀ą̈̀Ą̈́ą̈́Ą̈̂ą̈̂Ą̈̌ą̈̌Ą̈̄ą̈̄Ą̊ą̊Ą̌ą̌Ą̋ą̋Ą̱ą̱Ą̱̀ą̱̀Ą̱́ą̱́A᷎a᷎A̱a̱À̱à̱Á̱á̱Â̱â̱Ã̱ã̱Ā̱ā̱Ā̱̀ā̱̀Ā̱́ā̱́Ā̱̂ā̱̂Ä̱ä̱Ä̱̀ä̱̀Ä̱́ä̱́Ä̱̂ä̱̂Ä̱̌ä̱̌Å̱å̱Ǎ̱ǎ̱A̱̥a̱̥ẠạẠ́ạ́Ạ̀ạ̀ẬậẠ̃ạ̃Ạ̄ạ̄ẶặẠ̈ạ̈Ạ̈̀ạ̈̀Ạ̈́ạ̈́Ạ̈̂ạ̈̂Ạ̈̌ạ̈̌Ạ̌ạ̌Ạ̍ạ̍A̤a̤À̤à̤Á̤á̤Â̤â̤Ä̤ä̤ḀḁḀ̂ḁ̂Ḁ̈ḁ̈A̯a̯A̩a̩À̩à̩Á̩á̩Â̩â̩Ã̩ã̩Ā̩ā̩Ǎ̩ǎ̩A̩̍a̩̍A̩̓a̩̓A͔a͔Ā͔ā͔ȺⱥȺ̀ⱥ̀Ⱥ́ⱥ́ᶏꞺꞻⱭ̀ɑ̀Ɑ́ɑ́Ɑ̂ɑ̂Ɑ̃ɑ̃Ɑ̄ɑ̄Ɑ̆ɑ̆Ɑ̇ɑ̇Ɑ̈ɑ̈Ɑ̊ɑ̊Ɑ̌ɑ̌ᶐB̀b̀B́b́B̂b̂B̃b̃B̄b̄ḂḃB̈b̈B̒b̒B̕b̕ḆḇḆ̂ḇ̂ḄḅB̤b̤B̥b̥B̬b̬ɃƀᵬᶀƁɓƂƃʙ̇ʙ̣C̀c̀ĆćĈĉC̃c̃C̄c̄C̄́c̄́C̆c̆ĊċC̈c̈ČčČ́č́Č͑č͑Č̓č̓Č̕č̕Č̔č̔C̋c̋C̓c̓C̕c̕C̔c̔C͑c͑ÇçḈḉÇ̆ç̆Ç̇ç̇Ç̌ç̌ꞔꟄC̦c̦C̭c̭C̱c̱C̮c̮C̣c̣Ć̣ć̣Č̣č̣C̥c̥C̬c̬C̯c̯C̨c̨ȻȼȻ̓ȼ̓ꞒꞓƇƈɕᶝꜾꜿD́d́D̂d̂D̃d̃D̄d̄ḊḋD̊d̊ĎďD̑d̑D̓d̓D̕d̕ḐḑD̦d̦ḒḓḎḏD̮d̮ḌḍḌ́ḍ́Ḍ̄ḍ̄D̤d̤D̥d̥D̬d̬D̪d̪ĐđĐ̣đ̣Đ̱đ̱ᵭᶁƉɖƊɗᶑƋƌȡꝹ́ꝺ́Ꝺ̇ꝺ̇ᴅ̇ᴅ̣Ð́ð́Ð̣ð̣ÈèÉéÊêỀềẾếỄễÊ̄ê̄Ê̆ê̆Ê̌ê̌ỂểẼẽẼ̀ẽ̀Ẽ́ẽ́Ẽ̂ẽ̂Ẽ̌ẽ̌Ẽ̍ẽ̍Ẽ̎ẽ̎ĒēḔḕḖḗĒ̂ē̂Ē̃ē̃Ē̃́ē̃́Ē̄ē̄Ē̆ē̆Ē̆́ē̆́Ē̌ē̌Ē̑ē̑ĔĕĔ̀ĕ̀Ĕ́ĕ́Ĕ̄ĕ̄ĖėĖ́ė́Ė̃ė̃Ė̄ė̄ËëË̀ë̀Ë́ë́Ë̂ë̂Ë̃ë̃Ë̄ë̄Ë̌ë̌ẺẻE̊e̊E̊̄e̊̄E̋e̋ĚěĚ́ě́Ě̃ě̃Ě̋ě̋Ě̑ě̑E̍e̍E̎e̎ȄȅȆȇE̓e̓E᷎e᷎ȨȩȨ̀ȩ̀Ȩ́ȩ́Ȩ̂ȩ̂ḜḝȨ̌ȩ̌Ẽ̦ẽ̦ĘęĘ̀ę̀Ę́ę́Ę̂ę̂Ę̃ę̃Ę̃́ę̃́Ę̄ę̄Ę̄̀ę̄̀Ę̄́ę̄́Ę̄̂ę̄̂Ę̄̃ę̄̃Ę̄̌ę̄̌Ę̆ę̆Ę̇ę̇Ę̇́ę̇́Ę̈ę̈Ę̈̀ę̈̀Ę̈́ę̈́Ę̈̂ę̈̂Ę̈̌ę̈̌Ę̈̄ę̈̄Ę̋ę̋Ę̌ę̌Ę̑ę̑Ę̱ę̱Ę̱̀ę̱̀Ę̱́ę̱́Ę̣ę̣Ę᷎ę᷎ḘḙḚḛE̱e̱È̱è̱É̱é̱Ê̱ê̱Ẽ̱ẽ̱Ē̱ē̱Ḕ̱ḕ̱Ḗ̱ḗ̱Ē̱̂ē̱̂Ë̱ë̱Ë̱̀ë̱̀Ë̱́ë̱́Ë̱̂ë̱̂Ë̱̌ë̱̌Ě̱ě̱E̮e̮Ē̮ē̮ẸẹẸ̀ẹ̀Ẹ́ẹ́ỆệẸ̃ẹ̃Ẹ̄ẹ̄Ẹ̄̀ẹ̄̀Ẹ̄́ẹ̄́Ẹ̄̃ẹ̄̃Ẹ̆ẹ̆Ẹ̆̀ẹ̆̀Ẹ̆́ẹ̆́Ẹ̈ẹ̈Ẹ̈̀ẹ̈̀Ẹ̈́ẹ̈́Ẹ̈̂ẹ̈̂Ẹ̈̌ẹ̈̌Ẹ̍ẹ̍Ẹ̌ẹ̌Ẹ̑ẹ̑E̤e̤È̤è̤É̤é̤Ê̤ê̤Ë̤ë̤E̥e̥E̯e̯E̩e̩È̩è̩É̩é̩Ê̩ê̩Ẽ̩ẽ̩Ē̩ē̩Ě̩ě̩E̩̍e̩̍E̩̓e̩̓È͕è͕Ê͕ê͕Ẽ͕ẽ͕Ē͕ē͕Ḕ͕ḕ͕E̜e̜E̹e̹È̹è̹É̹é̹Ê̹ê̹Ẽ̹ẽ̹Ē̹ē̹Ḕ̹ḕ̹ɆɇᶒⱸᶕᶓɚᶔɝƐ̀ɛ̀Ɛ́ɛ́Ɛ̂ɛ̂Ɛ̃ɛ̃Ɛ̃̀ɛ̃̀Ɛ̃́ɛ̃́Ɛ̃̂ɛ̃̂Ɛ̃̌ɛ̃̌Ɛ̃̍ɛ̃̍Ɛ̃̎ɛ̃̎Ɛ̄ɛ̄Ɛ̆ɛ̆Ɛ̇ɛ̇Ɛ̈ɛ̈Ɛ̈̀ɛ̈̀Ɛ̈́ɛ̈́Ɛ̈̂ɛ̈̂Ɛ̈̌ɛ̈̌Ɛ̌ɛ̌Ɛ̍ɛ̍Ɛ̎ɛ̎Ɛ̣ɛ̣Ɛ̣̀ɛ̣̀Ɛ̣́ɛ̣́Ɛ̣̂ɛ̣̂Ɛ̣̃ɛ̣̃Ɛ̣̈ɛ̣̈Ɛ̣̈̀ɛ̣̈̀Ɛ̣̈́ɛ̣̈́Ɛ̣̈̂ɛ̣̈̂Ɛ̣̈̌ɛ̣̈̌Ɛ̣̌ɛ̣̌Ɛ̤ɛ̤Ɛ̤̀ɛ̤̀Ɛ̤́ɛ̤́Ɛ̤̂ɛ̤̂Ɛ̤̈ɛ̤̈Ɛ̧ɛ̧Ɛ̧̀ɛ̧̀Ɛ̧́ɛ̧́Ɛ̧̂ɛ̧̂Ɛ̧̌ɛ̧̌Ɛ̨ɛ̨Ɛ̨̀ɛ̨̀Ɛ̨́ɛ̨́Ɛ̨̂ɛ̨̂Ɛ̨̄ɛ̨̄Ɛ̨̆ɛ̨̆Ɛ̨̈ɛ̨̈Ɛ̨̌ɛ̨̌Ɛ̰ɛ̰Ɛ̰̀ɛ̰̀Ɛ̰́ɛ̰́Ɛ̰̄ɛ̰̄Ɛ̱ɛ̱Ɛ̱̀ɛ̱̀Ɛ̱́ɛ̱́Ɛ̱̂ɛ̱̂Ɛ̱̃ɛ̱̃Ɛ̱̈ɛ̱̈Ɛ̱̈̀ɛ̱̈̀Ɛ̱̈́ɛ̱̈́Ɛ̱̌ɛ̱̌Ə̀ə̀Ə́ə́Ə̂ə̂Ə̄ə̄Ə̌ə̌Ə̏ə̏F̀f̀F́f́F̃f̃F̄f̄ḞḟF̓f̓F̧f̧ᵮᶂƑƒꞘꞙF̱f̱F̣f̣ꜰ̇Ꝼ́ꝼ́Ꝼ̇ꝼ̇Ꝼ̣ꝼ̣G̀g̀ǴǵǴ̄ǵ̄ĜĝG̃g̃G̃́g̃́ḠḡḠ́ḡ́ĞğĠġG̈g̈G̈̇g̈̇G̊g̊G̋g̋ǦǧǦ̈ǧ̈G̑g̑G̒g̒G̓g̓G̕g̕G̔g̔ĢģG̦g̦G̱g̱G̱̓g̱̓G̮g̮G̣g̣G̤g̤G̥g̥G̫g̫ꞠꞡǤǥᶃƓɠɢ̇ɢ̣ʛƔ̓ɣ̓H̀h̀H́h́ĤĥH̄h̄ḢḣḦḧȞȟH̐h̐H̓h̓H̕h̕ḨḩH̨h̨H̭h̭H̱ẖḪḫḤḥḤ̣ḥ̣H̤h̤H̥h̥H̬h̬H̯h̯ĦħꟸĦ̥ħ̥ꞪɦʱⱧⱨꞕh̢ʜ̇ɧÌìÍíÎîÎ́î́ĨĩĨ́ĩ́Ĩ̀ĩ̀Ĩ̂ĩ̂Ĩ̌ĩ̌Ĩ̍ĩ̍Ĩ̎ĩ̎ĪīĪ́ī́Ī̀ī̀Ī̂ī̂Ī̌ī̌Ī̃ī̃Ī̄ī̄Ī̆ī̆Ī̆́ī̆́ĬĭĬ̀ĭ̀Ĭ́ĭ́İiIıİ́i̇́ÏïÏ̀ï̀ḮḯÏ̂ï̂Ï̃ï̃Ï̄ï̄Ï̌ï̌Ï̑ï̑I̊i̊I̋i̋ǏǐỈỉI̍i̍I̎i̎ȈȉI̐i̐ȊȋI᷎i᷎ĮįĮ̀į̀Į́į́į̇́Į̂į̂Į̃į̃į̇̃Į̄į̄Į̄̀į̄̀Į̄́į̄́Į̄̂į̄̂Į̄̆į̄̆Į̄̌į̄̌Į̈į̈Į̈̀į̈̀Į̈́į̈́Į̈̂į̈̂Į̈̌į̈̌Į̈̄į̈̄Į̋į̋Į̌į̌Į̱į̱Į̱́į̱́Į̱̀į̱̀I̓i̓I̧i̧Í̧í̧Ì̧ì̧Î̧î̧I̭i̭Ī̭ī̭ḬḭḬ̀ḭ̀Ḭ́ḭ́Ḭ̄ḭ̄Ḭ̈ḭ̈Ḭ̈́ḭ̈́I̱i̱Ì̱ì̱Í̱í̱Î̱î̱Ǐ̱ǐ̱Ĩ̱ĩ̱Ï̱ï̱Ḯ̱ḯ̱Ï̱̀ï̱̀Ï̱̂ï̱̂Ï̱̌ï̱̌Ī̱ī̱Ī̱́ī̱́Ī̱̀ī̱̀Ī̱̂ī̱̂I̮i̮ỊịỊ̀ị̀Ị́ị́Ị̂ị̂Ị̃ị̃Ị̄ị̄Ị̈ị̈Ị̈̀ị̈̀Ị̈́ị̈́Ị̈̂ị̈̂Ị̈̌ị̈̌Ị̌ị̌Ị̍ị̍I̤i̤Ì̤ì̤Í̤í̤Î̤î̤Ï̤ï̤I̥i̥Í̥í̥Ï̥ï̥I̯i̯Í̯í̯Ĩ̯ĩ̯I̩i̩I͔i͔Ī͔ī͔ƗɨᶤƗ̀ɨ̀Ɨ́ɨ́Ɨ̂ɨ̂Ɨ̌ɨ̌Ɨ̃ɨ̃Ɨ̄ɨ̄Ɨ̈ɨ̈Ɨ̧ɨ̧Ɨ̧̀ɨ̧̀Ɨ̧̂ɨ̧̂Ɨ̧̌ɨ̧̌Ɨ̱ɨ̱Ɨ̱̀ɨ̱̀Ɨ̱́ɨ̱́Ɨ̱̂ɨ̱̂Ɨ̱̈ɨ̱̈Ɨ̱̌ɨ̱̌Ɨ̯ɨ̯ᶖꞼꞽı̣ı̥Ɩ̀ɩ̀Ɩ́ɩ́Ɩ̂ɩ̂Ɩ̃ɩ̃Ɩ̈ɩ̈Ɩ̌ɩ̌ᵼJ́j́ĴĵJ̃j̃j̇̃J̄j̄J̇J̈j̈J̈̇j̈̇J̊j̊J̋j̋J̌ǰJ̌́ǰ́J̑j̑J̓j̓J᷎j᷎J̱j̱J̣j̣J̣̌ǰ̣J̥j̥ɈɉɈ̱ɉ̱ꞲʝᶨȷɟᶡʄK̀k̀ḰḱK̂k̂K̃k̃K̄k̄K̆k̆K̇k̇K̈k̈ǨǩK̑k̑K̓k̓K̕k̕K̔k̔K͑k͑ĶķK̦k̦K̨k̨ḴḵḴ̓ḵ̓ḲḳK̮k̮K̥k̥K̬k̬K̫k̫ᶄƘƙⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣᴋ̇ĿŀL̀l̀ĹĺL̂l̂L̃l̃L̄l̄L̇l̇L̈l̈L̋l̋ĽľL̐l̐L̑l̑L̓l̓L̕l̕ĻļĻ̂ļ̂Ļ̃ļ̃L̦l̦ḼḽḺḻḺ̓ḻ̓L̮l̮ḶḷḶ̀ḷ̀Ḷ́ḷ́ḸḹḸ́ḹ́Ḹ̆ḹ̆Ḷ̓ḷ̓Ḷ̕ḷ̕Ḷ̣ḷ̣L̤l̤L̤̄l̤̄L̥l̥L̥̀l̥̀Ĺ̥ĺ̥L̥̄l̥̄L̥̄́l̥̄́L̥̄̆l̥̄̆L̥̕l̥̕L̩l̩L̩̀l̩̀L̩̓l̩̓L̯l̯ŁłŁ̇ł̇Ł̓ł̓Ł̣ł̣Ł̱ł̱ꝈꝉȽƚⱠⱡⱢɫꭞꞭɬᶅᶪɭᶩꞎȴʟ̇ʟ̣ƛƛ̓λ̴λ̴̓M̀m̀ḾḿM̂m̂M̃m̃M̄m̄M̆m̆ṀṁṀ̇ṁ̇M̈m̈M̋m̋M̍m̍M̌m̌M̐m̐M̑m̑M̓m̓M̕̕m̕M͑m͑ᵯM̧m̧M̨m̨M̦m̦M̱m̱Ḿ̱ḿ̱M̮m̮ṂṃṂ́ṃ́Ṃ̄ṃ̄Ṃ̓ṃ̓M̥m̥Ḿ̥ḿ̥M̥̄m̥̄M̥̄́m̥̄́M̥̄̆m̥̄̆M̬m̬M̩m̩M̩̀m̩̀M̩̓m̩̓M̯m̯ᶆm̢Ɱɱᶬᴍ̇ᴍ̣ǸǹŃńN̂n̂ÑñÑ̈ñ̈N̄n̄N̆n̆ṄṅṄ̇ṅ̇N̈n̈N̋n̋ŇňN̐n̐N̑n̑N̍n̍N̓n̓N̕n̕ꞤꞥᵰŅņŅ̂ņ̂Ņ̃ņ̃N̦n̦N̨n̨ṊṋN̰n̰ṈṉṈ́ṉ́N̮n̮ṆṇṆ́ṇ́Ṇ̄ṇ̄Ṇ̄́ṇ̄́Ṇ̓ṇ̓N̤n̤N̥n̥Ǹ̥ǹ̥Ń̥ń̥Ñ̥ñ̥Ñ̥́ñ̥́N̥̄n̥̄N̥̄́n̥̄́N̥̄̆n̥̄̆N̥̄̑n̥̄̑Ṅ̥ṅ̥N̥̑n̥̑N̥̑́n̥̑́N̥̑̄n̥̑̄N̯n̯N̩n̩Ǹ̩ǹ̩N̩̓n̩̓N̲n̲ƝɲᶮȠƞꞐꞑŊ̀ŋ̀Ŋ́ŋ́Ŋ̂ŋ̂Ŋ̄ŋ̄Ŋ̈ŋ̈Ŋ̈̇ŋ̈̇Ŋ̊ŋ̊Ŋ̑ŋ̑Ŋ̨ŋ̨Ŋ̣ŋ̣Ŋ̥ŋ̥Ŋ̥́ŋ̥́Ŋ̥̄ŋ̥̄Ŋ̥̄́ŋ̥̄́ᶇɳᶯȵɴ̇ɴ̣ÒòÓóÔôỐốỒồỖỗÔ̆ô̆ỔổÕõÕ̍õ̍Õ̎õ̎Õ̀õ̀ṌṍÕ̂õ̂Õ̌õ̌ṎṏȬȭŌōṒṓṐṑŌ̂ō̂Ō̃ō̃Ō̃́ō̃́Ō̄ō̄Ō̆ō̆Ō̆́ō̆́Ō̈ō̈Ō̌ō̌ŎŏŎ̀ŏ̀Ŏ́ŏ́Ŏ̈ŏ̈ȮȯȮ́ȯ́ȰȱO͘o͘Ó͘ó͘Ò͘ò͘Ō͘ō͘O̍͘o̍͘ÖöÖ́ö́Ö̀ö̀Ö̂ö̂Ö̌ö̌Ö̃ö̃ȪȫȪ̆ȫ̆Ö̆ö̆ỎỏO̊o̊ŐőǑǒO̍o̍O̎o̎ȌȍO̐o̐ȎȏO̓o̓ØøØ̀ø̀ǾǿØ̂ø̂Ø̃ø̃Ø̄ø̄Ø̄́ø̄́Ø̄̆ø̄̆Ø̆ø̆Ø̇ø̇Ø̇́ø̇́Ø̈ø̈Ø̋ø̋Ø̌ø̌Ø᷎ø᷎Ø̨ø̨Ǿ̨ǿ̨Ø̨̄ø̨̄Ø̣ø̣Ø̥ø̥Ø̰ø̰Ǿ̰ǿ̰ظø¸Ǿ¸ǿ¸ƟɵᶱƠơỚớỜờỠỡƠ̆ơ̆ỞởO᷎o᷎Ó᷎ó᷎O̧o̧Ó̧ó̧Ò̧ò̧Ô̧ô̧Ǒ̧ǒ̧ǪǫǪ̀ǫ̀Ǫ́ǫ́Ǫ̂ǫ̂Ǫ̃ǫ̃ǬǭǬ̀ǭ̀Ǭ́ǭ́Ǭ̂ǭ̂Ǭ̃ǭ̃Ǭ̆ǭ̆Ǭ̌ǭ̌Ǫ̆ǫ̆Ǫ̆́ǫ̆́Ǫ̇ǫ̇Ǫ̇́ǫ̇́Ǫ̈ǫ̈Ǫ̈̀ǫ̈̀Ǫ̈́ǫ̈́Ǫ̈̂ǫ̈̂Ǫ̈̃ǫ̈̃Ǫ̈̄ǫ̈̄Ǫ̈̌ǫ̈̌Ǫ̋ǫ̋Ǫ̌ǫ̌Ǫ̑ǫ̑Ǫ̣ǫ̣Ǫ̱ǫ̱Ǫ̱́ǫ̱́Ǫ̱̀ǫ̱̀Ǫ᷎ǫ᷎O̭o̭O̰o̰Ó̰ó̰O̱o̱Ò̱ò̱Ó̱ó̱Ô̱ô̱Ǒ̱ǒ̱Õ̱õ̱Ō̱ō̱Ṓ̱ṓ̱Ṑ̱ṑ̱Ō̱̂ō̱̂Ö̱ö̱Ö̱́ö̱́Ö̱̀ö̱̀Ö̱̂ö̱̂Ö̱̌ö̱̌O̮o̮ỌọỌ̀ọ̀Ọ́ọ́ỘộỌ̃ọ̃Ọ̄ọ̄Ọ̄̀ọ̄̀Ọ̄́ọ̄́Ọ̄̃ọ̄̃Ọ̄̆ọ̄̆Ọ̆ọ̆Ọ̈ọ̈Ọ̈̀ọ̈̀Ọ̈́ọ̈́Ọ̈̂ọ̈̂Ọ̈̄ọ̈̄Ọ̈̌ọ̈̌Ọ̌ọ̌Ọ̑ọ̑ỢợỌọO̤o̤Ò̤ò̤Ó̤ó̤Ô̤ô̤Ö̤ö̤O̥o̥Ō̥ō̥O̬o̬O̯o̯O̩o̩Õ͔õ͔Ō͔ō͔O̜o̜O̹o̹Ó̹ó̹O̲o̲ᴓᶗꝌꝍⱺꝊꝋƆ́ɔ́Ɔ̀ɔ̀Ɔ̂ɔ̂Ɔ̌ɔ̌Ɔ̃ɔ̃Ɔ̃́ɔ̃́Ɔ̃̀ɔ̃̀Ɔ̃̂ɔ̃̂Ɔ̃̌ɔ̃̌Ɔ̃̍ɔ̃̍Ɔ̃̎ɔ̃̎Ɔ̄ɔ̄Ɔ̆ɔ̆Ɔ̇ɔ̇Ɔ̈ɔ̈Ɔ̈̀ɔ̈̀Ɔ̈́ɔ̈́Ɔ̈̂ɔ̈̂Ɔ̈̌ɔ̈̌Ɔ̌ɔ̌Ɔ̍ɔ̍Ɔ̎ɔ̎Ɔ̣ɔ̣Ɔ̣̀ɔ̣̀Ɔ̣́ɔ̣́Ɔ̣̂ɔ̣̂Ɔ̣̃ɔ̣̃Ɔ̣̈ɔ̣̈Ɔ̣̈̀ɔ̣̈̀Ɔ̣̈́ɔ̣̈́Ɔ̣̈̂ɔ̣̈̂Ɔ̣̈̌ɔ̣̈̌Ɔ̣̌ɔ̣̌Ɔ̤ɔ̤Ɔ̤̀ɔ̤̀Ɔ̤́ɔ̤́Ɔ̤̂ɔ̤̂Ɔ̤̈ɔ̤̈Ɔ̱ɔ̱Ɔ̱̀ɔ̱̀Ɔ̱́ɔ̱́Ɔ̱̂ɔ̱̂Ɔ̱̌ɔ̱̌Ɔ̱̃ɔ̱̃Ɔ̱̈ɔ̱̈Ɔ̱̈̀ɔ̱̈̀Ɔ̱̈́ɔ̱̈́Ɔ̧ɔ̧Ɔ̧̀ɔ̧̀Ɔ̧́ɔ̧́Ɔ̧̂ɔ̧̂Ɔ̧̌ɔ̧̌Ɔ̨ɔ̨Ɔ̨́ɔ̨́Ɔ̨̀ɔ̨̀Ɔ̨̂ɔ̨̂Ɔ̨̌ɔ̨̌Ɔ̨̄ɔ̨̄Ɔ̨̆ɔ̨̆Ɔ̨̈ɔ̨̈Ɔ̨̱ɔ̨̱Ɔ̰ɔ̰Ɔ̰̀ɔ̰̀Ɔ̰́ɔ̰́Ɔ̰̄ɔ̰̄P̀p̀ṔṕP̃p̃P̄p̄P̆p̆ṖṗP̈p̈P̋p̋P̑p̑P̓p̓P̕p̕P̔p̔P͑p͑P̱p̱P̣p̣P̤p̤P̬p̬ⱣᵽꝐꝑᵱᶈƤƥꝒꝓꝔꝕᴘ̇Q́q́Q̃q̃Q̄q̄Q̇q̇Q̈q̈Q̋q̋Q̓q̓Q̕q̕Q̧q̧Q̣q̣Q̣̇q̣̇Q̣̈q̣̈Q̱q̱ꝖꝗꝖ̃ꝗ̃ꝘꝙʠɊɋR̀r̀ŔŕR̂r̂R̃r̃R̄r̄R̆r̆ṘṙR̋r̋ŘřR̍r̍ȐȑȒȓR̓r̓R̕r̕ŖŗR̦r̦R̨r̨R̨̄r̨̄ꞦꞧR̭r̭ṞṟṚṛṚ̀ṛ̀Ṛ́ṛ́ṜṝṜ́ṝ́Ṝ̃ṝ̃Ṝ̆ṝ̆R̤r̤R̥r̥R̥̀r̥̀Ŕ̥ŕ̥R̥̂r̥̂R̥̃r̥̃R̥̄r̥̄R̥̄́r̥̄́R̥̄̆r̥̄̆Ř̥ř̥R̬r̬R̩r̩R̯r̯ɌɍᵲꭨɺᶉɻʵⱹɼⱤɽɾᵳɿʀ̇ʀ̣Ꝛ́ꝛ́Ꝛ̣ꝛ̣S̀s̀ŚśŚ̀ś̀ŚśṤṥŜŝS̃s̃S̄s̄S̄̒s̄̒S̆s̆ṠṡṠ̃ṡ̃S̈s̈S̋s̋ŠšŠ̀š̀Š́š́ṦṧŠ̓š̓S̑s̑S̒s̒S̓s̓S̕s̕ŞşȘșS̨s̨Š̨š̨ꞨꞩS̱s̱Ś̱ś̱S̮s̮ṢṣṢ́ṣ́Ṣ̄ṣ̄ṨṩṢ̌ṣ̌Ṣ̕ṣ̕Ṣ̱ṣ̱S̤s̤Š̤š̤S̥s̥Ś̥S̬s̬S̩s̩S̪s̪ꜱ̇ꜱ̣ſ́ẛſ̣ᵴᶊʂᶳꟅⱾȿẜẝᶋᶘʆT̀t̀T́t́T̃t̃T̄t̄T̆t̆T̆̀t̆̀ṪṫT̈ẗŤťT̑t̑T̓t̓T̕t̕T̔t̔T͑t͑ŢţȚțT̨t̨T̗t̗ṰṱT̰t̰ṮṯT̮t̮ṬṭṬ́ṭ́T̤t̤T̥t̥T̬t̬T̯t̯T̪t̪ƾŦŧȾⱦᵵƫᶵƬƭƮʈȶᴛ̇ᴛ̣ÙùÚúÛûŨũŨ̀ũ̀ṸṹŨ̂ũ̂Ũ̊ũ̊Ũ̌ũ̌Ũ̍ũ̍Ũ̎ũ̎ŪūŪ̀ū̀Ū́ū́Ū̂ū̂Ū̌ū̌Ū̃ū̃Ū̄ū̄Ū̆ū̆Ū̆́ū̆́ṺṻŪ̊ū̊ŬŭŬ̀ŭ̀Ŭ́ŭ́U̇u̇U̇́u̇́U̇̄u̇̄ÜüǛǜǗǘÜ̂ü̂Ü̃ü̃ǕǖǕ̆ǖ̆Ü̆ü̆ǙǚỦủŮůŮ́ů́Ů̃ů̃ŰűǓǔU̍u̍U̎u̎ȔȕȖȗU̓u̓U᷎u᷎ỦủƯưỨứỪừỮữƯ̆ư̆ỬửỰựU̧u̧Ú̧ú̧Ù̧ù̧Û̧û̧Ǔ̧ǔ̧ŲųŲ̀ų̀Ų́ų́Ų̂ų̂Ų̌ų̌Ų̄ų̄Ų̄́ų̄́Ų̄̀ų̄̀Ų̄̂ų̄̂Ų̄̌ų̄̌Ų̄̌ų̄̌Ų̈ų̈Ų̈́ų̈́Ų̈̀ų̈̀Ų̈̂ų̈̂Ų̈̌ų̈̌Ų̈̄ų̈̄Ų̋ų̋Ų̱ų̱Ų̱́ų̱́Ų̱̀ų̱̀ṶṷṴṵṴ̀ṵ̀Ṵ́ṵ́Ṵ̄ṵ̄Ṵ̈ṵ̈U̱u̱Ù̱ù̱Ú̱ú̱Û̱û̱Ũ̱ũ̱Ū̱ū̱Ū̱́ū̱́Ū̱̀ū̱̀Ū̱̂ū̱̂Ü̱ü̱Ǘ̱ǘ̱Ǜ̱ǜ̱Ü̱̂ü̱̂Ǚ̱ǚ̱Ǔ̱ǔ̱ỤụỤ̀ụ̀Ụ́ụ́Ụ̂ụ̂Ụ̃ụ̃Ụ̄ụ̄Ụ̈ụ̈Ụ̈̀ụ̈̀Ụ̈́ụ̈́Ụ̈̂ụ̈̂Ụ̈̌ụ̈̌Ụ̌ụ̌Ụ̍ụ̍ṲṳṲ̀ṳ̀Ṳ́ṳ́Ṳ̂ṳ̂Ṳ̈ṳ̈U̥u̥Ü̥ü̥U̯u̯Ũ̯ũ̯Ü̯ü̯U̩u̩U͔u͔Ũ͔ũ͔Ū͔ū͔ɄʉᶶɄ̀ʉ̀Ʉ́ʉ́Ʉ̂ʉ̂Ʉ̃ʉ̃Ʉ̄ʉ̄Ʉ̈ʉ̈Ʉ̌ʉ̌Ʉ̧ʉ̧Ʉ̰ʉ̰Ʉ̰́ʉ̰́Ʉ̱ʉ̱Ʉ̱́ʉ̱́Ʉ̱̀ʉ̱̀Ʉ̱̂ʉ̱̂Ʉ̱̈ʉ̱̈Ʉ̱̌ʉ̱̌Ʉ̥ʉ̥ꞸꞹᵾᶙꞾꞿʮʯɰᶭƱ̀ʊ̀Ʊ́ʊ́Ʊ̃ʊ̃ᵿV̀v̀V́v́V̂v̂ṼṽṼ̀ṽ̀Ṽ́ṽ́Ṽ̂ṽ̂Ṽ̌ṽ̌V̄v̄V̄̀v̄̀V̄́v̄́V̄̂v̄̂V̄̃v̄̃V̄̄v̄̄V̄̆v̄̆V̄̌v̄̌V̆v̆V̆́v̆́V̇v̇V̈v̈V̈̀v̈̀V̈́v̈́V̈̂v̈̂V̈̄v̈̄V̈̌v̈̌V̊v̊V̋v̋V̌v̌V̍v̍V̏v̏V̐v̐V̓v̓V̧v̧V̨v̨V̨̀v̨̀V̨́v̨́V̨̂v̨̂V̨̌v̨̌V̨̄v̨̄V̨̄́v̨̄́V̨̄̀v̨̄̀V̨̄̂v̨̄̂V̨̄̌v̨̄̌V̨̈v̨̈V̨̈́v̨̈́V̨̈̀v̨̈̀V̨̈̂v̨̈̂V̨̈̌v̨̈̌V̨̈̄v̨̈̄V̨̋v̨̋V̨̱v̨̱V̨̱́v̨̱́V̨̱̀v̨̱̀V̨̱̂v̨̱̂V̨̱̌v̨̱̌V̱v̱V̱̀v̱̀V̱́v̱́V̱̂v̱̂V̱̌v̱̌Ṽ̱ṽ̱V̱̈v̱̈V̱̈́v̱̈́V̱̈̀v̱̈̀V̱̈̂v̱̈̂V̱̈̌v̱̈̌ṾṿV̥v̥ꝞꝟᶌƲʋᶹƲ̀ʋ̀Ʋ́ʋ́Ʋ̂ʋ̂Ʋ̃ʋ̃Ʋ̈ʋ̈Ʋ̌ʋ̌ⱱⱴꝨ́ꝩ́Ꝩ̇ꝩ̇Ꝩ̣ꝩ̣ẀẁẂẃŴŵW̃w̃W̄w̄W̆w̆ẆẇẄẅW̊ẘW̋w̋W̌w̌W̍w̍W̓w̓W̱w̱ẈẉW̥w̥W̬w̬ⱲⱳX̀x̀X́x́X̂x̂X̃x̃X̄x̄X̆x̆X̆́x̆́ẊẋẌẍX̊x̊X̌x̌X̓x̓X̕x̕X̱x̱X̱̓x̱̓X̣x̣X̣̓x̣̓X̥x̥ᶍỲỳÝýŶŷỸỹȲȳȲ̀ȳ̀Ȳ́ȳ́Ȳ̃ȳ̃Ȳ̆ȳ̆Y̆y̆Y̆̀y̆̀Y̆́y̆́ẎẏẎ́ẏ́ŸÿŸ́ÿ́Y̊ẙY̋y̋Y̌y̌Y̍y̍Y̎y̎Y̐y̐Y̓y̓ỶỷY᷎y᷎Y̱y̱ỴỵỴ̣ỵ̣Y̥y̥Y̯y̯ɎɏƳƴỾỿZ̀z̀ŹźẐẑZ̃z̃Z̄z̄ŻżZ̈z̈Z̋z̋ŽžŽ́ž́Ž̏ž̏Z̑z̑Z̓z̓Z̕z̕Z̨z̨Z̗z̗ẔẕZ̮z̮ẒẓẒ́ẓ́Ẓ̌ẓ̌Ẓ̣ẓ̣Z̤z̤Z̥z̥ƵƶᵶᶎꟆȤȥʐᶼʑᶽⱿɀⱫⱬƷ́ʒ́Ʒ̇ʒ̇ǮǯǮ́ǯ́Ʒ̥ʒ̥ᶚƺʓÞ́þ́Þ̣þ̣ꝤꝥꝦꝧƻꜮꜯʡʢꜲꜳꜲ́ꜳ́Ꜳ̋ꜳ̋Ꜳ̇ꜳ̇Ꜳ̈ꜳ̈Ꜳ̣ꜳ̣ÆæᴭÆ̀æ̀ǼǽÆ̂æ̂Æ̌æ̌Æ̃æ̃Æ̃́æ̃́Æ̃̀æ̃̀Æ̃̂æ̃̂Æ̃̌æ̃̌ǢǣǢ́ǣ́Ǣ̂ǣ̂Ǣ̃ǣ̃Ǣ̆ǣ̆Æ̆æ̆Æ̇æ̇Æ̈æ̈Æ̈̀æ̈̀Æ̈́æ̈́Æ̈̂æ̈̂Æ̈̌æ̈̌Æ̊æ̊Æ̋æ̋Æ᷎æ᷎Æ̨æ̨Æ̨̀æ̨̀Ǽ̨ǽ̨Æ̨̂æ̨̂Æ̨̈æ̨̈Ǣ̨ǣ̨Æ̨̌æ̨̌Æ̨̱æ̨̱Æ̱æ̱Æ̱̃æ̱̃Æ̱̈æ̱̈Æ̣æ̣Æ͔̃æ͔̃ᴁᴂᵆꬱꜴꜵꜴ́ꜵ́Ꜵ̋ꜵ̋Ꜵ̣ꜵ̣ꜶꜷꜶ́ꜷ́Ꜷ̣ꜷ̣ꜸꜹꜺꜻꜸ́ꜹ́Ꜹ̋ꜹ̋Ꜹ̨ꜹ̨Ꜹ̣ꜹ̣Ꜻ́ꜻ́ꜼꜽꜼ̇ꜽ̇Ꜽ̣ꜽ̣ȸDZDzdzʣDŽDždžꭦʥʤffffifflfiflʩIJijꭡLJLjljỺỻʪʫɮNJNjnjŒœꟹŒ̀œ̀Œ́œ́Œ̂œ̂Œ̃œ̃Œ̄œ̄Œ̄́œ̄́Œ̄̃œ̄̃Œ̄̆œ̄̆Œ̋œ̋Œ̌œ̌Œ̨œ̨Œ̨̃œ̨̃Œ̣œ̣Œ̯œ̯ɶᴔꭂꭁꭢꝎꝏꝎ́ꝏ́Ꝏ̈ꝏ̈Ꝏ̋ꝏ̋Ꝏ̣ꝏ̣ꭃꭄȹẞßstſtʨᵺʦꭧʧꜨꜩꭀᵫꭐꭑꭣꝠꝡꝠ̈ꝡ̈Ꝡ̋ꝡ̋ꭠ";
let re = new RegExp(`(?<=[^${wordchar}]*)[${wordchar}]+(?=[${wordchar}]*)`, "g");
console.log(myString.match(re)); // ["Whatever", "you", "want", "here", "ex", "Bân-lâm-gú", "or", "bokmål", "or", "Português", "or", "Română", "or", "Slovenčina", "or", "Slovenščina"]
For russian and latin alphabet I've used
[\\wа-яА-Я]

Categories

Resources