I want to do some basic String testing in Node.js. Assume I have a form where users enter their name and I wanna check if it's just rubbish or a real name.
Happily (or sadly for my check) I get users from all around the world which means that their names contain non-english characters, like ä ö ü ß é. I was used to use /[A-Za-z -]{2,}/ but this doesn't match names like "Jan Buschtöns".
Do I have to manually add every possible non-english but latin character to my RegEx to work? I don't want a 100+ characters long RegEx like /[A-Za-z -äöüÄÖÜßéÉèÈêÊ...]{2,}/.
Check http://www.regular-expressions.info/unicode.html and http://xregexp.com/plugins/
You would need to use \p{L} to match any letter character if you want to include unicode.
Speaking unicode, alternative of \w is [\p{L}\p{N}_] then.
Update: As of ES2018, JavaScript supports Unicode property escapes such as \p{L}, which matches anything that Unicode considers to be a letter. All modern browsers support this feature, so that's probably the way to go as long as you don't care about ancient browsers.
Old answer for pre-ES2018 browsers:
The answer depends on exactly what you want to do.
As you have noticed, [A-Za-z] only matches Latin letters without diacritics.
If you only care about German diacritics and the ß ligature, then you can just replace that part with [A-Za-zÄÖÜäöüß], e.g.:
/[A-Za-zÄÖÜäöüß -]{2,}/
But that probably isn’t what you want to do. You probably want to match Latin letters with any diacritics, not just those used in German. Or perhaps you want to match any letters from any alphabet, not just Latin.
Other regular expression dialects have character classes to help you with problems like this, but unfortunately JavaScript’s regular expression dialect has very few character classes and none of them help you here.
(In case you don’t know, a “character class” is an expression that matches any character that is a member of a predefined group of characters. For example, \w is a character class that matches any ASCII letter, or digit, or an underscore, and . is a character class that matches any character.)
This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match.
A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full:
/[A-Za-z\\u0080-\\uFFFF -]{2,}/
This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range. This includes all possible alphabetic characters with or without diacritics in any script. However, it also includes a lot of characters that are not letters. Non-letters in the ASCII range are excluded, but non-letters outside the ASCII range are included.
The above might be good enough for your purposes, but if it isn’t then you will have to figure out which character ranges you need and specify those explicitly.
If you want just Latin letters, including those with less common diacritics like åēį, but excluding e.g. Chinese, Devanagari, and Cyrillic characters, you can use \p{Script=Latin} with the u flag. This feature is called Unicode property escapes, and was introduced in ES2018.
For example, /\p{Script=Latin}+/u will match a word that only contains Latin characters.
This is a JavaScript/node.js question, but I barely see any actual JavaScript code which shows how to do it. Its a bit trickier, because it requires the Unicode "u" flag:
// Result: '_ {_} [_]'
'ulike {adj} [ubøyelig]'.replace(/\p{L}+/gu, '_')
I know this question is old but I'm working on some NPL software and I needed to match all words in most latin-like languages and I did this with the following piece of code.
let myString = "Whatever you want here, ex. Bân-lâm-gú or bokmål or Português or Română or Slovenčina or Slovenščina";
let wordchar = "-A-Za-zᴀⱯɐᵄⱭɑᵅꬰꭤⱰɒᶛʙᴃᴯꞖꞗꞴꞵᴄↃↄꞳꭓꭕꭔÐðᶞꟇꟈꝹꝺᴅᴆꝱẟᴇꬲꬳꬴƎᴲǝⱻƏəₔᵊƐɛᵋEɘꞫɜᶟɞʚᴈᵌɤꝻꝼꜰℲⅎꟻꬵꝽᵹꞬɡᶢꬶɢᵷ⅁ꝾꝿƔɣˠƢƣʜǶƕⱵⱶꟵꟶꜦꜧꭜıꞮɪᶦꟾꟷᴉᵎᵻᶧƖɩᶥᴊKᴋꞰʞʟᶫꝆꝇᴌꬸꬹꬷꭝꝲꞀꞁ⅃ᴍꬺꟽꟿꝳɴᶰᴎᴻꬻꝴŊŋᵑꬼᴏᴑꬽꬾƆɔᵓᴐꬿᴒᴖᵔᴗᵕꞶꞷɷȢȣᴕᴽᴘꟼɸᶲⱷĸꞯꞂꞃƦʀꝚꝛᴙꭆɹʴᴚʁʶꭉꭇꭈꭊꭋꭌꭅꝵꝶꝜꝝſꟉꟊꞄꞅƧƨꜱƩʃᶴꭍƪʅꞆꞇᴛꝷꞱʇᴜᶸᴝᵙᴞꭒꭟꭎꭏꞍɥᶣƜɯꟺᵚᴟƱʊᶷᴠỼỽɅʌᶺᴡꟂꟃʍꭩꭖꭗꭘꭙꭙ̆ʏꭚʎ⅄ƍᴢꝢꝣƷʒᶾᴣƸƹȜȝÞþǷƿꝨꝩꝪꝫꝬꝭꝮꝯꝰꝸꜪꜫꜬꜭꜮꜯƼƽƄƅɁɂʔꜢꜣꞋꞌꞏʕˤᴤᴥᵜꜤꜥʖǀǁǃǂʗʘʬʭꞚꞛꞜꞝꞞꞟẚÀàÁáÂâẦầẤấẪẫẨẩÃãÃ̀ã̀Ã́ã́Ã̂ã̂Ã̌ã̌Ã̍ã̍Ã̎ã̎ĀāĀ̀ā̀Ā́ā́Ā̂ā̂Ā̃ā̃Ā̃́ā̃́Ā̄ā̄Ā̆ā̆Ā̆́ā̆́Ā̈ā̈Ā̊ā̊Ā̌ā̌ĂăẰằẮắẴẵẲẳȦȧȦ́ȧ́ǠǡÄäÄ́ä́Ä̀ä̀Ä̂ä̂Ä̃ä̃ǞǟǞ̆ǟ̆Ä̆ä̆Ä̌ä̌ẢảÅåÅǺǻÅ̂å̂Å̃å̃Å̄å̄Å̄̆å̄̆Å̆å̆A̋a̋ǍǎA̍a̍A̎a̎ȀȁȂȃA̐a̐A̓a̓A̧a̧À̧à̧Á̧á̧Â̧â̧Ǎ̧ǎ̧A̭a̭A̰a̰À̰à̰Á̰á̰Ā̰ā̰Ä̰ä̰Ä̰́ä̰́ĄąĄ̀ą̀Ą́ą́Ą̂ą̂Ą̃ą̃Ą̄ą̄Ą̄̀ą̄̀Ą̄́ą̄́Ą̄̂ą̄̂Ą̄̌ą̄̌Ą̇ą̇Ą̈ą̈Ą̈̀ą̈̀Ą̈́ą̈́Ą̈̂ą̈̂Ą̈̌ą̈̌Ą̈̄ą̈̄Ą̊ą̊Ą̌ą̌Ą̋ą̋Ą̱ą̱Ą̱̀ą̱̀Ą̱́ą̱́A᷎a᷎A̱a̱À̱à̱Á̱á̱Â̱â̱Ã̱ã̱Ā̱ā̱Ā̱̀ā̱̀Ā̱́ā̱́Ā̱̂ā̱̂Ä̱ä̱Ä̱̀ä̱̀Ä̱́ä̱́Ä̱̂ä̱̂Ä̱̌ä̱̌Å̱å̱Ǎ̱ǎ̱A̱̥a̱̥ẠạẠ́ạ́Ạ̀ạ̀ẬậẠ̃ạ̃Ạ̄ạ̄ẶặẠ̈ạ̈Ạ̈̀ạ̈̀Ạ̈́ạ̈́Ạ̈̂ạ̈̂Ạ̈̌ạ̈̌Ạ̌ạ̌Ạ̍ạ̍A̤a̤À̤à̤Á̤á̤Â̤â̤Ä̤ä̤ḀḁḀ̂ḁ̂Ḁ̈ḁ̈A̯a̯A̩a̩À̩à̩Á̩á̩Â̩â̩Ã̩ã̩Ā̩ā̩Ǎ̩ǎ̩A̩̍a̩̍A̩̓a̩̓A͔a͔Ā͔ā͔ȺⱥȺ̀ⱥ̀Ⱥ́ⱥ́ᶏꞺꞻⱭ̀ɑ̀Ɑ́ɑ́Ɑ̂ɑ̂Ɑ̃ɑ̃Ɑ̄ɑ̄Ɑ̆ɑ̆Ɑ̇ɑ̇Ɑ̈ɑ̈Ɑ̊ɑ̊Ɑ̌ɑ̌ᶐB̀b̀B́b́B̂b̂B̃b̃B̄b̄ḂḃB̈b̈B̒b̒B̕b̕ḆḇḆ̂ḇ̂ḄḅB̤b̤B̥b̥B̬b̬ɃƀᵬᶀƁɓƂƃʙ̇ʙ̣C̀c̀ĆćĈĉC̃c̃C̄c̄C̄́c̄́C̆c̆ĊċC̈c̈ČčČ́č́Č͑č͑Č̓č̓Č̕č̕Č̔č̔C̋c̋C̓c̓C̕c̕C̔c̔C͑c͑ÇçḈḉÇ̆ç̆Ç̇ç̇Ç̌ç̌ꞔꟄC̦c̦C̭c̭C̱c̱C̮c̮C̣c̣Ć̣ć̣Č̣č̣C̥c̥C̬c̬C̯c̯C̨c̨ȻȼȻ̓ȼ̓ꞒꞓƇƈɕᶝꜾꜿD́d́D̂d̂D̃d̃D̄d̄ḊḋD̊d̊ĎďD̑d̑D̓d̓D̕d̕ḐḑD̦d̦ḒḓḎḏD̮d̮ḌḍḌ́ḍ́Ḍ̄ḍ̄D̤d̤D̥d̥D̬d̬D̪d̪ĐđĐ̣đ̣Đ̱đ̱ᵭᶁƉɖƊɗᶑƋƌȡꝹ́ꝺ́Ꝺ̇ꝺ̇ᴅ̇ᴅ̣Ð́ð́Ð̣ð̣ÈèÉéÊêỀềẾếỄễÊ̄ê̄Ê̆ê̆Ê̌ê̌ỂểẼẽẼ̀ẽ̀Ẽ́ẽ́Ẽ̂ẽ̂Ẽ̌ẽ̌Ẽ̍ẽ̍Ẽ̎ẽ̎ĒēḔḕḖḗĒ̂ē̂Ē̃ē̃Ē̃́ē̃́Ē̄ē̄Ē̆ē̆Ē̆́ē̆́Ē̌ē̌Ē̑ē̑ĔĕĔ̀ĕ̀Ĕ́ĕ́Ĕ̄ĕ̄ĖėĖ́ė́Ė̃ė̃Ė̄ė̄ËëË̀ë̀Ë́ë́Ë̂ë̂Ë̃ë̃Ë̄ë̄Ë̌ë̌ẺẻE̊e̊E̊̄e̊̄E̋e̋ĚěĚ́ě́Ě̃ě̃Ě̋ě̋Ě̑ě̑E̍e̍E̎e̎ȄȅȆȇE̓e̓E᷎e᷎ȨȩȨ̀ȩ̀Ȩ́ȩ́Ȩ̂ȩ̂ḜḝȨ̌ȩ̌Ẽ̦ẽ̦ĘęĘ̀ę̀Ę́ę́Ę̂ę̂Ę̃ę̃Ę̃́ę̃́Ę̄ę̄Ę̄̀ę̄̀Ę̄́ę̄́Ę̄̂ę̄̂Ę̄̃ę̄̃Ę̄̌ę̄̌Ę̆ę̆Ę̇ę̇Ę̇́ę̇́Ę̈ę̈Ę̈̀ę̈̀Ę̈́ę̈́Ę̈̂ę̈̂Ę̈̌ę̈̌Ę̈̄ę̈̄Ę̋ę̋Ę̌ę̌Ę̑ę̑Ę̱ę̱Ę̱̀ę̱̀Ę̱́ę̱́Ę̣ę̣Ę᷎ę᷎ḘḙḚḛE̱e̱È̱è̱É̱é̱Ê̱ê̱Ẽ̱ẽ̱Ē̱ē̱Ḕ̱ḕ̱Ḗ̱ḗ̱Ē̱̂ē̱̂Ë̱ë̱Ë̱̀ë̱̀Ë̱́ë̱́Ë̱̂ë̱̂Ë̱̌ë̱̌Ě̱ě̱E̮e̮Ē̮ē̮ẸẹẸ̀ẹ̀Ẹ́ẹ́ỆệẸ̃ẹ̃Ẹ̄ẹ̄Ẹ̄̀ẹ̄̀Ẹ̄́ẹ̄́Ẹ̄̃ẹ̄̃Ẹ̆ẹ̆Ẹ̆̀ẹ̆̀Ẹ̆́ẹ̆́Ẹ̈ẹ̈Ẹ̈̀ẹ̈̀Ẹ̈́ẹ̈́Ẹ̈̂ẹ̈̂Ẹ̈̌ẹ̈̌Ẹ̍ẹ̍Ẹ̌ẹ̌Ẹ̑ẹ̑E̤e̤È̤è̤É̤é̤Ê̤ê̤Ë̤ë̤E̥e̥E̯e̯E̩e̩È̩è̩É̩é̩Ê̩ê̩Ẽ̩ẽ̩Ē̩ē̩Ě̩ě̩E̩̍e̩̍E̩̓e̩̓È͕è͕Ê͕ê͕Ẽ͕ẽ͕Ē͕ē͕Ḕ͕ḕ͕E̜e̜E̹e̹È̹è̹É̹é̹Ê̹ê̹Ẽ̹ẽ̹Ē̹ē̹Ḕ̹ḕ̹ɆɇᶒⱸᶕᶓɚᶔɝƐ̀ɛ̀Ɛ́ɛ́Ɛ̂ɛ̂Ɛ̃ɛ̃Ɛ̃̀ɛ̃̀Ɛ̃́ɛ̃́Ɛ̃̂ɛ̃̂Ɛ̃̌ɛ̃̌Ɛ̃̍ɛ̃̍Ɛ̃̎ɛ̃̎Ɛ̄ɛ̄Ɛ̆ɛ̆Ɛ̇ɛ̇Ɛ̈ɛ̈Ɛ̈̀ɛ̈̀Ɛ̈́ɛ̈́Ɛ̈̂ɛ̈̂Ɛ̈̌ɛ̈̌Ɛ̌ɛ̌Ɛ̍ɛ̍Ɛ̎ɛ̎Ɛ̣ɛ̣Ɛ̣̀ɛ̣̀Ɛ̣́ɛ̣́Ɛ̣̂ɛ̣̂Ɛ̣̃ɛ̣̃Ɛ̣̈ɛ̣̈Ɛ̣̈̀ɛ̣̈̀Ɛ̣̈́ɛ̣̈́Ɛ̣̈̂ɛ̣̈̂Ɛ̣̈̌ɛ̣̈̌Ɛ̣̌ɛ̣̌Ɛ̤ɛ̤Ɛ̤̀ɛ̤̀Ɛ̤́ɛ̤́Ɛ̤̂ɛ̤̂Ɛ̤̈ɛ̤̈Ɛ̧ɛ̧Ɛ̧̀ɛ̧̀Ɛ̧́ɛ̧́Ɛ̧̂ɛ̧̂Ɛ̧̌ɛ̧̌Ɛ̨ɛ̨Ɛ̨̀ɛ̨̀Ɛ̨́ɛ̨́Ɛ̨̂ɛ̨̂Ɛ̨̄ɛ̨̄Ɛ̨̆ɛ̨̆Ɛ̨̈ɛ̨̈Ɛ̨̌ɛ̨̌Ɛ̰ɛ̰Ɛ̰̀ɛ̰̀Ɛ̰́ɛ̰́Ɛ̰̄ɛ̰̄Ɛ̱ɛ̱Ɛ̱̀ɛ̱̀Ɛ̱́ɛ̱́Ɛ̱̂ɛ̱̂Ɛ̱̃ɛ̱̃Ɛ̱̈ɛ̱̈Ɛ̱̈̀ɛ̱̈̀Ɛ̱̈́ɛ̱̈́Ɛ̱̌ɛ̱̌Ə̀ə̀Ə́ə́Ə̂ə̂Ə̄ə̄Ə̌ə̌Ə̏ə̏F̀f̀F́f́F̃f̃F̄f̄ḞḟF̓f̓F̧f̧ᵮᶂƑƒꞘꞙF̱f̱F̣f̣ꜰ̇Ꝼ́ꝼ́Ꝼ̇ꝼ̇Ꝼ̣ꝼ̣G̀g̀ǴǵǴ̄ǵ̄ĜĝG̃g̃G̃́g̃́ḠḡḠ́ḡ́ĞğĠġG̈g̈G̈̇g̈̇G̊g̊G̋g̋ǦǧǦ̈ǧ̈G̑g̑G̒g̒G̓g̓G̕g̕G̔g̔ĢģG̦g̦G̱g̱G̱̓g̱̓G̮g̮G̣g̣G̤g̤G̥g̥G̫g̫ꞠꞡǤǥᶃƓɠɢ̇ɢ̣ʛƔ̓ɣ̓H̀h̀H́h́ĤĥH̄h̄ḢḣḦḧȞȟH̐h̐H̓h̓H̕h̕ḨḩH̨h̨H̭h̭H̱ẖḪḫḤḥḤ̣ḥ̣H̤h̤H̥h̥H̬h̬H̯h̯ĦħꟸĦ̥ħ̥ꞪɦʱⱧⱨꞕh̢ʜ̇ɧÌìÍíÎîÎ́î́ĨĩĨ́ĩ́Ĩ̀ĩ̀Ĩ̂ĩ̂Ĩ̌ĩ̌Ĩ̍ĩ̍Ĩ̎ĩ̎ĪīĪ́ī́Ī̀ī̀Ī̂ī̂Ī̌ī̌Ī̃ī̃Ī̄ī̄Ī̆ī̆Ī̆́ī̆́ĬĭĬ̀ĭ̀Ĭ́ĭ́İiIıİ́i̇́ÏïÏ̀ï̀ḮḯÏ̂ï̂Ï̃ï̃Ï̄ï̄Ï̌ï̌Ï̑ï̑I̊i̊I̋i̋ǏǐỈỉI̍i̍I̎i̎ȈȉI̐i̐ȊȋI᷎i᷎ĮįĮ̀į̀Į́į́į̇́Į̂į̂Į̃į̃į̇̃Į̄į̄Į̄̀į̄̀Į̄́į̄́Į̄̂į̄̂Į̄̆į̄̆Į̄̌į̄̌Į̈į̈Į̈̀į̈̀Į̈́į̈́Į̈̂į̈̂Į̈̌į̈̌Į̈̄į̈̄Į̋į̋Į̌į̌Į̱į̱Į̱́į̱́Į̱̀į̱̀I̓i̓I̧i̧Í̧í̧Ì̧ì̧Î̧î̧I̭i̭Ī̭ī̭ḬḭḬ̀ḭ̀Ḭ́ḭ́Ḭ̄ḭ̄Ḭ̈ḭ̈Ḭ̈́ḭ̈́I̱i̱Ì̱ì̱Í̱í̱Î̱î̱Ǐ̱ǐ̱Ĩ̱ĩ̱Ï̱ï̱Ḯ̱ḯ̱Ï̱̀ï̱̀Ï̱̂ï̱̂Ï̱̌ï̱̌Ī̱ī̱Ī̱́ī̱́Ī̱̀ī̱̀Ī̱̂ī̱̂I̮i̮ỊịỊ̀ị̀Ị́ị́Ị̂ị̂Ị̃ị̃Ị̄ị̄Ị̈ị̈Ị̈̀ị̈̀Ị̈́ị̈́Ị̈̂ị̈̂Ị̈̌ị̈̌Ị̌ị̌Ị̍ị̍I̤i̤Ì̤ì̤Í̤í̤Î̤î̤Ï̤ï̤I̥i̥Í̥í̥Ï̥ï̥I̯i̯Í̯í̯Ĩ̯ĩ̯I̩i̩I͔i͔Ī͔ī͔ƗɨᶤƗ̀ɨ̀Ɨ́ɨ́Ɨ̂ɨ̂Ɨ̌ɨ̌Ɨ̃ɨ̃Ɨ̄ɨ̄Ɨ̈ɨ̈Ɨ̧ɨ̧Ɨ̧̀ɨ̧̀Ɨ̧̂ɨ̧̂Ɨ̧̌ɨ̧̌Ɨ̱ɨ̱Ɨ̱̀ɨ̱̀Ɨ̱́ɨ̱́Ɨ̱̂ɨ̱̂Ɨ̱̈ɨ̱̈Ɨ̱̌ɨ̱̌Ɨ̯ɨ̯ᶖꞼꞽı̣ı̥Ɩ̀ɩ̀Ɩ́ɩ́Ɩ̂ɩ̂Ɩ̃ɩ̃Ɩ̈ɩ̈Ɩ̌ɩ̌ᵼJ́j́ĴĵJ̃j̃j̇̃J̄j̄J̇J̈j̈J̈̇j̈̇J̊j̊J̋j̋J̌ǰJ̌́ǰ́J̑j̑J̓j̓J᷎j᷎J̱j̱J̣j̣J̣̌ǰ̣J̥j̥ɈɉɈ̱ɉ̱ꞲʝᶨȷɟᶡʄK̀k̀ḰḱK̂k̂K̃k̃K̄k̄K̆k̆K̇k̇K̈k̈ǨǩK̑k̑K̓k̓K̕k̕K̔k̔K͑k͑ĶķK̦k̦K̨k̨ḴḵḴ̓ḵ̓ḲḳK̮k̮K̥k̥K̬k̬K̫k̫ᶄƘƙⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣᴋ̇ĿŀL̀l̀ĹĺL̂l̂L̃l̃L̄l̄L̇l̇L̈l̈L̋l̋ĽľL̐l̐L̑l̑L̓l̓L̕l̕ĻļĻ̂ļ̂Ļ̃ļ̃L̦l̦ḼḽḺḻḺ̓ḻ̓L̮l̮ḶḷḶ̀ḷ̀Ḷ́ḷ́ḸḹḸ́ḹ́Ḹ̆ḹ̆Ḷ̓ḷ̓Ḷ̕ḷ̕Ḷ̣ḷ̣L̤l̤L̤̄l̤̄L̥l̥L̥̀l̥̀Ĺ̥ĺ̥L̥̄l̥̄L̥̄́l̥̄́L̥̄̆l̥̄̆L̥̕l̥̕L̩l̩L̩̀l̩̀L̩̓l̩̓L̯l̯ŁłŁ̇ł̇Ł̓ł̓Ł̣ł̣Ł̱ł̱ꝈꝉȽƚⱠⱡⱢɫꭞꞭɬᶅᶪɭᶩꞎȴʟ̇ʟ̣ƛƛ̓λ̴λ̴̓M̀m̀ḾḿM̂m̂M̃m̃M̄m̄M̆m̆ṀṁṀ̇ṁ̇M̈m̈M̋m̋M̍m̍M̌m̌M̐m̐M̑m̑M̓m̓M̕̕m̕M͑m͑ᵯM̧m̧M̨m̨M̦m̦M̱m̱Ḿ̱ḿ̱M̮m̮ṂṃṂ́ṃ́Ṃ̄ṃ̄Ṃ̓ṃ̓M̥m̥Ḿ̥ḿ̥M̥̄m̥̄M̥̄́m̥̄́M̥̄̆m̥̄̆M̬m̬M̩m̩M̩̀m̩̀M̩̓m̩̓M̯m̯ᶆm̢Ɱɱᶬᴍ̇ᴍ̣ǸǹŃńN̂n̂ÑñÑ̈ñ̈N̄n̄N̆n̆ṄṅṄ̇ṅ̇N̈n̈N̋n̋ŇňN̐n̐N̑n̑N̍n̍N̓n̓N̕n̕ꞤꞥᵰŅņŅ̂ņ̂Ņ̃ņ̃N̦n̦N̨n̨ṊṋN̰n̰ṈṉṈ́ṉ́N̮n̮ṆṇṆ́ṇ́Ṇ̄ṇ̄Ṇ̄́ṇ̄́Ṇ̓ṇ̓N̤n̤N̥n̥Ǹ̥ǹ̥Ń̥ń̥Ñ̥ñ̥Ñ̥́ñ̥́N̥̄n̥̄N̥̄́n̥̄́N̥̄̆n̥̄̆N̥̄̑n̥̄̑Ṅ̥ṅ̥N̥̑n̥̑N̥̑́n̥̑́N̥̑̄n̥̑̄N̯n̯N̩n̩Ǹ̩ǹ̩N̩̓n̩̓N̲n̲ƝɲᶮȠƞꞐꞑŊ̀ŋ̀Ŋ́ŋ́Ŋ̂ŋ̂Ŋ̄ŋ̄Ŋ̈ŋ̈Ŋ̈̇ŋ̈̇Ŋ̊ŋ̊Ŋ̑ŋ̑Ŋ̨ŋ̨Ŋ̣ŋ̣Ŋ̥ŋ̥Ŋ̥́ŋ̥́Ŋ̥̄ŋ̥̄Ŋ̥̄́ŋ̥̄́ᶇɳᶯȵɴ̇ɴ̣ÒòÓóÔôỐốỒồỖỗÔ̆ô̆ỔổÕõÕ̍õ̍Õ̎õ̎Õ̀õ̀ṌṍÕ̂õ̂Õ̌õ̌ṎṏȬȭŌōṒṓṐṑŌ̂ō̂Ō̃ō̃Ō̃́ō̃́Ō̄ō̄Ō̆ō̆Ō̆́ō̆́Ō̈ō̈Ō̌ō̌ŎŏŎ̀ŏ̀Ŏ́ŏ́Ŏ̈ŏ̈ȮȯȮ́ȯ́ȰȱO͘o͘Ó͘ó͘Ò͘ò͘Ō͘ō͘O̍͘o̍͘ÖöÖ́ö́Ö̀ö̀Ö̂ö̂Ö̌ö̌Ö̃ö̃ȪȫȪ̆ȫ̆Ö̆ö̆ỎỏO̊o̊ŐőǑǒO̍o̍O̎o̎ȌȍO̐o̐ȎȏO̓o̓ØøØ̀ø̀ǾǿØ̂ø̂Ø̃ø̃Ø̄ø̄Ø̄́ø̄́Ø̄̆ø̄̆Ø̆ø̆Ø̇ø̇Ø̇́ø̇́Ø̈ø̈Ø̋ø̋Ø̌ø̌Ø᷎ø᷎Ø̨ø̨Ǿ̨ǿ̨Ø̨̄ø̨̄Ø̣ø̣Ø̥ø̥Ø̰ø̰Ǿ̰ǿ̰ظø¸Ǿ¸ǿ¸ƟɵᶱƠơỚớỜờỠỡƠ̆ơ̆ỞởO᷎o᷎Ó᷎ó᷎O̧o̧Ó̧ó̧Ò̧ò̧Ô̧ô̧Ǒ̧ǒ̧ǪǫǪ̀ǫ̀Ǫ́ǫ́Ǫ̂ǫ̂Ǫ̃ǫ̃ǬǭǬ̀ǭ̀Ǭ́ǭ́Ǭ̂ǭ̂Ǭ̃ǭ̃Ǭ̆ǭ̆Ǭ̌ǭ̌Ǫ̆ǫ̆Ǫ̆́ǫ̆́Ǫ̇ǫ̇Ǫ̇́ǫ̇́Ǫ̈ǫ̈Ǫ̈̀ǫ̈̀Ǫ̈́ǫ̈́Ǫ̈̂ǫ̈̂Ǫ̈̃ǫ̈̃Ǫ̈̄ǫ̈̄Ǫ̈̌ǫ̈̌Ǫ̋ǫ̋Ǫ̌ǫ̌Ǫ̑ǫ̑Ǫ̣ǫ̣Ǫ̱ǫ̱Ǫ̱́ǫ̱́Ǫ̱̀ǫ̱̀Ǫ᷎ǫ᷎O̭o̭O̰o̰Ó̰ó̰O̱o̱Ò̱ò̱Ó̱ó̱Ô̱ô̱Ǒ̱ǒ̱Õ̱õ̱Ō̱ō̱Ṓ̱ṓ̱Ṑ̱ṑ̱Ō̱̂ō̱̂Ö̱ö̱Ö̱́ö̱́Ö̱̀ö̱̀Ö̱̂ö̱̂Ö̱̌ö̱̌O̮o̮ỌọỌ̀ọ̀Ọ́ọ́ỘộỌ̃ọ̃Ọ̄ọ̄Ọ̄̀ọ̄̀Ọ̄́ọ̄́Ọ̄̃ọ̄̃Ọ̄̆ọ̄̆Ọ̆ọ̆Ọ̈ọ̈Ọ̈̀ọ̈̀Ọ̈́ọ̈́Ọ̈̂ọ̈̂Ọ̈̄ọ̈̄Ọ̈̌ọ̈̌Ọ̌ọ̌Ọ̑ọ̑ỢợỌọO̤o̤Ò̤ò̤Ó̤ó̤Ô̤ô̤Ö̤ö̤O̥o̥Ō̥ō̥O̬o̬O̯o̯O̩o̩Õ͔õ͔Ō͔ō͔O̜o̜O̹o̹Ó̹ó̹O̲o̲ᴓᶗꝌꝍⱺꝊꝋƆ́ɔ́Ɔ̀ɔ̀Ɔ̂ɔ̂Ɔ̌ɔ̌Ɔ̃ɔ̃Ɔ̃́ɔ̃́Ɔ̃̀ɔ̃̀Ɔ̃̂ɔ̃̂Ɔ̃̌ɔ̃̌Ɔ̃̍ɔ̃̍Ɔ̃̎ɔ̃̎Ɔ̄ɔ̄Ɔ̆ɔ̆Ɔ̇ɔ̇Ɔ̈ɔ̈Ɔ̈̀ɔ̈̀Ɔ̈́ɔ̈́Ɔ̈̂ɔ̈̂Ɔ̈̌ɔ̈̌Ɔ̌ɔ̌Ɔ̍ɔ̍Ɔ̎ɔ̎Ɔ̣ɔ̣Ɔ̣̀ɔ̣̀Ɔ̣́ɔ̣́Ɔ̣̂ɔ̣̂Ɔ̣̃ɔ̣̃Ɔ̣̈ɔ̣̈Ɔ̣̈̀ɔ̣̈̀Ɔ̣̈́ɔ̣̈́Ɔ̣̈̂ɔ̣̈̂Ɔ̣̈̌ɔ̣̈̌Ɔ̣̌ɔ̣̌Ɔ̤ɔ̤Ɔ̤̀ɔ̤̀Ɔ̤́ɔ̤́Ɔ̤̂ɔ̤̂Ɔ̤̈ɔ̤̈Ɔ̱ɔ̱Ɔ̱̀ɔ̱̀Ɔ̱́ɔ̱́Ɔ̱̂ɔ̱̂Ɔ̱̌ɔ̱̌Ɔ̱̃ɔ̱̃Ɔ̱̈ɔ̱̈Ɔ̱̈̀ɔ̱̈̀Ɔ̱̈́ɔ̱̈́Ɔ̧ɔ̧Ɔ̧̀ɔ̧̀Ɔ̧́ɔ̧́Ɔ̧̂ɔ̧̂Ɔ̧̌ɔ̧̌Ɔ̨ɔ̨Ɔ̨́ɔ̨́Ɔ̨̀ɔ̨̀Ɔ̨̂ɔ̨̂Ɔ̨̌ɔ̨̌Ɔ̨̄ɔ̨̄Ɔ̨̆ɔ̨̆Ɔ̨̈ɔ̨̈Ɔ̨̱ɔ̨̱Ɔ̰ɔ̰Ɔ̰̀ɔ̰̀Ɔ̰́ɔ̰́Ɔ̰̄ɔ̰̄P̀p̀ṔṕP̃p̃P̄p̄P̆p̆ṖṗP̈p̈P̋p̋P̑p̑P̓p̓P̕p̕P̔p̔P͑p͑P̱p̱P̣p̣P̤p̤P̬p̬ⱣᵽꝐꝑᵱᶈƤƥꝒꝓꝔꝕᴘ̇Q́q́Q̃q̃Q̄q̄Q̇q̇Q̈q̈Q̋q̋Q̓q̓Q̕q̕Q̧q̧Q̣q̣Q̣̇q̣̇Q̣̈q̣̈Q̱q̱ꝖꝗꝖ̃ꝗ̃ꝘꝙʠɊɋR̀r̀ŔŕR̂r̂R̃r̃R̄r̄R̆r̆ṘṙR̋r̋ŘřR̍r̍ȐȑȒȓR̓r̓R̕r̕ŖŗR̦r̦R̨r̨R̨̄r̨̄ꞦꞧR̭r̭ṞṟṚṛṚ̀ṛ̀Ṛ́ṛ́ṜṝṜ́ṝ́Ṝ̃ṝ̃Ṝ̆ṝ̆R̤r̤R̥r̥R̥̀r̥̀Ŕ̥ŕ̥R̥̂r̥̂R̥̃r̥̃R̥̄r̥̄R̥̄́r̥̄́R̥̄̆r̥̄̆Ř̥ř̥R̬r̬R̩r̩R̯r̯ɌɍᵲꭨɺᶉɻʵⱹɼⱤɽɾᵳɿʀ̇ʀ̣Ꝛ́ꝛ́Ꝛ̣ꝛ̣S̀s̀ŚśŚ̀ś̀ŚśṤṥŜŝS̃s̃S̄s̄S̄̒s̄̒S̆s̆ṠṡṠ̃ṡ̃S̈s̈S̋s̋ŠšŠ̀š̀Š́š́ṦṧŠ̓š̓S̑s̑S̒s̒S̓s̓S̕s̕ŞşȘșS̨s̨Š̨š̨ꞨꞩS̱s̱Ś̱ś̱S̮s̮ṢṣṢ́ṣ́Ṣ̄ṣ̄ṨṩṢ̌ṣ̌Ṣ̕ṣ̕Ṣ̱ṣ̱S̤s̤Š̤š̤S̥s̥Ś̥S̬s̬S̩s̩S̪s̪ꜱ̇ꜱ̣ſ́ẛſ̣ᵴᶊʂᶳꟅⱾȿẜẝᶋᶘʆT̀t̀T́t́T̃t̃T̄t̄T̆t̆T̆̀t̆̀ṪṫT̈ẗŤťT̑t̑T̓t̓T̕t̕T̔t̔T͑t͑ŢţȚțT̨t̨T̗t̗ṰṱT̰t̰ṮṯT̮t̮ṬṭṬ́ṭ́T̤t̤T̥t̥T̬t̬T̯t̯T̪t̪ƾŦŧȾⱦᵵƫᶵƬƭƮʈȶᴛ̇ᴛ̣ÙùÚúÛûŨũŨ̀ũ̀ṸṹŨ̂ũ̂Ũ̊ũ̊Ũ̌ũ̌Ũ̍ũ̍Ũ̎ũ̎ŪūŪ̀ū̀Ū́ū́Ū̂ū̂Ū̌ū̌Ū̃ū̃Ū̄ū̄Ū̆ū̆Ū̆́ū̆́ṺṻŪ̊ū̊ŬŭŬ̀ŭ̀Ŭ́ŭ́U̇u̇U̇́u̇́U̇̄u̇̄ÜüǛǜǗǘÜ̂ü̂Ü̃ü̃ǕǖǕ̆ǖ̆Ü̆ü̆ǙǚỦủŮůŮ́ů́Ů̃ů̃ŰűǓǔU̍u̍U̎u̎ȔȕȖȗU̓u̓U᷎u᷎ỦủƯưỨứỪừỮữƯ̆ư̆ỬửỰựU̧u̧Ú̧ú̧Ù̧ù̧Û̧û̧Ǔ̧ǔ̧ŲųŲ̀ų̀Ų́ų́Ų̂ų̂Ų̌ų̌Ų̄ų̄Ų̄́ų̄́Ų̄̀ų̄̀Ų̄̂ų̄̂Ų̄̌ų̄̌Ų̄̌ų̄̌Ų̈ų̈Ų̈́ų̈́Ų̈̀ų̈̀Ų̈̂ų̈̂Ų̈̌ų̈̌Ų̈̄ų̈̄Ų̋ų̋Ų̱ų̱Ų̱́ų̱́Ų̱̀ų̱̀ṶṷṴṵṴ̀ṵ̀Ṵ́ṵ́Ṵ̄ṵ̄Ṵ̈ṵ̈U̱u̱Ù̱ù̱Ú̱ú̱Û̱û̱Ũ̱ũ̱Ū̱ū̱Ū̱́ū̱́Ū̱̀ū̱̀Ū̱̂ū̱̂Ü̱ü̱Ǘ̱ǘ̱Ǜ̱ǜ̱Ü̱̂ü̱̂Ǚ̱ǚ̱Ǔ̱ǔ̱ỤụỤ̀ụ̀Ụ́ụ́Ụ̂ụ̂Ụ̃ụ̃Ụ̄ụ̄Ụ̈ụ̈Ụ̈̀ụ̈̀Ụ̈́ụ̈́Ụ̈̂ụ̈̂Ụ̈̌ụ̈̌Ụ̌ụ̌Ụ̍ụ̍ṲṳṲ̀ṳ̀Ṳ́ṳ́Ṳ̂ṳ̂Ṳ̈ṳ̈U̥u̥Ü̥ü̥U̯u̯Ũ̯ũ̯Ü̯ü̯U̩u̩U͔u͔Ũ͔ũ͔Ū͔ū͔ɄʉᶶɄ̀ʉ̀Ʉ́ʉ́Ʉ̂ʉ̂Ʉ̃ʉ̃Ʉ̄ʉ̄Ʉ̈ʉ̈Ʉ̌ʉ̌Ʉ̧ʉ̧Ʉ̰ʉ̰Ʉ̰́ʉ̰́Ʉ̱ʉ̱Ʉ̱́ʉ̱́Ʉ̱̀ʉ̱̀Ʉ̱̂ʉ̱̂Ʉ̱̈ʉ̱̈Ʉ̱̌ʉ̱̌Ʉ̥ʉ̥ꞸꞹᵾᶙꞾꞿʮʯɰᶭƱ̀ʊ̀Ʊ́ʊ́Ʊ̃ʊ̃ᵿV̀v̀V́v́V̂v̂ṼṽṼ̀ṽ̀Ṽ́ṽ́Ṽ̂ṽ̂Ṽ̌ṽ̌V̄v̄V̄̀v̄̀V̄́v̄́V̄̂v̄̂V̄̃v̄̃V̄̄v̄̄V̄̆v̄̆V̄̌v̄̌V̆v̆V̆́v̆́V̇v̇V̈v̈V̈̀v̈̀V̈́v̈́V̈̂v̈̂V̈̄v̈̄V̈̌v̈̌V̊v̊V̋v̋V̌v̌V̍v̍V̏v̏V̐v̐V̓v̓V̧v̧V̨v̨V̨̀v̨̀V̨́v̨́V̨̂v̨̂V̨̌v̨̌V̨̄v̨̄V̨̄́v̨̄́V̨̄̀v̨̄̀V̨̄̂v̨̄̂V̨̄̌v̨̄̌V̨̈v̨̈V̨̈́v̨̈́V̨̈̀v̨̈̀V̨̈̂v̨̈̂V̨̈̌v̨̈̌V̨̈̄v̨̈̄V̨̋v̨̋V̨̱v̨̱V̨̱́v̨̱́V̨̱̀v̨̱̀V̨̱̂v̨̱̂V̨̱̌v̨̱̌V̱v̱V̱̀v̱̀V̱́v̱́V̱̂v̱̂V̱̌v̱̌Ṽ̱ṽ̱V̱̈v̱̈V̱̈́v̱̈́V̱̈̀v̱̈̀V̱̈̂v̱̈̂V̱̈̌v̱̈̌ṾṿV̥v̥ꝞꝟᶌƲʋᶹƲ̀ʋ̀Ʋ́ʋ́Ʋ̂ʋ̂Ʋ̃ʋ̃Ʋ̈ʋ̈Ʋ̌ʋ̌ⱱⱴꝨ́ꝩ́Ꝩ̇ꝩ̇Ꝩ̣ꝩ̣ẀẁẂẃŴŵW̃w̃W̄w̄W̆w̆ẆẇẄẅW̊ẘW̋w̋W̌w̌W̍w̍W̓w̓W̱w̱ẈẉW̥w̥W̬w̬ⱲⱳX̀x̀X́x́X̂x̂X̃x̃X̄x̄X̆x̆X̆́x̆́ẊẋẌẍX̊x̊X̌x̌X̓x̓X̕x̕X̱x̱X̱̓x̱̓X̣x̣X̣̓x̣̓X̥x̥ᶍỲỳÝýŶŷỸỹȲȳȲ̀ȳ̀Ȳ́ȳ́Ȳ̃ȳ̃Ȳ̆ȳ̆Y̆y̆Y̆̀y̆̀Y̆́y̆́ẎẏẎ́ẏ́ŸÿŸ́ÿ́Y̊ẙY̋y̋Y̌y̌Y̍y̍Y̎y̎Y̐y̐Y̓y̓ỶỷY᷎y᷎Y̱y̱ỴỵỴ̣ỵ̣Y̥y̥Y̯y̯ɎɏƳƴỾỿZ̀z̀ŹźẐẑZ̃z̃Z̄z̄ŻżZ̈z̈Z̋z̋ŽžŽ́ž́Ž̏ž̏Z̑z̑Z̓z̓Z̕z̕Z̨z̨Z̗z̗ẔẕZ̮z̮ẒẓẒ́ẓ́Ẓ̌ẓ̌Ẓ̣ẓ̣Z̤z̤Z̥z̥ƵƶᵶᶎꟆȤȥʐᶼʑᶽⱿɀⱫⱬƷ́ʒ́Ʒ̇ʒ̇ǮǯǮ́ǯ́Ʒ̥ʒ̥ᶚƺʓÞ́þ́Þ̣þ̣ꝤꝥꝦꝧƻꜮꜯʡʢꜲꜳꜲ́ꜳ́Ꜳ̋ꜳ̋Ꜳ̇ꜳ̇Ꜳ̈ꜳ̈Ꜳ̣ꜳ̣ÆæᴭÆ̀æ̀ǼǽÆ̂æ̂Æ̌æ̌Æ̃æ̃Æ̃́æ̃́Æ̃̀æ̃̀Æ̃̂æ̃̂Æ̃̌æ̃̌ǢǣǢ́ǣ́Ǣ̂ǣ̂Ǣ̃ǣ̃Ǣ̆ǣ̆Æ̆æ̆Æ̇æ̇Æ̈æ̈Æ̈̀æ̈̀Æ̈́æ̈́Æ̈̂æ̈̂Æ̈̌æ̈̌Æ̊æ̊Æ̋æ̋Æ᷎æ᷎Æ̨æ̨Æ̨̀æ̨̀Ǽ̨ǽ̨Æ̨̂æ̨̂Æ̨̈æ̨̈Ǣ̨ǣ̨Æ̨̌æ̨̌Æ̨̱æ̨̱Æ̱æ̱Æ̱̃æ̱̃Æ̱̈æ̱̈Æ̣æ̣Æ͔̃æ͔̃ᴁᴂᵆꬱꜴꜵꜴ́ꜵ́Ꜵ̋ꜵ̋Ꜵ̣ꜵ̣ꜶꜷꜶ́ꜷ́Ꜷ̣ꜷ̣ꜸꜹꜺꜻꜸ́ꜹ́Ꜹ̋ꜹ̋Ꜹ̨ꜹ̨Ꜹ̣ꜹ̣Ꜻ́ꜻ́ꜼꜽꜼ̇ꜽ̇Ꜽ̣ꜽ̣ȸDZDzdzʣDŽDždžꭦʥʤffffifflfiflʩIJijꭡLJLjljỺỻʪʫɮNJNjnjŒœꟹŒ̀œ̀Œ́œ́Œ̂œ̂Œ̃œ̃Œ̄œ̄Œ̄́œ̄́Œ̄̃œ̄̃Œ̄̆œ̄̆Œ̋œ̋Œ̌œ̌Œ̨œ̨Œ̨̃œ̨̃Œ̣œ̣Œ̯œ̯ɶᴔꭂꭁꭢꝎꝏꝎ́ꝏ́Ꝏ̈ꝏ̈Ꝏ̋ꝏ̋Ꝏ̣ꝏ̣ꭃꭄȹẞßstſtʨᵺʦꭧʧꜨꜩꭀᵫꭐꭑꭣꝠꝡꝠ̈ꝡ̈Ꝡ̋ꝡ̋ꭠ";
let re = new RegExp(`(?<=[^${wordchar}]*)[${wordchar}]+(?=[${wordchar}]*)`, "g");
console.log(myString.match(re)); // ["Whatever", "you", "want", "here", "ex", "Bân-lâm-gú", "or", "bokmål", "or", "Português", "or", "Română", "or", "Slovenčina", "or", "Slovenščina"]
For russian and latin alphabet I've used
[\\wа-яА-Я]
Related
I using this pattern to find any word in the string:
\b(\w{1,})
but this can't find arabic words. How can I change this pattern to find both english and arabic words?
Thanks
Regex \w is an alias for A-z, 0-9, and _ (underscore) and will not match arabic unicode range. To include characters other than A-z you need to specify them, for example
[A-z\u0600-\u065F\u066A-\u06EF\u06FA-\u06FF]+
For explanation about character codes see Match Arabic word with regex that ends with “#”?
If your text only includes English and Arabic, and you want to sort the results, you could use this:
([^x00-\x7F ]+) for the Arabic text and this: (\w+) for the English text
The first part captures all characters other than the English set plus a space; the second part captures English characters (plus _).
Like smirnov said, that regex you're using will only find Latin strings. For Arabic you should use [\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD] (which should find all Arabic characters, even weird ones like .)
Depending on what you're trying to do, you might want to split the string into a list and process it that way (that's what I usually end up doing when I'm dealing with mixed-language texts). Then you can identify the language of each word and process it accordingly.
I am using a JavaScript RegEx which is mentioned below:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*([-_.])).+$
This accepts only that text which has at least 1: uppercase letter, lowercase letter, number & a special symbol from .-_.
Now assume I supply User-123 as the user id which confirms to the above RegEx & I use the onscreen keyboard to type in a character from Finnish language, which results in User-123Ã.
The RegEx being fulfilled, the text is accepted by my JavaScript code, but I want it to only accept Alphanumeric input in English, and nothing else.
How should I enhance this RegEx to do so?
This string "User-123Ã" have contain Unicode "Ã" not alphabets, so how can identify js code,
[Code] [Glyph] [Decimal] [HTML] Description [#]
U+00C3 Ã Ã Ã Latin Capital letter A with tilde 0131
Try this link also,
How to find whether a particular string has unicode characters
I am not sure this will solve the issue, but in most cases when you want to restrict the input itself to some characters, your consuming pattern should only match those characters you allow. The lookahead restrictions just require or forbid certain characters to appear certain number of times at certain positions, but what you match in the consuming part is crucial.
.+$ allows all letters. Replace it with [\w.-]+$ (\w = [a-zA-Z0-9_]) instead to restrict to the characters you require in the lookaheads.
I have a field in my application where users can enter a hashtag.
I want to validate their entry and make sure they enter what would be a proper HashTag.
It can be in any language and it should NOT precede with the # sign.
I am writing in JavaScript.
So the following are GOOD examples:
Abcde45454_fgfgfg (good because: only letters, numbers and _)
2014_is-the-year (good because: only letters, numbers, _ and -)
בר_רפאלי (good because: only letters and _)
арбуз (good because: only letters)
And the following are BAD examples:
Dan Brown (Bad because has a space)
OMG!!!!! (Bad because has !)
בר רפ#לי (Bad because has # and a space)
We had a regex that matched only a-zA-Z0-9, we needed to add language support so we changed it to ignore white spaces and forgot to ignore special characters, so here I am.
Some other StackOverflow examples I saw but didn't work for me:
Other languges don't work
Again, English only
[edit]
Added explanation why bad is bad and good is good
I don't want a preceding # character, but if I would to add a # in the beginning, it should be a valid hashtag
Basically I don't want to allow any special characters like !##$%^&*()=+./,[{]};:'"?><
If your disallowed characters list is thorough (!##$%^&*()=+./,[{]};:'"?><), then the regex is:
^#?[^\s!##$%^&*()=+./,\[{\]};:'"?><]+$
Demo
This allows an optional leading # sign: #?. It disallows the special characters using a negative character class. I just added \s to the list (spaces), and also I escaped [ and ].
Unfortunately, you can't use constructs like \p{P} (Unicode punctuation) in JavaScript's regexes, so you basically have to blacklist characters or take a different approach if the regex solution isn't good enough for your needs.
I don't understand why this question does not get more votes. Hashtag detection for multiple languages is a problem. The only working option I could find is posted by Lucas above (all other ones do not work so well).
It needs a modification though:
#[^\s!##$%^&*()=+.\/,\[{\]};:'"?><]+
DEMO
this detects all the hashtags, not only in the beginning of the string, fixes an unescaped character, and removes the unnecessary $ in the end.
First if we exclude all symbol it will not a handy solution. Because symbol depends on keyboard layout and there are hundreds of math symbols and so on. So use this..
[\p{sc=Bengali}|\p{L}_\p{N}]+
1. If you think if language need extra care include like \p{sc=Bengali}|\p{sc=Spanish} etc. Suppose bangla has surrogate alphabet like া, ে ৌ etc so codepoint need to recognize Bangla separately first by \p{sc=Bengali}
2. Than use \p{L} that matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç too or normal any alphabet without complex...matches a single code point in the category "letter"
3. _ underscore allowed
4. \p{N} matches any kind of numeric character in any script. (\d matches only a digit (equal to [0-9]) but for allowed Unicode digit \p{N} only option, because its works with any digit codepoint)
I want to delete from string all characters that are not letters.
I know that there is something like \W in regex, but it considers non-English characters as not letters. For example my script deletes all Polish letters (like "ą", "ć", "ó"), but I need them.
How to tell regex to do this?
Code:
var text = text.replace(/\W/g, ' ');
You can either use Steve Levithan's XRegExp library (with Unicode plugins), or you have to define the Unicode character range manually, since JavaScript doesn't support Unicode properties.
[^\u0041-\u005A\u0061-\u007A\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0345\u0370-\u0374\u0376\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05B0-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u05D0-\u05EA\u05F0-\u05F2\u0610-\u061A\u0620-\u0657\u0659-\u065F\u066E-\u06D3\u06D5-\u06DC\u06E1-\u06E8\u06ED-\u06EF\u06FA-\u06FC\u06FF\u0710-\u073F\u074D-\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0817\u081A-\u082C\u0840-\u0858\u08A0\u08A2-\u08AC\u08E4-\u08E9\u08F0-\u08FE\u0900-\u093B\u093D-\u094C\u094E-\u0950\u0955-\u0963\u0971-\u0977\u0979-\u097F\u0981-\u0983\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD-\u09C4\u09C7\u09C8\u09CB\u09CC\u09CE\u09D7\u09DC\u09DD\u09DF-\u09E3\u09F0\u09F1\u0A01-\u0A03\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A3E-\u0A42\u0A47\u0A48\u0A4B\u0A4C\u0A51\u0A59-\u0A5C\u0A5E\u0A70-\u0A75\u0A81-\u0A83\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD-\u0AC5\u0AC7-\u0AC9\u0ACB\u0ACC\u0AD0\u0AE0-\u0AE3\u0B01-\u0B03\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D-\u0B44\u0B47\u0B48\u0B4B\u0B4C\u0B56\u0B57\u0B5C\u0B5D\u0B5F-\u0B63\u0B71\u0B82\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCC\u0BD0\u0BD7\u0C01-\u0C03\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C33\u0C35-\u0C39\u0C3D-\u0C44\u0C46-\u0C48\u0C4A-\u0C4C\u0C55\u0C56\u0C58\u0C59\u0C60-\u0C63\u0C82\u0C83\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCC\u0CD5\u0CD6\u0CDE\u0CE0-\u0CE3\u0CF1\u0CF2\u0D02\u0D03\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D-\u0D44\u0D46-\u0D48\u0D4A-\u0D4C\u0D4E\u0D57\u0D60-\u0D63\u0D7A-\u0D7F\u0D82\u0D83\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E01-\u0E3A\u0E40-\u0E46\u0E4D\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB9\u0EBB-\u0EBD\u0EC0-\u0EC4\u0EC6\u0ECD\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F71-\u0F81\u0F88-\u0F97\u0F99-\u0FBC\u1000-\u1036\u1038\u103B-\u103F\u1050-\u1062\u1065-\u1068\u106E-\u1086\u108E\u109C\u109D\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u135F\u1380-\u138F\u13A0-\u13F4\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16EE-\u16F0\u1700-\u170C\u170E-\u1713\u1720-\u1733\u1740-\u1753\u1760-\u176C\u176E-\u1770\u1772\u1773\u1780-\u17B3\u17B6-\u17C8\u17D7\u17DC\u1820-\u1877\u1880-\u18AA\u18B0-\u18F5\u1900-\u191C\u1920-\u192B\u1930-\u1938\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A1B\u1A20-\u1A5E\u1A61-\u1A74\u1AA7\u1B00-\u1B33\u1B35-\u1B43\u1B45-\u1B4B\u1B80-\u1BA9\u1BAC-\u1BAF\u1BBA-\u1BE5\u1BE7-\u1BF1\u1C00-\u1C35\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF3\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2160-\u2188\u24B6-\u24E9\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2DE0-\u2DFF\u2E2F\u3005-\u3007\u3021-\u3029\u3031-\u3035\u3038-\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FCC\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA674-\uA67B\uA67F-\uA697\uA69F-\uA6EF\uA717-\uA71F\uA722-\uA788\uA78B-\uA78E\uA790-\uA793\uA7A0-\uA7AA\uA7F8-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA827\uA840-\uA873\uA880-\uA8C3\uA8F2-\uA8F7\uA8FB\uA90A-\uA92A\uA930-\uA952\uA960-\uA97C\uA980-\uA9B2\uA9B4-\uA9BF\uA9CF\uAA00-\uAA36\uAA40-\uAA4D\uAA60-\uAA76\uAA7A\uAA80-\uAABE\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEF\uAAF2-\uAAF5\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uABC0-\uABEA\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]
matches a character that isn't a Unicode letter.
It depends on what engine you are working with. It also depends on how your Unicode characters are encoded — are they encoded as a single character, or as a character+mark combination?
You can try the following: \p{L} to target character+mark combinations, and \P{M}\p{M}*+ for the single character encodings.
So, finally I decided to write my own regex condition, because it seems like it there isn't any fast&simple way to do that in javascript.
I added here all unnecessary characters that came to my mind, could be in typical website and aren't needed to understand single word (I left ' character because in English it is quite important ;) ). If you want you can edit my answer and add your own ones.
[:;.,\?!-()~\/"|®##$%^&*+-]
JS:
text = text.replace(/[:;\.,\?!\-\(\)~\\\/"|®##$%^&*+-]/, "");
I want to validate any string that contains çÇöÖİşŞüÜğĞ chars and starting at least 5 chars.String to validate can contain spaces.RegEx must validate like "asd Çğ ğT i" for example.
Any reply will helpful.
Thanks.
You can use escape sequences of the form
\uXXXX
where each "X" can be any hex digit. Thus:
\u0020
is the same as a plain space character, and
\u0041
is upper-case "A". Thus you can encode the Unicode values for the characters you're interested in and then include them in a regex character class. To make sure the string is at least five characters long, you can use a quantifier in the regex.
You'll end up with something like:
var regex = /^[A-Za-z\u00nn\u00nn\u00nn]{5,}$/;
where those "00nn" things would be the appropriate values. As to exactly what those values are, you should be able to find them on a reference site like this one or maybe this one. For example I think that "Ö" is \u00D6. (Some of your characters are in the Unicode Latin-1 Supplement, while others are in Latin Extended A.)