I need a regular expression to match against all punctuation marks, such as the standard [,!##$%^&*()], but including international marks like the upside-down Spanish question mark, Chinese periods, etc. My google-fu is coming up short. Does anyone have such a regular expression on hand that's compatible with Javascript?
Adding to #stema's answer (https://stackoverflow.com/a/7578937/114140)... here is the regex as a string (so you don't need to bloat your project with XRegExp).
!-#%-\x2A,-/:;\x3F#\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65
I used this in my own project with some additions...
// any kind of punctuation character (including international e.g. Chinese and Spanish punctuation)
// author: http://www.regular-expressions.info/unicode.html
// source: https://github.com/slevithan/xregexp/blob/41f4cd3fc0a8540c3c71969a0f81d1f00e9056a9/src/addons/unicode/unicode-categories.js#L142
// note: XRegExp unicode output taken from http://jsbin.com/uFiNeDOn/3/edit?js,console (see chrome console.log), then converted back to JS escaped unicode here http://rishida.net/tools/conversion/, then tested on http://regexpal.com/
// suggested by: https://stackoverflow.com/a/7578937
// added: extra characters like "$", "\uFFE5" [yen symbol], "^", "+", "=" which are not consider punctuation in the XRegExp regex (they are currency or mathmatical characters)
// added: \u3000-\u303F Chinese Punctuation for good measure
var regex_characters_to_remove = /[\$\uFFE5\^\+=`~<>{}\[\]|\u3000-\u303F!-#%-\x2A,-/:;\x3F#\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]+/g
If it's possible for you to use a plugin, there is a plugin for JavaScript: XRegExp Unicode plugins. That adds support for Unicode categories, scripts, and blocks (I personally have only read about it, I never used it).
With this plugin it should be possible to use Unicode categories like \p{P} as explained at regular-expressions.info.
Update:
OK, I tested it, and it seems to work fine.
You need to get the lib from XRegExp and additionally the Unicode Base and Unicode Category plugins (linked above).
<script src="xregexp.js"></script>
<script src="addons/unicode-base.js"></script>
<script src="addons/unicode-categories.js"></script>
<script>
var unicodePunctuation = XRegExp("^\\p{P}+$");
alert(unicodePunctuation.test("?.,;!¡¿。、·")); // true
</script>
The above alerts true. I included some Spanish and Chinese punctuation in my test string, "?.,;!¡¿。、·".
From ES 2018, Unicode property escapes are supported. You can use \p{Punctuation} or just \p{P} (the same as the XRegExp answer) to match any punctuation character (by the Unicode definition), or \P{Punctuation} to match any non-punctuation character.
If you want to match any "non-word" character, like a Unicode version of \W, you can try something like:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
(as recommended in the proposal for the feature). You might want to remove \p{Connector_Punctuation}, since that includes underscores and similar.
Don't forget to add the u flag to your regular expression to make it Unicode-aware and enable this feature.
Well... idk how extensive it would be, but you could use this:
[^\w\s\n\t]
Your regex would look something like...
/[,!##$%^&*()\u9999]/
Where you replace each \u9999 with the Unicode codepoint for the other punctuation characters.
If you could find a bunch in a range, you could specify that with the - range operand, e.g. \u9990-\u9999.
As far as I know you can't use something like \pP in JavaScript regexes.
For Python this regex to remove from the start and end any type of punctuation marks:
import re
def cleanspecialcharacters(str):
regex = re.compile((
'^[/\"_\(\)&*\$¥\^\+=`~<>\{\}\[\]\|\-!#%\,\:;#¡§«¶·»¿;·՚-՟։֊؉،॥॰෴๏๚๛༄-༒༔༺-༽྅჻፠-፨᐀᙭᙮។-៖៘-៚‧‰-⁃⁅-⁑⁓-⁞⁽⁾₍₎、〃〈-【】〔-〟〰〽゠・﴾﴿︐-︙︰-﹒﹔-﹡﹣﹨﹪﹫!-#%-*,-/:;?@[-]_{}⦅-・〔〕《》]*|'
'([/\"_\(\)&*\$¥\^\+=`~<>\{\}\[\]\|\-!#%\,\:;#¡§«¶·»¿;·՚-՟։֊؉،॥॰෴๏๚๛༄-༒༔༺-༽྅჻፠-፨᐀᙭᙮។-៖៘-៚‧‰-⁃⁅-⁑⁓-⁞⁽⁾₍₎、〃〈-【】〔-〟〰〽゠・﴾﴿︐-︙︰-﹒﹔-﹡﹣﹨﹪﹫!-#%-*,-/:;?@[-]_{}⦅-・〔〕《》])*$'))
str = regex.sub('', str)
return str
Related
i have the following regex that allows only alphabets :
/[a-zA-Z]+/
a = "abcDF"
if (a.match(/[a-zA-Z]+/) == a){
//Match
}else{
//No Match
}
How can I do this using p{L} (universal - any language like german, english etc.. )
What I tried :
a.match(/[p{l}]+/)
a.match(/[\p{l}]+/)
a.match(/p{l}/)
a.match(/\p{l}/)
but all returned null for the letter a = "aB"
Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.
For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:
var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
// Match
} else {
// No Match
}
If you are willing to use Babel to build your javascript then there's a babel-plugin I have released which will transform regular expressions like /^\p{L}+$/ or /\p{^White_Space}/ into a regular expression that browsers will understand.
This is the project page: https://github.com/danielberndt/babel-plugin-utf-8-regex
You may use \p{L} with the modern ECMAScript 2018+ compliant JavaScript environments, but you need to remember that the Unicode property classes are only supported when you pass u modifier/flag:
a.match(/\p{L}+/gu)
a.match(/\p{Alphabetic}+/gu)
will match all occurrences of 1 or more Unicode letters in the a string.
NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. Ⅻ – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).
There are some things to bear in mind though when using u modifier with a regex:
You can use Unicode code point escape sequences such as \u{1F42A} for specifying characters via code points. Normal Unicode escapes such as \u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane) (source)
"Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters" (source)
Escaping requirements to patterns compiled with u flag are more strict: you can't escape any special characters, you can only escape those that can actually behave as special characters. See HTML input pattern not working.
i have the following regex that allows only alphabets :
/[a-zA-Z]+/
a = "abcDF"
if (a.match(/[a-zA-Z]+/) == a){
//Match
}else{
//No Match
}
How can I do this using p{L} (universal - any language like german, english etc.. )
What I tried :
a.match(/[p{l}]+/)
a.match(/[\p{l}]+/)
a.match(/p{l}/)
a.match(/\p{l}/)
but all returned null for the letter a = "aB"
Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.
For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:
var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
// Match
} else {
// No Match
}
If you are willing to use Babel to build your javascript then there's a babel-plugin I have released which will transform regular expressions like /^\p{L}+$/ or /\p{^White_Space}/ into a regular expression that browsers will understand.
This is the project page: https://github.com/danielberndt/babel-plugin-utf-8-regex
You may use \p{L} with the modern ECMAScript 2018+ compliant JavaScript environments, but you need to remember that the Unicode property classes are only supported when you pass u modifier/flag:
a.match(/\p{L}+/gu)
a.match(/\p{Alphabetic}+/gu)
will match all occurrences of 1 or more Unicode letters in the a string.
NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. Ⅻ – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).
There are some things to bear in mind though when using u modifier with a regex:
You can use Unicode code point escape sequences such as \u{1F42A} for specifying characters via code points. Normal Unicode escapes such as \u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane) (source)
"Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters" (source)
Escaping requirements to patterns compiled with u flag are more strict: you can't escape any special characters, you can only escape those that can actually behave as special characters. See HTML input pattern not working.
I have a field in my application where users can enter a hashtag.
I want to validate their entry and make sure they enter what would be a proper HashTag.
It can be in any language and it should NOT precede with the # sign.
I am writing in JavaScript.
So the following are GOOD examples:
Abcde45454_fgfgfg (good because: only letters, numbers and _)
2014_is-the-year (good because: only letters, numbers, _ and -)
בר_רפאלי (good because: only letters and _)
арбуз (good because: only letters)
And the following are BAD examples:
Dan Brown (Bad because has a space)
OMG!!!!! (Bad because has !)
בר רפ#לי (Bad because has # and a space)
We had a regex that matched only a-zA-Z0-9, we needed to add language support so we changed it to ignore white spaces and forgot to ignore special characters, so here I am.
Some other StackOverflow examples I saw but didn't work for me:
Other languges don't work
Again, English only
[edit]
Added explanation why bad is bad and good is good
I don't want a preceding # character, but if I would to add a # in the beginning, it should be a valid hashtag
Basically I don't want to allow any special characters like !##$%^&*()=+./,[{]};:'"?><
If your disallowed characters list is thorough (!##$%^&*()=+./,[{]};:'"?><), then the regex is:
^#?[^\s!##$%^&*()=+./,\[{\]};:'"?><]+$
Demo
This allows an optional leading # sign: #?. It disallows the special characters using a negative character class. I just added \s to the list (spaces), and also I escaped [ and ].
Unfortunately, you can't use constructs like \p{P} (Unicode punctuation) in JavaScript's regexes, so you basically have to blacklist characters or take a different approach if the regex solution isn't good enough for your needs.
I don't understand why this question does not get more votes. Hashtag detection for multiple languages is a problem. The only working option I could find is posted by Lucas above (all other ones do not work so well).
It needs a modification though:
#[^\s!##$%^&*()=+.\/,\[{\]};:'"?><]+
DEMO
this detects all the hashtags, not only in the beginning of the string, fixes an unescaped character, and removes the unnecessary $ in the end.
First if we exclude all symbol it will not a handy solution. Because symbol depends on keyboard layout and there are hundreds of math symbols and so on. So use this..
[\p{sc=Bengali}|\p{L}_\p{N}]+
1. If you think if language need extra care include like \p{sc=Bengali}|\p{sc=Spanish} etc. Suppose bangla has surrogate alphabet like া, ে ৌ etc so codepoint need to recognize Bangla separately first by \p{sc=Bengali}
2. Than use \p{L} that matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç too or normal any alphabet without complex...matches a single code point in the category "letter"
3. _ underscore allowed
4. \p{N} matches any kind of numeric character in any script. (\d matches only a digit (equal to [0-9]) but for allowed Unicode digit \p{N} only option, because its works with any digit codepoint)
I want to delete from string all characters that are not letters.
I know that there is something like \W in regex, but it considers non-English characters as not letters. For example my script deletes all Polish letters (like "ą", "ć", "ó"), but I need them.
How to tell regex to do this?
Code:
var text = text.replace(/\W/g, ' ');
You can either use Steve Levithan's XRegExp library (with Unicode plugins), or you have to define the Unicode character range manually, since JavaScript doesn't support Unicode properties.
[^\u0041-\u005A\u0061-\u007A\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0345\u0370-\u0374\u0376\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05B0-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u05D0-\u05EA\u05F0-\u05F2\u0610-\u061A\u0620-\u0657\u0659-\u065F\u066E-\u06D3\u06D5-\u06DC\u06E1-\u06E8\u06ED-\u06EF\u06FA-\u06FC\u06FF\u0710-\u073F\u074D-\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0817\u081A-\u082C\u0840-\u0858\u08A0\u08A2-\u08AC\u08E4-\u08E9\u08F0-\u08FE\u0900-\u093B\u093D-\u094C\u094E-\u0950\u0955-\u0963\u0971-\u0977\u0979-\u097F\u0981-\u0983\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD-\u09C4\u09C7\u09C8\u09CB\u09CC\u09CE\u09D7\u09DC\u09DD\u09DF-\u09E3\u09F0\u09F1\u0A01-\u0A03\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A3E-\u0A42\u0A47\u0A48\u0A4B\u0A4C\u0A51\u0A59-\u0A5C\u0A5E\u0A70-\u0A75\u0A81-\u0A83\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD-\u0AC5\u0AC7-\u0AC9\u0ACB\u0ACC\u0AD0\u0AE0-\u0AE3\u0B01-\u0B03\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D-\u0B44\u0B47\u0B48\u0B4B\u0B4C\u0B56\u0B57\u0B5C\u0B5D\u0B5F-\u0B63\u0B71\u0B82\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCC\u0BD0\u0BD7\u0C01-\u0C03\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C33\u0C35-\u0C39\u0C3D-\u0C44\u0C46-\u0C48\u0C4A-\u0C4C\u0C55\u0C56\u0C58\u0C59\u0C60-\u0C63\u0C82\u0C83\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCC\u0CD5\u0CD6\u0CDE\u0CE0-\u0CE3\u0CF1\u0CF2\u0D02\u0D03\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D-\u0D44\u0D46-\u0D48\u0D4A-\u0D4C\u0D4E\u0D57\u0D60-\u0D63\u0D7A-\u0D7F\u0D82\u0D83\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E01-\u0E3A\u0E40-\u0E46\u0E4D\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB9\u0EBB-\u0EBD\u0EC0-\u0EC4\u0EC6\u0ECD\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F71-\u0F81\u0F88-\u0F97\u0F99-\u0FBC\u1000-\u1036\u1038\u103B-\u103F\u1050-\u1062\u1065-\u1068\u106E-\u1086\u108E\u109C\u109D\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u135F\u1380-\u138F\u13A0-\u13F4\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16EE-\u16F0\u1700-\u170C\u170E-\u1713\u1720-\u1733\u1740-\u1753\u1760-\u176C\u176E-\u1770\u1772\u1773\u1780-\u17B3\u17B6-\u17C8\u17D7\u17DC\u1820-\u1877\u1880-\u18AA\u18B0-\u18F5\u1900-\u191C\u1920-\u192B\u1930-\u1938\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A1B\u1A20-\u1A5E\u1A61-\u1A74\u1AA7\u1B00-\u1B33\u1B35-\u1B43\u1B45-\u1B4B\u1B80-\u1BA9\u1BAC-\u1BAF\u1BBA-\u1BE5\u1BE7-\u1BF1\u1C00-\u1C35\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF3\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2160-\u2188\u24B6-\u24E9\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2DE0-\u2DFF\u2E2F\u3005-\u3007\u3021-\u3029\u3031-\u3035\u3038-\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FCC\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA674-\uA67B\uA67F-\uA697\uA69F-\uA6EF\uA717-\uA71F\uA722-\uA788\uA78B-\uA78E\uA790-\uA793\uA7A0-\uA7AA\uA7F8-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA827\uA840-\uA873\uA880-\uA8C3\uA8F2-\uA8F7\uA8FB\uA90A-\uA92A\uA930-\uA952\uA960-\uA97C\uA980-\uA9B2\uA9B4-\uA9BF\uA9CF\uAA00-\uAA36\uAA40-\uAA4D\uAA60-\uAA76\uAA7A\uAA80-\uAABE\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEF\uAAF2-\uAAF5\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uABC0-\uABEA\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]
matches a character that isn't a Unicode letter.
It depends on what engine you are working with. It also depends on how your Unicode characters are encoded — are they encoded as a single character, or as a character+mark combination?
You can try the following: \p{L} to target character+mark combinations, and \P{M}\p{M}*+ for the single character encodings.
So, finally I decided to write my own regex condition, because it seems like it there isn't any fast&simple way to do that in javascript.
I added here all unnecessary characters that came to my mind, could be in typical website and aren't needed to understand single word (I left ' character because in English it is quite important ;) ). If you want you can edit my answer and add your own ones.
[:;.,\?!-()~\/"|®##$%^&*+-]
JS:
text = text.replace(/[:;\.,\?!\-\(\)~\\\/"|®##$%^&*+-]/, "");
I need a regular expression to allow all alphabet characters plus Greek/German alphabet in a string but replace those symbols ?,&,^,". with *
I skipped the list with characters to escape to made the question simple.
I really want to see how to construct this and afterwards include alphabet sets using ASCII codes.
if you have a finite and short set of elements to replace you could just use a class e.g.
string.replace(/[?\^&]/g, '*');
and add as many symbols as you want to reject. you could also add ranges of unicode symbols you want to replace (e.g. \u017F-\036F\u0400-\uFFFF )
otherwise use a a class to specify what symbols don't need to be replaced, like a-z, accented/diacritic letters and greek symbols
string.replace(/[^a-z\00C0-\017E\u0370-\03FF]/gi, '*');
You have to use the XRegexp plugin, along with the Unicode add-on.
Once you have that, you can use modern regexes like /[\p{L}\p{Nl}]/, which necessarily also includes those \p{Greek} code points which are letters or letter-numbers. But you could also match /[\p{Latin}\p{Greek}]/ if you wanted.
Javascript’s own regexes are terrible. Use XRegexp.
So something like: /^[^?&\^"]*$/ (that means the string is composed only of characters outside the five you listed)...
But if you want to have the greek characters and the unicode characters (what are unicode characters? àèéìòù? Japanese?) perhaps you'll have to use http://xregexp.com/ It is a regex library for javascript that includes character classes for the various unicode character classes (I know I'm repeating myself) plus other "commands" for unicode handling.