Formatting to only alphanumerical including unicode characters - javascript

I have this script
key = val.replace(/[^a-z0-9\s]/gi, '').replace(/[_\s]/g, '+')
It works fine, but it also takes out unicode characters such as აბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰ
My question is how can I change a given code so that it will accept these characters as well?

Javascript regular expressions don't support unicode properties. You have to add explicit unicode ranges to your expression. For example, the range for Georgian is 10A0-10FF, so to replace everything that is not a Latin or Georgian character, you need something like
val.replace(/[^\w\u10A0-\u10FF]/g, '')
This tool can help you further.

You can specify allowed characters ranges manually, or you can use some library like XRegExp.


Regex: any character that is NOT a letter (but not only English letters)

I want to delete from string all characters that are not letters.
I know that there is something like \W in regex, but it considers non-English characters as not letters. For example my script deletes all Polish letters (like "ą", "ć", "ó"), but I need them.
How to tell regex to do this?
var text = text.replace(/\W/g, ' ');
You can either use Steve Levithan's XRegExp library (with Unicode plugins), or you have to define the Unicode character range manually, since JavaScript doesn't support Unicode properties.
matches a character that isn't a Unicode letter.
It depends on what engine you are working with. It also depends on how your Unicode characters are encoded — are they encoded as a single character, or as a character+mark combination?
You can try the following: \p{L} to target character+mark combinations, and \P{M}\p{M}*+ for the single character encodings.
So, finally I decided to write my own regex condition, because it seems like it there isn't any fast&simple way to do that in javascript.
I added here all unnecessary characters that came to my mind, could be in typical website and aren't needed to understand single word (I left ' character because in English it is quite important ;) ). If you want you can edit my answer and add your own ones.
text = text.replace(/[:;\.,\?!\-\(\)~\\\/"|®##$%^&*+-]/, "");

Regular expression to allow all alphabet characters plus unicode characters

I need a regular expression to allow all alphabet characters plus Greek/German alphabet in a string but replace those symbols ?,&,^,". with *
I skipped the list with characters to escape to made the question simple.
I really want to see how to construct this and afterwards include alphabet sets using ASCII codes.
if you have a finite and short set of elements to replace you could just use a class e.g.
string.replace(/[?\^&]/g, '*');
and add as many symbols as you want to reject. you could also add ranges of unicode symbols you want to replace (e.g. \u017F-\036F\u0400-\uFFFF )
otherwise use a a class to specify what symbols don't need to be replaced, like a-z, accented/diacritic letters and greek symbols
string.replace(/[^a-z\00C0-\017E\u0370-\03FF]/gi, '*');
You have to use the XRegexp plugin, along with the Unicode add-on.
Once you have that, you can use modern regexes like /[\p{L}\p{Nl}]/, which necessarily also includes those \p{Greek} code points which are letters or letter-numbers. But you could also match /[\p{Latin}\p{Greek}]/ if you wanted.
Javascript’s own regexes are terrible. Use XRegexp.
So something like: /^[^?&\^"]*$/ (that means the string is composed only of characters outside the five you listed)...
But if you want to have the greek characters and the unicode characters (what are unicode characters? àèéìòù? Japanese?) perhaps you'll have to use It is a regex library for javascript that includes character classes for the various unicode character classes (I know I'm repeating myself) plus other "commands" for unicode handling.

Javascript regular expression for punctuation (international)?

I need a regular expression to match against all punctuation marks, such as the standard [,!##$%^&*()], but including international marks like the upside-down Spanish question mark, Chinese periods, etc. My google-fu is coming up short. Does anyone have such a regular expression on hand that's compatible with Javascript?
Adding to #stema's answer ( here is the regex as a string (so you don't need to bloat your project with XRegExp).
I used this in my own project with some additions...
// any kind of punctuation character (including international e.g. Chinese and Spanish punctuation)
// author:
// source:
// note: XRegExp unicode output taken from,console (see chrome console.log), then converted back to JS escaped unicode here, then tested on
// suggested by:
// added: extra characters like "$", "\uFFE5" [yen symbol], "^", "+", "=" which are not consider punctuation in the XRegExp regex (they are currency or mathmatical characters)
// added: \u3000-\u303F Chinese Punctuation for good measure
var regex_characters_to_remove = /[\$\uFFE5\^\+=`~<>{}\[\]|\u3000-\u303F!-#%-\x2A,-/:;\x3F#\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]+/g
If it's possible for you to use a plugin, there is a plugin for JavaScript: XRegExp Unicode plugins. That adds support for Unicode categories, scripts, and blocks (I personally have only read about it, I never used it).
With this plugin it should be possible to use Unicode categories like \p{P} as explained at
OK, I tested it, and it seems to work fine.
You need to get the lib from XRegExp and additionally the Unicode Base and Unicode Category plugins (linked above).
<script src="xregexp.js"></script>
<script src="addons/unicode-base.js"></script>
<script src="addons/unicode-categories.js"></script>
var unicodePunctuation = XRegExp("^\\p{P}+$");
alert(unicodePunctuation.test("?.,;!¡¿。、·")); // true
The above alerts true. I included some Spanish and Chinese punctuation in my test string, "?.,;!¡¿。、·".
From ES 2018, Unicode property escapes are supported. You can use \p{Punctuation} or just \p{P} (the same as the XRegExp answer) to match any punctuation character (by the Unicode definition), or \P{Punctuation} to match any non-punctuation character.
If you want to match any "non-word" character, like a Unicode version of \W, you can try something like:
(as recommended in the proposal for the feature). You might want to remove \p{Connector_Punctuation}, since that includes underscores and similar.
Don't forget to add the u flag to your regular expression to make it Unicode-aware and enable this feature.
Well... idk how extensive it would be, but you could use this:
Your regex would look something like...
Where you replace each \u9999 with the Unicode codepoint for the other punctuation characters.
If you could find a bunch in a range, you could specify that with the - range operand, e.g. \u9990-\u9999.
As far as I know you can't use something like \pP in JavaScript regexes.
For Python this regex to remove from the start and end any type of punctuation marks:
import re
def cleanspecialcharacters(str):
regex = re.compile((
str = regex.sub('', str)
return str

Validating any string with RegEx

I want to validate any string that contains çÇöÖİşŞüÜğĞ chars and starting at least 5 chars.String to validate can contain spaces.RegEx must validate like "asd Çğ ğT i" for example.
Any reply will helpful.
You can use escape sequences of the form
where each "X" can be any hex digit. Thus:
is the same as a plain space character, and
is upper-case "A". Thus you can encode the Unicode values for the characters you're interested in and then include them in a regex character class. To make sure the string is at least five characters long, you can use a quantifier in the regex.
You'll end up with something like:
var regex = /^[A-Za-z\u00nn\u00nn\u00nn]{5,}$/;
where those "00nn" things would be the appropriate values. As to exactly what those values are, you should be able to find them on a reference site like this one or maybe this one. For example I think that "Ö" is \u00D6. (Some of your characters are in the Unicode Latin-1 Supplement, while others are in Latin Extended A.)

Regex in javascript working with Cyrillic (Russian) set

Is it possible to work with Russian characters, in javascript's regex?
Maybe the use of \p{Cyrillic}?
If yes, please provide a basic example of usage.
The example:
var str1 = "абв прв фву";
var regexp = new RegExp("[вф]\\b", "g");
alert(str1.replace(regexp, "X"));
I expect to get: абX прX
Here is a good article on JavaScript regular expressions and unicode. Strings in JavaScript are 16 bit, so strings and RegExp objects can contain unicode characters, but most of the special characters like '\b', '\d', '\w' only support ascii. So your regular expression does not work as expected due to the use of '\b'. It seems you'll have to find a different way to detect word boundaries.
It should work if you just save the JavaScript file in UTF8. Then you should be able to enter any character in a string.
Just made a quick example with some cryllic characters from Wikipedia:
var cryllic = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюяабвгдеёжзийклмнопрстуфхцчшщъыьэюя';
cryllic.match( 'л.+а' )[0];
// returns as expected: "лмнопрстуфхцчшщъыьэюяа"
According to this:
JavaScript, which does not offer any
Unicode support through its RegExp
class, does support \uFFFF for
matching a single Unicode code point
as part of its string syntax.
so you can at least use code points, but seemingly nothing more (no classes).
Also check out this duplicate of your question.

