Regex in javascript working with Cyrillic (Russian) set

Regex in javascript working with Cyrillic (Russian) set - javascript

Is it possible to work with Russian characters, in javascript's regex?
Maybe the use of \p{Cyrillic}?
If yes, please provide a basic example of usage.
The example:
var str1 = "абв прв фву";
var regexp = new RegExp("[вф]\\b", "g");
alert(str1.replace(regexp, "X"));
I expect to get: абX прX

Here is a good article on JavaScript regular expressions and unicode. Strings in JavaScript are 16 bit, so strings and RegExp objects can contain unicode characters, but most of the special characters like '\b', '\d', '\w' only support ascii. So your regular expression does not work as expected due to the use of '\b'. It seems you'll have to find a different way to detect word boundaries.

It should work if you just save the JavaScript file in UTF8. Then you should be able to enter any character in a string.
edit:
Just made a quick example with some cryllic characters from Wikipedia:
var cryllic = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюяабвгдеёжзийклмнопрстуфхцчшщъыьэюя';
cryllic.match( 'л.+а' )[0];
// returns as expected: "лмнопрстуфхцчшщъыьэюяа"

According to this:
JavaScript, which does not offer any
Unicode support through its RegExp
class, does support \uFFFF for
matching a single Unicode code point
as part of its string syntax.
so you can at least use code points, but seemingly nothing more (no classes).
Also check out this duplicate of your question.

Related

Regex special character '{' matches in JS but not in Java

test string: abc{123
regex: \w+{\d+
This matches in JS, but when I try to match it in Java it gives me this error:
Illegal repetition near index 2
\w+{\d+
It works in Java only when I escape the { character like this: \w+\{\d+
I tried it on these two links:
JS Link : http://myregexp.com/index.html
Java Link:http://www.ocpsoft.org/tutorials/regular-expressions/java-visual-regex-tester/
Desired result: If it matches in JS, it should match in Java also.
What is the difference between the regex implementation in Java and JS? How can I make it behave in the same way in Java and in JS?

How can I make it behave in the same way in Java and in JS?
You already know the answer:
It works in Java only when I escape the { character like this: \w+\{\d+".
Why? Because JavaScript here is a bit more permissive. Note that in JavaScript \w{3 will match "f{3", but not "f77"; \w{3} will match "f77" but not "f{3}". That is to say, the same character { changes meaning based on whether or not somewhere later in the string an } appears. The behaviour is thus made more unpredictable by its permissiveness, and Java just does not allow you to write regular expressions so sloppily.

you have to escape special characters and since a backslash is also a special character, you have to escape it as well. the regex will look like this in java: \\w+\\{\\d+. if you have problems, feel free to ask. you can generate a code in several programming languages here: https://regex101.com/r/D4yz40/1 this example matches your string. you can then generate the code for java and js

You just need to escape the {. So the regex should look like this:
\w+\{\d+
Your initial regex isn't valid.. Javascript is just more forgiving in this case.. But { is one of the characters you want to escape in regex since it means how many times to repeat a specific character(s) like so: [a-z]{22} would match 22 sequential characters from a-z..

Regex: any character that is NOT a letter (but not only English letters)

I want to delete from string all characters that are not letters.
I know that there is something like \W in regex, but it considers non-English characters as not letters. For example my script deletes all Polish letters (like "ą", "ć", "ó"), but I need them.
How to tell regex to do this?
Code:
var text = text.replace(/\W/g, ' ');

You can either use Steve Levithan's XRegExp library (with Unicode plugins), or you have to define the Unicode character range manually, since JavaScript doesn't support Unicode properties.
[^\u0041-\u005A\u0061-\u007A\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0345\u0370-\u0374\u0376\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05B0-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u05D0-\u05EA\u05F0-\u05F2\u0610-\u061A\u0620-\u0657\u0659-\u065F\u066E-\u06D3\u06D5-\u06DC\u06E1-\u06E8\u06ED-\u06EF\u06FA-\u06FC\u06FF\u0710-\u073F\u074D-\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0817\u081A-\u082C\u0840-\u0858\u08A0\u08A2-\u08AC\u08E4-\u08E9\u08F0-\u08FE\u0900-\u093B\u093D-\u094C\u094E-\u0950\u0955-\u0963\u0971-\u0977\u0979-\u097F\u0981-\u0983\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD-\u09C4\u09C7\u09C8\u09CB\u09CC\u09CE\u09D7\u09DC\u09DD\u09DF-\u09E3\u09F0\u09F1\u0A01-\u0A03\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A3E-\u0A42\u0A47\u0A48\u0A4B\u0A4C\u0A51\u0A59-\u0A5C\u0A5E\u0A70-\u0A75\u0A81-\u0A83\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD-\u0AC5\u0AC7-\u0AC9\u0ACB\u0ACC\u0AD0\u0AE0-\u0AE3\u0B01-\u0B03\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D-\u0B44\u0B47\u0B48\u0B4B\u0B4C\u0B56\u0B57\u0B5C\u0B5D\u0B5F-\u0B63\u0B71\u0B82\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCC\u0BD0\u0BD7\u0C01-\u0C03\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C33\u0C35-\u0C39\u0C3D-\u0C44\u0C46-\u0C48\u0C4A-\u0C4C\u0C55\u0C56\u0C58\u0C59\u0C60-\u0C63\u0C82\u0C83\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCC\u0CD5\u0CD6\u0CDE\u0CE0-\u0CE3\u0CF1\u0CF2\u0D02\u0D03\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D-\u0D44\u0D46-\u0D48\u0D4A-\u0D4C\u0D4E\u0D57\u0D60-\u0D63\u0D7A-\u0D7F\u0D82\u0D83\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E01-\u0E3A\u0E40-\u0E46\u0E4D\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB9\u0EBB-\u0EBD\u0EC0-\u0EC4\u0EC6\u0ECD\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F71-\u0F81\u0F88-\u0F97\u0F99-\u0FBC\u1000-\u1036\u1038\u103B-\u103F\u1050-\u1062\u1065-\u1068\u106E-\u1086\u108E\u109C\u109D\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u135F\u1380-\u138F\u13A0-\u13F4\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16EE-\u16F0\u1700-\u170C\u170E-\u1713\u1720-\u1733\u1740-\u1753\u1760-\u176C\u176E-\u1770\u1772\u1773\u1780-\u17B3\u17B6-\u17C8\u17D7\u17DC\u1820-\u1877\u1880-\u18AA\u18B0-\u18F5\u1900-\u191C\u1920-\u192B\u1930-\u1938\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A1B\u1A20-\u1A5E\u1A61-\u1A74\u1AA7\u1B00-\u1B33\u1B35-\u1B43\u1B45-\u1B4B\u1B80-\u1BA9\u1BAC-\u1BAF\u1BBA-\u1BE5\u1BE7-\u1BF1\u1C00-\u1C35\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF3\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2160-\u2188\u24B6-\u24E9\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2DE0-\u2DFF\u2E2F\u3005-\u3007\u3021-\u3029\u3031-\u3035\u3038-\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FCC\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA674-\uA67B\uA67F-\uA697\uA69F-\uA6EF\uA717-\uA71F\uA722-\uA788\uA78B-\uA78E\uA790-\uA793\uA7A0-\uA7AA\uA7F8-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA827\uA840-\uA873\uA880-\uA8C3\uA8F2-\uA8F7\uA8FB\uA90A-\uA92A\uA930-\uA952\uA960-\uA97C\uA980-\uA9B2\uA9B4-\uA9BF\uA9CF\uAA00-\uAA36\uAA40-\uAA4D\uAA60-\uAA76\uAA7A\uAA80-\uAABE\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEF\uAAF2-\uAAF5\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uABC0-\uABEA\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]
matches a character that isn't a Unicode letter.

It depends on what engine you are working with. It also depends on how your Unicode characters are encoded — are they encoded as a single character, or as a character+mark combination?
You can try the following: \p{L} to target character+mark combinations, and \P{M}\p{M}*+ for the single character encodings.

So, finally I decided to write my own regex condition, because it seems like it there isn't any fast&simple way to do that in javascript.
I added here all unnecessary characters that came to my mind, could be in typical website and aren't needed to understand single word (I left ' character because in English it is quite important ;) ). If you want you can edit my answer and add your own ones.
[:;.,\?!-()~\/"|®##$%^&*+-]
JS:
text = text.replace(/[:;\.,\?!\-\(\)~\\\/"|®##$%^&*+-]/, "");

Regex format from PHP to Javascript

Can you please help me. How can I add this regex (?<=^|\s):d(?=$|\s) in javascript RegExp?
e.g
regex = new RegExp("?????" , 'g');
I want to replace the emoticon :d, but only if it is surrounded by spaces (or at an end of the string).

Firstly, as Some1.Kill.The.DJ mentioned, I recommend you use the literal syntax to create the regular expression:
var pattern = /yourPatternHere/g;
It's shorter, easier to read and you avoid complications with escape sequences.
The reason why the pattern does not work is that JavaScript does not support lookbehinds ((?<=...). So you have to find a workaround for that. You won't get around including that character in your pattern:
var pattern = /(?:^|\s):d(?!\S)/g;
Since there is no use in capturing anything in your pattern anyway (because :d is fixed) you are probably only interested in the position of the match. That means, when you find a match, you will have to check whether the first character is a space character (or is not :). If that is the case you have to increment the position by 1. If you know that your input string can never start with a space, you can simply increment any found position if it is not 0.
Note that I simplified your lookahead a bit. That is actually the beauty of lookarounds that you do not have to distinguish between end-of-string and a certain character type. Just use the negative lookahead, and assure that there is no non-space character ahead.
Just for future reference that means you could have simplified your initial pattern to:
(?<!\S):d(?!\S)
(If you were using a regex engine that supports lookbehinds.)
EDIT:
After your comment on the other answer, it's actually a lot easier to use the workaround. Just write back the captured space-character:
string = string.replace(/(^|\s):d(?!\S)/g, "$1emoticonCode");
Where $1 refers to what was matched with (^|\s). I.e. if the match was at the beginning of the string $1 will be empty, and if there was a space before :d, then $1 will contian that space character.

Javascript doesnt support lookbehind i.e(?<=)..
It supports lookahead
Better use
/(?:^|\s)(:d)(?=$|\s)/g
Group1 captures required match

Issue with custom javascript regex

I have a custom regular expression which I use to detect whole numbers, fractions and floats.
var regEx = new RegExp("^((^[1-9]|(0\.)|(\.))([0-9]+)?((\s|\.)[0-9]+(/[0-9])?)?)$");
var quantity = 'd';
var matched = quantity.match(regEx);
alert(matched);

(The code is also found here: http://jsfiddle.net/aNb3L/ .)
The problem is that for a single letter it matches, and I can't figure out why. But for more letters it fails(which is good).
Disclaimer: I am new to regular expressions, although in http://gskinner.com/RegExr/ it doesn't match a single letter

It's easier to use straight regular expression syntax:
var regEx = /^((^[1-9]|(0\.)|(\.))([0-9]+)?((\s|\.)[0-9]+(\/[0-9])?)?)$/;
When you use the RegExp constructor, you have to double-up on the backslashes. As it is, your code only has single backslashes, so the \. subexpressions are being treated as . — and that's how single non-digit characters are slipping through.
Thus yours would also work this way:
var regEx = new RegExp("^((^[1-9]|(0\\.)|(\\.))([0-9]+)?((\\s|\\.)[0-9]+(/[0-9])?)?)$");
This happens because the string syntax also uses backslash as a quoting mechanism. When your regular expression is first parsed as a string constant, those backslashes are stripped out if you don't double them. When the string is then passed to the regular expression parser, they're gone.
The only time you really need to use the RegExp constructor is when you're building up the regular expression dynamically or when it's delivered to your code via JSON or something.

Well, for a whole number this would be your regex:
/^(0|[1-9]\d*)$/
Then you have to account for the possibility of a float:
/^(0|[1-9]\d*)(.\d+)?$/
Then you have to account for the possibility of a fraction:
/^(0|[1-9]\d*)((.\d+)|(\/[1-9]\d*)?$/
To me this regex is much easier to read than your original, but it's up to you of course.

JavaScript regex - positive lookahead -- giving me syntax errors

This piece of regex (?<=href\=")[^]+?(?=#_) is supposed to match everything in a href value except the the hash value and what follows it within the href url.
It appears to work fine under Regex debuggers/testers such as http://gskinner.com/RegExr/
but in javascript it appears to produce syntax error. If i remove the < from the (?<=) it works however, that's not the positive lookahead I am looking for.
I am pulling my hair off, as usual, thanks to Regex lol
Please help

(?<=...) and (?<!...) are lookbehinds, not lookaheads. Lookbehinds are not supported by Javascript's regular expression engine.
Resources:
http://www.regular-expressions.info/lookaround.html#lookbehind
http://www.regular-expressions.info/javascript.html

Lookbehinds are not support as already mentioned.
I don't know which function you use for your regex, but if you use match(), just add a capture group:
href="(.+?)(?=#)
Which gives you, e.g.:
var str = '';
var matches = str.match(/href="(.+?)(?=#)/);
// matches[0] = href="foo
// matches[1] = foo // <-- this is what you want
Additional information:
[^] means, match all characters that are not in this character class. But there are no characters in the class. So it matches any character which is exactly what the dot . is doing.

The reason your regex works in RegExr is because RegExr is a Flash application written ActionScript, not JavaScript. Although AS advertises itself as being compliant with the EcmaScript standard the same as JS, it's actually much more powerful. Its regex engine is powered by the PCRE library, so it has the same capabilities as PHP.
To get a more accurate picture of what JavaScript regexes can do, use a tester that's actually powered by JavaScript, like this one.

Develop Reference

JavaScript is the programming language of the Web.

Regex in javascript working with Cyrillic (Russian) set - javascript

Related

Regex special character '{' matches in JS but not in Java

Regex: any character that is NOT a letter (but not only English letters)

Regex format from PHP to Javascript

Issue with custom javascript regex

JavaScript regex - positive lookahead -- giving me syntax errors

Categories

Resources