Detect Russian / cyrillic in Javascript string? - javascript

I'm trying to detect if a string contains Russian (cyrillic) characters or not. I'm using this code:
term.match(/[\wа-я]+/ig);
but it doesn't work – or in fact it just returns the string back as it is.
Can somebody help with the right code?
Thanks!

Use pattern /[\u0400-\u04FF]/ to cover more cyrillic characters:
// http://jrgraphix.net/r/Unicode/0400-04FF
const cyrillicPattern = /^[\u0400-\u04FF]+$/;
console.log('Привіт:', cyrillicPattern.test('Привіт'));
console.log('Hello:', cyrillicPattern.test('Hello'));
UPDATE:
In some new browsers, you can use Unicode property escapes.
The Cyrillic script uses the same range as described above: U+0400..U+04FF
const cyrillicPattern = /^\p{Script=Cyrillic}+$/u;
console.log('Привіт:', cyrillicPattern.test('Привіт'));
console.log('Hello:', cyrillicPattern.test('Hello'));

Perhaps you meant to use the RegExp test method instead?
/[а-яА-ЯЁё]/.test(term)
Note that JavaScript regexes are not really Unicode-aware, which means the i flag will have no effect on anything that's not ASCII. Hence the need for spelling out lower- and upper-case ranges separately.

Related

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?
I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.
Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.
One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16
The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.
I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'
Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

RegEx that works in Javascript won't do so in PHP

I will try to make my question short yet understandable, I have a simple RegEx I use in javascript to check for characters that aren't alphanumeric (AKA Symbols). It would be "/[$-/:-?{-~!"^_`[]]/"
In javascript, doing
if(/[$-/:-?{-~!"^_`\[\]]/.test( string ))
just works, if any of those characters are in the string, it will give true, else, it will give false. I tried to do the same in PHP, the following way
if(preg_match('/[$-/:-?{-~!"^_`\[\]]/', $string ))
other regexes work when done this way, but this particular one simply will give false no matter what when ran in PHP.
Is there any reason to this? Am I doing something wrong? Does PHP comprehend regexes in a different way? What should I change to make it work?
Thanks for your time.
Since php uses PCRE, you will get a pattern error using delimiter / as seen here http://regex101.com/r/3ILGgE/1
So, it should be escaped correctly.
Using / as the delimiter, the string is
'/[$-\/:-?{-~!"^_`\[\]]/'
Using ~ as the delimiter, the string is
'~[$-/:-?{-\~!"^_`\[\]]~'
Also, be aware you have a couple of range's in the class $-/ and :-? and {-~
that will include the characters between the from/to range characters as well
and does not include the range character - itself as it is an operator.

Remove a long dash from a string in JavaScript?

I've come across an error in my web app that I'm not sure how to fix.
Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.
How can I fix this?
The specific character that's killing it is —.
Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.
This code might help:
text = text.replace(/\u2013|\u2014/g, "-");
It replaces all – (–) and — (—) symbols with simple dashes (-).
DEMO: http://jsfiddle.net/F953H/
That character is call an Em Dash. You can replace it like so:
str.replace('\u2014', '');​​​​​​​​​​
Here is an example Fiddle: http://jsfiddle.net/x67Ph/
The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.
There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash
You can replace unicode characters directly by using the unicode escape:
'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:
[yourstring].replace(/[\u0080-\uC350]/g,
function(a) {
return '&#'+a.charCodeAt(0)+';';
}
);
With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:
/\p{Dash}/gu
In ES5, the equivalent expression is:
/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g
See the Unicode Utilities reference.
Here are some JavaScript examples:
const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030𐺭";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------
To match one or more dashes and replace with a single char (or remove in one go):
/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g

JavaScript regular expression to catch kanji

I can't get this javascript function to work the way I want...
// matches a String that contains kanji and/or kana character(s)
String.prototype.isKanjiKana = function(){
return !!this.match(/^[\u4E00-\u9FAF|\u3040-\u3096|\u30A1-\u30FA|\uFF66-\uFF9D|\u31F0-\u31FF]+$/);
}
it does return TRUE if the string is made of kanji and/or kana characters, FALSE if alphabet or other chars are present.
I would like it to return if at least 1 kanji and/or kana characters are present instead that if all of them are.
thank you in advance for any help!
The right answer is not to hardcode ranges. Never ever put magic numbers in your code! That is a maintenance nightmare. It is hard to read, hard to write, hard to debug, hard to maintain. How do you know you got the numbers right? What happens when they add new ones? No, do not use magic numbers. Please.
The right answer is to use named Unicode scripts, which are a fundemental aspect of every Unicode code point:
[\p{Han}\p{Hiragana}\p{Katakana}]
That requires the XRegExp plugin for Javascript.
The real problem is that Javascript regexes on their own are too primitive to support Unicode properties — and therefore, to support Unicode. Maybe that was once an acceptable compromise 15 years ago, but today it is nothing less than intolerably negligent, as you yourself have discovered.
You will also miss a few Common code points specified as kana in the new Script Extensions property, but probably no matter. You could just add \p{Common} to the set above.
Now that Unicode property escapes are part of the ES (2018) spec, the following regex can be used natively if the JS engine supports this feature (expanding on #tchrist's answer):
/[\p{Script_Extensions=Han}\p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}]/u
If you want to exclude punctuation from being matched:
/(?!\p{Punctuation})[\p{Script_Extensions=Han}\p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}]/u
/[\u3000-\u303f]|[\u3040-\u309f]|[\u30a0-\u30ff]|[\uff00-\uffef]|[\u4e00-\u9faf]|[\u3400-\u4dbf]/
Japanese style punctuation: [\u3000-\u303f]
Hiragana: [\u3040-\u309f]
Katakana: [\u30a0-\u30ff]
Roman characters + half-width katakana: [\uff00-\uffef]
Kanji: [\u4e00-\u9faf]|[\u3400-\u4dbf]
String.prototype.isKanjiKana = function(){
return !!this.match(/[\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF]/);
}
Don't anchor it to beginning and end of string with $^ and the + is useless in this case.
/[\u4E00-\u9FAF|\u3040-\u3096|\u30A1-\u30FA|\uFF66-\uFF9D|\u31F0-\u31FF]/
Why not just this? It will return true when it contains at least one Kanji.
/[一-龯]/.test(str)

How to match non-ASCII (German, Spanish, etc.) letters in regex?

I was unable to find or create a regex which match only letters,spaces, accented letters and spanish and german letters.
I'm using this for now:
var reg = new RegExp("^[a-z _]*$");
I've tried:
^[:alpha: _]*$
^[a-zA-Z0-9äöüÄÖÜ]*$
^[-\p{L}]*$
Any idea? Or the regex supported by javascript engines are limited?
The 2nd to last case looks like it should work, but is missing a " " and "_":
/^[a-zA-Z0-9äöüÄÖÜ]*$/.test("aäöüÄÖÜz") => true in FF 3.6 and IE8
/^[a-zA-Z0-9äöüÄÖÜ]*$/.test("é") => false in FF 3.6 and IE8
I'm am unable to find the other constructs in the ECMAScript specification.
Happy coding.
Edit Also check the page encoding and make sure it is "unicode" (UTF-8 likely). If this can't be ensured, then use the \uXXXX escape sequences in the regular expression (using the escapes can be done anyway and may help with source code editing/control).
I'm parsing a name input field, and this seems to be working for both German and French:
^[a-zA-Z\-ÀàÂâÆæÇçÈèÉéÊêËëÎîÏïÔôŒœÙùÛûÜü]*$
Some folks have names like 'Rölf-Dieter', and this lets them through, while checking for numbers. A little extreme, but it works!

Categories

Resources