How do I cover all lowercase non-numeric word characters in a regex?
For example this would cover Müller for german umlauts
/[A-Z][a-zäöü-]+/g
...but what about french or spanish word characters? Is it possible to get those covered by a range or something like that?
The regex should NOT match a string which has multiple uppercase characters, like DOntGetMe. But it should match McDonald.
So I came up at the end with
/([A-Z][a-zäöü'-](?:[\w'-]+)?)/g
But it still don't cover french/spanish characters.
https://regex101.com/r/ofl4tj/2
Match
Smith
Müller
McDonald
O'Riley
Lee
Li
Manco-Johnson
Don't match
MIsspelled
ABC
Since you are looking for a letter from English alphabet as beginning of word use a word boundary \b right before [A-Z]. Using Unicode CLDR or Mother E**, to match all lowercase letters from all languages (\p{Ll}) this would be the regex:
(?:[a-z\xB5\xDF-\xF6\xF8-\xFF\u0101\u0103\u0105\u0107\u0109\u010B\u010D\u010F\u0111\u0113\u0115\u0117\u0119\u011B\u011D\u011F\u0121\u0123\u0125\u0127\u0129\u012B\u012D\u012F\u0131\u0133\u0135\u0137\u0138\u013A\u013C\u013E\u0140\u0142\u0144\u0146\u0148\u0149\u014B\u014D\u014F\u0151\u0153\u0155\u0157\u0159\u015B\u015D\u015F\u0161\u0163\u0165\u0167\u0169\u016B\u016D\u016F\u0171\u0173\u0175\u0177\u017A\u017C\u017E-\u0180\u0183\u0185\u0188\u018C\u018D\u0192\u0195\u0199-\u019B\u019E\u01A1\u01A3\u01A5\u01A8\u01AA\u01AB\u01AD\u01B0\u01B4\u01B6\u01B9\u01BA\u01BD-\u01BF\u01C6\u01C9\u01CC\u01CE\u01D0\u01D2\u01D4\u01D6\u01D8\u01DA\u01DC\u01DD\u01DF\u01E1\u01E3\u01E5\u01E7\u01E9\u01EB\u01ED\u01EF\u01F0\u01F3\u01F5\u01F9\u01FB\u01FD\u01FF\u0201\u0203\u0205\u0207\u0209\u020B\u020D\u020F\u0211\u0213\u0215\u0217\u0219\u021B\u021D\u021F\u0221\u0223\u0225\u0227\u0229\u022B\u022D\u022F\u0231\u0233-\u0239\u023C\u023F\u0240\u0242\u0247\u0249\u024B\u024D\u024F-\u0293\u0295-\u02AF\u0371\u0373\u0377\u037B-\u037D\u0390\u03AC-\u03CE\u03D0\u03D1\u03D5-\u03D7\u03D9\u03DB\u03DD\u03DF\u03E1\u03E3\u03E5\u03E7\u03E9\u03EB\u03ED\u03EF-\u03F3\u03F5\u03F8\u03FB\u03FC\u0430-\u045F\u0461\u0463\u0465\u0467\u0469\u046B\u046D\u046F\u0471\u0473\u0475\u0477\u0479\u047B\u047D\u047F\u0481\u048B\u048D\u048F\u0491\u0493\u0495\u0497\u0499\u049B\u049D\u049F\u04A1\u04A3\u04A5\u04A7\u04A9\u04AB\u04AD\u04AF\u04B1\u04B3\u04B5\u04B7\u04B9\u04BB\u04BD\u04BF\u04C2\u04C4\u04C6\u04C8\u04CA\u04CC\u04CE\u04CF\u04D1\u04D3\u04D5\u04D7\u04D9\u04DB\u04DD\u04DF\u04E1\u04E3\u04E5\u04E7\u04E9\u04EB\u04ED\u04EF\u04F1\u04F3\u04F5\u04F7\u04F9\u04FB\u04FD\u04FF\u0501\u0503\u0505\u0507\u0509\u050B\u050D\u050F\u0511\u0513\u0515\u0517\u0519\u051B\u051D\u051F\u0521\u0523\u0525\u0527\u0529\u052B\u052D\u052F\u0560-\u0588\u10D0-\u10FA\u10FD-\u10FF\u13F8-\u13FD\u1C80-\u1C88\u1D00-\u1D2B\u1D6B-\u1D77\u1D79-\u1D9A\u1E01\u1E03\u1E05\u1E07\u1E09\u1E0B\u1E0D\u1E0F\u1E11\u1E13\u1E15\u1E17\u1E19\u1E1B\u1E1D\u1E1F\u1E21\u1E23\u1E25\u1E27\u1E29\u1E2B\u1E2D\u1E2F\u1E31\u1E33\u1E35\u1E37\u1E39\u1E3B\u1E3D\u1E3F\u1E41\u1E43\u1E45\u1E47\u1E49\u1E4B\u1E4D\u1E4F\u1E51\u1E53\u1E55\u1E57\u1E59\u1E5B\u1E5D\u1E5F\u1E61\u1E63\u1E65\u1E67\u1E69\u1E6B\u1E6D\u1E6F\u1E71\u1E73\u1E75\u1E77\u1E79\u1E7B\u1E7D\u1E7F\u1E81\u1E83\u1E85\u1E87\u1E89\u1E8B\u1E8D\u1E8F\u1E91\u1E93\u1E95-\u1E9D\u1E9F\u1EA1\u1EA3\u1EA5\u1EA7\u1EA9\u1EAB\u1EAD\u1EAF\u1EB1\u1EB3\u1EB5\u1EB7\u1EB9\u1EBB\u1EBD\u1EBF\u1EC1\u1EC3\u1EC5\u1EC7\u1EC9\u1ECB\u1ECD\u1ECF\u1ED1\u1ED3\u1ED5\u1ED7\u1ED9\u1EDB\u1EDD\u1EDF\u1EE1\u1EE3\u1EE5\u1EE7\u1EE9\u1EEB\u1EED\u1EEF\u1EF1\u1EF3\u1EF5\u1EF7\u1EF9\u1EFB\u1EFD\u1EFF-\u1F07\u1F10-\u1F15\u1F20-\u1F27\u1F30-\u1F37\u1F40-\u1F45\u1F50-\u1F57\u1F60-\u1F67\u1F70-\u1F7D\u1F80-\u1F87\u1F90-\u1F97\u1FA0-\u1FA7\u1FB0-\u1FB4\u1FB6\u1FB7\u1FBE\u1FC2-\u1FC4\u1FC6\u1FC7\u1FD0-\u1FD3\u1FD6\u1FD7\u1FE0-\u1FE7\u1FF2-\u1FF4\u1FF6\u1FF7\u210A\u210E\u210F\u2113\u212F\u2134\u2139\u213C\u213D\u2146-\u2149\u214E\u2184\u2C30-\u2C5E\u2C61\u2C65\u2C66\u2C68\u2C6A\u2C6C\u2C71\u2C73\u2C74\u2C76-\u2C7B\u2C81\u2C83\u2C85\u2C87\u2C89\u2C8B\u2C8D\u2C8F\u2C91\u2C93\u2C95\u2C97\u2C99\u2C9B\u2C9D\u2C9F\u2CA1\u2CA3\u2CA5\u2CA7\u2CA9\u2CAB\u2CAD\u2CAF\u2CB1\u2CB3\u2CB5\u2CB7\u2CB9\u2CBB\u2CBD\u2CBF\u2CC1\u2CC3\u2CC5\u2CC7\u2CC9\u2CCB\u2CCD\u2CCF\u2CD1\u2CD3\u2CD5\u2CD7\u2CD9\u2CDB\u2CDD\u2CDF\u2CE1\u2CE3\u2CE4\u2CEC\u2CEE\u2CF3\u2D00-\u2D25\u2D27\u2D2D\uA641\uA643\uA645\uA647\uA649\uA64B\uA64D\uA64F\uA651\uA653\uA655\uA657\uA659\uA65B\uA65D\uA65F\uA661\uA663\uA665\uA667\uA669\uA66B\uA66D\uA681\uA683\uA685\uA687\uA689\uA68B\uA68D\uA68F\uA691\uA693\uA695\uA697\uA699\uA69B\uA723\uA725\uA727\uA729\uA72B\uA72D\uA72F-\uA731\uA733\uA735\uA737\uA739\uA73B\uA73D\uA73F\uA741\uA743\uA745\uA747\uA749\uA74B\uA74D\uA74F\uA751\uA753\uA755\uA757\uA759\uA75B\uA75D\uA75F\uA761\uA763\uA765\uA767\uA769\uA76B\uA76D\uA76F\uA771-\uA778\uA77A\uA77C\uA77F\uA781\uA783\uA785\uA787\uA78C\uA78E\uA791\uA793-\uA795\uA797\uA799\uA79B\uA79D\uA79F\uA7A1\uA7A3\uA7A5\uA7A7\uA7A9\uA7AF\uA7B5\uA7B7\uA7B9\uA7FA\uAB30-\uAB5A\uAB60-\uAB65\uAB70-\uABBF\uFB00-\uFB06\uFB13-\uFB17\uFF41-\uFF5A]|\uD801[\uDC28-\uDC4F\uDCD8-\uDCFB]|\uD803[\uDCC0-\uDCF2]|\uD806[\uDCC0-\uDCDF]|\uD81B[\uDE60-\uDE7F]|\uD835[\uDC1A-\uDC33\uDC4E-\uDC54\uDC56-\uDC67\uDC82-\uDC9B\uDCB6-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDCCF\uDCEA-\uDD03\uDD1E-\uDD37\uDD52-\uDD6B\uDD86-\uDD9F\uDDBA-\uDDD3\uDDEE-\uDE07\uDE22-\uDE3B\uDE56-\uDE6F\uDE8A-\uDEA5\uDEC2-\uDEDA\uDEDC-\uDEE1\uDEFC-\uDF14\uDF16-\uDF1B\uDF36-\uDF4E\uDF50-\uDF55\uDF70-\uDF88\uDF8A-\uDF8F\uDFAA-\uDFC2\uDFC4-\uDFC9\uDFCB]|\uD83A[\uDD22-\uDD43])
\b is not Unicode aware in JS so you can't use it for multi-byte letters - I'm not sure if you need it either. But if you do translate \p{L}\p{D} using above tools and use it in a negative lookahead (?!...) or simply do a (?!\S). The latter is more general but most of times satisfies the needs.
Related
I am trying to match with RegEx any word in this sequence (ex: 1943 The brown Fox Jumped) that is a string that starts with numbers and then after that has words with spaces between them. I have spent hours trying to figure out how to match any word in that sequence that isn't title cased (eg. The Brown Fox Jumped). I have figured out how to match if all words aren't title cased but not if one or two are in the middle of a sentence. How would I go about creating a regular expression to detect if one or more words aren't title cased?
The pattern that I am working with currently is /(?<=^\d+\s)([a-z]+)/g. Here is a Regex101 demo of my last attempt. As mentioned earlier I figured out how to match if all of the words in the string weren't title cased as shown in this Regex101 demo. Any help would be greatly appreciated :)
You can use an infinite-width lookbehind based regex solution in case you must do it with a regex:
/(?<=^\d+\s.*?)\b[a-z]+\b/gs
See the regex demo.
Details
(?<=^\d+\s.*?) - a positive lookbehind that matches a location that is immediately preceded with
^ - start of string
\d+ - 1+ digits
\s - a whitespace]
.*? - any 0 or more chars as few as possible
\b[a-z]+\b - a whole word consisting of lowercase ASCII letters.
Note: this regex does not work in IE and older browsers that do not support the ECMAScript 2018+ standard.
Try this
([A-Z])\w+
It matches all words with Capital letters in them.
This is actually the default example in Regex Generator. Test it here:
https://regexr.com/
I know the pattern to detect if it's a string is chinese character but that's not what I need. I need to check if the characters is found in a string.
const words_found = (words, values) =>
words.some(word =>
values.match(new RegExp(word + '\\b', 'i'))
)
words_found(['james'], 'my name is james') // true
but failed for chinese character
words_found(['一个'], '你说到这是一个测试') // false
Read the documentation for word boundaries.
A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.
where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.
Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between 一 and 个, because both are non-word characters; similarly, there is no word boundary between 一个 and 测试, because both 个 and 测 are non-word characters.
With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?
Using Unicode regexp properties
However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality.com/2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use
new RegExp(`${word}(?=$|\P{Letter})`, "u")
Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.
Of course, this will not help you find 一个 as a "word" inside 你说到这是一个测试, because the following character 测 falls into the Unicode class "Letter", and so will not match \p{Letter}.
By the way, to match any "non-word" symbol in Unicode, you can use:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
\b only works on boundary between words and non-words. In case of Chinese, the entire '你说到这是一个测试' is considered a word, so '一个' won't match '你说到这是一个测试' with your regex pattern with \b since '一个' is not at the word boundary of '你说到这是一个测试'. '测试' on the other hand, will match. For Chinese words, a simple substring match is usually enough.
I using this pattern to find any word in the string:
\b(\w{1,})
but this can't find arabic words. How can I change this pattern to find both english and arabic words?
Thanks
Regex \w is an alias for A-z, 0-9, and _ (underscore) and will not match arabic unicode range. To include characters other than A-z you need to specify them, for example
[A-z\u0600-\u065F\u066A-\u06EF\u06FA-\u06FF]+
For explanation about character codes see Match Arabic word with regex that ends with “#”?
If your text only includes English and Arabic, and you want to sort the results, you could use this:
([^x00-\x7F ]+) for the Arabic text and this: (\w+) for the English text
The first part captures all characters other than the English set plus a space; the second part captures English characters (plus _).
Like smirnov said, that regex you're using will only find Latin strings. For Arabic you should use [\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD] (which should find all Arabic characters, even weird ones like .)
Depending on what you're trying to do, you might want to split the string into a list and process it that way (that's what I usually end up doing when I'm dealing with mixed-language texts). Then you can identify the language of each word and process it accordingly.
In JavaScript, its easy to match letters and accents with this regex:
text.match(/[a-z\u00E0-\u00FC]+/i);
And only the lowercase letters and accents without the i option:
text.match(/[a-z\u00E0-\u00FC]+/);
But what is the correct regular expression to match only capitalized letters and accents?
EDIT: like the answers already mention below, the regex above also matches some other signs, and miss some special accent characters like ý and Ý, ć and Ć and many others.
The range U+00C0 - U+00DC should be the uppercase equivalent for U+00E0 - U+00FC
So this text.match(/[A-Z\u00C0-\u00DC]+/); should be what you are looking for.
A site like graphemica can help you to determine the ranges you need yourself.
EDIT like the other answers already mention, this also matches some other signs.
Replace a-z with A-Z and \u00E0-\u00FC with \u00C0-\u00DC to match the same letters in uppercase as text.match(/[a-z\u00E0-\u00FC]+/); matches in lowercase.
However!
This is not a proper implementation, neither for lowercase nor for uppercase letters, as, for example, your lowercase match includes ÷ (division sign), which is not at all a letter, and my uppercase string will match × (multiplication sign), which looks like an X, but isn't actually a letter either.
In addition to that, you're missing characters like ý and Ý, ć and Ć and many, many others.
Your first regex doesn't actually match letters and accents: it only matches letters and a specific subset of accented letters, namely the ones between unicode codepoints \u00E0 and \u00FC. This range does not include any capital letters, while it does include e.g. the ÷ sign and some letters not generally regarded as 'accented'.
Depending on what you actually need, this may not be what you want. If you really want to match all capitals letters and all capital letters with the same accents, you need the regex /[A-Z\u00C0-\u00DC]+/, but please check with e.g. http://unicode-table.com/en/#basic-latin to check whether it suits your needs.
To match all capitalized letters, accented or not, you may use the following unicode regex /\p{Lu}+/u. For example, in node repl:
Note that this will match non-latin letters as well, like the capital greek delta Δ in the example.
I want to use regex for string replace with Cyrillic characters. I want to use exact match option. My string replace is working with Latin characters and is looking like that:
'Edin'.replace(/\Edin\b/gi, ''); // Output is ""
The same expression is not working with Cyrillic characters
'Един'.replace(/\Един\b/gi, ''); // Output is still 'Един'
The problem here is \b word boundary chracter, which matches position at a word boundary. Word boundary is defined as (^\w|\w$|\W\w|\w\W). And in its turn word character \w is a set of ASCII characters [A-Za-z0-9_]. Obviously Cyrillic characters don't fall into this set.
For example, for the same reason /\w+/ regular expression will not match Cyrillyc string.
As dfsq wrote the problem is with word boundary.
If you remove \b you will get desired output, but it is quite different regex. It will replace Един also in cases where it is a part of word. To avoid that you can use negative lookahead and define which letters shouldn't appear behind, because they could be a part of word.
'Един'.replace(/\Един(?![A-я])/gi, '');