I want to use regex for string replace with Cyrillic characters. I want to use exact match option. My string replace is working with Latin characters and is looking like that:
'Edin'.replace(/\Edin\b/gi, ''); // Output is ""
The same expression is not working with Cyrillic characters
'Един'.replace(/\Един\b/gi, ''); // Output is still 'Един'
The problem here is \b word boundary chracter, which matches position at a word boundary. Word boundary is defined as (^\w|\w$|\W\w|\w\W). And in its turn word character \w is a set of ASCII characters [A-Za-z0-9_]. Obviously Cyrillic characters don't fall into this set.
For example, for the same reason /\w+/ regular expression will not match Cyrillyc string.
As dfsq wrote the problem is with word boundary.
If you remove \b you will get desired output, but it is quite different regex. It will replace Един also in cases where it is a part of word. To avoid that you can use negative lookahead and define which letters shouldn't appear behind, because they could be a part of word.
'Един'.replace(/\Един(?![A-я])/gi, '');
Related
I need help putting together a regex that will match word that ends with "Id" with case sensitive match.
Try this regular expression:
\w*Id\b
\w* allows word characters in front of Id and the \b ensures that Id is at the end of the word (\b is word boundary assertion).
Gumbo gets my vote, however, the OP doesn't specify whether just "Id" is an allowable word, which means I'd make a minor modification:
\w+Id\b
1 or more word characters followed by "Id" and a breaking space. The [a-zA-Z] variants don't take into account non-English alphabetic characters. I might also use \s instead of \b as a space rather than a breaking space. It would depend if you need to wrap over multiple lines.
This may do the trick:
\b\p{L}*Id\b
Where \p{L} matches any (Unicode) letter and \b matches a word boundary.
How about \A[a-z]*Id\z? [This makes characters before Id optional. Use \A[a-z]+Id\z if there needs to be one or more characters preceding Id.]
I would use
\b[A-Za-z]*Id\b
The \b matches the beginning and end of a word i.e. space, tab or newline, or the beginning or end of a string.
The [A-Za-z] will match any letter, and the * means that 0+ get matched. Finally there is the Id.
Note that this will match words that have capital letters in the middle such as 'teStId'.
I use http://www.regular-expressions.info/ for regex reference
Regex ids = new Regex(#"\w*Id\b", RegexOptions.None);
\b means "word break" and \w means any word character. So \w*Id\b means "{stuff}Id". By not including RegexOptions.IgnoreCase, it will be case sensitive.
I know the pattern to detect if it's a string is chinese character but that's not what I need. I need to check if the characters is found in a string.
const words_found = (words, values) =>
words.some(word =>
values.match(new RegExp(word + '\\b', 'i'))
)
words_found(['james'], 'my name is james') // true
but failed for chinese character
words_found(['一个'], '你说到这是一个测试') // false
Read the documentation for word boundaries.
A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.
where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.
Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between 一 and 个, because both are non-word characters; similarly, there is no word boundary between 一个 and 测试, because both 个 and 测 are non-word characters.
With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?
Using Unicode regexp properties
However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality.com/2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use
new RegExp(`${word}(?=$|\P{Letter})`, "u")
Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.
Of course, this will not help you find 一个 as a "word" inside 你说到这是一个测试, because the following character 测 falls into the Unicode class "Letter", and so will not match \p{Letter}.
By the way, to match any "non-word" symbol in Unicode, you can use:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
\b only works on boundary between words and non-words. In case of Chinese, the entire '你说到这是一个测试' is considered a word, so '一个' won't match '你说到这是一个测试' with your regex pattern with \b since '一个' is not at the word boundary of '你说到这是一个测试'. '测试' on the other hand, will match. For Chinese words, a simple substring match is usually enough.
For the address field, I need first character of every word to be uppercase. I have been using /\b./g which has eventually resulted in a problem where first character after special characters such as !#*&;' and so on are also capitalised. ie. King'S Street instead of King's Street.
Is there a way to adjust that expression to exclude that behaviour or is changing the entire expression more optimal?
replace \b with (^|[ ])
Your regex will be: /(^|[ ])./g
Explanation:
\b by definition: is used to find a match at the beginning or end of a word.
(^|[ ]) will match with the beginning of the string or any space characters
(^|[ ]). will match every space followed by a character and the first character of the string.
Side note:
Use (^|\s) to match every blank spaces.
Your regex will be: /(^|\s)./g
You could use a lookahead:
\b[a-z](?=\w+)
See a demo on regex101.com.
I need is the last match. In the case below the word test without the $ signs or any other special character:
Test String:
$this$ $is$ $a$ $test$
Regex:
\b(\w+)\b
The $ represents the end of the string, so...
\b(\w+)$
However, your test string seems to have dollar sign delimiters, so if those are always there, then you can use that instead of \b.
\$(\w+)\$$
var s = "$this$ $is$ $a$ $test$";
document.body.textContent = /\$(\w+)\$$/.exec(s)[1];
If there could be trailing spaces, then add \s* before the end.
\$(\w+)\$\s*$
And finally, if there could be other non-word stuff at the end, then use \W* instead.
\b(\w+)\W*$
In some cases a word may be proceeded by non-word characters, for example, take the following sentence:
Marvelous Marvin Hagler was a very talented boxer!
If we want to match the word boxer all previous answers will not suffice due the fact we have an exclamation mark character proceeding the word. In order for us to ensure a successful capture the following expression will suffice and in addition take into account extraneous whitespace, newlines and any non-word character.
[a-zA-Z]+?(?=\s*?[^\w]*?$)
https://regex101.com/r/D3bRHW/1
We are informing upon the following:
We are looking for letters only, either uppercase or lowercase.
We will expand only as necessary.
We leverage a positive lookahead.
We exclude any word boundary.
We expand that exclusion,
We assert end of line.
The benefit here are that we do not need to assert any flags or word boundaries, it will take into account non-word characters and we do not need to reach for negate.
var input = "$this$ $is$ $a$ $test$";
If you use var result = input.match("\b(\w+)\b") an array of all the matches will be returned next you can get it by using pop() on the result or by doing: result[result.length]
Your regex will find a word, and since regexes operate left to right it will find the first word.
A \w+ matches as many consecutive alphanumeric character as it can, but it must match at least 1.
A \b matches an alphanumeric character next to a non-alphanumeric character. In your case this matches the '$' characters.
What you need is to anchor your regex to the end of the input which is denoted in a regex by the $ character.
To support an input that may have more than just a '$' character at the end of the line, spaces or a period for instance, you can use \W+ which matches as many non-alphanumeric characters as it can:
\$(\w+)\W+$
Avoid regex - use .split and .pop the result. Use .replace to remove the special characters:
var match = str.split(' ').pop().replace(/[^\w\s]/gi, '');
DEMO
I'm able to match and highlight this Hebrew letter in JS:
var myText = $('#text').html();
var myHilite = myText.replace(/(\u05D0+)/g,"<span class='highlight'>$1</span>");
$('#text').html(myHilite);
fiddle
but can't highlight a word containing that letter at a word boundary:
/(\u05D0)\b/g
fiddle
I know that JS is bad at regex with Unicode (and server side is preferred), but I also know that I'm bad at regex. Is this a limit in JS or an error in my syntax?
I can't read Hebrew... does this regex do what you want?
/(\S*[\u05D0]+\S*)/g
Your first regex, /(\u05D0+)/g matches on only the character you are interested in.
Your second regex, /(\u05D0)\b/g, matches only when the character you are interested in is the last-only (or last-repeated) character before a word boundary...so that doesn't won't match that character in the beginning or middle of a word.
EDIT:
Look at this anwer
utf-8 word boundary regex in javascript
Using the info from that answer, I come up with this regex, is this correct?
/([\u05D0])(?=\s|$)/g
What about using the following regexp which uses all cases of a word in a sentence:
/^u05D0\s|\u05D0$|\u05D0\s|^\u05D0$/
it actually uses 4 regexps with the OR operator ('|').
Either the string starts with your exact word followed by a space
OR your string has space + your word + space
OR your string ends with space + your word
OR your string is the exact word only.