regex words matching for Chinese and Japanese character - javascript

I know the pattern to detect if it's a string is chinese character but that's not what I need. I need to check if the characters is found in a string.
const words_found = (words, values) =>
words.some(word =>
values.match(new RegExp(word + '\\b', 'i'))
)
words_found(['james'], 'my name is james') // true
but failed for chinese character
words_found(['一个'], '你说到这是一个测试') // false

Read the documentation for word boundaries.
A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.
where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.
Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between 一 and 个, because both are non-word characters; similarly, there is no word boundary between 一个 and 测试, because both 个 and 测 are non-word characters.
With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?
Using Unicode regexp properties
However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality.com/2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use
new RegExp(`${word}(?=$|\P{Letter})`, "u")
Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.
Of course, this will not help you find 一个 as a "word" inside 你说到这是一个测试, because the following character 测 falls into the Unicode class "Letter", and so will not match \p{Letter}.
By the way, to match any "non-word" symbol in Unicode, you can use:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

\b only works on boundary between words and non-words. In case of Chinese, the entire '你说到这是一个测试' is considered a word, so '一个' won't match '你说到这是一个测试' with your regex pattern with \b since '一个' is not at the word boundary of '你说到这是一个测试'. '测试' on the other hand, will match. For Chinese words, a simple substring match is usually enough.

Related

Finding all words ending in "ion" with regex in JavaScript [duplicate]

I need help putting together a regex that will match word that ends with "Id" with case sensitive match.
Try this regular expression:
\w*Id\b
\w* allows word characters in front of Id and the \b ensures that Id is at the end of the word (\b is word boundary assertion).
Gumbo gets my vote, however, the OP doesn't specify whether just "Id" is an allowable word, which means I'd make a minor modification:
\w+Id\b
1 or more word characters followed by "Id" and a breaking space. The [a-zA-Z] variants don't take into account non-English alphabetic characters. I might also use \s instead of \b as a space rather than a breaking space. It would depend if you need to wrap over multiple lines.
This may do the trick:
\b\p{L}*Id\b
Where \p{L} matches any (Unicode) letter and \b matches a word boundary.
How about \A[a-z]*Id\z? [This makes characters before Id optional. Use \A[a-z]+Id\z if there needs to be one or more characters preceding Id.]
I would use
\b[A-Za-z]*Id\b
The \b matches the beginning and end of a word i.e. space, tab or newline, or the beginning or end of a string.
The [A-Za-z] will match any letter, and the * means that 0+ get matched. Finally there is the Id.
Note that this will match words that have capital letters in the middle such as 'teStId'.
I use http://www.regular-expressions.info/ for regex reference
Regex ids = new Regex(#"\w*Id\b", RegexOptions.None);
\b means "word break" and \w means any word character. So \w*Id\b means "{stuff}Id". By not including RegexOptions.IgnoreCase, it will be case sensitive.

Custom regex word boundary (javascript)

I'm trying to create a custom word boundary (like \b) that also takes words starting or ending with the unicode characters "ÆØÅæøå" into consideration.
Now the only thing I can come up with is this ugly thing
((?<![\wÆØÅæøå])(?=[\wÆØÅæøå])|(?![\wÆØÅæøå])(?<=[\wÆØÅæøå]))
Is there a more elegant solution to this? Or is this the only way.
You can use:
(?<!\p{L}\p{M}*|[\p{N}_]) // leading word boundary, similar to \<, [[:<:]] or \m in other flavors
(?![\p{L}\p{N}_]) // trailing word boundary, similar to \>, [[:>:]] or \M
Compile the regex with the u modifier to enable Unicode category classes.
The (?<!\p{L}\p{M}*|[\p{N}_]) is a negative lookbehind that matches a location not immediately preceded with a letter followed with zero or more diacritic marks or a digit or an underscore.
The (?![\p{L}\p{N}_]) is a negative lookahead that matches a location not immediately followed with a letter, digit or an underscore.

String replace exact match in cyrillic

I want to use regex for string replace with Cyrillic characters. I want to use exact match option. My string replace is working with Latin characters and is looking like that:
'Edin'.replace(/\Edin\b/gi, ''); // Output is ""
The same expression is not working with Cyrillic characters
'Един'.replace(/\Един\b/gi, ''); // Output is still 'Един'
The problem here is \b word boundary chracter, which matches position at a word boundary. Word boundary is defined as (^\w|\w$|\W\w|\w\W). And in its turn word character \w is a set of ASCII characters [A-Za-z0-9_]. Obviously Cyrillic characters don't fall into this set.
For example, for the same reason /\w+/ regular expression will not match Cyrillyc string.
As dfsq wrote the problem is with word boundary.
If you remove \b you will get desired output, but it is quite different regex. It will replace Един also in cases where it is a part of word. To avoid that you can use negative lookahead and define which letters shouldn't appear behind, because they could be a part of word.
'Един'.replace(/\Един(?![A-я])/gi, '');

Examples and explanation for javascript regular expression (x), decimal point, and word boundary

Can someone give a better explanation for these special characters examples in here? Or provide some clearer examples?
(x)
The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and
remember the first two words in the string "foo bar foo bar". The \1
and \2 in the pattern match the string's last two words.
decimal point
For example, /.n/ matches 'an' and 'on' in "nay, an apple is on the
tree", but not 'nay'.
Word boundary \b
/\w\b\w/ will never match anything, because a word character can never
be followed by both a non-word and a word character.
non word boundary \B
/\B../ matches 'oo' in "noonday" (, and /y\B./ matches 'ye' in
"possibly yesterday."
totally having no idea what the above example is showing :(
Much thanks!
Parentheses (aka capture groups)
Parantheses are used to indicate a group of symbols in the regular expression that, when matched, are 'remembered' in the match result. Each matched group is labelled with a numbered order, as \1, \2, and so on. In the example /(foo) (bar) \1 \2/ we remember the match foo as \1, and the match bar as \2. This means that the string "foo bar foo bar" matches the regular expression because the third and fourth terms (the \1 and \2) are matching the first and second capture groups (i.e. (foo) and (bar)). You can use capture groups in javascript like this:
/id:(\d+)/.exec("the item has id:57") // => ["id:57", "57"]
Note that in the return we get the whole match, and the subsequent groups that were captured.
Decimal point (aka wildcard)
A decimal point is used to represent a single character that can have any value. This means that the regular expression /.n/ will match any two character string where the second character is an 'n'. So /.n/.test("on") // => true, /.n/.test("an") // => true but /.n/.test("or") // => false. DrC brings up a good point in the comments that this won't match a newline character, but I feel in order for that to be an issue you need to explicitly specify multiline mode.
Word boundaries
A word boundary will match against any non-word character that directly precedes, or directly follows a word (i.e. adjacent to a word character). In javascript the word characters are any alpahnumeric and the underscore (mdn), non word is obviously everything else! The trick for word boundaries is that they are zero width assertions, which means they don't count as a character. That's why /\w\b\w/ will never match, because you can never have a word boundary between two word characters.
Non-word boundaries
The opposite of a word boundary, instead of matching a point that goes from non-word to word, or word to non-word (i.e. the ends of a word) it will match points where it's moving between the same types of character. So for our examples /\B../ will match the first point in the string that is between two characters of the same type and the next two characters, in this case it's between the first 'n' and 'o', and the next two characters are "oo". In the second example /y\B./ we are looking for the character 'y' followed by a character of matching type (so a word character), and the '.' will match that second character. So "possibly yesterday" won't match on the 'y' at the end of "possibly" because the next character is a space, which is a non word, but it will match the 'y' at the beginning of "yesterday", because it's followed by a word character, which is then included in the match by the '.' in the regular expression.
Overall, regular expressions are popular in many languages and based off a sound theoretical basis, so there's a lot of material on these characters. In general, Javascript is very similar to Perl's PCRE regular expressions (but not exactly the same!), so the majority of your questions about javascript regular expressions would be answered by any PCRE regex tutorial (of which there are many).
Hope that helps!

How can I make a regular expression which takes accented characters into account?

I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that
A word boundary ("\b") is a spot
between two characters that has a "\w"
on one side of it and a "\W" on the
other side of it (in either order),
counting the imaginary characters off
the beginning and end of the string as
matching a "\W".
AS3 RegExp to match words with boundry type characters in them
And since
\w matches any alphanumerical
character (word characters) including
underscore (short for [a-zA-Z0-9_]).
\W matches any non-word characters
(short for [^a-zA-Z0-9_])
http://www.javascriptkit.com/javatutors/redev2.shtml
obviously accented characters are not taken into account. This becomes a problem with words like Montréal. If the é is considered a word boundary, then al is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..
Any help?
Here is the relevant JavaScript code, which searches userInput and finds two-letter words using the re_state regular expression:
var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
While JavaScript regexes recognize non-ASCII characters in some cases (like \s), it's hopelessly inadequate when it comes to \w and \b. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.
By the way, there's an error in your regex. You have a \b after the optional trailing comma, but it should be in front:
"\\b([a-z]{2})\\b,?"
I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:
"\\b[a-z]{2}\\b"
Have you set JavaScript to use non-ASCII?
Here is a page
that suggests setting JavaScript to use UTF-8:
http://blogs.oracle.com/shankar/entry/how_to_handle_utf_8
It says:
add a charset attribute
(charset="utf-8") to your script tags
in the parent page:
script type="text/javascript" src="[path]/myscript.js" charset="utf-8"

Categories

Resources