match hebrew character at word boundary via regex in javascript? - javascript

I'm able to match and highlight this Hebrew letter in JS:
var myText = $('#text').html();
var myHilite = myText.replace(/(\u05D0+)/g,"<span class='highlight'>$1</span>");
$('#text').html(myHilite);
fiddle
but can't highlight a word containing that letter at a word boundary:
/(\u05D0)\b/g
fiddle
I know that JS is bad at regex with Unicode (and server side is preferred), but I also know that I'm bad at regex. Is this a limit in JS or an error in my syntax?

I can't read Hebrew... does this regex do what you want?
/(\S*[\u05D0]+\S*)/g
Your first regex, /(\u05D0+)/g matches on only the character you are interested in.
Your second regex, /(\u05D0)\b/g, matches only when the character you are interested in is the last-only (or last-repeated) character before a word boundary...so that doesn't won't match that character in the beginning or middle of a word.
EDIT:
Look at this anwer
utf-8 word boundary regex in javascript
Using the info from that answer, I come up with this regex, is this correct?
/([\u05D0])(?=\s|$)/g

What about using the following regexp which uses all cases of a word in a sentence:
/^u05D0\s|\u05D0$|\u05D0\s|^\u05D0$/
it actually uses 4 regexps with the OR operator ('|').
Either the string starts with your exact word followed by a space
OR your string has space + your word + space
OR your string ends with space + your word
OR your string is the exact word only.

Related

Match exact word and remove leading space in regular expression

I'm looking for a regular expression.
Requirement:
I need to select a complete word from a string (word might contain special character or anything). And m pretty close to the solution.
Example:
character-set
Regular expression: (?:^|\s)(cent-er)(?=\s|$)
Result: " character-set" with a leading space.
But i want to remove leading space from the selected word. The word should match exactly i.e if i say character or character- or -set or set it should not get any result.
Any help is much appreciated. Thanks in advance.
It is not exactly what you seem to describe (as far as I could understand, that is), but maybe what you are looking for are word boundaries: \b. Try the regex (parentheses optional):
(\b)(cent-er)(\b)
Other than that, if you have to have a space before the word, then you will have to match it (and then use capturing groups to extract the word without the space), because JavaScript's regex has no lookbehinds.

Why do I have to add double backslash on javascript regex?

When I use a tool like regexpal.com it let's me use regex as I am used to. So for example I want to check a text if there is a match for a word that is at least 3 letters long and ends with a white space so it will match 'now ', 'noww ' and so on.
On regexpal.com this regex works \w{3,}\s this matches both the words above.
But on javascript I have to add double backslashes before w and s. Like this:
var regexp = new RegExp('\\w{3,}\\s','i');
or else it does not work. I looked around for answers and searched for double backslash javascript regex but all I got was completely different topics about how to escape backslash and so on. Does someone have an explanation for this?
You could write the regex without double backslash but you need to put the regex inside forward slashshes as delimiter.
/^\w{3,}\s$/.test('foo ')
Anchors ^ (matches the start of the line boundary), $ (matches the end of a line) helps to do an exact string match. You don't need an i modifier since \w matches both upper and lower case letters.
Why? Because in a string, "\" quotes the following character so "\w" is seen as "w". It essentially says "treat the next character literally and don't interpret it".
To avoid that, the "\" must be quoted too, so "\\w" is seen by the regular expression parser as "\w".

String replace exact match in cyrillic

I want to use regex for string replace with Cyrillic characters. I want to use exact match option. My string replace is working with Latin characters and is looking like that:
'Edin'.replace(/\Edin\b/gi, ''); // Output is ""
The same expression is not working with Cyrillic characters
'Един'.replace(/\Един\b/gi, ''); // Output is still 'Един'
The problem here is \b word boundary chracter, which matches position at a word boundary. Word boundary is defined as (^\w|\w$|\W\w|\w\W). And in its turn word character \w is a set of ASCII characters [A-Za-z0-9_]. Obviously Cyrillic characters don't fall into this set.
For example, for the same reason /\w+/ regular expression will not match Cyrillyc string.
As dfsq wrote the problem is with word boundary.
If you remove \b you will get desired output, but it is quite different regex. It will replace Един also in cases where it is a part of word. To avoid that you can use negative lookahead and define which letters shouldn't appear behind, because they could be a part of word.
'Един'.replace(/\Един(?![A-я])/gi, '');

JS & Regex: how to replace punctuation pattern properly?

Given an input text such where all spaces are replaced by n _ :
Hello_world_?. Hello_other_sentenc3___. World___________.
I want to keep the _ between words, but I want to stick each punctuation back to the last word of a sentence without any space between last word and punctuation. I want to use the the punctuation as pivot of my regex.
I wrote the following JS-Regex:
str = str.replace(/(_| )*([:punct:])*( |_)/g, "$2$3");
This fails, since it returns :
Hello_world_?. Hello_other_sentenc3_. World_._
Why it doesn't works ? How to delete all "_" between the last word and the punctuation ?
http://jsfiddle.net/9c4z5/
Try the following regex, which makes use of a positive lookahead:
str = str.replace(/_+(?=\.)/g, "");
It replaces all underscores which are immediately followed by a punctuation character with the empty string, thus removing them.
If you want to match other punctuation characters than just the period, replace the \. part with an appropriate character class.
JavaScript doesn't have :punct: in its regex implementation. I believe you'd have to list out the punctuation characters you care about, perhaps something like this:
str = str.replace(/(_| )+([.,?])/g, "$2");
That is, replace any group of _ or space that is immediately followed by punctation with just the punctuation.
Demo: http://jsfiddle.net/9c4z5/2/

Javascript: regex for replace words inside text and not part of the words

I need regex for replace words inside text and not part of the words.
My code that replace 'de' also when it’s part of the word:
str="de degree deep de";
output=str.replace(new RegExp('de','g'),'');
output==" gree ep "
Output that I need: " degree deep "
What should be regex for get proper output?
str.replace(/\bde\b/g, '');
Note that
RegExp('\\bde\\b','g') // regex object constructor (takes a string as input)
and
/\bde\b/g // regex literal notation, does not require \ escaping
are the same thing.
The \b denotes a "word boundary". A word boundary is defined as a position where a word character follows a non-word character, or vice versa. A word character is defined as [a-zA-Z0-9_] in JavaScript.
Start-of-string and end-of-string positions can be word boundaries as well, as long as they are followed or preceded by a word character, respectively.
Be aware that the notion of a word character does not work very well outside the realm of the English language.
str="de degree deep de";
output=str.replace(/\bde\b/g,'');
You can use the reg ex \bde\b.
You can find a working sample here.
The regex character \b act as a word separator. You can find more here.
You should enclose your search characters between \b:
str="de degree deep de";
output=str.replace(/\bde\b/g,'');
You can use a word boundary as Arun & Tomalak note.
/\bde\b/g
or you can use a space
/de\s/g
http://www.regular-expressions.info/charclass.html

Categories

Resources