Javascript word boundary unicode space issue - javascript

I want to write a regex pattern that matches for full words or phrases even if they have unicode chars to wrap them with some html code. So I use this pattern:
var pattern=new RegExp('(^|\\s)'+phrase+'(?=\\s|$)', "gi");
It works perfectly even on multi-word phrases expect for one issue. If the phrase isn't the start of the string, it matches with the space before the word. So after I wrap it I'll lose that space. I only want to wrap the phrase variable and not the spaces.
For example:
var string="This is a nice sentence.";
var phrase="is a nice";
/*OUTPUT: Thisis a nicesentence*//*HTML OUTPUT: This<span>is a nice</span>sentence*/
/*What I want: This <span>is a nice</span> sentence*/
Of course this pattern could work:
var pattern=new RegExp(phrase, "gi");
But I'm not looking for those strings that are substrings of another.
Is it possible to solve my issue with a better regex pattern?

Simply write back what you captured in group 1:
output = string.replace(pattern, '$1<span>' + phrase + '</span>');
If you are not using replace but match or exec and do the replacement manually, you can still access the capturing group in the returned array and insert the space or empty string before your span.
By the way, if you capture the phrase as well, you don't need any string concatenation in the replacement:
var pattern = new RegExp('(^|\\s)('+phrase+')(?=\\s|$)', "gi");
output = string.replace(pattern, '$1<span>$2</span>');

Related

Regex for duplicate words in a string with one of them with a apostrophe in Angular

I want to create a regex for a requirement, a string is having duplicate words but in that duplicate words one is with a apostrophe.
For example: EXCHANGE CORRESPONDENCE WITH BOLNO'S COUNSEL RE SCHEDULING INTERVIEW WITH DAVID BOLNO.
I am using this regex to validate by this way and splitting the string with that word.
var splitArray = this.narrative.split(new RegExp("\\b(" + this.misspelledWords[m] + ")\\b"))
But this regex string is considering BOLNO and BOLNO'S as single word.But I want to create my regex in such a way that it should consider BOLNO'S and BOLNO as different.
Can anyone help me this one.
You can use
new RegExp("\\b(" + this.misspelledWords[m] + ")(?!['\\w])"
This also implies the this.misspelledWords[m] word cannot contain special chars.
The (?!['\w]) negative lookahead fails the match if there is a ' or a word char immediately on the right.

Javascript string split with regex

I am trying to split a string using a regular expression for links (urls).
The regex in question is
var regex = new RegExp('(?:^(?:(?:[a-z]+:)?//)(?:\S+(?::\S*)?#)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[/?#]\S*)?$)')
If i do
regex.test("https://google.com"); // returns true
but doing -
"Go to https://google.com".split(regex);
// return ["Go to https://google.com"]
Whereas i expect it to return
["Go to ", "https://google.com"]
Any idea what's going on here?
First of all, you're using a string literal to build your regex, which means that you have to escape your backslashes (since a backslash has a special meaning in strings, used for the line feed char \n for example):
var regex = new RegExp('(?:^(?:(?:[a-z]+:)?//)(?:\\S+(?::\\S*)?#)?(?:localhost|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:[/?#]\\S*)?$)');
Another solution would be to use the regex literal, as JavaScript proposes one, but you would then have to escape the slashes:
var regex = /(?:^(?:(?:[a-z]+:)?\/\/)(?:\S+(?::\S*)?#)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[\/?#]\S*)?$)/;
Then, your regex will try to match against the entire input due to the ^ and $ anchors. So if you remove them (or better, replace them with word boundaries \b), you'll be able to find URLs in a string for example:
var regex = /(?:\b(?:(?:[a-z]+:)?\/\/)(?:\S+(?::\S*)?#)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[\/?#]\S*)?\b)/;
But, the main point is that you're misunderstanding the split concept. Given the string "hello world", if you split by space, you'll end up with ["hello", "world"]: no more space anymore since it was the char that was used to split.
That is, if you split by the URL regex, the output array won't contain the URLs anymore. It seems to me that a lookahead could suit your needs:
var regex = /(?=(?:\b(?:(?:[a-z]+:)?\/\/)(?:\S+(?::\S*)?#)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[\/?#]\S*)?\b))/;
"Go to https://google.com".split(regex) // ["Go to ", "https://google.com"]
The regex explained:
(?=(?:\b(?:(?:[a-z]+:)?//)(?:\S+(?::\S*)?#)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[/?#]\S*)?\b))
Debuggex Demo
By splitting a string with a positive lookahead (?=content_of_lookahead), you'll split by each interchar that is followed by the content of the lookahead.
Take a look at 8 Regular Expressions You Should Know.
To match an url you can use following regex :
var regex = "(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w# \.-]*)*\/?$";
"Go to https://google.com".split(regex);
// return ["https://google.com"]
Live example.
Hope this helps.

Find how many words are starting with given search string - Regex

I am working on regex for searching hotel list. There are names like "testing hotel plaza", "testing2 newhotel plaza", "plaza hotel"....
Basically my requirement is if user type plaza then all the hotels should populate which contains "Plaza"... but if user types "aza" no result should populate. In short in given string I need to find is there any word that start with user entered string and if yes, then display the result.
Here is a code that I am stuck and is not working.
var regex = new RegExp("/\b"+searchString, "gi");
if (mainString.match(regex))
{
return true;
}
This is working but it is finding all occurrences even if it is a middle character or at any position which I do not want.
var regex = new RegExp(searchString , "gi");
if (mainString.match(regex))
{
return true;
}
When invoking the RegExp constructor like this, the regex is not enclosed in slashes (/.../), but you have a leading forward slash in your string. Also, escape sequence backslashes need to be escaped themselves, so what you should be using is
var regex = new RegExp("\\b"+searchString, "gi");
EDIT:
Yes, since \b is defined relative to [A-Za-z0-9_], this is indeed problematic when it comes to non-ASCII chars. You could probably solve it using more or less complicated lookarounds, but a much easier solution which would most likely do the trick here is to say that searchString should be found either at the beginning or after a whitespace character:
var regex = new RegExp("(?:^|\\s)"+searchString, "gi");

regexp problem, the dot selects all text

I use some jquery to highlight search results. For some reason if i enter a basis dot, all of the text get selected. I use regex and replace to wrap the results in a tag to give the found matches a color.
the code that i use
var pattern = new.RegExp('('+$.unique(text.split(" ")).join("|")+")","gi");
how can i prevent that the dot selects all text, so i want to leave the point out of the code(the dot has no power)
You may be able to get there by doing this:
var pattern = new.RegExp('('+$.unique(text.replace('.', '\\.').split(" ")).join("|")+")","gi");
The idea here is that you're attempting to escape the period, which acts as a wild card in regex.
This will replace all special RegExp characters (except for | since you're using that to join the terms) with their escaped version so you won't get unwanted matches or syntax errors:
var str = $.unique(text.split(" ")).join("|"),
pattern;
str = str.replace(/[\\\.\+\*\?\^\$\[\]\(\)\{\}\/\'\#\:\!\=]/ig, "\\$&");
pattern = new RegExp('('+str+')', 'gi');
The dot is supposed to match all text (almost everything, really). If you want to match a period, you can just escape it as \..
If you have a period in your RegExp it's supposed to match any character besides newline characters. If you don't want that functionality you need to escape the period.
Example RegExp with period escaped /word\./
You need to escape the text you're putting into the regex, so that special characters don't have unwanted meanings. My code is based on some from phpjs.org:
var words = $.unique(text.split(" ")).join("|");
words = words.replace(/[.\\+*?\[\^\]$(){}=!<>|:\\-]/h, '\\$&'); // escape regex special chars
var pattern = new RegExp('(' + words + ")","gi");
This escapes the following characters: .\+*?[^]$(){}=!<>|:- with a backslash \ so you can safely insert them into your new RegExp construction.

Regex from character until end of string

Hey. First question here, probably extremely lame, but I totally suck in regular expressions :(
I want to extract the text from a series of strings that always have only alphabetic characters before and after a hyphen:
string = "some-text"
I need to generate separate strings that include the text before AND after the hyphen. So for the example above I would need string1 = "some" and string2 = "text"
I found this and it works for the text before the hyphen, now I only need the regex for the one after the hyphen.
Thanks.
You don't need regex for that, you can just split it instead.
var myString = "some-text";
var splitWords = myString.split("-");
splitWords[0] would then be "some", and splitWords[1] will be "text".
If you actually have to use regex for whatever reason though - the $ character marks the end of a string in regex, so -(.*)$ is a regex that will match everything after the first hyphen it finds till the end of the string. That could actually be simplified that to just -(.*) too, as the .* will match till the end of the string anyway.

Categories

Resources