regex and javascript, some matches disappear !

regex and javascript, some matches disappear ! - javascript

Here is the code :
> var reg = new RegExp(" hel.lo ", 'g');
>
> var str = " helalo helblo helclo heldlo ";
>
> var mat = str.match(reg);
>
> alert(mat);
It alerts "helalo, helclo", but i expect it to be "helalo, helblo, helclo, heldlo" .
Only the half of them matches, I guess that's because of the space wich count only once. So I tried to double every space before processing, but in some case it's not enough.
I'm looking for an explanation, and a solution.
Thx

"␣helalo␣helblo␣helclo␣heldlo␣"
// 11111111------22222222-------
When ␣helalo␣ was matched, the string left is helblo␣... without the leading space. But the regex requires a leading space, so it skips to ␣helclo␣.
To avoid the expression eating up the space, use a lookahead.
var reg = / hel.lo(?= )/g
(Or use \b as a word boundary.)

It matches the regex, advances to the next character after the matched string and goes on.
You can use \b to match word boundaries. You can add the whitespaces later, if you want to.
\bhel.lo\b

Related

\b regex special character seems not working for Cyrillic in javascript [duplicate]

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å
When user types text in to the search input field I try to match the text to data.
Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
http://jsfiddle.net/7TsxB/
So how can I get those ä,ö and å characters to work with javascript regex?
I think I should use unicode codes but how should I do that? Codes for those characters are:
[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ

There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.
Instead of using \b, try using (?:^|\\s)
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
Breakdown:
(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together
^ the caret symbol matches the beginning of a string
| the bar is the "or" operator.
\s matches whitespace (appears as \\s in the string because we have to escape the backslash)
) closes the group
So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.

The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.
This makes the RegEx character classes largely useless for dealing with any real language.
\s should work for what you want to do, provided that search terms are only delimited by whitespace.

this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters.
Using XRegExp library you can implement a valid \b boundary expanding this
XRegExp('(?=^|$|[^\\p{L}])')
the result is a 4000+ char long, but it seems to work quite performing.
Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.

\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.
Updating and improving on max_masseti's answer:
With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.
EDIT: Previous version was incomplete.
As such:
const text = 'A Fé, o Império, e as terras viciosas';
text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);
// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']
We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.

I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.

Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.
Yes, \b word boundary works only with Latin letters (Word boundary: \b):
Word boundary \b doesn’t work for non-Latin alphabets
The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side.
But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.
Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.
So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:
new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')
Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.
To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.
To try out my example run the code snippet below.
function getMatchExpression(cyrillicSearchValue) {
return new RegExp(
`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
'gi',
);
}
const sentence = 'Будь-який текст кирилицею, де необхідно знайти слово з контексту';
console.log(sentence.match(getMatchExpression('текст')));
// expected output: ["текст"]
console.log(sentence.match(getMatchExpression('но')));
// expected output: null

I noticed something really weird with \b when using Unicode:
/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)
/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)
It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.
In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)

My idea is to search with codes representing the Finnish letters
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.
http://jsfiddle.net/7TsxB/5/
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.

What you are looking for is the Unicode word boundaries standard:
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
There is a JavaScript implementation here (unciodejs.wordbreak.js)
https://github.com/wikimedia/unicodejs

I had a similar problem, where I was trying to replace all of a particular unicode word with a different unicode word, and I cannot use lookbehind because it's not supported in the JS engine this code will be used in. I ultimately resolved it like this:
const needle = "КАРТОПЛЯ";
const replace = "БАРАБОЛЯ";
const regex = new RegExp(
String.raw`(^|[^\n\p{L}])`
+ needle
+ String.raw`(?=$|\P{L})`,
"gimu",
);
const result = (
'КАРТОПЛЯ сдффКАРТОПЛЯдадф КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ??? !!!КАРТОПЛЯ ;!;!КАРТОПЛЯ/#?#?'
+ '\n\nКАРТОПЛЯ КАРТОПЛЯ - - -КАРТОПЛЯ--'
)
.replace(regex, function (match, ...args) {
return args[0] + replace;
});
console.log(result)
output:
БАРАБОЛЯ сдффКАРТОПЛЯдадф БАРАБОЛЯ БАРАБОЛЯ БАРАБОЛЯ??? !!!БАРАБОЛЯ ;!;!БАРАБОЛЯ/#?#?
БАРАБОЛЯ БАРАБОЛЯ - - -БАРАБОЛЯ--
Breaking it apart
The first regex: (^|[^\n\p{L}])
^| = Start of the line or
[^\n\p{L}] = Any character which is not a letter or a newline
The second regex: (?=$|\P{L})
?= = Lookahead
$| = End of the line or
\P{L} = Any character which is not a letter
The first regex captures the group and is then used via args[0] to put it back into the string during replacement, thereby avoiding a lookbehind. The second regex utilized lookahead.
Note that the second one MUST be a lookahead because if we capture it then overlapping regex matches will not trigger (e.g. КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ would only match on the 1st and 3rd ones).

Trying to find text "myTest":
/(?<![\p{L}\p{N}_])myTest(?![\p{L}\p{N}_])/gu
Similar to NetBeans or Notepad++ form. Trying to find the expression without any letter or number or underscore (like \w characters of word boundary \b) in any unicode characters of letter and number before or after the expression.

I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
var text = "Ještě. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ještě", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
var replacedString = "";
text = text.replace(re, function replacer(match){
var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
if (beginning == null) beginning = "";
var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
if (ending == null) ending = "";
replacedString = match.replace(beginning,"");
replacedString = replacedString.replace(ending,"");
replaced.push(replacedString);
return beginning+"{{"+order+"}}"+ending;
});
if (replacedString == "") break;
order += 1;
}
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...

The correct answer to the question is given by andrefs.
I will only rewrite it more clearly, after putting all required things together.
For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:
Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.
So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.
For your code to work, you now need to do the following:
Assign to your variable "searchterm", the pattern or words you want to find.
Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.

bad but working:
var text = " аб аб АБ абвг ";
var ttt = "(аб)"
var p = "(^|$|[^A-Za-zА-Я-а-я0-9()])"; // add other word boundary symbols here
var exp = new RegExp(p+ttt+p,"gi");
text = text.replace(exp, "$1($2)$3").replace(exp, "$1($2)$3");
const t1 = performance.now();
console.log(text);
result (without qutes):
" (аб) (аб) (АБ) абвг "

I struggled hard on this. Working with French accented characters, and I managed to find this solution :
const myString = "MyString";
const regex = new RegExp(
"(?:[^À-ú]|^)\\b(" + myString + ")\\b(?:[^À-ú]|$)",
"ig"
);
What id does :
It keeps checking word-boundaries with \b before and after "MyString".
In addition to that, (?:[^À-ú]|^) and (?:[^À-ú]|$) will check if MyString is not surrounded by any accented characters
It will not work with cyrillic but it may be possible to find the range of cirillic charactes and edit [^À-ú] in consequence.
Warning, it captures only the group (MyString) but the total match contains previous and next characters
See example : https://regex101.com/r/5P0ZIe/1
Match examples :
MyString
match : "MyString"
group 1 : "MyString"
Lorem ipsum. MyString dolor sit amet
match : " MyString "
group 1 : "MyString"
(MyString)
match : "(MyString)"
group 1 : "MyString"
BetweenCharactersMyStringIsNotFound
match : Nothing
group 1 : Nothing
éMyStringé
match : Nothing
group 1 : Nothing
ùMyString
match : Nothing
group 1 : Nothing
MyStringÖ
match : Nothing
group 1 : Nothing

Match parts of code

I'm trying to match parts of code with regex. How can I match var, a, =, 2 and ; from
"var a = 2;"
?

I believe you want this regexp: /\S+/g
To break it down: \S selects all non-whitespace characters, + makes sure you it selects multiple non whitespace characters together (i.e. 'var'),
and the 'g' flag makes sure it selects all of the occurrences in the string, and instead of stopping at the first one which is the default behavior.
This is a helpful link for playing around until you find the right regexp: https://regex101.com/#javascript

var str = "var a = 2;";
// clean the duplicate whitespaces
var no_duplicate_whitespace = str.replace(new RegExp("\\s+", "g"), " ");
// and split by space
var tokens = no_duplicate_whitespace.split(" ");
Or as #kuujinbo pointed out:
str.split(/\s+/);

Regex match Array words with dash

I want to match some keywords in the url
var parentURL = document.referrer;
var greenPictures = /redwoods-are-big.jpg|greendwoods-are-small.jpg/;
var existsGreen = greenPictures.test(parentURL);
var existsGreen turns true when it finds greendwoods-are-small.jpg but also when it finds small.jpg
What can i do that it only turns true if there is exactly greendwoods-are-small.jpg?

You can use ^ to match the beginning of a string and $ to match the end:
var greenPictures = /^(redwoods-are-big.jpg|greendwoods-are-small.jpg)$/;
var existsGreen = greenPictures.test(parentURL);
But of cause the document.referrer is not equal ether redwoods-are-big.jpg or greendwoods-are-small.jpg so i would match /something.png[END]:
var greenPictures = /\/(redwoods-are-big\.jpg|greendwoods-are-small\.jpg)$/; // <-- See how I escaped the / and the . there? (\/ and \.)
var existsGreen = greenPictures.test(parentURL);

Try this regex:
/(redwoods-are-big|greendwoods-are-small)\.jpg/i
I used the i flag for ignoring the character cases in parentURL variable.
Description
Demo
http://regex101.com/r/aI4yJ6

Dashes does not have any special meaning outside character sets, e.g.:
[a-f], [^x-z] etc.
The characters with special meaning in your regexp is | and .
/redwoods-are-big.jpg|greendwoods-are-small.jpg/
| denotes either or.
. matches any character except the newline characters \n \r \u2028 or \u2029.
In other words: There is something else iffy going on in your code.
More on RegExp.
Pages like these can be rather helpful if you struggle with writing regexp's:
regex101 (with sample)
RegexPlanet
RegExr
Debuggex
etc.

Regular expression with asterisk quantifier

This documentation states this about the asterisk quantifier:
Matches the preceding character 0 or more times.
It works in something like this:
var regex = /<[A-Za-z][A-Za-z0-9]*>/;
var str = "<html>";
console.log(str.match(regex));
The result of the above is : <html>
But when tried on the following code to get all the "r"s in the string below, it only returns the first "r". Why is this?
var regex = /r*/;
var str = "rodriguez";
console.log(str.match(regex));
Why, in the first example does it cause "the preceding" character/token to be repeated "0 or more times" but not in the second example?

var regex = /r*/;
var str = "rodriguez";
The regex engine will first try to match r in rodriguez from left to right and since there is a match, it consumes this match.
The regex engine then tries to match another r, but the next character is o, so it stops there.
Without the global flag g (used as so var regex = /r*/g;), the regex engine will stop looking for more matches once the regex is satisfied.
Try using:
var regex = /a*/;
var str = "cabbage";
The match will be an empty string, despite having as in the string! This is because at first, the regex engine tries to find a in cabbage from left to right, but the first character is c. Since this doesn't match, the regex tries to match 0 times. The regex is thus satisfied and the matching ends here.
It might be worth pointing out that * alone is greedy, which means it will first try to match as many as possible (the 'or more' part from the description) before trying to match 0 times.
To get all r from rodriguez, you will need the global flag as mentioned earlier:
var regex = /r*/g;
var str = "rodriguez";
You'll get all the r, plus all the empty strings inside, since * also matches 'nothing'.

Use global switch to match 1 or more r anywhere in the string:
var regex = /r+/g;
In your other regex:
var regex = /<[A-Za-z][A-Za-z0-9]*>/;
You're matching literal < followed by a letter followed by 0 or more letter or digits and it will perfectly match <html>
But if you have input as <foo>:<bar>:<abc> then it will just match <foo> not other segments. To match all segments you need to use /<[A-Za-z][A-Za-z0-9]*>/g with global switch.

how to regex a string between two tokens in Javascript?

Asked many times, but I can't get it to work...
I have strings like:
"text!../tmp/widgets/tmp_widget_header.html"
and am trying like this to extract widget_header:
var temps[i] = "text!../tmp/widgets/tmp_widget_header.html";
var thisString = temps[i].regexp(/.*tmp_$.*\.*/) )
but that does not work.
Can someone tell me what I'm doing wrong here?
Thanks!

This prints widget_header:
var s = "text!../tmp/widgets/tmp_widget_header.html";
var matches = s.match(/tmp_(.*?)\.html/);
console.log(matches[1]);

var s = "text!../tmp/widgets/tmp_widget_header.html",
re = /\/tmp_([^.]+)\./;
var match = re.exec(s);
if (match)
alert(match[1]);
This will match:
a / character
the characters tmp_
one or more of any character that is not the . character. These are captured.
a . character
If a match was found, it will be at index 1 of the resulting Array.

In your code:
var temps[i] = "text!../tmp/widgets/tmp_widget_header.html";
var thisString = temps[i].regexp(/.*tmp_$.*\.*/) )
You are saying:
"Match any string that starts with any number of any characters, followed by "tmp_", followed by the end of input, followed by any number of periods."
.* : Any number of any character (except newline)
tmp_ : Literally "tmp_"
$ : End of input/newline - this will never be true in this position
\. : " . ", a period
\.* : Any number of periods
Plus when using the regex() function you need to pass a string, using string notation like var re = new RegExp("ab+c") or var re = new RegExp('ab+c') not in regex notation using slash. You also have either an extra, or missing parenthesis, and no characters are actually being captured.
What you want to do is:
"Find a string that preceded by the begining of input, followed by one or more of any character, followed by "tmp_"; followed by a single period, followed by one or more of any character, followed by the end of input;t that contains one or more of any character. Capture that string."
So:
var string = "text!../tmp/widgets/tmp_widget_header.html";
var re = /^.+tmp_(.+)\..+$/; //I use the simpler slash notation
var out = re.exec(string); //execute the regex
console.log(out[1]); //Note that out is an array, the first (here only) catpture sting is at index 1
This regex /^.+tmp_(.+)\..+$/ means:
^ : Match beginning of input/line
.+ : One or more of any character (except newline), "+" is one or more
tmp_ : Constant "tmp_"
\. : A single period
.+ : As above
$ : End of input/line
You could also use this as RegEx('^.+tmp_(.+)\..+$'); not that when we use RegEx(); we do not have the slash marks, instead we use quote marks (single or double will work), to pass it as a string.
Now this would also match var string = "Q%$#^%$^%$^%$^43etmp_ebeb.45t4t#$^g" and out == 'ebeb'. So depending on the specific use you may wish to replace any " . " used to signify any character (except newline) with bracketed "[ ]" character lists, as this may filter out unwanted results. You milage may vary.
For more information visit: https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions

Develop Reference

JavaScript is the programming language of the Web.

regex and javascript, some matches disappear ! - javascript

It matches the regex, advances to the next character after the matched string and goes on. You can use \b to match word boundaries. You can add the whitespaces later, if you want to. \bhel.lo\b

Related

\b regex special character seems not working for Cyrillic in javascript [duplicate]

Match parts of code

Regex match Array words with dash

Regular expression with asterisk quantifier

how to regex a string between two tokens in Javascript?

Categories

Resources