I want to replace the words int, float, char, bool, main, cin, cout, if, else, else if, for, while, clrscr, getch, do, void to "" (like removing it) if it is found on the string.
So if i have the ff:
str = "main(){ clrscr(); couts<<"wrong"; cout<<"right"; }"
After replacing, it should be:
str = "(){ (); couts<<"wrong"; <<"right"; }"
So far what i've tried is (wrong of course):
str = str.replace(/\s+(?:int|char|bool|main|float)/, "");//summarized
You need the g modifier to perform multiple replacements on the string. You should use the \b regexp at both ends of the regexp to match word boundaries. So it should be:
str = str.replace(/\b(int|char|bool|main|float|...)\b/g, "");
Use the word boundary \b as well as the global flag g.
'main(){ clrscr(); couts<<"wrong"; cout<<"right"; }'.replace(/\b(?:int|float|char|bool|main|cin|cout|if|else|else if|for|while|clrscr|getch|do|void)\b/g, '');
In you regex, \s+ would prevent matching words at the beginning of the string, since you used the + repetition operator, which means at least one. You could always have replaced the \s+ by (\s+|^), but you would still be left with one problem: the spaces would be part of the match and they would get replaced as well. Therefore, \b which matches word boundaries serves you better.
Related
got a question about the start of string regex anchor tag ^.
I was trying to sanitize a string to check if it's a palindrome and found a solution to use regex but couldn't wrap my head around the explanations I found for the start of string anchor tag:
To my understanding:
^ denotes that whatever expression that follows must match, starting from the beginning of the string.
Question:
Why then is there a difference between the two output below:
1)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/[^a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: A*man**a*plan**a*canal**Panama
VS.
2)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/[a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: * ***, * ****, * *****: ******
VS.
3)
let x = 'A man, a plan, a canal: Panama';
const re = new RegExp(/^[a-z]/, 'gi');
console.log(x.replace(re, '*'));
Output: * man, a plan, a canal: Panama
Please let me know if my explanation for each of the case above is off:
1) Confused about this one. If it matches a character class of [a-z] case insensitive + global find, with start of string anchor ^ denoting that it must match at the start of each string, should it not return all the words in the sentence? Since each word is a match of [a-z] insensitive characters that occurs at the start of each string per global find iteration?
(i.e.
finds "A" at the start
then on the next iteration, it should start search on the remaining string " man"
finds a space...and moves on to search "man"?
and so on and so forth...
Q: Why does it then when I call replace does it only targets the non alpha stuff? Should I in this case be treating ^ as inverting [a-z]?
2) This seems pretty straight forward, finds all occurrence of [a-z]and replaces them with the start. Inverse case of 1)??
3) Also confused about this one. I'm not sure how this is different from 1).
/^[a-z]/gi to me means: "starting at the start of the string being looked at, match all alpha characters, case insensitive. Repeat for global find".
Compared to:
1) /[^a-z]/gi to me means: "match all character class that starts each line with alpha character. case insensitive, repeat search for global find."
To mean they mean exactly the same #_#. Please let me know how my understanding is off for the above cases.
Your first expression [^a-z] matches anything other than an alphabetic, lower case letter, therefore that's why when you replace with * all the special characters such as whitespace, commas and colons are replaced.
Your second expression [a-z] matches any alphabetic, lower case letter, therefore the special characters mentioned are not replaced by *.
Your third expression ^[a-z] matches a alphabetic, lower case letter at the start of the string, therefore only the first letter is replaced by *.
For the first two expressions, the global flag g ensures that all characters that match the specified pattern, regardless of their position in the string, are replaced. For the third pattern however, since ^ anchors the pattern at the beginning of the string, only the first letter is replaced.
As you mentioned, the i flag ensures case insensitivity, so that all three patterns operate on both lower and upper case alphabetic letters, from a to z and A to Z.
The character ^ therefore has two meanings:
It negates characters in a character set.
It asserts position at the start of string.
^ denotes that whatever expression that follows must match, starting from the beginning of the string.
That's only when it's the first thing in the regex; it has other purposes when used elsewhere:
/[^a-z]/gi
In the above regex, the ^ does not indicate anchoring the match to the beginning of a string; it inverts the rest of the contents of the [] -- so the above regex will match any single character except a-z. Since you're using the g flag it will repeat that match for all characters in the string.
/[a-z]/gi
The above is not inverted, so will match a single instance of any character from a-z (and again because of the g flag will repeat to match all of those instances.)
/^[a-z]/gi
In this last example, the caret anchors the match to the beginning of the string; the bracketed portion will match any single a-z character. The g flag is still in use, so the regex would try to continue matching more characters later in the string -- but none of them except the first one will will meet the anchored-to-start requirement, so this will end up matching only the first character (if it's within a-z), exactly as if the g flag was not in use.
(When used anywhere in a regex other than the start of the regex or the start of a [] group, the ^ will be treated as a literal ^.)
If you're trying to detect palindromes, you'll want to remove everything except letter characters (and will probably want to convert everything to the same letter case, instead of having to detect that "P" == "p":)
const isPalindrome = function(input) {
let str = input.toLowerCase().replace(/[^a-z]/g,'');
return str === str.split('').reverse().join('')
}
console.log(isPalindrome("Able was I, ere I saw Elba!"))
console.log(isPalindrome("No, it never propagates if I set a ”gap“ or prevention."))
console.log(isPalindrome("Are we not pure? “No, sir!” Panama’s moody Noriega brags. “It is garbage!” Irony dooms a man –– a prisoner up to new era."))
console.log(isPalindrome("Taco dog is not a palindrome."))
Example string: George's - super duper (Computer)
Wanted new string: georges-super-duper-computer
Current regex: .replace(/\s+|'|()/g, '-')
It does not work and and when I remove the spaces and there is already a - in between I get something like george's---super.
tl;dr Your regex is malformed. Also you can't conditionally remove ' and \s ( ) in a single expression.
Your regex is malformed since ( and ) have special meanings. They are used to form groups so you have to escape them as \( and \). You'll also have to place another pipe | in between them, otherwise you're going to match the literal "()", which is not what you want.
The proper expression would look like this: .replace(/\s+|'|\(|\)/g, '-').
However, this is not what you want. Since this would produce George-s---super-duper--Computer-. I would recommend that you use Character Classes, which will also make your expression easier to read:
.replace(/[\s'()-]+/g, '-')
This matches whitespace, ', (, ) and any additional - on or more times and replaces them with -, yielding George-s-super-duper-Computer-.
This is still not quite right, so have this:
var myString = "George's - super duper (Computer)";
var myOtherString = myString
// Remove non-whitespace, non-alphanumeric characters from the string (note: ^ inverses the character class)
// also trim any whitespace from the beginning and end of the string (so we don't end up with hyphens at the start and end of the string)
.replace(/^\s+|[^\s\w]+|\s+$/g, "")
// Replace the remaining whitespace with hyphens
.replace(/\s+/g, "-")
// Finally make all characters lower case
.toLowerCase();
console.log(myString, '=>', myOtherString);
You could do match instead of replace then join result on -. Then you may need a replace to remove single quotes. Regex would be:
[a-z]+('[a-z]+)*
JS code:
var str = "George's - super duper (Computer)";
console.log(
str.match(/[a-z]+('[a-z]+)*/gi).join('-').replace("'", "").toLowerCase()
);
I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å
When user types text in to the search input field I try to match the text to data.
Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
http://jsfiddle.net/7TsxB/
So how can I get those ä,ö and å characters to work with javascript regex?
I think I should use unicode codes but how should I do that? Codes for those characters are:
[\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]
=> äÄåÅöÖ
There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.
Instead of using \b, try using (?:^|\\s)
var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";
// does not work
//var searchterm = "ää";
// Works
//var searchterm = "wi";
if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
$("#result").html("Match: ("+searchterm+"): "+title);
} else {
$("#result").html("nothing found with term: "+searchterm);
}
Breakdown:
(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together
^ the caret symbol matches the beginning of a string
| the bar is the "or" operator.
\s matches whitespace (appears as \\s in the string because we have to escape the backslash)
) closes the group
So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.
The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.
This makes the RegEx character classes largely useless for dealing with any real language.
\s should work for what you want to do, provided that search terms are only delimited by whitespace.
this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters.
Using XRegExp library you can implement a valid \b boundary expanding this
XRegExp('(?=^|$|[^\\p{L}])')
the result is a 4000+ char long, but it seems to work quite performing.
Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.
\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.
Updating and improving on max_masseti's answer:
With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.
EDIT: Previous version was incomplete.
As such:
const text = 'A Fé, o Império, e as terras viciosas';
text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);
// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']
We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.
I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.
Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.
Yes, \b word boundary works only with Latin letters (Word boundary: \b):
Word boundary \b doesn’t work for non-Latin alphabets
The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side.
But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.
Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.
So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:
new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')
Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.
To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.
To try out my example run the code snippet below.
function getMatchExpression(cyrillicSearchValue) {
return new RegExp(
`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
'gi',
);
}
const sentence = 'Будь-який текст кирилицею, де необхідно знайти слово з контексту';
console.log(sentence.match(getMatchExpression('текст')));
// expected output: ["текст"]
console.log(sentence.match(getMatchExpression('но')));
// expected output: null
I noticed something really weird with \b when using Unicode:
/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)
/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)
It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.
In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)
My idea is to search with codes representing the Finnish letters
new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))
My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.
http://jsfiddle.net/7TsxB/5/
I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.
What you are looking for is the Unicode word boundaries standard:
http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries
There is a JavaScript implementation here (unciodejs.wordbreak.js)
https://github.com/wikimedia/unicodejs
I had a similar problem, where I was trying to replace all of a particular unicode word with a different unicode word, and I cannot use lookbehind because it's not supported in the JS engine this code will be used in. I ultimately resolved it like this:
const needle = "КАРТОПЛЯ";
const replace = "БАРАБОЛЯ";
const regex = new RegExp(
String.raw`(^|[^\n\p{L}])`
+ needle
+ String.raw`(?=$|\P{L})`,
"gimu",
);
const result = (
'КАРТОПЛЯ сдффКАРТОПЛЯдадф КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ??? !!!КАРТОПЛЯ ;!;!КАРТОПЛЯ/#?#?'
+ '\n\nКАРТОПЛЯ КАРТОПЛЯ - - -КАРТОПЛЯ--'
)
.replace(regex, function (match, ...args) {
return args[0] + replace;
});
console.log(result)
output:
БАРАБОЛЯ сдффКАРТОПЛЯдадф БАРАБОЛЯ БАРАБОЛЯ БАРАБОЛЯ??? !!!БАРАБОЛЯ ;!;!БАРАБОЛЯ/#?#?
БАРАБОЛЯ БАРАБОЛЯ - - -БАРАБОЛЯ--
Breaking it apart
The first regex: (^|[^\n\p{L}])
^| = Start of the line or
[^\n\p{L}] = Any character which is not a letter or a newline
The second regex: (?=$|\P{L})
?= = Lookahead
$| = End of the line or
\P{L} = Any character which is not a letter
The first regex captures the group and is then used via args[0] to put it back into the string during replacement, thereby avoiding a lookbehind. The second regex utilized lookahead.
Note that the second one MUST be a lookahead because if we capture it then overlapping regex matches will not trigger (e.g. КАРТОПЛЯ КАРТОПЛЯ КАРТОПЛЯ would only match on the 1st and 3rd ones).
Trying to find text "myTest":
/(?<![\p{L}\p{N}_])myTest(?![\p{L}\p{N}_])/gu
Similar to NetBeans or Notepad++ form. Trying to find the expression without any letter or number or underscore (like \w characters of word boundary \b) in any unicode characters of letter and number before or after the expression.
I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:
var text = "Ještě. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ještě", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
var replacedString = "";
text = text.replace(re, function replacer(match){
var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
if (beginning == null) beginning = "";
var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
if (ending == null) ending = "";
replacedString = match.replace(beginning,"");
replacedString = replacedString.replace(ending,"");
replaced.push(replacedString);
return beginning+"{{"+order+"}}"+ending;
});
if (replacedString == "") break;
order += 1;
}
See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/
The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular
I can't say, that I find the solution elegant...
The correct answer to the question is given by andrefs.
I will only rewrite it more clearly, after putting all required things together.
For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:
Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.
So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.
For your code to work, you now need to do the following:
Assign to your variable "searchterm", the pattern or words you want to find.
Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.
bad but working:
var text = " аб аб АБ абвг ";
var ttt = "(аб)"
var p = "(^|$|[^A-Za-zА-Я-а-я0-9()])"; // add other word boundary symbols here
var exp = new RegExp(p+ttt+p,"gi");
text = text.replace(exp, "$1($2)$3").replace(exp, "$1($2)$3");
const t1 = performance.now();
console.log(text);
result (without qutes):
" (аб) (аб) (АБ) абвг "
I struggled hard on this. Working with French accented characters, and I managed to find this solution :
const myString = "MyString";
const regex = new RegExp(
"(?:[^À-ú]|^)\\b(" + myString + ")\\b(?:[^À-ú]|$)",
"ig"
);
What id does :
It keeps checking word-boundaries with \b before and after "MyString".
In addition to that, (?:[^À-ú]|^) and (?:[^À-ú]|$) will check if MyString is not surrounded by any accented characters
It will not work with cyrillic but it may be possible to find the range of cirillic charactes and edit [^À-ú] in consequence.
Warning, it captures only the group (MyString) but the total match contains previous and next characters
See example : https://regex101.com/r/5P0ZIe/1
Match examples :
MyString
match : "MyString"
group 1 : "MyString"
Lorem ipsum. MyString dolor sit amet
match : " MyString "
group 1 : "MyString"
(MyString)
match : "(MyString)"
group 1 : "MyString"
BetweenCharactersMyStringIsNotFound
match : Nothing
group 1 : Nothing
éMyStringé
match : Nothing
group 1 : Nothing
ùMyString
match : Nothing
group 1 : Nothing
MyStringÖ
match : Nothing
group 1 : Nothing
Plan A: it's such a simple function... it's ridiculous, really. I'm either totally misunderstanding how RegEx works with string replacement, or I'm making another stupid mistake that I just can't pinpoint.
function returnFloat(str){
console.log(str.replace(/$,)( /g,""));
}
but when I call it:
returnFloat("($ 51,453,042.21)")
>>> ($ 51,453,042.21)
It's my understanding that my regular expression should remove all occurrences of the dollar sign, the comma, and the parentheses. I've read through at least 10 different posts of similar issues (most people had the regex as a string or an invalid regex, but I don't think that applies here) without any changes resolving my issues.
My plan B is ugly:
str = str.replace("$", "");
str = str.replace(",", "");
str = str.replace(",", "");
str = str.replace(" ", "");
str = str.replace("(", "");
str = str.replace(")", "");
console.log(str);
There are certain things in RegEx that are considered special regex characters, which include the characters $, ( and ). You need to escape them (and put them in a character set or bitwise or grouping) if you want to search for them exactly. Otherwise Your Regex makes no sense to an interpreter
function toFloat(str){
return str.replace(/[\$,\(\)]/g,'');
}
console.log(toFloat('($1,234,567.90'));
Please note that this does not conver this string to a float, if you tried to do toFloat('($1,234,567.90)')+10 you would get '1234568.9010'. You would need to call the parseFloat() function.
the $ character means end of line, try:
console.log(str.replace(/[\$,)( ]/g,""));
You can fix your replacement as .replace(/[$,)( ]/g, "").
However, if you want to remove all letters that are not digit or dot,
and easier way exists:
.replace(/[^\d.]/g, "")
Here \d means digit (0 .. 9),
and [^\d.] means "not any of the symbols within the [...]",
in this case not a digit and not a dot.
if i understand correctly you want to have this list : 51,453,042.21
What you need are character classes. In that, you've only to worry about the ], \ and - characters (and ^ if you're placing it straight after the beginning of the character class "[" ).
Syntax: [characters] where characters is a list with characters to be drop( in your case $() ).
The g means Global, and causes the replace call to replace all matches, not just the first one.
var myString = '($ 51,453,042.21)';
console.log(myString.replace(/[$()]/g, "")); //51,453,042.21
if you want to delete ','
var myString = '($ 51,453,042.21)';
console.log(myString.replace(/[$(),]/g, "")); //51453042.21
Given an input text such where all spaces are replaced by n _ :
Hello_world_?. Hello_other_sentenc3___. World___________.
I want to keep the _ between words, but I want to stick each punctuation back to the last word of a sentence without any space between last word and punctuation. I want to use the the punctuation as pivot of my regex.
I wrote the following JS-Regex:
str = str.replace(/(_| )*([:punct:])*( |_)/g, "$2$3");
This fails, since it returns :
Hello_world_?. Hello_other_sentenc3_. World_._
Why it doesn't works ? How to delete all "_" between the last word and the punctuation ?
http://jsfiddle.net/9c4z5/
Try the following regex, which makes use of a positive lookahead:
str = str.replace(/_+(?=\.)/g, "");
It replaces all underscores which are immediately followed by a punctuation character with the empty string, thus removing them.
If you want to match other punctuation characters than just the period, replace the \. part with an appropriate character class.
JavaScript doesn't have :punct: in its regex implementation. I believe you'd have to list out the punctuation characters you care about, perhaps something like this:
str = str.replace(/(_| )+([.,?])/g, "$2");
That is, replace any group of _ or space that is immediately followed by punctation with just the punctuation.
Demo: http://jsfiddle.net/9c4z5/2/