Javascript regular expression to replace word but not within curly brackets - javascript

I have some content, for example:
If you have a question, ask for help on StackOverflow
I have a list of synonyms:
a={one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight}
ask={question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate}
I'm using JavaScript to:
Split synonyms based on =
Looping through every synonym, if found in content replace with {...|...}
The output should look like:
If you have {one typical|only one|one single|one sole|merely one|just one|one unitary|one small|this solitary|this slight} question, {question|inquire of|seek information from|put a question to|demand|request|expect|inquire|query|interrogate} for help on StackOverflow
Problem:
Instead of replacing the entire word, it's replacing every character found. My code:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp(word, "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
It should replace content word with synonym which should not be in {...|...}.

When you build the regexps, you need to include word boundary anchors at both the beginning and the end to match whole words (beginning and ending with characters from [a-zA-Z0-9_]) only:
var match = new RegExp("\\b" + word + "\\b", "ig");
Depending on the specific replacements you are making, you might want to apply your method to individual words (rather than to the entire text at once) matched using a regexp like /\w+/g to avoid replacing words that themselves are the replacements for others. Something like:
content = content.replace(/\w+/g, function(word) {
for(var i = 0, L = allSyn.length; i < L; ++i) {
var rtnSyn = allSyn[syn].split("=");
var synonym = (rtnSyn[1]).trim();
if(synonym && rtnSyn[0].toLowerCase() == word.toLowerCase()) return synonym;
}
});

Regular expressions include something called a "word-boundary", represented by \b. It is a zero-width assertion (it just checks something, it doesn't "eat" input) that says in order to match, certain word boundary conditions have to apply. One example is a space followed by a letter; given the string ' X', this regex would match it: / \bX/. So to make your code work, you just have to add word boundaries to the beginning and end of your word regex, like this:
for(syn in allSyn) {
var rtnSyn = allSyn[syn].split("=");
var word = rtnSyn[0];
var synonym = (rtnSyn[1]).trim();
if(word && synonym){
var match = new RegExp("\\b"+word+"\\b", "ig");
postProcessContent = preProcessContent.replace(match, synonym);
preProcessContent = postProcessContent;
}
}
[Note that there are two backslashes in each of the word boundary matchers because in javascript strings, the backslash is for escape characters -- two backslashes turns into a literal backslash.]

For optimization, don't create a new RegExp on each iteration. Instead, build up a big regex like [^{A-Za-z](a|ask|...)[^}A-Za-z] and an hash with a value for each key specifying what to replace it with. I'm not familiar enough with JavaScript to create the code on the fly.
Note the separator regex which says the match cannot begin with { or end with }. This is not terribly precise, but hopefully acceptable in practice. If you genuinely need to replace words next to { or } then this can certainly be refined, but I'm hoping we won't have to.

Related

Using JS to modify user input for REGEXP search

I'm taking user input from a searchbar and modifying it to a regexp. From there I can search a json file for valid values and return them. It works fine with input without quotes, but with them, I'm appending "\Q" and "\E" so I can find the entirety of the string (with spaces and other special characters).
if (searchField.includes('"')){
var tempexpress = searchField.substring(1,searchField.length-1);
var tempexpress = "\\Q" + tempexpress + "\\E";
var expression = new RegExp(tempexpress);
} else {
var tempexpress = searchField.replace('(',"\\(");
var tempexpress = tempexpress.replace(')',"\\)");
var tempexpress = tempexpress.replace(/'/g,"\\'");
var tempexpress = tempexpress.replace('*',"\.");
var expression = new RegExp(tempexpress, "i");
};
if (value.data.label.search(expression) != -1){
console.log('found it');
}
If I input "QTT6" into the search field (with quotes for a literal), then it creates the following regexp: /\QQTT6\E/
In my testing, I found that it doesn't match to QTT6 for some reason and I'm not sure why. Any help is appreciated.
Also I'm very new to JS and Jquery, so sorry if my code isn't very well put together.
Per Kelly's comment:
In JS you need to use ^ and $ instead of \Q and \E.
For more information, see the MDN docs on Regex Assertions:
^:
Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character. For example, /^A/ does not match the "A" in "an A", but does match the first "A" in "An A".
Note: This character has a different meaning when it appears at the start of a character class.
$:
Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character. For example, /t$/ does not match the "t" in "eater", but does match it in "eat".

Uppercase for each new word swedish characters and html markup

I was pointed out to this post, which does not seem to follow the criteria I have:
Replace a Regex capture group with uppercase in Javascript
I am trying to make a regex that will:
format a string by adding uppercase for the first letter of each word and lower case for the rest of the characters
ignore HTML markup
Accept swedish characters (åäöÅÄÖ)
Say I've got this string:
<b>app</b>le store östersund
Then I want it to be (changes marked by uppercase characters)
<b>App</b>le Store Östersund
I've been playing around with it and the closest I've got is the following:
(?!([^<])*?>)[åäöÅÄÖ]|\s\b\w
Resulted in
<b>app</b>le Store Östersund
Or this
/(?!([^<])*?>)[åäöÅÄÖ]|\S\b\w/g
Resulted in
<B>App</B>Le store Östersund
Here's a fiddle:
http://refiddle.com/refiddles/598aabef75622d4a531b0000
Any help or advice is much appreciated.
It is not possible to do this with regexp alone, since regexp doesn't understand HTML structure. [*] Instead, we need to process each text node, and carry through our logic for what is the beginning of the word in case a word continues across different text nodes. A character is at start of the word if it is preceded by a whitespace, or if it is at the start of the string and it is either the first text node, or the previous text node ended in whitespace.
function htmlToTitlecase(html, letters) {
let div = document.createElement('div');
let re = new RegExp("(^|\\s)([" + letters + "])", "gi");
div.innerHTML = html;
let treeWalker = document.createTreeWalker(div, NodeFilter.SHOW_TEXT);
let startOfWord = true;
while (treeWalker.nextNode()) {
let node = treeWalker.currentNode;
node.data = node.data.replace(re, function(match, space, letter) {
if (space || startOfWord) {
return space + letter.toUpperCase();
} else {
return match;
}
});
startOfWord = node.data.match(/\s$/);
}
return div.innerHTML;
}
console.log(htmlToTitlecase("<b>app</b>le store östersund", "a-zåäö"));
// <b>App</b>le Store Östersund
[*] Maybe possible, but even if so, it would be horribly ugly, since it would need to cover an awful amount of corner cases. Also might need a stronger RegExp engine than JavaScript's, like Ruby's or Perl's.
EDIT:
Even if just specifying really simple html tags? The only ones I am actually in need of covering is <b> and </b> at the moment.
This was not specified in the question. The solution is general enough to work for any markup (including simple tags). But...
function simpleHtmlToTitlecaseSwedish(html) {
return html.replace(/(^|\s)(<\/?b>|)([a-zåäö])/gi, function(match, space, tag, letter) {
return space + tag + letter.toUpperCase();
});
}
console.log(simpleHtmlToTitlecaseSwedish("<b>app</b>le store östersund", "a-zåäö"));
I have a solution which use almost only regex. It may be not the most intuitive way to do it, but it should be effective and I find it funny :)
You have to append at the end of your string every lowercase character followed by their uppercase counterpart, like this (it must also be preceded by a space for my regex) :
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ
(I don't know which letters are missing, I know nothing about swedish alphabet, sorry... I'm counting on you to correct that !)
Then you can use the following regex :
(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$
Replace by :
$1$3
Test it here
Here is a working javascript code :
// Initialization
var regex = /(?![^<]*>)(\s<[^/]*?>|\s|^)([\wåäö])(?=.*\2(.)\S*$)|[\wåÅäÄöÖ]+$/g;
var string = "test <b when=\"2>1\">ap<i>p</i></b>le store östersund";
// Processing
result = string + " aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZåÅäÄöÖ";
result = result.replace(regex, "$1$3");
// Display result
console.log(result);
Edit : I forgot to handle first word of the string, it's corrected :)

find and remove words matching a substring in a sentence

Is it possible to use regex to find all words within a sentence that contains a substring?
Example:
var sentence = "hello my number is 344undefined848 undefinedundefined undefinedcalling whistleundefined";
I need to find all words in this sentence which contains 'undefined' and remove those words.
Output should be "hello my number is ";
FYI - currently I tokenize (javascript) and iterate through all the tokens to find and remove, then merge the final string. I need to use regex. Please help.
Thanks!
You can use:
str = str.replace(/ *\b\S*?undefined\S*\b/g, '');
RegEx Demo
It certainly is possible.
Something like start of word, zero or more letters, "undefined", zero or more letters, end of word should do it.
A word boundary is \b outside a character class, so:
\b\w*?undefined\w*?\b
using non-greedy repetition to avoid the letter matching tryig to match "undefined" and leading to lots of backtracking.
Edit switch [a-zA-Z] to \w because the example includes numbers in the "words".
\S*undefined\S*
Try this simple regex.Replace by empty string.See demo.
https://www.regex101.com/r/fG5pZ8/5
you can use str.replace function like this
str = str.replace(/undefined/g, '');
Since there are enough solutions with regular expressions, here is another one - using arrays and simple function that finds occurrence of a string in a string :)
Even though the code looks more "dirty", it actually works faster than regular expression, so it might make sense to consider it when dealing with LARGE strings
var sentence = "hello my number is 344undefined848 undefinedundefined undefinedcalling whistleundefined";
var array = sentence.split(' ');
var sanitizedArray = [];
for (var i = 0; i <= array.length; i++) {
if (undefined !== array[i] && array[i].indexOf('undefined') == -1) {
sanitizedArray.push(array[i]);
}
}
var sanitizedSentence = sanitizedArray.join(' ');
alert(sanitizedSentence);
Fiddle: http://jsfiddle.net/448bbumh/

Regexp to capture comma separated values

I have a string that can be a comma separated list of \w, such as:
abc123
abc123,def456,ghi789
I am trying to find a JavaScript regexp that will return ['abc123'] (first case) or ['abc123', 'def456', 'ghi789'] (without the comma).
I tried:
^(\w+,?)+$ -- Nope, as only the last repeating pattern will be matched, 789
^(?:(\w+),?)+$ -- Same story. I am using non-capturing bracket. However, the capturing just doesn't seem to happen for the repeated word
Is what I am trying to do even possible with regexp? I tried pretty much every combination of grouping, using capturing and non-capturing brackets, and still not managed to get this happening...
If you want to discard the whole input when there is something wrong, the simplest way is to validate, then split:
if (/^\w+(,\w+)*$/.test(input)) {
var values = input.split(',');
// Process the values here
}
If you want to allow empty value, change \w+ to \w*.
Trying to match and validate at the same time with single regex requires emulation of \G feature, which assert the position of the last match. Why is \G required? Since it prevents the engine from retrying the match at the next position and bypass your validation. Remember than ECMA Script regex doesn't have look-behind, so you can't differentiate between the position of an invalid character and the character(s) after it:
something,=bad,orisit,cor&rupt
^^ ^^
When you can't differentiate between the 2 positions, you can't rely on the engine to do a match-all operation alone. While it is possible to use a while loop with RegExp.exec and assert the position of last match yourself, why would you do so when there is a cleaner option?
If you want to savage whatever available, torazaburo's answer is a viable option.
Live demo
Try this regex :
'/([^,]+)/'
Alternatively, strings in javascript have a split method that can split a string based on a delimeter:
s.split(',')
Split on the comma first, then filter out results that do not match:
str.split(',').filter(function(s) { return /^\w+$/.test(s); })
This regex pattern separates numerical value in new line which contains special character such as .,,,# and so on.
var val = [1234,1213.1212, 1.3, 1.4]
var re = /[0-9]*[0-9]/gi;
var str = "abc123,def456, asda12, 1a2ass, yy8,ghi789";
var re = /[a-z]{3}\d{3}/g;
var list = str.match(re);
document.write("<BR> list.length: " + list.length);
for(var i=0; i < list.length; i++) {
document.write("<BR>list(" + i + "): " + list[i]);
}
This will get only "abc123" code style in the list and nothing else.
May be you can use split function
var st = "abc123,def456,ghi789";
var res = st.split(',');

How to replace whitespaces using javascript?

I'm trying to remove the whitespaces from a textarea . The below code is not appending the text i'm selecting from two dropdowns. Can somebody tell me where i'd gone wrong? I'm trying to remove multiple spaces within the string as well, will that work with the same? Dont know regular expressions much. Please help.
function addToExpressionPreview() {
var reqColumnName = $('#ddlColumnNames')[0].value;
var reqOperator = $('#ddOperator')[0].value;
var expressionTextArea = document.getElementById("expressionPreview");
var txt = document.createTextNode(reqColumnName + reqOperator.toString());
if (expressionTextArea.value.match(/^\s+$/) != null)
{
expressionTextArea.value = (expressionTextArea.value.replace(/^\W+/, '')).replace(/\W+$/, '');
}
expressionTextArea.appendChild(txt);
}
> function addToExpressionPreview() {
> var reqColumnName = $('#ddlColumnNames')[0].value;
> var reqOperator = $('#ddOperator')[0].value;
You might as well use document.getElementById() for each of the above.
> var expressionTextArea = document.getElementById("expressionPreview");
> var txt = document.createTextNode(reqColumnName + reqOperator.toString());
reqOperator is already a string, and in any case, the use of the + operator will coerce it to String unless all expressions or identifiers involved are Numbers.
> if (expressionTextArea.value.match(/^\s+$/) != null) {
There is no need for match here. I seems like you are trying to see if the value is all whitespace, so you can use:
if (/^\s*$/.test(expressionTextArea.value)) {
// value is empty or all whitespace
Since you re-use expressionTextArea.value several times, it would be much more convenient to store it an a variable, preferably with a short name.
> expressionTextArea.value = (expressionTextArea.value.replace(/^\W+/,
> '')).replace(/\W+$/, '');
That will replace one or more non-word characters at the end of the string with nothing. If you want to replace multiple white space characters anywhere in the string with one, then (note wrapping for posting here):
expressionTextArea.value = expressionTextArea.value.
replace(/^\s+/,'').
replace(/\s+$/, '').
replace(/\s+/g,' ');
Note that \s does not match the same range of 'whitespace' characters in all browsers. However, for simple use for form element values it is probably sufficient.
Whitespace is matched by \s, so
expressionTextArea.value.replace(/\s/g, "");
should do the trick for you.
In your sample, ^\W+ will only match leading characters that are not a word character, and ^\s+$ will only match if the entire string is whitespace. To do a global replace(not just the first match) you need to use the g modifier.
Refer this link, you can get some idea. Try .replace(/ /g,"UrReplacement");
Edit: or .split(' ').join('UrReplacement') if you have an aversion to REs

Categories

Resources