Regex not detecting swear words inside string? - javascript

I have a script here:
http://jsfiddle.net/d2rcx/
It has an array of badWords to compare input strings with.
This script works fine if the input string matches exactly as the swear word string but it does not pick up any variations where there is more characters in the string e.g. whitespace before the swear word.
Using this site as reference : http://www.zytrax.com/tech/web/regex.htm
It said the following regex would detect a string within a string.
var regex = new RegExp("/" + badWords[i] + "/g");
if (fieldValue.match(regex) == true)
return true;
However that does not seem to be the case.
What do I need to change to the regex to make it work.
Thanks
Also any good links to explain Regex than what google turns up would be appreciated.

Here's a corrected JSFiddle:
http://jsfiddle.net/d2rcx/5/
See the following documentation for RegExp:
https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp
Note that the second parameter is where you should specify your flags (e.g. 'g' or 'i'). For example:
new RegExp(badWords[i], 'gi');

Please review http://www.w3schools.com/jsref/jsref_match.asp and note that fieldValue.match() will not return a boolean but an array of matches

Rather than using Regex to do this you are probably better off just looping through an array of badwords and looking for instances within a string using [indexOf].(https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/String/indexOf)
Otherwise you could make a regex like...
\badword1|badword2|badword3\
and just check for any match.
A word boundary in Regex is \b so you could say
\\b(badword1|badword2|badword3)\b\
which will match only whole words - ie Scunthorpe will be ok :)
var rx = new RegExp("\\b(donkey|twerp|idiot)\\b","i"); // i = case insenstive option
alert(rx.test('you are a twerp')); //true
alert(rx.test('hello idiotstick')); //fasle -not whole word
alert(rx.test('nice "donkey"')); //true
http://jsfiddle.net/F8svC/

Changing this, which requires a loop:
var regex = new RegExp("/" + badWords[i] + "/g");
for this:
var regex = new RegExp("/" + badWords.join("|") + "/g");
would be a start. This will do all the matches in one go because the array becomes one string with each element separated by pipes.
P.S.
Reference guide for RegEx here. But there isn't a lot of clear information online about what is and isn't possible with respect to certain functions nor what's good code. I've found a couple of books most useful: David Flanagan's latest JavaScript: The Definitive Guide and Douglas Crockford's JavaScript: The Good Parts for the most usable subset of JavaScript, including RegEx. The railroad diagrams by Crockford are especially good but I'm not sure if they're available online anywhere.
EDIT: Here's an online copy of the relevant chapter including some of those railroad diagrams I mentioned, in case it helps.

Related

Javascript RegEx to match everything up until and after a specific pattern

I need a regex to match everything in a string except a sub-string of a given pattern, which may appear several times.
Example text:
A lot of text before my pattern.
Perhaps several lines...
...then [my pattern here]. Then maybe [my pattern here] again and some more text to end.
The pattern in case is anything starting with "000." and followed by however many alphanumeric characters except a space. So, for example, valid tokens would be:
000.a
000.1a
000.SomeLongWordHere!123
Firstly, I started to match the pattern itself, which I managed to with /000\.[^ ]+/ g. Then I tried to negate that, with /(?!000\.[^ ]+)/ g and variations of that, adding things like .+ before, before and after, but none works for what I need.
I looked into several other questions regarding regex (such as this and this), but wasn't lucky (or didn't quite understand how to apply the answers to my need).
I'm using regex101.com to test.
Using the above example text, the desired result is:
A lot of text before my pattern.
Perhaps several lines...
...then . Then maybe again and some more text to end.
Any help is greatly appreciated. Thanks in advance!
As #Doqnach mentioned in the comments, it seems you just want to do a replace
Using your regex:
text.replace(/000\.[^ ]+/g, "");
And here's a working example:
let text = `A lot of text before my pattern.
Perhaps several lines...
...then 000.SomeLongWordHere!123 Then maybe 000.1a again and some more text to end.`
// Easiest way
console.log("=== Using Replace ===\n\n");
console.log(text.replace(/000\.[^ ]+/g, ""));
// Using regex exec
console.log("\n\n=== Regex exec ===\n\n");
let regex = /(?:^|(?:000\.[^ ]+))((?:(?!000\.[^ ]+)[\S\s])*)/igm;
let content = "";
while(result = regex.exec(text)){
//In Group 1 is the content that does not match the desired pattern
content+= result[1];
}
console.log(content);
Second method breakdown in: https://regex101.com/r/nR9tV6/16

What is the function of .source in context of this new RegExp

I ran into the below monster of a regex in the wild today. The regex is meant to validate a url.
function superUrlValidation(url) {
return new RegExp(/^/.source + "((.+):\/\/)?" + /(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/.source, "i")
.test(url);
}
I've never seen .source used in a regex like this so I looked it up.
The MDN docs for RegExp.prototype.source states:
The source property returns a String containing the source text of the regexp object, and it doesn't contain the two forward slashes on both sides and any flags.
... and gives this example:
var regex = /fooBar/ig;
console.log(regex.source); // "fooBar", doesn't contain /.../ and "ig".
I understand the MDN example (you're getting the source text of the regex object after it is created, makes sense), but I dont understand how this is being used in the superUrlValidation regex above.
How is the source being used before the regex object is completed and what does this accomplish? I cant find any documentation showing .source being used in this way.
Note that .source is used twice in the regex, at the beginning and the end
Use of .source everywhere in your regex seems totally unnecessary, may be just a trick to avoid double escaping. In fact even use of new RegExp is not needed and you can get away with just the regex literal as this:
var re = /^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i;
/^/ is a regex literal, meaning it's a valid regex object in it's own right. This means that /^/.source === "^".
This seems like an arbitrary example of using the source property as this means the author could have just placed a "^" in it's place, or even just put a ^ at the beginning of the next string, and it would have the same effect.
The .source property returns the content of the regex between the forward slashes as you say. so the result of the above is equivalent to this string:
/^((.+):\/\/)?(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i
In JavaScript you can write regexes like this: /matchsomething/ or using the RegExp function/constructor above. It looks like the code you found is the result of someone not know what they were doing. They seem to have taken a few regexes using the literal syntax (i.e /match_here/) and plugged it into the constructor version and stuck them all together.
I can't see any benefit in using the source property this way. I would just use the string version or the constructor version. Or better, find out what the original author intended and write it again or find a respected regex library with the criteria you need.
And, yeah, wow. It's massive.

JavaScript RegEx match unless wrapped with [nocode][/nocode] tags

My current code is:
var user_pattern = this.settings.tag;
user_pattern = user_pattern.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"); // escape regex
var pattern = new RegExp(user_pattern.replace(/%USERNAME%/i, "(\\S+)"), "ig");
Where this.settings.tag is a string such as "[user=%USERNAME%]" or "#%USERNAME%". The code uses pattern.exec(str) to find any username in the corresponding tag and works perfectly fine. For example, if str = "Hello, [user=test]" then pattern.exec(str) will find test.
This works fine, but I want to be able to stop it from matching if the string is wrapped in [nocode][/nocode] tags. For example, if str = "[nocode]Hello, [user=test], how are you?[/nocode]" thenpattern.exec(str)` should not match anything.
I'm not quite sure where to start. I tried using a (?![nocode]) before and after the pattern, but to no avail. Any help would be great.
I would just test if the string starts with [nocode] first:
/^\[nocode\]/.test('[nocode]');
Then simply do not process it.
Maybe filter out [nocode] before trying to find the username(s)?
pattern.exec(str.replace(/\[nocode\](.*)\[\/nocode\]/g,''));
I know this isn't exactly what you asked for because now you have to use two separate regular expressions, however code readability is important too and doing it this way is definitely better in that aspect. Hope this helps 😉
JSFiddle: http://jsfiddle.net/1f485Lda/1/
It's based on this: Regular Expression to get a string between two strings in Javascript

Non-capturing groups in Javascript regex

I am matching a string in Javascript against the following regex:
(?:new\s)(.*)(?:[:])
The string I use the function on is "new Tag:var;"
What it suppod to return is only "Tag" but instead it returns an array containing "new Tag:" and the desired result as well.
I found out that I might need to use a lookbehind instead but since it is not supported in Javascript I am a bit lost.
Thank you in advance!
Well, I don't really get why you make such a complicated regexp for what you want to extract:
(?:new\\s)(.*)(?:[:])
whereas it can be solved using the following:
s = "new Tag:";
var out = s.replace(/new\s([^:]*):.*;/, "$1")
where you got only one capturing group which is the one you're looking for.
\\s (double escaping) is only needed for creating RegExp instance.
Also your regex is using greedy pattern in .* which may be matching more than desired.
Make it non-greedy:
(?:new\s)(.*?)(?:[:])
OR better use negation:
(?:new\s)([^:]*)(?:[:])

Regex appears to ignore multiple piped characters

Apologies for the awkward question title, I have the following JavaScript:
var wordRe = new RegExp('\\b(?:(?![<^>"])fox|hello(?![<\/">]))\\b', 'g'); // Words regex
console.log('<span>hello</span> <hello>fox</hello> fox link hello my name is fox'.replace(wordRe, 'foo'));
What I'm trying to do is replace any word that isn't nested in a HTML tag, or part of a HTML tag itself. I.e I want to only match "plain" text. The expression seems to be ignoring the rule for the first piped match "fox", and replacing it when it shouldn't be.
Can anyone point out why this is? I think I might have organised the expression incorrectly (at least the negative lookahead).
Here is the JSFiddle.
I'd also like to add that I am aware of the implications of using regex with HTML :)
For your regex work, you want lookbehind. However, as of this writing, this feature is not supported in Javascript.
Here is a workaround:
Instead of matching what we want, we will match what we don't want and remove it from our input string. Later, we can perform the replace on the cleaned input string.
var nonWordRe = new RegExp('<([^>]+).*?>[^<]+?</\\1>', 'g');
var test = '<span>hello</span> <hello>fox</hello> fox link hello my name is fox';
var cleanedTest = test.replace(nonWordRe, '');
var final = cleanedTest.replace(/fox|hello/, 'foo'); // once trimmed final=='foo my name is foo'
NOTA:
I have build this workaround based on your sample. But here are some points that may need to be explored if you face them:
you may need to remove self closing tags (<([^>]+).*?/\>) from the test string
you may need to trim the final string (final)
you may need a descent html parser if tags can contain other tags as HTML allow this.
Javascript doesn't, again as of this writing, recursive patterns.
Demo
http://jsfiddle.net/yXd82/2/

Categories

Resources