Regex character sets - and what they contain - javascript

I'm working on a pretty crude sanitizer for string input in Node(express):
I have glanced at some plugins and library, but it seems most of them are either too complex or too heavy. Therefor i decided to write a couple of simple sanitizer-functions on my own.
One of them is this one, for hard-sanitizing most strings (not numbers...)
function toSafeString( str ){
str = str.replace(/[^a-öA-Ö0-9\s]+/g, '');
return str;
}
I'm from Sweden, therefore i Need the åäö letters. And i have noticed that this regex also accept others charachters aswell... for example á or é....
Question 1)
Is there some kind of list or similar where i can see WHICH charachters are actually accepted in, say this regex: /[^a-ö]+/g
Question 2)
Im working in Node and Express... I'm thinking this simple function is going to stop attacks trough input fields. Am I wrong?

Question 1: Find out. :)
var accepted = [];
for(var i = 0; i < 65535 /* the unicode BMP */; i++) {
var s = String.fromCharCode(i);
if(/[a-ö]+/g.test(s)) accepted.push(s);
}
console.log(s.join(""));
outputs
abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³
´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
on my system.
Question 2: What attacks are you looking to stop? Either way, the answer is "No, probably not".
Instead of mangling user data (I'm sure your, say, French or Japanese customers will have some beef with your validation), make sure to sanitize your data whenever it's going into customer view or out thereof (HTML escaping, SQL parameter escaping, etc.).

[x-y] matches characters whose unicode numbers are between that of x and that of y:
charsBetween = function(a, b) {
var a = a.charCodeAt(0), b = b.charCodeAt(0), r = "";
while(a <= b)
r += String.fromCharCode(a++);
return r
}
charsBetween("a", "ö")
> "abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö"
See character tables for the reference.
For your validation, you probably want something like this instead:
[^a-zA-Z0-9ÅÄÖåäö\s]
This matches ranges of latin letters and digits + individual characters from a list.

There is a lot of characters that we actually have no idea about, like Japanese or Russian and many more.
So to take them in account we need to use Unicode ranges rather than ASCII ranges in regular expressions.
I came with this regular expression that covers almost all written letters of the whole Unicode table, plus a bit more, like numbers, and few other characters for punctuation (Chinese punctuation is already included in Unicode ranges).
It is hard to cover everything and probably this ranges might include too many characters including "exotic" ones (symbols):
/^[\u0040-\u1FE0\u2C00-\uFFC00-9 ',.?!]+$/i
So I was using it this way to test (have to be not empty):
function validString(str) {
return str && typeof(str) == 'string' && /^[\u0040-\u1FE0\u2C00-\uFFC00-9 ',.?!]+$/i.test(str);
}
Bear in mind that this is missing characters like:
:*()&#'\-:%
And many more others.

Related

Check for Palindromes on freecode camp (don't want solution)

I want to be clear I'm not looking for solutions. I'm really trying to understand what is being done. With that said all pointers and recommendations are welcomed. I am woking through freecodecamp.com task of Check for Palindromes. Below is the description.
Return true if the given string is a palindrome. Otherwise, return
false.
A palindrome is a word or sentence that's spelled the same way both
forward and backward, ignoring punctuation, case, and spacing.
Note You'll need to remove all non-alphanumeric characters
(punctuation, spaces and symbols) and turn everything lower case in
order to check for palindromes.
We'll pass strings with varying formats, such as "racecar", "RaceCar",
and "race CAR" among others.
We'll also pass strings with special symbols, such as "2A3*3a2", "2A3
3a2", and "2_A3*3#A2".
This is what I have for code right now again I'm working through this and using chrome dev tools to figure out what works and what doesn't.
function palindrome(str) {
// Good luck!
str = str.toLowerCase();
//str = str.replace(/\D\S/i);
str = str.replace(/\D\s/g, "");
for (var i = str.length -1; i >= 0; i--)
str += str[i];
}
palindrome("eye");
What I do not understand is when the below code is run in dev tools the "e" is missing.
str = str.replace(/\D\s/g, "");
"raccar"
So my question is what part of the regex am I miss understanding? From my understand the regex should only be getting rid of spaces and integers.
/\D\s/g is replacing any character not a digit, followed by a space with "".
So, in race car, the Regex matches "e " and replaces it with "", making the string raccar
For digit, you need to use \d. I think using an OR would get you what you want. So, you may try something like /\d|\s/g to get a digit or a space.
Hope this helps in some way in your understanding!

Faster way match characters between strings than Regex?

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?
var query = "word",
words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'],
reStoredMatches = [],
indexOfMatches = [];
function match(word, query) {
var len = word.length,
charMatches = [],
charMatch,
char;
while (len--) {
char = word[len];
charMatch = query.indexOf(char);
if (charMatch !== -1) {
charMatches.push(char);
}
}
return charMatches.length === query.length;
}
function linearIndexOf(words, query) {
var wordsLen = words.length,
wordMatch,
word;
while (wordsLen--) {
word = words[wordsLen];
wordMatch = match(word, query);
if (wordMatch) {
indexOfMatches.push(word);
}
}
}
function linearRegexStored(words, query) {
var wordsLen = words.length,
re = new RegExp('[' + query + ']', 'g'),
match,
word;
while (wordsLen--) {
word = words[wordsLen];
match = word.match(re);
if (match !== null) {
if (match.length >= query.length) {
reStoredMatches.push(word);
}
}
}
}
Note that your regex is wrong, that's most certainly why it goes so fast.
Right now, if your query is "word" (as in your example), the regex is going to be:
/[word]/g
This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)
Also, you mention the idea/concept of any number of characters, I suppose as shown here:
'word', 'wwoorrddss'
The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:
/w+o+r+d+s+/g
That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.
From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
...
That's probably why your properly written test in plain JavaScript is much slower.
Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:
/d.*o.*r.*w/
And that will match any word that includes the word "word" such as "password".
One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)
One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:
query = query.toLowerCase();
early on in your top function.
You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match
function match(word, query) {
var len = word.length;
while (len--) {
if (query.indexOf(word[len]) === -1) { // found a missing char
return false;
}
}
return true; // couldn't find any missing chars
}
This gives a 4-5X improvement
Depending on the application you could try presorting words and presorting each word in words as another optimization.
The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.
For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.
if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.
Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

JavaScript remove ZERO WIDTH SPACE (unicode 8203) from string

I'm writing some javascript that processes website content. My efforts are being thwarted by SharePoint text editor's tendency to put the "zero width space" character in the text when the user presses backspace.
The character's unicode value is 8203, or B200 in hexadecimal. I've tried to use the default "replace" function to get rid of it. I've tried many variants, none of them worked:
var a = "o​m"; //the invisible character is between o and m
var b = a.replace(/\u8203/g,'');
= a.replace(/\uB200/g,'');
= a.replace("\\uB200",'');
and so on and so forth. I've tried quite a few variations on this theme. None of these expressions work (tested in Chrome and Firefox) The only thing that works is typing the actual character in the expression:
var b = a.replace("​",''); //it's there, believe me
This poses potential problems. The character is invisible so that line in itself doesn't make sense. I can get around that with comments. But if the code is ever reused, and the file is saved using non-Unicode encoding, (or when it's deployed to SharePoint, there's not guarantee it won't mess up encoding) it will stop working. Is there a way to write this using the unicode notation instead of the character itself?
[My ramblings about the character]
In case you haven't met this character, (and you probably haven't, seeing as it's invisible to the naked eye, unless it broke your code and you discovered it while trying to locate the bug) it's a real a-hole that will cause certain types of pattern matching to malfunction. I've caged the beast for you:
[​] <- careful, don't let it escape.
If you want to see it, copy those brackets into a text editor and then iterate your cursor through them. You'll notice you'll need three steps to pass what seems like 2 characters, and your cursor will skip a step in the middle.
The number in a unicode escape should be in hex, and the hex for 8203 is 200B (which is indeed a Unicode zero-width space), so:
var b = a.replace(/\u200B/g,'');
Live Example:
var a = "o​m"; //the invisible character is between o and m
var b = a.replace(/\u200B/g,'');
console.log("a.length = " + a.length); // 3
console.log("a === 'om'? " + (a === 'om')); // false
console.log("b.length = " + b.length); // 2
console.log("b === 'om'? " + (b === 'om')); // true
The accepted answer didn't work for my case.
But this one did:
text.replace(/(^[\s\u200b]*|[\s\u200b]*$)/g, '')

Regex with limited use of specific characters (like a scrabble bank of letters)

I would like to be able to do a regex where I can identify sort of a bank of letters like [dgos] for example and use that within my regex... but whenever a letter from that gets used, it takes it away from the bank.
So lets say that somehow \1 is able to stand for that bank of letters (dgos). I could write a regex something like:
^\1{1}o\1{2}$
and it would match basically:
\1{1} = [dgos]{1}
o
\1{2} = [dgos]{2} minus whatever was used in the first one
Matches could include good, dosg, sogd, etc... and would not include sods (because s would have to be used twice) or sooo (because the o would have to be used twice).
I would also like to be able to identify letters that can be used more than once. I started to write this myself but then realized I didn't even know where to begin with this so I've also searched around and haven't found a very elegant way to do this, or a way to do it that would be flexible enough that the regex could easily be generated with minimal input.
I have a solution below using a combination of conditions and multiple regexs (feel free to comment thoughts on that answer - perhaps it's the way I'll have to do it?), but I would prefer a single regex solution if possible... and something more elegant and efficient if possible.
Note that the higher level of elegance and single regex part are just my preferences, the most important thing is that it works and the performance is good enough.
Assuming you are using javascript version 1.5+ (so you can use lookaheads), here is my solution:
^([dogs])(?!.*\1)([dogs])(?!.*\2)([dogs])(?!.*\3)[dogs]$
So after each letter is matched, you perform a negative lookahead to ensure that this matched letter never appears again.
This method wouldn't work (or at least, would need to be made a heck of a lot more complicated!) if you want to allow some letters to be repeated, however. (E.g. if your letters to match are "example".)
EDIT: I just had a little re-think, and here is a much more elegant solution:
^(?:([dogs])(?!.*\1)){4}
After thinking about it some more, I have thought of a way that this IS possible, although this solution is pretty ugly! For example, suppose you want to match the letters "goods" (i.e. including TWO "o"s):
^(?=.*g)(?=.*o.*o)(?=.*d)(?=.*s).{5}$
- So, I have used forward lookaheads to check that all of these letters are in the text somewhere, then simply checked that there are exactly 5 letters.
Or as another example, suppose we want to match the letters "banana", with a letter "a" in position 2 of the word. Then you could match:
^(?=.*b)(?=.*a.*a.*a)(?=.*n.*n).a.{4}$
Building on #TomLord's answer, taking into account that you don't necessarily need to exhaust the bank of letters, you can use negative lookahead assertions instead of positive ones. For your example D<2 from bank>R<0-5 from bank>, that regex would be
/^(?![^o]*o[^o]*o)(?![^a]*a[^a]*a)(?![^e]*e[^e]*e[^e]*e)(?![^s]*s[^s]*s)d[oaes]{2}r[oaes]{0,5}$/i
Explanation:
^ # Start of string
(?![^o]*o[^o]*o) # Assert no more than one o
(?![^a]*a[^a]*a) # Assert no more than one a
(?![^e]*e[^e]*e[^e]*e) # Assert no more than two e
(?![^s]*s[^s]*s) # Assert no more than one s
d # Match d
[oaes]{2} # Match two of the letters in the bank
r # Match r
[oaes]{0,5} # Match 0-5 of the letters in the bank
$ # End of string
You could also write (?!.*o.*o) instead of (?![^o]*o[^o]*o), but the latter is faster, just harder to read. Another way to write this is (?!(?:[^o]*o){2}) (useful if the number of repetitions increases).
And of course you need to account for the number of letters in your "fixed" part of the string (which in this case (d and r) don't interfere with the bank, but they might do so in other examples).
Something like that would be pretty simple to specify, just very long
gdos|gdso|gods|gosd|....
You'd basically specify every permutation. Just write some code to generate every permutation, combine with the alternate operator, and you're done!
Although... It might pay to actually encode the decision tree used to generate the permutations...
I swear I think I answered this before on stackoverflow. Let me get back to you...
I couldn't think of a way to do this completely in a regex, but I was able to come up with this: http://jsfiddle.net/T2TMd/2/
As jsfiddle is limited to post size, I couldn't do the larger dictionary there. Check here for a better example using a 180k word dictionary.
Main function:
/*
* filter = regex to filter/split potential matches
* bank = available letters
* groups = capture groups that use the bank
* dict = list of words to search through
*/
var matchFind = function(filter, bank, groups, dict) {
var matches = [];
for(var i=0; i < dict.length; i++) {
if(dict[i].match(filter)){
var fail = false;
var b = bank;
var arr = dict[i].split(filter);
//loop groups that use letter bank
for(var k=0; k<groups.length && !fail; k++) {
var grp = arr[groups[k]] || [];
//loop characters of that group
for(var j=0; j<grp.length && !fail; j++) {
var regex = new RegExp(b);
var currChar = grp.charAt(j);
if(currChar.match(regex)) {
//match found, remove from bank
b = b.replace(currChar,"");
} else {
fail = true;
}
}
}
if(!fail) {
matches.push(dict[i]);
}
}
}
return matches;
}
Usage:
$("#go").click( function() {
var f = new RegExp($("#filter").val());
var b = "["+$("#bank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var g = $("#groups").val().replace(/[^0-9,]+/g,"").split(",") || [];
$("#result").text(matchFind(f,b,g,dict).toString());
});
To make it easier to create scenarios, I created this as well:
$("#build").click( function() {
var bank = "["+$("#buildBank").val().replace(/[^A-Za-z]+/g,"").toUpperCase()+"]";
var buildArr = $("#builder").val().split(",");
var groups = [];
var build = "^";
for(var i=0; i<buildArr.length; i++) {
var part = buildArr[i];
if(/\</.test(part)) {
part = "(" + bank + part.replace("<", "{").replace(">", "}").replace("-",",") + ")";
build = build + part;
groups.push(i+1);
} else {
build = build + "("+part+")";
}
}
build = build + "$";
$("#filter").val(build);
$("#bank").val(bank);
$("#groups").val(groups.toString());
$("#go").click();
});
This would be useful in Scrabble, so lets say that you are in a position where a word must start with a "D", there are two spaces between that "D" and an "R" from a parallel word, and you have OAEES for your letter bank. In the builder I could put D,<2>,R,<0-3> because it must start with a D, then it must have 2 letters from the bank, then there is an R, then I'd have up to 3 letters to use (since I'm using 2 in between D and R).
The builder would use the letter bank and convert D,<2>,R,<0-3> to ^(D)([OAEES]{2})(R)([OAEES]{0,5})$ which would be used to filter for possible matches. Then with those possible matches it will look at the capture groups that use the letter bank, character by character, removing letters from that regex when it finds them so that it won't match if there are more of the letter bank letters used than there is in the letter bank.
Test the above scenario here.

Regex for validation

Can anyone tell me how to write a regex for the following scenario. The input should only be numbers or - (hyphen) or , (comma). The input can be given as any of the following
23
23,26
1-23
1-23,24
24,25-56,58-40,45
Also when numbers is given in a range, the second number should be greater than the first one. 23-1 should not be allowed. If a number is already entered it should not be allowed again. Like 1-23,23 should not be allowed
I'm not going to quibble with "I think" or "maybe" -- you can not do this with a Regex.
Matching against a regex can validate that the form of the input is correct, and can also be used to extract pieces of the input, but it can not do value comparisons, or duplicate elimination (except in limited well defined circumstances), or range checking.
What you have as input I interpret as a comma-separated list of values or ranges of values; in BNFish notation:
value :: number
range :: value '-' value
term :: value | range
list :: term [','term]*
A regex can be built that will match this to verify correct structure, but you'll have to do other validation for the value comparisons and to prevent the duplicate numbers.
The most straigtforward regex I can think of (on short notice) is this
([0-9]+|[0-9]+-[0-9]+)(, *([0-9]+|[0-9]+-[0-9]+))*
You have digits or digits-digits, optionally followed by comma[optional space](digits or digits-digits) - repeated zero or more times.
I tested this regex at http://www.fileformat.info/tool/regex.htm with the input 3,4-12,6,2,90-221
Of course you can replace the [0-9] with [\d] for regex dialects that allow it.
var str = "24,25-56,24, 58- 40,a 45",
trimmed = str.replace(/\s+/g, '')
//test for correct characters
if (trimmed.match(/[^,\-\d]/)) alert("Please use only digits and hyphens, separated by commas.")
//test for duplicates
var split = trimmed.split(/-|,/)
split.sort()
for (var i = 0; i < split.length - 1; i++) {
if (split[i + 1] == split[i]) alert("Please avoid duplicate numbers.")
}
//test for ascending range
split = trimmed.split(/,/)
for (var i in split) {
if (split[i].match("-") && eval(split[i]) < 0) alert("Please use an ascending range.")
}
I don't think you will be able to do this with a RegEx. Especially not the part about set logic - number already used, valid sequential range.
My suggestion would be to have a Regex verify the format, at the least -, number, comma. Then use the split method on commas and loop over the input to verify the set. Something like:
var number_ranges = numbers.split(',');
for (var i = 0; i < number_ranges.length; ++i) {
// verify number ranges in set
}
That logic is not exactly trivial.
I think with regular expressions it is better to take the time to learn them than to throw someone elses script into yours without knowing exactly what it is doing. You have excellent resources out there to help you.
Try these sites:
regular-expressions.info
w3schools.com
evolt.org
Those are the first three results form a google search. All are good resources. Good luck. Remember to double check what your regex is actually matching by outputing it to the screen, don't assume you know (that has bitten me more than one time).

Categories

Resources