I'm assuming that DoS is a possible issue when matching, on the backend in Node.js, arbitrary strings with arbitrary regexes with one of JS's regex functions. If the provided regex is simply invalid, the error thrown by the constructor can just be caught -- but I'm thinking it's possible that matching the string with the RegExp could become a significantly or even completely blocking operation, deliberately or accidentally by the creator of the regex and the string? If so, how exactly would this be caused, and how could it be mitigated?
Related
Earlier questions on StackOverflow discuss escaping for JavaScript regular expressions, e.g.:
How to escape regular expression in javascript?
Escape string for use in Javascript regex
An implementation suggested is the following:
RegExp.quote = function(str) {
return str.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
};
Given that regular expressions in the two languages are not identical, is anyone aware of a JavaScript method that properly escapes strings to be used for Java regular expressions?
There's no need for any escaping at all. Those questions are about what needs to be done when the regular expression is being constructed as a string in the source language. Since you're reading the string from an input field, there's no layer of interpretation to worry about.
Just send the string to the server, where it will be discovered to be a valid regex or not.
edit — though I can't think of any, the real thing to worry about might be any sort of "injection" attack that could be conducted through this avenue. Seems to me that if you're just passing a regex to Pattern.compile() there aren't any side-effect channels that could be exploited.
Just out of curiosity, is it possible to parse a string that is totally made out of random but valid regular expressions with a single regular expression?
given the string of regex:
<[^>]*>\xA9
parses to:
<[^>]*>
\xA9
in which the first one match html and second one match a copyright symbol.
Edit:
I found a similar question asked at SO claiming that it maybe possible. Here, I'm referring to regex in JavaScript ECMA-262 only.
No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced. It is not possible even in theory to write a regular expression that verifies if parentheses are balanced in a given string. Without an ability to do that you wouldn't know where one regexp ends and the other one starts.
In general, regex grammar is relatively complex. To get an idea of just how complex it is, take a look at the parser in the source of Java's Pattern class.
I need to modify the value using javascript, to make it ready to be put as part of a SQL insert query.
Currently I have the following code to handle the single quote ' character.
value = value.replace(/'/g, "\\'");
This code works without an issue. Now I noticed that stand-alone backslashes are causing errors.
How can I remove those stand alone backslashes?
Now I noticed that stand-alone backslashes are causing errors.
Backslashes in the string you're operating on won't have any effect on replacing ' characters whatsoever. If your goal is to replace backslash characters, use this:
value = value.replace(/\\/g, "whatever");
...which will replace all backslashes in the string with "whatever". Note that I've had to write two backslashes rather than just one. That's because in a regular expression literal, the backslash is used to introduce various special characters and character classes, and is also used as an escape — two backslashes together in a regular expression literal (as in a string) represent a single actual backslash in the string.
To change a single backslash into two backslashes, use:
value = value.replace(/\\/g, "\\\\");
Note that, again, to get a literal backslash in the replacement string, we have to escape each of the two — resulting in four in total in the replacement string.
I need to modify the value using javascript, to make it ready to be put as part of a SQL insert query.
You don't want to do this by hand. Any technology that allows you to make database queries and such (JDBC, ODBC, etc.) will provide some form of prepared or parameterized statement (link), which deals with these sorts of escaping issues for you. Doing it yourself is virtually guaranteed to leave security holes in your software which could be exploited. You want to use the work of a team that's had to think this through, and which updates the resulting code periodically as issues come to light, rather than flying alone. Further, if your JavaScript is running on the client (as most is, but by no means all — I use JavaScript server-side all the time), then nothing you do to escape the string can make it safe, because client requests to the server can be spoofed, completely bypassing your client-side code.
You should use a escape function provided by some kind of database library, rolling your own will only cause trouble.
I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))
Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.
Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js
I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!
If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)