How to find a URL within full text using regular expression - javascript

What is wrong with the following regular expression, which works in many online JavaScript regular expression testers (and RegEx Buddy), yet doesn't work in my application?
It is intended to replace URLs with a Hyperlink. The Javascript is found in a javascript file.
var fixed = text.replace(/\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|$!:,.;]*[A-Z0-9+&##/%=~_|$]/ig, "<a href='$&' target='blank'>$&</a>");
Chrome, for example, complains that & is not valid (as does IE8). Is there some way to escape the ampersand (or whatever else is wrong), without resorting to the RegEx object?

Those testers let you input the regex in its raw form, but when you use it in source code you have to write it in the form of a string literal or (as is the case here) a regex literal. JavaScript uses forward-slashes for its regex-literal delimiters, so you have to escape any slashes in the regex itself to avoid confusing the interpreter.
Once you escape the slashes it should stop complaining about the ampersand. That was most likely caused by the malformed regex literal.
I recognize that regex, having used it myself the other day; you got it from RegexBuddy's Library, didn't you? If you had used RB's "Use" feature to create a JS-compatible regex, it would have escaped the slashes for you.

This works for me in Chrome
var fixed = text.replace(/(ftp|http|https):\/\/(\w+:{0,1}\w*#)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/igm, "<a href='$1' target='blank'>$1</a>");

Related

Confusion regarding RegExp matches, HTML tags, and newlines

I am attempting to create a Markdown-to-HTML parser. I am trying to use regex expressions to match an input string that may or may not contain HTML tags and whitespace/newlines. I have encountered an interesting case that I do not at all understand.
My regex expression is regex = /\*([\w\s]+|<.+>)\*/g.
The following works:
'*words\nmorewords*'.match(regex)
'*<b>words</b>*'.match(regex)
However, this does not work:
'*<b>words\nmore words</b>*'.match(regex)
If anyone can help me understand why this is so, I would appreciate it.
Edit: I see my faulty logic, thanks to Ry. The expression regex = /\*(<[a-z]+>)?[\w\s]+(<\/[a-z]+>)?\*/g solves this case.
This should work for your purpose:
\*(<.+>)?([\w\s]+)(<.+>)?\*
The HTML tags can exist or not (<.+>)?. The \n is matched by the \s (whitespace).
I'm also going to link the canonical don't parse HTML with regex answer, because regex is not suitable for (or even capable of) parsing HTML beyond fairly restricted subsets. Have a read, it's informative (and funny)!
Recall the Chomsky Heirarchy. Regular expressions can parse regular languages. HTML is not a regular language (it is the next level up, context sensitive).
There are extensions to some regular expression engines that give it recursive capability. You can probably parse HTML with these but there are better ways, like using a proper HTML parser for example DOMParser.

using same regex both in php and javascript

Problem:-
I am using the following regex to find the special characters in a string
"/[^a-zA-Z0-9\s]/i"
I want to get all the characters that match this pattern and all that is working fine.
The condition is that that I have to use the same expression both in php and javascript.
But the g flag in the above regex is creating problem as preg_match and preg_match_all do not accept this flag and I have to search globally.
Question:-
SO how can I get all the special characters using the same expression both in php and javascript?
You can't use the same regex in both PHP and JavaScript because their regex engines make different assumptions and support different features.
More than just the incompatibility with the g modifier, this regex will fail you if the input contains non-ASCII characters: the input encoding in PHP and JS will be almost certainly different and PHP will not even be Unicode-aware unless you use the u flag (which does not exist in JS because it's Unicode-aware by default).
Just use two different regular expressions.
To match [^a-zA-Z0-9\s] in JavaScript you would have to use:
[\u0000-\u0008\u000F-\u001F\u0022-\u002F\u003B-\u0040\u005C-\u0060\u007C-\u0084\u0087-\u009F\u00A2-\u167F\u1682-\u180D\u1810-\u1FFF\u200C-\u2027\u202B-\u202E\u2031-\u205E\u2061-\u2FFF\u3002-\uFFFF]

JavaScript function to escape Java regular expression string

Earlier questions on StackOverflow discuss escaping for JavaScript regular expressions, e.g.:
How to escape regular expression in javascript?
Escape string for use in Javascript regex
An implementation suggested is the following:
RegExp.quote = function(str) {
return str.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
};
Given that regular expressions in the two languages are not identical, is anyone aware of a JavaScript method that properly escapes strings to be used for Java regular expressions?
There's no need for any escaping at all. Those questions are about what needs to be done when the regular expression is being constructed as a string in the source language. Since you're reading the string from an input field, there's no layer of interpretation to worry about.
Just send the string to the server, where it will be discovered to be a valid regex or not.
edit — though I can't think of any, the real thing to worry about might be any sort of "injection" attack that could be conducted through this avenue. Seems to me that if you're just passing a regex to Pattern.compile() there aren't any side-effect channels that could be exploited.

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))
Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.
Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

Why aren't double quotes and backslashes allowed in strings in the JSON standard?

If I run this in a JavaScript console in Chrome or Firebug, it works fine.
JSON.parse('"\u0027"') // Escaped single-quote
But if I run either of these 2 lines in a Javascript console, it throws an error.
JSON.parse('"\u0022"') // Escaped double-quote
JSON.parse('"\u005C"') // Escaped backslash
RFC 4627 section 2.5 seems to imply that \ and " are allowed characters as long as they're properly escaped. The 2 browsers I've tried this in don't seem to allow it, however. Is there something I'm doing wrong here or are they really not allowed in strings? I've also tried using \" and \\ in place of \u0022 and \u005C respectively.
I feel like I'm just doing something very wrong, because I find it hard to believe that JSON would not allow these characters in strings, especially since the specification doesn't seem to mention anything that I could find saying they're not allowed.
You need to quote the backslash!
that which we call a rose
By any other name would smell as sweet
A double quote is a double quote, no matter how you express it in the string constant. If you put a backslash before your \u expression within the constant, then the effect is that of a backslash-quoted double-quote, which is in fact what you've got.
The most interesting thing about your question is that it helps me realize that every JavaScript string parser I've ever written is wrong.
JavaScript is interpreting the Unicode escape sequences before they get to the JSON parser. This poses a problem:
'"\u0027"' unquoted the first time (by JavaScript): "'" The second time (by the JSON parser) as valid JSON representing the string: '
'"\u0022"' unquoted the first time (by JavaScript): """ The JSON parser sees this unquoted version """ as invalid JSON (does not expect the last quotation mark).
'"\u005C"' unquoted the first time (by JavaScript): "\" The JSON parser also sees this unquoted version "\" as invalid JSON (second quotation mark is backslash-escaped and so does not terminate the string).
The fix for this is to escape the escape sequence, as \\u.... In reality, however, you would probably just not use the JSON parser. Used in the correct context (ensured by wrapping it within parentheses, all JSON is valid JavaScript.
You are not escaping the backslash and the double-quote, as \u00xx is being interpreted by the javascript parser.
JSON.parse('"\\\u0022"')
JSON.parse('"\\\""')
and
JSON.parse('"\\\u005C"')
JSON.parse('"\\\\"')
work as expected.

Categories

Resources