Regular expression Modification - javascript

I have one asp.net application, in which i have one text box for URL. And i am using the regular expression for validating. My regular expression is like this:^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$
But now i have one enhancement like the text always keeps the text of http://. at that time the validation of this expression have to ignore the default text (http://). How it possible? Please help me by resolving this issue.

Your expression matches the http:// part, so it "keeps" that part of the match. If the text box you're validating doesn't contain that part at all, simply drop (ht|f)tp(s?)\:\/\/ from your regex.
If it is part of the text box, but you want to ignore it after having matched it, then you can put capturing parenthesess around your intended match. Your original regex would then look like this:
^(ht|f)tp(s?)\:\/\/([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?)$
Now the part without http:// or ftp:// etc will be in backreference number 3.
That said, your regex as it stands now is pretty bad and also incorrect (lots of unnecessary escapes, unnecessary parentheses, wrongly constructed character classes (URLs with port number will fail here), and I'm pretty sure that you don't want & in there)...
It is not easy to validate URLs with regexes. What are your intentions? What should be valid, what shouldn't be?

You can try this -
Use ((ht|f)tp(s?)\:\/\/)? in the starting of your regular expression which makes http:// or ftp:// as optional.
Your complete regex would be -
^((ht|f)tp(s?)\:\/\/)?[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$

Related

How can I prevent extra characters at the end of my string in Regex?

Background
I'm working on a Javascript application where users have to use a specific email domain to sign up (Either #a.com or #b.com. Anything else gets rejected).
I've created a regex string that makes sure the user doesn't do #a.com with nothing in front of it and limits users to only #a.com and #b.com. The last step is to make sure the user doesn't add extra characters to the end of #a.com by doing something like #a.com.gmail.com
This is the regex I currently have:
\b[a-zA-Z0-9\.]*#(a.com|b.com)
Question
What can I use at the end to prevent anything from being added after a.com or b.com? I'm very novice at regex and have no idea where to start.
To solve your problem add $ to the regex's end. $ means your match should be at the strings' end.
Also you can reduce (a.com|b.com) to [ab]\.com. Look I've also escaped the dot
The character class [ab] means one of its characters should be matched.
Check this demo.
As stated in the comments, be sure to use the (m)ultiline flag, this way the regex engine will threat each line as a separate string.

Regular Expression problem

I need a regular expression for javascript that will get "jones.com/ca" from "Hello we are jones.com/ca in Tampa". The "jones.com/ca" could be any web url extension (example: .net, .co, .gov, etc), and any name. So the regular expression needs to find all instances of say ".com" and all the text to the last white space or beginning of line and to the last white space or end of line (minus any ending punctuation).
Right now I have as an example line: "jones.com/ca some text", using a javascript regular expression of: "\\(.+?^\\s).com?([^\\s]+)?\\", and all I get is ".com/ca" as the output.
This example will capture specific domains com,org and gov
\b\w+\.(?:com|org|gov)/[a-z]{2}\b
And this will capture almost any domain
\b\w+\.[a-z]{2,3}/[a-z]{2}\b
It uses word boundaries so that it does not capture white space.
Matching URLs is a bit of a dark art. The following site has a fairly well-designed regex for this purpose: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
A comprehensive regex for this is going to be much more complicated than you think. The list of top-level domains is fairly long (.gov, .info, .edu, .museum, etc.), and there are "special" domains like localhost as well. Also, many domains end in a two-letter country abbreviation (google.com.br for Google Brazil, for example, or del.icio.us).
The easiest thing would be to look for http(s):// or www at the beginning and just assume what comes after is a domain name. If you don't, you're going to either miss a lot, or get a lot of false positives.
You could try the following, but the last option (after the last |) is going to be open to a significant number of false positives:
/https?:\/\/\S+|www\.\S+|([-a-z0-9_]+\.)+(com|org|edu|gov|mil|info|[a-z]{2})(\/\S*)?|([-a-z0-9_]+\.)+[-a-z0-9_]+\/\S*/ig

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))
Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.
Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

Getting parts of a URL in JavaScript

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?
Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com
If you don't want to use regular expressions, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.
Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.
This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,
/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//
...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

How to detect what allowed character in current Regular Expression by using JavaScript?

In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks
You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.
I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.

Categories

Resources