match 2 urls with localhost - javascript

I'm having the hardest time with javascript regex, can't figure out how to match my url:
http://localhost:11111/#!/quote/18283
and
http://www.myurl.com/#!/quote/23834
with the same regex.
I just don't understand the regex rules that well.

http://[\w\d\.:]+/#!/quote/\d{5} - but obviously that is without any other context.
I don't know what your negative cases are. Which parts of the URLs are important, etc, etc.

One hint I can add is, if you are only looking to match a specific domain.com with localhost you can use alternation (either/or) with the pipe | symbol like (this is just for one portion of the regex:
((www\.)?myurl\.com|localhost:\d+)

Related

Regex (JS): Match URLs without a protocal, ignore URLS that with one

Apologies for yet another regex URL matching question, but I haven't been a able to find a solution in any of the other threads.
I want to run a replace() method on a string, with a pattern that matches all URLs without a protocal (http, https etc) but ignores urls that do have one.
So given this input:
www.google.com www.facebook.com
http://www.google.com
http://www.facebook.com
It would match www.google.com and www.facebook.com on the first line (without any surrounding whitespace), but ignore the other URLs on the second and third line.
I thought about just looking for www and ignoring matches which have // as preceding characters, which led me to this:
https://www.regex101.com/r/Y3rqxy/1
However, as you can see the second match includes the preceding whitespace. As I want to replace the www with http://www this whitespace buggers things up a little.
Any regex mandarins able to help me out on this one?
Mere seconds after posting this, one of my colleagues came up with a solution. It's a little wacky (thanks javascript) but it works! This example assumes you want to add http:// to any URLs that are missing their protocal.
First you have to reverse the string you're running the .replace() method on:
string.split('').reverse().join('')
Then you can run call the replace method with the following regex (note the backwards http://www!):
string.replace(/www(?!\/\/)/gi, 'www//:ptth')
Then you just reverse your string again:
string.split('').reverse().join('')
And any URLS that are missing a protocal in that string will now have them.
It's not going to win any awards for cleanliness, but it works!

Javascript Find a string between two strings, but keep each occurence of the match [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regular Expression problem

I need a regular expression for javascript that will get "jones.com/ca" from "Hello we are jones.com/ca in Tampa". The "jones.com/ca" could be any web url extension (example: .net, .co, .gov, etc), and any name. So the regular expression needs to find all instances of say ".com" and all the text to the last white space or beginning of line and to the last white space or end of line (minus any ending punctuation).
Right now I have as an example line: "jones.com/ca some text", using a javascript regular expression of: "\\(.+?^\\s).com?([^\\s]+)?\\", and all I get is ".com/ca" as the output.
This example will capture specific domains com,org and gov
\b\w+\.(?:com|org|gov)/[a-z]{2}\b
And this will capture almost any domain
\b\w+\.[a-z]{2,3}/[a-z]{2}\b
It uses word boundaries so that it does not capture white space.
Matching URLs is a bit of a dark art. The following site has a fairly well-designed regex for this purpose: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
A comprehensive regex for this is going to be much more complicated than you think. The list of top-level domains is fairly long (.gov, .info, .edu, .museum, etc.), and there are "special" domains like localhost as well. Also, many domains end in a two-letter country abbreviation (google.com.br for Google Brazil, for example, or del.icio.us).
The easiest thing would be to look for http(s):// or www at the beginning and just assume what comes after is a domain name. If you don't, you're going to either miss a lot, or get a lot of false positives.
You could try the following, but the last option (after the last |) is going to be open to a significant number of false positives:
/https?:\/\/\S+|www\.\S+|([-a-z0-9_]+\.)+(com|org|edu|gov|mil|info|[a-z]{2})(\/\S*)?|([-a-z0-9_]+\.)+[-a-z0-9_]+\/\S*/ig

Getting parts of a URL in JavaScript

I have to match URLs in a text, linkify them, and then display only the host--domain name or IP address--to the user. How can I proceed with JavaScript?
Thanks.
PS: please don't tell me about this; those regular expressions are so buggy they can't match http://google.com
If you don't want to use regular expressions, then you'll need to use things like indexOf and such instead. For instance, search for "://" in the text of every element and if you find it and the bit in front of it looks like a protocol (or "scheme"), grab it and the following characters that are valid URI characters (RFC2396). If the result ends in a dot or question mark, remove the dot or question (it probably ends a sentence). There's not really a lot more to say.
Update: Ah, I see from your edit that you don't have a problem with regular expressions, just the ones in the answers to that question. Fair enough.
This may well be one of those places where trying to do it all with a regular expression is more work that it should be, but using regular expressions as part of the solution is helpful. For instance,
/[a-zA-Z][a-zA-Z0-9+\-.]*:\/\//
...may well be a helpful way to find the beginning of a URL, since the scheme portion must start with an alpha and then can have zero or more alpha, digit, +, -, or . prior to the : (section 3.1).

Javascript/Regex for finding just the root domain name without sub domains

I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.com
sub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
Don't use regex, use the .split() method and work from there.
var s = domain.split('.');
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
return s.slice(-2).join('.');
It'll make your eyes bleed less than any regex solution.
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
If you have limited subset of data, I suggest to keep the regex simple, e.g.
(([a-z\-]+)(?:\.com|\.fr|\.co.uk))
This will match:
www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
domain = 'sub.domains.are.cool.google.com'
s = domain.split('.')
tld = s.slice(-2..-1).join('.')
EDIT: Be careful of issues with three part TLDs like domain.co.uk.

Categories

Resources