Regex - Removing parts of URL path - javascript

I am useless at Regex and I want to remove parts of a URL that are not always consistent.
The URL might be:
www.test.com /en/ restOfPath
or
www.test.com /en/en_gb/ restOfPath
Then depending on the country values might change to:
www.test.com /es/ restOfPath
or
www.test.com /es/es_es/ restOfPath
I am therefore looking to alway remove, the parts in bold, so that I can split the remained of the path, to create a logical naming that is language/location agnostic.
I am doing this as a work around to build out a data layer until the client can implement it properly when they launch their new website. I have managed to build an if else statement as a workaround which is a bit clunky but would like a cleaner solution.

Probably this will help you
(?:\/([a-z]{2})(?:\/([a-z]{2}_[A-Z]{2}))?)
This example is about to find first / with two alpha after that, and probably another / with aa_AA construction.
I got you code samples at regex101

I believe this is what you're after:
\/.*(?=\/.*?)
https://regex101.com/r/OZIseI/4
It uses a positive look ahead to exclude the last / from the match

Related

JavaScript / Using system generated regex for validation

we want to use regex to validate a document structure. For this we simplify the document and the regex. The regex is generated out of a schema which is used for the validation.
The application is completly client based and coded in JavaScript.
A simple example is this regex:
regex1 = new RegExp(/~(A{1}B?C?(D*|E*|F*|G*)+){1}~/g)
That means the document structure can have this structure
A
-B
-D
-D
-D
-D
-D
So the document structure is parsed to ~ABDDDDD~
Now I want to validate if I can add "A" to the end which would result in this string: ~ABDDDDDA~
This does not match with the reg ex anymore:
"~ABDDDDDA~".match(regex1)
This does work quiet fine, but the document structure can grow and be like this: ~ABDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD~
A matching value can be matched quiet fast, but if the value is then:
~ABDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDA~
It takes very long, most times I just close the browser and reopen it.
Does anyone have ideas how to solve it?
Thanks!
UPDATE
The RegEx should also cover more, the structure can be quiet dynamic. I have not used a RegEx Generator, this example is parsed from a self developed script and is just an example.
It is in this case, that there is one root element A, then optional B or C. And now in a not given order any amount of D,E,F,G. But at least one!
So it should be valid for:
"~ABDDDDDFEG~"
"~AGGGGGEGGD~"
"~ABCDEFG~"
"~ABCDDDDDDDDDDDDDDDEFGGGGGG~"
Additionally it is possible, that that the E is limited to 0-5 occurances.
As soon as I work with the match either(A | B), there are real performance issues in all browsers. (IE, Chrome, Firefox)
Any ideas? Are there any alternatives to "match either(A | B)" with better performance?
The resulting regex should be as close as possible to:
~AB?C?[DEFG]*A?~
There are a lot of simplifications to do in your regex generator to get rid of the following points:
{1}: is literally useless, you can remove it from everywhere
(A*|B*)+: is strictly equivalent to [AB]*
Here is a Regex101: https://regex101.com/r/Lc6Fx8/1
Also, if you want help fixing your Regex generator, you should post some info about it.

match 2 urls with localhost

I'm having the hardest time with javascript regex, can't figure out how to match my url:
http://localhost:11111/#!/quote/18283
and
http://www.myurl.com/#!/quote/23834
with the same regex.
I just don't understand the regex rules that well.
http://[\w\d\.:]+/#!/quote/\d{5} - but obviously that is without any other context.
I don't know what your negative cases are. Which parts of the URLs are important, etc, etc.
One hint I can add is, if you are only looking to match a specific domain.com with localhost you can use alternation (either/or) with the pipe | symbol like (this is just for one portion of the regex:
((www\.)?myurl\.com|localhost:\d+)

Javascript, take action based on page URL

I was helped on here some time ago in writing a regular expression to compare the location of the current page to that regex, and taking action if it's something specific. An example of that code is:
var re1 = new RegExp('^http://([^\.]+)\.domain\.com/subpage(.*)$');
if(window.location.href.match(re1))
{
// Do more
}
In this case, I could write some code that would only execute on that subpage. It's worked beautifully so far. But I've run into a problem where I need further assistance with regular expressions.
Imagine a site like this: http://websitenamepreview.testserver.designcompanyname.com/subpage
How can I adapt this code to work on such a URL? The only thing that will probably change, so long as it's run from this test server, is the subpage.
Using you original code, you could change it to ^http://([^\.]+\.)+domain\.com/subpage(.*)$
Of course, for this particular test data, it wouldn't match, because you don't have "domain.com" as the end of the domain in your test URL.
If you can give us a little more information about what parts are important, what parts are likely to change in the data that you's be matching, etc. we could probably make it even less complex.

Javascript/Regex for finding just the root domain name without sub domains

I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.com
sub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
Don't use regex, use the .split() method and work from there.
var s = domain.split('.');
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
return s.slice(-2).join('.');
It'll make your eyes bleed less than any regex solution.
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
three or more alpha characters (i.e. com/net/mil/coop, etc.)
two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
two alpha characters (i.e. us/uk/to, etc)
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
If you have limited subset of data, I suggest to keep the regex simple, e.g.
(([a-z\-]+)(?:\.com|\.fr|\.co.uk))
This will match:
www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
domain = 'sub.domains.are.cool.google.com'
s = domain.split('.')
tld = s.slice(-2..-1).join('.')
EDIT: Be careful of issues with three part TLDs like domain.co.uk.

Parsing Custom JavaScript Annotations

Implementing a large JavaScript application with a lot of scripts, its become necessary to put together a build script. JavaScript labels being ubiquitous, I've decided to use them as annotations for a custom script collator. So far, I'm just employing the use statement, like this:
use: com.example.Class;
However, I want to support an 'optional quotes' syntax, so the following would be parsed correctly as well
use: 'com.example.Class';
I'm currently using this pattern to parse the first form:
/\s*use:\s*(\S+);\s*/g
The '\S+' gloms all characters between the annotation name declaration and the terminating semi colon. What rule can I write to substitute for \S+ that will return an annotation value without quotes, no matter if it was quoted or not to begin with? I can do it in two steps, but I want to do it in one.
Thanks- I know I've put this a little awkwardly
Edit 1.
I've been able to use this, but IMHO its a mess- any more elegant solutions? (By the way, this one will parse ALL label names)
/\s*([a-z]+):\s*(?:['])([a-zA-Z0-9_.]+)(?:['])|([a-zA-Z0-9_.]+);/g
Edit 2.
The logic is the same, but expresses a little more succinctly. However, it poses a problem as it seems to pull in all sorts of javascript code as well.
/\s*([a-z]+):\s*'([\w_\.]+)'|([\w_\.]+);/g
Ok -this seemed to do it. Hope someone can improve on it.
/\s*([a-z]+): *('[\w_\/\.]+'|[\w_\/\.]+);/g

Categories

Resources