Get base url from string with Regex and Javascript - javascript

I'm trying to get the base url from a string (So no window.location).
It needs to remove the trailing slash
It needs to be regex (No New URL)
It need to work with query parameters and anchor links
In other words all the following should return https://apple.com or https://www.apple.com for the last one.
https://apple.com?query=true&slash=false
https://apple.com#anchor=true&slash=false
http://www.apple.com/#anchor=true&slash=true&whatever=foo
These are just examples, urls can have different subdomains like https://shop.apple.co.uk/?query=foo should return https://shop.apple.co.uk - It could be any url like: https://foo.bar
The closer I got is with:
const baseUrl = url.replace(/^((\w+:)?\/\/[^\/]+\/?).*$/,'$1').replace(/\/$/, ""); // Base Path & Trailing slash
But this doesn't work with anchor links and queries which start right after the url without the / before
Any idea how I can get it to work on all cases?

You could add # and ? to your negated character class. You don't need .* because that will match until the end of the string.
For your example data, you could match:
^https?:\/\/[^#?\/]+
Regex demo
strings = [
"https://apple.com?query=true&slash=false",
"https://apple.com#anchor=true&slash=false",
"http://www.apple.com/#anchor=true&slash=true&whatever=foo",
"https://foo.bar/?q=true"
];
strings.forEach(s => {
console.log(s.match(/^https?:\/\/[^#?\/]+/)[0]);
})

You could use Web API's built-in URL for this. URL will also provide you with other parsed properties that are easy to get to, like the query string params, the protocol, etc.
Regex is a painful way to do something that the browser makes otherwise very simple.
I know that you asked about using regex, but in the event that you (or someone coming here in the future) really just cares about getting the information out and isn't committed to using regex, maybe this answer will help.
let one = "https://apple.com?query=true&slash=false"
let two = "https://apple.com#anchor=true&slash=false"
let three = "http://www.apple.com/#anchor=true&slash=true&whatever=foo"
let urlOne = new URL(one)
console.log(urlOne.origin)
let urlTwo = new URL(two)
console.log(urlTwo.origin)
let urlThree = new URL(three)
console.log(urlThree.origin)

const baseUrl = url.replace(/(.*:\/\/.*)[\?\/#].*/, '$1');

This will get you everything up to the .com part. You will have to append .com once you pull out the first part of the url.
^http.*?(?=\.com)
Or maybe you could do:
myUrl.Replace(/(#|\?|\/#).*$/, "")
To remove everything after the host name.

Related

JS RegEx to remove part of a URL?

I am using the GoogleBooks API to search for particular titles by name and retrieve a cover image URL. For example, searching for "The Great Gatsby" will return the following image link:
http://books.google.com/books/content?id=HestSXO362YC&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api
If you look at the following image, you can see that there is a small fold on the bottom right corner. Some image URLs will have the fold and others won't. If you remove edge=curl from the URL link, the fold is removed.
Is there any way to use a regex to find and delete the curled portion?
Further, is there any way to use regex to change the img=1 value to img=2?
you can use the .replace() method
let URL = "some random URL you have"
console.log(URL.replace('&edge=curl',''))
Will replace every "&edge=curl" that it finds in this string and replace it with '' an empty string which is basically removing it.
You can also use the same method .replace() to replace any static URL variables like "img=1"
console.log(URL.replace('img=1','img=2'))
Don't use regex to parse URLs. Use URL object:
var u = new URL("http://books.google.com/books/content?id=HestSXO362YC&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api");
u.searchParams.delete("edge");
u.searchParams.set("img", "2");
console.log(u.href);
To obtain an updated url where the &edge=curl pattern is replaced and the &img= and &zoom= parameters are updated, you could achieve this by chaining multiple .replace() calls as shown below:
const url = "http://books.google.com/books/content?id=HestSXO362YC&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api"
// New values for img and zoom parameters
const img = 300;
const zoom = 22;
console.log(
url
.replace(/&img=(\w+)/,`&img=${img}`)
.replace(/&zoom=\w+/,`&zoom=${zoom}`)
.replace(/&edge=curl/,"")
)
Here the &img= and &zoom= parameters are updated with regular expressions &img=\w+ and &zoom=\w+, where \w+ will match one or more alpha numeric characters that appear after the parameter.
The advantage with this approach (over explicitly specifying img=1 and replacing it with img=2 ) is that you can update those parameter/value substrings of the input url without having to know the actual value of those parameters prior to replacement (ie that img has a value 1).
Note that this approach assumes the parameters being updated are prefixed with & (and not ?).
Hope that helps!
try this
let url = 'http://books.google.com/books/content?id=HestSXO362YC&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api';
// remove &edge=curl
url = url.replace('&edge=curl', '');
// replace img=1 with img=2
url = url.replace('img=1', 'img=2');

JavaScript RegEx to match url path

I have possible url paths as below
/articles
/payment
/about
/articles?page=1
/articles/hello-world
I would like to match only the main path of the url, expected matches: ['articles', 'payment', 'about', 'articles', 'articles']
So I tried to construct the JavaScript RegEx and came up with as nearest as I can [a-z].*(?=\/|\?), unfortunately it only matches string inside the last two
Please guide
Thanks everyone
https://regex101.com/r/A86hYz/1
/^\/([^?\/]+)/
This regex captures everything between the first / and either the second / or the first ? if they exist. This seems like the pattern you want. If it isn't, let me know and I'll adjust it as needed. Some simple adjustments would be capturing what's between every / as well as capturing the query parameters.
For future reference, when writing regex, try to avoid the lookahead/behind unless you have to as they usually introduce bugs. It's easiest if you stick to using the regular operators.
To access the match, use the regex like this:
var someString = '/articles?page=1';
var extracted = someString.match(/^\/([^?\/]+)/)[1]
or more generally
function getMainPath(str) {
const regex = /^\/([^?\/]+)/;
return str.match(regex)[1];
}

Extract characters in URL after certain character up to certain character

I'm trying to extract certain piece of a URL using regex (JavaScript) and having trouble excluding characters after a certain piece. Here's what I have so far:
URL: http://www.somesite.com/state-de
Using url.match(/\/[^\/]+$/)[0] I can extract the state-de like I want.
However when the URL becomes http://www.somesite.com/state-de?page=r and I do the same regex it pulls everything including the "?page=r" which I don't want. I want to only extract the state-de regardless of whats after it (looks like usually a "?" follows it)
This might work:
var arr = url.split("/")
arr[arr.length - 1].split("?")[0]
I'd recommend reading up on regular expressions in general. What you want to do here is make the regular expression stop when it hits the ? in the URL.
Using capturing groups to select which part of the match that you want might also be useful here.
Example:
url.match(/(\/[^\/?]+)(?:\?.*)?$/)[1]
I avoid overly complex RegExs when possible, so I tend to do this in multiple steps (with .replace()):
var stripped = url.replace(/[?#].*/, ''); // Strips anything after ? or #
You can now do the simpler transform to get the state, e.g.:
var state = stripped.split('/').pop()
If you want do it by regex try this one:
url.match(/https?:\/\/([a-z0-9-]+\.)+[a-z]+\/([a-z0-9_-])\/?(\?.*)?/)[1]
Or you could do it using JQuery:
var url = 'http://www.somesite.com/state-de?page=r#mark4';
// Create a special anchor element, set the URL to it
var a = $('<a>', { href:url } )[1];
console.log(a.hostname);
console.log(a.pathname);
console.log(a.search);
console.log(a.hash);

How to pull a unknown URL out of a String

I'm writing a Node/Express app and I have a text string in a JSON object that I need to pull a URL out of. The URL is different every time, and the string itself has two very similar URL's, and I only want to pull out one.
The only thing I do know is that in the string, the url will always be preceded with the same text.
String:
The following new or updated things match your search criteria.
Link I Need
<http://randomurl.com/Junk/Yay/ThisView.aspx?r=164241242186&s=J
WD&t=JWD>
Link I don't Need
<http://randomurl.com/Junk/Yay/ThisView.aspx?r=164241242186&s=J
WD&t=JWD&m=true>
Search was last updated on April 12th, 2013 # 14:43
If you wish to unsubscribe from this update...
Out of this string all I need to pull out is the URL under Link I Need, http://randomurl.com/Junk/Yay/ThisView.aspx?r=164241242186&s=J
WD&t=JWD and nothing else. I'm not quite sure how to go about this, any help would be greatly appreciated!
Something like this should work:
var s = "The following new or updated ...";
var regex = /Link I Need\s*<([^>]*)>/;
var match = s.match(regex);
var theUrl = match && match[1];
This assumes that the URL is not split across newlines. If it is, then after you find the match, you need to to
theUrl = theUrl.replace(/\s+/, '')

How to use href.replace in extjs

how to use href.replace in extjs
This is my sample:
'iconCls': 'icon_' + href.replace(/[^.]+\./, '')
href= http://localhost:1649/SFM/Default.aspx#/SFM/config/release_history.png
Now i want to get text "release_history.png", How i get it.
Thanks
If you just want the filename, it's probably easier to do:
var href = "http://localhost:1649/SFM/Default.aspx#/SFM/config/release_history.png";
var iconCls = 'icon_' + href.split('/').pop();
Update
To get the filename without the extension, you can do something similar:
var filename = "release_history.png";
var without_ext = filename.split('.');
// Get rid of the extension
without_ext.pop()
// Join the filename back together, in case
// there were any other periods in the filename
// and to get a string
without_ext = without_ext.join('.')
some regex solutions (regex including / delimiter)
as in your example code match the start of the url that can be dropped
href.replace(/^.*\//, '')
or use a regex to get the last part of the url that you want to keep
/(?<=\/)[^.\/]+\.[^.]+$/
update
or get the icon name without .png (this is using lookbehind and lookahead feature of regex)
(?<=\/)[^.\/]+(?=\.png)
Not all flavors of regex support all lookaround reatures and I think Javascript only supports lookahead. so probably your solution is this:
[^.\/]+(?=\.png)
code examples here:
http://www.myregextester.com/?r=6acb5d23
http://www.myregextester.com/?r=b0a88a0a

Categories

Resources