How to pull a unknown URL out of a String - javascript

I'm writing a Node/Express app and I have a text string in a JSON object that I need to pull a URL out of. The URL is different every time, and the string itself has two very similar URL's, and I only want to pull out one.
The only thing I do know is that in the string, the url will always be preceded with the same text.
String:
The following new or updated things match your search criteria.
Link I Need
<http://randomurl.com/Junk/Yay/ThisView.aspx?r=164241242186&s=J
WD&t=JWD>
Link I don't Need
<http://randomurl.com/Junk/Yay/ThisView.aspx?r=164241242186&s=J
WD&t=JWD&m=true>
Search was last updated on April 12th, 2013 # 14:43
If you wish to unsubscribe from this update...
Out of this string all I need to pull out is the URL under Link I Need, http://randomurl.com/Junk/Yay/ThisView.aspx?r=164241242186&s=J
WD&t=JWD and nothing else. I'm not quite sure how to go about this, any help would be greatly appreciated!

Something like this should work:
var s = "The following new or updated ...";
var regex = /Link I Need\s*<([^>]*)>/;
var match = s.match(regex);
var theUrl = match && match[1];
This assumes that the URL is not split across newlines. If it is, then after you find the match, you need to to
theUrl = theUrl.replace(/\s+/, '')

Related

How to validate/block shorten URL in string

I need to block/validate shorten URL in String. Below string contains shorten URL how can I block/validate this in string .
Hi #first_name# This is Mondi from Novato Cleaners. May I ask for a favor ? Our google https://bit.ly requires reviews. Could you provide one ?Thank you
So for this you need to follow these steps:
1- extract all urls from string.
2- request each urls and get there original location. very well explained here:
How to get domain name from shortened URL with Javascript?
3- when you have originalUrl, just check if url != originalUrl then it is a shorten url.
Use regex to find whether there is a URL in your string or not, if they're just replacing it what you need on that space
/(https?://[^\s]+)/g
var string = "Hi Vignesh This is Mondi from Novato Cleaners. May I ask for a favor ? Our google https://bit.ly requires reviews. Could you provide one ?Thank you";
var protomatch = /(https?:\/\/[^\s]+)/g;
var b = string.replace(protomatch, '');
console.log(b)

Get base url from string with Regex and Javascript

I'm trying to get the base url from a string (So no window.location).
It needs to remove the trailing slash
It needs to be regex (No New URL)
It need to work with query parameters and anchor links
In other words all the following should return https://apple.com or https://www.apple.com for the last one.
https://apple.com?query=true&slash=false
https://apple.com#anchor=true&slash=false
http://www.apple.com/#anchor=true&slash=true&whatever=foo
These are just examples, urls can have different subdomains like https://shop.apple.co.uk/?query=foo should return https://shop.apple.co.uk - It could be any url like: https://foo.bar
The closer I got is with:
const baseUrl = url.replace(/^((\w+:)?\/\/[^\/]+\/?).*$/,'$1').replace(/\/$/, ""); // Base Path & Trailing slash
But this doesn't work with anchor links and queries which start right after the url without the / before
Any idea how I can get it to work on all cases?
You could add # and ? to your negated character class. You don't need .* because that will match until the end of the string.
For your example data, you could match:
^https?:\/\/[^#?\/]+
Regex demo
strings = [
"https://apple.com?query=true&slash=false",
"https://apple.com#anchor=true&slash=false",
"http://www.apple.com/#anchor=true&slash=true&whatever=foo",
"https://foo.bar/?q=true"
];
strings.forEach(s => {
console.log(s.match(/^https?:\/\/[^#?\/]+/)[0]);
})
You could use Web API's built-in URL for this. URL will also provide you with other parsed properties that are easy to get to, like the query string params, the protocol, etc.
Regex is a painful way to do something that the browser makes otherwise very simple.
I know that you asked about using regex, but in the event that you (or someone coming here in the future) really just cares about getting the information out and isn't committed to using regex, maybe this answer will help.
let one = "https://apple.com?query=true&slash=false"
let two = "https://apple.com#anchor=true&slash=false"
let three = "http://www.apple.com/#anchor=true&slash=true&whatever=foo"
let urlOne = new URL(one)
console.log(urlOne.origin)
let urlTwo = new URL(two)
console.log(urlTwo.origin)
let urlThree = new URL(three)
console.log(urlThree.origin)
const baseUrl = url.replace(/(.*:\/\/.*)[\?\/#].*/, '$1');
This will get you everything up to the .com part. You will have to append .com once you pull out the first part of the url.
^http.*?(?=\.com)
Or maybe you could do:
myUrl.Replace(/(#|\?|\/#).*$/, "")
To remove everything after the host name.

Regex expression to match the First url after a space followed

I want to match the First url followed by a space using regex expression while typing in the input box.
For example :
if I type www.google.com it should be matched only after a space followed by the url
ie www.google.com<SPACE>
Code
$(".site").keyup(function()
{
var site=$(this).val();
var exp = /^http(s?):\/\/(\w+:{0,1}\w*)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/;
var find = site.match(exp);
var url = find? find[0] : null;
if (url === null){
var exp = /[-\w]+(\.[a-z]{2,})+(\S+)?(\/|\/[\w#!:.?+=&%#!\-\/])?/g;
var find = site.match(exp);
url = find? 'http://'+find[0] : null;
}
});
Fiddle
Please help, Thanks in advance
you should be using a better regex to correctly match the query & fragment parts of your url. Have a look here (What is the best regular expression to check if a string is a valid URL?) for a correct IRI/URI structured Regex test.
But here's a rudimentary version:
var regex = /[-\w]+(\.[a-z]{2,})+(\/?)([^\s]+)/g;
var text = 'test google.com/?q=foo basdasd www.url.com/test?q=asdasd#cheese something else';
console.log(text.match(regex));
Expected Result:
["google.com/?q=foo", "www.url.com/test?q=asdasd#cheese"]
If you really want to check for URLs, make sure you include scheme, port, username & password checks just to be safe.
In the context of what you're trying to achieve, you should really put in some delay so that you don't impact browser performance. Regex tests can be expensive when you use complex rules especially so when running the same rule every time a new character is entered. Just think about what you're trying to achieve and whether or not there's a better solution to get there.
With a lookahead:
var exp = /[-\w]+(\.[a-z]{2,})+(\S+)?(\/|\/[\w#!:.?+=&%#!\-\/])?(?= )/g;
I only added this "(?= )" to your regex.
Fiddle

RegExp - If first part of search string is found then replace with the full search string value

Is there a RegExp to find and replace a value based on the criteria, "if first part of search string is in the target string then replace the part that matches with the search string."
This is a special search and replace because the replacement is also used as the search string.
For example, I have this URL:
http://www.domain.com/path/something/more/something/
Search for any part of the following and replace with the whole:
/path/user/
Since, "/path/" is in both the replacement string and the target string the results would be:
http://www.domain.com/path/user/something/more/something/
NOTE: The search / replacement value can be anything.
I don't know what the replacement and search string is at the time I make a replacement so I can't use something that hard codes the search string. For example, this won't work because the term is hard coded:
s.replace(/(\/path\/)/, "$1value/");
Another example:
Here is the sentence, "Thank you Susan for your order."
Here is the search and replacement, "Susan Summers"
Here is the desired sentence, "Thank you Susan Summers for your order."
Use Case:
Lets say you are given 1 million text documents that are letters to customers but when they created the documents they used the customers first name only when they were supposed to use the full name. Now it's your job to find and replace every occurrence of their first name with their full name. You only have their full name to work with not first name.
Just realized this may not work as a RegEx and might require code.
You can use:
s = 'http://www.domain.com/path/something/more/something/';
r = s.replace(/(\/path\/)/, "$user/");
//=> "http://www.domain.com/path/user/something/more/something/"
You don't need to use regular expression for this case:
var url = 'http://www.domain.com/path/something/more/something/';
url.replace('/path/', '/path/user/');
// => "http://www.domain.com/path/user/something/more/something/"
I'm not quite sure if I understand the problem correctly. The following replaces any part of of /path/user/ (-> part 1: 'path', part 2: 'user') with the whole /path/user:
var url1 = "http://www.domain.com/path/something/more/something/";
var url2 = "http://www.domain.com/user/something/more/something/";
url1.replace(/\/path\/|\/user\//, '/path/user/');
url2.replace(/\/path\/|\/user\//, '/path/user/');
results in:
http://www.domain.com/path/user/something/more/something/
http://www.domain.com/path/user/something/more/something/
I hope this is what you need, otherwise, please add another example.
EDIT:
Here is the regex in action: http://regex101.com/r/jL6tK6
split + join alternative :
url = url.split('/path/').join('/path/user/');
Although your requirements are not clear, here is a guess that raises a few extra questions :
var sub = '/path/user/';
var parts = sub.match(/[^\/]+/g);
url = url.replace(new RegExp(
'\\/(' + [parts.join('\\/')].concat(parts).join('|') + ')\\/'
), sub);
The resulting regular expression is as follows :
/\/(path\/user|path|user)\// // "/path/user/" OR "/path/" OR "/user/"
Let's check some urls assuming we live in the best of worlds :
'http://domain/' -> 'http://domain/'
'http://path/user/' -> 'http://path/user/'
'http://path/' -> 'http://path/user/'
'http://user/' -> 'http://path/user/'
Now, what do you think about the following ones?
'http://path/user' -> 'http://path/user/user'
'http://user/path/' -> 'http://path/user/path/'
'http://path/user/path/' -> 'http://path/user/path/'
The remaining questions are :
Is this what you are looking for?
What to do when there is no trailing slash?
What to do in the reverse order case?
What to do with recurrent parts?

Regex: Getting content from URL

I want to get "the-game" using regex from URLs like
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/
What parts of the URL could vary and what parts are constant? The following regex will always match whatever is in the slashes following "/en/" - the-game in your example.
(?<=/en/).*?(?=/)
This one will match the contents of the 2nd set of slashes of any URL containing "webdev", assuming the first set of slashes contains a 2 or 3 character language code.
(?<=.*?webdev.*?/.{2,3}/).*?(?=/)
Hopefully you can tweak these examples to accomplish what you're looking for.
var myregexp = /^(?:[^\/]*\/){4}([^\/]+)/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
} else {
result = "";
}
matches whatever lies between the fourth and fifth slash and stores the result in the variable result.
You probably should use some kind of url parsing library rather than resorting to using regex.
In python:
from urlparse import urlparse
url = urlparse('http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/')
print url.path
Which would yield:
/en/the-game/another-one/another-one/another-one/
From there, you can do simple things like stripping /en/ from the beginning of the path. Otherwise, you're bound to do something wrong with a regular expression. Don't reinvent the wheel!

Categories

Resources