javascript fetch last url without prefix - javascript

I am looking to detect last url from text using javascript or mootools. Url canbe without prefix/scheme
I am working on URL auto sense like Facebook. Where a user may give an URL www.example.com or with http://www.example.com either of them should be detected by JavaScript. see stackoverflow detected URL that included with scheme without URL scheme it couldn't detect URL. In my case I need both.
Here is some text
'http://www.example.com www.example2.com'
Now I want www.example2.com It will be better if I get full array containing both http://www.example.com and www.example2.com
I searched a lot but couldn't find solution.
Most close to my requirements were Question about URL Validation with Regex and How do I extract a URL from plain text using jQuery?
Any help greatly appreciated.

by combing info in these 2 links:
How do I extract a URL from plain text using jQuery?
Detect URLs in text with JavaScript
We can get this:
http://jsfiddle.net/qQwGA/1/
If I understand what you're trying to do, this should cover it.

Given your input string, I think you just want to split it using spaces as separator?
.split(' ') ?

REGEX
/([^:\/?# ]+:)?(\/\/[^\/?# ]*)?[^?# ]+(\?[^# ]*)?(#\S*)?/gi
**SAMPLE CODE**
var str = 'http://www.example.com www.example2.com scheme://username:password#domain:port/path?query_string#fragment_id';
var t = str.match(/([^:\/?# ]+:)?(\/\/[^\/?# ]*)?[^?# ]+(\?[^# ]*)?(#\S*)?/gi);
/*
t contains :
[
"http://www.example.com",
"www.example2.com",
"scheme://username:password#domain:port/path?query_string#fragment_id"
]
*/
**DEMO**
>http://jsfiddle.net/wvYTd/
**DISCUSSION**
This regex will find any substring that looks like an URL in an input string.
No validation is performed on any URL found. For instance, if the input string is 3aBadScheme://hostname, the regex will detect it as an URL. In this example, 3aBadScheme is invalid since a scheme MUST start with a letter.
Excerpt from RFC3986
(...)Scheme names consist of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus ("+"), period ("."), or hyphen ("-").(...)

Related

How to grab URLs in JavaScript without harming embedded objects and inline URL

I wrote a RegExp to grab and encode URLs in JavaScript.
This works fine but, it introduced a bug into my app.
I have a span Element which is used to display Emojis like this:
<span style="background:url(http://localhost/res/emo/face/E004.png)"></span>
Now, I'm using this Regular Expression to grab and convert anything URL into actual HTML clickable links:
/((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/ig
This ended up encoding the emoji URL into a clickable link.
Can anyone adjust that Code to Ignore URLs inside Elements or embedded Objects???
Please I need help!
This is the code:
var urlRegex = /((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/ig;
return txt.replace(urlRegex, function (url) {
var hyperlink = url;
if(!hyperlink.match('^https?:\/\/')) {
hyperlink = 'http://' + hyperlink;
}
return `${url}`;
});
I don't that the URLS inside
<span style="background:url(http://localhost/res/emo/face/E004.png)"></span>
were touched.
You would need to use negative look behind, which has limited support in JavaScript. (see here https://stackoverflow.com/a/50434875/6853740)
Simply adding negative look behind to your existing regex still doesn't work as expected:
((?<!url\()(https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?) still matches "E004.png" in your example. Even other URL regexs from this post (What is the best regular expression to check if a string is a valid URL?) also match that. You may need to consider only looking for links that start with http:// or https:// which may help you recraft a regex that will only match full URLs.

Get YouTube video ID from URL with Python and Regex

I would like to retrieve the video ID part of a YouTube URL which is part of a HTML anchor element like so using regex:
Some text
I have looked around for some solutions. I found one from a Javascript solution which took the video ID from the url like so:
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*/ig
I would like to use this in Python as it supports every variance of YouTube's URLs. I implemented it in my Python script:
string = re.sub(r'https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:[\'"][^<>]*>|<\/a>))[?=&+%\w.-]*', r'\1', string)
And I get no replacements. I removed the / and /ig from the regex as they are only in Javascript however I still can't get it to pick up the video ID. Once I am able to pick up the ID, I can easily change around the regex to remove the anchor element.
What have I done wrong with my solution? Thanks.
I use somthing like belowe, based on Youtube I.D parsing for new URL formats, Python regex convert youtube url to youtube video.
import re
test_links = """
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://www.youtube.com/watch?/watch?other_param&v=5Y6HSHwhVlY',
'http://www.youtube.com/v/5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'http://m.youtube.com/v/5Y6HSHwhVlY',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca
"""
pattern = r'(?:https?:\/\/)?(?:[0-9A-Z-]+\.)?(?:youtube|youtu|youtube-nocookie)\.(?:com|be)\/(?:watch\?v=|watch\?.+&v=|embed\/|v\/|.+\?v=)?([^&=\n%\?]{11})'
result = re.findall(pattern, test_links, re.MULTILINE | re.IGNORECASE)
print(result)
But i really dont know if I am up to date.
edit
allow all subdomians
I don't think this (scroll right to see part denoted by ^^) is supposed to be a negative lookahead:
https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*
^^
I believe it should be a non-capturing group (i.e., ?! should be ?:).
>>> import re
>>> html = 'Some text'
>>> pattern = re.compile(r"""https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?:[?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*""", re.IGNORECASE)
>>> re.search(pattern, html).groups()
('NC2blnl0WTE',)
EDIT: Notice that I also had to use re.IGNORECASE. This is because the regex, as-is, won't match the www in www.youtube.com. You would need [0-9A-Z-] to be [0-9A-Za-z-]. However, it is safer just ignoring the case so you don't have to worry about other text in the URL.
EDIT2: As a negative lookahead, it means you would never be able to have a match when the URL is followed by the ending and closing of your anchor tag (">blah blah blah</a>).

How to remove URL from a string completely in Javascript?

I have a string that may contain several url links (http or https). I need a script that would remove all those URLs from the string completely and return that same string without them.
I tried so far:
var url = "and I said http://fdsadfs.com/dasfsdadf/afsdasf.html";
var protomatch = /(https?|ftp):\/\//; // NB: not '.*'
var b = url.replace(protomatch, '');
console.log(b);
but this only removes the http part and keeps the link.
How to write the right regex that it would remove everything that follows http and also detect several links in the string?
Thank you so much!
You can use this regex:
var b = url.replace(/(?:https?|ftp):\/\/[\n\S]+/g, '');
//=> and I said
This regex matches and removes any URL that starts with http:// or https:// or ftp:// and matches up to next space character OR end of input. [\n\S]+ will match across multi lines as well.
Did you search for a url parser regex? This question has a few comprehensive answers Getting parts of a URL (Regex)
That said, if you want something much simpler (and maybe not as perfect), you should remember to capture the entire url string and not just the protocol.
Something like
/(https?|ftp):\/\/[\.[a-zA-Z0-9\/\-]+/
should work better. Notice that the added half parses the rest of the URL after the protocol.

Javascript url validation allowing relative and absolute urls

I'm trying to validate a field to allow relative and absolute urls. I'm using the regex from this post but it is allowing spaces in the url.
var urlRegex = new RegExp(/(\/?[\w-]+)(\/[\w-]+)*\/?|(((http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)+([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?)/gi);
Example:
// this should work
this/will/work.aspx?say=hello
http://www.example.com/this/will/work.aspx?say=hello
// this shouldn't work but does
and/this will also work/even though it shouldn't
and/this-shouldn't/but it does/also
The code below is what I was originally using to validate just absolute urls and it was working perfectly. If I remember properly, I pulled it from the jquery source. If this could be modified to also accept relative urls that would be perfect, but this is out of my league.
var urlRegex = new RegExp(/^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*#)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|#)|\/|\?)*)?$/i);
I think you just need to anchor the pattern so that it has to match the whole string:
var urlRegex = /^(\/?[\w-]+)(\/[\w-]+)*\/?|(((http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)+([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?)$/gi;
The leading ^ and trailing $ means that the pattern has to match the entire string instead of just some part of it.
edit that said, the pattern has other problems. First, those HTML entities for & (&) need to be just "&". The slashes don't need to be escaped in [] groups, and we don't need the "g" suffix. That leaves us with:
var urlRegex = /^(?:(\/?[\w-]+)(\/[\w-]+)*\/?|(((http|ftp|https):\/\/)?[\w-]+(\.[\w-]+)*([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?))$/i;
edit again - oops also need to wrap the whole thing.
I wrote an article about URI validation complete with code snippets for all the various URI components as defined by RFC3986 here:
Regular Expression URI Validation
You may find what you are looking for there. Note however that almost any string represents a valid URI - even an empty string!

Regular expression for detecting hyperlinks

I've got this regex pattern from WMD showdown.js file.
/<((https?|ftp|dict):[^'">\s]+)>/gi
and the code is:
text = text.replace(/<((https?|ftp|dict):[^'">\s]+)>/gi,"$1");
But when I set text to http://www.google.com, it does not anchor it, it returns the original text value as is (http://www.google.com).
P.S: I've tested it with RegexPal and it does not match.
Your code is searching for a url wrapped in <> like: <http://www.google.com>: RegexPal.
Just change it to /((https?|ftp|dict):[^'">\s]+)/gi if you don't want it to search for the <>: RegexPal
As long as you know your url's start with http:// or https:// or whatever you can use:
/((https?|s?ftp|dict|www)(://)?)[A-Za-z0-9.\-]+)/gi
The expression will match till it encounters a character not allowed in the URL i.e. is not A-Za-z\.\-. It will not however detect anything of the form google.com or anything that comes after the domain name like parameters or sub directory paths etc. If that is your requirement that you can simply choose to terminate the terminating condition as you have above in your regex.
I know it seems pointless but it may be useful if you want the display name to be something abbreviated rather than the whole url in case of complex urls.
You could use:
var re = /(http|https|ftp|dict)(:\/\/\S+?)(\.?\s|\.?$)/gi;
with:
el.innerHTML = el.innerHTML.replace(re, '<a href=\'$1$2\'>$1$2<\/a>$3');
to also match URLs at the end of sentences.
But you need to be very careful with this technique, make sure the content of the element is more or less plain text and not complex markup. Regular expressions are not meant for, nor are they good at, processing or parsing HTML.

Categories

Resources