Get YouTube video ID from URL with Python and Regex - javascript

I would like to retrieve the video ID part of a YouTube URL which is part of a HTML anchor element like so using regex:
Some text
I have looked around for some solutions. I found one from a Javascript solution which took the video ID from the url like so:
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*/ig
I would like to use this in Python as it supports every variance of YouTube's URLs. I implemented it in my Python script:
string = re.sub(r'https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:[\'"][^<>]*>|<\/a>))[?=&+%\w.-]*', r'\1', string)
And I get no replacements. I removed the / and /ig from the regex as they are only in Javascript however I still can't get it to pick up the video ID. Once I am able to pick up the ID, I can easily change around the regex to remove the anchor element.
What have I done wrong with my solution? Thanks.

I use somthing like belowe, based on Youtube I.D parsing for new URL formats, Python regex convert youtube url to youtube video.
import re
test_links = """
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://www.youtube.com/watch?/watch?other_param&v=5Y6HSHwhVlY',
'http://www.youtube.com/v/5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'http://m.youtube.com/v/5Y6HSHwhVlY',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca
"""
pattern = r'(?:https?:\/\/)?(?:[0-9A-Z-]+\.)?(?:youtube|youtu|youtube-nocookie)\.(?:com|be)\/(?:watch\?v=|watch\?.+&v=|embed\/|v\/|.+\?v=)?([^&=\n%\?]{11})'
result = re.findall(pattern, test_links, re.MULTILINE | re.IGNORECASE)
print(result)
But i really dont know if I am up to date.
edit
allow all subdomians

I don't think this (scroll right to see part denoted by ^^) is supposed to be a negative lookahead:
https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*
^^
I believe it should be a non-capturing group (i.e., ?! should be ?:).
>>> import re
>>> html = 'Some text'
>>> pattern = re.compile(r"""https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?:[?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*""", re.IGNORECASE)
>>> re.search(pattern, html).groups()
('NC2blnl0WTE',)
EDIT: Notice that I also had to use re.IGNORECASE. This is because the regex, as-is, won't match the www in www.youtube.com. You would need [0-9A-Z-] to be [0-9A-Za-z-]. However, it is safer just ignoring the case so you don't have to worry about other text in the URL.
EDIT2: As a negative lookahead, it means you would never be able to have a match when the URL is followed by the ending and closing of your anchor tag (">blah blah blah</a>).

Related

How to write this regular expression to only replace url

I have this reGex to replace a youtube link with an iframe.
const regExp = /^.*(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*).*/;
It works, but it replaces the whole string let say I have something like this...
const string = This is a youtube like video but replace just the link...
https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s;
It replaces the whole string variable, but I want it to only replace the
https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s
Which should give me something like this in the end.
This is a youtube like video but replace just the link... <iframe ...>video</iframe>;
How can I change the regExp to replace just part of the string?
string = string.replace(regExp, function (url) {
return `<iframe ....></iframe>`;
});
I've modified your regexp and used groups, it should work properly.
const regExp = /(^.*)(http(s)?:\/\/)((w){3}.)?youtu(be|.be)?(\.com)?\/.+/;
let str = `This is a youtube like video but replace just the link... https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s`;
str = str.replace(regExp, '$1 <iframe ....></iframe>');
console.log(str);
The problem is that you are not capturing the beginning nor the end of the string, if you put your Regex in Regexper you can check that easily.
You should capture whatever comes before and after the Youtube link in capture groups (like you already did with the different parts of the link) to preserve them:
const regExp = /^(.*)(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*)(.*)/
Now you should update your replace code to take into account that the first matching group is no longer the Youtube link but whatever your string contained before it.
const sourceString = 'This is a youtube like video but replace just the link... https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s'
const regExp = /^(.*)(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*)(.*)/
const embeddedString = sourceString.replace(regExp, '$1 <iframe ...></iframe> $4')
console.log(embeddedString)
When you play with this a little bit you'll notice that there are a couple of issues in the original regular expressions: Youtube links containing a timestamps won't work properly as well as links containing HTTPS and the beginning or links to youtube.com instead of youtu.be.
I recommend you a couple of tools that are very helpful when working
with regular expressions:
Regexper is an online regular expression visualizer which displays a nice graph representing your regex.
Regex101 is an online workbench for regular expressions which allows you to check how it is being executed on your test strings and
get immediate results.
That could be solved by using a simpler regex to match URLs and then a different library to extract the useful parts of the URL, like built-in URL class or a third-party library.
This could also be addressed (at least partially) updating the regular expression so it extracts the video ID and ignores everything else:
const sourceString = 'This is a youtube like video but replace just the link... https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s'
const regExp = /^(.*)(?:https?)?(?:youtu\.be|youtube\.com)\/(?:v\/|u\/\w\/|embed\/|watch)(?:(?:(?:\?v=)([^& ]+)*)?)[^ ]*(.*)/
const embeddedString = sourceString.replace(regExp, '$1 <iframe ...></iframe> $3')
console.log(embeddedString)
Note that even though this version works fine with your sample case it is not production ready and there will be more edge cases I haven't found while writing it.
If you want to pursue the regex-based approach to this problem I suggest you try some of these NPM packages which offer a more tested regular expression for finding the ID of a Youtube video in a link.
They may not solve your problem directly but are good start points to write a more reliable regex.

regex to match all keywords in a string

Being noob in regex I require some support from community
Let say I have this string str
www.anysite.com hello demo try this link
anysite.com indeed demo link
http://www.anysite.com another one
www.anysite.com
http://anysite.com
Consider 1-5 as whole string str here
I want to convert all 'anysite.com' into clickable html links, for which I am using:
str = str.replace(/((http|https|ftp):\/\/[\w?=&.\/-;#~%-]+(?![\w\s?&.\/;#~%"=-]*>))/g, '$1');
This converts all space separated words starting with http/https/ftp into links as
url
So, line 3 and line 5 has been converted correctly. Now to convert all www.anysite.com into links I again used
str = str.replace(/(\b^(http|https|ftp)?(www\.)[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/ig, '$1');
Though it only converts www.anysite.com into link if it is found at very beginning of str. So it convert line number 1 but not line number 4.
Note that I have used ^(http|https|ftp)?(www.) to find all www not
starting with http/https/ftp, as for http they already have been
converted
Also the link on line number 2, where it is neither started with http nor www rather it ends with .com, how the regex would be for that.
For reference you can try posting this whole string to you facebook timeline, it converts all five line into links. Check snapshot
Thanks for help, the final RegEx that helped me is:
//remove all http:// and https://
str = str.replace(/(http|https):\/\//ig, "");
//replace all string ending with .com or .in only into link
str = str.replace( /((www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.(com|in))/ig, '$1');
I used .com and .in for my specific requirement, else the solution on this http://regexr.com/39i0i will work
Though sill there is issue like- it doesn't convert shortened url into
links perfectly. e.g http://s.ly/qhdfTyuiOP will give link till s.ly
Still any suggestions?
^(http|https|ftp)?(www\.) does not mean "all www not starting with http/https/ftp" but rather "a string that starts with an optional http/https/ftp followed by www..
Indeed, ^ in this context isn't a negation but rather an anchor representing the start of the string. I suppose you used it this way because of its meaning when used in a character class ([^...]) ; it is rather tricky since its meaning change depending on the context it is found in.
You could just remove it and you should be fine, as I see no point of making sure the string does not start with http/https/ftp (you transformed those occurrences just before, there should be none left).
Edit : I mentioned lookbehind but forgot it's not available in JS...
If you wanted to make some kind of negation, the easiest way would be to use a negative lookbehind :
(?<!http|https|ftp)www\.
This matches "www." only when it's not preceded by http, https nor ftp.

Only match regex if it doesnt start with a pattern in javascript

I have a bit of a strange one here, I basically have a large chunk of text which may or may not contain links to images.
So lets say it does I have a pattern which will extract the image url fine, however once a match is found it is replaced with a element with the link as the src. Now the problem is there may be multiple matches within the text and this is where it gets tricky. As the url pattern will now match the src tags url, which will basically just enter an infinite loop.
So is there a way to ONLY match in regex if it doesnt start with a pattern like ="|=' ? as then it would match the url in something like:
some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6
but not
some image <img src="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6">
I am not sure if it is possible, but if it is could someone point me in the right direction? A replace by itself will not suffice in this scenario as the url matched needs to be used elsewhere too so it needs to be used like a capture.
The main scenarios I need to account for are:
Many links in one block of varied text
A single link without any other text
A single link with other varied text
== edit ==
Here is the current regex I am using to match urls:
(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
== edit 2 ==
Just so everyone understands why I cannot use the /g command here is an answer which explains the issue, if I could use this /g like I originally tried then it would make things a lot simpler.
Javascript regex multiple captures again
What you are looking for is a negative look behind, but Javascript doesn't support any kind of look behinds, so you will either have to use a callback function to check what was matched and make sure it is not preceded by a ' or ", or you can use the following regex:
(?:^|[^"'])(\b(https?|ftp|file):\/\/[-a-zA-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
which has a single problem, that is in the case of a successful match it will catch one more character, the one right before the (\b(https?|ftp|file) pattern in the input, but I think you can deal with this easily.
Regex101 Demo
Using the /ig command at the end should work... the g is for global replace and the i is for case-insensitivity, which is necessary as you've only got A-Z instead of a-zA-Z.
Using the following vanilla JS appears to work for me (see jsfiddle)...
var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
Although, what it does highlight is that the query string part of the URL (the ?v=6 is not being picked up with your RegEx).
For jQuery, it would be (see jsfiddle)...
$(document).ready(function(){
var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
$("#output").html(test.replace(re,"<img src=\"$1\"/>"));
});
Update
Just in case my example of using the same image URL in the example doesn't convince you - it also works with different URLs... see this jsfiddle update
var test="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 http://cdn.sstatic.net/serverfault/img/sprites.png?v=7";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
Couldn't you just see if there is a whitespace in front of the url, instead of that word-boundary? seems to work, although you will have to remove the matched whitespace later.
(\s(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
http://rubular.com/r/9wSc0HNWas
Edit: Damn, too slow :) I'll still leave this here as my regex is shorter ;)
as was said by freefaller, you might use /g flag to just find all matches in one go, if exec is not a must.
otherwise: you can add (="|=')? to the beginning of your regex, and check if $1 is undefined. if it is undefined, then it was not started with a ="|=' pattern

Regular expression for detecting hyperlinks

I've got this regex pattern from WMD showdown.js file.
/<((https?|ftp|dict):[^'">\s]+)>/gi
and the code is:
text = text.replace(/<((https?|ftp|dict):[^'">\s]+)>/gi,"$1");
But when I set text to http://www.google.com, it does not anchor it, it returns the original text value as is (http://www.google.com).
P.S: I've tested it with RegexPal and it does not match.
Your code is searching for a url wrapped in <> like: <http://www.google.com>: RegexPal.
Just change it to /((https?|ftp|dict):[^'">\s]+)/gi if you don't want it to search for the <>: RegexPal
As long as you know your url's start with http:// or https:// or whatever you can use:
/((https?|s?ftp|dict|www)(://)?)[A-Za-z0-9.\-]+)/gi
The expression will match till it encounters a character not allowed in the URL i.e. is not A-Za-z\.\-. It will not however detect anything of the form google.com or anything that comes after the domain name like parameters or sub directory paths etc. If that is your requirement that you can simply choose to terminate the terminating condition as you have above in your regex.
I know it seems pointless but it may be useful if you want the display name to be something abbreviated rather than the whole url in case of complex urls.
You could use:
var re = /(http|https|ftp|dict)(:\/\/\S+?)(\.?\s|\.?$)/gi;
with:
el.innerHTML = el.innerHTML.replace(re, '<a href=\'$1$2\'>$1$2<\/a>$3');
to also match URLs at the end of sentences.
But you need to be very careful with this technique, make sure the content of the element is more or less plain text and not complex markup. Regular expressions are not meant for, nor are they good at, processing or parsing HTML.

Building a Hashtag in Javascript without matching Anchor Names, BBCode or Escaped Characters

I would like to convert any instances of a hashtag in a String into a linked URL:
#hashtag -> should have "#hashtag" linked.
This is a #hashtag -> should have "#hashtag" linked.
This is a [url=http://www.mysite.com/#name]named anchor[/url] -> should not be linked.
This isn't a pretty way to use quotes -> should not be linked.
Here is my current code:
String.prototype.parseHashtag = function() {
return this.replace(/[^&][#]+[A-Za-z0-9-_]+(?!])/, function(t) {
var tag = t.replace("#","")
return t.link("http://www.mysite.com/tag/"+tag);
});
};
Currently, this appears to fix escaped characters (by excluding matches with the amperstand), handles named anchors, but it doesn't link the #hashtag if it's the first thing in the message, and it seems to grab include the 1-2 characters prior to the "#" in the link.
Halp!
How about the following:
/(^|[^&])#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g
matches the hashtags in your example. Since JavaScript doesn't support lookbehind, it tries to either match the start of the string or any character except & before the hashtag. It captures the latter so it can later be replaced. It also captures the name of the hashtag.
So, for example:
subject.replace(/(^|[^&])#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g, "$1http://www.mysite.com/tag/$2");
will transform
#hashtag
This is a #hashtag and this one #too.
This is a [url=http://www.mysite.com/#name]named anchor[/url]
This isn't a pretty way to use quotes
into
http://www.mysite.com/tag/hashtag
This is a http://www.mysite.com/tag/hashtag and this one http://www.mysite.com/tag/too.
This is a [url=http://www.mysite.com/#name]named anchor[/url]
This isn't a pretty way to use quotes
This probably isn't what t.link() (which I don't know) would have returned, but I hope it's a good starting point.
There is an open-source Ruby gem to do this sort of thing (hashtags and #usernames) called twitter-text. You might get some ideas and regexes from that, or try out this JavaScript port.
Using the JavaScript port, you'll want to just do:
var linked = TwitterText.auto_link_hashtags(text, {hashtag_url_base: "http://www.mysite.come/tag/"});
Tim, your solution was almost perfect. Here's what I ended up using:
subject.replace(/(^| )#([A-Za-z0-9_-]+)(?![A-Za-z0-9_\]-])/g, "$1#$2");
The only change is the first conditional, changed it to match the beginning of the string or a space character. (I tried \s, but that didn't work at all.)

Categories

Resources