How to write this regular expression to only replace url - javascript

I have this reGex to replace a youtube link with an iframe.
const regExp = /^.*(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*).*/;
It works, but it replaces the whole string let say I have something like this...
const string = This is a youtube like video but replace just the link...
https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s;
It replaces the whole string variable, but I want it to only replace the
https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s
Which should give me something like this in the end.
This is a youtube like video but replace just the link... <iframe ...>video</iframe>;
How can I change the regExp to replace just part of the string?
string = string.replace(regExp, function (url) {
return `<iframe ....></iframe>`;
});

I've modified your regexp and used groups, it should work properly.
const regExp = /(^.*)(http(s)?:\/\/)((w){3}.)?youtu(be|.be)?(\.com)?\/.+/;
let str = `This is a youtube like video but replace just the link... https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s`;
str = str.replace(regExp, '$1 <iframe ....></iframe>');
console.log(str);

The problem is that you are not capturing the beginning nor the end of the string, if you put your Regex in Regexper you can check that easily.
You should capture whatever comes before and after the Youtube link in capture groups (like you already did with the different parts of the link) to preserve them:
const regExp = /^(.*)(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*)(.*)/
Now you should update your replace code to take into account that the first matching group is no longer the Youtube link but whatever your string contained before it.
const sourceString = 'This is a youtube like video but replace just the link... https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s'
const regExp = /^(.*)(youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=|\&v=)([^#\&\?]*)(.*)/
const embeddedString = sourceString.replace(regExp, '$1 <iframe ...></iframe> $4')
console.log(embeddedString)
When you play with this a little bit you'll notice that there are a couple of issues in the original regular expressions: Youtube links containing a timestamps won't work properly as well as links containing HTTPS and the beginning or links to youtube.com instead of youtu.be.
I recommend you a couple of tools that are very helpful when working
with regular expressions:
Regexper is an online regular expression visualizer which displays a nice graph representing your regex.
Regex101 is an online workbench for regular expressions which allows you to check how it is being executed on your test strings and
get immediate results.
That could be solved by using a simpler regex to match URLs and then a different library to extract the useful parts of the URL, like built-in URL class or a third-party library.
This could also be addressed (at least partially) updating the regular expression so it extracts the video ID and ignores everything else:
const sourceString = 'This is a youtube like video but replace just the link... https://www.youtube.com/watch?v=0oPAkkHXYHs&t=541s'
const regExp = /^(.*)(?:https?)?(?:youtu\.be|youtube\.com)\/(?:v\/|u\/\w\/|embed\/|watch)(?:(?:(?:\?v=)([^& ]+)*)?)[^ ]*(.*)/
const embeddedString = sourceString.replace(regExp, '$1 <iframe ...></iframe> $3')
console.log(embeddedString)
Note that even though this version works fine with your sample case it is not production ready and there will be more edge cases I haven't found while writing it.
If you want to pursue the regex-based approach to this problem I suggest you try some of these NPM packages which offer a more tested regular expression for finding the ID of a Youtube video in a link.
They may not solve your problem directly but are good start points to write a more reliable regex.

Related

How to grab URLs in JavaScript without harming embedded objects and inline URL

I wrote a RegExp to grab and encode URLs in JavaScript.
This works fine but, it introduced a bug into my app.
I have a span Element which is used to display Emojis like this:
<span style="background:url(http://localhost/res/emo/face/E004.png)"></span>
Now, I'm using this Regular Expression to grab and convert anything URL into actual HTML clickable links:
/((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/ig
This ended up encoding the emoji URL into a clickable link.
Can anyone adjust that Code to Ignore URLs inside Elements or embedded Objects???
Please I need help!
This is the code:
var urlRegex = /((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/ig;
return txt.replace(urlRegex, function (url) {
var hyperlink = url;
if(!hyperlink.match('^https?:\/\/')) {
hyperlink = 'http://' + hyperlink;
}
return `${url}`;
});
I don't that the URLS inside
<span style="background:url(http://localhost/res/emo/face/E004.png)"></span>
were touched.
You would need to use negative look behind, which has limited support in JavaScript. (see here https://stackoverflow.com/a/50434875/6853740)
Simply adding negative look behind to your existing regex still doesn't work as expected:
((?<!url\()(https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?) still matches "E004.png" in your example. Even other URL regexs from this post (What is the best regular expression to check if a string is a valid URL?) also match that. You may need to consider only looking for links that start with http:// or https:// which may help you recraft a regex that will only match full URLs.

Get YouTube video ID from URL with Python and Regex

I would like to retrieve the video ID part of a YouTube URL which is part of a HTML anchor element like so using regex:
Some text
I have looked around for some solutions. I found one from a Javascript solution which took the video ID from the url like so:
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*/ig
I would like to use this in Python as it supports every variance of YouTube's URLs. I implemented it in my Python script:
string = re.sub(r'https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:[\'"][^<>]*>|<\/a>))[?=&+%\w.-]*', r'\1', string)
And I get no replacements. I removed the / and /ig from the regex as they are only in Javascript however I still can't get it to pick up the video ID. Once I am able to pick up the ID, I can easily change around the regex to remove the anchor element.
What have I done wrong with my solution? Thanks.
I use somthing like belowe, based on Youtube I.D parsing for new URL formats, Python regex convert youtube url to youtube video.
import re
test_links = """
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://www.youtube.com/watch?/watch?other_param&v=5Y6HSHwhVlY',
'http://www.youtube.com/v/5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'http://m.youtube.com/v/5Y6HSHwhVlY',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca
"""
pattern = r'(?:https?:\/\/)?(?:[0-9A-Z-]+\.)?(?:youtube|youtu|youtube-nocookie)\.(?:com|be)\/(?:watch\?v=|watch\?.+&v=|embed\/|v\/|.+\?v=)?([^&=\n%\?]{11})'
result = re.findall(pattern, test_links, re.MULTILINE | re.IGNORECASE)
print(result)
But i really dont know if I am up to date.
edit
allow all subdomians
I don't think this (scroll right to see part denoted by ^^) is supposed to be a negative lookahead:
https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*
^^
I believe it should be a non-capturing group (i.e., ?! should be ?:).
>>> import re
>>> html = 'Some text'
>>> pattern = re.compile(r"""https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?:[?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*""", re.IGNORECASE)
>>> re.search(pattern, html).groups()
('NC2blnl0WTE',)
EDIT: Notice that I also had to use re.IGNORECASE. This is because the regex, as-is, won't match the www in www.youtube.com. You would need [0-9A-Z-] to be [0-9A-Za-z-]. However, it is safer just ignoring the case so you don't have to worry about other text in the URL.
EDIT2: As a negative lookahead, it means you would never be able to have a match when the URL is followed by the ending and closing of your anchor tag (">blah blah blah</a>).

Regex appears to ignore multiple piped characters

Apologies for the awkward question title, I have the following JavaScript:
var wordRe = new RegExp('\\b(?:(?![<^>"])fox|hello(?![<\/">]))\\b', 'g'); // Words regex
console.log('<span>hello</span> <hello>fox</hello> fox link hello my name is fox'.replace(wordRe, 'foo'));
What I'm trying to do is replace any word that isn't nested in a HTML tag, or part of a HTML tag itself. I.e I want to only match "plain" text. The expression seems to be ignoring the rule for the first piped match "fox", and replacing it when it shouldn't be.
Can anyone point out why this is? I think I might have organised the expression incorrectly (at least the negative lookahead).
Here is the JSFiddle.
I'd also like to add that I am aware of the implications of using regex with HTML :)
For your regex work, you want lookbehind. However, as of this writing, this feature is not supported in Javascript.
Here is a workaround:
Instead of matching what we want, we will match what we don't want and remove it from our input string. Later, we can perform the replace on the cleaned input string.
var nonWordRe = new RegExp('<([^>]+).*?>[^<]+?</\\1>', 'g');
var test = '<span>hello</span> <hello>fox</hello> fox link hello my name is fox';
var cleanedTest = test.replace(nonWordRe, '');
var final = cleanedTest.replace(/fox|hello/, 'foo'); // once trimmed final=='foo my name is foo'
NOTA:
I have build this workaround based on your sample. But here are some points that may need to be explored if you face them:
you may need to remove self closing tags (<([^>]+).*?/\>) from the test string
you may need to trim the final string (final)
you may need a descent html parser if tags can contain other tags as HTML allow this.
Javascript doesn't, again as of this writing, recursive patterns.
Demo
http://jsfiddle.net/yXd82/2/

Only match regex if it doesnt start with a pattern in javascript

I have a bit of a strange one here, I basically have a large chunk of text which may or may not contain links to images.
So lets say it does I have a pattern which will extract the image url fine, however once a match is found it is replaced with a element with the link as the src. Now the problem is there may be multiple matches within the text and this is where it gets tricky. As the url pattern will now match the src tags url, which will basically just enter an infinite loop.
So is there a way to ONLY match in regex if it doesnt start with a pattern like ="|=' ? as then it would match the url in something like:
some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6
but not
some image <img src="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6">
I am not sure if it is possible, but if it is could someone point me in the right direction? A replace by itself will not suffice in this scenario as the url matched needs to be used elsewhere too so it needs to be used like a capture.
The main scenarios I need to account for are:
Many links in one block of varied text
A single link without any other text
A single link with other varied text
== edit ==
Here is the current regex I am using to match urls:
(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
== edit 2 ==
Just so everyone understands why I cannot use the /g command here is an answer which explains the issue, if I could use this /g like I originally tried then it would make things a lot simpler.
Javascript regex multiple captures again
What you are looking for is a negative look behind, but Javascript doesn't support any kind of look behinds, so you will either have to use a callback function to check what was matched and make sure it is not preceded by a ' or ", or you can use the following regex:
(?:^|[^"'])(\b(https?|ftp|file):\/\/[-a-zA-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
which has a single problem, that is in the case of a successful match it will catch one more character, the one right before the (\b(https?|ftp|file) pattern in the input, but I think you can deal with this easily.
Regex101 Demo
Using the /ig command at the end should work... the g is for global replace and the i is for case-insensitivity, which is necessary as you've only got A-Z instead of a-zA-Z.
Using the following vanilla JS appears to work for me (see jsfiddle)...
var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
Although, what it does highlight is that the query string part of the URL (the ?v=6 is not being picked up with your RegEx).
For jQuery, it would be (see jsfiddle)...
$(document).ready(function(){
var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
$("#output").html(test.replace(re,"<img src=\"$1\"/>"));
});
Update
Just in case my example of using the same image URL in the example doesn't convince you - it also works with different URLs... see this jsfiddle update
var test="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 http://cdn.sstatic.net/serverfault/img/sprites.png?v=7";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");
Couldn't you just see if there is a whitespace in front of the url, instead of that word-boundary? seems to work, although you will have to remove the matched whitespace later.
(\s(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))
http://rubular.com/r/9wSc0HNWas
Edit: Damn, too slow :) I'll still leave this here as my regex is shorter ;)
as was said by freefaller, you might use /g flag to just find all matches in one go, if exec is not a must.
otherwise: you can add (="|=')? to the beginning of your regex, and check if $1 is undefined. if it is undefined, then it was not started with a ="|=' pattern

How to end a regular expression with a forward slash in Javascript

Problem
I am trying to match the hash part of a URL using Javascript. The hash will have the format
/#\/(.*)\//
This is easy to achieve using "new RegExp()" method of creating a JS regular expression, but I can't figure out how to do it using the standard format, because the two forward slashes at the end begin a comment. Is there another way to write this that won't start a comment?
Example
// works
myRegexp = new RegExp ('#\/(.*)\/');
// fails
myRegexp = /#\/(.*)\//
I am trying to match the hash part of a URL using Javascript.
Yeah, don't do that. There's a perfectly good URL parser built into every browser. Set an href on a location object (window.location or a link) and you can read/write URL parts from properties hostname, pathname, search, hash etc.
var a= document.createElement('a');
a.href= 'http://www.example.com/foo#bar#bar';
alert(a.hash); // #bar#bar
If you're putting a path-like /-separated list in the hash, I'd suggest hash.split('/') to follow.
As for the regex, both versions work identically for me. The trailing // does not cause a comment. If you just want to appease some dodgy syntax highlighting, you could potentially escape the / to \x2F.
It is not starting a comment, just like two slashes in a string. Look here: http://jsfiddle.net/Gr2qb/2/

Categories

Resources