RegEx, look for URL's where it does not start with =" - javascript

I am trying to build a function to find URL's in strings and change them into links. But I do not want to find URL's that is already inside a HTML tag (Like <A> and <IMG> as examples).
In other words the RegEx should find this and replace it with a link:
http://www.stackoverflow.com
www.stackoverflow.com
www.stackoverflow.com/logo.gif
But not these URL's (Since they are already formated):
http://www.stackoverflow.com
<img src="http://www.stackoverflow.com/logo.gif">
I am using a RegEx that is already developed for this, but it does not check if the URL is inside a HTML-element already. (http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without)
This is the original RegEx:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%#\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/
This the same RegEx with explanations:
(
( // brackets covering match for protocol (optional) and domain
([A-Za-z]{3,9}:(?:\/\/)?) // match protocol, allow in format http:// or mailto:
(?:[\-;:&=\+\$,\w]+#)? // allow something# for email addresses
[A-Za-z0-9\.\-]+ // anything looking at all like a domain, non-unicode domains
| // or instead of above
(?:www\.|[\-;:&=\+\$,\w]+#) // starting with something# or www.
[A-Za-z0-9\.\-]+ // anything looking at all like a domain
)
( // brackets covering match for path, query string and anchor
(?:\/[\+~%\/\.\w\-]*) // allow optional /path
?\??(?:[\-\+=&;%#\.\w]*) // allow optional query string starting with ?
#?(?:[\.\!\/\\\w]*) // allow optional anchor #anchor
)? // make URL suffix optional
)
What I am trying to do is to change this to look for if the URL starts with exactly =" or > and if it does, it should not go through the RegEx. Since the URL inside <A> and <IMG> elements should have one of these combinations right before it starts.
I am not the greatest in RegEx but I have tried and I guess this is my best try so far, but it does not do the trick:
/(((^[^\="|\>])([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%#\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;
It is this part I have added:
(^[^\="|\>])
This is my fiddle:
http://jsfiddle.net/0w1g4mm9/2/

You could try something like this:
string.replace(
/(<a[^>]*>[^>]*<\a>)|YOUR_REGEX_HERE/g,
function(match, link, YOUR_CAPTURE_GROUP_1, etc) {
if (link) {
return link
}
return YOUR_DESIRED_REPLACEMENT
}
)
The above matches either already valid <a> tags or the URL-looking strings you
are looking for, whichever comes first. A capturing group is used to detect
which of the two was matched. If a valid link was matched, simply return it
unmodified. Otherwise return your desired replacement.

A different aproach which got kind of ugly. I iterate trough all matches, rebuild the source html for the non matches and for the matches I check the char at matchIndex - 1 and add the link tag or not.
This has the advantage that the already crazy complicated regexp is not getting more complicated and you can use full javascript to check if the current string is part of an html element or not.
If you factor out the iterate code it might even end up look nice.
var urlRegEx = /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%#\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;
var source = $('#source').html();
var dest = "";
var lastMatchEnd = 0;
while ((match = urlRegEx.exec(source)) != null) {
dest += source.substring(lastMatchEnd, match.index);
var end = match.index + match[0].length;
var lastChar = source.charAt(match.index - 1);
if(lastChar == '"' || lastChar == '>') { // inside link
dest += match[0];
} else {
dest += "<a href=''>" + match[0] + "</a>";
}
lastMatchEnd = end;
}
dest += source.substring(lastMatchEnd);
$('#target').html(dest);

Related

JavaScript replace words which aren't in a URL

I'm running a JavaScript which replaces certain words in my browser's text content.
However I do not wish for it to replace the words within url's.
UPDATE:
E.g., if I've replaced X with Y, and I search for X within a search engine, any url links with X in it are replaced with Y - I can't click on them as they don't exist (and/or they are incorrect).
document.body.innerHTML = document.body.innerHTML.replace(/word/gi, "newword");
How can I do this?
It's really hard to do this (I mean, its too broad), but I suggest you to do that in these few steps:
first you should match all urls and store them in some array (e.g. var urls = [];)
also replace then all urls with some unique characters sequence, which are not for sure in your browser's content (e.g. ~~~~~)
then do your clasical replace, like document.body.innerHTML = document.body.innerHTML.replace(/word/gi, "newword");
and finally match in your new replaced browser's content all yours specials characters sequence (~~~~~) and replace them back in the same order with urls stored in your array (urls).
Matching URLs:
About matching urls you need a good regex that matches urls. This is hard to do. See here, here and here:
...almost anything is a valid URL. There
are some punctuation rules for
splitting it up. Absent any
punctuation, you still have a valid
URL.
Check the RFC carefully and see if you
can construct an "invalid" URL. The
rules are very flexible.
For example ::::: is a valid URL.
The path is ":::::". A pretty
stupid filename, but a valid filename.
Also, ///// is a valid URL. The
netloc ("hostname") is "". The path
is "///". Again, stupid. Also
valid. This URL normalizes to "///"
which is the equivalent.
Something like "bad://///worse/////"
is perfectly valid. Dumb but valid.
Anyway, this answer is not meant to give you the best regex but rather a proof of how to do the string wrapping inside the text, with JavaScript.
OK so lets just use this one: /(https?:\/\/[^\s]+)/g
Again, this is a bad regex. It will have many false positives. However it's good enough for this example.
function urlify(text) {
var urlRegex = /(https?:\/\/[^\s]+)/g;
return text.replace(urlRegex, function(url) {
return '' + url + '';
})
// or alternatively
// return text.replace(urlRegex, '$1')
}
var text = "Find me at http://www.example.com and also at http://stackoverflow.com";
var html = urlify(text);
// html now looks like:
// "Find me at http://www.example.com and also at http://stackoverflow.com"
So in sum try:
$$('#pad dl dd').each(function(element) {
element.innerHTML = urlify(element.innerHTML);
});
I hope that it will do at least a little help for you.
Here is a simple solution:
1. Replace all "word"s in urls with a "tempuniqueflag" (Note that word is not a substring of tempuniqueflag)
var urls = document.querySelectorAll('a');
for (url in urls) {
if (typeof urls[url].href === "string")
urls[url].href = urls[url].href.replace(/word/,"tempuniqueflag");
}
Replace your text content as usual
document.body.innerHTML = document.body.innerHTML.replace(/word/gi, "newword");
Bring back the original word in the urls
for (url in urls) {
if (typeof urls[url].href === "string")
urls[url].href = urls[url].href.replace(/tempuniqueflag/,"word");
}

Removing a letters located between to specific string

I want to make sure that the URL I get from window.location does not already contain a specific fragment identifier already. If it does, I must remove it. So I must search the URL, and find the string that starts with mp- and continues until the end URL or the next # (Just in case the URL contains more than one fragment identifier).
Examples of inputs and outputs:
www.site.com/#mp-1 --> www.site.com/
www.site.com#mp-1 --> www.site.com
www.site.com/#mp-1#pic --> www.site.com/#pic
My code:
(that obviously does not work correctly)
var url = window.location;
if(url.toLowerCase().indexOf("#mp-") >= 0){
var imgString = url.substring(url.indexOf('#mp-') + 4,url.indexOf('#'));
console.log(imgString);
}
Any idea how to do it?
Something like this? This uses a regular expression to filter the unwanted string.
var inputs = [
"www.site.com/#mp-1",
"www.site.com#mp-1",
"www.site.com/#mp-1#pic"
];
inputs = inputs.map(function(input) {
return input.replace(/#mp-1?/, '');
});
console.log(inputs);
Output:
["www.site.com/", "www.site.com", "www.site.com/#pic"]
jsfiddle: https://jsfiddle.net/tghuye75/
The regex I used /#mp-1?/ removes any strings like #mp- or #mp-1. For a string of unknown length until the next hashtag, you can use /#mp-[^#]* which removes #mp-, #mp-1, and #mp-somelongstring.
Use regular expressions:
var url = window.location;
var imgString = url.replace(/(#mp-[^#\s]+)/, "");
It removes from URL hash anything from mp- to the char before #.
Regex101 demo
You can use .replace to replace a regular expression matching ("#mp-" followed by 0 or more non-# characters) with the empty string. If it's possible there are multiple segments you want to remove, just add a g flag to the regex.
url = url.replace(/#mp-[^#]*/, '');
The window.location has the hash property so... window.location.hash
The most primitive way is to declare
var char_start, char_end
and find two "#" or one and the 2nd will be end of input.
with that... you can do what you want, the change of window.location.hash will normally affect the browser adress.
Good luck!

Removing a query string using regex in java script

I have a requirement of removing a query parameter coming with a REST API call. Below are the sample URLs which need to be considered. In each of this URL, we need to remove 'key' parameter and its value.
/test/v1?key=keyval&param1=value1&param2=value2
/test/v1?key=keyval
/test/v1?param1=value1&key=keyval
/test/v1?param1=value1&key=keyval&param2=value2
After removing the key parameter, the final URLs should be as follows.
/test/v1?param1=value1&param2=value2
/test/v1?
/test/v1?param1=value1
/test/v1?param1=value1=&param2=value2
We used below regex expression to match and replace this query string in php. (https://regex101.com/r/pK0dX3/1)
(?<=[?&;])key=.*?($|[&;])
We couldn't use the same regex in java script. Once we use it in java script it gives some syntax errors. Can you please help us to figure out the issue with the same regex ? How can we change this regex to match and remove query parameter as mentioned above?
Obviously lookbehind isn't supported in Javascript hence your regex won't work.
In Javascript you can use this:
repl = input.replace(/(\?)key=[^&]*(?:&|$)|&key=[^&]*/gmi, '$1');
RegEx Demo
Regex is working on 2 paths using regex alternation:
If this query parameter is right after ? then we grab till & after parameter and place ? back in replacement.
If this query parameter is after & then &key=value is replaced by an empty string.
The regex works in PHP but not in Javascript because Javascript does not support lookbehind.
The easiest fix here would be to replace the lookbehind (?<=[?&;]) with the equivalent characters in a capturing group ([?&;]) and use a backreference ($1) to insert this bit back into the replacement string.
For example:
var path = '/test/v1?key=keyval&param1=value1&param2=value2';
var regex = /([?&;])key=.*?($|[&;])/;
console.log(path.replace(regex, '$1'); // outputs '/test/v1?param1=value1&param2=value2'
Not convinced regex would be the most reliable way of removing a query parameter, but that's a different story :-)
Just in case you want to do it without a regex, here is a function that will do the trick:
var removeQueryString = function (str) {
var qm = str.lastIndexOf('?');
var path = str.substr(0, qm + 1);
var querystr = str.substr(qm + 1);
var params = querystr.split('&');
var keyIndex = -1;
for (var i = 0; i < params.length; i++) {
if (params[i].indexOf("key=") === 0) {
keyIndex = i;
break;
}
}
if (keyIndex != -1) {
params.splice(keyIndex, 1);
}
var result = path + params.join('&');
return result;
};
The lookbehind feature isn't available in javascript, so to test the character before the key/value, you must match it. To make the pattern works whatever the position in the query part of the url, you can use an alternation in a non-capturing group, and you capture the question mark:
url = url.replace(/(?:&|(\?))key=[^&#]*(?:(?!\1).)?/, '$1');
Note: the # is excluded from the character class to prevent the fragment part (if any) of the url to be matched with key value.

Matching hashes using regex, but not when they are part of an url

I am struggling with a regex in javascript that needs the text after # to the first word boundary, but not match it if it is part of an url. So
#test - should match test
sometext#test2 - should match test2
xx moretext#test3 - should match test3
http://test.com#tab1 - should not match tab1
I am replacing the text after the hash with a link (but not the hash character itself). There can be more than one hash in the text, and it should match them all (I guess I should use /g for that).
Matching the part after the hash is quite easy: /#\b(.+?)\b/g, but not matching it if the string itself starts with "http" is something I cannot solve. I should probably use a negative look-around, but I am having problems getting my head around that.
Any help is greatly appreciated!
Try this regex using a negative lookahead instead since JS doesn't support lookbehinds:
/^(?!http:\/\/).*#\b(.+?)\b/
You may want to check for www too, depending on your conditions.
Edit: Then you can do this:
str = str.replace(re.exec(str)[1], 'replaced!');
http://jsfiddle.net/j7c79/2/
Edit 2: Sometimes a regex alone is not the way to go if it gets too complicated. Try a different approach:
var txt = "asdfgh http://asdf#test1 #test2 woot#test3";
function replaceHashWords(str, rep) {
var isUrl = /^http/.test(str), result = [];
!isUrl && str.replace(/#\b(.+?)\b/g, function(a,b){ result.push(b); });
return str.replace((new RegExp('('+ result.join('|') +')','g')), rep);
}
alert(replaceHashWords(txt, 'replaced!'));
// asdfgh http://asdf#replaced! #replaced! woot#replaced!
As regex is, often (if not always), quite expensive to use, I'd suggest using basic string, and array, methods to determine whether a given set of characters represents an URL (though I'm assuming that all URLS will start with the http string):
$('ul li').each(
function() {
var t = $(this).text(),
words = t.split(/\s+/),
foundHashes = [],
word = '';
for (var i = 0, len = words.length; i < len; i++) {
word = words[i];
if (word.indexOf('http') == -1 && word.indexOf('#') !== -1) {
var match = word.substring(word.indexOf('#') + 1);
foundHashes.push(match);
}
}
// the following just shows what, if anything, was found
// and can definitely be safely omitted
if (foundHashes.length) {
var newSpan = $('<span />', {
'class': 'matchedWords'
}).text(foundHashes.join(', ')).appendTo($(this));
}
});
JS Fiddle demo (with some timing information printed to the console).
References:
jQuery:
appendTo().
each().
text().
'Vanilla' JavaScript
Array.join().
String.indexOf().
String.split().
String.substring().
This would require a lookbehind, something sadly lacking from JavaScript's capabilities.
However, if your subject string is some HTML and those URLs are in href attributes, you can create a document out of it and search for text nodes, only replacing their nodeValues instead of the whole HTML string.

Match filename and file extension from single Regex

I'm sure this must be easy enough, but I'm struggling...
var regexFileName = /[^\\]*$/; // match filename
var regexFileExtension = /(\w+)$/; // match file extension
function displayUpload() {
var path = $el.val(); //This is a file input
var filename = path.match(regexFileName); // returns file name
var extension = filename[0].match(regexFileExtension); // returns extension
console.log("The filename is " + filename[0]);
console.log("The extension is " + extension[0]);
}
The function above works fine, but I'm sure it must be possible to achieve with a single regex, by referencing different parts of the array returned with the .match() method. I've tried combining these regex but without success.
Also, I'm not using a string to test it on in the example, as console.log() escapes the backslashes in a filepath and it was starting to confuse me :)
Assuming that all files do have an extension, you could use
var regexAll = /[^\\]*\.(\w+)$/;
Then you can do
var total = path.match(regexAll);
var filename = total[0];
var extension = total[1];
/^.*\/(.*)\.?(.*)$/g after this first group is your file name and second group is extention.
var myString = "filePath/long/path/myfile.even.with.dotes.TXT";
var myRegexp = /^.*\/(.*)\.(.*)$/g;
var match = myRegexp.exec(myString);
alert(match[1]); // myfile.even.with.dotes
alert(match[2]); // TXT
This works even if your filename contains more then one dotes or doesn't contain dots at all (has no extention).
EDIT:
This is for linux, for windows use this /^.*\\(.*)\.?(.*)$/g (in linux directory separator is / in windows is \ )
You can use groups in your regular expression for this:
var regex = /^([^\\]*)\.(\w+)$/;
var matches = filename.match(regex);
if (matches) {
var filename = matches[1];
var extension = matches[2];
}
I know this is an old question, but here's another solution that can handle multiple dots in the name and also when there's no extension at all (or an extension of just '.'):
/^(.*?)(\.[^.]*)?$/
Taking it a piece at a time:
^
Anchor to the start of the string (to avoid partial matches)
(.*?)
Match any character ., 0 or more times *, lazily ? (don't just grab them all if the later optional extension can match), and put them in the first capture group ( ).
(\.
Start a 2nd capture group for the extension using (. This group starts with the literal . character (which we escape with \ so that . isn't interpreted as "match any character").
[^.]*
Define a character set []. Match characters not in the set by specifying this is an inverted character set ^. Match 0 or more non-. chars to get the rest of the file extension *. We specify it this way so that it doesn't match early on filenames like foo.bar.baz, incorrectly giving an extension with more than one dot in it of .bar.baz instead of just .baz.
. doesn't need escaped inside [], since everything (except^) is a literal in a character set.
)?
End the 2nd capture group ) and indicate that the whole group is optional ?, since it may not have an extension.
$
Anchor to the end of the string (again, to avoid partial matches)
If you're using ES6 you can even use destructing to grab the results in 1 line:
[,filename, extension] = /^(.*?)(\.[^.]*)?$/.exec('foo.bar.baz');
which gives the filename as 'foo.bar' and the extension as '.baz'.
'foo' gives 'foo' and ''
'foo.' gives 'foo' and '.'
'.js' gives '' and '.js'
This will recognize even /home/someUser/.aaa/.bb.c:
function splitPathFileExtension(path){
var parsed = path.match(/^(.*\/)(.*)\.(.*)$/);
return [parsed[1], parsed[2], parsed[3]];
}
I think this is a better approach as matches only valid directory, file names and extension. and also groups the path, filename and file extension. And also works with empty paths only filename.
^([\w\/]*?)([\w\.]*)\.(\w)$
Test cases
the/p0090Aath/fav.min.icon.png
the/p0090Aath/fav.min.icon.html
the/p009_0Aath/fav.m45in.icon.css
fav.m45in.icon.css
favicon.ico
Output
[the/p0090Aath/][fav.min.icon][png]
[the/p0090Aath/][fav.min.icon][html]
[the/p009_0Aath/][fav.m45in.icon][css]
[][fav.m45in.icon][css]
[][favicon][ico]
(?!\w+).(\w+)(\s)
Find one or more word (s) \w+, negate (?! ) so that the word (s) are not shown on the result, specify the delimiter ., find the first word (\w+) and ignore the words that are after a possible blank space (\s)

Categories

Resources