JavaScript replace words which aren't in a URL - javascript

I'm running a JavaScript which replaces certain words in my browser's text content.
However I do not wish for it to replace the words within url's.
UPDATE:
E.g., if I've replaced X with Y, and I search for X within a search engine, any url links with X in it are replaced with Y - I can't click on them as they don't exist (and/or they are incorrect).
document.body.innerHTML = document.body.innerHTML.replace(/word/gi, "newword");
How can I do this?

It's really hard to do this (I mean, its too broad), but I suggest you to do that in these few steps:
first you should match all urls and store them in some array (e.g. var urls = [];)
also replace then all urls with some unique characters sequence, which are not for sure in your browser's content (e.g. ~~~~~)
then do your clasical replace, like document.body.innerHTML = document.body.innerHTML.replace(/word/gi, "newword");
and finally match in your new replaced browser's content all yours specials characters sequence (~~~~~) and replace them back in the same order with urls stored in your array (urls).
Matching URLs:
About matching urls you need a good regex that matches urls. This is hard to do. See here, here and here:
...almost anything is a valid URL. There
are some punctuation rules for
splitting it up. Absent any
punctuation, you still have a valid
URL.
Check the RFC carefully and see if you
can construct an "invalid" URL. The
rules are very flexible.
For example ::::: is a valid URL.
The path is ":::::". A pretty
stupid filename, but a valid filename.
Also, ///// is a valid URL. The
netloc ("hostname") is "". The path
is "///". Again, stupid. Also
valid. This URL normalizes to "///"
which is the equivalent.
Something like "bad://///worse/////"
is perfectly valid. Dumb but valid.
Anyway, this answer is not meant to give you the best regex but rather a proof of how to do the string wrapping inside the text, with JavaScript.
OK so lets just use this one: /(https?:\/\/[^\s]+)/g
Again, this is a bad regex. It will have many false positives. However it's good enough for this example.
function urlify(text) {
var urlRegex = /(https?:\/\/[^\s]+)/g;
return text.replace(urlRegex, function(url) {
return '' + url + '';
})
// or alternatively
// return text.replace(urlRegex, '$1')
}
var text = "Find me at http://www.example.com and also at http://stackoverflow.com";
var html = urlify(text);
// html now looks like:
// "Find me at http://www.example.com and also at http://stackoverflow.com"
So in sum try:
$$('#pad dl dd').each(function(element) {
element.innerHTML = urlify(element.innerHTML);
});
I hope that it will do at least a little help for you.

Here is a simple solution:
1. Replace all "word"s in urls with a "tempuniqueflag" (Note that word is not a substring of tempuniqueflag)
var urls = document.querySelectorAll('a');
for (url in urls) {
if (typeof urls[url].href === "string")
urls[url].href = urls[url].href.replace(/word/,"tempuniqueflag");
}
Replace your text content as usual
document.body.innerHTML = document.body.innerHTML.replace(/word/gi, "newword");
Bring back the original word in the urls
for (url in urls) {
if (typeof urls[url].href === "string")
urls[url].href = urls[url].href.replace(/tempuniqueflag/,"word");
}

Related

How Can I Use Regex to Match All the Urls in This String?

I hope to use the following code to get all URLs from a string.
But I only the three URLs ,there are http://www.google.com, https://www.twitter.com and www.msn.com.
I hope I can get all URLs include bing.com in the result, how can I modifty the var expression = /(https?:\/\/(?:www\.| ... ?
function openURLs() {
let links = "http://www.google.com Hello https://www.twitter.com The www.msn.com World bing.com";
if (links) {
var expression = /(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})/gi;
var url_array = links.match(expression);
if (url_array != null) {
url_array.forEach((url) => {
urlOK = url.match(/^https?:/) ? url : '//' + url;
window.open(urlOK)
});
}
}
}
Going off of what you currently have, you can just append |[a-zA-Z0-9]+\.[^\s]{2,} to the end of your expression. The resulting line will look like this:
var expression = /(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})|[a-zA-Z0-9]+\.[^\s]{2,}/gi;
This could be cleaner, but it'll do what you're asking.
Edit:
If you're okay with something slightly more permissive that can pull the same URLs out, you can try this expression:
var expression = /(?:https?:\/\/)?(?:www\.)?[\w.-]+\.\S{2,}/gi;
A permissive regular expression may be the following:
var expression = /(https?:\/\/)?[a-zA-Z0-9]+\.[a-zA-Z0-9]+\S*/
This expression is simpler and easier to debug. Furthermore, it will match any website, including the ones with query params (example.com?param=value) or with no ASCII characters (example.com/你好).
Here you can see a test.
On the other hand, it will match things that aren't websites as soon as they contain a dot, so things like foo.bar will be matched. However, there is no reliable way to detect whether strings like foo.bar are actually websites.

Get base url from string with Regex and Javascript

I'm trying to get the base url from a string (So no window.location).
It needs to remove the trailing slash
It needs to be regex (No New URL)
It need to work with query parameters and anchor links
In other words all the following should return https://apple.com or https://www.apple.com for the last one.
https://apple.com?query=true&slash=false
https://apple.com#anchor=true&slash=false
http://www.apple.com/#anchor=true&slash=true&whatever=foo
These are just examples, urls can have different subdomains like https://shop.apple.co.uk/?query=foo should return https://shop.apple.co.uk - It could be any url like: https://foo.bar
The closer I got is with:
const baseUrl = url.replace(/^((\w+:)?\/\/[^\/]+\/?).*$/,'$1').replace(/\/$/, ""); // Base Path & Trailing slash
But this doesn't work with anchor links and queries which start right after the url without the / before
Any idea how I can get it to work on all cases?
You could add # and ? to your negated character class. You don't need .* because that will match until the end of the string.
For your example data, you could match:
^https?:\/\/[^#?\/]+
Regex demo
strings = [
"https://apple.com?query=true&slash=false",
"https://apple.com#anchor=true&slash=false",
"http://www.apple.com/#anchor=true&slash=true&whatever=foo",
"https://foo.bar/?q=true"
];
strings.forEach(s => {
console.log(s.match(/^https?:\/\/[^#?\/]+/)[0]);
})
You could use Web API's built-in URL for this. URL will also provide you with other parsed properties that are easy to get to, like the query string params, the protocol, etc.
Regex is a painful way to do something that the browser makes otherwise very simple.
I know that you asked about using regex, but in the event that you (or someone coming here in the future) really just cares about getting the information out and isn't committed to using regex, maybe this answer will help.
let one = "https://apple.com?query=true&slash=false"
let two = "https://apple.com#anchor=true&slash=false"
let three = "http://www.apple.com/#anchor=true&slash=true&whatever=foo"
let urlOne = new URL(one)
console.log(urlOne.origin)
let urlTwo = new URL(two)
console.log(urlTwo.origin)
let urlThree = new URL(three)
console.log(urlThree.origin)
const baseUrl = url.replace(/(.*:\/\/.*)[\?\/#].*/, '$1');
This will get you everything up to the .com part. You will have to append .com once you pull out the first part of the url.
^http.*?(?=\.com)
Or maybe you could do:
myUrl.Replace(/(#|\?|\/#).*$/, "")
To remove everything after the host name.

RegEx, look for URL's where it does not start with ="

I am trying to build a function to find URL's in strings and change them into links. But I do not want to find URL's that is already inside a HTML tag (Like <A> and <IMG> as examples).
In other words the RegEx should find this and replace it with a link:
http://www.stackoverflow.com
www.stackoverflow.com
www.stackoverflow.com/logo.gif
But not these URL's (Since they are already formated):
http://www.stackoverflow.com
<img src="http://www.stackoverflow.com/logo.gif">
I am using a RegEx that is already developed for this, but it does not check if the URL is inside a HTML-element already. (http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without)
This is the original RegEx:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%#\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/
This the same RegEx with explanations:
(
( // brackets covering match for protocol (optional) and domain
([A-Za-z]{3,9}:(?:\/\/)?) // match protocol, allow in format http:// or mailto:
(?:[\-;:&=\+\$,\w]+#)? // allow something# for email addresses
[A-Za-z0-9\.\-]+ // anything looking at all like a domain, non-unicode domains
| // or instead of above
(?:www\.|[\-;:&=\+\$,\w]+#) // starting with something# or www.
[A-Za-z0-9\.\-]+ // anything looking at all like a domain
)
( // brackets covering match for path, query string and anchor
(?:\/[\+~%\/\.\w\-]*) // allow optional /path
?\??(?:[\-\+=&;%#\.\w]*) // allow optional query string starting with ?
#?(?:[\.\!\/\\\w]*) // allow optional anchor #anchor
)? // make URL suffix optional
)
What I am trying to do is to change this to look for if the URL starts with exactly =" or > and if it does, it should not go through the RegEx. Since the URL inside <A> and <IMG> elements should have one of these combinations right before it starts.
I am not the greatest in RegEx but I have tried and I guess this is my best try so far, but it does not do the trick:
/(((^[^\="|\>])([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%#\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;
It is this part I have added:
(^[^\="|\>])
This is my fiddle:
http://jsfiddle.net/0w1g4mm9/2/
You could try something like this:
string.replace(
/(<a[^>]*>[^>]*<\a>)|YOUR_REGEX_HERE/g,
function(match, link, YOUR_CAPTURE_GROUP_1, etc) {
if (link) {
return link
}
return YOUR_DESIRED_REPLACEMENT
}
)
The above matches either already valid <a> tags or the URL-looking strings you
are looking for, whichever comes first. A capturing group is used to detect
which of the two was matched. If a valid link was matched, simply return it
unmodified. Otherwise return your desired replacement.
A different aproach which got kind of ugly. I iterate trough all matches, rebuild the source html for the non matches and for the matches I check the char at matchIndex - 1 and add the link tag or not.
This has the advantage that the already crazy complicated regexp is not getting more complicated and you can use full javascript to check if the current string is part of an html element or not.
If you factor out the iterate code it might even end up look nice.
var urlRegEx = /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+#)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+#)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%#\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;
var source = $('#source').html();
var dest = "";
var lastMatchEnd = 0;
while ((match = urlRegEx.exec(source)) != null) {
dest += source.substring(lastMatchEnd, match.index);
var end = match.index + match[0].length;
var lastChar = source.charAt(match.index - 1);
if(lastChar == '"' || lastChar == '>') { // inside link
dest += match[0];
} else {
dest += "<a href=''>" + match[0] + "</a>";
}
lastMatchEnd = end;
}
dest += source.substring(lastMatchEnd);
$('#target').html(dest);

change domain portion of links with javascript or jquery

Sorry for my original question being unclear, hopefully by rewording I can better explain what I want to do.
Because of this I need a way to use JavaScript (or jQuery) to do the following:
determine domain of the current page being accessed
identify all the links on the page that use the domain www.domain1.com and replace with www.domain2.com
i.e. if the user is accessing www.domain2.com/index then:
Test 1
should be rewritten dynamically on load to
Test 1
Is it even possible to rewrite only a portion of the url in an href tag?
Your code will loop over all links on the page. Here's a version that only iterates over URLS that need to be replaced.
var linkRewriter = function(a, b) {
$('a[href*="' + a + '"]').each(function() {
$(this).attr('href', $(this).attr('href').replace(a, b));
});
};
linkRewriter('originalDomain.com', 'rewrittenDomain.com');
I figured out how to make this work.
<script type="text/javascript">
// link rewriter
$(document).ready (
function link_rewriter(){
var hostadd = location.host;
var vendor = '999.99.999.9';
var localaccess = 'somesite1.';
if (hostadd == vendor) {
$("a").each(function(){
var o = $(this);
var href = o.attr('href');
var newhref;
newhref = href.replace(/somesite1/i, "999.99.999.99");
o.attr('href',newhref);
});
}
}
);
</script>
You'll need to involve Java or something server-side to get the IP address. See this:
http://javascript.about.com/library/blip.htm
Replace urls domains using REGEX
This example will replace all urls using my-domain.com to my-other-domain (both are variables).
You can do dynamic regexs by combining string values and other regex expressions within a raw string template. Using String.raw will prevent javascript from escaping any character within your string values.
// Strings with some data
const domainStr = 'my-domain.com'
const newDomain = 'my-other-domain.com'
// Make sure your string is regex friendly
// This will replace dots for '\'.
const regexUrl = /\./gm;
const substr = `\\\.`;
const domain = domainStr.replace(regexUrl, substr);
// domain is a regex friendly string: 'my-domain\.com'
console.log('Regex expresion for domain', domain)
// HERE!!! You can 'assemble a complex regex using string pieces.
const re = new RegExp( String.raw `([\'|\"]https:\/\/)(${domain})(\S+[\'|\"])`, 'gm');
// now I'll use the regex expression groups to replace the domain
const domainSubst = `$1${newDomain}$3`;
// const page contains all the html text
const result = page.replace(re, domainSubst);
note: Don't forget to use regex101.com to create, test and export REGEX code.

Regex: Getting content from URL

I want to get "the-game" using regex from URLs like
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/
http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/
What parts of the URL could vary and what parts are constant? The following regex will always match whatever is in the slashes following "/en/" - the-game in your example.
(?<=/en/).*?(?=/)
This one will match the contents of the 2nd set of slashes of any URL containing "webdev", assuming the first set of slashes contains a 2 or 3 character language code.
(?<=.*?webdev.*?/.{2,3}/).*?(?=/)
Hopefully you can tweak these examples to accomplish what you're looking for.
var myregexp = /^(?:[^\/]*\/){4}([^\/]+)/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
} else {
result = "";
}
matches whatever lies between the fourth and fifth slash and stores the result in the variable result.
You probably should use some kind of url parsing library rather than resorting to using regex.
In python:
from urlparse import urlparse
url = urlparse('http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/')
print url.path
Which would yield:
/en/the-game/another-one/another-one/another-one/
From there, you can do simple things like stripping /en/ from the beginning of the path. Otherwise, you're bound to do something wrong with a regular expression. Don't reinvent the wheel!

Categories

Resources