Remove all urls in a string using Javascript - javascript

How can I remove all urls within a string regardless of where they appear using Javascript?
For example, for the following tweet-
"...Ready For It?" (#BloodPop ® Remix) out now - https://example.com/rsKdAQzd2q
I would like to get back
"...Ready For It?" (#BloodPop ® Remix) out now -

To remove all urls from the string, you can use regex to identify all the urls that are there in the string and then use String.prototype.replace to replace all the urls with empty characters.
This is John Grubber's Regex which can be used to match all urls.
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/g
So to replace all the urls just run a replace with the above regex
let originalString = '"...Ready For It?" (#BloodPop ® Remix) out now - https://example.com/rsKdAQzd2q'
let newString = originalString.replace(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/g,'')
console.log(newString)

If your urls do not contain a literal whitespace, you could use a regex https?.*?(?= |$) to match from http with an optional s to the next whitespace or end of the string:
var str = '...Ready For It?" (#BloodPop ® Remix) out now - https://example.com/rsKdAQzd2q';
str = str.replace(/https?.*?(?= |$)/g, "");
console.log(str);
Or split on a whitespace and check if the parts start with "http" and if so remove them.
var string = "...Ready For It?\" (#BloodPop ® Remix) out now - https://example.com/rsKdAQzd2q";
string = string.split(" ");
for (var i = 0; i < string.length; i++) {
if (string[i].substring(0, 4) === "http") {
string.splice(i, 1);
}
}
console.log(string.join(" "));

First you can split it by white space
var givenText = '...Ready For It?" https://example2.com/rsKdAQzd2q (#BloodPop ® Remix) out now - https://example.com/rsKdAQzd2q'
var allWords = givenText.split(' ');
Than you can filter out non url words using your own implementation for checking URL, here we can check index of :// for simplicity
var allNonUrls = allWords.filter(function(s){ return
s.indexOf('://')===-1 // you can call custom predicate here
});
So you non URL string will be:
var outputText = allNonUrls.join(' ');
// "...Ready For It?" (#BloodPop ® Remix) out now - "

You can use a regular expression replace on the string to do this, however, finding a good expression to match all URLs is awkward. However something like:
str = str.replace(regex, '');
The correct regex to use has been the subject of many StackOverflow questions, it depends on whether you need to match only http(s)://xxx.yyy.zzz or a more general pattern such as www.xxx.yyy.
See this question for regex patterns to use: What is the best regular expression to check if a string is a valid URL?

function removeUrl(input) {
let regex = /http[%\?#\[\]#!\$&'\(\)\*\+,;=:~_\.-:\/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789]*/;
let result = input.replace(regex, '');
return result;
}
let result = removeUrl('abc http://helloWorld" sdfsewr');

Related

Javascript regex to add to every character of a string a backslash

So as the title says I'd like to add to every character of a string a backslash, whether the string has special characters or not. The string should not be considered 'safe'
eg:
let str = 'dj%^3&something';
str = str.replace(x, y);
// str = '\d\j\%\^\3\&\s\o\m\e\t\h\i\n\g'
You could capture every character in the string with (.) and use \\$1 as replacement, I'm not an expert but basically \\ will render to \ and $1 will render to whatever (.) captures.
HIH
EDIT
please refer to Wiktor Stribiżew's comment for an alternative which will require less coding. Changes as follows:
str = str.replace(/(.)/g, '\\$1'); for str = str.replace(/./g, '\\$&');
Also, for future reference I strongly advice you to visit regexr.com when it comes to regular expressions, it's helped ME a lot
let str = 'dj%^3&something';
str = str.replace(/(.)/g, '\\$1');
console.log(str);
If you just want to display a string safely, you should just do:
let str = 'dj%^3&something';
let node = document.createTextNode(str);
let dest = document.querySelector('.whatever');
dest.appendChild(node);
And then you are guaranteed that it will be treated as text, and won't be able to execute a script or anything.
For example: https://jsfiddle.net/s6udj03L/1/
You can split the string to an array, add a \ to each element to the array, then joint the array back to the string that you wanted.
var str = 'dj%^3&something';
var split = str.split(""); // split string into array
for (var i = 0; i < split.length; i++) {
split[i] = '\\' + split[i]; // add backslash to each element in array
}
var joint = split.join('') // joint array to string
console.log(joint);
If you don't care about creating a new string and don't really have to use a regex, why not just iterate over the existing one and place a \ before each char. Notice to you have to put \\ to escape the first \.
To consider it safe, you have to encode it somehow. You could replace typical 'non-safe' characters like in the encode() function below. Notice how & get's replaced by &
let str = 'dj%^3&something';
let out = "";
for(var i = 0; i < str.length; i++) {
out += ("\\" + str[i]);
}
console.log(out);
console.log(encode(out));
function encode(string) {
return String(string).replace(/&/g, '&').replace(/</g, '<').replace(/>/g, '>').replace(/"/g, '"');
}

Regex check both side of match but not include in match string

I want get match with checking both side expropriation of main match.
var str = 1234 word !!! 5678 another *** 000more))) get word and another
console.log(str.match(/(?!\d+\s?)\w+(?=\s?\W+)/g))
>> (3) ["word", "another", "more"]
it check both side but not include in the main match sets.
But in html it not working [not working]
var str = ''; get url, url2 and url3
console.log(str.match(/(?!href=")[^"]+?(?=")/g))
>> (6) ["<a href=", "url3"]
I try to Negative lookarounds using (?!href=") and Positive lookarounds using (?=") to match only the value of its attribute but it return more attributes.
Is there any way to so like this here, Thanks
What you could do for your example data is capture what is between double quotes href="([^"]+) in an captured group and loop through the result:
var str = '';
var pattern = /href="([^"]+)/g;
var match = pattern.exec(str);
while (match != null) {
console.log(match[1]);
match = pattern.exec(str);
}
In other flavors of regex you could have used e.g. positive lookbehind
((?<=href="), but unfortunately Javascript regex does not support
lookbehinds.
A reasonable solution is:
Match href=" as "ordinary" content, to be ignored.
Match the attribute value as a capturing group ((\w+)),
to be "consumed".
Set the boundary of the above group with a *positive lookup"
((?=")), just as you did.
So the whole regex can be:
href="(\w+)(?=")
and read "your" value from group 1.
You can't parse HTML with regex. Because HTML can't be parsed by regex.
Have you tried using the DOM parser that's right at your fingertips?
var str = '';
var div = document.createElement('div');
div.innerHTML = str; // parsing magic!
var links = Array.from(div.getElementsByTagName("a"));
var urls = links.map(function(a) {return a.href;});
// above returns fully-resolved absolute URLs.
// for the literal attribute value, try a.getAttribute("href")
console.log(urls);

Regex match cookie value and remove hyphens

I'm trying to extract out a group of words from a larger string/cookie that are separated by hyphens. I would like to replace the hyphens with a space and set to a variable. Javascript or jQuery.
As an example, the larger string has a name and value like this within it:
facility=34222%7CConner-Department-Store;
(notice the leading "C")
So first, I need to match()/find facility=34222%7CConner-Department-Store; with regex. Then break it down to "Conner Department Store"
var cookie = document.cookie;
var facilityValue = cookie.match( REGEX ); ??
var test = "store=874635%7Csomethingelse;facility=34222%7CConner-Department-Store;store=874635%7Csomethingelse;";
var test2 = test.replace(/^(.*)facility=([^;]+)(.*)$/, function(matchedString, match1, match2, match3){
return decodeURIComponent(match2);
});
console.log( test2 );
console.log( test2.split('|')[1].replace(/[-]/g, ' ') );
If I understood it correctly, you want to make a phrase by getting all the words between hyphens and disallowing two successive Uppercase letters in a word, so I'd prefer using Regex in that case.
This is a Regex solution, that works dynamically with any cookies in the same format and extract the wanted sentence from it:
var matches = str.match(/([A-Z][a-z]+)-?/g);
console.log(matches.map(function(m) {
return m.replace('-', '');
}).join(" "));
Demo:
var str = "facility=34222%7CConner-Department-Store;";
var matches = str.match(/([A-Z][a-z]+)-?/g);
console.log(matches.map(function(m) {
return m.replace('-', '');
}).join(" "));
Explanation:
Use this Regex (/([A-Z][a-z]+)-?/g to match the words between -.
Replace any - occurence in the matched words.
Then just join these matches array with white space.
Ok,
first, you should decode this string as follows:
var str = "facility=34222%7CConner-Department-Store;"
var decoded = decodeURIComponent(str);
// decoded = "facility=34222|Conner-Department-Store;"
Then you have multiple possibilities to split up this string.
The easiest way is to use substring()
var solution1 = decoded.substring(decoded.indexOf('|') + 1, decoded.length)
// solution1 = "Conner-Department-Store;"
solution1 = solution1.replace('-', ' ');
// solution1 = "Conner Department Store;"
As you can see, substring(arg1, arg2) returns the string, starting at index arg1 and ending at index arg2. See Full Documentation here
If you want to cut the last ; just set decoded.length - 1 as arg2 in the snippet above.
decoded.substring(decoded.indexOf('|') + 1, decoded.length - 1)
//returns "Conner-Department-Store"
or all above in just one line:
decoded.substring(decoded.indexOf('|') + 1, decoded.length - 1).replace('-', ' ')
If you want still to use a regular Expression to retrieve (perhaps more) data out of the string, you could use something similar to this snippet:
var solution2 = "";
var regEx= /([A-Za-z]*)=([0-9]*)\|(\S[^:\/?#\[\]\#\;\,']*)/;
if (regEx.test(decoded)) {
solution2 = decoded.match(regEx);
/* returns
[0:"facility=34222|Conner-Department-Store",
1:"facility",
2:"34222",
3:"Conner-Department-Store",
index:0,
input:"facility=34222|Conner-Department-Store;"
length:4] */
solution2 = solution2[3].replace('-', ' ');
// "Conner Department Store"
}
I have applied some rules for the regex to work, feel free to modify them according your needs.
facility can be any Word built with alphabetical characters lower and uppercase (no other chars) at any length
= needs to be the char =
34222 can be any number but no other characters
| needs to be the char |
Conner-Department-Store can be any characters except one of the following (reserved delimiters): :/?#[]#;,'
Hope this helps :)
edit: to find only the part
facility=34222%7CConner-Department-Store; just modify the regex to
match facility= instead of ([A-z]*)=:
/(facility)=([0-9]*)\|(\S[^:\/?#\[\]\#\;\,']*)/
You can use cookies.js, a mini framework from MDN (Mozilla Developer Network).
Simply include the cookies.js file in your application, and write:
docCookies.getItem("Connor Department Store");

Getting each 'word' after every underscore in a string in Javascript using regex

I'm wanting to extract each block of alphanumeric characters that come after underscores in a Javascript string. I currently have it working using a combination of string methods and regex like so:
var string = "ignore_firstMatch_match2_thirdMatch";
var firstValGone = string.substr(string.indexOf('_'));
// returns "_firstMatch_match2_thirdMatch"
var noUnderscore = firstValGone.match(/[^_]+/g);
// returns ["firstMatch", "match2" , "thirdMatch"]
I'm wondering if there's a way to do it purely using regex? Best I've managed is:
var string = "ignore_firstMatch_match2_thirdMatch";
var matchTry = string.match(/_[^_]+/g);
// returns ["_firstMatch", "_match2", "_thirdMatch"]
but that returns the preceding underscore too. Given you can't use lookbehinds in JS I don't know how to match the characters after, but exclude the underscore itself. Is this possible?
You can use a capture group (_([^_]+)) and use RegExp#exec in a loop while pushing the captured values into an array:
var re = /_([^_]+)/g;
var str = 'ignore_firstMatch_match2_thirdMatch';
var res = [];
while ((m = re.exec(str)) !== null) {
res.push(m[1]);
}
document.body.innerHTML = "<pre>" + JSON.stringify(res, 0, 4) + "</pre>";
Note that using a string#match() with a regex defined with a global modifier /g will lose all the captured texts, that's why you cannot just use str.match(/_([^_]+)/g).
Since lookbehind is not supported in JS the only way I can think of is using a group like this.
Regex: _([^_]+) and capture group using \1 or $1.
Regex101 Demo
var myString = "ignore_firstMatch_match2_thirdMatch";
var myRegexp = /_([^_]+)/g;
match = myRegexp.exec(myString);
while (match != null) {
document.getElementById("match").innerHTML += "<br>" + match[0];
match = myRegexp.exec(myString);
}
<div id="match">
</div>
An alternate way using lookahead would be something like this.
But it takes long in JS. Killed my page thrice. Would make a good ReDoS exploit
Regex: (?=_([A-Za-z0-9]+)) and capture groups using \1 or $1.
Regex101 Demo
Why do you assume you need regex? a simple split will do the job:
string str = "ignore_firstMatch_match2_thirdMatch";
IEnumerable<string> matches = str.Split('_').Skip(1);

How to Split string with multiple rules in javascript

I have this string for example:
str = "my name is john#doe oh.yeh";
the end result I am seeking is this Array:
strArr = ['my','name','is','john','&#doe','oh','&yeh'];
which means 2 rules apply:
split after each space " " (I know how)
if there are special characters ("." or "#") then also split but add the characther "&" before the word with the special character.
I know I can strArr = str.split(" ") for the first rule. but how do I do the other trick?
thanks,
Alon
Assuming the result should be '&doe' and not '&#doe', a simple solution would be to just replace all . and # with & split by spaces:
strArr = str.replace(/[.#]/g, ' &').split(/\s+/)
/\s+/ matches consecutive white spaces instead of just one.
If the result should be '&#doe' and '&.yeah' use the same regex and add a capture:
strArr = str.replace(/([.#])/g, ' &$1').split(/\s+/)
You have to use a Regular expression, to match all special characters at once. By "special", I assume that you mean "no letters".
var pattern = /([^ a-z]?)[a-z]+/gi; // Pattern
var str = "my name is john#doe oh.yeh"; // Input string
var strArr = [], match; // output array, temporary var
while ((match = pattern.exec(str)) !== null) { // <-- For each match
strArr.push( (match[1]?'&':'') + match[0]); // <-- Add to array
}
// strArr is now:
// strArr = ['my', 'name', 'is', 'john', '&#doe', 'oh', '&.yeh']
It does not match consecutive special characters. The pattern has to be modified for that. Eg, if you want to include all consecutive characters, use ([^ a-z]+?).
Also, it does nothing include a last special character. If you want to include this one as well, use [a-z]* and remove !== null.
use split() method. That's what you need:
http://www.w3schools.com/jsref/jsref_split.asp
Ok. i saw, you found it, i think:
1) first use split to the whitespaces
2) iterate through your array, split again in array members when you find # or .
3) iterate through your array again and str.replace("#", "&#") and str.replace(".","&.") when you find
I would think a combination of split() and replace() is what you are looking for:
str = "my name is john#doe oh.yeh";
strArr = str.replace('\W',' &');
strArr = strArr.split(' ');
That should be close to what you asked for.
This works:
array = string.replace(/#|\./g, ' &$&').split(' ');
Take a look at demo here: http://jsfiddle.net/M6fQ7/1/

Categories

Resources