Regex to find urls with hashes and exclamation marks #! [duplicate] - javascript

I know this has been asked a thousand times before (apologies), but searching SO/Google etc I am yet to get a conclusive answer.
Basically, I need a JS function which when passed a string, identifies & extracts all URLs based on a regex, returning an array of all found. e.g:
function findUrls(searchText){
var regex=???
result= searchText.match(regex);
if(result){return result;}else{return false;}
}
The function should be able to detect and return any potential urls. I am aware of the inherant difficulties/isses with this (closing parentheses etc), so I have a feeling the process needs to be:
Split the string (searchText) into distinct sections starting/ending) with either nothing, a space or carriage return either side of it, resulting in distinct content chunks, e.g. do a split.
For each content chunk that results from the split, see whether it fits the logic for a URL of any construction, namely, does it contain a period immediately followed the text (the one constant rule for qualifying a potential URL).
The regex should see whether the period is immediately followed by other text, of the type allowable for a tld, directory structure & query string, and preceded by text of the allowable type for a URL.
I am aware false positives may result, however any returned values will then be checked with a call to the URL itself, so this can be ignored. The other functions I have found often dont return the URLs query string too, if present.
From a block of text, the function should thus be able to return any type of URL, even if it means identifying will.i.am as a valid one!
eg. http://www.google.com, google.com, www.google.com, http://google.com,
ftp.google.com, https:// etc...and any derivation thereof with a query string
should be returned...
Many thanks, apologies again if this exists elsewhere on SO but my searches havent returned it..

I just use URI.js -- makes it easy.
var source = "Hello www.example.com,\n"
+ "http://google.com is a search engine, like http://www.bing.com\n"
+ "http://exämple.org/foo.html?baz=la#bumm is an IDN URL,\n"
+ "http://123.123.123.123/foo.html is IPv4 and "
+ "http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html is IPv6.\n"
+ "links can also be in parens (http://example.org) "
+ "or quotes »http://example.org«.";
var result = URI.withinString(source, function(url) {
return "<a>" + url + "</a>";
});
/* result is:
Hello <a>www.example.com</a>,
<a>http://google.com</a> is a search engine, like <a>http://www.bing.com</a>
<a>http://exämple.org/foo.html?baz=la#bumm</a> is an IDN URL,
<a>http://123.123.123.123/foo.html</a> is IPv4 and <a>http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html</a> is IPv6.
links can also be in parens (<a>http://example.org</a>) or quotes »<a>http://example.org</a>«.
*/
https://github.com/medialize/URI.js
http://medialize.github.io/URI.js/

You could use the regex from URI.js:
// gruber revised expression - http://rodneyrehm.de/t/url-regex.html
var uri_pattern = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/ig;
String#match and or String#replace may help…

Following regular expression extract URLs from string (inc. query string) and returns array
var url = "asdasdla hakjsdh aaskjdh https://www.google.com/search?q=add+a+element+to+dom+tree&oq=add+a+element+to+dom+tree&aqs=chrome..69i57.7462j1j1&sourceid=chrome&ie=UTF-8 askndajk nakjsdn aksjdnakjsdnkjsn";
var matches = strings.match(/\bhttps?::\/\/\S+/gi) || strings.match(/\bhttps?:\/\/\S+/gi);
Output:
["https://www.google.com/search?q=format+to+6+digir&…s=chrome..69i57.5983j1j1&sourceid=chrome&ie=UTF-8"]
Note:
This handles both http:// with single colon and http::// with double colon in string, vice versa for https, So it's safe for you to use. :)

try this
var expression = /[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi;
you could use this website to test regexp http://gskinner.com/RegExr/

In UIPath Studio the following built-in regex rule has been defined:
/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-a-zA-Z0-9+&##\/%=~_|$?!:,.]*\)|[-a-zA-Z0-9+&##\/%=~_|$?!:,.])*(?:\([-a-zA-Z0-9+&##\/%=~_|$?!:,.]*\)|[a-zA-Z0-9+&##\/%=~_|$])/

Related

javascript regex insert new element into expression

I am passing a URL to a block of code in which I need to insert a new element into the regex. Pretty sure the regex is valid and the code seems right but no matter what I can't seem to execute the match for regex!
//** Incoming url's
//** url e.g. api/223344
//** api/11aa/page/2017
//** Need to match to the following
//** dir/api/12ab/page/1999
//** Hence the need to add dir at the front
var url = req.url;
//** pass in: /^\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var re = myregex.toString();
//** Insert dir into regex: /^dir\/api\/([a-zA-Z0-9-_~ %]+)(?:\/page\/([a-zA-Z0-9-_~ %]+))?$/
var regVar = re.substr(0, 2) + 'dir' + re.substr(2);
var matchedData = url.match(regVar);
matchedData === null ? console.log('NO') : console.log('Yay');
I hope I am just missing the obvious but can anyone see why I can't match and always returns NO?
Thanks
Let's break down your regex
^\/api\/ this matches the beginning of a string, and it looks to match exactly the string "/api"
([a-zA-Z0-9-_~ %]+) this is a capturing group: this one specifically will capture anything inside those brackets, with the + indicating to capture 1 or more, so for example, this section will match abAB25-_ %
(?:\/page\/([a-zA-Z0-9-_~ %]+)) this groups multiple tokens together as well, but does not create a capturing group like above (the ?: makes it non-captuing). You are first matching a string exactly like "/page/" followed by a group exactly like mentioned in the paragraph above (that matches a-z, A-Z, 0-9, etc.
?$ is at the end, and the ? means capture 0 or more of the precending group, and the $ matches the end of the string
This regex will match this string, for example: /api/abAB25-_ %/page/abAB25-_ %
You may be able to take advantage of capturing groups, however, and use something like this instead to get similar results: ^\/api\/([a-zA-Z0-9-_~ %]+)\/page\/\1?$. Here, we are using \1 to reference that first capturing group and match exactly the same tokens it is matching. EDIT: actually, this probably won't work, since the text after /api/ and the text after /page/ will most likely be different, carrying on...
Afterwards, you are are adding "dir" to the beginning of your search, so you can now match someting like this: dir/api/abAB25-_ %/page/abAB25-_ %
You have also now converted the regex to a string, so like Crayon Violent pointed out in their comment, this will break your expected funtionality. You can fix this by using .source on your regex: var matchedData = url.match(regVar.source); https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/source
Now you can properly match a string like this: dir/api/11aa/page/2017 see this example: https://repl.it/Mj8h
As mentioned by Crayon Violent in the comments, it seems you're passing a String rather than a regular expression in the .match() function. maybe try the following:
url.match(new RegExp(regVar, "i"));
to convert the string to a regular expression. The "i" is for ignore case; don't know that's what you want. Learn more here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

Converting ampersand (&) and blank space to a dash (-) in URLs using regex

With the code below, I have converted the following names into URL such as
Love & Relationships to http://domain.org/love-relationships
Career & Guidance to http://domain.org/career-guidance
filter('ampToDash', function(){
return function(text){
return text ? String(text).replace(/ & /g,'-'): '';
};
}).filter('dashToAmp', function(){
return function(text){
return text ? String(text).replace(/-/g,' & '): '';
};
})
However, I have a new set of names and I can't figure out how to do both at the same time.
Being Human to http://domain.org/being-human
Competitive Exams to http://domain.org/competitive-exams
filter('ampToDash', function(){
return function(text){
return text ? String(text).replace(/ /g,'-'): '';
};
}).filter('dashToAmp', function(){
return function(text){
return text ? String(text).replace(/-/g,' '): '';
};
})
How do I combine both the regex codes so it can work hand in hand?
You may also want to extend your replacement criteria to cover all "non-word" characters, instead of just accounting for the ones you're currently aware of (& and space). This would be more future-proof, and perhaps easier to reason with:
String(text).replace(/\W+/g, '-')
(\W+ means any sequence of non-word characters.)
Example:
'Jack & Jill went up the #$%#! hill'.replace(/\W+/g, '-')
Yields:
Jack-Jill-went-up-the-hill
And because there's loss of information (i.e. you don't know what exactly leads to a '-' by looking at the transformed string), a way you can find the original string is to simply store it and look up by the transformed string. To elaborate: You're probably going to be looking up some document from this new string (a "slug", as others pointed out). Store the slug along with the document and just look up the document (and its original title) from your database.
It looks like you simply want to change any instances of an ampersand with leading or trailing white-space or just white-space to a single hyphen. If so, you could just use the following expression :
// Replace any strings that have leading and trailing spaces or just a series of spaces
String(text).replace(/(\s+&\s+|\s+)/g,'-'): '';
Example
var input = ['Love & Relationships', 'Career & Guidance', 'Being Human', 'Competitive Exams'];
for (var i in input) {
var phrase = input[i];
console.log(phrase + ' -> ' + phrase.replace(/(\s+&\s+|\s+)/g, '-'));
}
I think you are looking for a lib that converts a string into a slug.
You can do this manually, but you'll probably have hard time covering other edge cases.
I would suggest you to use something like :
https://github.com/dodo/node-slug
Or check out this gist if you really want to stay with the regex way : https://gist.github.com/mathewbyrne/1280286
You have two separate problems:
how to 'slugify' a string
how to undo / reverse the slugify.
To answer 1: A generic slugify method would be something like: text.replace(/\W+/g, '-')
To answer 2: you can't. You have a function (ampToDash) that can produce the same output given different inputs. i.e. there is NO equivalent of dashToAmp any more.

Regex converting & to &

I am developing a small character encoder generator where the user input their text and on the click of a button, it outputs the encoded version.
I've defined an object of the characters that need to be encoded like so:
map = {
'©' : '©',
'&' : '&'
},
And here is the loop that gets the values from the map and replaces them:
Object.keys(map).forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
I am them simply outputting the result to a textarea. This all works fine, however the problem I'm facing is this.
© is replaced with © however the & symbol at the beginning of this is then converted to & so it ends up being &copy;.
I see why this is happening however I'm not sure how to go about ensuring that & is not replaced within character encoded strings.
Here is a JSFiddle for a live preview of what I mean:
http://jsfiddle.net/4m3nw/1/
Any help would be much appreciated
Prelude: Apart from regex, an idea worth considering is something like this JS function that already handles html entities. Now, on to the regex question.
HTML Special Characters, Negative Lookahead
In HTML, special characters can look not only like © but also like —, and they can have upper-case characters.
To replace ampersands that are not immediately followed by a hash or word characters and a semicolon, you can use something like this:
&(?!(?:#[0-9]+|[a-z]+);)
See the demo.
Make sure to use the i flag to activate case-insensitive mode
& matches the literal ampersand
The negative lookahead (?!(?:#[0-9]+|[a-z]+);) asserts that it is not followed by...
(?:#[0-9]+|[a-z]+) a hash and digits, | OR letters...
then a semicolon.
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind
The problem is that since you process the same string you replace the &in ©. If you re-order your map then that seemingly solves the problem. However according to the ECMAScript specifications, this is not a given, so you would be relying on implementation details of the ECMAScript engine used.
What you can do to make sure it will always work is to swap the keys so that & is always processed first:
map = {
'©' : '©',
'&' : '&'
};
var keys = Object.keys(map);
keys[keys.indexOf('&')] = keys[0];
keys[0] = '&';
keys.forEach(function (ico) {
var icoE = ico.replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
raw = raw.replace( new RegExp(icoE, 'g'), map[ico] );
});
Obviously you need to add checks for the &'s existence if it isn't always there.
jsFiddle Demo.
Probably the simplest code change is to reorder your map by putting the ampersand on top.

RegExp - If first part of search string is found then replace with the full search string value

Is there a RegExp to find and replace a value based on the criteria, "if first part of search string is in the target string then replace the part that matches with the search string."
This is a special search and replace because the replacement is also used as the search string.
For example, I have this URL:
http://www.domain.com/path/something/more/something/
Search for any part of the following and replace with the whole:
/path/user/
Since, "/path/" is in both the replacement string and the target string the results would be:
http://www.domain.com/path/user/something/more/something/
NOTE: The search / replacement value can be anything.
I don't know what the replacement and search string is at the time I make a replacement so I can't use something that hard codes the search string. For example, this won't work because the term is hard coded:
s.replace(/(\/path\/)/, "$1value/");
Another example:
Here is the sentence, "Thank you Susan for your order."
Here is the search and replacement, "Susan Summers"
Here is the desired sentence, "Thank you Susan Summers for your order."
Use Case:
Lets say you are given 1 million text documents that are letters to customers but when they created the documents they used the customers first name only when they were supposed to use the full name. Now it's your job to find and replace every occurrence of their first name with their full name. You only have their full name to work with not first name.
Just realized this may not work as a RegEx and might require code.
You can use:
s = 'http://www.domain.com/path/something/more/something/';
r = s.replace(/(\/path\/)/, "$user/");
//=> "http://www.domain.com/path/user/something/more/something/"
You don't need to use regular expression for this case:
var url = 'http://www.domain.com/path/something/more/something/';
url.replace('/path/', '/path/user/');
// => "http://www.domain.com/path/user/something/more/something/"
I'm not quite sure if I understand the problem correctly. The following replaces any part of of /path/user/ (-> part 1: 'path', part 2: 'user') with the whole /path/user:
var url1 = "http://www.domain.com/path/something/more/something/";
var url2 = "http://www.domain.com/user/something/more/something/";
url1.replace(/\/path\/|\/user\//, '/path/user/');
url2.replace(/\/path\/|\/user\//, '/path/user/');
results in:
http://www.domain.com/path/user/something/more/something/
http://www.domain.com/path/user/something/more/something/
I hope this is what you need, otherwise, please add another example.
EDIT:
Here is the regex in action: http://regex101.com/r/jL6tK6
split + join alternative :
url = url.split('/path/').join('/path/user/');
Although your requirements are not clear, here is a guess that raises a few extra questions :
var sub = '/path/user/';
var parts = sub.match(/[^\/]+/g);
url = url.replace(new RegExp(
'\\/(' + [parts.join('\\/')].concat(parts).join('|') + ')\\/'
), sub);
The resulting regular expression is as follows :
/\/(path\/user|path|user)\// // "/path/user/" OR "/path/" OR "/user/"
Let's check some urls assuming we live in the best of worlds :
'http://domain/' -> 'http://domain/'
'http://path/user/' -> 'http://path/user/'
'http://path/' -> 'http://path/user/'
'http://user/' -> 'http://path/user/'
Now, what do you think about the following ones?
'http://path/user' -> 'http://path/user/user'
'http://user/path/' -> 'http://path/user/path/'
'http://path/user/path/' -> 'http://path/user/path/'
The remaining questions are :
Is this what you are looking for?
What to do when there is no trailing slash?
What to do in the reverse order case?
What to do with recurrent parts?

How do I implement this regular expression in Javascript?

How do I make a Javascript regular expression that will take this string (named url_string):
http://localhost:3000/new_note?date1=01-01-2010&date2=03-03-2010
and return it, but with the value of the date1 parameter set to a new date variable, which is called new_date_1?
There are better ways to manipulate URL than regex, but a simple solution like this may work:
after = before.replace(/date1=[\d-]+/, "date1=" + newDate);
[\d-]+ matches a non-empty sequence of digits and/or dashes. If you really need to, you can also be more specific with e.g. \d{2}-\d{2}-\d{4}, or an even more complicated date regex that rejects invalid dates, etc.
Note that since the regex makes the "date1=" prefix part of the match, it is also substituted in as part of the replacement.
url.replace(/date1=[0-9-]{10}/, "date1=" + new_date_1);
It's messy, but:
var url_string = "http://localhost:3000/new_note?date1=01-01-2010&date2=03-03-2010";
var new_date_1 = "01-02-2003";
var new_url_string = url_string.replace(/date1=\d{2}-\d{2}-\d{4}/, "date1="+new_date_1);
/* http://localhost:3000/new_note?date1=01-02-2003&date2=03-03-2010 */
There must be a proper URL parser in JS. Have a Google.

Categories

Resources