#username Regular Expression for social media with JavaScript

#username Regular Expression for social media with JavaScript - javascript

I am relatively new to using Regular Expressions with JavaScript and come across a particular situation with regards to the '#' username pattern that has given me about 3 hours of trouble.
I would like to replicate the #username pattern found on social media text fields on websites such as Facebook and Twitter. (I understand that they parse tokenized macros but I would like a pure RegEx version).
I have attached an image of the closest pattern that I have achieved, however I will also type this in order to make it easier for anyone to copy and paste into their own RegEx pattern checker.
As you can see, the # symbol aught to catch all subsequent alpha characters and nothing preceding that # symbol and be terminated by a space. There is a special case where a URL that contains an # symbol should be ignored. (As illustrated in the image) and the # symbol could be used at the very start of the textfield and in this case the handle should be parsed.
Clearly, my existing pattern is collecting the preceding character to the # symbol which is incorrect. Any help would be fantastic.
RegEx101 example (with highlights)
RegEx that is not working
/(^^|[^\/])(#[A-Za-z0-9_.]{3,25})/gm
Text version for copy+paste convenience
#test testing
#test testing
testing ,#test
https://youtube.com/#test
I tried multiple combinations of patterns, over 3 hours, to try to isolate #handle style tags as seen in popular social networks. I expected to be able to isolate only the portion of the patter that contain a single # deliminated username. I expected that I could ignore this patter where it is part of a URL.
My actual results cause the preceding character to be collected and added to the final string which is incorrect.

It sounds like you want a positive lookbehind
(?<=YOUR_REGEX)
/(?<=^|[^\/])(#[A-Za-z0-9_.]{3,25})/gm

I see you've already accepted an answer, but here's another version that may work for what you're trying to do.
/(?<=[^\/]|^)#[A-Za-z0-9_.]{3,25}/g
This matches anything that...
Is at the start of the line OR is not preceeded by a forwards slash (/)
Credit to #Samathingamajig for pointing that out.
Starts with an # symbol
Is followed by between 3 and 25 alphanumeric (incl. _ and .) characters.

Related

capture group with optional second capture group containing first group pattern [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

How can I prevent extra characters at the end of my string in Regex?

Background
I'm working on a Javascript application where users have to use a specific email domain to sign up (Either #a.com or #b.com. Anything else gets rejected).
I've created a regex string that makes sure the user doesn't do #a.com with nothing in front of it and limits users to only #a.com and #b.com. The last step is to make sure the user doesn't add extra characters to the end of #a.com by doing something like #a.com.gmail.com
This is the regex I currently have:
\b[a-zA-Z0-9\.]*#(a.com|b.com)
Question
What can I use at the end to prevent anything from being added after a.com or b.com? I'm very novice at regex and have no idea where to start.

To solve your problem add $ to the regex's end. $ means your match should be at the strings' end.
Also you can reduce (a.com|b.com) to [ab]\.com. Look I've also escaped the dot
The character class [ab] means one of its characters should be matched.
Check this demo.
As stated in the comments, be sure to use the (m)ultiline flag, this way the regex engine will threat each line as a separate string.

Email Regular Expression - Excluded Specified Set

I have been researching a regular expression for the better part of about six hours today. For the life of me, I can not figure it out. I have tried what feels like about a hundred different approaches to no avail. Any help is greatly appreciated!
The basic rules:
1 - Exclude these characters in the address portion (before the # symbol): "()<>#,;:\[]*&^%$#!{}/"
2 - The address can contain a ".", but not two in a row.
I have an elegant solution to the rule number one, however, rule number two is killing me! Here is what I have so far. (I'm only including the portion up to the # sign to keep it simple). Also, it is important to note that this regular expression is being used in JavaScript, so no conditional IF is allowed.
/^[^()<>#,;:\\[\]*&^%$#!{}//]+$/

First of all, I would suggest you always choose what characters you want to allow instead of the opposite, you never know what dangerous characters you might miss.
Secondly, this is the regular expression I always use for validating emails and it works perfectly. Hope it helps you out.
/^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}$/i

Rule number 2
/^(?:\.?[^.])+\.?$/
which means any number of sequences of (an optional dot followed by a mandatory non dot) with an optional dot at the end.
Consider four two character sequences
xx matches as two non dot characters.
.x matches as an optional dot followed by a non-dot.
x. matches as a non-dot followed by an optional dot at the end.
.. does not match because there is no non-dot after the first dot.
One thing to remember about email addresses is that dots can appear in tricky places
"..#"#.example.com
is a valid email address.
The "..#" is a perfectly valid quoted local-part production, and .example.com is just a way of saying example.com but resolved against the root DNS instead of using a host search path. example.com might resolve to example.com.myintranet.com if myintranet.com is on the host search path but .example.com always resolves to the absolute host example.com.

First of all, to your specifications:
^(?![\s\S]*\.\.)[^()<>#,;:\\[\]*&^%$#!{}/]#.*$
It's just your regex with (?!.*\.\.) tacked onto the front. That's a negative lookahead, which doesn't match if there are any two consecutive periods anywhere in the string.
Properly matching email addresses is quite a bit harder, however.

help making a "universal" regex Javascript compatible

I found a very nice URL regex matcher on this site: http://daringfireball.net/2010/07/improved_regex_for_matching_urls . It states that it's free to use and that it's cross language compatible (including Javascript). First of all, I have to escape some of the slashes to get it to compile at all. When I do that, it works fine on Rubular.com (where I generally test regexes), with the strange side effect that each match has 5 fields: 1 is the url, and the extra 4 are empty. When I put this in JS, I get the error "Invalid Group". I am using Node.js if that makes any difference, but I wish I could understand that error. I'd like to cut back on the unnecessary empty match fields, but I don't even know where to begin diagnosing this beast. This is what I had after escaping:
(?xi)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’] ))

Actually, you don't need the first capturing group either; it's the same as the whole match in this case, and that can always be accessed via $&. You can change all the capturing groups to non-capturing by adding ?: after the opening parens:
/\b(?:(?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(\(?:[^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\(?:[^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
That "invalid group" error is due to the inline modifiers (i.e., (?xi)) which, as #kirilloid observed, are not supported in JavaScript. Jon Gruber (the regex's author) was mistaken about that, as he was about JS supporting free-spacing mode.
Just FYI, the reason you had to escape the slashes is because you were using regex-literal notation, the most common form of which uses the forward-slash as the regex delimiter. In other words, it's the language (Ruby or JavaScript) that requires you to escape that particular character, not the regex. Some languages let you choose different regex delimiters, while others don't support regex literals at all.
But these are all language issues, not regex issues; the regex itself appears to work as advertised.

Seemes, that you copied it wrong.
http://www.regular-expressions.info/javascript.html
No mode modifiers to set matching options within the regular expression.
No regular expression comments
I.e. (?xi) at the beginning is useless.
x is useless at all for compacted RegExp
i can be replaced with flag
All these result in:
/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i
Tested and working in Google Chrome => should work in Node.js

How to detect what allowed character in current Regular Expression by using JavaScript?

In my web application, I create some framework that use to bind model data to control on page. Each model property has some rule like string length, not null and regular expression. Before submit page, framework validate any binded control with defined rules.
So, I want to detect what character that is allowed in each regular expression rule like the following example.
"^[0-9]+$" allow only digit characters like 1, 2, 3.
"^[a-zA-Z_][a-zA-Z_\-0-9]+$" allow only a-z, - and _ characters
However, this function should not care about grouping, positioning of allowed character. It just tells about possible characters only.
Do you have any idea for creating this function?
PS. I know it easy to create specified function like numeric only for allowing only digit characters. But I need share/reuse same piece of code both data tier(contains all model validator) and UI tier without modify anything.
Thanks

You can't solve this for the general case. Regexps don't generally ‘fail’ at a particular character, they just get to a point where they can't match any more, and have to backtrack to try another method of matching.
One could make a regex implementation that remembered which was the farthest it managed to match before backtracking, but most implementations don't do that, including JavaScript's.
A possible way forward would be to match first against ^pattern$, and if that failed match against ^pattern without the end-anchor. This would be more likely to give you some sort of match of the left hand part of the string, so you could count how many characters were in the match, and say the following character was ‘invalid’. For more complicated regexps this would be misleading, but it would certainly work for the simple cases like [a-zA-Z0-9_]+.

I must admit that I'm struggling to parse your question.
If you are looking for a regular expression that will match only if a string consists entirely of a certain collection of characters, regardless of their order, then your examples of character classes were quite close already.
For instance, ^[A-Za-z0-9]+$ will only allow strings that consist of letters A through Z (upper and lower case) and numbers, in any order, and of any length.

Develop Reference

JavaScript is the programming language of the Web.