"Françoise Lefèvre"#example.com
I'm reading RFC 5321 to try to actually understand what constitutes a valid email address -- and I'm probably making this a lot more difficult than it needs to be -- but this has been bugging me.
i.e., within a quoted string, any
ASCII graphic or space is permitted
without blackslash-quoting except
double-quote and the backslash itself.
Does this mean that ASCII extended character sets are valid within quotes? Or does that imply standard ASCII table only?
EDIT - With the answers in mind, here's a simple jQuery validator that could work in supplement to the the plugin's built-in email validation to check the characters.
jQuery.validator.addMethod("ascii_email", function( value, element ) {
// In compliance with RFC 5321, this allows all standard printing ASCII characters in quoted text.
// Unquoted text must be ASCII-US alphanumeric or one of the following: ! # $ % & ' * + - / = ? ^ _ ` { | } ~
// # and . get a free pass, as this is meant to be used together with the email validator
var result = this.optional(element) ||
(
/^[\u002a\u002b\u003d\u003f\u0040\u0020-\u0027\u002d-u002f\u0030-\u0039\u0041-\u005a\u005e-\u007e]+$/.test(value.replace(/(["])(?:\\\1|.)*?\1/, "")) &&
/^[\u0020-\u007e]+$/.test(value.match(/(["])(?:\\\1|.)*?\1/, ""))
);
return result;
}, "Invalid characters");
The plugin's built-in validation appears to be pretty good, except for catching invalid characters. Out of the test cases listed here it only disallows comments, folding whitespace and addresses lacking a TDL (ie: #localhost, #255.255.255.255) -- all of which I can easily live without.
According to this MSDN page the extended ASCII characters aren't valid, currently, but there is a proposed specification that would change this.
http://msdn.microsoft.com/en-us/library/system.net.mail.mailaddress(VS.90).aspx
The important part is here:
Thomas Lee is correct in that a quoted
local part is valid in an email
address and certain mail addresses may
be invalid if not in a quoted string.
However, the characters that others of
you have mentioned such as the umlaut
and the agave are not in the ASCII
character set, they are extended
ASCII. In RFC 2822 (and subsequent
RFC's 5322 and 3696) the dtext
specification (allowed in quoted local
parts) only allows most ASCII values
(RFC 2822, section 3.4.1) which
includes values in ranges from 33-90
and 94-126. RFC 5335 has been proposed
that would allow non-ascii characters
in the addr-spec, however it is still
labeled as experimental and as such is
not supported in MailAddress.
In this RFC, ASCII means US-ASCII , i.e., no characters with a value greater than 127 are allowed. As a proof, here are some quotes from RFC 5321:
The mail data may contain any of the 128 ASCII character codes, [...]
[...]
Systems MUST NOT define mailboxes in such a way as to require the use in SMTP of non-ASCII characters (octets with the high order bit set to one) or ASCII "control characters" (decimal value 0-31 and 127). These characters MUST NOT be used in MAIL or RCPT commands or other commands that require mailbox names.
These quotes quite clearly imply that characters with a value greater than 127 are considered non-ASCII. Since such characters are explicitly forbidden in MAIL TO or RCPT commands, it is impossible to use them for e-mail addresses.
Thus, "Francoise Lefevre"#example.com is a perfectly valid address (according to the RFC), whereas "Françoise Lefèvre"#example.com is not.
Technically yes, but read on:
While the above definition for
Local-part is relatively permissive,
for maximum interoperability, a host
that expects to receive mail SHOULD
avoid defining mailboxes where the
Local-part requires (or uses) the
Quoted-string form or where the
Local-part is case- sensitive.
...
Systems MUST NOT define mailboxes in
such a way as to require the use in
SMTP of non-ASCII characters.
The HTML5 spec has an interesting take on the issue of valid email addresses:
A valid e-mail address is a string that matches the ABNF production 1*( atext / "." ) "#" ldh-str 1*( "." ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.
The nice thing about this, of course, is that you can then take a look at the open source browser's source code for validating it (look for the IsValidEmailAddress function). Of course it's in C, but not too hard to translate to JS.
Related
I'm trying to construct a URL with something like:
var myUrl = '/path/to/api/' + encodeURIComponent(str);
But if str is .. then your browser automatically lops off a path segment so that the URL becomes /path/to which is not what I want.
I've tried encoding .. as %2E%2E but your browser still re-interprets it before the request is sent. Is there anything I can do to have path actually come through to my server as /path/to/api/..?
I believe this is not supported because the behaviour would violate RFC 3986.
From Section 2.3. Unreserved Characters (emphasis mine):
Characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include uppercase and lowercase
letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section
6). For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.
From Section 6.2.2.3. Path Segment Normalization (emphasis mine):
The complete path segments "." and ".." are intended only for use
within relative references (Section 4.1) and are removed as part of
the reference resolution process (Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments by applying the
remove_dot_segments algorithm to the path, as described in
Section 5.2.4.):
I've actually done similar by double encoding the text, then un-encoding it on the server back end. However, mine were query parameters, not part of the path.
PS. This is written on my phone, I'll add an example later.
Seeing as there's no solution, there's not much we can do but error:
export function encodeUriComponent(str) {
if(str === '.' || str === '..') {
throw new Error(`Cannot URI-encode "${str}" per RFC 3986 §6.2.2.3`)
}
return encodeURIComponent(str);
}
I feel that this is a better option than arbitrarily modifying the URL path which is exactly what I was trying to avoid by using encodeURIComponent.
I am using the Node.js library in my asp.net application to normalize the password string by using its UNorm.normalize function like this UNorm.normalize("NFC",strpwd); but it does not give me any output. To trace it I execute it in debug mode and I found that error occurs in unorm.js file function fromData(next, cp, needFeature) and it says that " javascript runtime error unable to get property '0' of undefined or null reference". If I presses the ignore button then it shows me the output but if I click continue or break button then no output produces. I get the Node.js library code from http://git.io/unorm. My application code is given below:
<script type="text/javascript" src="unorm-master/src/unorm.js"></script>
<script type="text/javascript" language="javascript">
function strNormalize()
{
var nstr;
var strpwd = 'αλφα';
nstr = UNorm.normalize('NFC',strpwd);
document.getElementById("txtNormalize").value = nstr;
}
</script>
Can anyone tell me that how to fix this issue in Node.js file unorm.js? Or propose any other solution using Javascript
convert to punycode via https://github.com/bestiejs/punycode.js/ and then hash
as long as you do the same convert both ways (e.g. when you do the first hash to store and when you do the hash for verification) you should be good.
e.g. the punycode for your example would be "--mxaa3a7b"
Maybe you are using this library incorrectly, there are some examples in readme, try:
nstr = UNorm.nfc(strpwd);
Depending on the browsers you are targeting you could just use String.prototype.normalize() (see documentation). To quote the documentation:
The normalize() method returns the Unicode Normalization Form of a
given string (if the value isn't a string, it will be converted to one
first).
Parameters
form One of NFC, NFD, NFKC, or NFKD, specifying the Unicode
Normalization Form. If omitted or undefined, NFC is used.
NFC — Normalization Form Canonical Composition.
NFD — Normalization Form Canonical Decomposition.
NFKC — Normalization Form Compatibility Composition.
NFKD — Normalization Form Compatibility Decomposition.
As an example "\u1E9B\u0323".normalize() will produce a string using NFC form.
As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
There is a JavaScript implementation of PRECIS, e.g., PRECIS-JS, though I haven’t used it and therefore can’t vouch for it. Its documentation gives this simple implementation example:
precis = require('precis-js');
profile = new precis.profile.UsernameCaseMappedProfile();
try {
result = precis.enforce(profile, string);
} catch (e) {
// handle error
}
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction.
I've debugged for a few hours now and have hit a wall - regex has never been my strongsuit. I have been able to alter the following regex to restrict 255 characters for domain fine, however, in trying to restrict the local/username portion of an email address I'm running into issues implementing a 64 character limit. I've gone through regex101 replacing +s and *s and attempting to understand what each pass is doing - however, even when I add a check against all non-whitespace characters with a limit of 64 it seems like the other checks pass and take precedence - although I'm not sure. Below is my regex currently without any of the 64 character checks that I've broken it with:
var emailCheck = new RegExp(/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.{0,1}([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]){1,255}([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]){1,255}([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.*$/i);
What I have so far can be seen at http://jsfiddle.net/mtqx0tz1/ , I've made other slight alterations (e.g. not allowing consecutive dots) but for the most part this regex comes from another stack post without the character limits.
Lastly, I'm aware this isn't the 'standard' so to speak and emails are checked server-side, however, I would like to be more safe than sorry...as well as work on some of my regex. Sorry if this question isn't worthy of an actual post - I'm just simply not seeing where in my passes {1,64} is failing. At this point I'm thinking about just sub-stringing the portion of the string up to the # sign and checking length that way...but it would be nice to include it in this statement since all the checks are done here to begin with.
I have used this regex validation and it works good.
The e-mail address is in the variable strIn
try
{
return Regex.IsMatch(strIn,
#"^(?("")("".+?(?<!\\)""#)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])#))" +
#"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$",
RegexOptions.IgnoreCase, TimeSpan.FromMilliseconds(250));
}
catch (RegexMatchTimeoutException)
{
return false;
}
Just tested this function to validate an email..However,it does not validate the presence or absence of the dot in the email..Why this is happening and how can I fix this?
<script type="text/javascript">
function isEmail(textbox){
var reg4 =/^(\w+)#(\w+).(\w+)$/;
if(reg4.test(textbox.value)){
return true;
}
return false;
}
</script>
No, it's insufficient.
In particular, it doesn't handle all these (legal) variants:
quoted user parts, e.g. "my name"#example.com
Internationalised domain names (IDNs)
Domains with more than two labels
Domains with only one label (yes, those are legal)
Hyphens in domain names
user#[1.2.3.4]
Validating an e-mail address via a regexp is very hard. See section 3.4.1 of RFC 5322 for the formal syntax definition.
Do not use a regular expression to validate email addresses. Email addresses can include a lot of things you wouldn't imagine (like spaces), and it's more likely that the user will accidentally enter an invalid email address than an invalid one. Just check if there's an # then send a confirmation email if you want.
From an answer to a related question:
There is no good (and realistic, see the fully RFC 822 compliant
regex) regular
expression for this problem. The grammar (specified in RFC 5322) is
too complicated for that. Use a real parser or, better, validate by
trying (to send a message).
the dot has a meaning in a regex
use [\.] instead of .
You need to escape the dot, which normally matches any character.
var reg4 =/^(\w+)#(\w+)\.(\w+)$/;
Escape the dot with backslash: \.. However, your regex would not be very good afterwards as it will not accept multiple dots in post-# part, such as domain foo.co.uk.
All in all, I'd advise againts using regexes to validate emails as they tend to get overcomplicated ;-) Simply check for presence of # followed by a dot, or use a similarly lenient algorithm.
I have a signup form where users kan enter their subdomain of choice when creating an account.
http://_________.ourapp.com
I want them to be able to enter valid characters on the ____________________ part above only. I'm using a text field for that.
Is there a function or some sort of pattern that exists for such situations? Spaces should be filtered, I guess many or all special characters (except numbers, dash and letters) as well?
you can use Regular Expressions to achieve what you need.
Try something like this:
<input id="username" type="text" onblur="validSubdomain()" />
function validSubdomain() {
var re = /[^a-zA-Z0-9\-]/;
var val = document.getElementById("username").value;
if(re.test(val)) {
alert("invalid");
}
}
Try if(subdomainName.match(/^[a-z0-9][a-z0-9\-]*[a-z0-9]$/))) {...what to do if valid here...} else {...invalid handling here...} - I reckon that ought to work.
Javascript 1.2 and later supports regular expressions. That's practically every browser these days.
Using your example of "numbers dashes and letters" as being acceptable subdomains, you could do something similar to the following, probably run when the "submit" button on the form is clicked (and if the match fails, then cancel the submission).
entry.Match(/^[a-zA-Z0-9\-]+$/)
Without more concrete information I really can't give you a full example, but this should get you where you need to go. Of course, keep in mind that javascript validation is not complete for a robust website. You need to re-check this on the server side to protect against people that have javascript disabled (or, in the worst case, malicious users).
For Jquery validation follow this steps
First Add Method of Jquery Rule
$.validator.addMethod("subdomainV", function(value, element) {
var regex = new RegExp("^[a-zA-Z]+[a-zA-Z0-9\\-]*$");
return regex.test(value);
}, "Please provide proper subdomain name");
Apply this added method to required field
subdomain : {
required: true,
subdomainV: true /*** New Rule Applied */
}
So your question is, "What rules do a valid internet domain name follow?"
The answer to that is:
it can only contain:
the 26 letter of the English alphabet (case-insensitive)
numbers (0-9)
hyphen/minus sign (-)
it must start and end with a letter or number, not a hyphen;
the labels must be between 1 and 63 characters long;
the entire hostname cannot exceed 255 characters
A domain name is comprised of multiple labels, each separated by a period. A direct subdomain of ourapp.com would be ben.ourapp.com, where ben, ourapp and com are each labels. But you may also optionally allow users to include periods inside of their subdomain, e.g.:
ben.franklin.ourapp.com
i.have.a.clever.vho.st
In those cases, you could allow the user's child domain to be longer than 63 characters (63 * the number of periods in the child domain, with a max size of 244 (.ourapp.com is 11 characters long).
See this Wikipedia article for more info on valid hostnames.
Edit: If you want to support internationalized domain names, things get a bit more complex, though still manageable.