How to normalize password before hashing using javascript?

How to normalize password before hashing using javascript? - javascript

I am using the Node.js library in my asp.net application to normalize the password string by using its UNorm.normalize function like this UNorm.normalize("NFC",strpwd); but it does not give me any output. To trace it I execute it in debug mode and I found that error occurs in unorm.js file function fromData(next, cp, needFeature) and it says that " javascript runtime error unable to get property '0' of undefined or null reference". If I presses the ignore button then it shows me the output but if I click continue or break button then no output produces. I get the Node.js library code from http://git.io/unorm. My application code is given below:
<script type="text/javascript" src="unorm-master/src/unorm.js"></script>
<script type="text/javascript" language="javascript">
function strNormalize()
{
var nstr;
var strpwd = 'αλφα';
nstr = UNorm.normalize('NFC',strpwd);
document.getElementById("txtNormalize").value = nstr;
}
</script>
Can anyone tell me that how to fix this issue in Node.js file unorm.js? Or propose any other solution using Javascript

convert to punycode via https://github.com/bestiejs/punycode.js/ and then hash
as long as you do the same convert both ways (e.g. when you do the first hash to store and when you do the hash for verification) you should be good.
e.g. the punycode for your example would be "--mxaa3a7b"

Maybe you are using this library incorrectly, there are some examples in readme, try:
nstr = UNorm.nfc(strpwd);

Depending on the browsers you are targeting you could just use String.prototype.normalize() (see documentation). To quote the documentation:
The normalize() method returns the Unicode Normalization Form of a
given string (if the value isn't a string, it will be converted to one
first).
Parameters
form One of NFC, NFD, NFKC, or NFKD, specifying the Unicode
Normalization Form. If omitted or undefined, NFC is used.
NFC — Normalization Form Canonical Composition.
NFD — Normalization Form Canonical Decomposition.
NFKC — Normalization Form Compatibility Composition.
NFKD — Normalization Form Compatibility Decomposition.
As an example "\u1E9B\u0323".normalize() will produce a string using NFC form.

As of November 2022, the currently relevant authority from IETF is RFC 8265, “Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords,” October 2017. This document about usernames and passwords is a special case of the more-general PRECIS specification in the still-authoritative RFC 8264, “PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols,” October 2017.
RFC 8265, § 4.1:
This document specifies that a password is a string of Unicode code points [Unicode] that is conformant to the OpaqueString profile (specified below) of the PRECIS FreeformClass defined in Section 4.3 of [RFC8264] and expressed in a standard Unicode Encoding Form (such as UTF-8 [RFC3629]).
RFC 8265, § 4.2 defines the OpaqueString profile, the enforcement of which requires that the following rules be applied in the following order:
the string must be prepared to ensure that it consists only of Unicode code point explicitly allowed by the FreeformClass string class defined in RFC 8264, § 4.3. Certain characters are specified as:
Valid: traditional letters and number, all printable, non-space code points from the 7-bit ASCII range, space code points, symbol code points, punctuation code points, “[a]ny code point that is decomposed and recomposed into something other than itself under Unicode Normalization Form KC, i.e., the HasCompat (‘Q’) category defined under Section 9.17,” and “[l]etters and digits other than the ‘traditional’ letters and digits allowed in IDNs, i.e., the OtherLetterDigits (‘R’) category defined under Section 9.18.”
Invalid: Old Hangul Jamo code points, control code points, and ignorable code points. Further, any currently unassigned code points are considered invalid.
“Contextual Rule Required”: a number of code points from an “
Exceptions” category and “joining code points.” (“Contextual Rule Required” means: “Some characteristics of the code point, such as its being invisible in certain contexts or problematic in others, require that it not be used in a string unless specific other code points or properties are present in the string.”)
Width Mapping Rule: Fullwidth and halfwidth code points MUST NOT be mapped to their decomposition mappings.
Additional Mapping Rule: Any instances of non-ASCII space MUST be mapped to SPACE (U+0020).
Unicode Normalization Form C (NFC) MUST be applied to all strings.
There is a JavaScript implementation of PRECIS, e.g., PRECIS-JS, though I haven’t used it and therefore can’t vouch for it. Its documentation gives this simple implementation example:
precis = require('precis-js');
profile = new precis.profile.UsernameCaseMappedProfile();
 
try {
    result = precis.enforce(profile, string);
} catch (e) {
    // handle error
}
I found Paweł Krawczyk’s “PRECIS, the next step in Unicode validation” a very helpful introduction.

Related

TextEncoder / TextDecoder not round tripping

I'm definitely missing something about the TextEncoder and TextDecoder behavior. It seems to me like the following code should round-trip, but it doesn't seem to:
new TextDecoder().decode(new TextEncoder().encode(String.fromCharCode(55296))).charCodeAt(0);
Since I'm just encoding and decoding the string, the char code seems like it should be the same, but this returns 65533 instead of 55296. What am I missing?

Based on some spelunking, the TextEncoder.encode() method appears to take an argument of type USVString, where USV stands for Unicode Scalar Value. According to this page, a USV cannot be a high-surrogate or low-surrogate code point.
Also, according to MDN:
A USVString is a sequence of Unicode scalar values. This definition
differs from that of DOMString or the JavaScript String type in that
it always represents a valid sequence suitable for text processing,
while the latter can contain surrogate code points.
So, my guess is your String argument to encode() is getting converted to a USVString (either implicitly or within encode()). Based on this page, it looks like to convert from String to USVString, it first converts it to a DOMString, and then follows this procedure, which includes replacing all surrogates with U+FFFD, which is the code point you see, 65533, the "Replacement Character".
The reason String.fromCharCode(55296).charCodeAt(0) works I believe is because it doesn't need to do this String -> USVString conversion.
As to why TextEncoder.encode() was designed this way, I don't understand the unicode details well enough to attempt to explain, but I suspect it's to simplify implementation since the only output encoding it supports seems to be UTF-8, in an Uint8Array. I'm guessing requiring a USVString argument without surrogates (instead of a native UTF-16 String possibly with surrogates) simplifies the encoding to UTF-8, or maybe makes some encoding/decoding use cases simpler?

For those (like me) who aren't sure what "unicode surrogates" are:
The problem
The character code 55296 is not a valid character by itself. So this part of the code is already a problem:
String.fromCharCode(55296)
Since there is no valid character at that charCode, the .fromCharCode function returns the error character "�" instead, which happens to have the code 65533.
Codes like 55296 are only valid as the first element of a pair of codes. Pairs of codes are used to represent the characters that didn't fit in Unicode's Basic Multilingual Plane. (There are a lot of characters outside the Basic Multilingual Plane, so they need two 16-bit numbers to encode them.)
For example, here is a valid use of the code 55296:
console.log(String.fromCharCode(55296, 57091)
It returns the character "𐌃", from the ancient Etruscan alphabet.
The solution
This code will round-trip correctly:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).charCodeAt(0)); // Returns 55296
But beware: .charCodeAt only returns the first part of the pair. A safer option might be to use String.codePointAt to convert the character into a single 32-bit code:
const code = new TextEncoder().encode(String.fromCharCode(55296, 57091));
console.log(new TextDecoder().decode(code).codePointAt(0)); // Returns 66307

Partial matching a string against a regex

Suppose that I have this regular expression: /abcd/
Suppose that I wanna check the user input against that regex and disallow entering invalid characters in the input. When user inputs "ab", it fails as an match for the regex, but I can't disallow entering "a" and then "b" as user can't enter all 4 characters at once (except for copy/paste). So what I need here is a partial match which checks if an incomplete string can be potentially a match for a regex.
Java has something for this purpose: .hitEnd() (described here http://glaforge.appspot.com/article/incomplete-string-regex-matching) python doesn't do it natively but has this package that does the job: https://pypi.python.org/pypi/regex.
I didn't find any solution for it in js. It's been asked years ago: Javascript RegEx partial match
and even before that: Check if string is a prefix of a Javascript RegExp
P.S. regex is custom, suppose that the user enters the regex herself and then tries to enter a text that matches that regex. The solution should be a general solution that works for regexes entered at runtime.

Looks like you're lucky, I've already implemented that stuff in JS (which works for most patterns - maybe that'll be enough for you). See my answer here. You'll also find a working demo there.
There's no need to duplicate the full code here, I'll just state the overall process:
Parse the input regex, and perform some replacements. There's no need for error handling as you can't have an invalid pattern in a RegExp object in JS.
Replace abc with (?:a|$)(?:b|$)(?:c|$)
Do the same for any "atoms". For instance, a character group [a-c] would become (?:[a-c]|$)
Keep anchors as-is
Keep negative lookaheads as-is
Had JavaScript have more advanced regex features, this transformation may not have been possible. But with its limited feature set, it can handle most input regexes. It will yield incorrect results on regex with backreferences though if your input string ends in the middle of a backreference match (like matching ^(\w+)\s+\1$ against hello hel).

As many have stated there is no standard library, fortunately I have written a Javascript implementation that does exactly what you require. With some minor limitation it works for regular expressions supported by Javascript.
see: incr-regex-package.
Further there is also a react component that uses this capability to provide some useful capabilities:
Check input as you type
Auto complete where possible
Make suggestions for possible input values
Demo of the capabilities Demo of use

I think that you have to have 2 regex one for typing /a?b?c?d?/ and one for testing at end while paste or leaving input /abcd/
This will test for valid phone number:
const input = document.getElementById('input')
let oldVal = ''
input.addEventListener('keyup', e => {
if (/^\d{0,3}-?\d{0,3}-?\d{0,3}$/.test(e.target.value)){
oldVal = e.target.value
} else {
e.target.value = oldVal
}
})
input.addEventListener('blur', e => {
console.log(/^\d{3}-?\d{3}-?\d{3}-?$/.test(e.target.value) ? 'valid' : 'not valid')
})
<input id="input">
And this is case for name surname
const input = document.getElementById('input')
let oldVal = ''
input.addEventListener('keyup', e => {
if (/^[A-Z]?[a-z]*\s*[A-Z]?[a-z]*$/.test(e.target.value)){
oldVal = e.target.value
} else {
e.target.value = oldVal
}
})
input.addEventListener('blur', e => {
console.log(/^[A-Z][a-z]+\s+[A-Z][a-z]+$/.test(e.target.value) ? 'valid' : 'not valid')
})
<input id="input">

This is the hard solution for those who think there's no solution at all: implement the python version (https://bitbucket.org/mrabarnett/mrab-regex/src/4600a157989dc1671e4415ebe57aac53cfda2d8a/regex_3/regex/_regex.c?at=default&fileviewer=file-view-default) in js. So it is possible. If someone has simpler answer he'll win the bounty.
Example using python module (regular expression with back reference):
$ pip install regex
$ python
>>> import regex
>>> regex.Regex(r'^(\w+)\s+\1$').fullmatch('abcd ab',partial=True)
<regex.Match object; span=(0, 7), match='abcd ab', partial=True>

You guys would probably find this page of interest:
(https://github.com/desertnet/pcre)
It was a valiant effort: make a WebAssembly implementation that would support PCRE. I'm still playing with it, but I suspect it's not practical. The WebAssembly binary weighs in at ~300K; and if your JS terminates unexpectedly, you can end up not destroying the module, and consequently leaking significant memory.
The bottom line is: this is clearly something the ECMAscript people should be formalizing, and browser manufacturers should be furnishing (kudos to the WebAssembly developer into possibly shaming them to get on the stick...)
I recently tried using the "pattern" attribute of an input[type='text'] element. I, like so many others, found it to be a letdown that it would not validate until a form was submitted. So a person would be wasting their time typing (or pasting...) numerous characters and jumping on to other fields, only to find out after a form submit that they had entered that field wrong. Ideally, I wanted it to validate field input immediately, as the user types each key (or at the time of a paste...)
The trick to doing a partial regex match (until the ECMAscript people and browser makers get it together with PCRE...) is to not only specify a pattern regex, but associated template value(s) as a data attribute. If your field input is shorter than the pattern (or input.maxLength...), it can use them as a suffix for validation purposes. YES -this will not be practical for regexes with complex case outcomes; but for fixed-position template pattern matching -which is USUALLY what is needed- it's fine (if you happen to need something more complex, you can build on the methods shown in my code...)
The example is for a bitcoin address [ Do I have your attention now? -OK, not the people who don't believe in digital currency tech... ] The key JS function that gets this done is validatePattern. The input element in the HTML markup would be specified like this:
<input id="forward_address"
name="forward_address"
type="text"
maxlength="90"
pattern="^(bc(0([ac-hj-np-z02-9]{39}|[ac-hj-np-z02-9]{59})|1[ac-hj-np-z02-9]{8,87})|[13][a-km-zA-HJ-NP-Z1-9]{25,34})$"
data-entry-templates="['bc099999999999999999999999999999999999999999999999999999999999','bc1999999999999999999999999999999999999999999999999999999999999999999999999999999999999999','19999999999999999999999999999999999']"
onkeydown="return validatePattern(event)"
onpaste="return validatePattern(event)"
required
/>
[Credit goes to this post: "RegEx to match Bitcoin addresses?
" Note to old-school bitcoin zealots who will decry the use of a zero in the regex here -it's just an example for accomplishing PRELIMINARY validation; the server accepting the address passed off by the browser can do an RPC call after a form post, to validate it much more rigorously. Adjust your regex to suit.]
The exact choice of characters in the data-entry-template was a bit arbitrary; but they had to be ones such that if the input being typed or pasted by the user is still incomplete in length, it will use them as an optimistic stand-in and the input so far will still be considered valid. In the example there, for the last of the data-entry-templates ('19999999999999999999999999999999999'), that was a "1" followed by 39 nines (seeing as how the regex spec "{25,39}" dictates that a maximum of 39 digits in the second character span/group...) Because there were two forms to expect -the "bc" prefix and the older "1"/"3" prefix- I furnished a few stand-in templates for the validator to try (if it passes just one of them, it validates...) In each template case, I furnished the longest possible pattern, so as to insure the most permissive possibility in terms of length.
If you were generating this markup on a dynamic web content server, an example with template variables (a la django...) would be:
<input id="forward_address"
name="forward_address"
type="text"
maxlength="{{MAX_BTC_ADDRESS_LENGTH}}"
pattern="{{BTC_ADDRESS_REGEX}}" {# base58... #}
data-entry-templates="{{BTC_ADDRESS_TEMPLATES}}" {# base58... #}
onkeydown="return validatePattern(event)"
onpaste="return validatePattern(event)"
required
/>
[Keep in mind: I went to the deeper end of the pool here. You could just as well use this for simpler patterns of validation.]
And if you prefer to not use event attributes, but to transparently hook the function to the element's events at document load -knock yourself out.
You will note that we need to specify validatePattern on three events:
The keydown, to intercept delete and backspace keys.
The paste (the clipboard is pasted into the field's value, and if it works, it accepts it as valid; if not, the paste does not transpire...)
Of course, I also took into account when text is partially selected in the field, dictating that a key entry or pasted text will replace the selected text.
And here's a link to the [dependency-free] code that does the magic:
https://gitlab.com/osfda/validatepattern.js
(If it happens to generate interest, I'll integrate constructive and practical suggestions and give it a better readme...)
PS: The incremental-regex package posted above by Lucas Trzesniewski:
Appears not to have been updated? (I saw signs that it was undergoing modification??)
Is not browserified (tried doing that to it, to kick the tires on it -it was a module mess; welcome anyone else here to post a browserified version for testing. If it works, I'll integrate it with my input validation hooks and offer it as an alternative solution...) If you succeed in getting it browserfied, maybe sharing the exact steps that were needed would also edify everyone on this post. I tried using the esm package to fix version incompatibilities faced by browserify, but it was no go...

I strongly suspect (although I'm not 100% sure) that general case of this problem has no solution the same way as famous Turing's "Haltin problem" (see Undecidable problem). And even if there is a solution, it most probably will be not what users actually want and thus depending on your strictness will result in a bad-to-horrible UX.
Example:
Assume "target RegEx" is [a,b]*c[a,b]* also assume that you produced a reasonable at first glance "test RegEx" [a,b]*c?[a,b]* (obviously two c in the string is invalid, yeah?) and assume that the current user input is aabcbb but there is a typo because what the user actually wanted is aacbbb. There are many possible ways to fix this typo:
remove c and add it before first b - will work OK
remove first b and add after c - will work OK
add c before first b and then remove the old one - Oops, we prohibit this input as invalid and the user will go crazy because no normal human can understand such a logic.
Note also that your hitEnd will have the same problem here unless you prohibit user to enter characters in the middle of the input box that will be another way to make a horrible UI.
In the real life there would be many much more complicated examples that any of your smart heuristics will not be able to account for properly and thus will upset users.
So what to do? I think the only thing you can do and still get reasonable UX is the simplest thing you can do i.e. just analyze your "target RegEx" for set of allowed characters and make your "test RegEx" [set of allowed chars]*. And yes, if the "target RegEx" contains . wildcart, you will not be able to do any reasonable filtering at all.

Define allowed characters in text objects HTML

Is there anyway I can define the encoding in text areas using HTML and pure JS?
I want to have them not permitting special unicode characters (such as ♣♦♠).
The valid character range (for my purpose) is from Unicode code point U+0000 to U+00FF.
It is OK to silently replace invalid characters with an empty string upon form-submission (without warning to the user).

So, as you have clarified in your comments: you want to replace the characters you deem illegal with empty strings on form-submission without warning.
Given the following example html (body content):
<form action="demo_form.asp">
First name: <input type="text" name="fname" /><br>
Last name: <input type="text" name="lname" /><br>
Likes: <textarea name="txt_a"></textarea><br>
Dislikes: <textarea name="txt_b"></textarea><br>
<input type="submit" value="Submit">
</form>
Here is a basic concept javascript:
function demo(){
for( var elms=this.getElementsByTagName('textarea')
, L=elms.length
; L--
; elms[L].value=elms[L].value.replace(/[^\u0000-\u00FF]/g,'')
);
}
window.onload=function(){
document.forms[0].onsubmit=demo; //hook form's onsubmit use any method you like
};
The basic idea is to force the browser's regex engine to match on Unicode (not local charset) using the \uXXXX notation.
Then we simply make a range: [\u0000-\u00FF] and finally specify we want to match on everything outside that range: [^\u0000-\u00FF].
Everything that matches those criteria will be replaced by '' (an empty string) on form-submission. No warning no nothing.
You can/should freely expand this concept to incorporate this into your code (in a way that fits your code-flow) (and where needed, apply it to input type="text" etc), depending on your further requirements.
This should get you started!
EDIT:
Note that your current valid-range specification (\u0000-\u00FF) will effectively dis-allow all such 'pesky' special characters like:
fancy quotes ‘ ’ “ ”
(that's a great feature for people copying from Word etc.),
€ ™ Œ œ, etc.
But, it will nicely include the full C1 control-block (all 32 control-characters). However on the other hand.. it's consistent with including the full C0 control-block.
Effectively, this is now your (what you requested) valid char-set: http://en.wikipedia.org/wiki/ISO/IEC_8859-1
As you can now see, there is a lot more to this. That is why sane applications (finally) are starting to use Unicode (usually encoded for the web as UTF-8) and just accept what the users provide (within (extremely clearly specified) reason)!
Most common validation-questions are (in the real world) nothing more than a high-school-class example of the concept of validating (and even more to the point: to explain the basics of regular expressions with what is considered to be easily understandable examples, like name/email/address). Sadly they are wildly applied even by some government identity-systems (up to passports etc) to people's names, addresses etc. In fact: even the full current Unicode cannot represent every person's name (in native writing) on the planet (that is actually still alive)!! Real world example: try entering and leaving a commercial flight when your boarding-pass has a different credentials then your passport (regardless of which one is wrong).. 'Just' an umlaut missing is going to be a problem somewhere, worse example, imagine an woman with a German first name, Thai last name and married to a man with a Mandarin last name..
Source: xkcd.com/1171/
Finally: Please do realize that in most cases this whole exercise is useless (if you do it silently without warning), because:
you may never just accept user-input on the server-side without proper cleanup, so you are already (silently without the user knowing it) cleaning up your input to the form that you require (to a novice programmer (that forgets to think about (for example) users with javascript disabled,) this sometimes feels like repeating the work already done in javascript on the client-side)...
Usually, the only use of replicating the server-side behavior on the client-side (usually using javascript) is so the user dynamically knows what would be dis-allowed by the server (without sending data back and forth) and can adapt accordingly!

You can use form attribute accept-charset
The accept-charset attribute specifies the character encodings that
are to be used for the form submission.
The default value is the reserved string "UNKNOWN" (indicates that the
encoding equals the encoding of the document containing the
element).
See this documentation http://www.w3schools.com/tags/att_form_accept_charset.asp
I cannot say if this will protect the text field but at least it controls what character set is submitted by the form.
Actually this issue has already been answered
javascript to prevent writing into form elements after n utf 8 characters

what kind of encoding is this?

I've got some data from dbpedia using jena and since jena's output is based on xml so there are some circumstances that xml characters need to be treated differently like following :
Guns n &#039; Roses
I just want to know what kind of econding is this?
I want decode/encode my input based on above encode(r) with the help of javascript and send it back to a servlet.
(edited post if you remove the space between & and amp you will get the correct character since in stackoverflow I couldn't find a way to do that I decided to put like that!)

Seems to be XML entity encoding, and a numeric character reference (decimal).
A numeric character reference refers to a character by its Universal
Character Set/Unicode code point, and uses the format
You can get some info here: List of XML and HTML character entity references on Wikipedia.
Your character is number 39, being the apostrophe: ', which can also be referenced with a character entity reference: &apos;.
To decode this using Javascript, you could use for example php.js, which has an html_entity_decode() function (note that it depends on get_html_translation_table()).
UPDATE: in reply to your edit: Basically that is the same, the only difference is that it was encoded twice (possibly by mistake). & is the ampersand: &.

This is an SGML/HTML/XML numeric character entity reference.
In this case for an apostrophe '.

JavaScript/HTML/Unicode accents: á != á

I want to check if a user submitted string is the same as the string in my answer key. Sometimes the words involve Spanish accents (like in sábado), and that makes the condition always false.
I have Firebug log $('#answer').val() and it shows up as sábado. (The á comes from a button that inserts the value á, if that matters) whereas logging the answer from the answer key shows sábado (how I wrote it in the actual answer key).
I have tried replacing the &aacute in the answer key with a normal á, but it still doesn't work, and results in a Unicode diamond-question-mark. When I do that and also replace the value of the button that makes the user-submitted á, the condition works correctly, but then the button, the user string, and the answer string all have the weird Unicode diamond-question-mark.
I have also tried using á in both places and it's no different from using á. Both my HTML and Javascript are using charset="utf-8".
How can I fix this?

If you're consistently using UTF-8, there's no need for HTML entities except to encode syntax (ie <, >, & and - within attributes - ").
For anything else, use the proper characters, and your problems should go away - until you run into unicode normalization issues, ie the difference between 'a\u0301' and '\u00E1'...

The issue is that you're not using the real UTF-8 characters in both strings (entered answer and the key). You should NOT be supplying "a button that inserts the value á" -- Re: "if that matters" it does!
The characters should be added by the keyboard input system. And your comparison string should also be only utf-8 characters. It should NOT be character entities.

Develop Reference

JavaScript is the programming language of the Web.