Below is a base 64 image encoding function that I got from Philippe Tenenhaus (http://www.philten.com/us-xmlhttprequest-image/).
It's very confusing to me, but I'd love to understand.
I think I understand the bitwise & and | , and moving through byte position with << and >>.
I'm especially confused at those lines :
((byte1 & 3) << 4) | (byte2 >> 4);
((byte2 & 15) << 2) | (byte3 >> 6);
And why it still using byte1 for enc2, and byte2 for enc3.
And the purpose of enc4 = byte3 & 63; ...
Can someone could explain this function.
function base64Encode(inputStr)
{
var b64 = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";
var outputStr = "";
var i = 0;
while (i < inputStr.length)
{
//all three "& 0xff" added below are there to fix a known bug
//with bytes returned by xhr.responseText
var byte1 = inputStr.charCodeAt(i++) & 0xff;
var byte2 = inputStr.charCodeAt(i++) & 0xff;
var byte3 = inputStr.charCodeAt(i++) & 0xff;
var enc1 = byte1 >> 2;
var enc2 = ((byte1 & 3) << 4) | (byte2 >> 4);
var enc3, enc4;
if (isNaN(byte2))
{
enc3 = enc4 = 64;
}
else
{
enc3 = ((byte2 & 15) << 2) | (byte3 >> 6);
if (isNaN(byte3))
{
enc4 = 64;
}
else
{
enc4 = byte3 & 63;
}
}
outputStr += b64.charAt(enc1) + b64.charAt(enc2) + b64.charAt(enc3) + b64.charAt(enc4);
}
return outputStr;
}
It probably helps to understand what Base64 encoding does. It converts 24 bits in groupings of 8 bits into groupings of 6 bits. (http://en.wikipedia.org/wiki/Base64)
So enc1, is the first 6-bits which are the first 6-bits of the first Byte.
enc2, is the next 6-bits, the last 2-bits of the first Byte and first 4-bits of the second Byte. The bitwise and operation byte1 & 3 targets the last 2 bits in the first Byte.
So,
XXXXXXXX & 00000011 = 000000XX
It is then shifted to the left 4 bits.
000000XX << 4 = 00XX0000.
The byte2 >> 4 performs a right bit shift, isolating the first 4 bits of the second Byte, shown below
YYYYXXXX >> 4 = 0000YYYY
So, ((byte1 & 3) << 4) | (byte2 >> 4) combines the results with a bitwise or
00XX0000 | 0000YYYY = 00XXYYYY
enc3, is the last 4-bits of the second byte and the first 2-bits of the 3rd Byte.
enc4 is the last 6-bits of the 3rd Byte.
charCodeAt returns a Unicode code point which is a 16-bit value, so it appears there is an assumption that the relevant information is only in the low 8-bits. This assumption makes me wonder if there still is a bug in the code. There could be some information lost as a result of this assumption.
Related
For encoding, Javascript pulls from the standard Anscii table for mapping characters. I found the following function below that brilliantly and correctly encodes to Anscii85/Base85. But I want to encode to the Z85 variation because it contains the set of symbols that I require. My understanding is that the Anscii85/Base85 encoding should work exactly the same, except that Z85 maps the values in a different order from the Anscii standard, and uses a different combination of symbols from the standard Ansii85 mapping as well. So the character set is the only difference:
Ansci85 uses the 85 characters, 32 through 126 (reference):
"!\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstu
Z85 uses a custom set of 85 characters (reference):
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}#%$#
My question is, is there any way to redefine the character set that charCodeAt and fromCharCode refer to in this function so that it would then encode in Z85?
// By Steve Hanov. Released to the public domain.
function encodeAscii85(input) {
// Remove Adobe standard prefix
// var output = "<~";
var chr1, chr2, chr3, chr4, chr, enc1, enc2, enc3, enc4, enc5;
var i = 0;
while (i < input.length) {
// Access past the end of the string is intentional.
chr1 = input.charCodeAt(i++);
chr2 = input.charCodeAt(i++);
chr3 = input.charCodeAt(i++);
chr4 = input.charCodeAt(i++);
chr = ((chr1 << 24) | (chr2 << 16) | (chr3 << 8) | chr4) >>> 0;
enc1 = (chr / (85 * 85 * 85 * 85) | 0) % 85 + 33;
enc2 = (chr / (85 * 85 * 85) | 0) % 85 + 33;
enc3 = (chr / (85 * 85) | 0 ) % 85 + 33;
enc4 = (chr / 85 | 0) % 85 + 33;
enc5 = chr % 85 + 33;
output += String.fromCharCode(enc1) +
String.fromCharCode(enc2);
if (!isNaN(chr2)) {
output += String.fromCharCode(enc3);
if (!isNaN(chr3)) {
output += String.fromCharCode(enc4);
if (!isNaN(chr4)) {
output += String.fromCharCode(enc5);
}
}
}
}
// Remove Adobe standard suffix
// output += "~>";
return output;
}
Extra notes:
Alternately, I thought I could use something like the following function, but the problem is that it doesn't properly encode Anscii85 in the first place. If it was correct, Hello world! should encode to 87cURD]j7BEbo80, but this function encodes it to RZ!iCB=*gD0D5_+ (reference).
I don't understand the algorithm enough to know what is wrong with the mapping here. Ideally, if it was encoding correctly, I should be able to update this function to use the Z85 character set:
// Adapted from: Ascii85 JavaScript implementation, 2012.10.16 Jim Herrero
// Original: https://jsfiddle.net/nderscore/bbKS4/
var Ascii85 = {
// Ascii85 mapping
_alphabet: "!\"#$%&'()*+,-./0123456789:;<=>?#"+
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`"+
"abcdefghijklmnopqrstu"+
"y"+ // short form 4 spaces (optional)
"z", // short form 4 nulls (optional)
// functions
encode: function(input) {
var alphabet = Ascii85._alphabet,
useShort = alphabet.length > 85,
output = "", buffer, val, i, j, l;
for (i = 0, l = input.length; i < l;) {
buffer = [0,0,0,0];
for (j = 0; j < 4; j++)
if(input[i])
buffer[j] = input.charCodeAt(i++);
for (val = buffer[3], j = 2; j >= 0; j--)
val = val*256+buffer[j];
if (useShort && !val)
output += alphabet[86];
else if (useShort && val == 0x20202020)
output += alphabet[85];
else {
for (j = 0; j < 5; j++) {
output += alphabet[val%85];
val = Math.floor(val/85);
}
}
}
return output;
}
};
Character codes are character codes. You can't change the behavior of String.fromCharCode() or String.charCodeAt().
However, you can store your custom character set in an array and use array indexing and Array.indexOf() to look up entries.
Updating this function to work with Z85 will be tricky, though, because String.fromCharCode() and String.charCodeAt() are used in two different contexts -- they're sometimes used to access the unencoded string (which doesn't need to change), and sometimes for the encoded string (which does). You will need to take care to not confuse the two.
I'm writing a program in JavaScript that needs to convert text to 8-bit binary, which I accomplish with a loop that uses "exampleVariable.charCodeAt(i).toString(2)", then appends "0"s to the front until the length of the binary representation of each character is 8 bits. However, when Russian characters are passed into the function, each character is converted to an 11-bit binary representation, when it should actually be 16 bits. For example, "д" converts to 10000110100, when, in actuality, it should convert to "1101000010110100". Any ideas on how to fix this?
It looks like you are trying to get the binary representation of the UTF-8 representation of the character. JavaScript uses UTF-16 internally, so you will have to do some work to do the translation. There are various libraries out there, we'd need to know more about the environment to recommend the right tools. If you wanted to code it up yourself, it would be roughly:
function codepointToUTF_8(code) {
if (code < 0x07f) {
return [code];
} else if (code < 0x800) {
var byte1 = 0xc0 | (code >> 6 );
var byte2 = 0x80 | (code & 0x3f);
return [ byte1, byte2 ];
} else if (code < 0x10000) {
var byte1 = 0xe0 | ( code >> 12 );
var byte2 = 0x80 | ((code >> 6 ) & 0x3f);
var byte3 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3 ];
} else {
var byte1 = 0xf0 | ( code >> 18 );
var byte2 = 0x80 | ((code >> 12) & 0x3f);
var byte3 = 0x80 | ((code >> 6) & 0x3f);
var byte4 = 0x80 | ( code & 0x3f);
return [ byte1, byte2, byte3, byte4 ];
}
}
function strToUTF_8 (str) {
var result = [];
for (var i = 0; i < str.length; i++) {
// NOTE: this will not handle anything beyond the BMP
result.push(codepointToUTF_8(str.charCodeAt(i)));
}
console.log('result = ', result);
return [].concat.apply([], result);
}
function byteToBinary (b) {
var str = b.toString(2);
while (str.length < 8) {
str = '0' + str;
}
return str;
}
function toBinaryUTF_8 (str) {
return strToUTF_8(str).map(byteToBinary).join(' ');
}
console.log("абвгд => '" + toBinaryUTF_8("абвгд") + "'");
When I execute this I get:
абвгд => '11010000 10110000 11010000 10110001 11010000 10110010 11010000 10110011 11010000 10110100'
I haven't tested this thoroughly, but it should handle the Russian characters OK. It produces an array of character codes, which if you translate as you were trying before with 8 binary bits per character, you should be fine.
In javascript I am trying to make unicode into byte based hex escape sequences that are compatible with C:
ie. 😄
becomes: \xF0\x9F\x98\x84 (correct)
NOT javascript surrogates, not \uD83D\uDE04 (wrong)
I cannot figure out the math relationship between the four bytes C wants vs the two surrogates javascript uses. I suspect the algorithm is far more complex than my feeble attempts.
Thanks for any tips.
encodeURIComponent does this work:
var input = "\uD83D\uDE04";
var result = encodeURIComponent(input).replace(/%/g, "\\x"); // \xF0\x9F\x98\x84
Upd: Actually, C strings can contain digits and letters without escaping, but if you really need to escape them:
function escape(s, escapeEverything) {
if (escapeEverything) {
s = s.replace(/[\x10-\x7f]/g, function (s) {
return "-x" + s.charCodeAt(0).toString(16).toUpperCase();
});
}
s = encodeURIComponent(s).replace(/%/g, "\\x");
if (escapeEverything) {
s = s.replace(/\-/g, "\\");
}
return s;
}
Found a solution here: http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/
I would have never figured out THAT math, wow.
somewhat minified
function UTF8seq(s) {
var i,c,u=[];
for (i=0; i < s.length; i++) {
c = s.charCodeAt(i);
if (c < 0x80) { u.push(c); }
else if (c < 0x800) { u.push(0xc0 | (c >> 6), 0x80 | (c & 0x3f)); }
else if (c < 0xd800 || c >= 0xe000) { u.push(0xe0 | (c >> 12), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
else { i++; c = 0x10000 + (((c & 0x3ff)<<10) | (s.charCodeAt(i) & 0x3ff));
u.push(0xf0 | (c >>18), 0x80 | ((c>>12) & 0x3f), 0x80 | ((c>>6) & 0x3f), 0x80 | (c & 0x3f)); }
}
for (i=0; i < u.length; i++) { u[i]=u[i].toString(16); }
return '\\x'+u.join('\\x');
}
Your C code expects an UTF-8 string (the symbol is represented as 4 bytes). The JS representation you see is UTF-16 however (the symbol is represented as 2 uint16s, a surrogate pair).
You will first need to get the (Unicode) code point for your symbol (from the UTF-16 JS string), then build the UTF-8 representation for it from that.
Since ES6 you can use the codePointAt method for the first part, which I would recommend using as a shim even if not supported. I guess you don't want to decode surrogate pairs yourself :-)
For the rest, I don't think there's a library method, but you can write it yourself according to the spec:
function hex(x) {
x = x.toString(16);
return (x.length > 2 ? "\\u0000" : "\\x00").slice(0,-x.length)+x.toUpperCase();
}
var c = "😄";
console.log(c.length, hex(c.charCodeAt(0))+hex(c.charCodeAt(1))); // 2, "\uD83D\uDE04"
var cp = c.codePointAt(0);
var bytes = new Uint8Array(4);
bytes[3] = 0x80 | cp & 0x3F;
bytes[2] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[1] = 0x80 | (cp >>>= 6) & 0x3F;
bytes[0] = 0xF0 | (cp >>>= 6) & 0x3F;
console.log(Array.prototype.map.call(bytes, hex).join("")) // "\xf0\x9f\x98\x84"
(tested in Chrome)
I found this algorithm on the net but I'm having a bit of trouble understanding exactly how it works. It encodes an Uint8Array to Base64. I would like to understand especially the sections under the comments "Combine the three bytes into a single integer" and "Use bitmasks to extract 6-bit segments from the triplet". I understood the concept of bit shifting used there, but can't understand what's its purpose in those two sections.
function base64ArrayBuffer(bytes) {
var base64 = ''
var encodings = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
var byteLength = bytes.byteLength
var byteRemainder = byteLength % 3
var mainLength = byteLength - byteRemainder
var a, b, c, d
var chunk
// Main loop deals with bytes in chunks of 3
for (var i = 0; i < mainLength; i = i + 3) {
// Combine the three bytes into a single integer
chunk = (bytes[i] << 16) | (bytes[i + 1] << 8) | bytes[i + 2]
// Use bitmasks to extract 6-bit segments from the triplet
a = (chunk & 16515072) >> 18 // 16515072 = (2^6 - 1) << 18
b = (chunk & 258048) >> 12 // 258048 = (2^6 - 1) << 12
c = (chunk & 4032) >> 6 // 4032 = (2^6 - 1) << 6
d = chunk & 63 // 63 = 2^6 - 1
// Convert the raw binary segments to the appropriate ASCII encoding
base64 += encodings[a] + encodings[b] + encodings[c] + encodings[d]
}
// Deal with the remaining bytes and padding
if (byteRemainder == 1) {
chunk = bytes[mainLength]
a = (chunk & 252) >> 2 // 252 = (2^6 - 1) << 2
// Set the 4 least significant bits to zero
b = (chunk & 3) << 4 // 3 = 2^2 - 1
base64 += encodings[a] + encodings[b] + '=='
} else if (byteRemainder == 2) {
chunk = (bytes[mainLength] << 8) | bytes[mainLength + 1]
a = (chunk & 64512) >> 10 // 64512 = (2^6 - 1) << 10
b = (chunk & 1008) >> 4 // 1008 = (2^6 - 1) << 4
// Set the 2 least significant bits to zero
c = (chunk & 15) << 2 // 15 = 2^4 - 1
base64 += encodings[a] + encodings[b] + encodings[c] + '='
}
return base64
}
The first step takes each group of 3 bytes in the input and combines them into a 24-bit number. If we call them x = bytes[i], y = bytes[i+1], and z = bytes[i+2], it uses bit-shifting and bit-OR to create a 24-bit integer whose bits are:
xxxxxxxxyyyyyyyyzzzzzzzz
Then it extracts these bits in groups of 6 to get 4 numbers. The bits of a, b, c, and d correspond this way:
xxxxxxxxyyyyyyyyzzzzzzzz
aaaaaabbbbbbccccccdddddd
Then for each of these 6-bit numbers, it indexes the encodings string to get a corresponding character, and concatenates them into the base64 result string.
At the end there are some special cases to deal with the last 1 or 2 bytes in the input if it wasn't a multiple of 3 bytes long.
I'm setting a cookie with some query string values for a page I've built, so that when you revisit the page you will have option set for you.
So if the URL is http://mysite.com/index.php?setting1=blue&orientation=horizontal&background=paper the cookie will store the setting1=blue, orientation=horizontal, and background=paper values to be read back on the next visit.
It seems like most people advise json encoding these values prior to storing in the cookie. However, I'm getting way bigger cookie sizes (like 4-5x bigger!) when json encoding vs. just saving these values in a standard query string format and parsing them later.
Any best practice for this situation?
Query string format is fine, if it's easy for you to parse them back.
Well, if you're using MooTools, simply use Hash.Cookie, it's nifty and will get you rid of your headaches by abstracting this stupid cookie storage stuff :)
If you want to convert a query string to an object take a look at
myQueryString.parseQueryString() // returns object of key value pairs
Requires mooTools More Strings: http://mootools.net/docs/more/Types/String.QueryString
However i like the idea of Base64 more! See below
Credit goes to Ryan Florence for this but this is what i use:
var cookieData = DATATOENCODE.toBase64() // base64 encodes the data
cookieData.decodeBase64() // to decode it
The magic:
/*
---
script: Base64.js
description: String methods for encoding and decoding Base64 data
license: MIT-style license.
authors: Ryan Florence (http://ryanflorence.com), webtoolkit.info
requires:
- core:1.2.4: [String]
provides: [String.toBase64, String.decodeBase64]
...
*/
(function(){
// Base64 string methods taken from http://www.webtoolkit.info/
var Base64 = {
_keyStr : "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=",
encode : function (input) {
var output = "";
var chr1, chr2, chr3, enc1, enc2, enc3, enc4;
var i = 0;
input = Base64._utf8_encode(input);
while (i < input.length) {
chr1 = input.charCodeAt(i++);
chr2 = input.charCodeAt(i++);
chr3 = input.charCodeAt(i++);
enc1 = chr1 >> 2;
enc2 = ((chr1 & 3) << 4) | (chr2 >> 4);
enc3 = ((chr2 & 15) << 2) | (chr3 >> 6);
enc4 = chr3 & 63;
if (isNaN(chr2)) {
enc3 = enc4 = 64;
} else if (isNaN(chr3)) {
enc4 = 64;
};
output = output +
this._keyStr.charAt(enc1) + this._keyStr.charAt(enc2) +
this._keyStr.charAt(enc3) + this._keyStr.charAt(enc4);
};
return output;
},
decode : function (input) {
var output = "";
var chr1, chr2, chr3;
var enc1, enc2, enc3, enc4;
var i = 0;
input = input.replace(/[^A-Za-z0-9\+\/\=]/g, "");
while (i < input.length) {
enc1 = this._keyStr.indexOf(input.charAt(i++));
enc2 = this._keyStr.indexOf(input.charAt(i++));
enc3 = this._keyStr.indexOf(input.charAt(i++));
enc4 = this._keyStr.indexOf(input.charAt(i++));
chr1 = (enc1 << 2) | (enc2 >> 4);
chr2 = ((enc2 & 15) << 4) | (enc3 >> 2);
chr3 = ((enc3 & 3) << 6) | enc4;
output = output + String.fromCharCode(chr1);
if (enc3 != 64) {
output = output + String.fromCharCode(chr2);
};
if (enc4 != 64) {
output = output + String.fromCharCode(chr3);
};
};
output = Base64._utf8_decode(output);
return output;
},
// private method for UTF-8 encoding
_utf8_encode : function (string) {
string = string.replace(/\r\n/g,"\n");
var utftext = "";
for (var n = 0; n < string.length; n++) {
var c = string.charCodeAt(n);
if (c < 128) {
utftext += String.fromCharCode(c);
} else if((c > 127) && (c < 2048)) {
utftext += String.fromCharCode((c >> 6) | 192);
utftext += String.fromCharCode((c & 63) | 128);
} else {
utftext += String.fromCharCode((c >> 12) | 224);
utftext += String.fromCharCode(((c >> 6) & 63) | 128);
utftext += String.fromCharCode((c & 63) | 128);
};
};
return utftext;
},
_utf8_decode : function (utftext) {
var string = "";
var i = 0;
var c = c1 = c2 = 0;
while ( i < utftext.length ) {
c = utftext.charCodeAt(i);
if (c < 128) {
string += String.fromCharCode(c);
i++;
} else if((c > 191) && (c < 224)) {
c2 = utftext.charCodeAt(i+1);
string += String.fromCharCode(((c & 31) << 6) | (c2 & 63));
i += 2;
} else {
c2 = utftext.charCodeAt(i+1);
c3 = utftext.charCodeAt(i+2);
string += String.fromCharCode(((c & 15) << 12) | ((c2 & 63) << 6) | (c3 & 63));
i += 3;
};
};
return string;
}
};
String.implement({
toBase64: function(){
return Base64.encode(this);
},
decodeBase64: function(){
return Base64.decode(this);
}
});
})();
I don't know if there is a best practice for this, but I would advise using the query string format since it's smaller.
Cookies are transferred with every page request. Less data transferred is almost always better.
If they're that much bigger, you're probably doing something wrong. setting1=blue can be represented as {"setting1":"blue"} - adding orientation=horizontal gives {"setting1":"blue","orientation":"horizontal"} - it definitely takes more space in the cookie but not that much.
Also, I'd personally recommend against using JSON. It's too easy for an attacker from a different website to set a cookie on your domain which could then get executed as JSON. The NVP "encoding" is more effective if you're only doing key/value storage.