Encoding issues for UTF8 CSV file when opening Excel and TextEdit - javascript

I recently added a CSV-download button that takes data from database (Postgres) an array from server (Ruby on Rails), and turns it into a CSV file on the client side (Javascript, HTML5). I'm currently testing the CSV file and I am coming across some encoding issues.
When I view the CSV file via 'less', the file appears fine. But when I open the file in Excel OR TextEdit, I start seeing weird characters like
—, â€, “
appear in the text. Basically, I see the characters that are described here: http://digwp.com/2011/07/clean-up-weird-characters-in-database/
I read that this sort of issue can arise when the Database encoding setting is set to the wrong one. BUT, the database that I am using is set to use UTF8 encoding. And when I debug through the JS codes that create the CSV file, the text appear normal. (This could be a Chrome ability, and less capability)
I'm feeling frustrated because the only thing I am learning from my online search is that there could be many reasons why encoding is not working, I'm not sure which part is at fault (so excuse me as I initially tag numerous things), and nothing I tried has shed new light on my problem.
For reference, here's the JavaScript snippet that creates the CSV file!
$(document).ready(function() {
var csvData = <%= raw to_csv(#view_scope, clicks_post).as_json %>;
var csvContent = "data:text/csv;charset=utf-8,";
csvData.forEach(function(infoArray, index){
var dataString = infoArray.join(",");
csvContent += dataString+ "\n";
});
var encodedUri = encodeURI(csvContent);
var button = $('<a>');
button.text('Download CSV');
button.addClass("button right");
button.attr('href', encodedUri);
button.attr('target','_blank');
button.attr('download','<%=title%>_25_posts.csv');
$("#<%=title%>_download_action").append(button);
});

As #jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.
Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.
Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.
Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.
You can find a fiddle here:
Unicode export sample Fiddle
Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.
Typically use like:
// Initiate
encoder = new DataEnc({
mime : 'text/csv',
charset: 'UTF-16BE',
bom : true
});
// Convert data to percent escaped text
encoder.enc(data);
// Get result
var result = encoder.pay();
There is two result properties of the object:
1.) encoder.lead
This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.
data:[<MIME-type>][;charset=<encoding>][;base64]
You can specify base64, but there is no base64 conversion (at least not this far).
2.) encoder.buf
This is a string with the percent escaped data.
The .pay() function simply return 1.) and 2.) as one.
Main code:
function DataEnc(a) {
this.config(a);
this.intro();
}
/*
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* */
DataEnc._enctype = {
u8 : ['u8', 'utf8'],
// RFC-2781, Big endian should be presumed if none given
u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
u16le : ['u16le', 'utf16le', 'ucs2le']
};
DataEnc._BOM = {
'none' : '',
'UTF-8' : '%ef%bb%bf', // Discouraged
'UTF-16BE' : '%fe%ff',
'UTF-16LE' : '%ff%fe'
};
DataEnc.prototype = {
// Basic setup
config : function(a) {
var opt = {
charset: 'u8',
mime : 'text/csv',
base64 : 0,
bom : 0
};
a = a || {};
this.charset = typeof a.charset !== 'undefined' ?
a.charset : opt.charset;
this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;
this.enc = this.utf8;
this.buf = '';
this.lead = '';
return this;
},
// Create lead based on config
// data:[<MIME-type>][;charset=<encoding>][;base64],<data>
intro : function() {
var
g = [],
c = this.charset || '',
b = 'none'
;
if (this.mime && this.mime !== '')
g.push(this.mime);
if (c !== '') {
c = c.replace(/[-\s]/g, '').toLowerCase();
if (DataEnc._enctype.u8.indexOf(c) > -1) {
c = 'UTF-8';
if (this.bom)
b = c;
this.enc = this.utf8;
} else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
c = 'UTF-16BE';
if (this.bom)
b = c;
this.enc = this.utf16be;
} else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
c = 'UTF-16LE';
if (this.bom)
b = c;
this.enc = this.utf16le;
} else {
if (c === 'copy')
c = '';
this.enc = this.copy;
}
}
if (c !== '')
g.push('charset=' + c);
if (this.base64)
g.push('base64');
this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
return this;
},
// Deliver
pay : function() {
return this.lead + this.buf;
},
// UTF-16BE
utf16be : function(t) { // U+0500 => %05%00
var i, c, buf = [];
for (i = 0; i < t.length; ++i) {
if ((c = t.charCodeAt(i)) > 0xff) {
buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
} else {
buf.push('00');
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
}
}
this.buf += '%' + buf.join('%');
// Note the hex array is returned, not string with '%'
// Might be useful if one want to loop over the data.
return buf;
},
// UTF-16LE
utf16le : function(t) { // U+0500 => %00%05
var i, c, buf = [];
for (i = 0; i < t.length; ++i) {
if ((c = t.charCodeAt(i)) > 0xff) {
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
} else {
buf.push(('00' + (c & 0xff).toString(16)).substr(-2));
buf.push('00');
}
}
this.buf += '%' + buf.join('%');
// Note the hex array is returned, not string with '%'
// Might be useful if one want to loop over the data.
return buf;
},
// UTF-8
utf8 : function(t) {
this.buf += encodeURIComponent(t);
return this;
},
// Direct copy
copy : function(t) {
this.buf += t;
return this;
}
};
Previous answer:
I do not have any setup to replicate yours, but if your case is the same as #jlarson then the resulting file should be correct.
This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.
TL;DR:
The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.
PS: The UniSearcher is a nice tool to have available on this journey.
The long way around
The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.
As an example if we take the script of jlarson it takes the data Array:
data = ['name', 'city', 'state'],
['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington']
This one is merged into the string:
name,city,state<newline>
\u0500\u05E1\u0E01\u1054,seattle,washington<newline>
which translates by Unicode to:
name,city,state<newline>
Ԁסกၔ,seattle,washington<newline>
As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is:
Code-point Glyph UTF-8
----------------------------
U+0500 Ԁ d4 80
U+05E1 ס d7 a1
U+0E01 ก e0 b8 81
U+1054 ၔ e1 81 94
Looking at the hex-dump of the downloaded file:
0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a name,city,state.
0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474 ..........,seatt
0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a le,washington.
On second line we find d480 d7a1 e0b8 81e1 8194 which match up with the above:
0000010: d480 d7a1 e0b8 81 e1 8194 2c73 6561 7474 ..........,seatt
| | | | | | | | | | | | | |
+-+-+ +-+-+ +--+--+ +--+--+ | | | | | |
| | | | | | | | | |
Ԁ ס ก ၔ , s e a t t
None of the other characters is mangled either.
Do similar tests if you want. The result should be the similar.
By sample provided —, â€, “
We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.
To quote Wikipedia on Windows-1252:
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by
default in the legacy components of Microsoft Windows in English and some other
Western languages. It is one version within the group of Windows code pages.
In LaTeX packages, it is referred to as "ansinew".
Retrieving the original bytes
To translate it back into it's original form we can look at the code page layout, from which we get:
Character: <â> <€> <”> <,> < > <â> <€> < > <,> < > <â> <€> <œ>
U.Hex : e2 20ac 201d 2c 20 e2 20ac 9d 2c 20 e2 20ac 153
T.Hex : e2 80 94 2c 20 e2 80 9d* 2c 20 e2 80 9c
U is short for Unicode
T is short for Translated
For example:
â => Unicode 0xe2 => CP-1252 0xe2
” => Unicode 0x201d => CP-1252 0x94
€ => Unicode 0x20ac => CP-1252 0x80
Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.
Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:
set fenc=utf-16
# Or
set fenc=ucs-2
Bytes to UTF-8
We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.
0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash) —
0x2c = 0010 1100bin => 1 byte => 0x2c (comma) ,
0x2c = 0010 0000bin => 1 byte => 0x20 (space)
0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ”
0x2c = 0010 1100bin => 1 byte => 0x2c (comma) ,
0x2c = 0010 0000bin => 1 byte => 0x20 (space)
0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq) “
Conclusion; The original UTF-8 string was:
—, ”, “
Mangling it back
We can also do the reverse. The original string as bytes:
UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c
Corresponding values in cp-1252:
e2 => â
80 => €
94 => ”
2c => ,
20 => <space>
...
and so on, result:
—, â€, “
Importing to MS Excel
In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways.
Method one:
Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.
As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.
Method two:
Use import data instead of open file. Something like:
Data -> Import External Data -> Import Data
Select encoding and proceed.
Check that Excel and selected font actually supports the glyph
We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel:
page with code points U+0E00 to U+0EFF
If support for the code points exist, the text should render fine.
Linux
On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered.
Why it works (or should)
encodeURI from the spec states, (also read sec-15.1.3):
The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.
We can simply test this in our console by, for example saying:
>> encodeURI('Ԁסกၔ,seattle,washington')
<< "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington"
As we register the escape sequences are equal to the ones in the hex dump above:
%D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log)
d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file)
or, testing a 4-byte code:
>> encodeURI('󱀁')
<< "%F3%B1%80%81"
If this is does not comply
If nothing of this apply it could help if you added
Sample of expected input vs mangled output, (copy paste).
Sample hex-dump of original data vs result file.

I ran into exactly this yesterday. I was developing a button that exports the contents of an HTML table as a CSV download. The functionality of the button itself is almost identical to yours – on click I read the text from the table and create a data URI with the CSV content.
When I tried to open the resulting file in Excel it was clear that the "£" symbol was getting read incorrectly. The 2 byte UTF-8 representation was being processed as ASCII resulting in an unwanted garbage character. Some Googling indicated this was a known issue with Excel.
I tried adding the byte order mark at the start of the string – Excel just interpreted it as ASCII data. I then tried various things to convert the UTF-8 string to ASCII (such as csvData.replace('\u00a3', '\xa3')) but I found that any time the data is coerced to a JavaScript string it will become UTF-8 again. The trick is to convert it to binary and then Base64 encode it without converting back to a string along the way.
I already had CryptoJS in my app (used for HMAC authentication against a REST API) and I was able to use that to create an ASCII encoded byte sequence from the original string then Base64 encode it and create a data URI. This worked and the resulting file when opened in Excel does not display any unwanted characters.
The essential bit of code that does the conversion is:
var csvHeader = 'data:text/csv;charset=iso-8859-1;base64,'
var encodedCsv = CryptoJS.enc.Latin1.parse(csvData).toString(CryptoJS.enc.Base64)
var dataURI = csvHeader + encodedCsv
Where csvData is your CSV string.
There are probably ways to do the same thing without CryptoJS if you don't want to bring in that library but this at least shows it is possible.

Excel likes Unicode in UTF-16 LE with BOM encoding. Output the correct BOM (FF FE), then convert all your data from UTF-8 to UTF-16 LE.
Windows uses UTF-16 LE internally, so some applications work better with UTF-16 than with UTF-8.
I haven't tried to do that in JS, but there're various scripts on the web to convert UTF-8 to UTF-16. Conversion between UTF variations is pretty easy and takes just a dozen of lines.

I was having a similar issue with data that was pulled into Javascript from a Sharepoint list. It turned out to be something called a "Zero Width Space" character and it was being displayed as †when it was brought into Excel. Apparently, Sharepoint inserts these sometimes when a user hits 'backspace'.
I replaced them with this quickfix:
var mystring = myString.replace(/\u200B/g,'');
It looks like you may have other hidden characters in there. I found the codepoint for the zero-width character in mine by looking at the output string in the Chrome inspector. The inspector couldn't render the character so it replaced it with a red dot. When you hover your mouse over that red dot, it gives you the codepoint (eg. \u200B) and you can just sub in the various codepoints to the invisible characters and remove them that way.

button.href = 'data:' + mimeType + ';charset=UTF-8,%ef%bb%bf' + encodedUri;
this should do the trick

It could be a problem in your server encoding.
You could try (assuming locale english US) if you are running Linux:
sudo locale-gen en_US en_US.UTF-8
dpkg-reconfigure locales

These three rules should be applied when writing a multibyte CSV file so that it can be readable on Excel across different OS platforms (Windows, Linux, MacOS)
The tab character \t is used to separate between fields instead of comma (,)
The content must be encoded in UTF-16 little endian (UTF16-LE)
The content must be prefixed with UTF16-LE byte order mark (BOM), which is 0xFEFF
Here is an article that shows how to reproduce the encoding issue and walks through the solution. NodeJS is used to create the CSV file.
As a side note, UTF16-LE BOM has to be explicitly set when writing a file using NodeJS fs module. Refer to this github issue for more detailed discussion.

Related

How to generate a Shift_JIS(SJIS) percent encoded string in JavaScript

I'm new to both JavaScript and Google Apps Script and having a problem to convert texts written in a cell to the Shift-JIS (SJIS) encoded letters.
For example, the Japanese string "あいう" should be encoded as "%82%A0%82%A2%82%A4" not as "%E3%81%82%E3%81%84%E3%81%86" which is UTF-8 encoded.
I tried EncodingJS and the built-in urlencode() function but it both returns the UTF-8 encoded one.
Would any one tell me how to get the SJIS-encoded letters properly in GAS? Thank you.
You want to do the URL encode from あいう to %82%A0%82%A2%82%A4 as Shift-JIS of the character set.
%E3%81%82%E3%81%84%E3%81%86 is the result converted as UTF-8.
You want to achieve this using Google Apps Script.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Points of this answer:
In order to use Shift-JIS of the character set at Google Apps Script, it is required to use it as the binary data. Because, when the value of Shift-JIS is retrieved as the string by Google Apps Script, the character set is automatically changed to UTF-8. Please be careful this.
Sample script 1:
In order to convert from あいう to %82%A0%82%A2%82%A4, how about the following script? In this case, this script can be used for HIRAGANA characters.
function muFunction() {
var str = "あいう";
var bytes = Utilities.newBlob("").setDataFromString(str, "Shift_JIS").getBytes();
var res = bytes.map(function(byte) {return "%" + ("0" + (byte & 0xFF).toString(16)).slice(-2)}).join("").toUpperCase();
Logger.log(res)
}
Result:
You can see the following result at the log.
%82%A0%82%A2%82%A4
Sample script 2:
If you want to convert the values including the KANJI characters, how about the following script? In this case, 本日は晴天なり is converted to %96%7B%93%FA%82%CD%90%B0%93V%82%C8%82%E8.
function muFunction() {
var str = "本日は晴天なり";
var conv = Utilities.newBlob("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz*-.#_").getBytes().map(function(e) {return ("0" + (e & 0xFF).toString(16)).slice(-2)});
var bytes = Utilities.newBlob("").setDataFromString(str, "Shift_JIS").getBytes();
var res = bytes.map(function(byte) {
var n = ("0" + (byte & 0xFF).toString(16)).slice(-2);
return conv.indexOf(n) != -1 ? String.fromCharCode(parseInt(n[0], 16).toString(2).length == 4 ? parseInt(n, 16) - 256 : parseInt(n, 16)) : ("%" + n).toUpperCase();
}).join("");
Logger.log(res)
}
Result:
You can see the following result at the log.
%96%7B%93%FA%82%CD%90%B0%93V%82%C8%82%E8
When 本日は晴天なり is converted with the sample script 1, it becomes like %96%7B%93%FA%82%CD%90%B0%93%56%82%C8%82%E8. This can also decoded. But it seems that the result value converted with the sample script 2 is generally used.
Flow:
The flow of this script is as follows.
Create new blob as the empty data.
Put the text value of あいう to the blob. At that time, the text value is put as Shift-JIS of the the character set.
In this case, even when blob.getDataAsString("Shift_JIS") is used, the result becomes UTF-8. So the blob is required to be used as the binary data without converting to the string data. This is the important point in this answer.
Convert the blob to the byte array.
Convert the bytes array of the signed hexadecimal to the unsigned hexadecimal.
At Google Apps Script, the byte array is uses as he signed hexadecimal. So it is required to convert to the unsigned hexadecimal.
When the value is the KANJI character, when the characters of 2 bytes can be converted to the string value as the ascii code, the string value is required to be used. The script of "Sample script 2" can be used for this situation.
At above sample, 天 becomes %93V.
Add % to the top character of each byte.
References:
newBlob(data)
setDataFromString(string, charset)
getBytes()
map()
If I misunderstood your question and this was not the direction you want, I apologize.
Let libraries do the hard work! EncodingJS, which you mentioned, can produce URL-encoded Shift-JIS strings from ordinary String objects.
Loading the library in Apps Script is a bit tricky, but nonetheless possible as demonstrated in this answer:
/**
* Specific to Apps Script. See:
* https://stackoverflow.com/a/33315754/13301046
*
* You can instead use <script>, import or require()
* depending on the environment the code runs in.
*/
eval(UrlFetchApp.fetch('https://cdnjs.cloudflare.com/ajax/libs/encoding-japanese/2.0.0/encoding.js').getContentText());
URL encoding is achieved is as follows:
function muFunction() {
const utfString = '本日は晴天なり';
const sjisArray = Encoding.convert(utfString, {
to: 'SJIS',
from: 'UNICODE'
})
const sjisUrlEncoded = Encoding.urlEncode(sjisArray)
Logger.log(sjisUrlEncoded)
}
This emits an URL-encoded Shift-JIS string to the log:
'%96%7B%93%FA%82%CD%90%B0%93V%82%C8%82%E8'

Given the length of an unencoded string, what single formula reveals the length of that string after base-64 encoding?

I am trying to ascertain if there is a standard arithmetical formula which, given the length of an unencoded string, will reveal the length of that string when it has been base-64 encoded.
Here is a list of strings and their base-64 encodings:
A : QQ==
AB : QUI=
ABC : QUJD
ABCD : QUJDRA==
ABCDE : QUJDREU=
ABCDEF : QUJDREVG
ABCDEFG : QUJDREVGRw==
ABCDEFGH : QUJDREVGR0g=
ABCDEFGHI : QUJDREVGR0hJ
ABCDEFGHIJ : QUJDREVGR0hJSg==
ABCDEFGHIJK : QUJDREVGR0hJSks=
ABCDEFGHIJKL : QUJDREVGR0hJSktM
Here are the string lengths of the original strings and the lengths of their base-64 encoded strings (not including the = signs sometimes appended to the end of the encoding):
1 : 2
2 : 3
3 : 4
4 : 6
5 : 7
6 : 8
7 : 10
8 : 11
9 : 12
10 : 14
11 : 15
12 : 16
What single formula, when applied to the numbers on the left, results in the numbers on the right?
Function https://stackoverflow.com/a/57945696/230983 does exactly what Rounin needs. But if you want to support Unicode characters you cannot rely on the length method, so you need something else to count the number of bytes. A simple way to solve this is to use blobs:
/**
* Guess the number of Base64 characters required by specified string
*
* #param {String} str
* #returns {Number}
*/
function detectB64CharsLength(str) {
const blob = new Blob([str]);
return Math.ceil(blob.size * (4 / 3))
}
/**
* A dirty hack for encoding Unicode characters to Base64
*
* #link https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem
* #param {String} data
* #returns {String}
*/
function utoa(data) {
return btoa(unescape(encodeURIComponent(data)));
}
// Run some tests and make sure everything is ok
['a', 'ab', 'ββ', '😀'].map(v => {
console.log(v, detectB64CharsLength(v), utoa(v));
});
Your question is muddled, because of the part where you say "not including the = signs sometimes appended to the end of the encoding".
I'm not saying the length of the non-= portion of a base64 encoding result is uninteresting -- perhaps you have valid reasons for wanting to know that.
But if you are trying to calculate, say, the storage needed for a base64 encoding result, you need to include storage for the = signs; a base64 result cannot be decoded without them. Observe:
echo -n 'ABCDE' | base64
QUJDREU=
$ echo -n 'QUJDREU=' | base64 --decode | od -c
0000000 A B C D E
$ echo -n 'QUJDREU' | base64 --decode | od -c
0000000 A B C
NOTE #1 : It is possible to not store the =-signs, because it is possible to calculate when they are missing from a given base64 result; they don't strictly speaking need to be stored, but they do need to be supplied for the decoding operation. But then you'd need a custom decoding operation that first looks to see if the padding is missing. I wager that storing at worst 2 extra bytes is far less expensive than the hassle / complexity / unexpectedness of a custom base64 decoding function.
NOTE #2 : As per follow-up comments, some libraries have base64 functions that support missing padding. Treatment of padding is implementation-specific. In some contexts, padding is mandatory (per the relevant specs). Each of the following is a reasonable treatment of padding for any specific library:
implicit padding : assume padding characters for inputs whose length is one or two bytes short of a multiple of 4 bytes (note: 3 bytes short is still invalid, since base64 encoding can only be 0, 1, or 2 bytes short)
best-effort decoding : decode the longest portion of the input that is divisible by 4 bytes
assume truncation : reject as invalid an input whose length is not divisible by 4 bytes, on the assumption that this indicates an incomplete transmission
Again, which of these is most correct will depend upon the context in which the code in question is operating, and different library authors will make different determinations on this.
The answer from #Victor is the best answer; it is the most germane to the context of the question (Javascript), and considers the crucial bytes-vs-characters issue as well.
As I was finishing typing out the question above, I realised (I think) what the formula is.
Divide the original string length by 3.
Round up that new number
Add the rounded up new number to the original string length
Like this:
getLengthOfStringAfterBase64Encoding = (string) => {
const stringLength = string.length;
const base64EncodedStringLength = stringLength + Math.ceil(stringLength / 3);
return base64EncodedStringLength;
}

Counting the byte size of a file encoded in ISO 8859-7 in JavaScript

Background
I am writing an esoteric language called Jolf. It is used on the lovely site codegolf SE. If you don't already know, a lot of challenges are scored in bytes. People have made lots of languages that utilize either their own encoding or a pre-existing encoding.
On the interpreter for my language, I have a byte counter. As you might expect, it counts the number of bytes in the code. Until now, I've been using a UTF-8 en/decoder (utf8.js). I am now using the ISO 8859-7 encoding, which has Greek characters. Nor does the text upload actually work. I need to count the actually bytes contained within an uploaded file. Also, is there a way to read the contents of said encoded file?
Question
Given a file encoded in ISO 8859-7 obtained from an <input> element on the page, is there any way to obtain the number of bytes contained in that file? And, given "plaintext" (i.e. text put directly into a <textarea>), how might I count the bytes in that as if it was encoded in ISO 8859-7?
What I've tried
The input element is called isogreek. The file resides in the <input> element. The content is ΦX族, a Greek character, a latin character (each of which should be a byte) and a Chinese character, which should be more than one byte (?).
isogreek.files[0].size; // is 3; should be more.
var reader = new FileReader();
reader.readAsBinaryString(isogreek.files[0]); // corrupts the string to `ÖX?`
reader.readAsText(isogreek.files[0]); // �X?
reader.readAsText(isogreek.files[0],"ISO 8859-7"); // �X?
Extended from this comment.
As #pvg mentioned in the comments, the string resulting from readAsBinaryString would be correct, but is corrupted for two reasons:
A. The result is encoded in ISO-8859-1. You can use a function to fix this:
function convertFrom1to7(text) {
// charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format:
// - If the character is in the same position as in ISO-8859-1/Unicode, use a "!".
// - If the character is a Greek char with 720 subtracted from its char code, use a ".".
// - Otherwise, use \uXXXX format.
var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!";
var newtext = "", newchar = "";
for (var i = 0; i < text.length; i++) {
var char = text[i];
newchar = char;
if (char.charCodeAt(0) >= 160) {
newchar = charset[char.charCodeAt(0) - 160];
if (newchar === "!") newchar = char;
if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720);
}
newtext += newchar;
}
return newtext;
}
B. The Chinese character isn't a part of the ISO-8859-7 charset (because the charset supports up to 256 unique chars, as the table shows). If you want to include arbitrary Unicode characters in a program, you will probably need to do one of these two things:
Count the bytes of that program in i.e. UTF-8 or UTF-16. This can be done pretty easily with the library you linked. However, if you want this to be done automatically, you'll need a function that checks if the content of the textarea is a valid ISO-8859-7 file, like this:
function isValidISO_8859_7(text) {
var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/;
var valid = true;
for (var i = 0; i < text.length; i++) {
valid = valid && charset.test(text[i]);
}
return valid;
}
Create your own, custom variant of ISO-8859-7 that uses a specific byte (or more than one) to signify that the next 2 or 3 bytes belong to a single Unicode char. This can be pretty much as simple or complex as you like, from one char signifying a 2-byte char and one signifying a 3-byter to everything between 80 and 9F setting up for the next few. Here's a basic example that uses 80 as the 2-byter and 81 as the 3-byter (assumes the text is encoded in ISO-8859-1):
function reUnicode(text) {
var newtext = "";
for (var i = 0; i < text.length; i++) {
if (text.charCodeAt(i) === 0x80) {
newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i));
} else if (text.charCodeAt(i) === 0x81) {
var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536;
newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair
} else {
newtext += convertFrom1to7(text[i]);
}
}
return newtext;
}
I can go into either method in more detail if you desire.
The three characters you gave as an example are decoded in 6 bytes a6 ce e6 58 8f 97 (0x58 = X). Also: JavaScript works with utf16 which results in some funny things like ("abc".length === "ΦX族".length) being true.
You most probably need to go to the full length and check every single character for its length by its code-value. You may also need to check two characters in some cases (utf-32 to utf-16). A BOM needs to be placed and checked, too, if necessary (always necessary if you work with files of unknown sources).
EDIT: added on request:
The encodings of the characters in JavaScript is always in utf-16, a two byte representation of the character. That was all well and nice until they suddenly (ha!) found out that two bytes are not really sufficient for all of the alphabets of the world, so the expanded the Unicode range to four bytes: utf-32.
Well, the Unicode consortium did so but the ECMA committee did not.
It cannot be said that hell broke loose but it is quite close in some circumstances, and one of those is your case because you want to mix one-byte encodings with multiple-byte encodings, different ones even.
One byte fits well in two bytes but three or more bytes do not fit well in two bytes, so the so called surrogates were invented. These surrogates are also the reason why it is not so simple to reverse a string in JavaScript.
As I said: a large can of worms.

What is the significance of the number 93 in Unicode?

Since there is currently no universal way to read live data from an audio track in JavaScript I'm using a small library/API to read volume data from a text file that I converted from an MP3 offline.
The string looks like this
!!!!!!!!!!!!!!!!!!!!!!!!!!###"~{~||ysvgfiw`gXg}i}|mbnTaac[Wb~v|xqsfSeYiV`R
][\Z^RdZ\XX`Ihb\O`3Z1W*I'D'H&J&J'O&M&O%O&I&M&S&R&R%U&W&T&V&m%\%n%[%Y%I&O'P'G
'L(V'X&I'F(O&a&h'[&W'P&C'](I&R&Y'\)\'Y'G(O'X'b'f&N&S&U'N&P&J'N)O'R)K'T(f|`|d
//etc...
and the idea is basically that at a given point in the song the Unicode number of the character at the corresponding point in the text file yields a nominal value to represent volume.
The library translates the data (in this case, a stereo track) with the following (simplified here):
getVolume = function(sampleIndex,o) {
o.left = Math.min(1,(this.data.charCodeAt(sampleIndex*2|0)-33)/93);
o.right = Math.min(1,(this.data.charCodeAt(sampleIndex*2+1|0)-33)/93);
}
I'd like some insight into how the file was encoded in the first place, and how I'm making use of it here.
What is the significance of 93 and 33?
What is the purpose of the bitwise |?
Is this a common means of porting information (ie, does it have a name), or is there a better way to do it?
It looks like the range of the characters in that file are from ! to ~. ! has an ASCII code of 33 and ~ has an ASCII code of 126. 126-33 = 93.
33 and 93 are used for normalizing values beween ! and ~.
var data = '!';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0
var data = '~';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 1
var data = '"';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0.010752688172043012
var data = '#';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0.021505376344086023
// ... and so on
The |0 is there due to the fact that sampleIndex*2 or sampleIndex*2+1 will yield a non-integer value when being passed a non-integer sampleIndex. |0 truncates the decimal part just in case someone sends in an incorrectly formatted sampleIndex (i.e. non-integer).
Doing a bitwise OR with zero will truncate the number on the LHS to a integer. Not sure about the rest of your question though, sorry.
93 and 33 are ASCII codes (not unicode) for the characters "]" and "!" respectively. Hope that helps a bit.
This will help you forever:
http://www.asciitable.com/
ASCIII codes for everything.
Enjoy!

How many bytes in a JavaScript string?

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?
I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?
You can use the Blob to get the string size in bytes.
Examples:
console.info(
new Blob(['😂']).size, // 4
new Blob(['👍']).size, // 4
new Blob(['😂👍']).size, // 8
new Blob(['👍😂']).size, // 8
new Blob(['I\'m a string']).size, // 12
// from Premasagar correction of Lauri's answer for
// strings containing lone characters in the surrogate pair range:
// https://stackoverflow.com/a/39488643/6225838
new Blob([String.fromCharCode(55555)]).size, // 3
new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Source
If you're using node.js, there is a simpler solution using buffers :
function getBinarySize(string) {
return Buffer.byteLength(string, 'utf8');
}
There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)
String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:
4.3.16 String Value
A string value is a member of the type String and is a
finite ordered sequence of zero or
more 16-bit unsigned integer values.
NOTE Although each value usually
represents a single 16-bit unit of
UTF-16 text, the language does not
place any restrictions or requirements
on the values except that they be
16-bit unsigned integers.
These are 3 ways I use:
TextEncoder
new TextEncoder().encode("myString").length
Blob
new Blob(["myString"]).size
Buffer
Buffer.byteLength("myString", 'utf8')
Try this combination with using unescape js function:
const byteAmount = unescape(encodeURIComponent(yourString)).length
Full encode proccess example:
const s = "1 a ф № # ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,#-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Note that if you're targeting node.js you can use Buffer.from(string).length:
var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
The size of a JavaScript string is
Pre-ES6: 2 bytes per character
ES6 and later: 2 bytes per character,
or 5 or more bytes per character
Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.
ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}
Practical notes
This doesn't relate to the internal implemention of a particular engine. For
example, some engines use data structures and libraries with full
UTF-16 support, but what they provide externally doesn't have to be
full UTF-16 support. Also an engine may provide external UTF-16
support as well but is not mandated to do so.
For ES6, practically speaking characters will never be more than 5
bytes long (2 bytes for the escape point + 3 bytes for the Unicode
code point) because the latest version of Unicode only has 136,755
possible characters, which fits easily into 3 bytes. However this is
technically not limited by the standard so in principal a single
character could use say, 4 bytes for the code point and 6 bytes
total.
Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.
UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).
If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:
getStringMemorySize = function( _string ) {
"use strict";
var codePoint
, accum = 0
;
for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
codePoint = _string.charCodeAt( stringIndex );
if( codePoint < 0x100 ) {
accum += 1;
continue;
}
if( codePoint < 0x10000 ) {
accum += 2;
continue;
}
if( codePoint < 0x1000000 ) {
accum += 3;
} else {
accum += 4;
}
}
return accum * 2;
}
Examples:
getStringMemorySize( 'I' ); // 2
getStringMemorySize( '❤' ); // 4
getStringMemorySize( '𠀰' ); // 8
getStringMemorySize( 'I❤𠀰' ); // 14
The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.
byteCount(String.fromCharCode(55555))
// URIError: URI malformed
This longer function should handle all strings:
function bytes (str) {
var bytes=0, len=str.length, codePoint, next, i;
for (i=0; i < len; i++) {
codePoint = str.charCodeAt(i);
// Lone surrogates cannot be passed to encodeURI
if (codePoint >= 0xD800 && codePoint < 0xE000) {
if (codePoint < 0xDC00 && i + 1 < len) {
next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next < 0xE000) {
bytes += 4;
i++;
continue;
}
}
}
bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
}
return bytes;
}
E.g.
bytes(String.fromCharCode(55555))
// 3
It will correctly calculate the size for strings containing surrogate pairs:
bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)
The results can be compared with Node's built-in function Buffer.byteLength:
Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3
Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).
The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.
When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)
See Unicode encodings for different encodings and their code ranges.
The Blob interface's size property returns the size of the Blob or File in bytes.
const getStringSize = (s) => new Blob([s]).size;
I'm working with an embedded version of the V8 Engine.
I've tested a single string. Pushing each step 1000 characters. UTF-8.
First test with single byte (8bit, ANSI) Character "A" (hex: 41).
Second test with two byte character (16bit) "Ω" (hex: CE A9) and the
third test with three byte character (24bit) "☺" (hex: E2 98 BA).
In all three cases the device prints out of memory at
888 000 characters and using ca. 26 348 kb in RAM.
Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.
Embedded or not, fact is that the characters are not only stored in 16bit.
Unfortunally I've no 100% answer, what Javascript do at low level area.
Btw. I've tested the same (first test above) with an array of character "A".
Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000.
So, the javascript engine is not simple restricted. It's a kind more complex.
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.

Categories

Resources