is there any way to encode file that has for example 2GB without "chopping" it for chunks? Because files larger than 2GB throw error that file is too large for fs. And making it smaller chunks dont work either, cause of encoding/decoding problem. Thanks for any help :)
Base64 isn't a good solution for large file transfer.
It's simple, and easy to work with, but will increase your file size. See MDN's article about this. I would recommend looking into best practices for data transfer in JS. MDN has an other article on this that breaks down the DataTransfer API.
Encoded size increase
Each Base64 digit represents exactly 6 bits of data. So, three 8-bits bytes of the input string/binary file (3×8 bits
= 24 bits) can be represented by four 6-bit Base64 digits (4×6 = 24 bits).
This means that the Base64 version of a string or file will be at
least 133% the size of its source (a ~33% increase). The increase may
be larger if the encoded data is small. For example, the string "a"
with length === 1 gets encoded to "YQ==" with length === 4 — a 300%
increase.
Additionally
Could share what you're trying to do, and add a MRE? There are so many different ways to tackle this problem, it's hard to narrow it down without knowing any of the requirements.
In a Firefox addon I am caching lengthy strings to disk. I would like to be able to give users some idea of how much disk space in bytes these strings are taking up.
I understand that Javascript stores strings as UTF-16. If a UTF-8 string is saved in a variable, it is converted to UTF-16. So UTF-8 methods of determining string size will not do here.
From this reference:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Description
It states that the value of string.length is actually the number of UTF-16 code units, and not the number of characters.
From this I infer that the disk space in bytes would simply be string.length * 2. I am looking for confirmation as to whether my assumption is correct.
EDIT:
(Several edits made to the title and original text. Also, the following:)
It was suggested that this is a duplicate of How many bytes in a JavaScript string?. However this does not address my question, as it refers to methods of getting string size of UTF-8 strings, however Javascript converts UTF-8 strings to UTF-16 when it stores them. For example a UTF-8 character that takes up 3 bytes may only use 2 bytes (1 UTF-16 code unit) when converted to UTF-16.
I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.
It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip compression. In practice if compression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.
As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.
Original data: 1,048,576 bytes
Base64 encoded data: 1,398,104 bytes
Gzipped base64 encoded data: 1,060,329 bytes
That's only a 1.12% increase in size (and the overhead of encoding -> compressing -> decompressing -> decoding).
Base128 encoding would take 1,198,373 bytes, so you'd have to compress it too if you wanted comparable file size. Gzip compression is a standard feature in all modern browsers so what's the case for base128 and all the extra complexity that would entail?
Select 36 non-Unicode characters from within the upper (>128) ASCII range
base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.
The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream.
Technically we can use higher bases but the problem with them is that they will need to fit some character set.
UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below
https://github.com/bharatmicrosystems/base256
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
§2.2 Base-122 Encoding
You can find the implementation on github
The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...
Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as
... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)
Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case
I have a 200Kb jpg and I want to convert it into a base64 string.
How long will that base64 string be approximately ?
The reason I'm asking is because I'd like to store that image as a base64 string in a "container" that only allows strings of a maximum length is 65000 characters.
I tried to find out for myself using the Chrome's console but the browser keeps freezing up due to the length of the base64 generated string, as soon as I assign it to a variable and the do :
x = 'base64.....'; // ridiculously long string
x.length;
The approximate size of the string is 135% of the original size due to the expansion that takes place (according to my NetBSD manpage of uuencode). To encode n bytes you need 4*ceil(n/3) bytes and additional line breaks.
As already stated in comments, there's no way of giving an exact number as it's depending on how the data is packed. It will however be larger than the source file. A ballpark figure is around 270 000 characters.
An easy way to check this is to upload a few images to an online converting service such as https://www.base64-image.de/
I saw quite many compression methods for js but in most cases compressed data was in string and it contained text. I need to compress array of less than 10^7 floats in range 0-1.
As precision is not really important eventually i can save it as string containing only numbers 0-9 (containing only 2 first digits after decimal of each float). What method would be best for data like this? I'd like to have smallest possible output but also it should't take more than ~10 sec to compress this string it's about up to 10 000 000 signs when saving 2 digits per float.I saw quite many compression methods for js but in most cases compressed data was in string and it contained text. I need to compress array of less than 10^7 floats in range 0-1.
As precision is not really important eventually i can save it as string containing only numbers 0-9 (containing only 2 first digits after decimal of each float). What method would be best for data like this? I'd like to have smallest possible output but also it should't take more than ~10 sec to decompress this string it's about up to 10 000 000 signs when saving 2 digits per float.
Data contains records of sound waveform for visualization on archaic browsers not supporting Web Audio API. Waveform is recorded at 20 fps on Chrome user client, compressed and stored in server db. Then send back to IE or ff after request to draw visualization - so I need lossy compression - it can be really lossy to achieve size able to be send with song metadata request. I hope compression on wav -> mp3 64k level would be possible (like 200:1 or something) noone will recognise that waveform is not perfect on visualization, I thought maybe about saving theese floats as 0-9a-Z it gives 36 instead of 100 steps but reduces record of one frequency to 1 sign. but what next, what compression to use on this string with 0-Z signs to achieve best compression? would lzma be suitable for string like this? compression / decompression would run on web worker so it doesn't need to be really instant - decompression like 10 sec, compression doesn't matter - rather less than one song so about 2 min
Taking a shot in the dark, if you truly can rely on only the first two digits after the decimal (i.e. there are no 0.00045s in the array), and you need two digits, the easiest thing to do would be multiply by 256 and take the integer part as a byte
encoded = Math.floor(floatValue * 256)
decoded = encoded / 256.0
However, if you know more about your data, you can squeeze more entropy out of your values. This comes out to a 4:1 compression ration.