I am trying to develop a paint brush application thru processingjs.
This API has function loadPixels() that will load the RGB values in to the array.
Now i want to store the array in the server db.
The problem is the size of the array, when i convert to a string the size is 5 MB.
Is the best solution is to do compression at javascript level? How to do it?
See http://rosettacode.org/wiki/LZW_compression#JavaScript for an LZW compression example. It works best on longer strings with repeated patterns.
From the Wikipedia article on LZW:
A dictionary is initialized to contain
the single-character strings
corresponding to all the possible
input characters (and nothing else
except the clear and stop codes if
they're being used). The algorithm
works by scanning through the input
string for successively longer
substrings until it finds one that is
not in the dictionary. When such a
string is found, the index for the
string less the last character (i.e.,
the longest substring that is in the
dictionary) is retrieved from the
dictionary and sent to output, and the
new string (including the last
character) is added to the dictionary
with the next available code. The last
input character is then used as the
next starting point to scan for
substrings.
In this way, successively longer
strings are registered in the
dictionary and made available for
subsequent encoding as single output
values. The algorithm works best on
data with repeated patterns, so the
initial parts of a message will see
little compression. As the message
grows, however, the compression ratio
tends asymptotically to the
maximum.
JavaScript implementation of Gzip has a couple answers that are relevant.
Also, Javascript LZW and Huffman Coding with PHP and JavaScript are other implementations I found.
Related
Very simple question, how much data (bytes) do strings take up? Do they take up 1 byte per character?
I tried searching it up, but ws schools doesn't say...
I want to know this to reduce bandwidth in my web app.
Also, for anyone that knows, does socket.io automatically json stringify when using socket.emit();?
String is a character array. So, it will take up roughly sizeof(char) * noOfCharacters ignoring other fields in String class for now. Character can be of 1 byte or 2 bytes depending upon the system, the type of chars being represented- unicode etc.
However, from your question, you are more interested in data being transported over the network. Note that data is always exchanged in bytes (byte[]) and thus string will be converted into byte[] representation first and then ported over.
To limit the bandwidth usage, you can enable compression, choose interoperable serialisation technique(protobuf, smile, fastinfoset etc)
If we have a huge string, named str1, say 5 million characters long, and then str2 = str1.substr(5555, 100) so that str2 is 100 characters long and is a substring of str1 starting at 5555 (or any other randomly selected position).
How JavaScript stores str2 internally? Is the string contents copied or the new string is sort of virtual and only a reference to the original string and values for position and size are stored?
I know this is implementation dependent, ECMAScript standard (probably) does not define what's under the hood of the string implementation. But I want to know from some expert who knows V8 or SpiderMonkey from inside well enough to clarify this.
Thank you
AFAIK V8 has four string representations:
ASCII
UTF-16
concatenation of multiple strings
slice of another string
Adventures in the land of substrings and RegExps has great explanations and illustrations.
Thus, it does not have to copy the string; it just has to beginning and ending markers to the other string.
SpiderMonkey does the same thing. (See Large substrings ~9000x faster in Firefox than Chrome: why? ... though the answer for Chrome is outdated.)
This can give real speed boosts, but sometimes this is undesirable, since it can cause small strings to hold onto the memory of the larger parent string (V8 bug report)
This old blog post of mine explains it, as well as some other string representation forms: https://web.archive.org/web/20170607033600/http://blog.cdleary.com:80/2012/01/string-representation-in-spidermonkey/
Search for "dependent string". I think I know what you might be getting at with the question: they can be problematic things, at times, because if there are no references to the original, you can keep a giant string around in order to keep a bitty little substring that's actually semantically reachable. There are things that an implementation could do to mitigate that problem, like record information on a GC-generation basis to see if such one-dependent-string entities exist and collapse them to their minimal size, but last I knew of that was not being done. (Essentially with that kind of approach you're recovering runtime_refcount == 1 style information at GC-sweep time.)
I'm working on a client side app. Users can select a few widgets on the page and share their selection with friends by sending them the URL of the page. I'm planning on saving the user's widget selections via a query string. I'd like the URL to be as small as possible so that it's easier for people to share.
Now to my question. I have a string of characters (8) that I'd like to encode so that output of the encoding is significantly smaller. I realize that 8 characters isn't very big but it's got potential to get larger in the future.
//using hex encoding results in a saving of 1 character
(98765432).toString(16) //"5e30a78"
example.com?q=98765432 vs example.com?q=5e30a78
Ideally I'd like the new string to be 4 characters or less. What are my options for encoding a string that will be used in URLs?
I've looked at this question: How can I quickly encode and then compress a short string containing numbers in c# but the encoded string is still too long.
Short tale about compression:
Let's say that you have an alphabet A and you have a set of words W(A) in alphabet A. Consider function
f: W(A) -> W(A)
which takes a word w and maps it into a word f(w) in the same alphabet.
Now it can be shown that if this function is invertible and there is a word w1 such that
length(f(w1)) < length(w1)
(i.e. we've compressed the word) then there exists a word w2 such that the opposite holds
length(f(w2)) > length(w2)
So this means that every compression method you've ever heard of is actually an illusion. For every method there is a file that will be larger after compression. It works because compression methods make assumptions about initial files. For example that these are words written in natural language. They are optimized for such cases and fail for other cases like whitenoise.
Back to your problem. If you wish to compress [a-zA-Z0-9] words onto itself and all cases are possible then you are doomed.
But there are at least two things you can think about:
Find most common [a-zA-Z0-9] words and map them onto small words. For example you found out that the case example.com?q=98765432 is most common among your users. Then you will map it to example.com?c=1 (note the parameter change). You will need a dictionary for such mappings. Of course for same rare cases you will end up with larger url, e.g. example.com?q=abcd will be mapped to example.com?c=abcdefgh unfortunately.
Restrict your input alphabet and enlarge your output alphabet. The bigger the difference, the bigger real compression is possible. Note that unfortunately there is a quite low upper limit for the alphabet used in URLs, namely 128 (ascii characters). For example if you have alphabet A={1,2} and B={1,2,3,4,5,6} then you can map 1~1, 2~2, 11~3, 12~4, 21~5, 22~6 which basically means that every word in A can be written in B in such a way that you reduce the size by half.
I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?
To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.
(Similar questions to this have been asked on StackOverflow, but not exactly this. The nearest is probably "javascript how to convert unicode string to ascii", where there is already the remark "this has to be a dup[licate]". I have read some similar posts, but they don't answer my specific question. I've looked on the very good W3Schools site, and have also Googled it, but not found the answer that way either. So any hints here would be very much appreciated.)
I have an array of bytes being passed to a piece of JavaScript. In the JavaScript the data arrives in a string. I do not know the mechanism of transfer, as it's from a 3rd-party application. I do not know even whether the string is "wide" or "narrow".
In my JavaScript, I have some code like b = str.charCodeAt(pos);.
My problem is that a byte value such as 0x86 = 134 is coming through as character 0x2020 = 8224. This seems to be because my original byte interpreted as a Latin-1 (probably) 'dagger' character, and is then being translated to the equivalent Unicode code-point. (The problem may or may not be JavaScript's 'fault'.) Similar problems occur with other values, although the ranges 0x00..0x7F and 0xA0..0xFF seem to be fine, but most values from 0x80..0x9F are affected, in each case the value seems to be the Unicode for the original Latin-1.
Another observation is that the length of the string is what I'd expect for narrow string if the length was measured in bytes. (On the other hand, if length returns a value in abstract characters, this doesn't tell me anything.)
So, in JavaScript, is there a way at getting at the 'raw' bytes in a string, or getting a Latin-1 or ASCII character code directly, or of converting between character encodings, or defining the default encoding?
I could write my own mapping, but I'd rather not. I expect that is what I'll end up doing, but that has the feel of a kludge on a kludge.
I'm also looking into whether there's anything I can adjust in the calling application (as it could be passing the data as a wide string, although I doubt it).
Either way, though, I'd be interested in whether there is a simple JavaScript solution, or to understand why there isn't.
(If the incoming data was character data, having Unicode dealt with so automatically would be great. But it's not, it's just a binary data stream.)
Thanks.
There is no such thing as the raw bytes in a String. The EcmaScript spec defines a string as a sequence of UTF-16 code-units. That is the most fine-grained representation exposed by any interpreter have ever encountered.
On the browser there are no encoding libraries. You have to roll your own if you are trying to represent a byte array as a string and want to reencode it.
If your string already happens to be valid ASCII, then you can get the numeric value of a code unit by using the charCodeAt method.
"\n".charCodeAt(0) === 10
Start with the Javascript (Ecmascript) specs: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf. Is says:
8.4 The String Type
The String type is the set of all finite ordered
sequences of zero or more 16-bit unsigned integer
values (“elements”). The String type is generally
used to represent textual data in a running ECMAScript
program, in which case each element in the String is
treated as a code unit value (see Clause 6). Each
element is regarded as occupying a position within
the sequence. These positions are indexed with
nonnegative integers. The first element (if any) is
at position 0, the next element (if any) at position
1, and so on. The length of a String is the number
of elements (i.e., 16-bit values) within it. The
empty String has length zero and therefore contains
no elements.
When a String contains actual textual data, each
element is considered to be a single UTF-16 code unit.
Whether or not this is the actual storage format of a
String, the characters within a String are numbered by
their initial code unit element position as though they
were represented using UTF-16. All operations on Strings
(except as otherwise stated) treat them as sequences of
undifferentiated 16-bit unsigned integers; they do not
ensure the resulting String is in normalised form, nor
do they ensure language-sensitive results.
NOTE The rationale behind this design was to keep the
implementation of Strings as simple and high-performing
as possible. The intent is that textual data coming into
the execution environment from outside (e.g., user input,
text read from a file or received over the network, etc.)
be converted to Unicode Normalised Form C before the
running program sees it. Usually this would occur at the
same time incoming text is converted from its original
character encoding to Unicode (and would impose no additional
overhead). Since it is recommended that ECMAScript source
code be in Normalised Form C, string literals are guaranteed
to be normalised (if source text is guaranteed to be
normalised), as long as they do not contain any Unicode
escape sequences.
What charCodeAt(p) gives you is the UTF-16 value (a 16-bit number) of the character at index p in the string. Since UTF-16 directly represents Unicode's Basic Multilingual Plane (that would be code points U+0000–U+D7FF and U+E000–U+FFFF, your Latin-1 characters should be the values you expect them to be.
That fact that they are not suggests to me that you have an encoding problem with the inbound 3rd octet stream — if the conversion to UTF-16 is being done and gets the encoding of the inbound octet stream wrong, you'll get odd results.
Perhaps that it is being treated as vanilla ASCII, when in fact it is UTF-8 (or vice-versa). UTF-8 represents code points above 0x7F as 2-, 3- or 4-octet "digraphs".