I want to send game result data as binary, partly for efficiency (sending 6 bytes per item instead of 13... that's more than halving the total amount of data to send, and as there can be a few hundred of these items, result is huge savings), and partly for obfuscation (people monitoring network activity would see seemingly random bytes instead of distinguishable data).
My "code" (not in use yet, just a prototype) is as follows:
String.fromCharCode.apply(null,somevar.toString(16).split(/(?=(?:..)+$)/).map(function(a) {return parseInt(a,16);}))
This will convert any integer value into a binary string value.
However, I seem to recall that AJAX and binary data don't mix. I'd like to know what range of values is safe to use. Should I stick to the range 32-255, or go even safer and stick to 32-127? In the case of 32-255, I can use 15 as the base in the above code and add 32 to all the numbers, so that'dw work for me.
But really I'm more interested in the character range question, and if there is any cross-browser (among browsers that support Canvas) way to transfer binary data?
AJAX and binary data does not conflict with each other. What happens is, when you make AJAX call, the data are posted as form data. When you post form data, you would usually encode the form data as application/x-www-form-url-encode. The encoded data only contain letters/numbers and certain special characters. For example, space is encoded as %20. For this reason, it may not save you any space at all even if you convert your "normal" letters to binary because eventually everything has to be encoded again.
Related
Very simple question, how much data (bytes) do strings take up? Do they take up 1 byte per character?
I tried searching it up, but ws schools doesn't say...
I want to know this to reduce bandwidth in my web app.
Also, for anyone that knows, does socket.io automatically json stringify when using socket.emit();?
String is a character array. So, it will take up roughly sizeof(char) * noOfCharacters ignoring other fields in String class for now. Character can be of 1 byte or 2 bytes depending upon the system, the type of chars being represented- unicode etc.
However, from your question, you are more interested in data being transported over the network. Note that data is always exchanged in bytes (byte[]) and thus string will be converted into byte[] representation first and then ported over.
To limit the bandwidth usage, you can enable compression, choose interoperable serialisation technique(protobuf, smile, fastinfoset etc)
In my Angular app I am making a $http.get() request to a URL that is responding with a json object. This object contains a value that is occasionally a very large number (e.g. 9106524608436223400). Looking at the network profiler in Chrome I can see that the number is coming down properly but when $http.get() has it's callback hit the number will be corrupted somewhat. I assume this is because the number is very large and not a string. Is there any way to get Angular to handle this response correctly or do I need to wrap my server's output as a string? Thanks.
Numbers in JavaScript are double precision floating point numbers. This means that they can only handle integer numbers with full precision up to 52 bits.
Any code for parsing the JSON that will represent the number as a regular number in JavaScript will be unable to give you the unchanged value.
The JSON standard doesn't specify any limitation for the range or precision for numbers. However, as JSON is based on a subset of the JavaScript syntax, one could argue that the format doesn't support numbers outside of what could be represented in JavaScript.
To safely get the value unchanged, you would need to put it as a string in the JSON, or split it up into two or more smaller numbers.
I have strings (about 1-5Kb) of the form:
FF,A3V,X7Y,aA4,....
lzw compresses these really nicely, but includes Turkish characters. These are then submitted to a MySQL database.
Sometimes MySQL can 'play-up' and not submit these properly, putting question marks '?' in place of the Turkish characters. They can do this even when you have your text areas properly defined. Exporting and reimporting the table can sort this out. This is fine for my test database, but not something I am happy with when this goes live.
Consequently I am looking for an alternative to lzw, which will compress but only using normal letters/numbers etc.
Does anyone know of a PUBLIC DOMAIN compression method that avoid Turkish Characters (and any other non-standard characters)? Can anyone point me to some code in javascript (or c++ or c# which I can convert)?
To expand a bit on what's been said in the comments... Storing strings of bytes, such as the output from a compression algorithm typically contains, in a VARCHAR or CHAR or TEXT column is not valid usage.
These column types are not for byte strings, they are for strings of valid characters only. Not every string of bytes contains valid strings of characters in any given character set... and MySQL isn't going to allow invalid characters (which, for some character sets, the correlation between "character" and "byte" isn't 1:1).
In the good ol' days™, the two were interchangeable but this is not the case any more (and hasn't been, to one degree or another, for a while).
If your column type, instead, were BINARY or VARBINARY or BLOB, the issue should disappear, because those data types are for binary data.
I am developing a phonegap application in html5/javascript. I have a string of around 1000 characters comprising of guids in below format
1=0a0a8907-40b9-4e81-8c4d-d01af26efb78;2=cd4713339;3=Cjdnd;4=19120581-21e5-42b9-b85f-3b8c5b1206d9;5=hdhsfsdfsd;6=30a21580-48f3-40e8-87a3-fa6e39e6412f; ...............
I have to write this particular string into a QR code. Is there any working technique to compress this string and store in QR code. The QR generated by this string is too complex and is not easily read by the QR scanner of mobile phones. Pls suggest the approach to reduce the size of string to around 200-250 character which can be easily read.
Any help is appreciated.
In your question you have the following sample data:
1=0a0a8907-40b9-4e81-8c4d-d01af26efb78;2=cd4713339;3=Cjdnd;
4=19120581-21e5-42b9-b85f-3b8c5b1206d9;5=hdhsfsdfsd;6=30a21
580-48f3-40e8-87a3-fa6e39e6412f; ..............
Where 1, 4 & 6 looks like version 4 UUIDs as described here. I suspect that 2, 3 and 5 might also actually be UUIDs?!
The binary representation of a UUIDs are 128 bits long, and they should be fairly simple to convert to this representation by just reading the hex digits of the UUIDs and convert to binary. This gives 16 bytes per UUID.
However - as the UUID's are version 4, they are based on random data, that in effect counter further compression (appart from the few bits representing the UUID version). So apart from getting rid of the counters (1=, 2=) and the seperater: ;, no further compression seem to be possible.
QR codes encode data using different character sets depending on the range of characters being used. IOW, if you use just ascii digits it will use an encoding that doesn't use 8 bits per digit. See the wikipedia page on QR codes.
Because of the characters in your example, e.g., lower case, you'll be using a binary encoding which is way overkill for your actual information content.
Presuming you have control over the decoder, you could use any compression library to take your ascii data and compress it before encoding, encode/decode the binary result, and then decompress it in the decoder. There are a world of techniques for trying to get the most out of the compression. You can also start with a non-ascii encoding and elminate redudant information like the #= parts.
Couldn't say, though, how much this will buy you.
If you have access to a database already, can you create a table to support this? If so, archive the value and use an ID for QR.
1) Simple schema: ID = bigint with Identity (1000,1) and set as primary key, Value = NVARCHAR(MAX). Yes this is a bit overkill, so modify to taste.
2) Create a function to add your string value to the table and get the ID back as a string for the QR code.
3) Create another function to return the string value when passed a valid ID number.
Stays below the 200 character limit for a very long time.
You don't need the whole guid; that could eliminate all but one record out of 2^128 records (enough to address every bit of digital information on earth many times over).
How many records do you need to eliminate? Probably a lot less than 4 billion right? That's 2^32, so just take the first 1/4 of the guid and there's your 1000 characters to 250.
I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.
â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.
This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.
The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)
you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.