What is the significance of the number 93 in Unicode? - javascript

Since there is currently no universal way to read live data from an audio track in JavaScript I'm using a small library/API to read volume data from a text file that I converted from an MP3 offline.
The string looks like this
!!!!!!!!!!!!!!!!!!!!!!!!!!###"~{~||ysvgfiw`gXg}i}|mbnTaac[Wb~v|xqsfSeYiV`R
][\Z^RdZ\XX`Ihb\O`3Z1W*I'D'H&J&J'O&M&O%O&I&M&S&R&R%U&W&T&V&m%\%n%[%Y%I&O'P'G
'L(V'X&I'F(O&a&h'[&W'P&C'](I&R&Y'\)\'Y'G(O'X'b'f&N&S&U'N&P&J'N)O'R)K'T(f|`|d
//etc...
and the idea is basically that at a given point in the song the Unicode number of the character at the corresponding point in the text file yields a nominal value to represent volume.
The library translates the data (in this case, a stereo track) with the following (simplified here):
getVolume = function(sampleIndex,o) {
o.left = Math.min(1,(this.data.charCodeAt(sampleIndex*2|0)-33)/93);
o.right = Math.min(1,(this.data.charCodeAt(sampleIndex*2+1|0)-33)/93);
}
I'd like some insight into how the file was encoded in the first place, and how I'm making use of it here.
What is the significance of 93 and 33?
What is the purpose of the bitwise |?
Is this a common means of porting information (ie, does it have a name), or is there a better way to do it?

It looks like the range of the characters in that file are from ! to ~. ! has an ASCII code of 33 and ~ has an ASCII code of 126. 126-33 = 93.

33 and 93 are used for normalizing values beween ! and ~.
var data = '!';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0
var data = '~';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 1
var data = '"';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0.010752688172043012
var data = '#';
Math.min(1,(data.charCodeAt(0*2)-33)/93); // will yield 0.021505376344086023
// ... and so on
The |0 is there due to the fact that sampleIndex*2 or sampleIndex*2+1 will yield a non-integer value when being passed a non-integer sampleIndex. |0 truncates the decimal part just in case someone sends in an incorrectly formatted sampleIndex (i.e. non-integer).

Doing a bitwise OR with zero will truncate the number on the LHS to a integer. Not sure about the rest of your question though, sorry.
93 and 33 are ASCII codes (not unicode) for the characters "]" and "!" respectively. Hope that helps a bit.

This will help you forever:
http://www.asciitable.com/
ASCIII codes for everything.
Enjoy!

Related

Javascript: Convert Unicode Character to hex string [duplicate]

I'm using a barcode scanner to read a barcode on my website (the website is made in OpenUI5).
The scanner works like a keyboard that types the characters it reads. At the end and the beginning of the typing it uses a special character. These characters are different for every type of scanner.
Some possible characters are:
█
▄
–
—
In my code I use if (oModelScanner.oData.scanning && oEvent.key == "\u2584") to check if the input from the scanner is ▄.
Is there any way to get the code from that character in the \uHHHH style? (with the HHHH being the hexadecimal code for the character)
I tried the charCodeAt but this returns the decimal code.
With the codePointAt examples they make the code I need into a decimal code so I need a reverse of this.
Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).
var hex = "▄".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
However it would probably be easier for you to check directly if you key code point integer match the expected code point
oEvent.key.codePointAt(0) === '▄'.codePointAt(0);
Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).
For this reason I would recommend to use a specialized library.
you'll find more details in the very relevant article by Mathias Bynens
var hex = "▄".charCodeAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
If you want to print the multiple code points of a character, e.g., an emoji, you can do this:
const facepalm = "🤦🏼‍♂️";
const codePoints = Array.from(facepalm)
.map((v) => v.codePointAt(0).toString(16))
.map((hex) => "\\u{" + hex + "}");
console.log(codePoints);
["\u{1f926}", "\u{1f3fc}", "\u{200d}", "\u{2642}", "\u{fe0f}"]
If you are wondering about the components and the length of 🤦🏼‍♂️, check out this article.

Is there an equivalent to C's *(unsigned int*)(char) = 123 in Javascript?

I'm dealing with some C source code I'm trying to convert over to Javascript, I've hit a snag at this line
char ddata[512];
*(unsigned int*)(ddata+0)=123;
*(unsigned int*)(ddata+4)=add-8;
memset(ddata+8,0,add-8);
I'm not sure exactly what is happening here, I understand they're casting the char to an unsigned int, but what is the ddata+0 and stuff doing here? Thanks.
You can't say.
That's because the behaviour on casting a char* to an unsigned* is undefined unless the pointer started off as an unsigned*, which, in your case it didn't.
ddata + 0 is equivalent to ddata.
ddata + 4 is equivalent to &ddata[4], i.e. the address of the 5th element of the array.
For what it's worth, it looks like the C programmer is attempting to serialise a couple of unsigned literals into a byte array. But the code is a mess; aside from what I've already said they appear to be assuming that an unsigned occupies 4 bytes, which is not necessarily the case.
The code fragment is storing a record id (123) as 4 byte integer in the first 4 bytes of a char buffer ddata. It then stores a length (add-8) in the following 4 bytes and finally initializes the following add-8 bytes to 0.
Translating this to javascript can be done in different ways, but probably not by constructing a string with the same contents. The reason is strings a not byte buffers in javascript, they contain unicode code points, so writing the string to storage might perform some unwanted conversions.
The best solution depends on your actual target platform, where byte arrays may be available to more closely match the intended semantics of your C code.
Note that the above code is not portable and has undefined behavior for various reasons, notably because ddata might not be properly aligned to be used as the address to store an unsigned int via the cast *(unsigned int*)ddata = 123;, because it assumes int to be 4 bytes and because it relies on unspecified byte ordering.
On the Redhat linux box it probably works as expected, and the same C code would probably perform correctly on MacOS, that uses the same Intel architecture with little endian ordering. How best to translate this to Javascript requires more context and specifications.
In the mean time, the code would best be rewritten this way:
unsigned char ddata[512];
if (add <= 512) {
ddata[0] = 123;
ddata[1] = 0;
ddata[2] = 0;
ddata[3] = 0;
ddata[4] = ((add-8) >> 0) & 255;
ddata[5] = ((add-8) >> 8) & 255;
ddata[6] = ((add-8) >> 16) & 255;
ddata[7] = ((add-8) >> 24) & 255;
memset(ddata + 8, 0, add - 8);
}

Buffer to integer. Having trouble understanding this line of code

I'm looking for help in understanding this line of code in the npm moudle hash-index.
The purpose of this module is to be a function which returns the sha-1 hash of an input mod by the second argument you pass.
The specific function in this module that I don't understand is this one that takes a Buffer as input and returns an integer:
var toNumber = function (buf) {
return buf.readUInt16BE(0) * 0xffffffff + buf.readUInt32BE(2)
}
I can't seem to figure out why those specific offsets of the buffer are chosen and what the purpose of multiplying by 0xffffffff is.
This module is really interesting to me and any help in understanding how it's converting buffers to integers would be greatly appreciated!
It prints the first UINT32 (Unsigned Integer 32 bits) in the buffer.
First, it reads the first two bytes (UINT16) of the buffer, using Big Endian, then, it multiplies it by 0xFFFFFFFF.
Then, it reads the second four bytes (UINT32) in the buffer, and adds it to the multiplied number - resulting in a number constructed from the first 6 bytes of the buffer.
Example: Consider [Buffer BB AA CC CC DD ... ]
0xbb * 0xffffffff = 0xbaffffff45
0xbaffffff45 + 0xaaccccdd = 0xbbaacccc22
And regarding the offsets, it chose that way:
First time, it reads from byte 0 to byte 1 (coverts to type - UINT16)
second time, it reads from byte 2 to byte 5 (converts to type - UINT32)
So to sum it up, it constructs a number from the first 6 bytes of the buffer using big endian notation, and returns it to the calling function.
Hope that's answers your question.
Wikipedia's Big Endian entry
EDIT
As someone pointed in the comments, I was totally wrong about 0xFFFFFFFF being a left-shift of 32, it's just a number multiplication - I'm assuming it's some kind of inner protocol to calculate a correct legal buffer header that complies with what they expect.
EDIT 2
After looking on the function in the original context, I've come to this conclusion:
This function is a part of a hashing flow, and it works in that manner:
Main flow receives a string input and a maximum number for the hash output, it then takes the string input, plugs it in the SHA-1 hashing function.
SHA-1 hashing returns a Buffer, it takes that Buffer, and applies the hash-indexing on it, as can be seen in the following code excerpt:
return toNumber(crypto.createHash('sha1').update(input).digest()) % max
Also, it uses a modulu to make sure the hash index returned doesn't exceed the maximum possible hash.
Multiplication by 2 is equivalent to a shift of bits to the left by 1, so the purpose of multiplying by 2^16 is the equivalent of shifting the bits left 16 times.
Here is a similar question already answered:
Bitwise Logic in C

Reassembling negative Python marshal int's into Javascript numbers

I'm writing a client-side Python bytecode interpreter in Javascript (specifically Typescript) for a class project. Parsing the bytecode was going fine until I tried out a negative number.
In Python, marshal.dumps(2) gives 'i\x02\x00\x00\x00' and marshal.dumps(-2) gives 'i\xfe\xff\xff\xff'. This makes sense as Python represents integers using two's complement with at least 32 bits of precision.
In my Typescript code, I use the equivalent of Node.js's Buffer class (via a library called BrowserFS, instead of ArrayBuffers and etc.) to read the data. When I see the character 'i' (i.e. buffer.readUInt8(offset) == 105, signalling that the next thing is an int), I then call readInt32LE on the next offset to read a little-endian signed long (4 bytes). This works fine for positive numbers but not for negative numbers: for 1 I get '1', but for '-1' I get something like '-272777233'.
I guess that Javascript represents numbers in 64-bit (floating point?). So, it seems like the following should work:
var longval = buffer.readInt32LE(offset); // reads a 4-byte long, gives -272777233
var low32Bits = longval & 0xffff0000; //take the little endian 'most significant' 32 bits
var newval = ~low32Bits + 1; //invert the bits and add 1 to negate the original value
//but now newval = 272826368 instead of -2
I've tried a lot of different things and I've been stuck on this for days. I can't figure out how to recover the original value of the Python integer from the binary marshal string using Javascript/Typescript. Also I think I deeply misunderstand how bits work. Any thoughts would be appreciated here.
Some more specific questions might be:
Why would buffer.readInt32LE work for positive ints but not negative?
Am I using the correct method to get the 'most significant' or 'lowest' 32 bits (i.e. does & 0xffff0000 work how I think it does?)
Separate but related: in an actual 'long' number (i.e. longer than '-2'), I think there is a sign bit and a magnitude, and I think this information is stored in the 'highest' 2 bits of the number (i.e. at number & 0x000000ff?) -- is this the correct way of thinking about this?
The sequence ef bf bd is the UTF-8 sequence for the "Unicode replacement character", which Unicode encoders use to represent invalid encodings.
It sounds like whatever method you're using to download the data is getting accidentally run through a UTF-8 decoder and corrupting the raw datastream. Be sure you're using blob instead of text, or whatever the equivalent is for the way you're downloading the bytecode.
This got messed up only for negative values because positive values are within the normal mapping space of UTF-8 and thus get translated 1:1 from the original byte stream.

Any way to reliably compress a short string?

I have a string exactly 53 characters long that contains a limited set of possible characters.
[A-Za-z0-9\.\-~_+]{53}
I need to reduce this to length 50 without loss of information and using the same set of characters.
I think it should be possible to compress most strings down to 50 length, but is it possible for all possible length 53 strings? We know that in the worst case 14 characters from the possible set will be unused. Can we use this information at all?
Thanks for reading.
If, as you stated, your output strings have to use the same set of characters as the input string, and if you don't know anything special about the requirements of the input string, then no, it's not possible to compress every possible 53-character string down to 50 characters. This is a simple application of the pigeonhole principle.
Your input strings can be represented as a 53-digit number in base 67, i.e., an integer from 0 to 6753 - 1 ≅ 6*1096.
You want to map those numbers to an integer from 0 to 6750 - 1 ≅ 2*1091.
So by the pigeonhole principle, you're guaranteed that 673 = 300,763 different inputs will map to each possible output -- which means that, when you go to decompress, you have no way to know which of those 300,763 originals you're supposed to map back to.
To make this work, you have to change your requirements. You could use a larger set of characters to encode the output (you could get it down to 50 characters if each one had 87 possible values, instead of the 67 in the input). Or you could identify redundancy in the input -- perhaps the first character can only be a '3' or a '5', the nineteenth and twentieth are a state abbreviation that can only have 62 different possible values, that sort of thing.
If you can't do either of those things, you'll have to use a compression algorithm, like Huffman coding, and accept the fact that some strings will be compressible (and get shorter) and others will not (and will get longer).
What you ask is not possible in the most general case, which can be proven very simply.
Say it was possible to encode an arbitrary 53 character string to 50 chars in the same set. Do that, then add three random characters to the encoded string. Then you have another arbitrary, 53 character string. How do you compress that?
So what you want can not be guaranteed to work for any possible data. However, it is possible that all your real data has low enough entropy that you can devise a scheme that will work.
In that case, you will probably want to do some variant of Huffman coding, which basically allocates variable-bit-length encodings for the characters in your set, using the shortest encodings for the most commonly used characters. You can analyze all your data to come up with a set of encodings. After Huffman coding, your string will be a (hopefully shorter) bitstream, which you encode to your character set at 6 bits per character. It may be short enough for all your real data.
A library-based encoding like Smaz (referenced in another answer) may work as well. Again, it is impossible to guarantee that it will work for all possible data.
One byte (character) can encode 256 values (0-255) but your set of valid characters uses only 67 values, which can be represented in 7 bits (alas, 6 bits gets you only 64) and none of your characters uses the high bit of the byte.
Given that, you can throw away the high bit and store only 7 bits, running the initial bits of the next character into the "spare" space of the first character. This would require only 47 bytes of space to store. (53 x 7 = 371 bits, 371 / 8 = 46.4 == 47)
This is not really considered compression, but rather a change in encoding.
For example "ABC" is 0x41 0x42 0x43
0x41 0x42 0x43 // hex values
0100 0001 0100 0010 0100 0011 // binary
100 0001 100 0010 100 0011 // drop high bit
// run it all together
100000110000101000011
// split as 8 bits (and pad to 8)
10000011 00001010 00011[000]
0x83 0x0A 0x18
As an example these 3 characters won't save any space, but your 53 characters will always come out as 47, guaranteed.
Note, however, that the output will not be in your original character set, if that is important to you.
The process becomes:
original-text --> encode --> store output-text (in database?)
retrieve --> decode --> original-text restored
If I remember correctly Huffman coding is going to be the most compact way to store the data. It has been too long since I used it to write the algorithm quickly, but the general idea is covered here, but if I remember correctly what you do is:
get the count for each character that is used
prioritize them based on how frequently they occurred
build a tree based off the prioritization
get the compressed bit representation of each character by traversing the tree (start at the root, left = 0 right = 1)
replace each character with the bits from the tree
Smaz is a simple compression library suitable for compressing very short strings.

Categories

Resources