JS base n string to base64 - javascript

I have a string (with numbers under 128) separated by a comma:
"127,25,34,52,46,2,34,4,6,1"
Because there are 10 digits and one comma, that makes 11 total characters. How can I convert this string from "base 11" to "base 64"? I would like to compress this string into base64. I tried window.btoa, but it produces a larger output because the browser doesn't know that the string only has 11 characters.
Thanks in advance.

Base64 encoding never produces shorter strings. It is not intended as a compression tool, but as a means to reduce the used character set to 64 readable characters, taking into account that the input may use a larger characterset (even if not all those characters are used).
Given the format of your string, why not take those numbers and use them as ASCII, and then apply Base64 encoding on that?
Demo:
let s = "127,25,34,52,46,2,34,4,6,1";
console.log(s);
let encoded = btoa(String.fromCharCode(...s.match(/\d+/g)));
console.log(encoded);
let decoded = Array.from(atob(encoded), c => c.charCodeAt()).join();
console.log(decoded);

Related

How can I convert from a Javascript Buffer to a String only containing 0 and 1

So I currently am trying to implement the huffman alg and it works fine for decoding and encoding. However, I store the encoded data as follows.
The result of the encoding function is a list containing many strings made up of 0 and 1 and all are varying length.
If i'd safe them in a normal txt file it would take up more space, if Id store them how they are in a binary file it could be that for example an 'e' which would have the code 101 would be stored in a full 8 bits looking like '00000101' which is wasteful and wont take up less storage then the original txt file. I took all the strings in the list and put them into one string and split it into equal parts of length 8 to store them more effectively.
However if I wanna read the data now, instead of 0 and 1 I get utf-8 chars, even some escape characters.
I'm reading the file with fs.readFileSync("./encoded.bin", "binary"); but javascript then thinks it's a buffer already and converts it to a string and it gets all weird... Any solutions or ideas to convert it back to 0 and 1?
I also tried to switch the "binary" in fs.readFileSync("./encoded.bin", "binary"); to a "utf-8" which helped with not crashing my terminal but still is "#��C��Ʃ��Ԧ�y�Kf�g��<�e�t"
To clarify, my goal in the end is to read out the massive string of binary data which would look like this "00011001000101001010" and actually get this into a string...
You can convert a String of 1s and 0s to the numerical representation of a byte using Number.parseInt(str, 2) and to convert it back, you can use nr.toString(2).
The entire process will look something like this:
const original = '0000010100000111';
// Split the string in 8 char long substrings
const stringBytes = original.match(/.{8}/g);
// Convert the 8 char long strings to numerical byte representations
const numBytes = stringBytes.map((s) => Number.parseInt(s, 2));
// Convert the numbers to an ArrayBuffer
const buffer = Uint8Array.from(numBytes);
// Write to file
// Read from file and reverse the process
const decoded = [...buffer].map((b) => b.toString(2).padStart(8, '0')).join('');
console.log('original', original, 'decoded', decoded, 'same', original === decoded);
var binary = fs.readFileSync("./binary.bin");
binary = [...binary].map((b) => b.toString(2).padStart(8, "0")).join("");
console.log(binary);
//Output will be like 010000111011010

Javascript: Convert Unicode Character to hex string [duplicate]

I'm using a barcode scanner to read a barcode on my website (the website is made in OpenUI5).
The scanner works like a keyboard that types the characters it reads. At the end and the beginning of the typing it uses a special character. These characters are different for every type of scanner.
Some possible characters are:
█
▄
–
—
In my code I use if (oModelScanner.oData.scanning && oEvent.key == "\u2584") to check if the input from the scanner is ▄.
Is there any way to get the code from that character in the \uHHHH style? (with the HHHH being the hexadecimal code for the character)
I tried the charCodeAt but this returns the decimal code.
With the codePointAt examples they make the code I need into a decimal code so I need a reverse of this.
Javascript strings have a method codePointAt which gives you the integer representing the Unicode point value. You need to use a base 16 (hexadecimal) representation of that number if you wish to format the integer into a four hexadecimal digits sequence (as in the response of Nikolay Spasov).
var hex = "▄".codePointAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
However it would probably be easier for you to check directly if you key code point integer match the expected code point
oEvent.key.codePointAt(0) === '▄'.codePointAt(0);
Note that "symbol equality" can actually be trickier: some symbols are defined by surrogate pairs (you can see it as the combination of two halves defined as four hexadecimal digits sequence).
For this reason I would recommend to use a specialized library.
you'll find more details in the very relevant article by Mathias Bynens
var hex = "▄".charCodeAt(0).toString(16);
var result = "\\u" + "0000".substring(0, 4 - hex.length) + hex;
If you want to print the multiple code points of a character, e.g., an emoji, you can do this:
const facepalm = "🤦🏼‍♂️";
const codePoints = Array.from(facepalm)
.map((v) => v.codePointAt(0).toString(16))
.map((hex) => "\\u{" + hex + "}");
console.log(codePoints);
["\u{1f926}", "\u{1f3fc}", "\u{200d}", "\u{2642}", "\u{fe0f}"]
If you are wondering about the components and the length of 🤦🏼‍♂️, check out this article.

What is this type of string called?

In python, we can do something like print("some random string".encode().decode('utf-16')) which will output: 潳敭爠湡潤瑳楲杮.
I feel like that is utf-16, but I'm not really sure, because I can't reproduce it in any other language. My goal is to create a function that will do exactly this, but in Javascript. The problem is that I can't find what of what type if this type of string...
Does someone know how this is called or/and how I could reproduce this in JS ?
A string is a sequence of runes. Unicode is a standard for assigning numeric values to those runes. UTF-8 or UTF-16 are standards for encoding a sequence of runes, as represented by their unicode numeric values, as a sequence of bytes.
What you did there is use encode with the default encoding, which is UTF-8, to get a sequence of bytes which you then tried to decode back to runes as if the bytes had come from a UTF-16 encoding. Basically (because your input string fits in a 1-byte encoding for UTF-8) you're taking pairs of characters from the input, jamming their bytes together and hoping that the resulting value is a legal UTF-16 encoding of something (which in general you cannot count on being true). You'll also run into issues if the utf-8 encoding is not an even number of bytes, of course.
If you really need to do this thing in javascript, you could do something like this:
const str = "some random string";
var buf = new ArrayBuffer(str.length);
// Reinterpret the sequence of bytes as a sequence of byte pairs.
var bufView = new Uint16Array(buf);
for (var i=0, strLen=str.length; i < strLen-1; i+=2) {
var c1 = str.charCodeAt(i);
var c2 = str.charCodeAt(i+1);
if (c1 > 127 || c2 > 127) {
// This will be a problem. How you handle it is up to you.
}
bufView[i/2] = c1 << 8 | c2;
}
console.log(String.fromCharCode.apply(String, bufView));

What is a surrogate pair?

I came across this code in a javascript open source project.
validator.isLength = function (str, min, max)
// match surrogate pairs in string or declare an empty array if none found in string
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
// subtract the surrogate pairs string length from main string length
var len = str.length - surrogatePairs.length;
// now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
return len >= min && (typeof max === 'undefined' || len <= max);
};
As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:
Is my understanding of the code correct?
What are surrogate pairs?
I have thus far only figured out that this is related to encoding.
Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.
JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.
Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.
Try this
var len = "😀".length // There is an emoji in the string (if you don’t see it)
vs
var str = "😀"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;
In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.
You might want to read
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
For your second question:
1. What is a "surrogate pair" in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "low surrogates" and "high surrogates", depending on whether they are allowed at the start or end of the two code unit sequence.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
Hope this helps.
Did you try to just google it?
The best description is http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
In UTF-16 some characters are stored in 8 bits and others in 16 bits.
Surrogate pair is a character representation that take 16 bits.
Some character codes is reserved to be the first one in such pairs.

Javascript encoding breaking & combining multibyte characters?

I'm planning to use a client-side AES encryption for my web-app.
Right now, I've been looking for ways to break multibyte characters into one byte-'non-characters' ,encrypt (to have the same encrypted text length),
de-crypt them back, convert those one-byte 'non-characters' back to multibyte characters.
I've seen the wiki for UTF-8 (the supposedly-default encoding for JS?) and UTF-16, but I can't figure out how to detect "fragmented" multibyte characters and how I can combine them back.
Thanks : )
JavaScript strings are UTF-16 stored in 16-bit "characters". For Unicode characters ("code points") that require more than 16 bits (some code points require 32 bits in UTF-16), each JavaScript "character" is actually only half of the code point.
So to "break" a JavaScript character into bytes, you just take the character code and split off the high byte and the low byte:
var code = str.charCodeAt(0); // The first character, obviously you'll have a loop
var lowbyte = code & 0xFF;
var highbyte = (code & 0xFF00) >> 8;
(Even though JavaScript's numbers are floating point, the bitwise operators work in terms of 32-bit integers, and of course in our case only 16 of those bits are relevant.)
You'll never have an odd number of bytes, because again this is UTF-16.
You could simply convert to UTF8... For example by using this trick
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Considering you are using crypto-js, you can use its methods to convert to utf8 and return to string. See here:
var words = CryptoJS.enc.Utf8.parse('𤭢');
var utf8 = CryptoJS.enc.Utf8.stringify(words);
The 𤭢 is probably a botched example of Utf8 character.
By looking at the other examples (see the Latin1 example), I'll say that with parse you convert a string to Utf8 (technically you convert it to Utf8 and put in a special array used by crypto-js of type WordArray) and the result can be passed to the Aes encoding algorithm and with stringify you convert a WordArray (for example obtained by decoding algorithm) to an Utf8.
JsFiddle example: http://jsfiddle.net/UpJRm/

Categories

Resources