Saving an HTML Blob file produces weird text inside - javascript

So i have a comma delimited file that im saving to a blob.
Im using the latest Chrome based Edge browser.
This particular code (typescript) that I have has not changed for many months now.
But suddenly, i noticed that if i save the file with a particular datetime string in it, then i get a weird output for that. Basically, i see the weird text instead of the datetime string.
Here is the datetime string im saving (and fully expect to see in the saved file):
‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM
Here is the weird text that appears instead:
‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM
Now judging by the fact that i couldn't simply copy & paste this weird string into this edit window (it thinks im trying to paste an image), im guessing it is binary. Which is probably a huge hint, but it's not ringing any bells for me.
So question is: why is this binary when im certain im writing out a string?
After some digging around I was able to determine that there seems to be an encoding issue. Still not sure why. In addition, upon closer inspection of the weird string, the date is actually in there. It just looks strange because each component is padded with this weird string "‎".

Your string is full of Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E).
const text = `‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM`;
console.log( text.replace( /\u200e/g, "[LTR]" ) );
Somehow, you are reading your file as Windows-1252 (you don't say how you are reading it, so it's hard to tell you what you did wrong, but note it is the default encoding when opening a text file directly in most browsers), and when the reader finds the UTF-8 0xe2 0x80 0x8e sequence, it doesn't map well in Windows-1252 (unlike the other ASCII characters) and this character gets read as ‎:
const text = "\u200e9\u200e/\u200e26\u200e/\u200e2020\u200e \u200e7\u200e:\u200e00\u200e:\u200e00\u200e \u200eAM";
const blob = new Blob( [ text ] ); // here 'text' is encoded as UTF-8
const reader = new FileReader();
reader.onload = (evt) => {
console.log( reader.result );
const OPs_result = "‎9‎/‎26‎/‎2020‎ ‎7‎:‎00‎:‎00‎ ‎AM";
console.log( "is same as OP's result?", OPs_result === reader.result );
};
reader.readAsText( blob, "Windows-1252" );
However, reading this same file as UTF-8 would render these characters correctly:
const text = "\u200e9\u200e/\u200e26\u200e/\u200e2020\u200e \u200e7\u200e:\u200e00\u200e:\u200e00\u200e \u200eAM";
const blob = new Blob( [ text ] ); // here 'text' is encoded as UTF-8
blob.text() // reads as UTF-8
.then( console.log );
And if you want to help your browser to open this text file as UTF-8 instead of the default Windows-1252, you can prepend a BOM to this file, as demonstrated in this answer:
const text = "\u200e9\u200e/\u200e26\u200e/\u200e2020\u200e \u200e7\u200e:\u200e00\u200e:\u200e00\u200e \u200eAM";
const without_BOM = new Blob( [ text ] );
const BOM = new Uint8Array([0xEF,0xBB,0xBF]);
const with_BOM = new Blob( [ BOM, text ] );
document.getElementById( "without_BOM" ).href = URL.createObjectURL( without_BOM );
document.getElementById( "with_BOM" ).href = URL.createObjectURL( with_BOM );
<a id="without_BOM">Open the file without BOM</a><br>
<a id="with_BOM">Open the file with BOM</a>
And if you wish to encode your csv files as Windows-1252, then you can check this answer.

Related

Javascript OCR tesseract.js Error in copying number after recognition

i'm working on this project the idea of it is that you give the program an image and using OCR in javascript the program well detect or (recognize) a string or a word for example ('رقم العداد')
and copies the the number or the integer after the string with ( spaces ) like ==>>
7038842 رقم العداد
and that is it so i'm using Tesseract.js ( Tesseract.recognize ) to recognize the string but at the first i faced an Error
Uncaught (in promise)
Erorr so after beating around the bush its turned out that the tesseract fail to detect some Arabic letters as they are so i print all the text detected from the image and it turend out that the string ['نقطة الخدمة'] is recognized as ['ننطة الخدمة'] and ['رقم العداد'] as ['رم العداد'] so using
string.match method
to maniplate and copy the number after the word the number was given for ['رم العداد'] was correct and clear but !!! for some reason the code is not copying the number written after the word ['ننطة الخدمة'] i tried to play around like adding spaces and tabs but the same problem is given so eventually i decieded to ask for some help so what is i'm missing
the code :-
<script>
Tesseract.recognize(
'form.png',
'ara',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
const info = ['ننطة الخدمة','رم العداد','القراءة'];
for(k=0;k<info.length;k++){
var result = text.match(new RegExp(info[0] + '\\s+(\\w+)'))[1]; /* info[0] the index of ['نقطة الخدمة']*/
alert(result);
}
})
</script>
the project image:-

Why won't window.btoa work on – ” characters in Javascript?

So I'm converting a string to BASE64 as shown in the code below...
var str = "Hello World";
var enc = window.btoa(str);
This yields SGVsbG8gV29ybGQ=. However if I add these characters – ” such as the code shown below, the conversion doesn't happen. What is the reason behind this? Thank you so much.
var str = "Hello – World”";
var enc = window.btoa(str);
btoa is an exotic function in that it requires a "Binary String", which is an 8-bit clean string format. It doesn't work with unicode values above charcode 255, such as used by your em dash and "fancy" quote symbol.
You'll either have to turn your string into a new string that conforms to single byte packing (and then manually reconstitute the result of the associated atob), or you can uri encode the data first, making it safe:
> var str = `Hello – World`;
> window.btoa(encodeURIComponent(str));
"SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA=="
And then remember to decode it again when unpacking:
> var base64= "SGVsbG8lMjAlRTIlODAlOTMlMjBXb3JsZA==";
> decodeURIComponent(window.atob(base64));
"Hello – World"
The Problem is the character ” lies outside of Latin1 range.
For this you can use unescape (now deprecated)
var str = "Hello – World”";
var enc = btoa(unescape(encodeURIComponent(str)));
alert(enc);
And to decode:
var encStr = "SGVsbG8g4oCTIFdvcmxk4oCd";
var dec = decodeURIComponent(escape(window.atob(encStr)))
alert(dec);
This ultimately owes to a deficiency in the JavaScript type system.
JavaScript strings are strings of 16-bit code units, which are customarily interpreted as UTF-16. The Base64 encoding is a method of transforming an 8-bit byte stream into a string of digits, by taking each three bytes and mapping them into four digits, each covering 6 bits: 3 × 8 = 4 × 6. As we see, this is crucially dependent on the bit width of each symbol.
At the time the btoa function was defined, JavaScript had no type for 8-bit byte streams, so the API was defined to take the ordinary 16-bit string type as input, with the restriction that each code unit was supposed to be confined to the range [U+0000, U+00FF]; when encoded into ISO-8859-1, such a string would reproduce the intended byte stream exactly.
The character – is U+2013, while ” is U+201D; neither of those characters fits into the above-mentioned range, so the function rejects it.
If you want to convert Unicode text into Base64, you need to pick a character encoding and convert it into a byte string first, and encode that. Asking for a Base64 encoding of a Unicode string itself is meaningless.
The most bullet proof way is to work on binary data directly.
For this, you can encode your string to an ArrayBuffer object representing the UTF-8 version of your string.
Then a FileReader instance will be able to give you the base64 quite easily.
var str = "Hello – World”";
var buf = new TextEncoder().encode( str );
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([buf]) );
And since the Blob() constructor does automagically encode DOMString instances to UTF-8, we could even get rid of the TextEncoder object:
var str = "Hello – World”";
var reader = new FileReader();
reader.onload = evt => { console.log( reader.result.split(',')[1] ); };
reader.readAsDataURL( new Blob([str]) );

Save binary file from base64 data in Javascript [duplicate]

This question already has answers here:
Creating a BLOB from a Base64 string in JavaScript
(15 answers)
Closed 5 years ago.
I am trying to download xlsx spreadsheet with javascript. I have tested base64 data. I decode it like so:
var data = atob(validBase64Data);
After that, I do:
save(name, data, type) {
const blob = new Blob([data], {type: type});
let objectURL = window.URL.createObjectURL(blob);
let anchor = document.createElement('a');
anchor.href = objectURL;
anchor.download = name;
anchor.click();
URL.revokeObjectURL(objectURL);
}
Where name is a filename.xlsx, data is the decoded data and type is a mime-type string.
The excel file is downloaded but would not open as excel. Data is corrupted somehow.
In addition: I tested the same data with a unix terminal command to base64 decode and write the xlsx directly into that file, and that produced working file. Test was done like so:
I saved base64 data to test_excel.txt`
Ran command base64 -D -i test_excel.txt -o test_excel.xlsx
test_excel.xlsx is recognized by excel.
What am I doing wrong with the code?
Here is the code that solved it:
export default {
save(name, data, type, isBinary) {
if (isBinary) {
var bytes = new Array(data.length);
for (var i = 0; i < data.length; i++) {
bytes[i] = data.charCodeAt(i);
}
data = new Uint8Array(bytes);
}
var blob = new Blob([data], {type: type});
let objectURL = window.URL.createObjectURL(blob);
let anchor = document.createElement('a');
anchor.href = objectURL;
anchor.download = name;
anchor.click();
URL.revokeObjectURL(objectURL);
}
}
Thanks to everyone who participated in resolving.
Also, credits to: Creating a Blob from a base64 string in JavaScript
Okay, so let's clarify a few things before anyone tries to "explain" the problem incorrectly.
The original .xlsx is a binary-encoded file, meaning that the data will contain bytes in the full range of 0x00 to 0xFF.
In the question, it is assumed that this string has been successfully encoded into a valid base64 string, with no extraneous characters (as indicated by the success of the test using base64 without the -i flag), and stored to validBase64Data.
The problem is that atob(validBase64Data) generates a string decoded into utf-8, not binary. And as I said before, the original binary string contains non-ASCII bytes in the range 0x80 to 0xFF. In utf-8, these code points are stored as two bytes instead of one, so the solution, as described in Creating a Blob from a base64 string in JavaScript, is to convert the code points of each character in the utf-8 string data into bytes stored as a Uint8Array, and then construct a Blob from that.
A naive solution might look like this, though please refer to Creating a Blob from a base64 string in JavaScript for more performant solutions:
const blob = new Blob([Uint8Array.from(data, c => c.charCodeAt(0))], { type });
//...
This uses TypedArray.from(iterable, mapFn).

Convert a file from to Base 64 using JavaScript and converting it back to file using C#

I am trying to convert pdf and image files to base 64 using javascript and convert it back to file using C# in WEB API.
Javascript
var filesSelected = document.getElementById("inputFileToLoad").files;
if (filesSelected.length > 0)
{
var fileToLoad = filesSelected[0];
var fileReader = new FileReader();
fileReader.onload = function(fileLoadedEvent)
{
var textAreaFileContents = document.getElementById("textAreaFileContents");
textAreaFileContents.innerHTML = fileLoadedEvent.target.result;
};
fileReader.readAsDataURL(fileToLoad);
}
C#
Byte[] bytes = Convert.FromBase64String(dd[0].Image_base64Url);
File.WriteAllBytes(actualSavePath,bytes);
But in API I'm getting exception as {"The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters. "}
Please tell me how to proceed with this...
Thanks
According to MDN: FileReader.readAsDataURL those generated URLs are prefixed with something like data:image/jpeg;base64,. Have a look at your generated string. Look for the occurence of base64, and take the base64 data that starts after this prefix.
Because the FileReader.readAsDataURL() produces a string that is prefixed with extra metadata (the "URL" portion), you need to strip it off on the C# side. Here's some sample code:
// Sample string from FileReader.readAsDataURL()
var base64 = "";
// Some known piece of information that will be in the above string
const string identifier = ";base64,";
// Find where it exists in the input string
var dataIndex = base64.IndexOf(identifier);
// Take the portion after this identifier; that's the real base-64 portion
var cleaned = base64.Substring(dataIndex + identifier.Length);
// Get the bytes
var bytes = Convert.FromBase64String(cleaned);
You could condense this down if it's too verbose, I just wanted to explain it step by step.
var bytes = Convert.FromBase64String(base64.Substring(base64.IndexOf(";base64,") + 8));

How to parse out unescaped single quote ' from json string in javascript Google Translate

I am calling google translate api and getting back strings that are not fully decoded. In particular, I am seeing ' where single quotes should be.
For example:
{
"q": "det är fullt",
"target": "en"
}
Returns
{
"data": {
"translations": [
{
"translatedText": "It&\#39;s full",
"detectedSourceLanguage": "sv"
}
]
}
}
I would have expected JSON.parse to take care of this, but it does not. Is there some other native function I need to be calling? My current fix is to fix this using a regex .replace(/'/g, "'");, but is there a better way to decode this type of thing using javascript?
Aha! The issue is caused because the response is HTML encoded.
If I were to put the translation onto the page directly, the quote renders just fine. However, I am putting the result in a textarea to give users a chance to edit that translation. As a result, the browser is not automatically reading the string as HTML since it is not rendered directly as HTML.
The solution I am now using is to decode the string using DOMParser as described on this stackoverflow thread:
var encodedStr = 'hello & world';
var parser = new DOMParser;
var dom = parser.parseFromString(
'<!doctype html><body>' + encodedStr,
'text/html');
var decodedString = dom.body.textContent;
console.log(decodedString);

Categories

Resources