What I need to do is best descriped as example.
Previously, I had the following code:
content = u'<?xml version="1.0" encoding="windows-1251"?>\n' + ... #
with open(file_name, 'w') as f:
f.write(content.encode('cp1251'))
f.close;
Now I want to modify the architecture of my entire app and send the string which is supposed to be the file content to client via JSON and to generate the file via javascript.
So, now my code looks something like this:
response_data = {}
response_data['file_content'] = content.encode('cp1251')
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False)}) # error generated
The problem is that I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 53: ordinal not in range(128)
I also tried the second option this way:
response_data = {}
response_data['file_content'] = content
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False).encode('utf8')}) # error generated
Then, on client, I try to covert utf8 to windows-1251.
$.post ('/my_url/', data, function(response) {
var file_content = JSON.parse(response.content).file_content;
file_content = UnicodeToWin1251(file_content);
...but...I get distorted symbols.
I know I am doing something terribly wrong here and am likely to mess up things with encoding, but still it's been an entire day I couldn't solve this issue. Could someone give a hint where my mistake is ?
Both XML and JSON contain data that is Unicode text. The XML declaration merely tells your XML parser how to decode the XML serialisation of that data. You wrote the serialisation by hand so to match the XML header, you had to encode to CP-1251.
The JSON standard states that all JSON should be encoded in either UTF-8, UTF-16 or UTF-32, with UTF-8 the standard; again, this is just the encoding for the serialisation.
Leave your data as Unicode, then encode that data to JSON with the json library; the library takes care of ensuring you get UTF-8 data (in Python 2), or gives you Unicode text (Python 3) that can be encoded to UTF-8 later. Your Javascript code will then decode the JSON again at which point you have Unicode text again:
response_data = {}
response_data['file_content'] = content
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False)})
There is no need whatsoever to send binary data over JSON here, you are sending text. If you Javascript code then generates the file, it is responsible for encoding to CP-1251, not your Python code.
If you must put binary data in a JSON payload, you'll need to encode that payload to some form of text. Binary data (and CP-1251-encoded text is binary data) could be encoded in text as Base-64:
import base64
response_data = {}
response_data['file_content'] = base64.encodestring(content.encode('cp1251')).decode('ascii')
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False)})
Base64 data is encoded to a bytestring containing only ASCII data, so decode it as ASCII for the JSON library, which expects text to be Unicode text.
Now you are sending binary data, wrapped in a Base64 text encoding, to the Javascript client, which now has to decode the Base64 if you need the binary payload there.
Related
I'm parsing a Uint8 array that is an HTML document. It contains a script tag which in turn contains JSON data that I would like to parse.
I first converted the array to text:
data = Buffer.from(str).toString('utf8')
I then searched for the script tag, and extracted the string containing the JSON:
... {\"phrase\":\"Go to \"California\"\",\"color\":\"red\",\"html\":\"<div class=\"myclass\">Ok</div>\"} ...
I then did a replace to clean it up.
data = data.replace(/\\"/g, "\"").replace(/\\/g, "").
{"phrase":"Go to "California"","color":"red","html":"<div class="myclass">Ok</div>"}
I tried to parse using JSON.parse() and got an error because the attributes contain quotes. Is there a way to process this further using a regex ? Or perhaps a library? I am working with Cheerio, so can use that if helpful.
The escape characters are necessary if you want to parse the JSON. The embedded quotes would need to be double escaped, so the extracted text isn't even valid JSON.
"{\"phrase\":\"Go to \\\"California\\\"\",\"color\":\"red\",\"html\":\"<div class=\\\"myclass\\\">Ok</div>\"}"
or, using single quotes:
'{"phrase":"Go to \\"California\\"","color":"red","html":"<div class=\\"myclass\\">Ok</div>"}'
Thanks.
After some more tinkering around, I realized that I should have encoded the data to Uint8 at the source (a Lambda function) before transmitting it for further processing. So now, I have:
Text
Encoded text to Uint8
Return from Lambda function.
Decode from Uint8 to text
Process readily as no escape characters.
Before, I was skipping step 2. And so Lambda was encoded the text however it does by default.
I'm getting an input data in appscript in the form of binary data (docx file data) which looks something like below
I need this data to be converted into Base64 string. I tried using Utilities class to encode it into base64, however, it returns a few character string which is invalid. Is there any way to convert this form of data in appscript?
The current script is as below
function run() {
var inputData = Eventbus.get('encodedData');//this is received as binary data
var convertedData = Utilities.base64Encode(inputData);//need to encode to base64 but doesn't work
Eventbus.set('decodedData',convertedData);
}
Thanks
Saurabh
I am building a web-app where I can upload a JSON file, update it, then download it. The output JSON is not valid because some characters changed through the process. I don't know where I'm wrong because even when I only do upload => download without updates the JSON is still not valid...
This is how I read the uploaded JSON:
readFile: function () {
var reader = new FileReader();
reader.onload = function(event) {
this.json = JSON.parse(event.target.result);
}.bind(this);
reader.readAsText(this.file);
}
Then I can edit (or not) the json object. Then I can download it with JSON.stringify(json).
When I try to read or validate the output JSON I get errors signaling invalid characters, for example:
Invalid characters in string. Control characters must be escaped for some lines in my editor.
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xac in position X: invalid start byte when I try to load it in python with open('output.json') as json_file: data = json.load(json_file)
Does using JSON.parse then JSON.stringify modifies the encoding or structure of the JSON? How can I avoid this effect?
UPDATE:
Original file can have some characters like \u2013, \u2014, \u201d, \u00e7 but those characters are transformed into things like this � or invisible characters in the output JSON, which I guess make it not valid.
Try to add 'UTF-8' as a second parameter to the readAsText function as follows :
reader.readAsText(this.file,'UTF-8');
When using a library to request some non-ASCII/UTF8 data, we often get back a string full of nonsense. Example:
const got = require("got");
got("http://twemoji.maxcdn.com/16x16/1f525.png")
.then(response => console.log(response.body))
This is the output:
�PNG
IHD��aaIDAT8�c`��L�fEb��?��8�-���#���5�!� ���|bQ\�$�� �ׁX�y�xT
���y#< �c�i��6$�K$
L÷���w��������_��Ϡ���d��?�j��2��� ��hX��cn������e"L����x�3�
��Y�f�N���
mt:����2e�f��N���~{'̀x�ȿ �;�m
�
�PIEND�B`� �vZ�]�dX<R�\�Y:������`�A�A��ӂƟ}�����#A�\�n����|�A� u83����,�{������#�#4��#��D�
Curiously, that is the same thing we see when downloading the image and using:
cat 1f525.png
What, exactly, is that string, why it looks like this, and how do we convert it to a proper Buffer object?
That's a PNG image which is not text data, but just basic binary data. It doesn't make sense to interpret it as a string.
got will return string, buffer, readableStream, or object. console.log is converting your stream into a string which is not what you want. cat deals with text, not binary data.
Also the response from http://twemoji.maxcdn.com/16x16/1f525.png does not include the Content-Type header which could be throwing off the got library.
It's not really strange that you see the same output from cat - that is how a PNG image looks like when interpreted as a string.
According to the got documentation, it should return a buffer when you specify the encoding as null. Perhaps console.log is converting the buffer to a string, or you could try to set the encoding to image/png.
Did you actually try to save to image to file? Perhaps it'll just work.
By default, got will fetch you a string. It assumes you want UTF-8 text data by default, since that's probably the most common case, people fetching HTML documents. From the documentation:
encoding
Type: string, null
Default: 'utf8'
Encoding to be used on setEncoding of the response data. If null, the body is returned as a Buffer.
If you want binary data instead, specify {encoding: null}:
const got = require("got");
got("http://twemoji.maxcdn.com/16x16/1f525.png", {encoding:null})
.then(response => console.log(response.body))
Then response.body will be a Buffer.
I have a dict that contains some Unicode strings (among other objects). I'd like to save this dict as a JSON file, and then display the content of this via AJAX.
If final_res is the dict, I usually do this:
json.dumps(final_res, ensure_ascii=True)
In the result, I see strings like:
"l\\u00a0m\\u00fcdale"
I imagine these are unicode encoded characters. But when I try to display them in Javascript, these get printed with the slashes, instead of the encoded unicode letter.
Is there something I am doing wrong in Javascript for displaying these properly? Or should I do decode these into ASCII in Python, before outputting to JSON?
UPDATE:
Based on the discussion in the comments below with #spectra, I realized that json.dumps should not be outputting double slashes. When I parse this in the browser, this prints it as a literal single slash.
I am trying to figure out a way to fix this with the json module, not sure why it's happening.
The solution for me was to save the result of json.dumps to the database with the "single-slashed" version. I did it by calling print on the result of json.dumps and then copy-pasting that to the database.
You can encode the json file in UTF8 instead of escaped charaters:
json.dumps(final_res,ensure_ascii=False).encode('utf8')
For example
print json.dumps({'name':u'l\u00a0m\u00fcdale'},ensure_ascii=False).encode('utf8')
# {"name": "l müdale"}
Then in your client side AJAX code, set the encoding to 'utf8' :
How to set encoding in .getJSON JQuery
$.ajax({ contentType: "application/json; charset=utf-8", ... })