I am building a web-app where I can upload a JSON file, update it, then download it. The output JSON is not valid because some characters changed through the process. I don't know where I'm wrong because even when I only do upload => download without updates the JSON is still not valid...
This is how I read the uploaded JSON:
readFile: function () {
var reader = new FileReader();
reader.onload = function(event) {
this.json = JSON.parse(event.target.result);
}.bind(this);
reader.readAsText(this.file);
}
Then I can edit (or not) the json object. Then I can download it with JSON.stringify(json).
When I try to read or validate the output JSON I get errors signaling invalid characters, for example:
Invalid characters in string. Control characters must be escaped for some lines in my editor.
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xac in position X: invalid start byte when I try to load it in python with open('output.json') as json_file: data = json.load(json_file)
Does using JSON.parse then JSON.stringify modifies the encoding or structure of the JSON? How can I avoid this effect?
UPDATE:
Original file can have some characters like \u2013, \u2014, \u201d, \u00e7 but those characters are transformed into things like this � or invisible characters in the output JSON, which I guess make it not valid.
Try to add 'UTF-8' as a second parameter to the readAsText function as follows :
reader.readAsText(this.file,'UTF-8');
Related
I have below codes in my frontend:
let reader = new FileReader();
reader.onload = function(e) {
_this.imageBase64 = e.target.result;
alert(_this.imageBase64);
...
}
the alert function shows that the imageBase64 is a base64 string, which is starting with data:image/jpeg;base64.
The problem is that, is there any elegant way that I can get the base64 string of the image, without this prefix? (I don't want to use substring like function).
Since the server end codes will read this string with the assumption that it only contains the base64 representation of the image.
Maybe base64Str.split(',').pop() is my best choice, as string.slice(start, stop) and string.substring(start, stop) require the exact index.
Seems that we are getting a data url(which has the leading meta data prefix) by using the approach I mentioned. The advantage is that that url can be used directly in some src field. This is the reason why we have that prefix in the front end world.
I'm connecting to an external websocket api using the node ws library (node 10.8.0 on Ubuntu 16.04). I've got a listener which simply parses the json and passes it to the callback:
this.ws.on('message', (rawdata) => {
let data = null;
try {
data = JSON.parse(rawdata);
} catch (e) {
console.log('Failed parsing the following string as json: ' + rawdata);
return;
}
mycallback(data);
});
I now receive errors in which the rawData looks as follows (I formatted and removed irrelevant contents):
�~A
{
"id": 1,
etc..
}�~�
{
"id": 2,
etc..
I then wondered; what are these characters? Seeing the structure I initially thought that the first weird sign must be an opening bracket of an array ([) and the second one a comma (,) so that it creates an array of objects.
I then investigated the problem further by writing the rawdata to a file whenever it encounters a JSON parsing error. In an hour or so it has saved about 1500 of these error files, meaning this happens a lot. I cated a couple of these files in the terminal, of which I uploaded an example below:
A few things are interesting here:
The files always start with one of these weird signs.
The files appear to exist out of multiple messages which should have been received separately. The weird signs separate those individual messages.
The files always end with an unfinished json object.
The files are of varying lengths. They are not always the same size and are thus not cut off on a specific length.
I'm not very experience with websockets, but could it be that my websocket somehow receives a stream of messages that it concatenates together, with these weird signs as separators, and then randomly cuts off the last message? Maybe because I'm getting a constant very fast stream of messages?
Or could it be because of an error (or functionality) server side in that it combines those individual messages?
Does anybody know what's going on here? All tips are welcome!
[EDIT]
#bendataclear suggested to interpret it as utf8. So I did, and I pasted a screenshot of the results below. The first print is as it is, and the second one interpreted as utf8. To me this doesn't look like anything. I could of course convert to utf8, and then split by those characters. Although the last message is always cut off, this would at least make some of the messages readble. Other ideas still welcome though.
My assumption is that you're working only with English/ASCII characters and something probably messed the stream. (NOTE:I am assuming), there are no special characters, if it's so, then I will suggest you pass the entire json string into this function:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
console.log(output);
}
//example
cleanString("�~�")
You can make reference to How to remove invalid UTF-8 characters from a JavaScript string?
EDIT
From an article by Internet Engineering Task Force (IETF),
A common class of security problems arises when sending text data
using the wrong encoding. This protocol specifies that messages with
a Text data type (as opposed to Binary or other types) contain UTF-8-
encoded data. Although the length is still indicated and
applications implementing this protocol should use the length to
determine where the frame actually ends, sending data in an improper
The "Payload data" is text data encoded as UTF-8. Note that a particular text frame might include a partial UTF-8 sequence; however, the whole message MUST contain valid UTF-8. Invalid UTF-8 in reassembled messages is handled as described in Handling Errors in UTF-8-Encoded Data, which states that When an endpoint is to interpret a byte stream as UTF-8 but finds that the byte stream is not, in fact, a valid UTF-8 stream, that endpoint MUST Fail the WebSocket Connection. This rule applies both during the opening handshake and during subsequent data exchange.
I really believe that you error (or functionality) is coming from the server side which combines your individual messages, so I will suggest come up with a logic of ensuring that all your characters MUST be converted from Unicode to ASCII by first encoding the characters as UTF-8. And you might also want to install npm install --save-optional utf-8-validate to efficiently check if a message contains valid UTF-8 as required by the spec.
You might also want to pass in an if condition to help you do some checks;
this.ws.on('message', (rawdata) => {
if (message.type === 'utf8') { // accept only text
}
I hope this gets to help.
The problem which you have is that one side sends a JSON in different encoding as the other side it intepretes.
Try to solve this problem with following code:
const { StringDecoder } = require('string_decoder');
this.ws.on('message', (rawdata) => {
const decoder = new StringDecoder('utf8');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
});
Or with utf16:
const { StringDecoder } = require('string_decoder');
this.ws.on('message', (rawdata) => {
const decoder = new StringDecoder('utf16');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
});
Please read: String Decoder Documentation
It seems your output is having some spaces, If you have any spaces or if you find any special characters please use Unicode to full fill them.
Here is the list of Unicode characters
This might help I think.
Those characters are known as "REPLACEMENT CHARACTER" - used to replace an unknown, unrecognized or unrepresentable character.
From: https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The replacement character � (often a black diamond with a white question mark or an empty square box) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol. It is usually seen when the data is invalid and does not match any character
Checking the section 8 of the WebSocket protocol Error Handling:
8.1. Handling Errors in UTF-8 from the Server
When a client is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, then any bytes or sequences of bytes that are not valid UTF-8 sequences MUST be interpreted as a U+FFFD REPLACEMENT CHARACTER.
8.2. Handling Errors in UTF-8 from the Client
When a server is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, behavior is undefined. A server could close the connection, convert invalid byte sequences to U+FFFD REPLACEMENT CHARACTERs, store the data verbatim, or perform application-specific processing. Subprotocols layered on the WebSocket protocol might define specific behavior for servers.
Depends on the implementation or library in use how to deal with this, for example from this post Implementing Web Socket servers with Node.js:
socket.ondata = function(d, start, end) {
//var data = d.toString('utf8', start, end);
var original_data = d.toString('utf8', start, end);
var data = original_data.split('\ufffd')[0].slice(1);
if (data == "kill") {
socket.end();
} else {
sys.puts(data);
socket.write("\u0000", "binary");
socket.write(data, "utf8");
socket.write("\uffff", "binary");
}
};
In this case, if a � is found it will do:
var data = original_data.split('\ufffd')[0].slice(1);
if (data == "kill") {
socket.end();
}
Another thing that you could do is to update node to the latest stable, from this post OpenSSL and Breaking UTF-8 Change (fixed in Node v0.8.27 and v0.10.29):
As of these releases, if you try and pass a string with an unmatched surrogate pair, Node will replace that character with the unknown unicode character (U+FFFD). To preserve the old behavior set the environment variable NODE_INVALID_UTF8 to anything (even nothing). If the environment variable is present at all it will revert to the old behavior.
When using a library to request some non-ASCII/UTF8 data, we often get back a string full of nonsense. Example:
const got = require("got");
got("http://twemoji.maxcdn.com/16x16/1f525.png")
.then(response => console.log(response.body))
This is the output:
�PNG
IHD��aaIDAT8�c`��L�fEb��?��8�-���#���5�!� ���|bQ\�$�� �ׁX�y�xT
���y#< �c�i��6$�K$
L÷���w��������_��Ϡ���d��?�j��2��� ��hX��cn������e"L����x�3�
��Y�f�N���
mt:����2e�f��N���~{'̀x�ȿ �;�m
�
�PIEND�B`� �vZ�]�dX<R�\�Y:������`�A�A��ӂƟ}�����#A�\�n����|�A� u83����,�{������#�#4��#��D�
Curiously, that is the same thing we see when downloading the image and using:
cat 1f525.png
What, exactly, is that string, why it looks like this, and how do we convert it to a proper Buffer object?
That's a PNG image which is not text data, but just basic binary data. It doesn't make sense to interpret it as a string.
got will return string, buffer, readableStream, or object. console.log is converting your stream into a string which is not what you want. cat deals with text, not binary data.
Also the response from http://twemoji.maxcdn.com/16x16/1f525.png does not include the Content-Type header which could be throwing off the got library.
It's not really strange that you see the same output from cat - that is how a PNG image looks like when interpreted as a string.
According to the got documentation, it should return a buffer when you specify the encoding as null. Perhaps console.log is converting the buffer to a string, or you could try to set the encoding to image/png.
Did you actually try to save to image to file? Perhaps it'll just work.
By default, got will fetch you a string. It assumes you want UTF-8 text data by default, since that's probably the most common case, people fetching HTML documents. From the documentation:
encoding
Type: string, null
Default: 'utf8'
Encoding to be used on setEncoding of the response data. If null, the body is returned as a Buffer.
If you want binary data instead, specify {encoding: null}:
const got = require("got");
got("http://twemoji.maxcdn.com/16x16/1f525.png", {encoding:null})
.then(response => console.log(response.body))
Then response.body will be a Buffer.
What I need to do is best descriped as example.
Previously, I had the following code:
content = u'<?xml version="1.0" encoding="windows-1251"?>\n' + ... #
with open(file_name, 'w') as f:
f.write(content.encode('cp1251'))
f.close;
Now I want to modify the architecture of my entire app and send the string which is supposed to be the file content to client via JSON and to generate the file via javascript.
So, now my code looks something like this:
response_data = {}
response_data['file_content'] = content.encode('cp1251')
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False)}) # error generated
The problem is that I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 53: ordinal not in range(128)
I also tried the second option this way:
response_data = {}
response_data['file_content'] = content
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False).encode('utf8')}) # error generated
Then, on client, I try to covert utf8 to windows-1251.
$.post ('/my_url/', data, function(response) {
var file_content = JSON.parse(response.content).file_content;
file_content = UnicodeToWin1251(file_content);
...but...I get distorted symbols.
I know I am doing something terribly wrong here and am likely to mess up things with encoding, but still it's been an entire day I couldn't solve this issue. Could someone give a hint where my mistake is ?
Both XML and JSON contain data that is Unicode text. The XML declaration merely tells your XML parser how to decode the XML serialisation of that data. You wrote the serialisation by hand so to match the XML header, you had to encode to CP-1251.
The JSON standard states that all JSON should be encoded in either UTF-8, UTF-16 or UTF-32, with UTF-8 the standard; again, this is just the encoding for the serialisation.
Leave your data as Unicode, then encode that data to JSON with the json library; the library takes care of ensuring you get UTF-8 data (in Python 2), or gives you Unicode text (Python 3) that can be encoded to UTF-8 later. Your Javascript code will then decode the JSON again at which point you have Unicode text again:
response_data = {}
response_data['file_content'] = content
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False)})
There is no need whatsoever to send binary data over JSON here, you are sending text. If you Javascript code then generates the file, it is responsible for encoding to CP-1251, not your Python code.
If you must put binary data in a JSON payload, you'll need to encode that payload to some form of text. Binary data (and CP-1251-encoded text is binary data) could be encoded in text as Base-64:
import base64
response_data = {}
response_data['file_content'] = base64.encodestring(content.encode('cp1251')).decode('ascii')
response_data['file_name'] = file_name
return JsonResponse({'content':json.dumps(response_data, ensure_ascii=False)})
Base64 data is encoded to a bytestring containing only ASCII data, so decode it as ASCII for the JSON library, which expects text to be Unicode text.
Now you are sending binary data, wrapped in a Base64 text encoding, to the Javascript client, which now has to decode the Base64 if you need the binary payload there.
I'm having problems using split to try and parse a text file.
the text file is as shows:
123.0 321.02
342.1 234.03
425.3 326.33
etc. etc.
When I read using a FileReader() and doing a readAsText call on the file, the file appears in a string as such:
"123.0 321.02\r\n342.1 234.03\r\n ..." (How it appears in Firebug)
Currently I'm trying to split it like this:
var reader = FileReader();
reader.readAsText(f);
alert(reader.result);
var readInStrings = reader.result.split(/|\s|\n|\r|/);
but when I do this, the resulting array has values as shown:
["123.0", "321.02", "", "342.1", "234.03", "" etc....]
Can anyone explain to me where the values of {""} in the array are coming from and how to correctly split such a file as to only get the number strings as the values?
Any help would be greatly appreciated, thanks!
Note*: Currently doing this in javascript
This is likely due to splitting on each newline and carriage return character rather than each bundle of such characters. To prevent this issue, you could cluster them in the regular expression such as /\s+/ or something similar.