json.loads in python containing non standard characters

json.loads in python containing non standard characters - javascript

What I'm trying to do is grab some text in my web application (string literals containing non English characters such as ă) with Javascript. Pass it to an object and then use JSON.Stringify() on the object, and then pass that to a python script.
The Python script is intended to load the json data and eventually print the text on a POS printer so the final data has to be in ascii hex format of a specific code page which means ill have to perform some character conversion either before or after the python script gets the data.
Basically something like this:
someObject.arrayOfTextInputs.push("how can I handle the ă character");
--run a python script and pass to it: JSON.Stringify(someObject) --
In Python:
jsonStuff = sys.argv[1]
myObject= json.loads(jsonStuff)
Now if I simply pass the strings as is, the python script hangs upon json.loads because of the ă character. If i replace the character prior to Stringify, with an'\xNN' representation matching a value from a Code Page i need, the json.loads still hangs.
Same thing for using '\uNNNN'.
I also printed out the json i get before handing it over to json.loads() and normally it just prints out some weird hex? character/image instead of ă.
However, replacing it to its utf-8 repesentation (in javascript) \xc4\x83, makes the print in python display the character properly (altho it creates problems in the next steps).
Same thing happens with replacing ă with \xC7 which is the matching character in code page 852 (latin-2) and then jsonStuff.decode('cp852')
What are my options here?
Edit: Thanks for the welcome!
I am using Python 2.7 which from what I've gathered uses standard ascii encoding.
If i skip any conversion i get the exception: ValueError: Invalid control character at some character/byte..
If I convert the character in Javascript (before doing Stringify() on the object) with an escape matching the same character from an utf-8 table: "\u0103" i get the same exception.
If I convert the character to some utf-8 character that falls into the standard ascii character set ("\u0045"), it loads fine. I guess the decoder can automatically map the unicode "regular" characters into their ascii representatives.
Same thing for converting to say "\x45".
If I add strict=False argument to the loads() function, i can load any escaped character but then I'm not sure how to handle it in my python script.
I have to admit, the stringify and loads() part really makes me lose track, since I'm starting out with utf8 in my IDE, using escape characters from a different encoding, then calling stringify (only utf8 encoded stuff is valid json?) and pass it into python which cant handle utf8 past the standard character set. And I have to end up with '\xNN' of a specific code page(say Latin-2) in the end just before the print.
Should I try to pass anything using strict=False and handle it from there within python, or is it possible to send everything encoded using some code page?
I'll add some code in a bit.

Related

Writing special characters to excel file using xlsx-populate

I am using xlsx-populate to write to excel files. Whenever the text contains special characters such as smart quotes or bullets points, it gets printed as weird character like â€™. However if replace them with their Unicode value (Right Single Quotation mark with \u2019), it renders correctly in excel sheet.
I tried js libraries like slugify to convert the special characters into ascii version but all of them would individually convert every constituent byte of special char and not as whole.
Currently, I replace the special character with their Unicode value using regex but there will be always some character that I will miss. Is there a better way to handle this problem?

I used iconv-lite library to first encode the string into 'win1252' and then decoded it with 'utf-8'. The resulting string renders correctly when written to excel.
iconv.decode(iconv.encode(string, 'win1252'), 'utf-8')
Thanks #JosefZ for suggesting this path.

JavaScript/NodeJS RTF CJK Conversions

I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.

I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.

Unexpected token error while json parsing for string containing special characters

On trying to parse the following string on titanium Studio for mobile app project, I get the
error:
Unexpected token at profileSkills":"Analysis
des='[{"jobId":0,"jobPositionName":"NA","companyId":0,"companyDisplayName":"NA","profileSkills":"Analysis\r\nAnalysis\r\nQuality Assurance\r\nProject Management\r\nProgrammer Analyst\r\n"}]';
desjson=JSON.parse(des);
Can anyone help me , whether I can parse strings containing escape charaters using JSON.
If not, could you tell me the procedure to it.

You need to encode the special characters with double-backslashes, because the JSON parser will expect them to be escaped.
var des='[{"jobId":0,"jobPositionName":"NA","companyId":0,"companyDisplayName":"NA","profileSkills":"Analysis\\r\\nAnalysis\\r\\nQuality Assurance\\r\\nProject Management\\r\\nProgrammer Analyst\\r\\n"}]';
If you are actually declaring the JSON string as a JavaScript string literal, then you have to account for the fact that when the JavaScript parser sees those escaped characters, it'll build a string with the real carriage return and line feed characters. The JSON parser coming along after that won't like them.
If, on the other hand, your JSON is really coming from a server, then the JSON "on the wire" should not have doubled backslashes.
I should also note that there's rarely any reason to put a JSON string as a literal in JavaScript code. It might as well be a JavaScript object literal, in most cases. (I acknowledge that there might be some reason for it, of course.)

You have two \r\ in the string, that should be \r\n. Change those, and it validates as correct JSON.

Writing Javascript using UTF-16 character encoding

Here is what I am trying but am not sure how to get this working or if it is even possible -
I have an HTML page MyHTMLPage.htm and I want to src a Javascript from this HTML file. This is pretty straightforward. I plan to include a <script src = "MyJavascript.js"></script> tag in my HTML file and that should take care of it.
However, I want to create my Javascript file using UTF-16 encoding. So, I plan to use the following tag <script charset="UTF-16" src="MyJavascript.js"></script> in my HTML file to take care of that
Now the problem I am really stuck at is how do I create the Javascript using UTF-16 encoding - E.g. let's say my Javascript code is alert(1); I created my Javascript file with the contents as \u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0029\u003b but that does not seem to execute as valid Javascript at runtime.
To summarize, here is what I have -
MyHTMLPage.html
...
...
...
<script charset="UTF-16" src="MyJavascript.js"></script>
...
...
...
MyJavascript.js
\u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0029\u003b
When I open the HTML page in Firefox, I get the error - "Syntax error - Illegal character" right at the beginning of the MyJavascript.js file. I have also tried adding the BOM character "\ufeff" at the beginning of the above Javascript but I still get the same error.
I know I could create my Javascript file as - "alert(1);" and then save it using UTF-16 encoding using the text editor and then the browser runs it fine however is there a way I could use "\u" notation (or an alternate escape character) and still get the Javascript to execute fine?
Thanks,

You are misunderstanding character encoding. Character encoding is a scheme of how characters are represented as bits behind the scenes.
You would not write \u004a in your file to "make it utf-16" as that is literally a sequence of 6 characters:
\, u, 0, 0, 4, a
And if you saved the above as utf-16, it would be represented as the following bits:
005C0075
00300030
00340061
Had you saved it as utf-8 it would be:
5C753030
3461
Which takes 50% of the space and bandwidth. It takes even less to write that character literally ("J"): just a byte
(4A) in utf-8.
The "\u"-notation is a way to reference any BMP character by just using a small set of ascii characters. If you were
working with a text editor with no unicode support, you could write "\u2665", instead of literally writing "♥" and the
browser would show it properly.
If you for some weird reason still want to use utf-16, simply write the code normally, save the file as utf-16 and serve it with the proper charset header.

UTF-8 in HTML input added by JavaScript

I just don't get it.
My case is, that my application is sending all the needed GUI text by JSON at page startup from my PHP server. On my PHP server I have all text special characters written in UTF-8. Example: Für
So on the client side I have exactly the same value, and it gets displayed nicely everywhere except on input fields. When I do this with JavaScript:
document.getElementById('myInputField').value = "FÖr";
Then it is written exactly like that without any transformation into the special character.
Did I understand something wrong in UTF-8 concepts?
Thanks for any hints.

The notation ü has nothing particular to do with UTF-8. The use of character references is a common way of avoiding the need to use UTF-8; they can be used with any encoding, but if you use UTF-8, you don’t need them.
The notation ü is an HTML notation, not JavaScript. Whether it gets interpreted by HTML rules when it appears inside your JavaScript code depends on the context (like JavaScript inside an HTML document vs. separate JavaScript file). This problem is best avoided by using either characters as such or by using JavaScript notations for characters.
For example, ü means the same as ü, i.e. U+00FC, ü (u with diaeresis). The JavaScript notation, for use inside string literals, for this is \u00fc (\u followed by exactly four hexadecimal digits). E.g., the following sets the value to “Für”:
document.getElementById('myInputField').value = "F\u00fcr";

Your using whats called HTML entities to encode characters which it not the same as UTF-8, but of course a UTF-8 string can include HTML entities.
I think the problem is that tag attributes can't include HTML entities so you have to use some other encoding when assigning the text input value attribute. I think you have two options:
Decode the HTML entity on the client side. A quite ugly solution to piggyback on the decoder available in the browser (im using jQuery in the example, but you probably get the point).
inputElement.value = $("<p/>").html("FÖr").text();
Another option, which is think is nicer, is to not send HTML entities in the server response but instead use proper UTF-8 encoding for all characters which should work fine when put into text nodes or tag attributes. This assumes the HTML page uses UTF-8 encoding of course.

Develop Reference

JavaScript is the programming language of the Web.