I am using xlsx-populate to write to excel files. Whenever the text contains special characters such as smart quotes or bullets points, it gets printed as weird character like ’. However if replace them with their Unicode value (Right Single Quotation mark with \u2019), it renders correctly in excel sheet.
I tried js libraries like slugify to convert the special characters into ascii version but all of them would individually convert every constituent byte of special char and not as whole.
Currently, I replace the special character with their Unicode value using regex but there will be always some character that I will miss. Is there a better way to handle this problem?
I used iconv-lite library to first encode the string into 'win1252' and then decoded it with 'utf-8'. The resulting string renders correctly when written to excel.
iconv.decode(iconv.encode(string, 'win1252'), 'utf-8')
Thanks #JosefZ for suggesting this path.
Related
I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.
I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.
I am using the ck editor (version 4) and users enter greek text. Most characters are converted into html entities but the accented characters seem to be ignored.
On my page and within my database the text is stored an can be displayed but when I create an email with the entered text the characters are not shown correctly.
Is there a config setting that I can use so that all greek characters are converted? Or do I have to try and change these characters manually?
This sounds like an encoding issue rather than converting the characters into entities. Presumably the text is being stored as UTF-8 in the database, and the webs pages are also in the same encoding, so the characters get displayed correctly.
When it comes to the email, I suspect that it's being encoded in another encoding like ASCII or ISO-8859-1 (Latin characters only) so the Greek text will appear as a mix of letters and punctuation rather than as it should look.
You don't mention what you're using to send the email, but I'd probably look there first to see if there's a way of setting the character encoding to the same encoding as the data is stored in the database.
What I'm trying to do is grab some text in my web application (string literals containing non English characters such as ă) with Javascript. Pass it to an object and then use JSON.Stringify() on the object, and then pass that to a python script.
The Python script is intended to load the json data and eventually print the text on a POS printer so the final data has to be in ascii hex format of a specific code page which means ill have to perform some character conversion either before or after the python script gets the data.
Basically something like this:
someObject.arrayOfTextInputs.push("how can I handle the ă character");
--run a python script and pass to it: JSON.Stringify(someObject) --
In Python:
jsonStuff = sys.argv[1]
myObject= json.loads(jsonStuff)
Now if I simply pass the strings as is, the python script hangs upon json.loads because of the ă character. If i replace the character prior to Stringify, with an'\xNN' representation matching a value from a Code Page i need, the json.loads still hangs.
Same thing for using '\uNNNN'.
I also printed out the json i get before handing it over to json.loads() and normally it just prints out some weird hex? character/image instead of ă.
However, replacing it to its utf-8 repesentation (in javascript) \xc4\x83, makes the print in python display the character properly (altho it creates problems in the next steps).
Same thing happens with replacing ă with \xC7 which is the matching character in code page 852 (latin-2) and then jsonStuff.decode('cp852')
What are my options here?
Edit: Thanks for the welcome!
I am using Python 2.7 which from what I've gathered uses standard ascii encoding.
If i skip any conversion i get the exception: ValueError: Invalid control character at some character/byte..
If I convert the character in Javascript (before doing Stringify() on the object) with an escape matching the same character from an utf-8 table: "\u0103" i get the same exception.
If I convert the character to some utf-8 character that falls into the standard ascii character set ("\u0045"), it loads fine. I guess the decoder can automatically map the unicode "regular" characters into their ascii representatives.
Same thing for converting to say "\x45".
If I add strict=False argument to the loads() function, i can load any escaped character but then I'm not sure how to handle it in my python script.
I have to admit, the stringify and loads() part really makes me lose track, since I'm starting out with utf8 in my IDE, using escape characters from a different encoding, then calling stringify (only utf8 encoded stuff is valid json?) and pass it into python which cant handle utf8 past the standard character set. And I have to end up with '\xNN' of a specific code page(say Latin-2) in the end just before the print.
Should I try to pass anything using strict=False and handle it from there within python, or is it possible to send everything encoded using some code page?
I'll add some code in a bit.
Here is what I am trying but am not sure how to get this working or if it is even possible -
I have an HTML page MyHTMLPage.htm and I want to src a Javascript from this HTML file. This is pretty straightforward. I plan to include a <script src = "MyJavascript.js"></script> tag in my HTML file and that should take care of it.
However, I want to create my Javascript file using UTF-16 encoding. So, I plan to use the following tag <script charset="UTF-16" src="MyJavascript.js"></script> in my HTML file to take care of that
Now the problem I am really stuck at is how do I create the Javascript using UTF-16 encoding - E.g. let's say my Javascript code is alert(1); I created my Javascript file with the contents as \u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0029\u003b but that does not seem to execute as valid Javascript at runtime.
To summarize, here is what I have -
MyHTMLPage.html
...
...
...
<script charset="UTF-16" src="MyJavascript.js"></script>
...
...
...
MyJavascript.js
\u0061\u006c\u0065\u0072\u0074\u0028\u0031\u0029\u003b
When I open the HTML page in Firefox, I get the error - "Syntax error - Illegal character" right at the beginning of the MyJavascript.js file. I have also tried adding the BOM character "\ufeff" at the beginning of the above Javascript but I still get the same error.
I know I could create my Javascript file as - "alert(1);" and then save it using UTF-16 encoding using the text editor and then the browser runs it fine however is there a way I could use "\u" notation (or an alternate escape character) and still get the Javascript to execute fine?
Thanks,
You are misunderstanding character encoding. Character encoding is a scheme of how characters are represented as bits behind the scenes.
You would not write \u004a in your file to "make it utf-16" as that is literally a sequence of 6 characters:
\, u, 0, 0, 4, a
And if you saved the above as utf-16, it would be represented as the following bits:
005C0075
00300030
00340061
Had you saved it as utf-8 it would be:
5C753030
3461
Which takes 50% of the space and bandwidth. It takes even less to write that character literally ("J"): just a byte
(4A) in utf-8.
The "\u"-notation is a way to reference any BMP character by just using a small set of ascii characters. If you were
working with a text editor with no unicode support, you could write "\u2665", instead of literally writing "♥" and the
browser would show it properly.
If you for some weird reason still want to use utf-16, simply write the code normally, save the file as utf-16 and serve it with the proper charset header.
I need to create an EBCDIC string within my javascript and save it into an EBCDIC database. A process on the EBCDIC system then uses the data. I haven't had any problems until I came across the character '¬'. In EBCDIC it is hex value of 5F. All of the usual letters and symbols seem to automagically convert with no problem. Any idea how I can create the EBCDIC value for '¬' within javascript so I can store it properly in the EBCDIC db?
Thanks!
If "all of the usual letters and symbols seem to automagically convert", then I very strongly suspect that you do not have to create an EBCDIC string in Javascript. The character codes for Latin letters and digits are completely different in EBCDIC than they are in Unicode, so something in your server code is already converting the strings.
Thus what you need to determine is how that process works, and specifically you need to find out how the translation maps character codes from Unicode source into the EBCDIC equivalents. Once you know that, you'll know what Unicode character to use in your Javascript code.
As a further note: every single time I've been told by an IT organization that their mainframe software requires that data be supplied in EBCDIC, that advice has been dead wrong. The fact that there's some external interface means that something in the pile of iron that makes up the mainframe and it's tentacles, something the IT people have forgotten about and probably couldn't find if they needed to, is already mapping "real world" character encodings like Unicode into EBCDIC. How does it work? Well, it may be impossible to figure out.
You might try whether this works: var notSign = "\u00AC";
edit: also: here's a good reference for HTML entities and Unicode glyphs: http://www.elizabethcastro.com/html/extras/entities.html The HTML/XML syntax uses decimal numbers for the character codes. For Javascript, you have to convert those to hex, and the notation in Javascript strings is "\u" followed by a 4-digit hex constant. (That reference isn't complete, but it's pretty easy to read and it's got lots of useful symbols.)