Automatic superscript chars like © in page with JS? - javascript

My client has a problem: their CMS doesn't handle properly symbols like the TM or copyright symbols. they don't have a way to format a superscript symbol
so i was thinking to solve it client-side with JS
what would be the best practice?
how to detect a specific char like © and make it small and align top to the work is associated with?

If the CMS won't handle the copyright symbol (U+00a9), there is something quite wrong with it. Assuming that problem cannot be fixed, you have to "encode" such characters somehow. Under this solution, anyone writing anything or reading anything from the CMS is going to have to be conscious of this encoding, and do the appropriate encoding on the way in and decoding on the way out. This is not a happy path.
For instance, assume the CMS has an editor. How is the user going to input the "special" characters in the editor? Is the editor going to be modified to handle the necessary encoding and decoding?
Anyway, assuming you do decide to go the encoding route, which encoding to choose? Others have suggested encoding using HTML entities such as ©. This is probably not the best solution. First, is assumes the content is always going to be output in an HTML environment. Possibly more importantly, it cannot handle characters which do not have HTML entity encodings. Therefore, using an encoding such as the JS string encoding ("Bad Idea\u00a9") is probably your best bet. If the API to the CMS uses JSON, everything should pretty much work.
Alternative encodings you might consider are URI encoding or BASE64 encoding, but neither seems like a wonderful idea.
Having said that, you seem to be fuzzy on the distinction between character encodings and formatting. You say
> how to detect a specific char like © and make it small and align top
If you already have a real copyright symbol, then you don't need to do anything, because all fonts will already display it correctly. For instance, if you have encoded the copyright symbol in your database as \u00a9 and are sending that down in JSON, it will already be a copyright symbol, and will be displayed correctly.
Or are you proposing to store the copyright symbol in the CMS as the three characters "(c)", and treat that as a copyright symbol for formatting/display purposes? In that case, yes, you would need to detect such sequences and wrap them in a bit of HTML which applies the relevant CSS properties.

Related

JavaScript/NodeJS RTF CJK Conversions

I'm working on a node module that parses RTF files and does some find and replace. I have already come up with a solution for special characters expressed in escaped unicode here, but have ran into a wall when it comes to CJK characters. Is there an easy way to do these conversions in JavaScript, either with a library or built in?
Example:
An RTF file viewed in plain text contains:
Now testing symbols {鈴:200638d}
When parsed in NodeJS, this part of the file looks like:
Now testing symbols \{
\f1 \'e2\'8f
\f0 :200638d\}\
I understand that \f1 and \f0 denote font changes, and the \'e2\'8f block is the actual character... but how can I take \'e2\'8f and convert it back to 鈴, or conversely, convert 鈴 to \'e2\'8f?
I have tried looking up the character in different encodings and am not seeing anything that remotely resembles \'e2\'8f. I understand that the RTF control \'hh is A hexadecimal value, based on the specified character set (may be used to identify 8-bit values) (source) or maybe the better definition comes from Microsoft RTF Spec; %xHH (OCTET with the hexadecimal value of HH) (download) but I have no idea what to do with that information to get conversions going on this.
I was able to parse your sample file using my RTF parser and retrieve the correct character.
The key thing is the \fonttbl command, as the name suggests, defines the fonts used in the document. As part of the definition of each font the \fcharset command determines the character set to be used with this font. You need to use this to correctly interpret the character data.
My parser maps the argument to the \fcharset to a Codeset name here then this is then translated to a charecter set name which can be used to retrieve the correct Java Charsethere. Your character set handling will obviously be different as you are working in Javascript, but hopefully this information will help you move forward.

JavaScript - Is filtering '<' good enough to secure HTML before displaying?

In JavaScript, is there any known string that can cause mischief if we filter out all 'less than' ('<') characters then display the result as HTML?
var str = GetDangerousString().toString();
var secure = str.replace(/</g, '');
$('#safe').html(secure); // or
document.getElementById('safe').innerHTML = secure;
This question addresses sanitizing ID's in particular. I'm looking for a general HTML string. Ideal answer is the simplest working example of a string that would inject code or other potentially dangerous elements.
That's not enough for sure... You need to HTML encode any HTML you embed in your pages that you want to be editable by an end user. Otherwise, you need to sanitize it.
You can find out more here at the Owasp site
EDIT: In response to your comment, I'm not 100% sure. It sounds like double encoding will get you in some cases if you're not careful.
https://www.owasp.org/index.php/Double_Encoding
For example, this string from that page is supposed to demonstrate an exploit that hides the "<" character:
%253Cscript%253Ealert('XSS')%253C%252Fscript%253E
Also, the character "<" can be encoded lots of different ways in HTML, as suggested by this table:
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet#Character_escape_sequences
So to me, that's the thing to be careful of - the fact that there may be exploitable cases that are hard to understand, but may leave you open.
But back to your original question - can you give me an example of HTML that renders as HTML that doesn't contain the "<" character? I'm trying to understand what HTML you want users to be able to use that would be in an "id".
Also, if your site is small, if you're open to rewriting parts of it (specifically how you use javascript in your pages), then you could consider using Content Security Policies to protect your users from XSS. This works in most modern browsers, and would protect lots of your users from XSS attacks if you were to take this step.

using unicode in Javascript

In JavaScript we can use the below line of code(which uses Unicode) for displaying copyright symbol:
var x = "\u00A9 RPeripherals";
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
var x = "© RPeripherals" ;
What is the difference between these two methods?
Why can't we type the copyright symbol directly using ALT code (alt+0169) like below :
Who says so? Of course you can. Just configure your code editor to use UTF-8 encoding for source files. You should never use anything else to begin with...
What is the difference between these two methods?
The difference is that using the \uXXXX scheme you are transmitting at best 2 and at worst 5 extra bytes on the wire. This kind of spelling may help if you need to embed characters in your source code, which your font cannot display properly. For example, I don't have traditional Chinese characters in the font I'm using for programming, so if I type Chinese characters into my code editor, I'll see a bunch of question marks or rectangles with Unicode codepoint digits instead of actual characters. But someone who has Chinese glyphs in the font wouldn't have that problem.
If me and that person want to share our source code, it would be preferable that the other person uses \uXXXX scheme, as I would be able to verify which character is that by looking it up in the Unicode table. That's about all the difference.
EDIT
ECMAScript standard (v 262/5.1) says specifically that
A conforming implementation of this Standard shall interpret
characters in conformance with the Unicode Standard, Version 3.0 or
later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted
encoding form, implementation level 3. If the adopted ISO/IEC 10646-1
subset is not otherwise specified, it is presumed to be the BMP
subset, collection 300. If the adopted encoding form is not otherwise
specified, it presumed to be the UTF-16 encoding form.
So, the standard guarantees that character encoding is Unicode, and enforces the use of UTF-16 (that's strange, I thought it was UTF-8), but I don't think that this is what happens in practice... I believe that browsers use UTF-8 as default. Perhaps this have changed in the later standards, but this is the one last universally accepted.
Why can't we directly type the copyright symbol directly
Because JavaScript engines are capable of parsing UTF-8 encoded source files.
What is the difference between these two methods?
One is short, requires the source file be encoded in an encoding that supports the character, and requires that you type a character that isn't printed on the keyboard's buttons.
The other is (comparatively) long, can be expressed entirely in ASCII, and can be typed with characters printed on the buttons of a standard keyboard.

How do I keep my UTF-8 characters from becoming junk?

I'm creating a simple JavaScript multiple choice game. Here is a sample question:
p ∧ q ≡ q ∧ p by which rule?
When I run it on localhost, it works fine, it prints out those special characters. However, when I upload it to my school's server, it prints out garbage:
p ∨ q ≡ q ∨ p by which rule?
I have this at the top of my HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I can't use PHP in my assignment, or I'd use header('Content-Type: text/xml, charset=utf-8');
If you want, I can give a link... but I'd rather not because then everyone can see my really bad educational game...
How can I keep my UTF-8 characters?
Edit: I found out that if I Filezilla my files up to the server and download them from the server, the characters become little squares. I don't know if that's useful information.
Edit: I found out that if I Filezilla my files up to the server and download them from the server, the characters become little squares. I don't know if that's useful information.
Yes, filezilla is corrupting your files in transit. Make sure filezilla transfers your files as binary in order to make sure the text doesn't get corrupted in transit. If its transferring in ascii mode, it'll try to fix newlines and unrecognized characters.
If you cannot easily fix the HTTP headers, escape from the problem by using “character escapes.” If e.g. “∧” occurs in HTML content, use ∧ for it. If it occurs in a JavaScript string literal, use \u2227 for it.
To check out the codes for other characters, consult e.g.
http://www.alanwood.net/unicode/mathematical_operators.html
Copying and pasting the questions into notepad or any other app that allows you to save as UTF-8 might work if that is a viable option.
I think you could also use a regex to identify the hex values and replace them with the corresponding value that would work in UTF8.
Also if you're using a specialized type of font this could cause the problem - are the questions stylized with a particular font? a set of fallbacks? you may need to do an #font-face import but I suspect there's another option...with the symbols you're trying to use it seems like LaTeX might be an option..I believe there are a few options out there for javascript, fonts, etc..
this article may also be useful: http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

Google AJAX Feed API, Dynamic Feed Control and the Japnese Language

English is fine but for Japanese feeds its showing invalid characters...
why i am getting invalid characters in Japnese feeds?
http://acsjapan.jp/j/index.html
not in english?
http://acsjapan.jp/
help me fix for japnese feeds..
This is an encoding issue.
You are using (implicit) ISO-8859-1 encoding on your web page. Your AJAX feed serves UTF-8 characters.
This is tricky: I don't think you can make the Google Service deliver its data in the ISO-8859-1 character set. The best way would be to switch your site to UTF-8 - but that may have deeper consequences, and require other changes, especially if you are using a CMS.
Mandatory basic reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Categories

Resources