what kind of encoding is this? - javascript

I've got some data from dbpedia using jena and since jena's output is based on xml so there are some circumstances that xml characters need to be treated differently like following :
Guns n ' Roses
I just want to know what kind of econding is this?
I want decode/encode my input based on above encode(r) with the help of javascript and send it back to a servlet.
(edited post if you remove the space between & and amp you will get the correct character since in stackoverflow I couldn't find a way to do that I decided to put like that!)

Seems to be XML entity encoding, and a numeric character reference (decimal).
A numeric character reference refers to a character by its Universal
Character Set/Unicode code point, and uses the format
You can get some info here: List of XML and HTML character entity references on Wikipedia.
Your character is number 39, being the apostrophe: ', which can also be referenced with a character entity reference: '.
To decode this using Javascript, you could use for example php.js, which has an html_entity_decode() function (note that it depends on get_html_translation_table()).
UPDATE: in reply to your edit: Basically that is the same, the only difference is that it was encoded twice (possibly by mistake). & is the ampersand: &.

This is an SGML/HTML/XML numeric character entity reference.
In this case for an apostrophe '.

Related

How to convert this text to correct HTML characters using Javascript

How to convert this text to correct HTML characters using Javascript:
'PingAsyncTask - Token v\ufffdlido'
Put in your console:
console.log('PingAsyncTask - Token v\ufffdlido');
I already try all common functions:
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
http://jsfromhell.com/geral/utf-8
Can anyone help me?
If your document is already UTF-8 you don't need to do anything special. The string is already encoded correctly in JavaScript, so when you write it into the document it'll show up correctly. You can see it in this fiddle: https://jsfiddle.net/baar4ew8/
P.S. The character in your code (\ufffd) is U+FFFD, the Unicode replacement character. Most fonts render it as a black diamond with a question mark inside, or just an empty box. Here's how Stack Overflow renders it:
�
If you're seeing that in your output, your string is being rendered correctly.
If you think you should be seeing some other character, then your problem isn't in the HTML or JavaScript—it's with the source of your data, whatever that might be. When a program converts text from a non-Unicode encoding to a Unicode encoding like UTF-8, characters that don't exist in Unicode are replaced with U+FFFD (�)—hence "replacement character." If you're expecting some character that does exist in Unicode but you're getting U+FFFD then it might be the case that the program converting the text to UTF-8 doesn't know what encoding it was originally in and so converted it incorrectly. For example, if you stored text with encoding X in a database table with encoding Y without first converting it to encoding Y.

What is the best way to serialize a JavaScript object into something that can be used as a fragment identifier (url#hash)?

My page state can be described by a JavaScript object that can be serialized into JSON. But I don't think a JSON string is suitable for use in a fragment ID due to, for example, the spaces and double-quotes.
Would encoding the JSON string into a base64 string be sensible, or is there a better way? My goal is to allow the user to bookmark the page and then upon returning to that bookmark, have a piece of JavaScript read window.location.hash and change state accordingly.
I think you are on a good way. Let's write down the requirements:
The encoded string must be usable as hash, i.e. only letters and numbers.
The original value must be possible to restore, i.e. hashing (md5, sha1) is not an option.
It shouldn't be too long, to remain usable.
There should be an implementation in JavaScript, so it can be generated in the browser.
Base64 would be a great solution for that. Only problem: base64 also contains characters like - and +, so you win nothing compared to simply attaching a JSON string (which also would have to be URL encoded).
BUT: Luckily, theres a variant of base64 called base64url which is exactly what you need. It is specifically designed for the type of problem you're describing.
However, I was not able to find a JS implementation; maybe you have to write one youself – or do a bit more research than my half-assed 15 seconds scanning the first 5 Google results.
EDIT: On a second thought, I think you don't need to write an own implementation. Use a normal implementation, and simply replace the “forbidden” characters with something you find appropriate for your URLs.
Base64 is an excellent way to store binary data in text. It uses just 33% more characters/bytes than the original data and mostly uses 0-9, a-z, and A-Z. It also has three other characters that would need encoded to be stored in the URL, which are /, =, and +. If you simply used URL encoding, it would take up 300% (3x) the size.
If you're only storing the characters in the fragment of the URL, base64-encoded text it doesn't need to be re-encoded and will not change. But if you want to send the data as part of the actual URL to visit, then it matters.
As referenced by lxg, there there is a base64url variant for that. This is a modified version of base64 to replace unsafe characters to store in the URL. Here is how to encode it:
function tobase64url(s) {
return btoa(x).replace(/\+/g,'-').replace(/\//g,'_').replace(/=/g,'');
}
console.log(tobase64url('\x00\xff\xff\xf1\xf1\xf1\xff\xff\xfe'));
// Returns "AP__8fHx___-" instead of "AP//8fHx///+"
And to decode a base64 string from the URL:
function frombase64url(s) {
return atob(x.replace(/-/g,'+').replace(/_/g, '/'));
}
Use encodeURIComponent and decodeURIComponent to serialize data for the fragment (aka hash) part of the URL.
This is safe because the character set output by encodeURIComponent is a subset of the character set allowed in the fragment. Specifically, encodeURIComponent escapes all characters except:
A - Z
a - z
0 - 9
- . _ ~ ! ' ( ) *
So the output includes the above characters, plus escaped characters, which are % followed by hexadecimal digits.
The set of allowed characters in the fragment is:
A - Z
a - z
0 - 9
? / : # - . _ ~ ! $ & ' ( ) * + , ; =
percent-encoded characters (a % followed by hexadecimal digits)
This set of allowed characters includes all the characters output by encodeURIComponent, plus a few other characters.

encoding issue on form.serialize(); Some specials character displaying as ASCII code

I am having a problem with special character in javascript.
I have a form with a input text that has the following string:
10/10/2010
after a form.serialize(); I get this string as
10%2F10%2F2010
The '/' character is converted to its ASCII code %2F.
I would be able to convert that using String.fromCharCode(ascii_code) but I have many inputs in my form so these string is somenthing like:
var=14&var=10%2F10%2F2010&var=10%2F10%2F2010&var=10%2F10%2F2010
Just an example to state that I would have to go through this string ("manually") and find those value and convert it.
Is there any easy way to perform that conversion?
Strange thing because I did not have that problem before, I am not sure why this is happening now.
I happens that way because that's how it's meant to be:
The .serialize() method creates a text string in standard URL-encoded
notation. It operates on a jQuery object representing a set of form
elements.
As far as I know, there's no native jQuery function to unserialize but your post suggests you already got that and are only stuck in the URL-encoded strings:
decodeURIComponent(encodedURI)Decodes a Uniform Resource Identifier (URI) component previously created by encodeURIComponent or
by a similar routine.

JavaScript automatically converts some special characters

I need to extract a HTML-Substring with JS which is position dependent. I store special characters HTML-encoded.
For example:
HTML
<div id="test"><p>lösen & grüßen</p></div>​
Text
lösen & grüßen
My problem lies in the JS-part, for example when I try to extract the fragment
lö, which has the HTML-dependent starting position of 3 and the end position of 9 inside the <div> block. JS seems to convert some special characters internally so that the count from 3 to 9 is wrongly interpreted as "lösen " and not "lö". Other special characters like the & are not affected by this.
So my question is, if someone knows why JS is behaving in that way? Characters like ä or ö are being converted while characters like & or are plain. Is there any possibility to avoid this conversion?
I've set up a fiddle to demonstrate this: JSFiddle
Thanks for any help!
EDIT:
Maybe I've explained it a bit confusing, sorry for that. What I want is the HTML:
<p>lösen & grüßen</p> .
Every special character should be unconverted, except the HTML-Tags. Like in the HTML above.
But JS converts the ö or ü into ö or ü automatically, what I need to avoid.
That's because the browser (and not JavaScript) turns entities that don't need to be escaped in HTML into their respective Unicode characters (e.g. it skips &, < and >).
So by the time you inspect .innerHTML, it no longer contains exactly what was in the original page source; you could reverse this process, but it involves the full map of character <-> entity pairs which is just not practical.
If i understand you correctly, then try use innerHTML or .html('your html code') for jQuery on the target element

JavaScript/HTML/Unicode accents: á != á

I want to check if a user submitted string is the same as the string in my answer key. Sometimes the words involve Spanish accents (like in sábado), and that makes the condition always false.
I have Firebug log $('#answer').val() and it shows up as sábado. (The á comes from a button that inserts the value á, if that matters) whereas logging the answer from the answer key shows sábado (how I wrote it in the actual answer key).
I have tried replacing the &aacute in the answer key with a normal á, but it still doesn't work, and results in a Unicode diamond-question-mark. When I do that and also replace the value of the button that makes the user-submitted á, the condition works correctly, but then the button, the user string, and the answer string all have the weird Unicode diamond-question-mark.
I have also tried using á in both places and it's no different from using á. Both my HTML and Javascript are using charset="utf-8".
How can I fix this?
If you're consistently using UTF-8, there's no need for HTML entities except to encode syntax (ie <, >, & and - within attributes - ").
For anything else, use the proper characters, and your problems should go away - until you run into unicode normalization issues, ie the difference between 'a\u0301' and '\u00E1'...
The issue is that you're not using the real UTF-8 characters in both strings (entered answer and the key). You should NOT be supplying "a button that inserts the value á" -- Re: "if that matters" it does!
The characters should be added by the keyboard input system. And your comparison string should also be only utf-8 characters. It should NOT be character entities.

Categories

Resources