JavaScript automatically converts some special characters - javascript

I need to extract a HTML-Substring with JS which is position dependent. I store special characters HTML-encoded.
For example:
HTML
<div id="test"><p>lösen & grüßen</p></div>​
Text
lösen & grüßen
My problem lies in the JS-part, for example when I try to extract the fragment
lö, which has the HTML-dependent starting position of 3 and the end position of 9 inside the <div> block. JS seems to convert some special characters internally so that the count from 3 to 9 is wrongly interpreted as "lösen " and not "lö". Other special characters like the & are not affected by this.
So my question is, if someone knows why JS is behaving in that way? Characters like ä or ö are being converted while characters like & or are plain. Is there any possibility to avoid this conversion?
I've set up a fiddle to demonstrate this: JSFiddle
Thanks for any help!
EDIT:
Maybe I've explained it a bit confusing, sorry for that. What I want is the HTML:
<p>lösen & grüßen</p> .
Every special character should be unconverted, except the HTML-Tags. Like in the HTML above.
But JS converts the ö or ü into ö or ü automatically, what I need to avoid.

That's because the browser (and not JavaScript) turns entities that don't need to be escaped in HTML into their respective Unicode characters (e.g. it skips &, < and >).
So by the time you inspect .innerHTML, it no longer contains exactly what was in the original page source; you could reverse this process, but it involves the full map of character <-> entity pairs which is just not practical.

If i understand you correctly, then try use innerHTML or .html('your html code') for jQuery on the target element

Related

How to sanitize abused unicoded chars like 𝑨𝒏𝒅𝒓𝒆 (YayText style) back to regular chars like "Andre" [duplicate]

User copy paste and send data in following format: "𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖"
I need to convert it into plain txt (we can say ascii chars) like 'jovy debbie'
It comes in different font and format:
ex:
'𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔'
'𝙶𝚎𝚟𝚒𝚎𝚕𝚢𝚗 𝙽𝚒𝚌𝚘𝚕𝚎 𝙻𝚞𝚖𝚋𝚊𝚐'
Any Help will be Appreciated, I already refer other stack overflow question but no luck :(
Those letters are from the Mathematical Alphanumeric Symbols block.
Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".tr("𝕒-𝕫", "a-z")
#=> "jovy debbie"
The same approach can be used for the other styles, e.g.
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
#=> "Jenica Dugos"
This gives you full control over the character mapping.
Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".unicode_normalize(:nfkc)
#=> "jovy debbie"
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".unicode_normalize(:nfkc)
#=> "Jenica Dugos"

Converting special characters in c# string to html special characters

I am using .Net:
fulltext = File.ReadAllText(#location);
to read text anyfile content at given locatin.
I got result as:
fulltext="# vdk10.syx\t1.1 - 3/15/94\r# #(#)Copyright (C) 1987-1993 Verity, Inc.\r#\r# Synonym-list Database Descriptor\r#\r$control: 1\rdescriptor:\r{\r data-table: syu\r {\r worm:\tTHDSTAMP\t\tdate\r worm:\tQPARSER\t\t\ttext\r\t/_hexdata = yes\r varwidth:\tWORD\t\tsyv\r fixwidth:\tEXPLEN\t\t\t2 unsigned-integer\r varwidth:\tEXPLIST\t\tsyx\r\t/_hexdata = yes\r }\r data-table: syw\r {\r varwidth:\tSYNONYMS\tsyz\r }\r}\r\r ";
Now, I want this fulltext to be displayed in html page so that special characters are recognized in html properly. For examples: \r should be replaced by line break tag
so that they are properly formatted in html page display.
Is there any .net class to do this? I am looking for universal method since i am reading file and I can have any special characters. Thanks in advance for help or just direction.
You're trying to solve two problems:
Ensure special characters are properly encoded
Pretty-print your text
Solve them in this order:
First, encode the text, by importing the System.Web namespace and using HttpUtility (asked on StackOverflow). Use the result in step 2.
Pretty-printing is trickier, depending on the amount of pretty-printing that you want. Here are a few approaches, in increasing order of difficulty:
Put the text in a pre element. This should preserve newlines, tabs, spaces. You can still adjust the font used using CSS if you first slap a CSS class on the pre.
Replace all \r, \r\n and remaining \n with <br/>.
Study the structure of your text, parse it according to this structure, and provide specific tags in specific contexts. For example, the tab characters in your example may be indicative of a list of items. HTML provides the ol and ul elements for lists. Similarly, consecutive line breaks may indicate paragraphs, for which HTML provides the well known p element.
Thanks Everyone here for your valuable comment. I solved my formatting problem in client side with following code.
document.getElementById('textView').innerText = fulltext;
Here textview is the div where i want to display my fulltext . I don't think i need to replace special characters in string fulltext. I output as shown in the figure.

what kind of encoding is this?

I've got some data from dbpedia using jena and since jena's output is based on xml so there are some circumstances that xml characters need to be treated differently like following :
Guns n &#039; Roses
I just want to know what kind of econding is this?
I want decode/encode my input based on above encode(r) with the help of javascript and send it back to a servlet.
(edited post if you remove the space between & and amp you will get the correct character since in stackoverflow I couldn't find a way to do that I decided to put like that!)
Seems to be XML entity encoding, and a numeric character reference (decimal).
A numeric character reference refers to a character by its Universal
Character Set/Unicode code point, and uses the format
You can get some info here: List of XML and HTML character entity references on Wikipedia.
Your character is number 39, being the apostrophe: ', which can also be referenced with a character entity reference: &apos;.
To decode this using Javascript, you could use for example php.js, which has an html_entity_decode() function (note that it depends on get_html_translation_table()).
UPDATE: in reply to your edit: Basically that is the same, the only difference is that it was encoded twice (possibly by mistake). & is the ampersand: &.
This is an SGML/HTML/XML numeric character entity reference.
In this case for an apostrophe '.

Remove a long dash from a string in JavaScript?

I've come across an error in my web app that I'm not sure how to fix.
Text boxes are sending me the long dash as part of their content (you know, the special long dash that MS Word automatically inserts sometimes). However, I can't find a way to replace it; since if I try to copy that character and put it into a JavaScript str.replace statement, it doesn't render right and it breaks the script.
How can I fix this?
The specific character that's killing it is —.
Also, if it helps, I'm passing the value as a GET parameter, and then encoding it in XML and sending it to a server.
This code might help:
text = text.replace(/\u2013|\u2014/g, "-");
It replaces all – (–) and — (—) symbols with simple dashes (-).
DEMO: http://jsfiddle.net/F953H/
That character is call an Em Dash. You can replace it like so:
str.replace('\u2014', '');​​​​​​​​​​
Here is an example Fiddle: http://jsfiddle.net/x67Ph/
The \u2014 is called a unicode escape sequence. These allow to to specify a unicode character by its code. 2014 happens to be the Em Dash.
There are three unicode long-ish dashes you need to worry about: http://en.wikipedia.org/wiki/Dash
You can replace unicode characters directly by using the unicode escape:
'—my string'.replace( /[\u2012\u2013\u2014\u2015]/g, '' )
There may be more characters behaving like this, and you may want to reuse them in html later. A more generic way to to deal with it could be to replace all 'extended characters' with their html encoded equivalent. You could do that Like this:
[yourstring].replace(/[\u0080-\uC350]/g,
function(a) {
return '&#'+a.charCodeAt(0)+';';
}
);
With the ECMAScript 2018 standard, JavaScript RegExp now supports Unicode property (or, category) classes. One of them, \p{Dash}, matches any Unicode character points that are dashes:
/\p{Dash}/gu
In ES5, the equivalent expression is:
/[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g
See the Unicode Utilities reference.
Here are some JavaScript examples:
const text = "Dashes: \uFF0D\uFE63\u058A\u1400\u1806\u2010-\u2013\uFE32\u2014\uFE58\uFE31\u2015\u2E3A\u2E3B\u2053\u2E17\u2E40\u2E5D\u301C\u30A0\u2E1A\u05BE\u2212\u207B\u208B\u3030𐺭";
const es5_dash_regex = /[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD/g;
console.log(text.replace(es5_dash_regex, '-')); // Normalize each dash to ASCII hyphen
// => Dashes: ----------------------------
To match one or more dashes and replace with a single char (or remove in one go):
/\p{Dash}+/gu
/(?:[-\u058A\u05BE\u1400\u1806\u2010-\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u2E5D\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]|\uD803\uDEAD)+/g

JavaScript/HTML/Unicode accents: á != á

I want to check if a user submitted string is the same as the string in my answer key. Sometimes the words involve Spanish accents (like in sábado), and that makes the condition always false.
I have Firebug log $('#answer').val() and it shows up as sábado. (The á comes from a button that inserts the value á, if that matters) whereas logging the answer from the answer key shows sábado (how I wrote it in the actual answer key).
I have tried replacing the &aacute in the answer key with a normal á, but it still doesn't work, and results in a Unicode diamond-question-mark. When I do that and also replace the value of the button that makes the user-submitted á, the condition works correctly, but then the button, the user string, and the answer string all have the weird Unicode diamond-question-mark.
I have also tried using á in both places and it's no different from using á. Both my HTML and Javascript are using charset="utf-8".
How can I fix this?
If you're consistently using UTF-8, there's no need for HTML entities except to encode syntax (ie <, >, & and - within attributes - ").
For anything else, use the proper characters, and your problems should go away - until you run into unicode normalization issues, ie the difference between 'a\u0301' and '\u00E1'...
The issue is that you're not using the real UTF-8 characters in both strings (entered answer and the key). You should NOT be supplying "a button that inserts the value á" -- Re: "if that matters" it does!
The characters should be added by the keyboard input system. And your comparison string should also be only utf-8 characters. It should NOT be character entities.

Categories

Resources