In my web application (JSP, JQuery...) there is a form which, along with other fields, has a textarea where the user can input notes freely. The value is saved to the database as is.
The problem happens when the value has newline characters and is loaded back to the textarea; it sometimes "breaks" the Jquery code. Explaining further:
The value is loaded to the textarea using Jquery:
$('#p_notas').text("value_from_db");
When the user hits Enter to insert a new paragraph, the resulting value will include a newline character (or more than one char). This char is the problem as it varies from browser to browser and I haven't found out which one is causing the problem.
The error I get is a console error: SyntaxError: unterminated string literal. The page doesn't load correctly.
I'm not able to reproduce the problem. I tried with Chrome, Firefox and IE Edge (with several combinations of user agent and document mode).
We advise our users to use IE8+, Firefox or Chrome but we can't control it.
What I wanted to know is which character is causing the problem and how can I solve it.
Thanks
EDIT: Summing up - What are the differences in newline characters for the different browsers? Can I do anything to make them uniform?
EDIT 2: Looking at the page in the debugger, what I get is:
Case 1 (No problem)
$('#p_notas').text("This is the text I inserted \r\n More text");
Case 2 (Problem)
$('#p_notas').text("This is the text I inserted
More text");
In case 2 I get the Javascript error "SyntaxError: unterminated string literal." because it is interpreted as two lines of code
EDIT 3: #m02ph3u5 I tried using '\r' '\n' '\r\n' '\n\r' and I couldn't reproduce the problem.
EDIT 4: I'm going to try and replace all line breaks with '\n\r'
EDIT 5: In case it is of interest, what I did was treat the value before it was saved
value.replace(/(?:\r\n|\r(?=\n)|\n(?=\r))/g, '\n\r')
The problem isn't the browser but the operating system. Quoting from this post:
So, using \r\n will ensure linebreaks on all major operating systems
without issue.
Here's a nice read on the why: why do operating systems implement line breaks differently?
The problem you might be experiencing is saving the value of the textarea and then returning that value including any newlines. What you could do is "normalize" the value before saving, so that you don't have to change the output. In other words: get the value from the textarea, do a find-and-replace and replace every ossible occurrence of a newline (\r, \n) by a value that works on all OS's \r\n. Then, when you get the value from the database later on, it'll always be correct.
I suspect your problem is actually any new line in the entered input is causing an issue. It looks like on the server you are have a templated page something like:
$('#p_notas').text("<%=db.value%>");
So what you end up with client side is:
$('#p_notas').text("some notes that
were entered by the user");
or some other characters that break the JS. Embedded quotes would do it too.
You need to escape the user entered values some how. The preferred "modern" way is to format info you are returning as AJAX. If you are embedding the value within a template what I might do is:
<div style="display:none" id="userdata><%=db.value%></div>
<script>$('#p_notas').text($("#userdata").text());</script>
Of course if it were this exactly you could just embed the data in the text area <textarea><%=db.value%></textarea>
When you output data to the response, you always need to encode it using the appropriate encoding for the context it appears in.
You haven't mentioned which server-side technology you're using. In ASP.NET, for example, the HttpUtility class contains various encoding methods for different contexts:
HtmlEncode for general HTML output;
HtmlAttributeEncode for HTML attributes;
JavaScriptStringEncode for javascript strings;
UrlEncode for values passed in the query-string of a URL;
In some cases, you might need to encode the value more than once. For example, if you're passing a value in a URL via a javascript string, you'd need to UrlEncode the raw value, then JavaScriptStringEncode the result.
Assuming that you're using ASP.NET, and your code currently looks something like this:
$('#p_notas').text("<%# Eval("SomeField") %>");
change it to:
$('#p_notas').text("<%# HttpUtility.JavaScriptStringEncode(Eval("SomeField", "{0}")) %>");
Related
Long story short, I'm trying to "fix" my system so I'm using the same regular expressions on the backend as we are the front (validating both sides for obvious security reasons). I've got my regex server side working just fine, but getting it down to the client is a pain. My quickest thought was to simply store it in a data attribute on a tag, grab it, and then validate against it.
Well, me, think again! JS is throwing me for a loop because apparently RegExp interprets the string differently depending how it's pulled in. Can anyone shine some light on what is happening here or how I might go about resolving this issue
HTML
<span data-regex="(^\\d{5}$)|(^\\d{5}-\\d{4}$)"></span>
Javascript
new RegExp($0.dataset.regex)
//returns /(^\\d{5}$)|(^\\d{5}-\\d{4}$)/
new RegExp($($0).data('regex'))
//returns /(^\\d{5}$)|(^\\d{5}-\\d{4}$)/
new RegExp("(^\\d{5}$)|(^\\d{5}-\\d{4}$)");
//returns /(^\d{5}$)|(^\d{5}-\d{4}$)/
Note in the first two how if I pull the value from the data attribute dynamically, the constructor for RegExp for some reason doesn't interpret the double slash correctly. If, however, I copy and paste the value as a string and call RegExp on the value, it correctly interprets the double slash and returns it in the right pattern.
I've also attempted simply not escaping the \d character by double slashing on the server side, but as you might (or might not) have guessed, the opposite happens. When pulled from attributes/dataset, the \ is completely removed leading the Regex to think I'm looking for the "d" character rather than digits. I'm at a loss for understanding what JS is thinking here. Please send help, Internet
Your data attribute has redundant backslashes. There's no need to escape backslashes in HTML attributes, so you'll actually get a double-backslash where you don't want one. When writing regular expressions as strings in JavaScript you have to escape backslashes, of course.
So you don't actually have the same string on both sides, simply because escaping works differently.
My intention is to store books and other types of large blobs of formatted text (100 to thousands of words on each chapter) to be displayed with their format in an application built with the aurelia framework. I would prefer using JSON, but I could try other alternatives. The text has been written using google docs.
So far, trying to use JSON, Visual Studio Code says Unexpected end of string at the first carriage return, and the application gives me an error in the console:
Unhandled rejection SyntaxError: Unexpected token in JSON at position 780
Is there any way to indicate to JSON that something is formatted text, or any decent alternative?
You're JSON has characters in it that aren't properly escaped. Most likely these are quote " characters and need \" before them all. Unless you have a particularly robust workflow setup to handle transcribing, you're going to run into this problem a lot with large documents, especially coming from a word processor.
Instead, why not simply store the material as HTML? It is specifically designed to store and markup documents. It has headings, paragraphs, lists, etc. Browsers are already equipped to display it without doing any processing and it can be easily injected into your application by simply appending it to any element on the page.
Additionally, Google Docs should be able to save the document as HTML directly, so you don't have to do any manual markup.
You need to escape special characters. This discussion may help. Note that you will probably have your own list of escaped characters, which depends on your source string.
I have several html A tag generated programmatically in ASP.NET with a JavaScript function taking long parameter in href. One of those has over 20K characters when it get assigned in backend, but I am seeing the actual link has only 5239 characters on the browser side and the JavaScript function does not have closing. So the link never works. I am thinking about workarounds for this implementation since it's not a good idea to put this much amount of data in links, but now I'm just curious about cause of the issue.
Examples of the code assigning values to the link:
HtmlAnchor.HRef = "javascript:doSomething('Import','" + strHeader_LineIds + "');"
In this case the variable strHeader_LineIds carries a string over 20k characters.
Example of what I'm actually seeing in client side:
<a id=anchor1 class=class1 href="javascript:doSomething('Import', 'blahblahblahblah....">Link Text</a>
Please note the javascript function has no closing here. But when I'm debugging in backend I do see the closing of the function.
I guess this issue may have something to do with the browser's URL limit? I am using IE and I learned IE has a maximum URL length limit as 2,083 characters from Here. But how can the link show up with 5,239 characters?
I've had a similar issue with javascript like dynamic functions created in code and then called. I found that I had to play with swapping out single quotes in the javascript function with double quotes or escaping the quotes.
Then again just reading your post could be a limit issue.
Have you tried assigning the long to an element in the background and then referencing that as part of the javacript. I know IE gets funny with spaces in passed in parameters.
I think found an answer to the issue though. According to This Article:
JavaScript URIs
The JavaScript protocol is used for bookmarklets (aka favlets), a lightweight form of extensibility that permits a user to click a button and run some stored JavaScript on the currently loaded page. In IE9, the team did some work to relax the length limit (from ~260 characters, if I recall correctly) to something significantly larger (~5kb, if I recall correctly).
So I just hit the ~5kb limit.
There is a page where I have certain special characters on and when retrieving values of these via javascript I am getting an odd conversion. The character 'Œ' is coming back as 'R' and its lower case version 'œ' is coming back as 'S'. Is this a limitation of javascript or could it possibly be the browser. This is from testing in firefox. Also this is being retrieved via a repl client (Jssh/MozRepl) so it seems that it could be an issue with these clients themselves rather than the browser.
You likely have an encoding problem somewhere. There are many opportunities to mis-handle the encoding of text. If you post some code, we might be able to help you find it.
Output streams aren't scriptably safe for non-ASCII characters so you will need to wrap the stream in a nsIBinaryOutputStream, a nsIUnicharOutputStream or a nsIConverterOutputStream.
I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.
â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.
This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.
The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)
you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.