I have an Oracle database which uses the character set NLS_CHARACTERSET WE8ISO8859P1. A web application using UTF-8 is used to enter data in the Oracle DB. When characters outside of the set, such as Chinese characters, are entered in the text fields it creates a database error.
I am looking for a regex to prevent submitting non-WE8ISO8859P1 characters. Searching the topic I cannot find any such regex. Can anyone point me in the right direction? I would like to validate at the application level before hitting the DB.
I can validate for other character encodings, like UTF-8, with something like this...
var regex = /^[A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0-\u08B4\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FD5\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AD\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]$/u
But will it fit into the Windows (WE8ISO8859P1) character set?
Related
I am currently working on a personal project, where I have to deal with some chemical formulas;
I have a form with JavaScript where I enter these formulas; The formulas are entered in a LaTeX-like style for super- en subscript.
An example formula can be found below:
Fe^{3+}
When I use JavaScript to read the form and console.log(); the formula is working as expected.
However if I send the formula to the back-end (Python with CGI), the + character seems to have disappeared and been replaced with a space.
I thought it had something to do with escaping the character, since parts of the formula look a lot like regex's, but after looking around, I couldn't find anything that would suggest that I had to escape the + character.
And now I have absolutely no idea how to resolve this... I could use a different character and replace it on the back-end but that seems like it is not the optimal solution...
Most important question: How did you invoke the CGI script?
With HTTP GET or HTTP POST?
If you're using HTTP POST and the data was being transferred in the HTTP Data portion, then you don't need to escape the "+" sign.
But if you're using HTTP GET, then the "+" sign will first be translated according to URL encoding standard (thus, "+" becomes a space), before transferred to the CGI script.
So in the latter scenario, you need to escape the "+" sign (and other 'special' characters such as "%" and "?").
I am using the ck editor (version 4) and users enter greek text. Most characters are converted into html entities but the accented characters seem to be ignored.
On my page and within my database the text is stored an can be displayed but when I create an email with the entered text the characters are not shown correctly.
Is there a config setting that I can use so that all greek characters are converted? Or do I have to try and change these characters manually?
This sounds like an encoding issue rather than converting the characters into entities. Presumably the text is being stored as UTF-8 in the database, and the webs pages are also in the same encoding, so the characters get displayed correctly.
When it comes to the email, I suspect that it's being encoded in another encoding like ASCII or ISO-8859-1 (Latin characters only) so the Greek text will appear as a mix of letters and punctuation rather than as it should look.
You don't mention what you're using to send the email, but I'd probably look there first to see if there's a way of setting the character encoding to the same encoding as the data is stored in the database.
I am sending a parameter, from Javascript to .NET. The parameter could contain multiple spaces like 'John [3 spaces here, stackoverflow not showing them] Smith', I need the spaces to stay. However, it looks like the spaces disappear in .NET. In an attempt to fix this, I made sure to encode the URI on client side, and decode it on server side. The code (in VB.NET) looks like this:
<AjaxPro.AjaxMethod()> _
Public Function GetSearch(ByVal strValue As String) As String
strValue = HttpUtility.UrlDecode(strValue)
...
End Function
Before the UrlDecode, strValue looks like John%20%20%20Smith'. Afterwards it looks like John Smith. Can anyone tell me how to fix this?
I am using .NET 2.0 Web Forms.
EDIT: following one of the suggestions below I replaced all the spaces with (ampersand)nbsp;. I can see all the spaces when I debug, however, my database is SQL server, for some reason it views regular spaces to be different from (ampersand)nbsp; spaces, and as a result the query does not return the right values. It took me some time to figure this out because I could not see the difference with the naked eye.
Ok, I was able to resolve the issue in the front-end by replacing all spaces with Non-breaking spaces. I then had to replace the Non-breaking spaces back with regular spaces in my SQL stored procedure (otherwise the query wouldn't work).
I have been recently requested, to adapt an app's input, to support Unicode letters, on some of the inputs within the web app.
That app, already does some validation with regex, with the pattern html attribute. Like so:
<input required="true" pattern="[a-zA-Z0-9_\-]+" type="text" name="name">
Now, since I have to adapt some inputs to the new requirements, I was wondering what would be better to do?
Do Encoding / decoding UTF8 in javascript?
http://ecmanaut.blogspot.ca/2006/07/encoding-decoding-utf8-in-javascript.html
http://laffers.net/blog/2010/12/10/regex-match-unicode-characters/
or
Addapt the regex pattern just like suggested here: PHP Regex for Multiple Unicode Characters ?
Javascript is by definition completely in unicode (Except websites with non unicode encoding, but there the solution may still work), so just add letters you need to regexp. If you need to add them by charcode use \x0000
I had decided to go for editing the regex option, since when my view starts being processed, there is a module that will set the pattern attribute, with the defined regex, on specific inputs.
So, it is better to edit the regex pattern and then set in on inputs when the view is loaded, then doing:
Load view and set pattern attributes for each input
Write js to analyze specific inputs, to see if they contain decoded Unicode characters
If so, encode those characters while typing/before submit, since I have a regex pattern that doesn't allow such characters
Essentially, I'm saving me time on writing useless code, and most important, a lot of browser processing (it was going to be to much, for what it is needed).
I should've gone for this option since the beginning (duh!)
I am trying to piece together the mysterious string of characters â?? I am seeing quite a bit of in our database - I am fairly sure this is a result of conversion between character encodings, but I am not completely positive.
The users are able to enter text (or cut and paste) into a Ext-Js rich text editor. The data is posted to a severlet which persists it to the database, and when I view it in the database i see those strange characters...
is there any way to decode these back to their original meaning, if I was able to discover the correct encoding - or is there a loss of bits or bytes that has occured through the conversion process?
Users are cutting and pasting from multiple versions of MS Word and PDF. Does the encoding follow where the user copied from?
Thank you
website is UTF-8
We are using ms sql server 2005;
SELECT serverproperty('Collation') -- Server default collation.
Latin1_General_CI_AS
SELECT databasepropertyex('xxxx', 'Collation') -- Database default
SQL_Latin1_General_CP1_CI_AS
and the column:
Column_name Type Computed Length Prec Scale Nullable TrimTrailingBlanks FixedLenNullInSource Collation
text varchar no -1 yes no yes SQL_Latin1_General_CP1_CI_AS
The non-Unicode equivalents of the
nchar, nvarchar, and ntext data types
in SQL Server 2000 are listed below.
When Unicode data is inserted into one
of these non-Unicode data type columns
through a command string (otherwise
known as a "language event"), SQL
Server converts the data to the data
type using the code page associated
with the collation of the column. When
a character cannot be represented on a
code page, it is replaced by a
question mark (?), indicating the data
has been lost. Appearance of
unexpected characters or question
marks in your data indicates your data
has been converted from Unicode to
non-Unicode at some layer, and this
conversion resulted in lost
characters.
So this may be the root cause of the problem... and not an easy one to solve on our end.
â is encoded as 0xE2 in ISO-8859-1 and windows-1252. 0xE2 is also a lead byte for a three-byte sequence in UTF-8. (Specifically, for the range U+2000 to U+2FFF, which includes the windows-1252 characters –—‘’‚“”„†‡•…‰‹›€™).
So it looks like you have text encoded in UTF-8 that's getting misinterpreted as being in windows-1252, and displays as a â followed by two unprintable characters.
This is an something of an educated guess that you're just experiencing a naive conversion of Word/PDF documents to HTML. (windows-1252 to utf8 most likely) If that's the case probably 2/3 of the mysterious characters from Word documents are "smart quotes" and most of the rest are a result of their other "smart" editing features, elipsis, em dashes, etc. PDF's probably have similar features.
I would also guess that if the formatting after pasting into the ExtJS editor looks OK, then the encoding is getting passed along. Depending on the resulting use of the text, you may not need to convert.
If I'm still on base, and we're not talking about internationalization issues, then I can add that there are Word to HTML converters out there, but I don't know the details of how they operate, and I had mixed success when evaluating them. There is almost certainly some small information loss/error involved with such converters, since they need to make guesses about the original source of the "smart" characters. In my isolated case it was easier to just go back to the users and have them turn off the "smart" features.
The issue is clear: if the browser is good enough, a form in a web page can accept any Unicode character you can type or paste. If the character belongs to the HTML charset, it will be sent as is. If it doesn't, it'll get converted to an HTML entity. SQL Server will perform the appropriate conversion and silently corrupt your data when a character does not have an equivalent.
There's not much you can do to fully fix it but you can make a workaround: let your servlet perform the conversion. This way you have full control about it. You can, for instance, compile a list of the most common non-Latin1 characters users paste (smart quotes, unicode spaces...), which should be fairly easy to identify from context, and replace them with something else better that ?. Or you use a library that makes this for you.
Or you can switch your DB to Unicode :)
you're storing unicode data that uses 2 bytes per charcter into a varchar type columns that uses 1 byte per character. any text that uses 2 bytes per chars will have 1 byte lost when stored in the db.
all you need to do is change varchar column to nvarchar.
and then change sql parameters you're using in code of course.