I am trying to do a substr on a UTF-8 string like हिन्दी.
The problem is that it becomes totally screwed up=> with some weird box in the end (does not show here, although i copy pasted) (its something like [00 02]): हिन...
okay this is how it appers after using substr function:
alt text http://img27.imageshack.us/img27/765/capturexv.png
Wondering if there is some function to solve this problem? Atleast I want to remove that funny box.
Thank you for your time.
JavaScript encodes strings with UTF-16, meaning characters outside the basic multilingual plane have to be represented as a surrogate pair. Splitting a string in the middle of such a pair might explain your results.
As I understand the wikipedia article, you'll have to check if your last character lies in the range 0xD800–0xDBFF and, if so, either drop it or add the following character (which should be in range 0xDC00-0xDFFF) to the substring.
I believe that the box is the font's representation of the UTF-8 values that the substring created. Try to remove the character at the box's position and it should be removed.
Try avoiding to put UTF-8 byte sequences into JavaScript string objects. Instead, rely on the Unicode support of JavaScript, and use a proper Unicode string (instead of an UTF-8 string).
My guess is that you managed to slice the string in the middle of a character, so that the result is an incomplete character. Browser then try to render it anyway, leading to moji-bake.
Related
This is very similar to
Regular expression to find unescaped double quotes in CSV file
However, the solutions presented don't work with Node.js's regex engine. Given a CSV string where columns are quoted with double quotes, but some columns have unescaped double quotes in them, what regex could be used to match these unescaped quotes and just remove them.
Example rows
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
"123","","SDFDS SDFSDF EEE "S"","asdfas","b","lll"
So the two double quotes surrounding the S in the third column would get matched and removed. Needs to work in Node.js (14.16.1)
I have tried (?m)""(?![ \t]*(,|$)) but get a Invalid regular expression: /(?m)""(?![ \t]*(,|$))/: Invalid group exception
I don't know much about node.js, but assuming it is like the JavaScript flavor of regex then I have the following comments about the example you took from the prior answer:
I think your example is choking on the first element, (?m) which is unsupported in Javascript. However, that part is not essential to your task. It only turns on multiline processing and you don't need that if you feed the regex engine each line individually. If you find you still want to feed it a multiline string, then you can still turn on multiline in JavaScript - you do it with the "m" flag after the final delimiter, "/myregex/m". All of the other elements, including the negative lookahead are supported by JavaScript and probably by your engine as well. So, drop the (?m) part of your expression and try it again.
Even after you get it to work, the example row you provided will not be parsed according to your expectations by the sample regular expression. Its function is to identify all occurrences of two double-quotes that are not followed by a comma (or end of string). The ONLY two occurrences of doubled quotes in your example each have a comma after, so you will get no matches on this regex in your example.
It seems like you want some context-sensitive scanning to match and remove the inner pairs of double quotes while leaving the outer ones in place and handling commas inside your strings and possibly correctly quoted double quotes. Regular expression engines are really bad at this kind of processing and I don't think you are going to get satisfactory results whatever you come up with.
You can get an approximate solution to your problem by using regex once to parse the individual elements of the .csv stripping the outer quotes as you go and then running a second regex against each parsed element to either remove single occurrences of double quote or adding a second double-quote, where necessary. Then you can reassemble the string under program control.
This still will break if someone embeds a "", sequence in a data field string, so it's not perfect but it might be good enough for you.
The regex for splitting the .csv and stripping the double quotes is:
/(("(.*?)")|([^,]*))(,|$)/gm
This will accept either a "anything", OR a anything, repeatedly until the source is exhausted. Because of the capturing groups, the parsed text will either by in $3 (if the field was quoted) or $4 (if it was not quoted) but not both.
Here's a regexpReplace of your string with $3&$4 and a semicolon after each iteration (I took the liberty of adding a numeric field without the quotes so you could see that it handles both cases):
"123","","SDFDS SDFSDF EEE "S"",456,"asdfas","b","lll"
RegexpReplace(<above>,"((""(.*?)"")|([^,]*))(,|$)","$3$4;")
=> 123;;SDFDS SDFSDF EEE "S";456;asdfas;b;lll;;
See how the outer quotes have been stripped away. Now it's a simple thing to go through all the matches to remove all the remaining quotes, and then you can reconstruct the string from the array of matches.
I'm writing some code that scans a string for TeX-style Greek character (like \Delta or \alpha), and replaces the string with the Unicode symbol. It works fine for the non-italic Greek characters. The problem is that I want to use mathematical italic for the lower case. These codes are one digit longer. For example, the code for the letter alpha is 1d6fc. When I put \u1d6fc into my string it displays as the character that matches \u1d6f (a lower case m with a superimposed tilde) followed by the letter c. How do I force the "correct" reading of the code?
You have to use UTF-16 surrogate pairs for characters beyond the UTF-16 range. In your particular case, you can use 0xD835 0xDEFC:
console.log('\uD835\uDEFC')
Here is a handy pair calculator. If you don't have to worry about Internet Explorer, you can also use String.fromCodePoint(), which will deal with that mess for you. If you do have to worry about Internet Explorer, MDN has a polyfill for that method.
To produce a \u escape sequence with more than 4 hex digits (code point belonging to a so-called astral plane), you can use the Unicode code point escape notation \u{xxxxx}:
console.log ('\u{1d6fc}');
or you can call String.fromCodePoint with the code point value expressed in hexadecimal using the 0x prefix notation:
console.log (String.fromCodePoint (0x1d6fc));
Is it possible to create an invalid UTF8 string using Javascript?
Every solution I've found relies String.fromCharCode which generates undefined rather than an invalid string. I've seen mention of errors being generated by ill-formed UTF8 string (i.e. https://developer.mozilla.org/en-US/docs/Web/API/WebSocket#send()) but I can't figure out how you would actually create one.
One way to generate an invalid UTF-8 string with JavaScript is to take an emoji and remove the last byte.
For example, this will be an invalid UTF-8 string:
const invalidUtf8 = '🐶🐶🐶'.substr(0,5);
A string in JavaScript is a counted sequence of UTF-16 code units. There is an implicit contract that the code units represent Unicode codepoints. Even so, it is possible to represent any sequence of UTF-16 code units—even unpaired surrogates.
I find String.fromCharCode(0xd801) returns the replacement character, which seems quite reasonable (rather than undefined). Any text function might do that but, for efficiency reasons, I'm sure that many text manipulations would just pass invalid sequences through unless the manipulation required interpreting them as codepoints.
The easiest way to create such a string is with a string literal. For example, "\uD83D \uDEB2" or "\uD83D" or "\uDEB2" instead of the valid "\uD83D\uDEB2".
"\uD83D \uDEB2".replace(" ","") actually does return "\uD83D\uDEB2" ("🚲") but I don't think you should count on anything good coming from a string that isn't a valid UTF-16 encoding of Unicode codepoints.
I have a pretty simple question, but a few simple googling and stachexchange queries were not able to answer it, so i guess i'm missing something here.
Here are my simplified parameters:
I'm using Javascript.
I have a text that needs to get URLEncoded and the text have more than 1 line.
My question is: What is the character for newline before the text get encoded? (I know that after the encoding the newline will be encoded into %0A)
I guess asking "What char is decoded when decoding %0A" will be the same.
Those codes consist of a percent sign, followed by a two character hexadecimal number representing a byte value.
So in this case, the byte value is 0A, representing the ASCII newline character. This is commonly written as \n inside strings in JavaScript (and others, like PHP).
But I think your question suggests you want to do some search and replace for this character. I would not do that, since there can be other characters too that need encoding. Instead, use the function encodeURIComponent, which can encode the entire string for you. There is encodeURI as well, but in your case, I think the first is more appropriate.
This example shows how special characters (newline, space, and others) are encoded to an url-friendly format. Note that the diacritic é translates to the two bytes of its UTF-8 representation.
document.write(encodeURIComponent("Normal text\nEéy, check the specials: /, + and \t!"));
I am trying to create some random unicode strings within javascript and was wondering if there was an easy way. I tried doing something like this...
var username = "David Perry" + "/u4589";
But it just appends /u4589 to the end which is to be expected since it's just a string. What I WANT it to do is convert that into the unicode character in the string (AS IF I typed ALT 4589 on the keypad). I'm trying to build the string within javascript because I wanna test my form with various symbols and stuff and I'm tired of trying ALT codes to see what weird characters there are... so I thought.. I would loop through ALL unicode characters for FUN and populate my form and submit it automatically...
I was going to start at /u0000 and go up to /uffff and see which codes break my website when outputting them :)
I know there are different functions in JS but I can't seem to figure out why I can't build a string of unicode characters. lol.
If it's too complicated don't worry about it. It's just something I wanted to tinker with.
Try "\u4589" instead of "/u4589":
>>> "/u4589"
"/u4589"
>>> "\u4589"
"䖉"
the forward slash (/) is just a forward slash in a string, however the backslash (\) is an escape character.
If you wish to generate random characters or loop through a range of characters, then you could use String.fromCharCode(), which gives you the character with the Unicode number passed as argument, e.g. String.fromCharCode(0x4589) or String.fromCharCode(i) where i is a variable with an integer value.
Both the \uxxxx notation and the String.fromCharCode() work up to 0xFFFF only, i.e. for Basic Multilingual Plane characters. This may well suffice, but if you need non-BMP characters, check out e.g. the MDN page on fromCharCode.