Dealing with the Cyrillic encoding in Node.Js / Express App

Dealing with the Cyrillic encoding in Node.Js / Express App - javascript

In my app a user submits text through a form's textarea and this text is passed on to the app and is then processed by jsesc library, which escapes javascript strings.
The problem is that when I type in a text in Russian, such as
нам #интересны наши #идеи
what i get is
'\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438'
I then need to pass this data through FlowDock to extract hashtags and FlockDock just does not recognize it.
Can someone please tell me
1) What is the need for converting it into that representation;
2) If it makes sense to convert it back to cyrillic encoding for FlowDock and for the database, or shall I keep it in Unicode and try to make FlowDock work with it?
Thanks!
UPDATE
The complete script is:
result = getField(req, field);
result = S(result).trim().collapseWhitespace().s;
// at this point result = "нам #интересны наши #идеи"
result = jsesc(result, {
'quotes': 'double'
});
// now i end up with Unicode as above above (\u....)
var hashtags = FlowdockText.extractHashtags(result);
FlowDock receives the result which is
\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438
And doesn't extract hashtags from it...

These are 2 representations of the same string:
'нам #интересны наши #идеи' === '\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438'
looks like flowdock-text doesn't work well with non-ASCII characters
UPD: Tried, actually works well:
fdt.extractHashtags('\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438');
You shouldn't have used escaping in the first place, it gives you string literal representation (suits for eval, etc), not a string.
UPD2: I've reduced you code to the following:
var jsesc = require('jsesc');
var fdt = require('flowdock-text');
var result = 'нам #интересны наши #идеи';
result = jsesc(result, {
'quotes': 'double'
});
var hashtags = fdt.extractHashtags(result);
console.log(hashtags);
As I said, the problem is with jsesc: you don't need it. It returns javascript-encoded string. You need when you are doing eval with concatenation to protect from code injection, or something like this. For example if you add result = eval('"' + result + '"');, it will work.

What is the need for converting it into that representation?
jsesc is a JavaScript library for escaping JavaScript strings while generating the shortest possible valid ASCII-only output. Here’s an online demo.
This can be used to avoid mojibake and other encoding issues, or even to avoid errors when passing JSON-formatted data (which may contain U+2028 LINE SEPARATOR, U+2029 PARAGRAPH SEPARATOR, or lone surrogates) to a JavaScript parser or an UTF-8 encoder, respectively.
Sounds like in this case you don’t intend to use jsesc at all.

Try this:
decodeURIComponent("\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438");

Related

Matching ouput for HttpServerUtility.UrlTokenEncode in NodeJS Javascript

I am looking at an example in dotnet which looks like the following: https://dotnetfiddle.net/t0y8yD.
The output for the HttpServerUtility.UrlTokenEncode method is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
When I try to complete the same in NodeJS with encodeURI, encodeURIComponent or any other attempt I get the following:
Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo=
As you can see from the above the '-' should be a '+' and the last character part is different. The hash is created the same and outputs the same buffer.
var hmac = crypto.createHmac("sha256", buf);
hmac.update("9644873");
var hash = hmac.digest("base64");
How can I get the two to match? One other important note is that this is one use case and I am unsure if there are other chars that do the same.
I am unsure if the dotnet variant is incorrect or the NodeJS version is. However, the comparison will be done on the dotnet side, so I need node to match that.

The difference of the two results is caused by the use of Base64URL encoding in the C# code vs. Base64 encoding in node.js.
Base64URL and Base64 are almost identical, but Base64 encoding uses the characters +, / and =, which have a special meaning in URLs and thus have to be avoided. In Base64URL encoding + is replaced with -, / with _ and = (the padding character on the end) is either replaced with %20 or simply omitted.
In your code you're calculating a HMAC-SHA256 hash, so you get a 256 bit result, which can be encoded in 32 bytes. In Base64/Base64URL every character represents 6 bits, therefore you would need 256/6 = 42,66 => 43 Base64 characters. With 43 characters you would have 2 'lonesome' bits on the end, therefore a padding char (=) is added.
The question now is why HttpServerUtility.UrlTokenEncode adds a 1 as a replacement for the padding char on the end. I didn't find anything in the documentation. But you you should keep in mind that it's insignificant anyway.
To to get the same in node.js, you can use the package base64url, or just use simple replace statements on the base64 encoded hash.
With base64url package:
const base64url = require('base64url');
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
var hmacB64url = base64url.fromBase64(hmacb64)
console.log(hmacB64url)
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo
as you can see, this library just omits the padding char.
With replace, also replacing the padding = with 1:
var hmacB64 = "Pn55YBwEH2S2BEM5qlNrq+LMNE8BDdHYwbWKFEHiPZo="
console.log(hmacb64.replace(/\//g,'_').replace(/\+/g,'-').replace(/\=+$/m,'1'))
The result is:
Pn55YBwEH2S2BEM5qlNrq-LMNE8BDdHYwbWKFEHiPZo1
I tried the C# code with different data and always got '1' on the end, so to replace = with 1 seems to be ok, though it doesn't seem to be conform to the RFC.
The other alternative, if this is an option for you, is to change the C# code. Use normal base64 encoding plus string replace to get base64url output instead of using HttpServerUtility.UrlTokenEncode
A possible solution for that is described here

I'm new here so I can't comment (need 50 reputation), but I would like to add to #jqs answer that if the string ends with two "=", the replace needs to be done with "2". So my replace looks like:
hmacb64.replace(///g,'_').replace(/+/g,'-').replace(/\=\=$/m,'2').replace(/\=$/m,'1')

How to feed strange characters to javaScript? [duplicate]

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?

I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.

Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.

One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16

The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.

I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'

Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

Parsing malformed JSON in JavaScript

Thanks for looking!
BACKGROUND
I am writing some front-end code that consumes a JSON service which is returning malformed JSON. Specifically, the keys are not surrounded with quotes:
{foo: "bar"}
I have NO CONTROL over the service, so I am correcting this like so:
var scrubbedJson = dirtyJson.replace(/(['"])?([a-zA-Z0-9_]+)(['"])?:/g, '"$2": ');
This gives me well formed JSON:
{"foo": "bar"}
Problem
However, when I call JSON.parse(scrubbedJson), I still get an error. I suspect it may be because the entire JSON string is surrounded in double quotes but I am not sure.
UPDATE
This has been solved--the above code works fine. I had a rogue single quote in the body of the JSON that was returned. I got that out of there and everything now parses. Thanks.
Any help would be appreciated.

You can avoid using a regexp altogether and still output a JavaScript object from a malformed JSON string (keys without quotes, single quotes, etc), using this simple trick:
var jsonify = (function(div){
return function(json){
div.setAttribute('onclick', 'this.__json__ = ' + json);
div.click();
return div.__json__;
}
})(document.createElement('div'));
// Let's say you had a string like '{ one: 1 }' (malformed, a key without quotes)
// jsonify('{ one: 1 }') will output a good ol' JS object ;)
Here's a demo: http://codepen.io/csuwldcat/pen/dfzsu (open your console)

something like this may help to repair the json ..
$str='{foo:"bar"}';
echo preg_replace('/({)([a-zA-Z0-9]+)(:)/','$1"$2"${3}',$str);
Output:
{"foo":"bar"}
EDIT:
var str='{foo:"bar"}';
str.replace(/({)([a-zA-Z0-9]+)(:)/,'$1"$2"$3')

There is a project that takes care of all kinds of invalid cases in JSON https://github.com/freethenation/durable-json-lint

I was trying to solve the same problem using a regEx in Javascript. I have an app written for Node.js to parse incoming JSON, but wanted a "relaxed" version of the parser (see following comments), since it is inconvenient to put quotes around every key (name). Here is my solution:
var objKeysRegex = /({|,)(?:\s*)(?:')?([A-Za-z_$\.][A-Za-z0-9_ \-\.$]*)(?:')?(?:\s*):/g;// look for object names
var newQuotedKeysString = originalString.replace(objKeysRegex, "$1\"$2\":");// all object names should be double quoted
var newObject = JSON.parse(newQuotedKeysString);
Here's a breakdown of the regEx:
({|,) looks for the beginning of the object, a { for flat objects or , for embedded objects.
(?:\s*) finds but does not remember white space
(?:')? finds but does not remember a single quote (to be replaced by a double quote later). There will be either zero or one of these.
([A-Za-z_$\.][A-Za-z0-9_ \-\.$]*) is the name (or key). Starts with any letter, underscore, $, or dot, followed by zero or more alpha-numeric characters or underscores or dashes or dots or $.
the last character : is what delimits the name of the object from the value.
Now we can use replace() with some dressing to get our newly quoted keys:
originalString.replace(objKeysRegex, "$1\"$2\":")
where the $1 is either { or , depending on whether the object was embedded in another object. \" adds a double quote. $2 is the name. \" another double quote. and finally : finishes it off.
Test it out with
{keyOne: "value1", $keyTwo: "value 2", key-3:{key4:18.34}}
output:
{"keyOne": "value1","$keyTwo": "value 2","key-3":{"key4":18.34}}
Some comments:
I have not tested this method for speed, but from what I gather by reading some of these entries is that using a regex is faster than eval()
For my application, I'm limiting the characters that names are allowed to have with ([A-Za-z_$\.][A-Za-z0-9_ \-\.$]*) for my 'relaxed' version JSON parser. If you wanted to allow more characters in names (you can do that and still be valid), you could instead use ([^'":]+) to mean anything other than double or single quotes or a colon. You can have all sorts of stuff in here with this expression, so be careful.
One shortcoming is that this method actually changes the original incoming data (but I think that's what you wanted?). You could program around that to mitigate this issue - depends on your needs and resources available.
Hope this helps.
-John L.

How about?
function fixJson(json) {
var tempString, tempJson, output;
tempString = JSON.stringify(json);
tempJson = JSON.parse(tempString);
output = JSON.stringify(tempJson);
return output;
}

Insert Unicode character into JavaScript

I need to insert an Omega (Ω) onto my html page. I am using its HTML escaped code to do that, so I can write Ω and get Ω. That's all fine and well when I put it into a HTML element; however, when I try to put it into my JS, e.g. var Omega = Ω, it parses that code as JS and the whole thing doesn't work. Anyone know how to go about this?

I'm guessing that you actually want Omega to be a string containing an uppercase omega? In that case, you can write:
var Omega = '\u03A9';
(Because Ω is the Unicode character with codepoint U+03A9; that is, 03A9 is 937, except written as four hexadecimal digits.)
Edited to add (in 2022): There now exists an alternative form that better supports codepoints above U+FFFF:
let Omega = '\u{03A9}';
let desertIslandEmoji = '\u{1F3DD}';
Judging from https://caniuse.com/mdn-javascript_builtins_string_unicode_code_point_escapes, most or all browsers added support for it in 2015, so it should be reasonably safe to use.

Although #ruakh gave a good answer, I will add some alternatives for completeness:
You could in fact use even var Omega = 'Ω' in JavaScript, but only if your JavaScript code is:
inside an event attribute, as in onclick="var Omega = '&#937';
alert(Omega)" or
in a script element inside an XHTML (or XHTML + XML) document
served with an XML content type.
In these cases, the code will be first (before getting passed to the JavaScript interpreter) be parsed by an HTML parser so that character references like Ω are recognized. The restrictions make this an impractical approach in most cases.
You can also enter the Ω character as such, as in var Omega = 'Ω', but then the character encoding must allow that, the encoding must be properly declared, and you need software that let you enter such characters. This is a clean solution and quite feasible if you use UTF-8 encoding for everything and are prepared to deal with the issues created by it. Source code will be readable, and reading it, you immediately see the character itself, instead of code notations. On the other hand, it may cause surprises if other people start working with your code.
Using the \u notation, as in var Omega = '\u03A9', works independently of character encoding, and it is in practice almost universal. It can however be as such used only up to U+FFFF, i.e. up to \uffff, but most characters that most people ever heard of fall into that area. (If you need “higher” characters, you need to use either surrogate pairs or one of the two approaches above.)
You can also construct a character using the String.fromCharCode() method, passing as a parameter the Unicode number, in decimal as in var Omega = String.fromCharCode(937) or in hexadecimal as in var Omega = String.fromCharCode(0x3A9). This works up to U+FFFF. This approach can be used even when you have the Unicode number in a variable.

One option is to put the character literally in your script, e.g.:
const omega = 'Ω';
This requires that you let the browser know the correct source encoding, see Unicode in JavaScript
However, if you can't or don't want to do this (e.g. because the character is too exotic and can't be expected to be available in the code editor font), the safest option may be to use new-style string escape or String.fromCodePoint:
const omega = '\u{3a9}';
// or:
const omega = String.fromCodePoint(0x3a9);
This is not restricted to UTF-16 but works for all unicode code points. In comparison, the other approaches mentioned here have the following downsides:
HTML escapes (const omega = '&#937';): only work when rendered unescaped in an HTML element
old style string escapes (const omega = '\u03A9';): restricted to UTF-16
String.fromCharCode: restricted to UTF-16

The answer is correct, but you don't need to declare a variable.
A string can contain your character:
"This string contains omega, that looks like this: \u03A9"
Unfortunately still those codes in ASCII are needed for displaying UTF-8, but I am still waiting (since too many years...) the day when UTF-8 will be same as ASCII was, and ASCII will be just a remembrance of the past.

I found this question when trying to implement a font-awesome style icon system in html. I have an API that provides me with a hex string and I need to convert it to unicode to match with the font-family.
Say I have the string const code = 'f004'; from my API. I can't do simple string concatenation (const unicode = '\u' + code;) since the system needs to recognize that it's unicode and this will in fact cause a syntax error if you try.
#coldfix mentioned using String.fromCodePoint but it takes a number as an argument, not a string.
To finally cross the finish line, just add parseInt and pass 16 (since hex is base 16) to it's second parameter. You'll finally get a unicode string from a simple hex string.
This is what I did:
const code = 'f004';
const toUnicode = code => String.fromCodePoint(parseInt(code, 16));
toUnicode(code);
// => '\uf004'

Try using Function(), like this:
var code = "2710"
var char = Function("return '\\u"+code+"';")()
It works well, just do not add any 's or "s or spaces.
In the example, char is "✐".

How to process a javascript function call that returns a string with quotes in it

I'm having an issue with a javascript call to an api function in NetSuite that returns a string with quotes in it. An error is thrown each time the call is made.
var selling_point_1 = "<%=getCurrentAttribute('item','custitemsellingpoint1')%>";
when looking in the debugger, this evaluates to:
var selling_point_1 = "Product Dimensions: H:14" W:24"";
Any string function (like .length or charAt(0) ) on this also throws an error. I have no control over what the function call returns, so i need to know how to handle embedded quotes.
Any help would be greatly appreciated, John

Although not the most robust method you could use:
var selling_point_1 = escape("<%=getCurrentAttribute('item','custitemsellingpoint1')%>");
This is actually for URI escaping, but will get rid of the pesky double quotes, plus you can use unescape to get the original format back. As suggested
var selling_point_1 = '<%=getCurrentAttribute(\'item\',\'custitemsellingpoint1\')%>';
Should also work in your case.

See this thread for someone dealing with roughly the same issue. The short answer is that you need to run some kind of escape function in the server-side code (i.e., within the <%=...%> block) so that only escaped values get inserted into the client-side code. All of the solutions below can handle an unlimited number of single and double quotes.
My first suggestion is to try:
var selling_point_1 = decodeURI("<%=Server.URLEncode(getCurrentAttribute('item','custitemsellingpoint1'))%>");
This will produce server-side JS that looks like:
var selling_point_1 = decodeURI("Product Dimensions: H:14%22 W:24%22");
The decodeURI JavaScript function will convert the %22 back into quotes and the correct string will be stored in selling_point_1.
If that fails, you might also try something like:
var selling_point_1 = unescape("<%=HttpServerUtility.HtmlEncode(getCurrentAttribute('item','custitemsellingpoint1'))%>");
which operates similarly, but tuuns your quotes into \" sequences, which will be converted back into ordinary quotes by JavaScript's unescape.

Develop Reference

JavaScript is the programming language of the Web.

Dealing with the Cyrillic encoding in Node.Js / Express App - javascript

Try this: decodeURIComponent("\u043D\u0430\u043C #\u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B \u043D\u0430\u0448\u0438 #\u0438\u0434\u0435\u0438");

Related

Matching ouput for HttpServerUtility.UrlTokenEncode in NodeJS Javascript

How to feed strange characters to javaScript? [duplicate]

Parsing malformed JSON in JavaScript

Insert Unicode character into JavaScript

How to process a javascript function call that returns a string with quotes in it

Categories

Resources