Getting question marks when trying to get data from API - javascript

I'm using node-webkit to build an app that alerts me every time there is an alarm in my country (we are currently in a war). There is a website that supplies a JSON file that contains info about current alarms. When I try to access that page and check whether there are alarms, the result is a lot of question marks. I can't use that, and when I try to JSON.parse the data it says that it cannot parse question marks. What do I do?
url: "http://www.oref.org.il/WarningMessages/alerts.json",
checkAlert: function(callback) {
request({
uri: this.url,
json: true,
encoding: 'utf-8'
}, function(err, res, json) {
if (err)
return console.log(err);
json = JSON.parse(json);
var data = json.data;
console.log('just checked. json.data: ' + data);
if (data.length != 0) // if array is not empty
callback(true);
else
callback(false);
});
}
Here's how the file looks like:
{
"id" : "1405751634717",
"title" : "something in hebrew ",
"data" : []
}
Thanks a lot!

That API returns a JSON response encoded in UTF-16-LE, so you'll have to tell request to use that encoding instead.
However, since you're trying to query Pikud Haoref's alerts API, check out pikud-haoref-api on npm to do the heavy lifting for you:
https://www.npmjs.com/package/pikud-haoref-api
(Disclaimer: I created this package)

Have a look here: jQuery doesn't display Hebrew
And be totally sure first that your JSON files are actually enconded in UTF-8
You might want to check how your server is serving those JSON and which codification they have.
Check also this link: http://dougal.gunters.org/blog/2012/03/14/dealing-with-utf-in-node-js/
Quick overview:
“V8 currently only accepts characters in the BMP as input, using UCS-2
as internal representation (the same representation as JavaScript
strings).” Basically, this means that JavaScript uses the UCS-2
character encoding internally, which is strictly a 16-bit format,
which in turn means that it can only support the first 65,536
code-points of Unicode characters. Any characters that fall outside
that range are apparently truncated in the conversion from UTF-8 to
UCS-2, mangling the character stream. In my case (as with many others
I found in my research) this surfaces when the system attempts to
serialize/deserialize these strings as JSON objects. In the
conversion, you can end up with character sequences which are invalid
UTF-8. When browsers see these broken strings come in, they promptly
drop the connection mid-stream, apparently as a security measure. (I
sort-of understand this, but would have a hard time explaining it,
because these character-encoding discussions give me a headache).

Related

Weird charaters in a stringified buffer in javascript

I have a particular context in which one data are transformed a lot to get transferred across network. At the end, when I try to get this data back, I have unwanted characters at the beginning of the string.
First, I get the data from a db and it returns it to me as bytes (<Array.<byte>>), fully readable with .toString(). The result is:
{\"company\":\"xxx\",\"email\":\"xxx\",\"firstName\":\"xxx\",\"lastName\":\"xxx\",\"providerId\":\"xxx\",\"role\":\"xxx\",\"status\":\"xxx\"}
These data are passed to another "environment" with a function (not developed by me and that I cannot change) that returns the data in a "I don't really know what format it is".
I can decode it with the following piece of code:
jsonIdentity = JSON.stringify(bufferIdentity);
Buffer.from(JSON.parse(jsonIdentity).payload.buffer.data).toString('utf-8')
However, at the beginning of the string, I have the following:
"\u0008\u0006\u001a�\u0001\u0008�\u0001\u001a{{\"company\":\"xxx\",\"email\":\"xxx\",\"firstName\":\"xxx\",\"lastName\":\"xxx\",\"providerId\":\"xxx\",\"role\":\"xxx\",\"status\":\"xxx\"}
Also represented like that in my logs:
��{{"company":"xxx","email":"xxx","firstName":"xxx","lastName":"xxx","providerId":"xxx","role":"xxx","status":"xxx"
How can I remove it/prevent it to get in my result? It prevents me from using the JSON.
Update: here is the buffer I get:
{"status":200,"message":"","payload":{"buffer":{"type":"Buffer","data":[8,6,26,128,1,8,200,1,26,123,123,34,99,111,109,112,97,110,121,34,58,34,105,98,109,34,44,34,101,109,97,105,108,34,58,34,102,64,105,98,109,46,99,111,109,34,44,34,102,105,114,115,116,78,97,109,101,34,58,34,102,108,111,114,105,97,110,34,44,34,108,97,115,116,78,97,109,101,34,58,34,99,97,115,116,34,44,34,112,114,111,118,105,100,101,114,73,100,34,58,34,102,99,34,44,34,114,111,108,101,34,58,34,117,115,101,114,34,44,34,115,116,97,116,117,115,34,58,34,111,107,34,125,34,64,98,54,57,51,50,51,53,100,49,52,97,49,98,102,57,57,56,100,50,99,97,102,53,53,52,52,100,97,49,50,50,51,55,101,97,55,99,50,56,55,50,49,56,97,101,55,51,100,55,97,50,53,101,52,55,48,48,51,56,52,100,54,53,54,58,14,100,101,102,97,117,108,116,99,104,97,110,110,101,108]},"offset":10,"markedOffset":-1,"limit":133,"littleEndian":true,"noAssert":false}}
Can you try this out:
const yourString = JSON.parse(jsonIdentity).payload.buffer.data;
console.log(Buffer.from(yourString, 'base64').toString('utf-8'))
An ugly solution would be just trim or replace the characters from your result
Problem fixed. Solution is available on Jira here: https://jira.hyperledger.org/browse/FAB-14785?focusedCommentId=58680&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-58680

How to handle weirdly combined websocket messages?

I'm connecting to an external websocket api using the node ws library (node 10.8.0 on Ubuntu 16.04). I've got a listener which simply parses the json and passes it to the callback:
this.ws.on('message', (rawdata) => {
let data = null;
try {
data = JSON.parse(rawdata);
} catch (e) {
console.log('Failed parsing the following string as json: ' + rawdata);
return;
}
mycallback(data);
});
I now receive errors in which the rawData looks as follows (I formatted and removed irrelevant contents):
�~A
{
"id": 1,
etc..
}�~�
{
"id": 2,
etc..
I then wondered; what are these characters? Seeing the structure I initially thought that the first weird sign must be an opening bracket of an array ([) and the second one a comma (,) so that it creates an array of objects.
I then investigated the problem further by writing the rawdata to a file whenever it encounters a JSON parsing error. In an hour or so it has saved about 1500 of these error files, meaning this happens a lot. I cated a couple of these files in the terminal, of which I uploaded an example below:
A few things are interesting here:
The files always start with one of these weird signs.
The files appear to exist out of multiple messages which should have been received separately. The weird signs separate those individual messages.
The files always end with an unfinished json object.
The files are of varying lengths. They are not always the same size and are thus not cut off on a specific length.
I'm not very experience with websockets, but could it be that my websocket somehow receives a stream of messages that it concatenates together, with these weird signs as separators, and then randomly cuts off the last message? Maybe because I'm getting a constant very fast stream of messages?
Or could it be because of an error (or functionality) server side in that it combines those individual messages?
Does anybody know what's going on here? All tips are welcome!
[EDIT]
#bendataclear suggested to interpret it as utf8. So I did, and I pasted a screenshot of the results below. The first print is as it is, and the second one interpreted as utf8. To me this doesn't look like anything. I could of course convert to utf8, and then split by those characters. Although the last message is always cut off, this would at least make some of the messages readble. Other ideas still welcome though.
My assumption is that you're working only with English/ASCII characters and something probably messed the stream. (NOTE:I am assuming), there are no special characters, if it's so, then I will suggest you pass the entire json string into this function:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
console.log(output);
}
//example
cleanString("�~�")
You can make reference to How to remove invalid UTF-8 characters from a JavaScript string?
EDIT
From an article by Internet Engineering Task Force (IETF),
A common class of security problems arises when sending text data
using the wrong encoding. This protocol specifies that messages with
a Text data type (as opposed to Binary or other types) contain UTF-8-
encoded data. Although the length is still indicated and
applications implementing this protocol should use the length to
determine where the frame actually ends, sending data in an improper
The "Payload data" is text data encoded as UTF-8. Note that a particular text frame might include a partial UTF-8 sequence; however, the whole message MUST contain valid UTF-8. Invalid UTF-8 in reassembled messages is handled as described in Handling Errors in UTF-8-Encoded Data, which states that When an endpoint is to interpret a byte stream as UTF-8 but finds that the byte stream is not, in fact, a valid UTF-8 stream, that endpoint MUST Fail the WebSocket Connection. This rule applies both during the opening handshake and during subsequent data exchange.
I really believe that you error (or functionality) is coming from the server side which combines your individual messages, so I will suggest come up with a logic of ensuring that all your characters MUST be converted from Unicode to ASCII by first encoding the characters as UTF-8. And you might also want to install npm install --save-optional utf-8-validate to efficiently check if a message contains valid UTF-8 as required by the spec.
You might also want to pass in an if condition to help you do some checks;
this.ws.on('message', (rawdata) => {
if (message.type === 'utf8') { // accept only text
}
I hope this gets to help.
The problem which you have is that one side sends a JSON in different encoding as the other side it intepretes.
Try to solve this problem with following code:
const { StringDecoder } = require('string_decoder');
this.ws.on('message', (rawdata) => {
const decoder = new StringDecoder('utf8');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
});
Or with utf16:
const { StringDecoder } = require('string_decoder');
this.ws.on('message', (rawdata) => {
const decoder = new StringDecoder('utf16');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
});
Please read: String Decoder Documentation
It seems your output is having some spaces, If you have any spaces or if you find any special characters please use Unicode to full fill them.
Here is the list of Unicode characters
This might help I think.
Those characters are known as "REPLACEMENT CHARACTER" - used to replace an unknown, unrecognized or unrepresentable character.
From: https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The replacement character � (often a black diamond with a white question mark or an empty square box) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol. It is usually seen when the data is invalid and does not match any character
Checking the section 8 of the WebSocket protocol Error Handling:
8.1. Handling Errors in UTF-8 from the Server
When a client is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, then any bytes or sequences of bytes that are not valid UTF-8 sequences MUST be interpreted as a U+FFFD REPLACEMENT CHARACTER.
8.2. Handling Errors in UTF-8 from the Client
When a server is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, behavior is undefined. A server could close the connection, convert invalid byte sequences to U+FFFD REPLACEMENT CHARACTERs, store the data verbatim, or perform application-specific processing. Subprotocols layered on the WebSocket protocol might define specific behavior for servers.
Depends on the implementation or library in use how to deal with this, for example from this post Implementing Web Socket servers with Node.js:
socket.ondata = function(d, start, end) {
//var data = d.toString('utf8', start, end);
var original_data = d.toString('utf8', start, end);
var data = original_data.split('\ufffd')[0].slice(1);
if (data == "kill") {
socket.end();
} else {
sys.puts(data);
socket.write("\u0000", "binary");
socket.write(data, "utf8");
socket.write("\uffff", "binary");
}
};
In this case, if a � is found it will do:
var data = original_data.split('\ufffd')[0].slice(1);
if (data == "kill") {
socket.end();
}
Another thing that you could do is to update node to the latest stable, from this post OpenSSL and Breaking UTF-8 Change (fixed in Node v0.8.27 and v0.10.29):
As of these releases, if you try and pass a string with an unmatched surrogate pair, Node will replace that character with the unknown unicode character (U+FFFD). To preserve the old behavior set the environment variable NODE_INVALID_UTF8 to anything (even nothing). If the environment variable is present at all it will revert to the old behavior.

iron-ajax trying to transmit by xhr multiline string or array subexpression

I have a multiline string coming from a paper-textarea element that I am trying to transmit to the server.
My first attempt was to send it just like that to the server, but iron-ajax cuts out the newline characters, presumably because of json encoding issues.
My second attempt involves splitting the string lines into entries of an array, so let's see how that goes.
<iron-ajax
...
params={{ajax_new_tag_and_entry}}
...
</iron-ajax>
This is the function that changes 'ajax_new_tag_and_entry":
tap_submit_entry : function(){
this.ajax_new_tag_and_entry=
{ tag : this.journal_tags[this.the_journal_tag].tag,
entry : this.the_journal_entry.split("\n") };
console.log(this.the_journal_entry);
console.log(this.the_journal_entry.split("\n"));
}
When I do 'console.log(this.the_journal_entry);' I get:
One message
to rule
them all.
When I do 'console.log(this.the_journal_entry.split("\n"));' I get:
Array [ "One message", "to rule", "them all." ]
But the Firefox developer tools tell me that this are the parameters sent to the server:
tag:"general_message"
entry:"One message"
entry:"to rule"
entry:"them all."
This obviously means that entry was split up into three identical entries for the xhr parameters, instead of being one single entry array with the three lines in the message.
I would appreciate it if anyone has any thoughts on how I could fix this issue.
If newlines are to be preserved, the server (or the data receiver) ultimately needs to recover the newlines using an agreed upon format.
Splitting the multi-line string into an array (as you've done) is one way of doing it. The server/receiver would then join the array with newlines as deserialization. Note that it's perfectly acceptable to have multiple query parameters of the same name in a URL, which the receiver could merge into an array (as seen in this Flask test app).
Alternatively, you could encode (i.e., replace \n with %0A) or escape (i.e., replace \n with \\n) the newlines. Then, the server/receiver has to decode/unescape them to restore the original message.

Decode JSON string in Mojolicious that was encoded with JSON.stringify

I am trying to send javascript variable as JSON string to Mojolicious and I am having problems with decoding it on perl side. My page uses utf-8 encoding.
The json string (value of $self->param('routes_jsonstr')) seems to have correct value but Mojo::JSON can't decode it. The code is working well when there are no utf-8 characters. What am I doing wrong?
Javascript code:
var routes = [ {
addr1: 'Škofja Loka', // string with utf-8 character
addr2: 'Kranj'
}];
var routes_jsonstr = JSON.stringify(routes);
$.get(url.on_route_change,
{
routes_jsonstr: routes_jsonstr
}
);
Perl code:
sub on_route_change {
my $self = shift;
my $routes=j( $self->param('routes_jsonstr') );
warn $self->param('routes_jsonstr');
warn Dumper $routes;
}
Server output
Wide character in warn at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
[{"addr1":"Škofja Loka","addr2":"Kranj"}] at /opt/mojo/routes/script/../lib/Routes/Homepage.pm line 76.
$VAR1 = undef;
Last line above shows that decoding of json string didn't work. When there are no utf-8 characters to decode on perl side everything works fine and $routes contain expected data.
Mojolicious style solution can be found here:
http://showmetheco.de/articles/2010/10/how-to-avoid-unicode-pitfalls-in-mojolicious.html
In Javascript I only changed $.get() to $.post().
Updated and working Perl code now looks like this:
use Mojo::ByteStream 'b';
sub on_route_change {
my $self = shift;
my $routes=j( b( $self->param('routes_jsonstr') )->encode('UTF-8') );
}
Tested with many different utf8 strings.
Wide character warnings happen when you print. This is not due to how you decode your unicode but your STDOUT encoding. Try use utf8::all available from CPAN which will set all your IO handles to utf8. Avoiding decoding probably isn't fixing the problem, but rather making it worse. The only reason it appears to work is your terminal is fixing things up for you.
You can take away at least some of the pain by escaping the problematic characters; see https://stackoverflow.com/a/4901205/17389.

Executing Queries in MongoDB with Greek Characters using Javascript Returns No Results

I am building an HTML5 app combining the AngularJS framework and MongoDB. The setup is similar to the ‘Wire up a backend’ demo in the AngularJS home page. So far, I have managed to save a large amount of documents in a single simply - structured MongoDB collection (hosted on Mongolab). These documents contain keys with latin characters and values with greek characters or numeric ones:
{ "name": "Νίκος", "value": 1.35}
I am pretty sure these documents are utf-8 encoded. The problem is that when I try to query the database with JS, passing strings containing greek characters, I get zero results.
var queryString = "{\"name\": \"Νίκος\"}";
$scope.query_results = Project.query({q: query_string}, null, $scope.query_success);
The same queries using php return the correct results. All other queries with numeric values or Latin characters are successfully executed (either from php or js). So the only problem is when I try to query the db through js using greek characters.
I have checked the encoding of the js documents to be utf-8 and I have set the html meta charset attribute to utf-8. I have also tried encoding the query string to utf-8 before querying the database, with no success though.
Any ideas?
Thanks.
Works for me from the shell (I copied your example document to insert, and then copied from the query for name), so at least you're not having one of those issues where the utf-8 characters look the same but are slightly different:
> db.test.insert({ "name": "Νίκος", "value": 1.35});
> db.test.find({name: "Νίκος"});
{ "_id" : ObjectId("4f9b1642c26c79dac82740c5"), "name" : "Νίκος", "value" : 1.35 }
Double check your file encoding on the js file? Although, I'm sure in your real program, you have that search value coming from a URL encoded form through GET or POST, so the encoding on the js file wouldn't matter.
You might try setting accept-charset="utf-8" in your form. If it's AJAX or posted through JS via the angular bindings, make sure that the character encoding is set before you send it as well. Something like this? http://groups.google.com/group/angular/browse_thread/thread/e6701e749d4bc8ed

Categories

Resources