Request returning unicode replacement character

Request returning unicode replacement character - javascript

Using the request module to load a webpage, I notice that for he UK pound symbol £ I sometimes get back the unicode replacement character \uFFFD.
An example URL that I'm parsing is this Amazon UK page: http://www.amazon.co.uk/gp/product/B00R3P1NSI/ref=s9_newr_gw_d38_g351_i2?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=0Q529EEEZWKPCVQBRHT9&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294
I'm also using the iconv-lite module to decode using the charset returned in the response header:
request(urlEntry.url, function(err, response, html) {
const contType = response.headers['content-type'];
const charset = contType.substring(contType.indexOf('charset=') + 8, contType.length);
const encBody = iconv.decode(html, charset);
...
But this doesn't seem to be helping. I've also tried decoding the response HTML as UTF-8.
How can I avoid this Unicode replacement char?

Firstly, the Amazon webpage is encoded in ISO-8859-1, not UTF-8. This is what causes the Unicode replacement character. You can check this in the response headers. I used curl -i.
Secondly, the README for requests says:
encoding - Encoding to be used on setEncoding of response data. If
null, the body is returned as a Buffer. Anything else (including the
default value of undefined) will be passed as the encoding parameter
to toString() (meaning this is effectively utf8 by default).
It is UTF-8 by default... and (after a little experimentation) we find that it sadly it doesn't support ISO-8859-1. However, if we set the encoding to null we can then decode the resulting Buffer using iconv-lite.
Here is a sample program.
var request = require('request');
var iconvlite = require('iconv-lite');
var url = "http://www.amazon.co.uk/gp/product/B00R3P1NSI/ref=s9_newr_gw_d38_g351_i2?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=0Q529EEEZWKPCVQBRHT9&pf_rd_t=101&pf_rd_p=455333147&pf_rd_i=468294";
request({url: url, encoding: null}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var encoding = 'ISO-8859-1';
var content = iconvlite.decode(body, encoding);
console.log(content);
}
});
This question is somewhat related, and I used it whilst figuring this out:
http.get and ISO-8859-1 encoded responses

Related

JavaScript Fetch: characters with encoding issues

I'm trying to use Fetch to bring some data into the screen, however some of the characters ares showing a weird � sign which I believe has something to do with converting special chars.
When debugging on the server side or if I call the servlet on my browser, the problem doesn't happen, so I believe the issue is with my JavaScript. See the code below:
var myHeaders = new Headers();
myHeaders.append('Content-Type','text/plain; charset=UTF-8');
fetch('getrastreiojadlog?cod=10082551688295', myHeaders)
.then(function (response) {
return response.text();
})
.then(function (resp) {
console.log(resp);
});
I think it is probably some detail, but I haven't managed to find out what is happening. So any tips are welcome
Thx

The response's text() function always decodes the payload as utf-8.
If you want the text in other charset you may use TextDecoder to convert the response buffer (NOT the text) into a decoded text with chosen charset.
Using your example it should be:
var myHeaders = new Headers();
myHeaders.append('Content-Type','text/plain; charset=UTF-8');
fetch('getrastreiojadlog?cod=10082551688295', myHeaders)
.then(function (response) {
return response.arrayBuffer();
})
.then(function (buffer) {
const decoder = new TextDecoder('iso-8859-1');
const text = decoder.decode(buffer);
console.log(text);
});
Notice that I'm using iso-8859-1 as decoder.
Credits: Schneide Blog

Maybe your server isn't returning an utf-8 encoded response, try to find which charset is used and then modify it in call headers.
Maybe ISO-8859-1 :
myHeaders.append('Content-Type','text/plain; charset=ISO-8859-1');

As it turns out, the problem was in how ther servlet was serving the data without explicitly informing the enconding type on the response.
By adding the following line in the Java servlet:
response.setContentType("text/html;charset=UTF-8");
it was possible got get the characters in the right format.

Need to add value into my URL while doing HTTP post request using Google Cloud Function

I'm creating a OTP type of registration for my react native based mobile app. By using google cloud function to generate otp and post http request to my SMS provider.
The problem i am facing is that, whenever i try to add the random code to my sms provider url with ${code}, the message simply displays the same ${code} not the randomly generated code.
In other words, don't know how to interpolate the code into my url (as i am a newbie).
Here is my code for Random Number :
const code = Math.floor((Math.random() * 8999 + 1000));
And my request using request package is as follows:
const options = {
method: 'POST',
uri: 'http://smsprovider.com/numbers=${numbers}&route=2&message=Your OTP is ${code}',
body: {
numbers: phone,
code: code
},
json: true
};
So, whenever I get a message, it says Your OTP is ${code}. But what I actually need is to show the random number generated by the math.floor function. Expected "Your OTP is 5748"
Kindly guide

For string interpolation with JavaScript be sure to use the
`
character instead of
'
Try this instead:
const options = {
method: 'POST',
uri: `http://smsprovider.com/numbers=${numbers}&route=2&message=Your OTP is ${code}`,
body: {
numbers: phone,
code: code
},
json: true
};

String interpolation and url encoding are two distinct paradigms, one doesn't replace the other.
string interpolation allows you to dynamically insert a variable's content into a string with ${}. For this to work you must enclose your string between back quotes as #Ben Beck instructed. Some interpreters will be forgiving, meaning that even if you use single quotes, the interpreter will nonetheless parse the string with the interpolation, however not all interpreters do that, and it is bad practice to rely on it. Make sure you format these correctly.
url component encoding converts the url parameters containing special characters into a valid uri component with encodeURIComponent(). This is how you get rid of spaces and other special characters, however it might not be needed here as most browsers do that for you. Use Chrome to be sure, but again it is good practice to write fully portable code, I recommend to encode any parameter featuring any special character.
The fact that your Postman test failed is most certainly due to a faulty request. Check this screenshot for a working Postman POST request based on your case, leveraging Pre-request Script.
While testing with your code directly (not through Postman), if you keep getting the literal ${code} in place of the actual value, it likely means that the definition const code = Math.floor((Math.random() * 8999 + 1000)) is not in the same scope as the interpolation call. Check below for an example of working script using both string interpolation and url encoding based on your case:
const request = require('request');
const code = Math.floor((Math.random() * 8999 + 1000));
var p1 = encodeURIComponent(`Your OTP is ${code}`);
var uri = `http://smsprovider.com/?message=${p1}`;
const options = {
method: 'POST',
url: uri,
json: true
};
function callback(error, response, body) {
if (!error && response.statusCode == 200) {
console.log(body);
}
else {
console.log(error);
}
}
request(options, callback);
same but without url encoding and with the message parameter embedded in body element:
var uri = `http://smsprovider.com/`;
const options = {
method: 'POST',
url: uri,
body: {
message: `Your OTP is ${code}`,
},
json: true
};

Dealing with multiple encoding schemes while downloading the XML feed

I am trying to read the feed at the following URL:
http://www.chinanews.com/rss/scroll-news.xml
using request module. But I get stuff that has ���� ʷ����)������(�й�)���޹�.
On reviewing the XML I see that the encoding is being set as <?xml version="1.0" encoding="gb2312"?>
But on trying to set the encoding to gb2312, I get the unknown encoding error.
request({
url: "http://www.chinanews.com/rss/scroll-news.xml",
method: "GET",
headers: {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Host": "www.chinanews.com",
"Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6"
},
"gzip": true,
"encoding": "utf8"
}, (err, resp, data) => {
console.log(data);
});
Is there a way I could get the data irrespective of the encoding it has? How should I approach this?

You missed the concept of character encoding.
var iconv=require('iconv-lite'), request=require('request');
request({
url: "http://www.chinanews.com/rss/scroll-news.xml",
method: "GET",
headers: {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Host": "www.chinanews.com",
"Accept-Language": "" // client accept language
},
gzip: true,
encoding: null // or 'ascii'
}, (err, resp, body) => {
console.log(iconv.decode(Buffer.from(body, 'ascii'), 'gb2312'));
});
chunk is a Buffer instance in node.js. According to the official documention, there are only
'ascii' - For 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
'ucs2' - Alias of 'utf16le'.
'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
'binary' - Alias for 'latin1'.
'hex' - Encode each byte as two hexadecimal characters.
currently supported by node.js include. To use the encodings not natively supported by node.js, use iconv, iconv-lite or other libraries to grab the character mapping table. This is very similar to this answer.
The Accept-Language implies the languages accepted by client. en-gb represents English (United Kingdom), but not Chinese. The Chinese one is zh-cn, zh, according to RFC 7231.

The tricky part is to pass encoding as null to get a Buffer instead of a string.
encoding - encoding to be used on setEncoding of response data.
If null, the body is returned as a Buffer.
—request
var request = require('request');
var legacy = require('legacy-encoding');
var requestSettings = {
method: 'GET',
url: 'http://www.chinanews.com/rss/scroll-news.xml',
encoding: null,
};
request(requestSettings, function(error, response, body) {
var text = legacy.decode(body, 'gb2312');
console.log(text);
});
Again, in the context of the follow-up question, "
Is there a way I could detect encoding?"
By "detect" I hope you mean, find the declaration. (…as opposed to guessing. If you have to guess then you have a failed communication.) The HTTP response header Content-Type is the primary way to communicate the encoding (if applicable to the MIME type). Some MIME types allow the encoding to be declared within the content, as servers quite rightly defer to that.
In the case of your RSS response. The server sends Content-Type:text/xml. which is without an encoding override. And the content's XML declaration is <?xml version="1.0" encoding="gb2312"?> The XML specification has procedures for finding such a declaration. It basically amounts to reading with different encodings until the XML declaration becomes intelligible, and then re-read with the declared encoding.
var request = require('request');
var legacy = require('legacy-encoding');
var convert = require('xml-js');
// specials listed here: https://www.w3.org/Protocols/rfc1341/4_Content-Type.html
var charsetFromContentTypeRegex = (/charset=([^()<>#,;:\"/[\]?.=\s]*)/i).compile();
var requestSettings = {
method: 'GET',
url: 'http://www.chinanews.com/rss/scroll-news.xml',
encoding: null,
};
request(requestSettings, function(error, response, body) {
var contentType = charsetFromContentTypeRegex.exec(response.headers['content-type'])
var encodingFromHeader = contentType.length > 1 ? contentType[1] : null;
var doc = convert.xml2js(body);
var encoding = doc.declaration.attributes.encoding;
doc = convert.xml2js(
legacy.decode(body, encodingFromHeader ? encodingFromHeader : encoding));
// xpath /rss/channel/title
console.log(doc.elements[1].elements[0].elements[0].elements[0].text);
});

Requesting XML returns a string with many carriage returns

var request = require('request');
request('http://www.behindthename.com/api/lookup.php?name=olga&key=li758582', function (error, response, body) {
if (!error && response.statusCode == 200) {
console.log (body);
}
})
I'm trying to fetch an XML file from an api using node.js.
When I fetch it using the above code with this Request Module I get XML with many unnecessary "\r" like so:
\r<response>\r<name_detail>\r<name>Oľga</name>\r<gender>f</gender>
When I go to the URL with my browser it has no extra line returns at all.
This is what I see on my browser:
Is it possible to fetch just the source XML with node.js? Also, is there a smarter way to fetch XML?

solved. that article has an excellent guide on how exactly to do it.
http://antrikshy.com/blog/fetch-xml-url-convert-to-json-nodejs

base64 encode image host url or server file path

When I encode image data to a base64 string I use the server file path to get the image data with fs.readFile(). I have question: does this mean other people can decode the base64 string then get the server path from the encoded data as below?
...
fs.readFile(destinationFilePath, function(error, data){
fulfill(data.toString('base64'));
});
I don't want to leak my server path to so I also tried encode the host url like below code, I'm not sure this correct way to use base64? and I don't get any error but also I got no response - did I miss something?
var base64EncodeData = function(destinationFilePath) {
return new Promise(function (fulfill, reject){
var request = require('request').defaults({ encoding: null });
request.get(destinationFilePath, function (error, response, body) {
if (!error && response.statusCode == 200) {
data = "data:" + response.headers["content-type"] + ";base64," + new Buffer(body).toString('base64');
console.log(data);
fulfill(data);
}
});
});
};

No you don't leak your server path by base64 encoding images. The base64 you are generating only includes a base64 representation of the binary image data. Indeed by base64 encoding them you remove any use of a path when you display them within a HTML page for example:
<img alt="base64 image" src="data:image/png;base64,isdRw0KGgot5AAANdSsDIA..." />
The src attribute contains a flag that data is being provided data: the file mimetype image/png; the encoding base64, and the encoded image data isdRw0KGgot5AAANdSsDIA....

Develop Reference

JavaScript is the programming language of the Web.

Request returning unicode replacement character - javascript

Related

JavaScript Fetch: characters with encoding issues

Need to add value into my URL while doing HTTP post request using Google Cloud Function

Dealing with multiple encoding schemes while downloading the XML feed

Requesting XML returns a string with many carriage returns

base64 encode image host url or server file path

Categories

Resources