Convert Unicode Javascript string to PHP utf8 string

Convert Unicode Javascript string to PHP utf8 string - javascript

I make form with input text.
<input type="text" id="input" value=""/>
i received utf-8 string from web like this (with javascript, jquery)
var str = '\u306e\u7c21\u5358\u306a\u8aac\u660e';
str is 'の簡単な説明'.
set input field value to 'str'
$('#input').val(str);
this input replace all of escape string '\' and set string like this.
<input type"text" id="input" value="u306eu7c21u5358u306au8aacu660e"/>
no problem in this point. display character is also good.
But.
When I save this string into my database with PHP
PHP put this value non-escaped utf8 string 'u306eu7c21u5358u306au8aacu660e' to database
and next time I've call
<input type="text" id="input" value="<?=$str?>">
and browser displays raw value
just 'u306eu7c21u5358u306au8aacu660e'
not 'の簡単な説明'
I don't know what is wrong.
I've tried
$str = json_decode("\"".$str."\"");
html_entity_decode(...);
mb_convert_encoding(...);
but not working correctly...
How can I covert this non-escaped utf-8 string to general utf-8 string?

You've MUST have MultiByte String support. With some extra work here is what you need:
<?php
$str = 'u306eu7c21u5358u306au8aacu660e';
function converter($sequence) {
return mb_convert_encoding(pack('H*', $sequence), 'UTF-8', 'UCS-2BE');
}
# array_filter is not important here at all it just "remove" empty strings
$converted = array_map('converter', array_filter(explode('u', $str)));
$converted = join('', $converted);
print $converted;
Just as a side note you OUGHT TO FIND a better strategy in order to
split the unicode sequences. By "exploding" string by u char is
somewhat ingenuo.
Also, I strongly advise you read the excelent blog post by Armin Ronacher, UCS vs UTF-8 as Internal String Encoding.

Related

"Literally" insert string into HTML input

I want to insert a string "literally" into an HTML input field using jQuery.
The problem is that the string contains escaped characters and character codes, and all of those need to still be there when the string is inserted.
My problem seems to be with escaped characters (thanks for the comments that pointed that out). I can't figure out how I can insert the string without the escaped characters and codes being translated.
The literal strings come from a file data.txt. To clarify, this is just an exemplary string that is used to demonstrate that there can be escaped quotes and character codes etc. in the strings.
TEST\"/**\x3e
They are loaded (in node.js) from the file into an array of strings.
Wrapper code (Node.js) visits the page using the Chrome dev tools.
Here, for each string a script is prepared that is injected and executed on the page.
Therefore the inputString is inserted into the script, before it is injected.
So here is my problem with string escaping. I have the strings in literal format as data and I currently inject them as dynamically generated JavaScript code which is where escaping problems occur.
Injected Code
// this was (currently incorrectly) injected into the page before
// from the array of input strings that was loaded from data
let insertString = "TEST\"/**\x3e"; // <-
let form = $("form").first();
let inputs = form.find(":input").not(":input[type=submit]");
let input = inputs.first();
input.focus().val(insertString);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<!-- Exemplary form code on the page -->
<form action="post" method="post">
<label for="name">Name: </label>
<input id="name" type="text" name="input">
<input type="submit" value="Submit">
</form>
What we got
What I want
The string is not inserted as is.
For example the character code \x3e is translated to >.
Also the escaped \" is translated to ".
It needs to be inserted just as it would be when copying and pasting from the data file.
Thoughts on a potential (manual) solution
So one potential solution is to rework the data.txt file and escape the strings correctly. So the first line might be TEST\\\"/**\\x3e, as #Jamiec and #Barmar have proposed.
// injected before
let insertString = "TEST\\\"/**\\x3e"; // <- manually escaped
let form = $("form").first();
let inputs = form.find(":input").not(":input[type=submit]");
let input = inputs.first();
input.focus().val(insertString);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<!-- Exemplary form code on the page -->
<form action="post" method="post">
<label for="name">Name: </label>
<input id="name" type="text" name="input">
<input type="submit" value="Submit">
</form>
The string will then be inserted as intended, but the solution is still not satisfying, because it would be better for me to not touch the input data.
It would be best to have the input strings in the data.txt file exactly as they will look when they are inserted into the page.
This would require and additional step between loading the input data and inserting each string into the script (that is then injected into the page). Potentially this preprocessing can be done with regexp replacements.

You need to escape all the backslashes and quotes in the string. You can do this using a regular expression.
function escape_string(string) {
return string.replace(/[\\"']/g, "\\$&");
}
console.log('let str = "' + escape_string(prompt("Type a string")) + '";');

This has nothing to do with encoding, nor input fields - it is simply string escapes - so can be demonstrated using the console (or any other way of displaying a string).
In order to see the literal escape character \ in a string you must escape the escape character with \\ - see below:
var text1 = "TEST\"/**\x3e";
console.log(text1)
var text2 = "TEST\\\"/**\\x3e";
console.log(text2)
As you can see the first output is your exact problem, where as the second escapes the escape character so you get what you expect in the output.

Why is my search() unable to find special characters?

I think this is a character encoding problem. My javascript search fails to identify a string whenever the string contains certain special characters such as ()* parenthesis, asterisks, and numerals.
The javascript is fairly simple. I search a string (str) for the value (val):
n = str.search(val);
The string I search was written to a txt file using PHP...
$scomment = $_POST['order'];
$data = stripslashes("$scomment\n");
$fh = fopen("comments.txt", "a");
fwrite($fh, $data);
fclose($fh);
...then retrieved with PHP...
$search = $_GET["name"];
$comments = file_get_contents('comments.txt');
$array = explode("~",$comments);
foreach($array as $item){if(strstr($item, $search)){
echo $item; } }
...and put into my HTML using AJAX...
xmlhttp.open("GET","focus.php?name="+str,true);
xmlhttp.send();
My HTML is encoded as UTF-8.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The problem is that the when I search() for strings, special characters are not recognized.
I didn't really see an option to encode the PHP created text, so I assumed it was UTF-8 too. Is that assumption incorrect?
If this is an encoding problem, how do I change the encoding of the txt file written with PHP? If it is caused by the above code, where is the problem?

The search() function takes a regex. If a string is given, it is converted into one. Because of this, any special characters have special meanings, and it will not work as expected. You need to escape special characters.
Syntax
str.search(regexp)
Parameters
regexp
A regular expression object. If a non-RegExp object obj is passed, it is implicitly converted to a RegExp by using new RegExp(obj).
If you want to search for text, use str.indexOf() instead, unless you NEED a regex. If so, see here on how to escape the string: Is there a RegExp.escape function in Javascript?

Include this function in your code:
function isValid(str){
return !/[~`!#$%\^&*+=\-\[\]\\';,/{}|\\":<>\?]/g.test(str);
}
Hope it helps

Why are endline characters illegal in HTML string sent over ajax?

Within HTML, it is okay to have endline characters. But when I try to send HTML strings that have endline characters over AJAX to have them operated with JavaScript/jQuery, it returns an error that says that endline characters are illegal. For example, if I have a Ruby string:
"<div>Hello</div>"
and jsonify it with Ruby by to_json, and send it over ajax, parse it within JavaScript by JSON.parse, and insert that in jQuery like:
$('body').append('<div>Hello</div>');
then it does not return an error, but if I do a similar thing with a string like
"<div>Hello\n</div>"
it returns an error. Why are they legal in HTML and illegal in AJAX? Are there any other differences between a legal HTML string loaded as a page and legal HTML string sent over ajax?

string literals can contain line breaks, they just need to be escaped with a backslash like so:
var string = "hello\
world!";
However, this does not create a line break in the string, as it must be an explicit \n escape sequence. This would technically become helloworld. Doing
var string = "hello"
+ "world"
would be much cleaner

Specify the type of the ajax call as 'html'. Jquery will try to infer the type when parsing the response.
If the response is json, newlines should be escaped.
I'd recommend using a library to serialize json. You're unlikely to handle all the edge cases if you roll your own.

Strings in JavaScript MUST appear on a single line, with the exception of escaping that line:
var str = "abc \
def";
However note that the newline is escaped and will not appear in the string itself.
The best option is \n, but note that if it is already going through something that parses \n then you will need to double-escape it as \\n.

Seeing how you're already escaping the JSON properly by using to_json in Ruby, I do believe the bug is in jQuery; when there are newlines in the string it has trouble determining whether you meant to create a single element or a document fragment. This would work just fine:
var str = "<div>Hello\n</div>";
var wrapper = document.createElement('div');
wrapper.innerHTML = str;
$('body').append(wrapper);
Demo

What is the correct way to encodeURIcomponent non utf-8 characters and decodes them accordingly?

I have a Javascript bookmarklet that uses encodeURIcomponent to pass the URL of the current page to the server side, and then use urldecode on the server side to get the characters back.
The problem is, when the encoded character is not in utf-8 (for my case it's gb2312, but it could be something else), and when the server does the urldecode, the decoded character become squares. Which, obviously, isn't what it looked like before the encoding.
It's a bookmarklet, input could be anything, so I can't just define "encode as gb2312" in the js, or "decode as gb2312" in the php scripts.
So, is there a correct way of using encodeURIcomponent which passes the character encoding together with the contents, and then the decoding can pick the right encoding to decode it?

For encoding of browsers, especially for GB2312 charset, check the following docs (in Chinese) first
http://ued.taobao.com/blog/2011/08/26/encode-war/
http://www.ruanyifeng.com/blog/2010/02/url_encoding.html
For your case, %C8%B7%B6%A8 is actually generated from the GB2312 form of '\u786e\u5b9a'. This occurs normally on (legacy?) versions of IE and FF, when user directly inputs Chinese character in location bar,
Or you're using non-standard link from page content which does not perform IRI to URI encoding at all and just render binary string like '/tag/\xc8\xb7\xb6\xa8'(douban.com used to have this usage for tags, now they're using correct URI encoding in UTF8). not quite sure because cannot reproduce in Chrome, maybe test in FF and IE, part about douban is true.
Actually, the correct output of encodeURIComponent should be
> encodeURIComponent('%C8%B7%B6%A8')
"%25C8%25B7%25B6%25A8"
Thus in server side, when an unquoted string contains non-ascii bytes, you'd better to leave the string as it is, here '%C8%B7%B6%A8'.
Also, you could check in client side to apply encodeURIComponent again on a value that contains %XX where XX is larger than 0x7F. I'm not quite sure whether this against RFC 2396 though.
写英文好累啊，不过还是要入乡随俗～

Using escape() and then translate the characters to numeric character reference before sending them to server.
From MDN escape() reference:
The hexadecimal form for characters, whose code unit value is 0xFF or
less, is a two-digit escape sequence: %xx. For characters with a
greater code unit, the four-digit format %uxxxx is used.
Thus, it's easy to translate the output of escape() to numeric character reference by using a simple replace() statement:
escape(input_value).replace(/%u([0-9a-fA-F]{4})/g, '&#x$1;');
Or, if your server-side language only supports decimal entities, use:
escape(input_value).replace(/%u([0-9a-fA-F]{4})/g, function(m0, m1) {
return '&#' + parseInt(m1, 16) + ';';
};
Example code in PHP
client.html (file encoding: GB2312):
<html>
<head>
<meta charset="gb2312">
<script>
function processForm(form) {
console.log('BEFORE:', form.test.value);
form.test.value = escape(form.test.value).replace(/%u(\w{4})/g, function(m0, m1) {
return '&#' + parseInt(m1, 16) + ';';
});
console.log('AFTER:', form.test.value);
return true;
}
</script>
</head>
<body>
<form method="post" action="server.php" onsubmit="return processForm(this);">
<input type="text" name="test" value="确定">
<input type="submit">
</form>
</body>
</html>
server.php:
<?php
echo '<script>console.log("',
$_REQUEST['test'], ' --> ',
mb_decode_numericentity($_REQUEST['test'], array(0x80, 0xffff, 0, 0xffff), 'UTF-8'),
'");</script>';
?>

POST data issues

I have an issue with submitting post data. I have a form which have a couple of text fields in, and when a button is pressed to submit the data, it is run through a custom from validation (JS), then I construct a query string like
title=test&content=some content
which is then submitted to the server. The problem I had is when I have '&' (eg &nbsp) entered into one of the inputs which then breaks up the query string. Eg:
title=test&content=some content &nbsp
How do I get around this?
Thanks in advance,
Harry.

Run encodeURIComponent over each key and value.
var title = "test";
var content = "some content &nbsp ";
var data = encodeURIComponent('title') + /* You don't actually need to encode this as it is a string that only contains safe characters, but you would if you weren't sure about the data */
'=' + encodeURIComponent(title) +
'&' + encodeURIComponent('content') +
'=' + encodeURIComponent(content);

Encode the string..when you want to encode a query string with special characters you need to use encoding. ampersand is encoded like this
title=test&content=some content %26
basically any character in a query string can be replaced by its ASCII Hex equivalent with a % as the prefix
Space = %20
A = %41
B = %42
C = %43
...

You need to encode your query to make it URL-safe. You can refer to the following links on how to do that in JS:
http://xkr.us/articles/javascript/encode-compare/
http://www.webtoolkit.info/javascript-url-decode-encode.html

You said:
...and when a button is pressed to submit the data, it is run through a custom from validation (JS), then I construct a query string...
In the section where you are building the query string you should also run the value of each input through encodeURIComponent() as David Dorward suggested.
As you do - be careful that you only assign the new value to your processed query string and NOT the form element value, otherwise your users will think their input was somehow corrupted and potentially freak out.
[EDIT]
I just re-read your question and realized something important: you're encoding an &nbsp ;character. This is probably a more complicated issue than other posters here have read into. If you want that character, and other &code; type characters to transfer over you'll need to realize that they are codes. Those characters &, n, b, s, p and ; are not themselves the same as " " which is a space character that does not break.
You'll have to add another step of encoding/decoding. You can place this step either before of after the data is sent (or "POSTed").
Before:
(Using this question's answers)
var data = formElement.value;
data = rhtmlspecialchars(data, 0);
Which is intended to replace your "special" characters like with " " so that they are then properly encoded by encodeURIComponent(data)
Or after:
(using standard PHP functions)
<?PHP
$your_field_name = htmlspecialchars_decode(urldecode($_POST['your_field_name']));
?>
This assumes that you escaped the & in your POST with %26
If you replaced it with some function other than encodeURIComponent() you'll have to find a different way to decode it in PHP.

This should solve your problem:
encodeURIComponent(name)+'='+encodeURIComponent(value)+'&'+encodeURIComponent(name2)+'='+encodeURIComponent(value2)
You need to escape each value (and name if you want to be on the safe side) before concatenating them when you're building your query.
The JavaScript global function encodeURIComponent() does the escaping.
The global function escape() (DOM) does this for you in a browser. Although people are saying it is not doing the escaping well for unicode chars. Anyway if you're only concerned about '&' then this would solve your problem.

Develop Reference

JavaScript is the programming language of the Web.

Convert Unicode Javascript string to PHP utf8 string - javascript

Related

"Literally" insert string into HTML input

Why is my search() unable to find special characters?

Why are endline characters illegal in HTML string sent over ajax?

What is the correct way to encodeURIcomponent non utf-8 characters and decodes them accordingly?

POST data issues

Categories

Resources