MySQL字符集的内部转换
我们知道set names utf8;改变的character_set_client、character_set_connection、character_set_result三个值,MySQL字符集的内部转换到底是哪个阶段呢,下面通过控制变量法来验证。
表准备
建立gbk字符集的表
show create table t\G
*************************** 1. row ***************************
Table: t
Create Table: CREATE TABLE `t` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`data` varchar(100) DEFAULT NULL,
`comment` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_data` (`data`)
) ENGINE=InnoDB DEFAULT CHARSET=gbk
1 row in set (0.00 sec)
character_set_client、character_set_connection、character_set_result为gbk,SecureCRT为utf-8
show variables like 'character%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
insert into T values(null,'小明','3 gbk');
SELECT LENGTH(data), CHAR_LENGTH(data),data,hex(data),hex('小明'),comment from t;
+--------------+-------------------+--------+--------------+---------------+---------+
| LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | hex('小明') | comment |
+--------------+-------------------+--------+--------------+---------------+---------+
| 6 | 3 | 小明 | E5B08FE6988E | E5B08FE6988E | 3 gbk |
+--------------+-------------------+--------+--------------+---------------+---------+
1 row in set (0.00 sec)
这里在SecureCRT明文输入hex(‘小明’)
可以看到查询出的二进制流为E5B08FE6988E
由于character_set_client、character_set_connection、character_set_result和表字符集均为GBK,不涉及MySQL内部编码转换。因此,表字符集虽然为GBK,但“小明”的编码并非为GBK编码的二进制流,而是UTF8的二进制流,两个汉字占用了6个字节,而读取是一个逆向的过程,不涉及到编码转换,查询依然能正确返回“小明”。
只更改character_set_client为utf8
set character_set_client=utf8;
Query OK, 0 rows affected (0.00 sec)
show variables like 'character%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | utf8 |
| character_set_connection | gbk |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
insert into T values(null,'小明','character_set_client utf8');
Query OK, 1 row affected (0.01 sec)
SELECT id,LENGTH(data), CHAR_LENGTH(data),data,hex(data),hex('小明'),comment from t order by id ;
+----+--------------+-------------------+--------+--------------+-------------+---------------------------+
| id | LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | hex('С) | comment |
+----+--------------+-------------------+--------+--------------+-------------+---------------------------+
| 1 | 6 | 3 | 小明 | E5B08FE6988E | D0A1C3F7 | 3 gbk |
| 2 | 4 | 2 | С | D0A1C3F7 | D0A1C3F7 | character_set_client utf8 |
+----+--------------+-------------------+--------+--------------+-------------+---------------------------+
2 rows in set (0.00 sec)
可以看到,数据库里存储的字符为utf8的二进制流E5B08FE6988E
MySQL把SecureCRT明文输入的“小明”转换成了gbk的二进制流D0A1C3F7
把SecureCRT字符集设置成gbk用unhex反向查看确认D0A1C3F7是不是gbk的二进制流
select unhex('D0A1C3F7');
+-------------------+
| unhex('D0A1C3F7') |
+-------------------+
| 小明 |
+-------------------+
1 row in set (0.00 sec)
所以D0A1C3F7为小明gbk的二进制流
而E5B08FE6988E呢
select unhex('E5B08FE6988E');
+-----------------------+
| unhex('E5B08FE6988E') |
+-----------------------+
| 灏忔槑 |
+-----------------------+
1 row in set (0.00 sec)
由于SecureCRT是gbk字符集,把utf-8二进制流E5B08FE6988E按gbk解码成了gbk对应的汉字“灏忔槑”,就是日常所说的乱码。把SecureCRT设置成utf-8后,即恢复正常
select unhex('E5B08FE6988E');
+-----------------------+
| unhex('E5B08FE6988E') |
+-----------------------+
| 小明 |
+-----------------------+
1 row in set (0.00 sec)
所以E5B08FE6988E为小明utf8的二进制流
SecureCRT(utf8 E5B08FE6988E)–> character_set_client(utf8 E5B08FE6988E)–>character_set_connection(gbk D0A1C3F7)–>database(gbk D0A1C3F7)–>character_set_results(gbk D0A1C3F7)–>SecureCRT(utf8 D0A1C3F7)
client(utf8)–>connection(gbk)作了转化
connection把utf8的二进制流转换为gbk的二进制流,由于gbk不包含utf8,需要先用原utf8解码再转码为gbk,用Python演示过程为
a = u"小明"
a_utf_8 = a.encode('utf-8')
print(a_utf_8)
a_unicode = a_utf_8.decode('utf-8')
assert (a_unicode == a)
a_gb2312 = a_unicode.encode('gb2312')
print(a_gb2312)
结果为
b'\xe5\xb0\x8f\xe6\x98\x8e'
b'\xd0\xa1\xc3\xf7'
只设置character_set_connection为utf8
set names gbk;
Query OK, 0 rows affected (0.00 sec)
set character_set_connection=utf8;
Query OK, 0 rows affected (0.00 sec)
show variables like 'character%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | gbk |
| character_set_connection | utf8 |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
insert into T values(null,'小明','character_set_connection utf8');
Query OK, 1 row affected (0.01 sec)
guo@ha 12:25:58>SELECT id,LENGTH(data), CHAR_LENGTH(data),data,hex(data),hex('小明'),comment from t order by id ;
+----+--------------+-------------------+--------+--------------+--------------------+-------------------------------+
| id | LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | hex('小明') | comment |
+----+--------------+-------------------+--------+--------------+--------------------+-------------------------------+
| 1 | 6 | 3 | 小明 | E5B08FE6988E | E7818FE5BF94E6A791 | 3 gbk |
| 2 | 4 | 2 | С | D0A1C3F7 | E7818FE5BF94E6A791 | character_set_client utf8 |
| 3 | 6 | 3 | 小明 | E5B08FE6988E | E7818FE5BF94E6A791 | character_set_connection utf8 |
+----+--------------+-------------------+--------+--------------+--------------------+-------------------------------+
3 rows in set (0.00 sec)
SecureCRT(utf8 E5B08FE6988E)–> character_set_client(gbk E5B08FE6988E)–>character_set_connection(utf8 E5B08FE6988E)–>database(gbk E5B08FE6988E)–>character_set_results(gbk E5B08FE6988E)–>SecureCRT(utf8 E5B08FE6988E)
数据库存储的二进制流E5B08FE6988E不变
SecureCRT输入的utf-8字符集‘小明’在这里变成了E7818FE5BF94E6A791
client(gbk)–>connection(utf8)未作转化
只设置character_set_results为utf8
set names gbk;
Query OK, 0 rows affected (0.00 sec)
set character_set_results=utf8;
Query OK, 0 rows affected (0.00 sec)
show variables like 'character%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
insert into T values(null,'小明','character_set_results utf8');
Query OK, 1 row affected (0.00 sec)
guo@ha 12:27:06>SELECT id,LENGTH(data), CHAR_LENGTH(data),data,hex(data),hex('小明'),comment from t order by id ;
+----+--------------+-------------------+-----------+--------------+------------------+-------------------------------+
| id | LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | hex('灏忔槑') | comment |
+----+--------------+-------------------+-----------+--------------+------------------+-------------------------------+
| 1 | 6 | 3 | 灏忔槑 | E5B08FE6988E | E5B08FE6988E | 3 gbk |
| 2 | 4 | 2 | 小明 | D0A1C3F7 | E5B08FE6988E | character_set_client utf8 |
| 3 | 6 | 3 | 灏忔槑 | E5B08FE6988E | E5B08FE6988E | character_set_connection utf8 |
| 4 | 6 | 3 | 灏忔槑 | E5B08FE6988E | E5B08FE6988E | character_set_results utf8 |
+----+--------------+-------------------+-----------+--------------+------------------+-------------------------------+
4 rows in set (0.00 sec)
SecureCRT(utf8 E5B08FE6988E)–> character_set_client(gbk E5B08FE6988E)–>character_set_connection(gbk E5B08FE6988E)–>database(gbk E5B08FE6988E)–>character_set_results(utf8 E7818FE5BF94E6A791 灏忔槑)–>SecureCRT(utf8 E7818FE5BF94E6A791 灏忔槑)
client(gbk)–>connection(gbk)未作转化
这里只是显示出了问题,数据库存储的二进制流E5B08FE6988E不变,gbk存储的“小明”也能正常显示
SecureCRT输入的utf-8字符集‘小明’在这里也是E5B08FE6988E,但是“小明”显示成了“灏忔槑”对应二进制流E7818FE5BF94E6A791,results并没有转换而是直接把E5B08FE6988E当作gbk的字符处理成E7818FE5BF94E6A791
select unhex('E7818FE5BF94E6A791');
+-----------------------------+
| unhex('E7818FE5BF94E6A791') |
+-----------------------------+
| 灏忔槑 |
+-----------------------------+
1 row in set (0.00 sec)
更改三者为utf8
改变character_set_client、character_set_connection、character_set_results的设置为utf8,查询插入的值。
set names utf8;
Query OK, 0 rows affected (0.00 sec)
show variables like 'character%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
insert into T values(null,'小明','3 utf8');
Query OK, 1 row affected (0.00 sec)
guo@ha 12:28:10>SELECT id,LENGTH(data), CHAR_LENGTH(data),data,hex(data),hex('小明'),comment from t order by id ;
+----+--------------+-------------------+-----------+--------------+---------------+-------------------------------+
| id | LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | hex('小明') | comment |
+----+--------------+-------------------+-----------+--------------+---------------+-------------------------------+
| 1 | 6 | 3 | 灏忔槑 | E5B08FE6988E | E5B08FE6988E | 3 gbk |
| 2 | 4 | 2 | 小明 | D0A1C3F7 | E5B08FE6988E | character_set_client utf8 |
| 3 | 6 | 3 | 灏忔槑 | E5B08FE6988E | E5B08FE6988E | character_set_connection utf8 |
| 4 | 6 | 3 | 灏忔槑 | E5B08FE6988E | E5B08FE6988E | character_set_results utf8 |
| 5 | 4 | 2 | 小明 | D0A1C3F7 | E5B08FE6988E | 3 utf8 |
+----+--------------+-------------------+-----------+--------------+---------------+-------------------------------+
5 rows in set (0.00 sec)
SecureCRT(utf8 E5B08FE6988E)–> character_set_client(utf8 E5B08FE6988E)–>character_set_connection(utf8 E5B08FE6988E)–>database(gbk D0A1C3F7)–>character_set_results(utf8 E5B08FE6988E)–>SecureCRT(utf8 E5B08FE6988E)
character_set_connection(utf8)–>database(gbk)作了转化
综上,MySQL内部二进制流重编码转化条件
- 只在connection和database阶段进行
- connection、database的前一字符集是当前字符集的超集
设置SecureCRT的字符集为GBK,看看SecureCRT字符集设置对结果影响
set names gbk;
Query OK, 0 rows affected (0.00 sec)
show variables like '%char%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
SELECT id,LENGTH(data), CHAR_LENGTH(data),data,hex(data),comment from t order by id ;
+----+--------------+-------------------+--------+----------------+-------------------------------+
| id | LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | comment |
+----+--------------+-------------------+--------+----------------+-------------------------------+
| 1 | 6 | 3 | 灏忔槑 | E5B08FE6988E | 3 gbk |
| 2 | 4 | 2 | 小明 | D0A1C3F7 | character_set_client utf8 |
| 3 | 6 | 3 | 灏忔槑 | E5B08FE6988E | character_set_connection utf8 |
| 4 | 6 | 3 | 灏忔槑 | E5B08FE6988E | character_set_results utf8 |
| 5 | 4 | 2 | 小明 | D0A1C3F7 | 3 utf8 |
+----+--------------+-------------------+--------+----------------+-------------------------------+
5 rows in set (0.00 sec)
set names utf8;
Query OK, 0 rows affected (0.00 sec)
show variables like '%char%';
+--------------------------+-----------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | gbk |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /home/mysql/mysql/share/charsets/ |
+--------------------------+-----------------------------------+
8 rows in set (0.00 sec)
SELECT id,LENGTH(data), CHAR_LENGTH(data),data,hex(data),comment from t order by id ;
+----+--------------+-------------------+-----------+--------------+-------------------------------+
| id | LENGTH(data) | CHAR_LENGTH(data) | data | hex(data) | comment |
+----+--------------+-------------------+-----------+--------------+-------------------------------+
| 1 | 6 | 3 | 鐏忓繑妲 B08FE6988E | 3 gbk |
| 2 | 4 | 2 | 灏忔槑 |D0A1C3F7 | character_set_client utf8 |
| 3 | 6 | 3 | 鐏忓繑妲 B08FE6988E | character_set_connection utf8 |
| 4 | 6 | 3 | 鐏忓繑妲 B08FE6988E | character_set_results utf8 |
| 5 | 4 | 2 | 灏忔槑 | D0A1C3F7 | 3 utf8 |
+----+--------------+-------------------+-----------+--------------+-------------------------------+
5 rows in set (0.00 sec)