本文程式碼執行環境:MySQL:5.1.26-rc-community,Windows 2003

無意中在 emule 的安裝目錄下看到了個 ip-to-country.csv 檔案。 開啟後,發現是世界各國及IP段對照的檔案。格式如下:


真是個好東東!正好一個專案要用到,就準備把資料匯入到 MySQL 資料庫中。 首先在 MySQL 資料庫中建立表結構:

use testdb;

create table ip_to_country
ip1 int unsigned not null
,ip2 int unsigned not null
,cname1 varchar(10) not null
,cname2 varchar(10) not null
,cname3 varchar(50) not null
engine=innodb default charset=utf8;

我準備用 MySQL 匯入工具:mysqlimport 來完成匯入任務。 首先把資料檔案 ip-to-country.csv copy 到 d:/,為了使其和 MySQL 中表名匹配, 重新命名為 ip_to_country.csv。然後根據資料檔案格式,編寫並執行下面的 mysqlimport 指令碼:

mysqlimport --local

注意:上面的 mysqlimport 引數應寫成一行,為了便於閱讀我有意分成多行。 執行上面的 mysqlimport 命令後,發現有 Warnings:

testdb.ip_to_country: Records: 65290  Deleted: 0  Skipped: 0  Warnings: 15

資料已經匯入,不管那麼多了, 先進 MySQL 資料庫看下效果。 首先設定 character_set_results=gb2312; 然後查詢 10 條記錄出來看看,步驟如下:

mysql> set character_set_results=gb2312;

mysql> show variables like '%char%';
| Variable_name | Value |
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | gb2312 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | D:/MySQL/share/charsets/ |
mysql> select * from ip_to_country limit 10;

| ip1 | ip2 | cname1 | cname2 | cname3 |
| 33996344 | 33996351 | GB | GBR | ? |
| 50331648 | 69956103 | US | USA | |
| 69956104 | 69956111 | BM | BMU | |
| 69956112 | 83886079 | US | USA | |
| 94585424 | 94585439 | SE | SWE | |
| 100663296 | 121195295 | US | USA | |
| 121195296 | 121195327 | IT | ITA | |
| 121195328 | 152305663 | US | USA | |
| 152305664 | 152338431 | GB | GBR | ? |
| 152338432 | 167772159 | US | USA | |

結果發現國家的中文名稱都是亂碼。奇怪,已經把 mysqlimport 的 default-character-set 引數設為:gb2312,為什麼會有亂碼? 最後不得以,只好在 MySQL 資料庫中, 把表 ip_to_country 的字符集改為 gb2312。

mysql> alter table ip_to_country default character set gb2312;
mysql> alter table ip_to_country convert to character set gb2312;

然後重新執行匯入命令 mysqlimport,這時候發現 MySQL 亂碼問題已解決, 中文國家名字可以正常顯示:

mysql> select * from ip_to_country limit 10;

| ip1 | ip2 | cname1 | cname2 | cname3 |
| 33996344 | 33996351 | GB | GBR | 英國 |
| 50331648 | 69956103 | US | USA | 美國 |
| 69956104 | 69956111 | BM | BMU | 百慕達群島 |
| 69956112 | 83886079 | US | USA | 美國 |
| 94585424 | 94585439 | SE | SWE | 瑞典 |
| 100663296 | 121195295 | US | USA | 美國 |
| 121195296 | 121195327 | IT | ITA | 義大利 |
| 121195328 | 152305663 | US | USA | 美國 |
| 152305664 | 152338431 | GB | GBR | 英國 |
| 152338432 | 167772159 | US | USA | 美國 |

留下一個問題:mysqlimport 到底能不能把文字檔案中的 gb2312 字元 轉換成 utf8 匯入到 MySQL 資料庫中?

雖然問題看起來已經解決了,但我還想試下 MySQL load data 命令。 mysqlimport 雖然把資料匯入資料庫了,但還有 15 Warnings 在鬧心。 我本想利用 mysqlimport 自身的功能來檢視這些 Warnings 到底是怎麼回事, 但翻翻手冊,仍無計可施。MySQL 中有個 show warnings 給我一線希望。 我這樣想:先在 MySQL 中執行 Load data,然後 show warnings 不就 可以找到問題所在了嗎?

mysql> truncate table ip_to_country;

mysql> load data infile "d:/ip_to_country.csv"
replace into table ip_to_country
character set gb2312
fields terminated by "," enclosed by ""
lines terminated by "/r/n";

ERROR 1262 (01000): Row 6737 was truncated; it contained more data than there were input columns

暈,又出現個攔路虎:ERROR 1262 (01000): Row 6737 was truncated; it contained more data than there were input columns. 最後發現問題是 sql_mode 的問題。

mysql> show variables like '%sql_mode%';

| Variable_name | Value |
| sql_mode | strict_trans_tables,no_auto_create_user,no_engine_substitution |

mysql> set sql_mode='no_auto_create_user,no_engine_substitution';

把 strict_trans_tables 從 sql_mode 中去掉,再次執行 MySQL Load data

mysql>  load data infile "d:/ip_to_country.csv"
replace into table ip_to_country
character set gb2312
fields terminated by "," enclosed by ""
lines terminated by "/r/n";

Query OK, 65290 rows affected, 15 warnings (0.63 sec)
Records: 65290 Deleted: 0 Skipped: 0 Warnings: 15

接下來,用 MySQL show warnings 命令,來找警告的詳細描述:

mysql> show warnings;

| Level | Code | Message |
| Warning | 1262 | Row 6737 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 6817 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 6914 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 6916 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 6918 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 6988 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 7028 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 7226 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 7569 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 7791 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 47856 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 47885 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 49331 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 49539 was truncated; it contained more data than there were input columns |
| Warning | 1262 | Row 49547 was truncated; it contained more data than there were input columns |

根據行號 Row 6737 到 ip_to_country.csv 中檢視,發現果然有問題,國家中文名中間多了個逗號 “,”


看來,需要在表 ip_to_country 中再增加一列,來存放多出的內容。於是修改表結構:

mysql> alter table ip_to_country add column cname4 varchar(50) null;

再次執行 mysql load data,資料順利匯入。這時仍有警告,這些警告是因為 檔案中的大部分資料行只有 5 列,而表中有 6 列,因此 MySQL 才 Warning。

把表 ip_to_country 的字符集改為 utf8,看有沒有亂碼:

truncate table ip_to_country;
alter table ip_to_country default character set utf8;
alter table ip_to_country convert to character set utf8;

再次,執行 MySQL Load data 命令:

mysql>  load data infile "d:/ip_to_country.csv"
replace into table ip_to_country
character set gb2312
fields terminated by "," enclosed by ""
lines terminated by "/r/n";

Query OK, 65290 rows affected, 65275 warnings (0.64 sec)
Records: 65290 Deleted: 0 Skipped: 0 Warnings: 65275


mysql> select * from ip_to_country where cname4 is not null limit 10;

| ip1 | ip2 | cname1 | cname2 | cname3 | cname4 |
| 1089579216 | 1089579223 | VI | VIR | 維京群島 | 美國 |
| 1093062144 | 1093062399 | VI | VIR | 維京群島 | 美國 |
| 1097896192 | 1097897215 | VI | VIR | 維京群島 | 美國 |
| 1097947136 | 1097949183 | VI | VIR | 維京群島 | 美國 |
| 1097951232 | 1097953279 | VI | VIR | 維京群島 | 美國 |
| 1101625344 | 1101625407 | VI | VIR | 維京群島 | 美國 |
| 1101971072 | 1101971079 | VI | VIR | 維京群島 | 美國 |
| 1113864768 | 1113864783 | VI | VIR | 維京群島 | 美國 |
| 1119428608 | 1119432703 | VI | VIR | 維京群島 | 美國 |
| 1123590144 | 1123594239 | VI | VIR | 維京群島 | 美國 |

可見,MySQL load data infile 指令,可以實現不同字符集之間的轉換。


show warnings;是不錯的方法


SET character_set_database=gbk是更不錯的方法



