1. 程式人生 > >用sqoop進行mysql和hdfs系統間的資料互導

用sqoop進行mysql和hdfs系統間的資料互導

2012-07-19

sqoop 是apache下用於RDBMS和HDFS互相導資料的工具。 本文件是sqoop的使用例項,實現從mysql到hdfs互導資料,以及從Mysql導資料到HBase。

下載:

http://www.apache.org/dyn/closer.cgi/sqoop/

[[email protected] ~]$ wget http://labs.renren.com/apache-mirror/sqoop/1.4.1-incubating/sqoop-1.4.1-incubating__hadoop-1.0.0.tar.gz

最新使用者手冊 http://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html

一、從HBase庫中直接匯出到mysql中?

一開始我想從HBase庫中直接匯出到mysql中。 在mysql中建立一個庫和表

mysql> create database toplists;
Query OK, 1 row affected (0.06 sec)
mysql> use toplists
Database changed
mysql> create table t1(id int not null primary key, name varchar(255),value int);
Query OK, 0 rows affected (0.10 sec)

hbase(main):011:0> scan 't1'
ROW COLUMN+CELL
1001 column=info:count, timestamp=1340265059531, value=724988
1009 column=info:count, timestamp=1340265059533, value=108051
...
total column=info:count, timestamp=1340265059534, value=833039
total_user_count column=info:, timestamp=1340266656307, value=154516
11 row(s) in 0.0420 seconds

[
[email protected]
48 ~]$ sqoop list-tables --connect jdbc:mysql://localhost/toplists --username root java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:657) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:473) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:496) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:194) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:178) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:114) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82) at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64) at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)

需下載 MySQL JDBC Connector 庫,並將其複製到$SQOOP_HOME/lib 下載mysql jdbc連線庫

地址:

http://www.mysql.com/downloads/connector/j/





[[email protected] ~]$ wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.21.tar.gz/from/http://cdn.mysql.com/
[[email protected] mysql-connector-java-5.1.21]$ cp mysql-connector-java-5.1.21-bin.jar ../sqoop/lib/.
[[email protected] ~]$ sqoop list-tables --connect jdbc:mysql://localhost/toplists --username root
t1

[[email protected] ~]$ sqoop-export --connect jdbc:mysql://localhost/toplists --username root --table t1 --export-dir /hbase

java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at org.apache.sqoop.mapreduce.ExportOutputFormat.getRecordWriter(ExportOutputFormat.java:79)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:628)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

這是可能由jdbc版本引起的,換成5.1.18

[[email protected] ~]$ sqoop-export --connect jdbc:mysql://localhost:3306/toplists --username root --table t1 --export-dir /hbase

Error initializing attempt_201206271529_0006_r_000000_0:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for ttprivate/taskTracker/zhouhh/jobcache/job_201206271529_0006/jobToken
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
at org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:4271)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1177)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1118)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2430)
at java.lang.Thread.run(Thread.java:722)

DiskErrorException ,定位半天,發現是另一臺機器的空間滿了,在mapreduce執行時會引起該異常。

[[email protected] ~]$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 28337624 26877184 0 100% /







[[email protected] ~]$ sqoop-export --connect jdbc:mysql://192.168.10.48:3306/toplists --username root --table t1 --export-dir /hbase
Caused by: java.sql.SQLException: null, message from server: "Host 'Hadoop47' is not allowed to connect to this MySQL server"

這是許可權問題,設定授權:

mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'%';#允許所有使用者檢視和修改databaseName資料庫模式的內容,否則別的IP連不上本MYSQL
Query OK, 0 rows affected (0.06 sec)

這是測試,所以許可權沒有限制。實際工作環境需謹慎授權。

[[email protected] ~]$ sqoop-export --connect jdbc:mysql://192.168.10.48:3306/toplists --username root --table t1 --export-dir /hbase
Note: /tmp/sqoop-zhouhh/compile/fa1d1c042030b0ec8537c7a4cd02aab3/t1.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
java.lang.NumberFormatException: For input string: "7"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.valueOf(Integer.java:582)
at t1.__loadFromFields(t1.java:218)
at t1.parse(t1.java:170)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:77)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:36)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:183)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

這是由於/hbase是hbase的庫表,根本不是可以導的格式,所以報錯。

[[email protected] ~]$ sqoop-export --connect jdbc:mysql://192.168.10.48:3306/toplists --username root --table t1 --export-dir /hbase/t1
[[email protected] ~]$ sqoop-export --verbose --connect jdbc:mysql://192.168.10.48:3306/toplists --username root --table t1 --update-key id --input-fields-terminated-by 't' --export-dir /hbase/t1
Note: /tmp/sqoop-zhouhh/compile/8ce6556eb13b3000550a9c864eaa6820/t1.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[[email protected] ~]$

但將匯出目錄指到/hbase/t1表中,匯出不會報錯,而mysql中沒有資料。後面才瞭解到,sqoop沒有直接從hbase中將表匯出到mysql的辦法。必須先將hbase匯出成平面檔案,或者匯出到hive中,才可以用sqoop將資料匯出到mysql。

二、從mysql中導到hdfs。

建立mysql表,將其匯入到hdfs

mysql> create table test(id int not null primary key auto_increment,name varchar(64) not null,price decimal(10,2), cdate date,version int,comment varchar(255));
Query OK, 0 rows affected (0.10 sec)
mysql> insert into test values(null,'iphone',3900.00,'2012-7-18',1,'8g');
Query OK, 1 row affected (0.04 sec)
mysql> insert into test values(null,'ipad',3200.00,'2012-7-16',2,'16g');
Query OK, 1 row affected (0.00 sec)
mysql> select * from test;
+----+--------+---------+------------+---------+---------+
| id | name | price | cdate | version | comment |
+----+--------+---------+------------+---------+---------+
| 1 | iphone | 3900.00 | 2012-07-18 | 1 | 8g |
| 2 | ipad | 3200.00 | 2012-07-16 | 2 | 16g |
+----+--------+---------+------------+---------+---------+
2 rows in set (0.00 sec)

匯入:

[[email protected] ~]$ sqoop import --connect jdbc:mysql://Hadoop48/toplists --table test -m 1
java.lang.RuntimeException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Access denied for user ''@'Hadoop48' to database 'toplists'
at org.apache.sqoop.manager.CatalogQueryManager.getColumnNames(CatalogQueryManager.java:162)

給空使用者授權

mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'%';







[[email protected] ~]$ sqoop import --connect jdbc:mysql://Hadoop48/toplists --username root --table test -m 1

12/07/18 11:10:16 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/07/18 11:10:16 INFO tool.CodeGenTool: Beginning code generation
12/07/18 11:10:16 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `index_mapping` AS t LIMIT 1
12/07/18 11:10:16 INFO orm.CompilationManager: HADOOP_HOME is /home/zhoulei/hadoop-1.0.0/libexec/..
注: /tmp/sqoop-zhoulei/compile/2b04bdabb7043e4f75b215d72f65388e/index_mapping.java使用或覆蓋了已過時的 API。
注: 有關詳細資訊, 請使用 -Xlint:deprecation 重新編譯。
12/07/18 11:10:18 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-zhoulei/compile/2b04bdabb7043e4f75b215d72f65388e/index_mapping.jar
12/07/18 11:10:18 WARN manager.MySQLManager: It looks like you are importing from mysql.
12/07/18 11:10:18 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
12/07/18 11:10:18 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
12/07/18 11:10:18 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
12/07/18 11:10:25 INFO mapreduce.ImportJobBase: Beginning import of index_mapping
12/07/18 11:10:26 INFO mapred.JobClient: Running job: job_201207101344_0519
12/07/18 11:10:27 INFO mapred.JobClient: map 0% reduce 0%
12/07/18 11:10:40 INFO mapred.JobClient: map 100% reduce 0%
12/07/18 11:10:45 INFO mapred.JobClient: Job complete: job_201207101344_0519
12/07/18 11:10:45 INFO mapred.JobClient: Counters: 18
12/07/18 11:10:45 INFO mapred.JobClient: Job Counters
12/07/18 11:10:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12083
12/07/18 11:10:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/18 11:10:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/07/18 11:10:45 INFO mapred.JobClient: Launched map tasks=1
12/07/18 11:10:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/07/18 11:10:45 INFO mapred.JobClient: File Output Format Counters
12/07/18 11:10:45 INFO mapred.JobClient: Bytes Written=28
12/07/18 11:10:45 INFO mapred.JobClient: FileSystemCounters
12/07/18 11:10:45 INFO mapred.JobClient: HDFS_BYTES_READ=87
12/07/18 11:10:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=30396
12/07/18 11:10:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=28
12/07/18 11:10:45 INFO mapred.JobClient: File Input Format Counters
12/07/18 11:10:45 INFO mapred.JobClient: Bytes Read=0
12/07/18 11:10:45 INFO mapred.JobClient: Map-Reduce Framework
12/07/18 11:10:45 INFO mapred.JobClient: Map input records=2
12/07/18 11:10:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=79167488
12/07/18 11:10:45 INFO mapred.JobClient: Spilled Records=0
12/07/18 11:10:45 INFO mapred.JobClient: CPU time spent (ms)=340
12/07/18 11:10:45 INFO mapred.JobClient: Total committed heap usage (bytes)=56623104
12/07/18 11:10:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=955785216
12/07/18 11:10:45 INFO mapred.JobClient: Map output records=2
12/07/18 11:10:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=87
12/07/18 11:10:45 INFO mapreduce.ImportJobBase: Transferred 28 bytes in 20.2612 seconds (1.382 bytes/sec)
12/07/18 11:10:45 INFO mapreduce.ImportJobBase: Retrieved 2 records.

檢查資料是否匯入

[[email protected] ~]$ fs -cat /user/zhouhh/test/part-m-00000
1,iphone,3900.00,2012-07-18,1,8g
2,ipad,3200.00,2012-07-16,2,16g

[[email protected] ~]$ fs -cat test/part-m-00000
1,iphone,3900.00,2012-07-18,1,8g
2,ipad,3200.00,2012-07-16,2,16g

三、從hdfs匯出到mysql

清空表

mysql> delete from test;
Query OK, 2 rows affected (0.00 sec)

mysql> select * from test;
Empty set (0.00 sec)

匯出

[[email protected] ~]$ sqoop-export --connect jdbc:mysql://192.168.10.48:3306/toplists --username root --table test --export-dir test
Note: /tmp/sqoop-zhouhh/compile/7adaaa7ffe5f49ed9d794b1be8a9a983/test.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

匯出時,–connect,–table, –export-dir是必須設定的。其中toplists是庫名,–table是該庫下的表名。 –export-dir是要匯出的HDFS平面檔案位置。如果不是絕對路徑,指/user/username/datadir

檢查mysql表

mysql> select * from test;
+----+--------+---------+------------+---------+---------+
| id | name | price | cdate | version | comment |
+----+--------+---------+------------+---------+---------+
| 1 | iphone | 3900.00 | 2012-07-18 | 1 | 8g |
| 2 | ipad | 3200.00 | 2012-07-16 | 2 | 16g |
+----+--------+---------+------------+---------+---------+
2 rows in set (0.00 sec)

可見匯出成功。

四、不執行mapreduce,但生成匯入程式碼

[[email protected] ~]$ sqoop codegen --connect jdbc:mysql://192.168.10.48:3306/toplists --username root --table test --class-name Mycodegen
Note: /tmp/sqoop-zhouhh/compile/104b871487669b89dcd5b9b2c61f905f/Mycodegen.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

[[email protected] ~]$ sqoop help codegen
usage: sqoop codegen [GENERIC-ARGS] [TOOL-ARGS]

sqoop匯入時,可以加選擇語句,以過濾和綜合多表,用–query.也可以只加條件,用–where。這樣可以不必每次匯入整張表。 如 –where ‘id > 1000’ 示例,採用join選擇多表資料: sqoop import –query ‘select a.,b. from a join b on (a.id == b.id) where $conditions’ -m 1 –target-dir /usr/foo/joinresults

五、將mysql表匯入到HBase

雖然目前,sqoop沒有將HBase直接匯入mysql的辦法,但將mysql直接匯入HBase是可以的。需指定–hbase-table,用–hbase-create-table來自動在HBase中建立表。–column-family指定列族名。–hbase-row-key指定rowkey對應的mysql的鍵。 [[email protected] ~]$ sqoop import –connect jdbc:mysql://Hadoop48/toplists –table test –hbase-table a –column-family name –hbase-row-key id –hbase-create-table –username ‘root’

檢查hbase被匯入的表:

hbase(main):002:0> scan 'a'
ROW COLUMN+CELL
1 column=name:cdate, timestamp=1342601695952, value=2012-07-18
1 column=name:comment, timestamp=1342601695952, value=8g
1 column=name:name, timestamp=1342601695952, value=iphone
1 column=name:price, timestamp=1342601695952, value=3900.00
1 column=name:version, timestamp=1342601695952, value=1
2 column=name:cdate, timestamp=1342601695952, value=2012-07-16
2 column=name:comment, timestamp=1342601695952, value=16g
2 column=name:name, timestamp=1342601695952, value=ipad
2 column=name:price, timestamp=1342601695952, value=3200.00
2 column=name:version, timestamp=1342601695952, value=2
2 row(s) in 0.2370 seconds

關於匯入的一致性:建議停止mysql表的寫入再匯入到HDFS或HIVE,否則,mapreduce可能會丟失新增的資料。 關於效率:mysql直接模式(–direct)匯入的方式效率高。但不支援大物件資料,型別為CLOB或BLOB的列。用JDBC效率較低,但有專用API可以支援CLOB及BLOB。

六、從HBase匯出資料到Mysql

目前沒有直接的匯出命令。但有兩個方法可以將HBase資料匯出到mysql。

其一,將HBase匯出成HDFS平面檔案,再匯出到mysql. 其二,將HBase資料匯出到HIVE,再匯出到mysql,參見後續文章《從hive將資料匯出到mysql

如非註明轉載, 均為原創. 本站遵循知識共享CC協議,轉載請註明來源