1. 程式人生 > >Hive表中四種不同資料匯出方式以及如何自定義匯出列分隔符

Hive表中四種不同資料匯出方式以及如何自定義匯出列分隔符

問題導讀:

1、Hive表資料四種匯出方式是?

2、匯出命令中LOCAL的作用及有無的區別?

3、匯出命令中是否可以嚮導入命令一樣使用INTO?

4、如何自定義匯出檔案的列分隔符?

5、hive的-e和-f引數的作用及如何使用其來匯出資料?

6、hive shell環境中source命令的作用?

我們以test1表為例演示幾種資料匯出方式,test1表中的資料內容如下:

hive> select * from test1;
OK
Tom     24.0    NanJing Nanjing University
Jack    29.0    NanJing Southeast China University
Mary Kake       21.0    SuZhou  Suzhou University
John Doe        24.0    YangZhou        YangZhou University
Bill King       23.0    XuZhou  Xuzhou Normal University
Time taken: 0.064 seconds, Fetched: 5 row(s)

一、匯出資料至本地檔案系統

(1) 匯出資料

hive> insert overwrite local directory "/home/hadoopUser/data"
    > select name,age,address
    > from test1;

/home/hadoopUser/data是一個目錄,一個檔案或者多個檔案將會被寫到該目錄下,具體個數取決於呼叫的reducer的個數。此處的local表示匯出到本地檔案系統,如果不加local,則表示匯出到分散式檔案系統。

(2) 檢視匯出的資料

Tom^A24.0^ANanJing
Jack^A29.0^ANanJing
Mary Kake^A21.0^ASuZhou
John Doe^A24.0^AYangZhou
Bill King^A23.0^AXuZhou

在本地檔案系統目錄/home/hadoopUser/data中生成了一個000000_0檔案,裡面的內容如上所示。每個列以^A(八進位制\001)分隔。

二、匯出資料至分散式檔案系統

(1) 匯出資料

hive> insert overwrite  directory "/output"
    > select name,age,address
    > from test1;

 此處與第一種方法的不同之處就在於沒有local關鍵字,目錄/output為分散式檔案系統的一個目錄,也可以寫全:hdfs://master-server/output

(2) 檢視匯出的資料

結果和第一種方法一樣

注意:

如果我們將overwrite改成into,就像資料匯入表一樣,結果會怎麼樣。

hive> insert into local directory "/home/hadoopUser/data"
    > select name,age,address
    > from test1;
NoViableAltException([email protected][184:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
        at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
        at org.antlr.runtime.DFA.predict(DFA.java:116)
        at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4945)
        at org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:40208)
        at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.tableOrPartition(HiveParser_IdentifiersParser.java:10233)
        at org.apache.hadoop.hive.ql.parse.HiveParser.tableOrPartition(HiveParser.java:40210)
        at org.apache.hadoop.hive.ql.parse.HiveParser.insertClause(HiveParser.java:39685)
        at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:37647)
        at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:36898)
        at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:36774)
        at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1338)
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:12 missing TABLE at 'local' near 'into' in table name

和Hive資料匯入不一樣,不能使用INSERT INTO命令匯出表中資料。

三、將表中資料匯出至另外一個表中

此種方法與我寫的另外一篇部落格內容相似,此處不再累贅。資料匯入表中的一種方法,即如何將查詢結果插入表中,附上鍊接:通過查詢語句向表中插入資料

四、hive -e和-f引數的使用及資料匯出

        (1) 使用hive -e資料匯出

        使用者可能有時候期望執行一個或者多個查詢(使用分號分隔),執行結束後hive CLI立即退出。hive使用-e 命令在終端執行一條或者多條語句,語句使用引號引起來,還可以將查詢的結果重定向到一個檔案中。

        注意:不需要在Hive shell中執行,直接在終端執行即可。

       需要先使用"use hive"進入對應的資料庫,否則會報錯。如下

[[email protected] ~]$ hive -e "use hive;select * from test1" > /home/hadoopUser/data/myquery
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/12/29 18:53:18 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed

Logging initialized using configuration in file:/home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/conf/hive-log4j.properties
OK
Time taken: 0.597 seconds
OK
Time taken: 0.66 seconds, Fetched: 5 row(s)

        檢視匯出檔案結果

Tom     24.0    NanJing Nanjing University
Jack    29.0    NanJing Southeast China University
Mary Kake       21.0    SuZhou  Suzhou University
John Doe        24.0    YangZhou        YangZhou University
Bill King       23.0    XuZhou  Xuzhou Normal University
         注意:預設是以\t作為列分隔符的。

        (2) 使用hive -f 執行查詢檔案

        有時候我們將多條查詢語句寫入到一個檔案中,使用hive -f 可以執行該檔案中的一條或者多條語句。按照慣例,一般把這些Hive查詢檔案儲存為具有.q或者.hql字尾名的檔案。

        編輯query.hql,內容如下:  

use hive;
select name,age from test1;
        執行命令及過程如下:
[[email protected] ~]$ hive -f data/query.hql
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/12/29 19:07:38 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed

Logging initialized using configuration in file:/home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/conf/hive-log4j.properties
OK
Time taken: 0.564 seconds
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419317102229_0038, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0038/
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0038
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-29 19:07:55,168 Stage-1 map = 0%,  reduce = 0%
2014-12-29 19:08:06,671 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec
MapReduce Total cumulative CPU time: 2 seconds 930 msec
Ended Job = job_1419317102229_0038
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 2.93 sec   HDFS Read: 415 HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 930 msec
OK
Tom     24.0
Jack    29.0
Mary Kake       21.0
John Doe        24.0
Bill King       23.0
Time taken: 27.17 seconds, Fetched: 5 row(s)

       (3) 在Hive shell中使用source命令執行一個指令碼檔案

hive> source /home/hadoopUser/data/query.hql;
OK
Time taken: 0.394 seconds
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419317102229_0039, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0039/
Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job  -kill job_1419317102229_0039
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-29 19:13:03,718 Stage-1 map = 0%,  reduce = 0%
2014-12-29 19:13:15,459 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.7 sec
MapReduce Total cumulative CPU time: 2 seconds 700 msec
Ended Job = job_1419317102229_0039
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 415 HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 700 msec
OK
Tom     24.0
Jack    29.0
Mary Kake       21.0
John Doe        24.0
Bill King       23.0
Time taken: 27.424 seconds, Fetched: 5 row(s)

附:

自定義匯出檔案列分隔符

        在預設情況下,從Hive表中匯出的檔案是以^A作為列分隔符的,但是有時候此方法分隔顯示的不直觀,為此Hive允許指定列的分隔符。我們還是以上面第一種方法匯出表中資料為例。

(1) 匯出表中資料

hive> insert overwrite local directory "/home/hadoopUser/data"
    > row format delimited
    > fields terminated by '\t'
    > select * from test1;

此處是以\t作為列的分隔符的。

(2) 檢視匯出結果

Tom     24.0    NanJing Nanjing University
Jack    29.0    NanJing Southeast China University
Mary Kake       21.0    SuZhou  Suzhou University
John Doe        24.0    YangZhou        YangZhou University
Bill King       23.0    XuZhou  Xuzhou Normal University

和第一種方法預設以^A匯出相比,顯得更直觀了一些