hive top n (order by與sort by區別)
我想說的SELECT TOP N是取最大前N條或者最小前N條。
Hive提供了limit關鍵字,再配合order by可以很容易地實現SELECT
TOP N。但是在Hive中order by只能使用1個reduce,如果表的資料量很大,那麼order
by就會力不從心。例如我們執行SQL:select a from ljntest01 order by a limit
10;
控制檯會打印出:Number of reduce tasks determined at compile time: 1
說明啟動的reduce數量是編譯時確定的。檢視該SQL的執行計劃,該SQL只啟動1個JOB。
假設資料表有
select a from ljntest01 sort by a limit 10;
首先執行該SQL控制檯打印出:Number of reduce tasks not specified. Estimated from input data size: 1
說明reduce數不是編譯時確定的,而是根據輸入檔案大小動態確定的。此外檢視該SQL的執行計劃:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
ljntest01
TableScan
alias: ljntest01
Select Operator
expressions:
expr: a
type: int
outputColumnNames: _col0
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
tag: -1
value expressions:
expr: _col0
type: int
Reduce Operator Tree:
Extract
Limit
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://hdpnn:9000/group/alidw-cbu/tmp/hive-admin/hive_2012-12-16_01-19-42_893_2878471909568139281/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
tag: -1
value expressions:
expr: _col0
type: int
Reduce Operator Tree:
Extract
Limit
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: 10
sort by可以啟動多個reduce,每個reduce做區域性排序,但是這對於sort by limit N已經夠用了。從執行計劃中可以看出sort by limit N啟動了兩個JOB。第一個JOB是在每個reduce中做區域性排序,然後分別取TOP N。假設啟動了M個reduce,第二個JOB再對M個reduce分別區域性排好序的總計M * N條資料做全域性排序,取TOP N,從而得到想要的結果。這樣就可以大大提高SELECT TOP N的效率。