hive 底層模組實現-distinct
阿新 • • 發佈:2019-01-22
準備資料
語句
SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
hive> SELECT * FROM logs;
OK
a 蘋果 3
a 橙子 3
a 燒雞 1
b 燒雞 3
hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
根據count分組,計算獨立使用者數。
計算過程
預設設定了hive.map.aggr=true,所以會在mapper端先group by一次,最後再把結果merge起來,為了減少reducer處理的資料量。注意看explain的mode是不一樣的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那將groupby放到reducer才做,他的mode是complete.
Operator
Explain
hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
logs
TableScan // 掃描表
alias: logs
Select Operator //選擇欄位
expressions:
expr: uid
type: string
expr: count
type: int
outputColumnNames: uid, count
Group By Operator //這裡是因為預設設定了hive.map.aggr=true,會在mapper先做一次聚合,減少reduce需要處理的資料
aggregations:
expr: sum(count) //聚集函式
bucketGroup: false
keys: //鍵
expr: uid
type: string
mode: hash //hash方式,processHashAggr()
outputColumnNames: _col0, _col1
Reduce Output Operator //輸出key,value給reducer
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
//聚合
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial //合併值
outputColumnNames: _col0, _col1
Select Operator //選擇欄位
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator //輸出到檔案
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1