1. 程式人生 > >hive 底層模組實現-distinct

hive 底層模組實現-distinct

準備資料

語句

SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
hive> SELECT * FROM logs;
OK
a   蘋果  3
a   橙子  3
a   燒雞  1
b   燒雞  3

hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;

根據count分組,計算獨立使用者數。

計算過程

hive distinct cal

預設設定了hive.map.aggr=true,所以會在mapper端先group by一次,最後再把結果merge起來,為了減少reducer處理的資料量。注意看explain的mode是不一樣的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那將groupby放到reducer才做,他的mode是complete.

Operator

hive distinct op

Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum
(TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 掃描表 alias: logs Select
Operator //選擇欄位 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //這裡是因為預設設定了hive.map.aggr=true,會在mapper先做一次聚合,減少reduce需要處理的資料 aggregations: expr: sum(count) //聚集函式 bucketGroup: false keys: //鍵 expr: uid type: string mode: hash //hash方式,processHashAggr() outputColumnNames: _col0, _col1 Reduce Output Operator //輸出key,value給reducer key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) //聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial //合併值 outputColumnNames: _col0, _col1 Select Operator //選擇欄位 expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator //輸出到檔案 compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1