hive的mr和map-reduce基本設計模式
阿新 • • 發佈:2017-08-24
key format values 模式 none columns lan pac ...
(原創文章,謝絕轉載~)
hive可以使用 explain 或 explain extended (select query) 來看mapreduce執行的簡要過程描述。explain出來的結果類似以下:
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: --map tree TableScan alias: testtb Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: zd1 (type: string), zd2 (type: string), zd3 (type: string) outputColumnNames: zd1, zd2, zd3 Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE Group By Operator aggregations: sum(zd3) keys: zd1 (type: string), zd2 (type: string) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE value expressions: _col2 (type: double) Reduce Operator Tree: --reduce tree Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: string), KEY._col1 (type: string) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string), _col2 (type: double) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1
可以通過此分析mapreduce過程。以上為對zd1,zd2 分組,求sum(zd3)的mr過程:
這個直接根據需要group by的字段作為 key,hive 默認在map端先做一次聚合(set hive.map.aggr=true),且mode為 hash;然後再到reduce端聚合,此時reduce端的mode為mergepartial,如果設置不在map端聚合set hive.map.aggr=false,那麽reduce端的mode是 complete 。
mapreduce的基本設計模式:(參考資料:MapReduce Design Pattern -by Donald Miner and Adam Shook )
1.分組數值聚合,這個模式下map端直接根據需要分組(group by)的字段作為keys,values包括需要的數據,reduce端, f(values) 得到需要的結果(以keys為組)
2.join,map端關聯字段作為keys,每條record作為輸出,不同表的數據打上flag,reduce端根據每組keys的數據,每個flag的數據放在這個flag的list下,然後不同的list的數據再join輸出即可,若inner join那麽限制list都不空,left、right join等則list為空也輸出。
(待續....)
hive的mr和map-reduce基本設計模式