第六章_查詢 【排序 原理】
阿新 • • 發佈:2022-01-29
1. order by(全域性排序)
1. 全域性排序, 無論設定多少個 reduce個數,只會產生一個reduce
2. 對大規模的資料集,全域性排序效率非常低
1. 設定分割槽個數為 3 set mapreduce.job.reduces=3; 2. 執行sql select name ,num from ( select '張飛' as name,1 as num union all select '張飛' as name,2 as num union all select '張飛' as name,3 as num union allView Codeselect '趙雲' as name,1 as num union all select '趙雲' as name,2 as num union all select '趙雲' as name,3 as num ) as t1 order by num desc ; -- 檢視結果(無論 reduceNum 設定為多少,只會產生一個 reduce) Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 name num 趙雲 3 張飛3 趙雲 2 張飛 2 趙雲 1 張飛 1
2. sort by (分割槽內排序)
1. 為每個 reducer 內產生的檔案排序
2. 不指定 distribute by 時,按照 行的所有欄位的hash值 % numReduces 進行分割槽
-- 使用步驟 -- 1. 設定 reduce 個數 set mapreduce.job.reduces=3; insert overwrite local directory '/root/sortby' select name ,num from ( selectView Code'張飛' as name,1 as num union all select '張飛' as name,2 as num union all select '張飛' as name,3 as num union all select '趙雲' as name,1 as num union all select '趙雲' as name,2 as num union all select '趙雲' as name,3 as num ) as t1 sort by num desc ; --結果 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3 --產生3個 reduce, 且按照 行的所有欄位進行分割槽 -- 檢視生產檔案(三個分割槽檔案 都有資料) -rw-r--r-- 1 root root 18 1月 29 11:06 000000_1 -rw-r--r-- 1 root root 27 1月 29 11:06 000001_2 -rw-r--r-- 1 root root 9 1月 29 11:06 000002_1 name num 趙雲 3 趙雲 2 張飛 3 張飛 2 趙雲 1 張飛 1
3. distribute by(指定 分割槽欄位)
1. 分割槽編號 = hash(指定分割槽欄位) % reduceNub
insert overwrite local directory '/root/sortbykey' select name ,num from ( select '張飛' as name,1 as num union all select '張飛' as name,2 as num union all select '張飛' as name,3 as num union all select '趙雲' as name,1 as num union all select '趙雲' as name,2 as num union all select '趙雲' as name,3 as num ) as t1 distribute by name sort by num desc ; Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3 -rw-r--r-- 1 root root 27 1月 29 11:12 000000_1 -rw-r--r-- 1 root root 0 1月 29 11:12 000001_2 (空檔案) -rw-r--r-- 1 root root 27 1月 29 11:12 000002_0 趙雲3 趙雲2 趙雲1 張飛3 張飛2 張飛1View Code
4. cluster by(分割槽內 排序)
1. 當 distribute by 和 sort by 欄位相同時, 可以使用cluster by
預設為 升序排序, 不能指定 排序規則 desc 、asc
2. cluster by name = distribute by name sort by name asc ;
insert overwrite local directory '/root/clusterby2' select name ,num from ( select '張飛' as name,1 as num union all select '張飛' as name,2 as num union all select '張飛' as name,3 as num union all select '趙雲' as name,1 as num union all select '趙雲' as name,2 as num union all select '趙雲' as name,3 as num ) as t1 cluster by name ; name num 趙雲 3 趙雲 2 趙雲 1 張飛 3 張飛 2 張飛 1View Code
5. mr排序連線: https://www.cnblogs.com/bajiaotai/p/15734910.html