1. 程式人生 > 其它 >第六章_查詢 【排序 原理】

第六章_查詢 【排序 原理】

1. order by(全域性排序)
1. 全域性排序, 無論設定多少個 reduce個數,只會產生一個reduce
2. 對大規模的資料集,全域性排序效率非常低
1. 設定分割槽個數為 3
set mapreduce.job.reduces=3;

2. 執行sql
select name
      ,num
from (
    select '張飛' as name,1 as num
    union all
    select '張飛' as name,2 as num
    union all
    select '張飛' as name,3 as num
    union all
    
select '趙雲' as name,1 as num union all select '趙雲' as name,2 as num union all select '趙雲' as name,3 as num ) as t1 order by num desc ; -- 檢視結果(無論 reduceNum 設定為多少,只會產生一個 reduce) Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 name num 趙雲 3 張飛
3 趙雲 2 張飛 2 趙雲 1 張飛 1
View Code
2. sort by (分割槽內排序)
1. 為每個 reducer 內產生的檔案排序
2. 不指定 distribute by 時,按照 行的所有欄位的hash值 % numReduces 進行分割槽
 -- 使用步驟
 -- 1. 設定 reduce 個數
 set mapreduce.job.reduces=3;

insert overwrite local directory '/root/sortby'
select name
      ,num
from (
    select
'張飛' as name,1 as num union all select '張飛' as name,2 as num union all select '張飛' as name,3 as num union all select '趙雲' as name,1 as num union all select '趙雲' as name,2 as num union all select '趙雲' as name,3 as num ) as t1 sort by num desc ; --結果 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3 --產生3個 reduce, 且按照 行的所有欄位進行分割槽 -- 檢視生產檔案(三個分割槽檔案 都有資料) -rw-r--r-- 1 root root 18 1月 29 11:06 000000_1 -rw-r--r-- 1 root root 27 1月 29 11:06 000001_2 -rw-r--r-- 1 root root 9 1月 29 11:06 000002_1 name num 趙雲 3 趙雲 2 張飛 3 張飛 2 趙雲 1 張飛 1
View Code
3. distribute by(指定 分割槽欄位)
1. 分割槽編號 = hash(指定分割槽欄位) % reduceNub
insert overwrite local directory '/root/sortbykey'
select name
      ,num
from (
    select '張飛' as name,1 as num
    union all
    select '張飛' as name,2 as num
    union all
    select '張飛' as name,3 as num
    union all
    select '趙雲' as name,1 as num
    union all
    select '趙雲' as name,2 as num
    union all
    select '趙雲' as name,3 as num
) as t1
distribute by name
sort by num desc ;

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3


-rw-r--r-- 1 root root 27 1月  29 11:12 000000_1
-rw-r--r-- 1 root root  0 1月  29 11:12 000001_2 (空檔案)
-rw-r--r-- 1 root root 27 1月  29 11:12 000002_0

趙雲3
趙雲2
趙雲1
張飛3
張飛2
張飛1
View Code
4. cluster by(分割槽內 排序)
1. 當 distribute by 和 sort by 欄位相同時, 可以使用cluster by
預設為 升序排序, 不能指定 排序規則 desc 、asc
2. cluster by name = distribute by name sort by name asc ;
insert overwrite local directory '/root/clusterby2'
select name
      ,num
from (
    select '張飛' as name,1 as num
    union all
    select '張飛' as name,2 as num
    union all
    select '張飛' as name,3 as num
    union all
    select '趙雲' as name,1 as num
    union all
    select '趙雲' as name,2 as num
    union all
    select '趙雲' as name,3 as num
) as t1
cluster by name ;

name    num
趙雲    3
趙雲    2
趙雲    1
張飛    3
張飛    2
張飛    1
View Code
5. mr排序連線: https://www.cnblogs.com/bajiaotai/p/15734910.html