1. 程式人生 > >hive新功能 Cube, Rollup介紹

hive新功能 Cube, Rollup介紹

說明:hive之cube、rollup,還有視窗函式,在傳統關係型資料(oracle、sqlserver)中都是有的,用法都很相似。

GROUPING SETS

GROUPING SETS作為GROUP BY的子句,允許開發人員在GROUP BY語句後面指定多個統計選項,可以簡單理解為多條group by語句通過union all把查詢結果聚合起來結合起來,下面是幾個例項可以幫助我們瞭解,

以acorn_3g.test_xinyan_reg為例:

[[email protected] xjob]$ hive -e "use acorn_3g;desc test_xinyan_reg;"
user_id                 bigint                  None                 device_id               int                     None   手機,平板              os_id                   int                     None   作業系統型別              app_id                  int                     None   手機app_id              client_version          string                  None   客戶端版本              from_id                 int
                    None  四級渠道

幾個demo幫助大家瞭解:

grouping sets語句 等價hive語句
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id)) 
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id,os_id),(device_id)) SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id 
UNION ALL 
SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id
select device_id,os_id,app_id,count(user_id) from  test_xinyan_reg group by device_id,os_id,app_id grouping sets((device_id),(os_id),(device_id,os_id),()) SELECT device_id,null,null,count(user_id) FROM test_xinyan_reg group by device_id 
UNION ALL 
SELECT null,os_id,null,count(user_id) FROM test_xinyan_reg group by os_id 
UNION ALL 
SELECT device_id,os_id,null,count(user_id) FROM test_xinyan_reg group by device_id,os_id  
UNION ALL 
SELECT null,null,null,count(user_id) FROM test_xinyan_reg

CUBE函式

cube簡稱資料魔方,可以實現hive多個任意維度的查詢,cube(a,b,c)則首先會對(a,b,c)進行group by,然後依次是(a,b),(a,c),(a),(b,c),(b),(c),最後在對全表進行group by,他會統計所選列中值的所有組合的聚合

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with cube;

手工實現需要寫的hql語句(寫個程式自己生成的,手寫累死):

SELECT device_id,null,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id
UNION ALL
SELECT null,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by os_id
UNION ALL
SELECT device_id,os_id,null,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id
UNION ALL
SELECT null,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by app_id
UNION ALL
SELECT device_id,null,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id
UNION ALL
SELECT null,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id
UNION ALL
SELECT device_id,os_id,app_id,null,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id
UNION ALL
SELECT null,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by client_version
UNION ALL
SELECT device_id,null,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,client_version
UNION ALL
SELECT null,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,client_version
UNION ALL
SELECT device_id,os_id,null,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version
UNION ALL
SELECT null,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by app_id,client_version
UNION ALL
SELECT device_id,null,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version
UNION ALL
SELECT null,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version
UNION ALL
SELECT device_id,os_id,app_id,client_version,null ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version
UNION ALL
SELECT null,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by from_id
UNION ALL
SELECT device_id,null,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,from_id
UNION ALL
SELECT null,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,from_id
UNION ALL
SELECT device_id,os_id,null,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,from_id
UNION ALL
SELECT null,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,from_id
UNION ALL
SELECT device_id,null,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,from_id
UNION ALL
SELECT null,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,from_id
UNION ALL
SELECT device_id,os_id,app_id,null,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,from_id
UNION ALL
SELECT null,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by client_version,from_id
UNION ALL
SELECT device_id,null,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,client_version,from_id
UNION ALL
SELECT null,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,null,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,client_version,from_id
UNION ALL
SELECT null,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by app_id,client_version,from_id
UNION ALL
SELECT device_id,null,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,app_id,client_version,from_id
UNION ALL
SELECT null,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by os_id,app_id,client_version,from_id
UNION ALL
SELECT device_id,os_id,app_id,client_version,from_id ,count(user_id) FROM test_xinyan_reg group by device_id,os_id,app_id,client_version,from_id
UNION ALL
SELECT null,null,null,null,null ,count(user_id) FROM test_xinyan_reg

看著很蛋疼是不是,體會到cube的強大了嗎!(低版本hive可以通過union all方式解決,算是沒有辦法的辦法)

ROLL UP函式

rollup可以實現從右到做遞減多級的統計,顯示統計某一層次結構的聚合。

 select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with rollup;

等價以下sql語句:

 select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id 
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

Grouping_ID函式

當我們沒有統計某一列時,它的值顯示為null,這可能與列本身就有null值衝突,這就需要一種方法區分是沒有統計還是值本來就是null。(寫一個排列組合的演算法,就馬上理解了,grouping_id其實就是所統計各列二進位制和)

直接拿官方文件一個例子,O(∩_∩)O哈哈~

Column1 (key) Column2 (value)
1 NULL
1
1
2
2
3
3
3
NULL
4
5

hql統計:

  SELECT key, value, GROUPING__ID, count(*) from T1 GROUP BY key, value WITH ROLLUP

統計結果如下:

NULL NULL 0     00 6
1 NULL 1     10 2
1 NULL 3     11 1
1 1 3     11 1
2 NULL 1     10 1
2 2 3     11 1
3 NULL 1     10 2
3 NULL 3     11 1
3 3 3     11 1
4 NULL 1     10 1
4 5 3     11 1

GROUPING__ID轉變為二進位制,如果對應位上有值為null,說明這列本身值就是null。(通過類DataFilterNull.py 掃描,可以篩選過濾掉列中null、“”統計結果),

視窗函式

hive視窗函式,感覺大部分都是在模仿oracle,有對oracle熟悉的,應該看下就知道怎麼用。

主要圍繞..over( partitoin by ..) ..

3g業務求新增啟用時候,有的一部手機,可能註冊多個渠道,這時候就要按時間順序求第一個:

select f.udid,f.from_id,f.ins_date 
from (select /* +MAPJOIN(u) */ u.device_id as udid ,g.device_id as gdid,u.from_id,u.ins_date,row_number() over (partition by u.device_id order by u.ins_date asc) as row_number 
        from user_device_info u  
        left outer join  (select device_id from 3g_device_id where log_date<'2013-07-25') g  on ( u.device_id = g.device_id ) 
        where u.log_date='2013-07-25' and u.device_id is not null and u.device_id <> '') f  
where f.gdid is null and row_number=1

參考資料