Hive的視窗函式(附帶上手案例)
阿新 • • 發佈:2018-12-04
目錄
視窗函式的概述與總結:
1.什麼時候用開窗函式?開窗函式常結合聚合函式使用,一般來講聚合後的行數要少於聚合前的行數,但是有時我們既想顯示聚集前的資料,又要顯示聚集後的資料,這時我們便引入了視窗函式.如下:
+-------+-------------+-------+---------------+--+ | name | orderdate | cost | sum_window_0 | +-------+-------------+-------+---------------+--+ | jack | 2017-01-01 | 10 | 205 | | jack | 2017-01-08 | 55 | 205 | | tony | 2017-01-07 | 50 | 205 | | jack | 2017-01-05 | 46 | 205 | | tony | 2017-01-04 | 29 | 205 | | tony | 2017-01-02 | 15 | 205 | | jack | 2017-02-03 | 23 | 23 | | mart | 2017-04-13 | 94 | 341 | | jack | 2017-04-06 | 42 | 341 | | mart | 2017-04-11 | 75 | 341 | | mart | 2017-04-09 | 68 | 341 | | mart | 2017-04-08 | 62 | 341 | | neil | 2017-05-10 | 12 | 12 | | neil | 2017-06-12 | 80 | 80 | +-------+-------------+-------+---------------+--
2.視窗函式的語法:
UDAF() over (PARTITION By col1,col2 order by col3 視窗子句(rows between .. and ..)) AS 列別名
注意:PARTITION By後可跟多個欄位,order By只跟一個欄位。
partition by子句: 一旦指定了partition by子句,聚合函式的作用範圍就是分割槽之後的資料,這一點和group by 有些類似 order by子句: order by子句對欄位進行排序,如果order by子句後面沒有跟rows between ** and ** 則表示起點到當前行 的聚合。order by後的rows子句近一步限制聚合函式的作用範圍。 視窗子句 CURRENT ROW:當前行 n PRECEDING:往前n行資料 n FOLLOWING:往後n行資料 UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING表示到後面的終點 視窗子句對聚合函式的聚合範圍作進一步的動態劃分,沒有指定的時候,預設為起點到當前行的聚合 注意: (1)order by必須跟在partition by後; (2)Rows必須跟在Order by子; (3)(partition by .. order by)可替換為(distribute by .. sort by ..)
可上手案例實操:
資料的準備: [[email protected] datas]$ cat business.txt jack,2017-01-01,10 tony,2017-01-02,15 jack,2017-02-03,23 tony,2017-01-04,29 jack,2017-01-05,46 jack,2017-04-06,42 tony,2017-01-07,50 jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 neil,2017-05-10,12 mart,2017-04-11,75 neil,2017-06-12,80 mart,2017-04-13,94 需求: (1)查詢在2017年4月份購買過的顧客及總人數 (2)查詢顧客的購買明細及月購買總額 (3)上述的場景, 將每個顧客的cost按照日期進行累加 (4)查詢每個顧客上次的購買時間 (5)查詢前20%時間的訂單資訊 建立hive表並匯入資料: create table business(name string,orderdate string,cost int) row format delimited fields terminated by ','; load data local inpath '/opt/module/datas/business.txt' into table business; 0: jdbc:hive2://hadoop108:10000> select * from business; +----------------+---------------------+----------------+--+ | business.name | business.orderdate | business.cost | +----------------+---------------------+----------------+--+ | jack | 2017-01-01 | 10 | | tony | 2017-01-02 | 15 | | jack | 2017-02-03 | 23 | | tony | 2017-01-04 | 29 | | jack | 2017-01-05 | 46 | | jack | 2017-04-06 | 42 | | tony | 2017-01-07 | 50 | | jack | 2017-01-08 | 55 | | mart | 2017-04-08 | 62 | | mart | 2017-04-09 | 68 | | neil | 2017-05-10 | 12 | | mart | 2017-04-11 | 75 | | neil | 2017-06-12 | 80 | | mart | 2017-04-13 | 94 | +----------------+---------------------+----------------+--+ (1)查詢在2017年4月份購買過的顧客及總人數: 分析過程:四月份的資料如下: | jack | 2017-04-06 | 42 | | mart | 2017-04-08 | 62 | | mart | 2017-04-09 | 68 | | mart | 2017-04-11 | 75 | | mart | 2017-04-13 | 94 | 最後的輸出結果應該是長成這個樣子: jack 2 mart 2 我們一起來理解一下聚合函式,下面的這個聚合函式,將business表中的所有內容作為輸入,輸入到聚合函式中去 select count(*) from business; +------+--+ | _c0 | +------+--+ | 14 | +------+--+ 下面的這個聚合函式,作用在分組的資料中,這樣一來,聚合函式的作用物件就是組,即如果是同一個組,會作為輸入,輸入到聚合函式中去。 select name,count(*) from business where orderdate like '2017-04%' group by name; +-------+------+--+ | name | _c1 | +-------+------+--+ | jack | 1 | | mart | 4 | +-------+------+--+ 上面的這個結果很明顯並不是我們想要的,因為這裡是將一個組的內容作為輸入輸入到聚合函式中去的,所以此時統計的是name的個數。我們 想要的結果是:將這兩行的內容作為聚合函式的輸入。我們可以使用視窗函式實現: select name,count(*) over() from business where orderdate like '2017-04%' group by name; +-------+-----------------+--+ | name | count_window_0 | +-------+-----------------+--+ | mart | 2 | | jack | 2 | +-------+-----------------+--+ (2)查詢顧客的購買明細及月購買總額 這裡面是月購買總額,所以應該以月份來進行分組,同樣月份的進入聚合函式,這裡既要顯示原來的資料, 又要顯示聚合之後的資料,所以使用視窗函式。 select *,sum(cost) over(partition by month(orderdate)) from business; +----------------+---------------------+----------------+---------------+--+ | business.name | business.orderdate | business.cost | sum_window_0 | +----------------+---------------------+----------------+---------------+--+ | jack | 2017-01-01 | 10 | 205 | | jack | 2017-01-08 | 55 | 205 | | tony | 2017-01-07 | 50 | 205 | | jack | 2017-01-05 | 46 | 205 | | tony | 2017-01-04 | 29 | 205 | | tony | 2017-01-02 | 15 | 205 | | jack | 2017-02-03 | 23 | 23 | | mart | 2017-04-13 | 94 | 341 | | jack | 2017-04-06 | 42 | 341 | | mart | 2017-04-11 | 75 | 341 | | mart | 2017-04-09 | 68 | 341 | | mart | 2017-04-08 | 62 | 341 | | neil | 2017-05-10 | 12 | 12 | | neil | 2017-06-12 | 80 | 80 | +----------------+---------------------+----------------+---------------+--+ (3) 將每個顧客的cost按照日期進行累加 這裡面需要將每一個顧客的cost累加,所以要對name進行分組,按照日期進行累加,應該對於日期進行排序, 這樣才好一行一行的累加。 select *,sum(cost) over(partition by name order by orderdate) from business; +----------------+---------------------+----------------+---------------+--+ | business.name | business.orderdate | business.cost | sum_window_0 | +----------------+---------------------+----------------+---------------+--+ | jack | 2017-01-01 | 10 | 10 | | jack | 2017-01-05 | 46 | 56 | | jack | 2017-01-08 | 55 | 111 | | jack | 2017-02-03 | 23 | 134 | | jack | 2017-04-06 | 42 | 176 | | mart | 2017-04-08 | 62 | 62 | | mart | 2017-04-09 | 68 | 130 | | mart | 2017-04-11 | 75 | 205 | | mart | 2017-04-13 | 94 | 299 | | neil | 2017-05-10 | 12 | 12 | | neil | 2017-06-12 | 80 | 92 | | tony | 2017-01-02 | 15 | 15 | | tony | 2017-01-04 | 29 | 44 | | tony | 2017-01-07 | 50 | 94 | +----------------+---------------------+----------------+---------------+--+ 除了上面的這種方式,我們還可以使用視窗子句來實現: select *,sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) from business; +----------------+---------------------+----------------+---------------+--+ | business.name | business.orderdate | business.cost | sum_window_0 | +----------------+---------------------+----------------+---------------+--+ | jack | 2017-01-01 | 10 | 10 | | jack | 2017-01-05 | 46 | 56 | | jack | 2017-01-08 | 55 | 111 | | jack | 2017-02-03 | 23 | 134 | | jack | 2017-04-06 | 42 | 176 | | mart | 2017-04-08 | 62 | 62 | | mart | 2017-04-09 | 68 | 130 | | mart | 2017-04-11 | 75 | 205 | | mart | 2017-04-13 | 94 | 299 | | neil | 2017-05-10 | 12 | 12 | | neil | 2017-06-12 | 80 | 92 | | tony | 2017-01-02 | 15 | 15 | | tony | 2017-01-04 | 29 | 44 | | tony | 2017-01-07 | 50 | 94 | +----------------+---------------------+----------------+---------------+--+ (4)查詢每個顧客上次的購買時間 對name 分割槽,對時間排序,例如下面的這個樣子 +----------------+---------------------+----------------+--+ | business.name | business.orderdate | business.cost | date +----------------+---------------------+----------------+--+ | jack | 2017-01-01 | 10 | null | jack | 2017-01-05 | 46 | 2017-01-01 | jack | 2017-02-03 | 23 | 2017-01-05 | jack | 2017-04-06 | 42 | 2017-02-03 | tony | 2017-01-02 | 15 | null | tony | 2017-01-04 | 29 | 2017-01-02 select *,lag(orderdate,1,'-1') over(partition by name order by orderdate) from business; +----------------+---------------------+----------------+---------------+--+ | business.name | business.orderdate | business.cost | lag_window_0 | +----------------+---------------------+----------------+---------------+--+ | jack | 2017-01-01 | 10 | -1 | | jack | 2017-01-05 | 46 | 2017-01-01 | | jack | 2017-01-08 | 55 | 2017-01-05 | | jack | 2017-02-03 | 23 | 2017-01-08 | | jack | 2017-04-06 | 42 | 2017-02-03 | | mart | 2017-04-08 | 62 | -1 | | mart | 2017-04-09 | 68 | 2017-04-08 | | mart | 2017-04-11 | 75 | 2017-04-09 | | mart | 2017-04-13 | 94 | 2017-04-11 | | neil | 2017-05-10 | 12 | -1 | | neil | 2017-06-12 | 80 | 2017-05-10 | | tony | 2017-01-02 | 15 | -1 | | tony | 2017-01-04 | 29 | 2017-01-02 | | tony | 2017-01-07 | 50 | 2017-01-04 | +----------------+---------------------+----------------+---------------+--+ (5)查詢前20%時間的訂單資訊 20%需要對時間進行排序,取到其中的20%,輸入全部,得到20%,使用Ntail聚合函式 t1: select *,NTILE(5) over(order by orderdate) num from business ; +----------------+---------------------+----------------+------+--+ | business.name | business.orderdate | business.cost | num | +----------------+---------------------+----------------+------+--+ | jack | 2017-01-01 | 10 | 1 | | tony | 2017-01-02 | 15 | 1 | | tony | 2017-01-04 | 29 | 1 | | jack | 2017-01-05 | 46 | 2 | | tony | 2017-01-07 | 50 | 2 | | jack | 2017-01-08 | 55 | 2 | | jack | 2017-02-03 | 23 | 3 | | jack | 2017-04-06 | 42 | 3 | | mart | 2017-04-08 | 62 | 3 | | mart | 2017-04-09 | 68 | 4 | | mart | 2017-04-11 | 75 | 4 | | mart | 2017-04-13 | 94 | 4 | | neil | 2017-05-10 | 12 | 5 | | neil | 2017-06-12 | 80 | 5 | +----------------+---------------------+----------------+------+--+ select * from (select *,NTILE(5) over(order by orderdate) num from business ) t1 where num = 1; +----------+---------------+----------+---------+--+ | t1.name | t1.orderdate | t1.cost | t1.num | +----------+---------------+----------+---------+--+ | jack | 2017-01-01 | 10 | 1 | | tony | 2017-01-02 | 15 | 1 | | tony | 2017-01-04 | 29 | 1 | +----------+---------------+----------+---------+--+
總結:
①理解視窗函式的前提是深入理解聚合函式,理解聚合函式,就是要理解聚合函式的作用範圍,首先沒有任何修飾的聚合函式的作用範圍是全體的資料;其次有group by的聚合函式,聚合函式對同組的資料聚合;有了partition by 的範圍也是組內的資料;有了視窗子句之後,視窗子句會進一步限制聚合函式的作用範圍。②既想顯示聚集前的資料,又要顯示聚集後的資料,使用視窗函式,因為select 後面的欄位必須是聚合函式和group by 欄位,如果想顯示其他欄位,group by做不到,就得使用視窗函式。