1. 程式人生 > >Hive的視窗函式(附帶上手案例)

Hive的視窗函式(附帶上手案例)

目錄

視窗函式的概述與總結:

可上手案例實操:

總結:


視窗函式的概述與總結:

1.什麼時候用開窗函式?開窗函式常結合聚合函式使用,一般來講聚合後的行數要少於聚合前的行數,但是有時我們既想顯示聚集前的資料,又要顯示聚集後的資料,這時我們便引入了視窗函式.如下:

+-------+-------------+-------+---------------+--+
| name  |  orderdate  | cost  | sum_window_0  |
+-------+-------------+-------+---------------+--+
| jack  | 2017-01-01  | 10    | 205           |
| jack  | 2017-01-08  | 55    | 205           |
| tony  | 2017-01-07  | 50    | 205           |
| jack  | 2017-01-05  | 46    | 205           |
| tony  | 2017-01-04  | 29    | 205           |
| tony  | 2017-01-02  | 15    | 205           |
| jack  | 2017-02-03  | 23    | 23            |
| mart  | 2017-04-13  | 94    | 341           |
| jack  | 2017-04-06  | 42    | 341           |
| mart  | 2017-04-11  | 75    | 341           |
| mart  | 2017-04-09  | 68    | 341           |
| mart  | 2017-04-08  | 62    | 341           |
| neil  | 2017-05-10  | 12    | 12            |
| neil  | 2017-06-12  | 80    | 80            |
+-------+-------------+-------+---------------+--

2.視窗函式的語法:

UDAF() over (PARTITION By col1,col2 order by col3 視窗子句(rows between .. and ..)) AS 列別名

注意:PARTITION By後可跟多個欄位,order By只跟一個欄位。

partition by子句:
一旦指定了partition by子句,聚合函式的作用範圍就是分割槽之後的資料,這一點和group by 有些類似

order by子句:
order by子句對欄位進行排序,如果order by子句後面沒有跟rows between ** and ** 則表示起點到當前行
的聚合。order by後的rows子句近一步限制聚合函式的作用範圍。

視窗子句
CURRENT ROW:當前行
n PRECEDING:往前n行資料
n FOLLOWING:往後n行資料
UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING表示到後面的終點
視窗子句對聚合函式的聚合範圍作進一步的動態劃分,沒有指定的時候,預設為起點到當前行的聚合

注意:
(1)order by必須跟在partition by後;
(2)Rows必須跟在Order by子;
(3)(partition by .. order by)可替換為(distribute by .. sort by ..)

可上手案例實操:

資料的準備:
[[email protected] datas]$ cat business.txt 
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

需求:
(1)查詢在2017年4月份購買過的顧客及總人數
(2)查詢顧客的購買明細及月購買總額
(3)上述的場景, 將每個顧客的cost按照日期進行累加
(4)查詢每個顧客上次的購買時間
(5)查詢前20%時間的訂單資訊

建立hive表並匯入資料:

create table business(name string,orderdate string,cost int)
row format delimited
fields terminated by ',';

load data local inpath '/opt/module/datas/business.txt' into table business;

0: jdbc:hive2://hadoop108:10000> select * from business;
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+--+
| jack           | 2017-01-01          | 10             |
| tony           | 2017-01-02          | 15             |
| jack           | 2017-02-03          | 23             |
| tony           | 2017-01-04          | 29             |
| jack           | 2017-01-05          | 46             |
| jack           | 2017-04-06          | 42             |
| tony           | 2017-01-07          | 50             |
| jack           | 2017-01-08          | 55             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| neil           | 2017-05-10          | 12             |
| mart           | 2017-04-11          | 75             |
| neil           | 2017-06-12          | 80             |
| mart           | 2017-04-13          | 94             |
+----------------+---------------------+----------------+--+


(1)查詢在2017年4月份購買過的顧客及總人數:
分析過程:四月份的資料如下:
| jack           | 2017-04-06          | 42             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| mart           | 2017-04-11          | 75             |
| mart           | 2017-04-13          | 94             |

最後的輸出結果應該是長成這個樣子:
jack 2
mart 2

我們一起來理解一下聚合函式,下面的這個聚合函式,將business表中的所有內容作為輸入,輸入到聚合函式中去
select count(*) from business;
+------+--+
| _c0  |
+------+--+
| 14   |
+------+--+


下面的這個聚合函式,作用在分組的資料中,這樣一來,聚合函式的作用物件就是組,即如果是同一個組,會作為輸入,輸入到聚合函式中去。
select name,count(*)
from business
where orderdate like '2017-04%'
group by name;

+-------+------+--+
| name  | _c1  |
+-------+------+--+
| jack  | 1    |
| mart  | 4    |
+-------+------+--+
上面的這個結果很明顯並不是我們想要的,因為這裡是將一個組的內容作為輸入輸入到聚合函式中去的,所以此時統計的是name的個數。我們
想要的結果是:將這兩行的內容作為聚合函式的輸入。我們可以使用視窗函式實現:

select name,count(*) over()
from business
where orderdate like '2017-04%'
group by name;

+-------+-----------------+--+
| name  | count_window_0  |
+-------+-----------------+--+
| mart  | 2               |
| jack  | 2               |
+-------+-----------------+--+


(2)查詢顧客的購買明細及月購買總額
這裡面是月購買總額,所以應該以月份來進行分組,同樣月份的進入聚合函式,這裡既要顯示原來的資料,
又要顯示聚合之後的資料,所以使用視窗函式。

select *,sum(cost) over(partition by month(orderdate)) 
from business;

+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 205           |
| jack           | 2017-01-08          | 55             | 205           |
| tony           | 2017-01-07          | 50             | 205           |
| jack           | 2017-01-05          | 46             | 205           |
| tony           | 2017-01-04          | 29             | 205           |
| tony           | 2017-01-02          | 15             | 205           |
| jack           | 2017-02-03          | 23             | 23            |
| mart           | 2017-04-13          | 94             | 341           |
| jack           | 2017-04-06          | 42             | 341           |
| mart           | 2017-04-11          | 75             | 341           |
| mart           | 2017-04-09          | 68             | 341           |
| mart           | 2017-04-08          | 62             | 341           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 80            |
+----------------+---------------------+----------------+---------------+--+

(3) 將每個顧客的cost按照日期進行累加
這裡面需要將每一個顧客的cost累加,所以要對name進行分組,按照日期進行累加,應該對於日期進行排序,
這樣才好一行一行的累加。

select *,sum(cost) over(partition by name  order by orderdate)
from business;

+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 10            |
| jack           | 2017-01-05          | 46             | 56            |
| jack           | 2017-01-08          | 55             | 111           |
| jack           | 2017-02-03          | 23             | 134           |
| jack           | 2017-04-06          | 42             | 176           |
| mart           | 2017-04-08          | 62             | 62            |
| mart           | 2017-04-09          | 68             | 130           |
| mart           | 2017-04-11          | 75             | 205           |
| mart           | 2017-04-13          | 94             | 299           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-02          | 15             | 15            |
| tony           | 2017-01-04          | 29             | 44            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+--+

除了上面的這種方式,我們還可以使用視窗子句來實現:
select *,sum(cost) over(partition by name order by orderdate
rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 10            |
| jack           | 2017-01-05          | 46             | 56            |
| jack           | 2017-01-08          | 55             | 111           |
| jack           | 2017-02-03          | 23             | 134           |
| jack           | 2017-04-06          | 42             | 176           |
| mart           | 2017-04-08          | 62             | 62            |
| mart           | 2017-04-09          | 68             | 130           |
| mart           | 2017-04-11          | 75             | 205           |
| mart           | 2017-04-13          | 94             | 299           |
| neil           | 2017-05-10          | 12             | 12            |
| neil           | 2017-06-12          | 80             | 92            |
| tony           | 2017-01-02          | 15             | 15            |
| tony           | 2017-01-04          | 29             | 44            |
| tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+--+


(4)查詢每個顧客上次的購買時間
對name 分割槽,對時間排序,例如下面的這個樣子
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  | date
+----------------+---------------------+----------------+--+
| jack           | 2017-01-01          | 10             | null
| jack           | 2017-01-05          | 46             | 2017-01-01
| jack           | 2017-02-03          | 23             | 2017-01-05 
| jack           | 2017-04-06          | 42             | 2017-02-03
| tony           | 2017-01-02          | 15             | null
| tony           | 2017-01-04          | 29             | 2017-01-02

select *,lag(orderdate,1,'-1') over(partition by name order by orderdate)
from business;
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | lag_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | -1            |
| jack           | 2017-01-05          | 46             | 2017-01-01    |
| jack           | 2017-01-08          | 55             | 2017-01-05    |
| jack           | 2017-02-03          | 23             | 2017-01-08    |
| jack           | 2017-04-06          | 42             | 2017-02-03    |
| mart           | 2017-04-08          | 62             | -1            |
| mart           | 2017-04-09          | 68             | 2017-04-08    |
| mart           | 2017-04-11          | 75             | 2017-04-09    |
| mart           | 2017-04-13          | 94             | 2017-04-11    |
| neil           | 2017-05-10          | 12             | -1            |
| neil           | 2017-06-12          | 80             | 2017-05-10    |
| tony           | 2017-01-02          | 15             | -1            |
| tony           | 2017-01-04          | 29             | 2017-01-02    |
| tony           | 2017-01-07          | 50             | 2017-01-04    |
+----------------+---------------------+----------------+---------------+--+


(5)查詢前20%時間的訂單資訊
20%需要對時間進行排序,取到其中的20%,輸入全部,得到20%,使用Ntail聚合函式

t1:
select *,NTILE(5) over(order by orderdate) num
from business ; 
+----------------+---------------------+----------------+------+--+
| business.name  | business.orderdate  | business.cost  | num  |
+----------------+---------------------+----------------+------+--+
| jack           | 2017-01-01          | 10             | 1    |
| tony           | 2017-01-02          | 15             | 1    |
| tony           | 2017-01-04          | 29             | 1    |
| jack           | 2017-01-05          | 46             | 2    |
| tony           | 2017-01-07          | 50             | 2    |
| jack           | 2017-01-08          | 55             | 2    |
| jack           | 2017-02-03          | 23             | 3    |
| jack           | 2017-04-06          | 42             | 3    |
| mart           | 2017-04-08          | 62             | 3    |
| mart           | 2017-04-09          | 68             | 4    |
| mart           | 2017-04-11          | 75             | 4    |
| mart           | 2017-04-13          | 94             | 4    |
| neil           | 2017-05-10          | 12             | 5    |
| neil           | 2017-06-12          | 80             | 5    |
+----------------+---------------------+----------------+------+--+


select * from
(select *,NTILE(5) over(order by orderdate) num
from business ) t1
where num = 1;
+----------+---------------+----------+---------+--+
| t1.name  | t1.orderdate  | t1.cost  | t1.num  |
+----------+---------------+----------+---------+--+
| jack     | 2017-01-01    | 10       | 1       |
| tony     | 2017-01-02    | 15       | 1       |
| tony     | 2017-01-04    | 29       | 1       |
+----------+---------------+----------+---------+--+

總結:

①理解視窗函式的前提是深入理解聚合函式,理解聚合函式,就是要理解聚合函式的作用範圍,首先沒有任何修飾的聚合函式的作用範圍是全體的資料;其次有group by的聚合函式,聚合函式對同組的資料聚合;有了partition by 的範圍也是組內的資料;有了視窗子句之後,視窗子句會進一步限制聚合函式的作用範圍。②既想顯示聚集前的資料,又要顯示聚集後的資料,使用視窗函式,因為select 後面的欄位必須是聚合函式和group by 欄位,如果想顯示其他欄位,group by做不到,就得使用視窗函式。