1. 程式人生 > >Hive視窗函式

Hive視窗函式

@[toc] # 官方文件地址 [Hive官網,點我就進](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics) oracle,sqlserver都提供了視窗函式,但是在mysql5.5和5.6都沒有提供視窗函式! **視窗函式: 視窗+函式** - 視窗: 函式執行時計算的資料集的範圍 - 函式: 執行的函式! **僅僅支援以下函式:** # Windowing functions - LEAD (scalar_expression [,offset] [,default]): 返回當前行以下N行的指定列的列值!如果找不到,就採用預設值 - LAG (scalar_expression [,offset] [,default]): 返回當前行以上N行的指定列的列值!如果找不到,就採用預設值 - FIRST_VALUE(列名,[false(預設)]):返回當前視窗指定列的第一個值,第二個引數如果為true,代表加入第一個值為null,跳過空值,繼續尋找! - LAST_VALUE(列名,[false(預設)]):返回當前視窗指定列的最後一個值,第二個引數如果為true,代表加入第一個值為null,跳過空值,繼續尋找! # 統計類的函式(一般都需要結合over使用):min,max,avg,sum,count 排名分析: - RANK - ROW_NUMBER - DENSE_RANK - CUME_DIST - PERCENT_RANK - NTILE ==注意:不是所有的函式在執行都是可以通過改變視窗的大小,來控制計算的資料集的範圍!所有的排名函式和LAG,LEAD,支援使用over(),但是在over()中不能定義 window_clause== **格式: 函式 over( partition by 欄位 ,order by 欄位 window_clause )** # 視窗的大小可以通過windows_clause來指定: ```sql (rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following) (rows | range) between current row and (current row | (unbounded | [num]) following) (rows | range) between [num] following and (unbounded | [num]) following ``` # 特殊情況: - ①在over()中既沒有出現windows_clause,也沒有出現order by,視窗預設為rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING - ②在over()中(沒有出現windows_clause),指定了order by,視窗預設為rows between UNBOUNDED PRECEDING and CURRENT ROW # 視窗函式和分組有什麼區別? - ①如果是分組操作,select後只能寫分組後的欄位 - ②如果是視窗函式,視窗函式是在指定的視窗內,對每條記錄都執行一次函式 - ③如果是分組操作,有去重效果,而partition不去重! # 練習 (9) 查詢前20%時間的訂單資訊 精確演算法: ```sql select * from (select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum from business) tmp where cdnum<=0.2 ``` 不精確計算: ```sql select * from (select name,orderdate,cost,ntile(5) over(order by orderdate ) cdnum from business) tmp where cdnum=1 ``` (8)查詢顧客的購買明細及顧客最近三次cost花費 最近三次: **當前和之前兩次** 或 **當前+前一次+後一次** 當前和之前兩次: ```sql select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT row) from business ``` 當前+前一次+後一次: ```sql select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING) from business ``` 或 ```sql select name,orderdate,cost,cost+ lag(cost,1,0) over(partition by name order by orderdate )+ lead(cost,1,0) over(partition by name order by orderdate ) from business ``` (7) 查詢顧客的購買明細及顧客本月最後一次購買的時間 ```sql select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT row and UNBOUNDED FOLLOWING) from business ``` (6) 查詢顧客的購買明細及顧客本月第一次購買的時間 ```sql select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate ) from business ``` (5) 查詢顧客的購買明細及顧客下次的購買時間 ```sql select name,orderdate,cost,lead(orderdate,1,'無資料') over(partition by name order by orderdate ) from business ``` (4)查詢顧客的購買明細及顧客上次的購買時間 ```sql select name,orderdate,cost,lag(orderdate,1,'無資料') over(partition by name order by orderdate ) from business ``` (3)查詢顧客的購買明細要將cost按照日期進行累加 ```sql select name,orderdate,cost,sum(cost) over(partition by name order by orderdate ) from business ``` (2)查詢顧客的購買明細及月購買總額 ```sql select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) ) from business ``` (1)查詢在2017年4月份購買過的顧客及總人數 ```sql select name,count(*) over(rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) from business where substring(orderdate,1,7)='2017-04' group by name ``` 等價於 ```sql select name,count(*) over() from business where substring(orderdate,1,7)='2017-04' group by n