Hive窗口函數之LAG、LEAD、FIRST_VALUE、LAST_VALUE的用法
一、創建表:
create table windows_ss
(
polno string,
eff_date string,
userno string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,‘
stored as textfile;
數據準備:
P066666666666,2016-04-02 09:00:02,user01
P066666666666,2016-04-02 09:00:00,user02
P066666666666,2016-04-02 09:03:04,user11
P066666666666,2016-04-02 09:50:05,user03
P066666666666,2016-04-02 10:00:00,user51
P066666666666,2016-04-02 09:10:00,user09
P066666666666,2016-04-02 09:50:01,user32
P088888888888,2016-04-02 09:00:02,user41
P088888888888,2016-04-02 09:00:00,user55
P088888888888,2016-04-02 09:03:04,user23
P088888888888,2016-04-02 09:50:05,user80
P088888888888,2016-04-02 10:00:00,user08
P088888888888,2016-04-02 09:10:00,user22
P088888888888,2016-04-02 09:50:01,user31
將數據導入Hive表中:
LOAD DATA LOCAL INPATH ‘/home/hadoop/testhivedata/windows_ss.txt‘ OVERWRITE INTO TABLE windows_ss;
LAG
LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
第一個參數為列名,第二個參數為往上第n行(可選,默認為1),第三個參數為默認值(當往上第n行為NULL時候,取默認值,如不指定,則為NULL)
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
LAG(eff_date,1,‘1970-01-01 00:00:00‘) OVER(PARTITION BY polno ORDER BY eff_date) AS last_1_time,
LAG(eff_date,2) OVER(PARTITION BY polno ORDER BY eff_date) AS last_2_time
FROM windows_ss;
結果:
polno eff_date userno rn last_1_time last_2_time
P066666666666 2016-04-02 09:00:00 user02 1 1970-01-01 00:00:00 NULL
P066666666666 2016-04-02 09:00:02 user01 2 2016-04-02 09:00:00 NULL
P066666666666 2016-04-02 09:03:04 user11 3 2016-04-02 09:00:02 2016-04-02 09:00:00
P066666666666 2016-04-02 09:10:00 user09 4 2016-04-02 09:03:04 2016-04-02 09:00:02
P066666666666 2016-04-02 09:50:01 user32 5 2016-04-02 09:10:00 2016-04-02 09:03:04
P066666666666 2016-04-02 09:50:05 user03 6 2016-04-02 09:50:01 2016-04-02 09:10:00
P066666666666 2016-04-02 10:00:00 user51 7 2016-04-02 09:50:05 2016-04-02 09:50:01
P088888888888 2016-04-02 09:00:00 user55 1 1970-01-01 00:00:00 NULL
P088888888888 2016-04-02 09:00:02 user41 2 2016-04-02 09:00:00 NULL
P088888888888 2016-04-02 09:03:04 user23 3 2016-04-02 09:00:02 2016-04-02 09:00:00
P088888888888 2016-04-02 09:10:00 user22 4 2016-04-02 09:03:04 2016-04-02 09:00:02
P088888888888 2016-04-02 09:50:01 user31 5 2016-04-02 09:10:00 2016-04-02 09:03:04
P088888888888 2016-04-02 09:50:05 user80 6 2016-04-02 09:50:01 2016-04-02 09:10:00
P088888888888 2016-04-02 10:00:00 user08 7 2016-04-02 09:50:05 2016-04-02 09:50:01
分析:
last_1_time: 指定了往上第1行的值,default為‘1970-01-01 00:00:00‘
P066666666666第一行,往上1行為NULL,因此取默認值 1970-01-01 00:00:00
P066666666666第三行,往上1行值為第二行值,2016-04-02 09:00:02
P066666666666第六行,往上1行值為第五行值,2016-04-02 09:50:01
last_2_time: 指定了往上第2行的值,為指定默認值
P088888888888第一行,往上2行為NULL
P088888888888第二行,往上2行為NULL
P088888888888第四行,往上2行為第二行值,2016-04-02 09:00:02
P088888888888第七行,往上2行為第五行值,2016-04-02 09:50:01
LEAD
與LAG相反
LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
第一個參數為列名,第二個參數為往下第n行(可選,默認為1),第三個參數為默認值(當往下第n行為NULL時候,取默認值,如不指定,則為NULL)
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
LEAD(eff_date,1,‘1970-01-01 00:00:00‘) OVER(PARTITION BY polno ORDER BY eff_date) AS next_1_time,
LEAD(eff_date,2) OVER(PARTITION BY polno ORDER BY eff_date) AS next_2_time
FROM windows_ss;
結果:
polno eff_date userno rn next_1_time next_2_time
P066666666666 2016-04-02 09:00:00 user02 1 2016-04-02 09:00:02 2016-04-02 09:03:04
P066666666666 2016-04-02 09:00:02 user01 2 2016-04-02 09:03:04 2016-04-02 09:10:00
P066666666666 2016-04-02 09:03:04 user11 3 2016-04-02 09:10:00 2016-04-02 09:50:01
P066666666666 2016-04-02 09:10:00 user09 4 2016-04-02 09:50:01 2016-04-02 09:50:05
P066666666666 2016-04-02 09:50:01 user32 5 2016-04-02 09:50:05 2016-04-02 10:00:00
P066666666666 2016-04-02 09:50:05 user03 6 2016-04-02 10:00:00 NULL
P066666666666 2016-04-02 10:00:00 user51 7 1970-01-01 00:00:00 NULL
P088888888888 2016-04-02 09:00:00 user55 1 2016-04-02 09:00:02 2016-04-02 09:03:04
P088888888888 2016-04-02 09:00:02 user41 2 2016-04-02 09:03:04 2016-04-02 09:10:00
P088888888888 2016-04-02 09:03:04 user23 3 2016-04-02 09:10:00 2016-04-02 09:50:01
P088888888888 2016-04-02 09:10:00 user22 4 2016-04-02 09:50:01 2016-04-02 09:50:05
P088888888888 2016-04-02 09:50:01 user31 5 2016-04-02 09:50:05 2016-04-02 10:00:00
P088888888888 2016-04-02 09:50:05 user80 6 2016-04-02 10:00:00 NULL
P088888888888 2016-04-02 10:00:00 user08 7 1970-01-01 00:00:00 NULL
分析:
--邏輯與LAG一樣,只不過LAG是往上,LEAD是往下
FIRST_VALUE
取分組內排序後,截止到當前行,第一個值
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
FIRST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date) AS first1
FROM windows_ss;
polno eff_date userno rn first1
P066666666666 2016-04-02 09:00:00 user02 1 user02
P066666666666 2016-04-02 09:00:02 user01 2 user02
P066666666666 2016-04-02 09:03:04 user11 3 user02
P066666666666 2016-04-02 09:10:00 user09 4 user02
P066666666666 2016-04-02 09:50:01 user32 5 user02
P066666666666 2016-04-02 09:50:05 user03 6 user02
P066666666666 2016-04-02 10:00:00 user51 7 user02
P088888888888 2016-04-02 09:00:00 user55 1 user55
P088888888888 2016-04-02 09:00:02 user41 2 user55
P088888888888 2016-04-02 09:03:04 user23 3 user55
P088888888888 2016-04-02 09:10:00 user22 4 user55
P088888888888 2016-04-02 09:50:01 user31 5 user55
P088888888888 2016-04-02 09:50:05 user80 6 user55
P088888888888 2016-04-02 10:00:00 user08 7 user55
LAST_VALUE
取分組內排序後,截止到當前行,最後一個值
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
LAST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date) AS last1
FROM windows_ss;
結果:
polno eff_date userno rn last1
P066666666666 2016-04-02 09:00:00 user02 1 user02
P066666666666 2016-04-02 09:00:02 user01 2 user01
P066666666666 2016-04-02 09:03:04 user11 3 user11
P066666666666 2016-04-02 09:10:00 user09 4 user09
P066666666666 2016-04-02 09:50:01 user32 5 user32
P066666666666 2016-04-02 09:50:05 user03 6 user03
P066666666666 2016-04-02 10:00:00 user51 7 user51
P088888888888 2016-04-02 09:00:00 user55 1 user55
P088888888888 2016-04-02 09:00:02 user41 2 user41
P088888888888 2016-04-02 09:03:04 user23 3 user23
P088888888888 2016-04-02 09:10:00 user22 4 user22
P088888888888 2016-04-02 09:50:01 user31 5 user31
P088888888888 2016-04-02 09:50:05 user80 6 user80
P088888888888 2016-04-02 10:00:00 user08 7 user08
如果不指定ORDER BY,則默認按照記錄在文件中的偏移量進行排序,會出現錯誤的結果
FIRST_VALUE沒有排序:
SELECT
polno,
eff_date,
userno,
FIRST_VALUE(userno) OVER(PARTITION BY polno) AS first2
FROM windows_ss;
polno eff_date userno first2
P066666666666 2016-04-02 09:00:02 user01 user01
P066666666666 2016-04-02 09:00:00 user02 user01
P066666666666 2016-04-02 09:03:04 user11 user01
P066666666666 2016-04-02 09:50:05 user03 user01
P066666666666 2016-04-02 10:00:00 user51 user01
P066666666666 2016-04-02 09:10:00 user09 user01
P066666666666 2016-04-02 09:50:01 user32 user01
P088888888888 2016-04-02 09:00:02 user41 user41
P088888888888 2016-04-02 09:00:00 user55 user41
P088888888888 2016-04-02 09:03:04 user23 user41
P088888888888 2016-04-02 09:50:05 user80 user41
P088888888888 2016-04-02 10:00:00 user08 user41
P088888888888 2016-04-02 09:10:00 user22 user41
P088888888888 2016-04-02 09:50:01 user31 user41
LAST_VALUE沒有排序:
SELECT
polno,
eff_date,
userno,
LAST_VALUE(userno) OVER(PARTITION BY polno) AS last2
FROM windows_ss;
結果:
polno eff_date userno last2
P066666666666 2016-04-02 09:00:02 user01 user32
P066666666666 2016-04-02 09:00:00 user02 user32
P066666666666 2016-04-02 09:03:04 user11 user32
P066666666666 2016-04-02 09:50:05 user03 user32
P066666666666 2016-04-02 10:00:00 user51 user32
P066666666666 2016-04-02 09:10:00 user09 user32
P066666666666 2016-04-02 09:50:01 user32 user32
P088888888888 2016-04-02 09:00:02 user41 user31
P088888888888 2016-04-02 09:00:00 user55 user31
P088888888888 2016-04-02 09:03:04 user23 user31
P088888888888 2016-04-02 09:50:05 user80 user31
P088888888888 2016-04-02 10:00:00 user08 user31
P088888888888 2016-04-02 09:10:00 user22 user31
P088888888888 2016-04-02 09:50:01 user31 user31
如果想要取分組內排序後最後一個值,則需要變通一下:
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
LAST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date) AS last1,
FIRST_VALUE(userno) OVER(PARTITION BY polno ORDER BY eff_date DESC) AS last2
FROM windows_ss ORDER BY polno,eff_date;
polno eff_date userno rn last1 last2
P066666666666 2016-04-02 09:00:00 user02 1 user02 user51
P066666666666 2016-04-02 09:00:02 user01 2 user01 user51
P066666666666 2016-04-02 09:03:04 user11 3 user11 user51
P066666666666 2016-04-02 09:10:00 user09 4 user09 user51
P066666666666 2016-04-02 09:50:01 user32 5 user32 user51
P066666666666 2016-04-02 09:50:05 user03 6 user03 user51
P066666666666 2016-04-02 10:00:00 user51 7 user51 user51
P088888888888 2016-04-02 09:00:00 user55 1 user55 user08
P088888888888 2016-04-02 09:00:02 user41 2 user41 user08
P088888888888 2016-04-02 09:03:04 user23 3 user23 user08
P088888888888 2016-04-02 09:10:00 user22 4 user22 user08
P088888888888 2016-04-02 09:50:01 user31 5 user31 user08
P088888888888 2016-04-02 09:50:05 user80 6 user80 user08
P088888888888 2016-04-02 10:00:00 user08 7 user08 user08
註意:
在使用分析函數的過程中,要特別註意ORDERBY子句,用的不恰當,統計出的結果就不是你所期望的
Hive窗口函數之LAG、LEAD、FIRST_VALUE、LAST_VALUE的用法