Hive分析窗體函數之SUM,AVG,MIN和MAX
Hive中提供了非常多的分析函數,用於完畢負責的統計分析。
本文先介紹SUM、AVG、MIN、MAX這四個函數。
環境信息:
Hive版本號為apache-hive-0.14.0-bin
Hadoop版本號為hadoop-2.6.0
Tez版本號為tez-0.7.0
構造數據:
P088888888888,2016-02-10,1
P088888888888,2016-02-11,3
P088888888888,2016-02-12,1
P088888888888,2016-02-13,9
P088888888888,2016-02-14,3
P088888888888,2016-02-15,12
P088888888888,2016-02-16,3
創建表:
hive (hiveinaction)> create table windows_func
>(
> polno string,
> createtime string,
> pnum int
>)
>ROW FORMAT DELIMITED
>FIELDS TERMINATED BY ‘,‘
>stored as textfile;
導入數據到表中:
load data local inpath ‘/home/hadoop/testhivedata/windows_func.txt‘ into table windows_func;
測試:
SELECT polno,
createtime,
pnum,
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime) AS pnum1, --默覺得從起點到當前行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --
SUM(pnum) OVER(PARTITION BY polno) ASpnum3, --分組內全部行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --當前行+往前3行(當前行的值+前面三行的值)
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --當前行+往前3行+往後1行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---當前行+往後全部行
FROM windows_func;
結果:
polno | createtime | pnum | pnum1 | pnum2 | pnum3 | pnum4 | pnum5 | pnum6 |
P088888888888 | 2016/2/10 | 1 | 1 | 1 | 32 | 1 | 4 | 32 |
P088888888888 | 2016/2/11 | 3 | 4 | 4 | 32 | 4 | 5 | 31 |
P088888888888 | 2016/2/12 | 1 | 5 | 5 | 32 | 5 | 14 | 28 |
P088888888888 | 2016/2/13 | 9 | 14 | 14 | 32 | 14 | 17 | 27 |
P088888888888 | 2016/2/14 | 3 | 17 | 17 | 32 | 16 | 28 | 18 |
P088888888888 | 2016/2/15 | 12 | 29 | 29 | 32 | 25 | 28 | 15 |
P088888888888 | 2016/2/16 | 3 | 32 | 32 | 32 | 27 | 27 | 3 |
凝視:
1. 假設不指定ROWS BETWEEN,默覺得從起點到當前行;
2. 假設不指定ORDER BY,則將分組內全部值累加;
理解ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點,UNBOUNDED PRECEDING表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點
其它AVG,MIN。MAX。和SUM使用方法一樣。
演示AVG環境:
SELECT polno,
createtime,
pnum,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1, --默覺得從起點到當前行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --從起點到當前行
AVG(pnum) OVER(PARTITION BY polno) AS pnum3, --分組內全部行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --當前行+往前3行(當前行的值+前面三行的值)
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --當前行+往前3行+往後1行
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---當前行+往後全部行
FROM windows_func;
結果:
polno | createtime | pnum | pnum1 | pnum2 | pnum3 | pnum4 | pnum5 | pnum6 |
P088888888888 | 2016/2/10 | 1 | 1 | 1 | 4.57142857 | 1 | 2 | 4.5714286 |
P088888888888 | 2016/2/11 | 3 | 2 | 2 | 4.57142857 | 2 | 1.666667 | 5.1666667 |
P088888888888 | 2016/2/12 | 1 | 1.66667 | 1.6667 | 4.57142857 | 1.666667 | 3.5 | 5.6 |
P088888888888 | 2016/2/13 | 9 | 3.5 | 3.5 | 4.57142857 | 3.5 | 3.4 | 6.75 |
P088888888888 | 2016/2/14 | 3 | 3.4 | 3.4 | 4.57142857 | 4 | 5.6 | 6 |
P088888888888 | 2016/2/15 | 12 | 4.83333 | 4.8333 | 4.57142857 | 6.25 | 5.6 | 7.5 |
P088888888888 | 2016/2/16 | 3 | 4.57143 | 4.5714 | 4.57142857 | 6.75 | 6.75 | 3 |
其它相似的函數就不舉例了。
Hive分析窗體函數之SUM,AVG,MIN和MAX