hive視窗函式之sum,avg,min,max
在hive的統計分析中,其實視窗函式還是比較常用也重要的。
今天整理下hive中視窗函式的sum,avg,min,max,後續再整理其他常用的。
首先模擬建立一張通話記錄表:欄位有主叫號碼,主叫時間,通話時長
> create table `call_test` (
`pone_number` string,
`createtime` string, --day
`call_minute` int
);
OK
Time taken: 0.369 seconds
查看下錶結構
> desc call_test;
OK
pone_number string
createtime string
call_minute int
Time taken: 0.149 seconds, Fetched: 3 row(s)
插入模擬資料
insert into call_test values('18600000000', '2018-12-10 13:00:00', 1);
insert into call_test values('18600000000', '2018-12-11 13:00:00', 6);
insert into call_test values('18600000000', '2018-12-12 13:00:00', 8);
insert into call_test values('18600000000' , '2018-12-13 13:00:00', 4);
insert into call_test values('18600000000', '2018-12-14 13:00:00', 7);
insert into call_test values('18600000000', '2018-12-15 13:00:00', 1);
insert into call_test values('18600000000', '2018-12-16 13:00:00', 6);
insert into call_test values('18600000000', '2018-12-17 13:00:00', 8);
insert into call_test values('18600000000', '2018-12-18 13:00:00' , 2);
insert into call_test values('18600000000', '2018-12-19 13:00:00', 4);
insert into call_test values('18600000000', '2018-12-20 13:00:00', 7);
insert into call_test values('18600000000', '2018-12-21 13:00:00', 1);
insert into call_test values('18600000000', '2018-12-22 13:00:00', 6);
insert into call_test values('18600000000', '2018-12-23 13:00:00', 8);
insert into call_test values('15600000000', '2018-12-10 13:00:00', 2);
insert into call_test values('15600000000', '2018-12-11 13:00:00', 4);
insert into call_test values('15600000000', '2018-12-12 13:00:00', 7);
insert into call_test values('15600000000', '2018-12-13 13:00:00', 1);
insert into call_test values('15600000000', '2018-12-14 13:00:00', 6);
insert into call_test values('15600000000', '2018-12-15 13:00:00', 8);
insert into call_test values('15600000000', '2018-12-16 13:00:00', 2);
insert into call_test values('15600000000', '2018-12-17 13:00:00', 4);
insert into call_test values('15600000000', '2018-12-18 13:00:00', 7);
SUM — 注意,結果和ORDER BY相關,預設為升序
> select pone_number,
createtime,
call_minute,
sum(call_minute) OVER(partition by pone_number order by createtime) as call_minute1, -- 預設為從起點到當前行
sum(call_minute) OVER(partition by pone_number order by createtime rows between unbounded preceding and current row) as call_minute2, --從起點到當前行,結果同call_minute1
sum(call_minute) OVER(partition by pone_number) as call_minute3,--分組內所有行
sum(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and current row) as call_minute4, --當前行+往前3行
sum(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and 1 following) as call_minute5, --當前行+往前3行+往後1行
sum(call_minute) OVER(partition by pone_number order by createtime rows between current row and unbounded following) as call_minute6 ---當前行+往後所有行
FROM call_test;
Query ID = hdfs_20181211000153_8870b5b2-ecaf-46aa-90f2-49a73e9e4ddf
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1541064601030_38864)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 0.66 s
----------------------------------------------------------------------------------------------
OK
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| pone_number | createtime | call_minute | call_minute1 | call_minute2 | call_minute3 | call_minute4 | call_minute5 | call_minute6 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| 15600000000 | 2018-12-14 13:00:00 | 6 | 20 | 20 | 41 | 18 | 26 | 27 |
| 15600000000 | 2018-12-13 13:00:00 | 1 | 14 | 14 | 41 | 14 | 20 | 28 |
| 15600000000 | 2018-12-12 13:00:00 | 7 | 13 | 13 | 41 | 13 | 14 | 35 |
| 15600000000 | 2018-12-11 13:00:00 | 4 | 6 | 6 | 41 | 6 | 13 | 39 |
| 15600000000 | 2018-12-10 13:00:00 | 2 | 2 | 2 | 41 | 2 | 6 | 41 |
| 15600000000 | 2018-12-18 13:00:00 | 7 | 41 | 41 | 41 | 21 | 21 | 7 |
| 15600000000 | 2018-12-17 13:00:00 | 4 | 34 | 34 | 41 | 20 | 27 | 11 |
| 15600000000 | 2018-12-16 13:00:00 | 2 | 30 | 30 | 41 | 17 | 21 | 13 |
| 15600000000 | 2018-12-15 13:00:00 | 8 | 28 | 28 | 41 | 22 | 24 | 21 |
| 18600000000 | 2018-12-23 13:00:00 | 8 | 69 | 69 | 69 | 22 | 22 | 8 |
| 18600000000 | 2018-12-22 13:00:00 | 6 | 61 | 61 | 69 | 18 | 26 | 14 |
| 18600000000 | 2018-12-21 13:00:00 | 1 | 55 | 55 | 69 | 14 | 20 | 15 |
| 18600000000 | 2018-12-20 13:00:00 | 7 | 54 | 54 | 69 | 21 | 22 | 22 |
| 18600000000 | 2018-12-19 13:00:00 | 4 | 47 | 47 | 69 | 20 | 27 | 26 |
| 18600000000 | 2018-12-18 13:00:00 | 2 | 43 | 43 | 69 | 17 | 21 | 28 |
| 18600000000 | 2018-12-17 13:00:00 | 8 | 41 | 41 | 69 | 22 | 24 | 36 |
| 18600000000 | 2018-12-16 13:00:00 | 6 | 33 | 33 | 69 | 18 | 26 | 42 |
| 18600000000 | 2018-12-15 13:00:00 | 1 | 27 | 27 | 69 | 20 | 26 | 43 |
| 18600000000 | 2018-12-14 13:00:00 | 7 | 26 | 26 | 69 | 25 | 26 | 50 |
| 18600000000 | 2018-12-13 13:00:00 | 4 | 19 | 19 | 69 | 19 | 26 | 54 |
| 18600000000 | 2018-12-11 13:00:00 | 6 | 7 | 7 | 69 | 7 | 15 | 68 |
| 18600000000 | 2018-12-10 13:00:00 | 1 | 1 | 1 | 69 | 1 | 7 | 69 |
| 18600000000 | 2018-12-12 13:00:00 | 8 | 15 | 15 | 69 | 15 | 19 | 62 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
Time taken: 1.14 seconds, Fetched: 23 row(s)
解釋:
call_minute1: 分組內從起點到當前行的call_minute累積,如,11號的call_minute1=10號的call_minute+11號的call_minute, 12號=10號+11號+12號
call_minute2: 同call_minute1
call_minute3: 分組內(call_minute1)所有的call_minute累加
call_minute4: 分組內當前行+往前3行,如,11號=10號+11號, 12號=10號+11號+12號, 13號=10號+11號+12號+13號, 14號=11號+12號+13號+14號
call_minute5: 分組內當前行+往前3行+往後1行,如,14號=11號+12號+13號+14號+15號
call_minute6: 分組內當前行+往後所有行,如,13號=13號+14號+15號+16號,14號=14號+15號+16號
如果不指定rows between,預設為從起點到當前行;
如果不指定order by,則將分組內所有值累加;
關鍵是理解rows between含義,也叫做window子句:
preceding:往前
following:往後
current row:當前行
unbounded:起點,unbounded preceding 表示從前面的起點, unbounded following:表示到後面的終點
其他avg,min,max,和sum用法一樣。
AVG
> select pone_number,
createtime,
call_minute,
round(avg(call_minute) OVER(partition by pone_number order by createtime), 2) as call_minute1, -- 預設為從起點到當前行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between unbounded preceding and current row), 2) as call_minute2, --從起點到當前行,結果同call_minute1
round(avg(call_minute) OVER(partition by pone_number), 2) as call_minute3,--分組內所有行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and current row), 2) as call_minute4, --當前行+往前3行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and 1 following), 2) as call_minute5, --當前行+往前3行+往後1行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between current row and unbounded following), 2) as call_minute6 ---當前行+往後所有行
FROM call_test;
Query ID = hdfs_20181211000203_53ab6fb6-628c-4ac8-81aa-244c73b701f0
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1541064601030_38864)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 4.04 s
----------------------------------------------------------------------------------------------
OK
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| pone_number | createtime | call_minute | call_minute1 | call_minute2 | call_minute3 | call_minute4 | call_minute5 | call_minute6 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| 15600000000 | 2018-12-14 13:00:00 | 6 | 4.0 | 4.0 | 4.56 | 4.5 | 5.2 | 5.4 |
| 15600000000 | 2018-12-13 13:00:00 | 1 | 3.5 | 3.5 | 4.56 | 3.5 | 4.0 | 4.67 |
| 15600000000 | 2018-12-12 13:00:00 | 7 | 4.33 | 4.33 | 4.56 | 4.33 | 3.5 | 5.0 |
| 15600000000 | 2018-12-11 13:00:00 | 4