1. 程式人生 > 資料庫 >數倉開發需要了解的5大SQL分析函式

數倉開發需要了解的5大SQL分析函式

公眾號『大資料技術與數倉』,回覆『資料』領取大資料資料包

基本語法

analytic_function_name([argument_list])
OVER (
[PARTITION BY partition_expression,…]
[ORDER BY sort_expression, … [ASC|DESC]])
  • analytic_function_name: 函式名稱 — 比如 RANK(), SUM(), FIRST()等等
  • partition_expression: 分割槽列
  • sort_expression: 排序列

案例

資料準備

CREATE TABLE `orders` (
    `order_num` String COMMENT '訂單號',
    `order_amount` DECIMAL ( 12, 2 ) COMMENT '訂單金額',
    `advance_amount` DECIMAL ( 12, 2 ) COMMENT '預付款',
    `order_date` string COMMENT '訂單日期',
    `cust_code` string COMMENT '客戶',
    `agent_code` string COMMENT '代理商' 
);
INSERT INTO orders VALUES('200100', '1000.00', '600.00', '2020-08-01', 'C00013', 'A003');
INSERT INTO orders VALUES('200110', '3000.00', '500.00', '2020-04-15', 'C00019', 'A010');
INSERT INTO orders VALUES('200107', '4500.00', '900.00', '2020-08-30', 'C00007', 'A010');
INSERT INTO orders VALUES('200112', '2000.00', '400.00', '2020-05-30', 'C00016', 'A007'); 
INSERT INTO orders VALUES('200113', '4000.00', '600.00', '2020-06-10', 'C00022', 'A002');
INSERT INTO orders VALUES('200102', '2000.00', '300.00', '2020-05-25', 'C00012', 'A012');
INSERT INTO orders VALUES('200114', '3500.00', '2000.00', '2020-08-15', 'C00002','A008');
INSERT INTO orders VALUES('200122', '2500.00', '400.00', '2020-09-16', 'C00003', 'A004');
INSERT INTO orders VALUES('200118', '500.00', '100.00', '2020-07-20', 'C00023', 'A006');
INSERT INTO orders VALUES('200119', '4000.00', '700.00', '2020-09-16', 'C00007', 'A010');
INSERT INTO orders VALUES('200121', '1500.00', '600.00', '2020-09-23', 'C00008', 'A004');
INSERT INTO orders VALUES('200130', '2500.00', '400.00', '2020-07-30', 'C00025', 'A011');
INSERT INTO orders VALUES('200134', '4200.00', '1800.00', '2020-09-25', 'C00004','A005');
INSERT INTO orders VALUES('200108', '4000.00', '600.00', '2020-02-15', 'C00008', 'A004');
INSERT INTO orders VALUES('200103', '1500.00', '700.00', '2020-05-15', 'C00021', 'A005');
INSERT INTO orders VALUES('200105', '2500.00', '500.00', '2020-07-18', 'C00025', 'A011');
INSERT INTO orders VALUES('200109', '3500.00', '800.00', '2020-07-30', 'C00011', 'A010');
INSERT INTO orders VALUES('200101', '3000.00', '1000.00', '2020-07-15', 'C00001','A008');
INSERT INTO orders VALUES('200111', '1000.00', '300.00', '2020-07-10', 'C00020', 'A008');
INSERT INTO orders VALUES('200104', '1500.00', '500.00', '2020-03-13', 'C00006', 'A004');
INSERT INTO orders VALUES('200106', '2500.00', '700.00', '2020-04-20', 'C00005', 'A002');
INSERT INTO orders VALUES('200125', '2000.00', '600.00', '2020-10-01', 'C00018', 'A005');
INSERT INTO orders VALUES('200117', '800.00', '200.00', '2020-10-20', 'C00014', 'A001');
INSERT INTO orders VALUES('200123', '500.00', '100.00', '2020-09-16', 'C00022', 'A002');
INSERT INTO orders VALUES('200120', '500.00', '100.00', '2020-07-20', 'C00009', 'A002');
INSERT INTO orders VALUES('200116', '500.00', '100.00', '2020-07-13', 'C00010', 'A009');
INSERT INTO orders VALUES('200124', '500.00', '100.00', '2020-06-20', 'C00017', 'A007'); 
INSERT INTO orders VALUES('200126', '500.00', '100.00', '2020-06-24', 'C00022', 'A002');
INSERT INTO orders VALUES('200129', '2500.00', '500.00', '2020-07-20', 'C00024', 'A006');
INSERT INTO orders VALUES('200127', '2500.00', '400.00', '2020-07-20', 'C00015', 'A003');
INSERT INTO orders VALUES('200128', '3500.00', '1500.00', '2020-07-20', 'C00009','A002');
INSERT INTO orders VALUES('200135', '2000.00', '800.00', '2020-09-16', 'C00007', 'A010');
INSERT INTO orders VALUES('200131', '900.00', '150.00', '2020-08-26', 'C00012', 'A012');
INSERT INTO orders VALUES('200133', '1200.00', '400.00', '2020-06-29', 'C00009', 'A002');

AVG() 和SUM()

需求描述:

第三季度每個代理商的移動平均收入和總收入

SELECT
    agent_code,
    order_date,
    AVG( order_amount ) OVER ( PARTITION BY agent_code ORDER BY order_date)  avg_rev,
    SUM( order_amount ) OVER ( PARTITION BY agent_code ORDER BY order_date ) total_rev 
FROM
orders 
WHERE
order_date >= '2020-07-01' 
AND order_date <= '2020-09-30';

結果輸出

A002    2020-07-20      2000    4000
A002    2020-07-20      2000    4000
A002    2020-09-16      1500    4500
A003    2020-07-20      2500    2500
A003    2020-08-01      1750    3500
A004    2020-09-16      2500    2500
A004    2020-09-23      2000    4000
A005    2020-09-25      4200    4200
A006    2020-07-20      1500    3000
A006    2020-07-20      1500    3000
A008    2020-07-10      1000    1000
A008    2020-07-15      2000    4000
A008    2020-08-15      2500    7500
A009    2020-07-13      500     500
A010    2020-07-30      3500    3500
A010    2020-08-30      4000    8000
A010    2020-09-16      3500    14000
A010    2020-09-16      3500    14000
A011    2020-07-18      2500    2500
A011    2020-07-30      2500    5000
A012    2020-08-26      900     900

FIRST_VALUE()和 LAST_VALUE()

  • first_value: 取分組內排序後,截止到當前行,第一個值
  • last_value: 取分組內排序後,截止到當前行,最後一個值

需求描述

客戶首次購買後多少天才進行下一次購買

SELECT
    cust_code,
    order_date,
    datediff(order_date,FIRST_VALUE ( order_date ) OVER ( PARTITION BY cust_code ORDER BY order_date )) next_order_gap 
FROM
orders 
order by cust_code,next_order_gap

結果輸出

C00001  2020-07-15      0
C00002  2020-08-15      0
C00003  2020-09-16      0
C00004  2020-09-25      0
C00005  2020-04-20      0
C00006  2020-03-13      0
C00007  2020-08-30      0
C00007  2020-09-16      17
C00007  2020-09-16      17
C00008  2020-02-15      0
C00008  2020-09-23      221
C00009  2020-06-29      0
C00009  2020-07-20      21
C00009  2020-07-20      21
C00010  2020-07-13      0
C00011  2020-07-30      0
C00012  2020-05-25      0
C00012  2020-08-26      93
C00013  2020-08-01      0
C00014  2020-10-20      0
C00015  2020-07-20      0
C00016  2020-05-30      0
C00017  2020-06-20      0
C00018  2020-10-01      0
C00019  2020-04-15      0
C00020  2020-07-10      0
C00021  2020-05-15      0
C00022  2020-06-10      0
C00022  2020-06-24      14
C00022  2020-09-16      98
C00023  2020-07-20      0
C00024  2020-07-20      0
C00025  2020-07-18      0
C00025  2020-07-30      12

LEAD() 和 LAG()

  • lead(value_expr[,offset[,default]]):用於統計視窗內往下第n行值。第一個引數為列名,第二個引數為往下第n行(可選,預設為1),第三個引數為預設值(當往下第n行為NULL時候,取預設值,如不指定,則為NULL
  • lag(value_expr[,offset[,default]]): 與lead相反,用於統計視窗內往上第n行值。第一個引數為列名,第二個引數為往上第n行(可選,預設為1),第三個引數為預設值(當往上第n行為NULL時候,取預設值,如不指定,則為NULL)

需求描述

代理商最近一次出售的最高訂單金額是多少?

SELECT
	agent_code,
	order_amount,
	LAG ( order_amount, 1 ) OVER ( PARTITION BY agent_code ORDER BY order_amount DESC ) last_highest_amount 
FROM
	orders 
ORDER BY
	agent_code,
	order_amount DESC;

結果輸出

A001    800     NULL
A002    4000    NULL
A002    3500    4000
A002    2500    3500
A002    1200    2500
A002    500     1200
A002    500     500
A002    500     500
A003    2500    NULL
A003    1000    2500
A004    4000    NULL
A004    2500    4000
A004    1500    2500
A004    1500    1500
A005    4200    NULL
A005    2000    4200
A005    1500    2000
A006    2500    NULL
A006    500     2500
A007    2000    NULL
A007    500     2000
A008    3500    NULL
A008    3000    3500
A008    1000    3000
A009    500     NULL
A010    4500    NULL
A010    4000    4500
A010    3500    4000
A010    3000    3500
A010    2000    3000
A011    2500    NULL
A011    2500    2500
A012    2000    NULL
A012    900     2000

RANK() 和DENSE_RANK()

**rank:**對組中的資料進行排名,如果名次相同,則排名也相同,但是下一個名次的排名序號會出現不連續。比如查詢具體條件的topN行。RANK() 排序為 (1,2,2,4)

**dense_rank:**dense_rank函式的功能與rank函式類似,dense_rank函式在生成序號時是連續的,而rank函式生成的序號有可能不連續。當出現名次相同時,則排名序號也相同。而下一個排名的序號與上一個排名序號是連續的。

DENSE_RANK() 排序為 (1,2,2,3)

需求描述

每月第二高的訂單金額是多少?

SELECT
	order_num,
	order_date,
	order_amount,
	order_month 
FROM
	(
SELECT
	order_num,
	order_date,
	order_amount,
	DATE_FORMAT( order_date, 'YYYY-MM' ) AS order_month,
	DENSE_RANK ( ) OVER ( PARTITION BY DATE_FORMAT( order_date, 'YYYY-MM' ) ORDER BY order_amount DESC ) order_rank 
FROM
	orders 
	) t 
WHERE
	order_rank = 2 
ORDER BY
	order_date;

結果輸出

200106  2020-04-20      2500    2020-04
200103  2020-05-15      1500    2020-05
200133  2020-06-29      1200    2020-06
200101  2020-07-15      3000    2020-07
200114  2020-08-15      3500    2020-08
200119  2020-09-16      4000    2020-09
200117  2020-10-20      800     2020-10

CUME_DIST()

cume_dist:如果按升序排列,則統計:小於等於當前值的行數/總行數(number of rows ≤ current row)/(total number of rows)。如果是降序排列,則統計:大於等於當前值的行數/總行數。比如,統計小於等於當前工資的人數佔總人數的比例 ,用於累計統計。

需求描述

8月和9月每個訂單的收入百分比

先檢視一下8月和9月的資料,按訂單金額排序

SELECT
	order_num,
	order_amount,
	order_date,
	agent_code 
FROM
	orders 
WHERE
	order_date >= '2020-08-01' 
	AND order_date <= '2020-09-30' 
ORDER BY
	date_format( order_date, "YYYY-MM" ),
	order_amount;

其結果為:

SELECT
	DATE_FORMAT( order_date, 'YYYY-MM' ) AS order_month,
	agent_code,
	order_amount,
	CUME_DIST ( ) OVER ( PARTITION BY DATE_FORMAT( order_date, 'YYYY-MM' ) ORDER BY order_amount ) 
FROM
	orders 
WHERE
	order_date >= '2020-08-01' 
	AND order_date <= '2020-09-30';

結果輸出

2020-08 A012    900     0.25
2020-08 A003    1000    0.5
2020-08 A008    3500    0.75
2020-08 A010    4500    1.0
2020-09 A002    500     0.16666666666666666
2020-09 A004    1500    0.3333333333333333
2020-09 A010    2000    0.5
2020-09 A004    2500    0.6666666666666666
2020-09 A010    4000    0.8333333333333334
2020-09 A005    4200    1.0

公眾號『大資料技術與數倉』,回覆『資料』領取大資料資料包