row_number() over(partition by) 和 first_value over(partition by) 實用例子和二者區別。Mysql和Hive分別實現取組內最大條資料
需求: 1:id有5組, 現在要求出各組內薪資最高得那個人
2:如果該組有多個薪水一樣得人,那麼取出年齡最大那個
資料來源: mysql、hive
id | name | age | salary |
---|---|---|---|
1 | a1 | 10 | 80 |
1 | a2 | 11 | 65 |
1 | a3 | 5 | 90 |
2 | b1 | 12 | 130 |
2 | b2 | 13 | 45 |
2 | b3 | 14 | 80 |
3 | c1 | 14 | 300 |
3 | c2 | 15 | 900 |
3 | c3 | 16 | 900 |
4 | d1 | 16 | 500 |
4 | d2 | 16 | 600 |
4 | d3 | 17 | 300 |
5 | e1 | 20 | 200 |
5 | e2 | 20 | 200 |
5 | e3 | 19 | 100 |
一、資料生成
1:建立表
CREATE TABLE `test` ( `id` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `age` int(255) NULL DEFAULT NULL, `salary` int(10) NULL DEFAULT NULL ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
2: 插入資料
insert into test(id,name,age,salary) values(1,'a1',10,80);
insert into test(id,name,age,salary) values(1,'a2',11,65);
insert into test(id,name,age,salary) values(1,'a3',05,90);
insert into test(id,name,age,salary) values(2,'b1',12,130);
insert into test(id,name,age,salary) values(2,'b2',13,45);
insert into test(id,name,age,salary) values(2,'b3',14,80);
insert into test(id,name,age,salary) values(3,'c1',14,300);
insert into test(id,name,age,salary) values(3,'c2',15,900);
insert into test(id,name,age,salary) values(3,'c3',16,900);
insert into test(id,name,age,salary) values(4,'d1',16,500);
insert into test(id,name,age,salary) values(4,'d2',16,600);
insert into test(id,name,age,salary) values(4,'d3',17,300);
insert into test(id,name,age,salary) values(5,'e1',20,200);
insert into test(id,name,age,salary) values(5,'e2',20,200);
insert into test(id,name,age,salary) values(5,'e3',19,100);
二: 需求實現(Mysql)
Mysql實現求出組內最大
SELECT * FROM test WHERE (id,salary) IN (
SELECT id,MAX(salary) FROM test GROUP BY id
)
id | name | age | salary |
1 | a3 | 5 | 90 |
2 | b1 | 12 | 130 |
3 | c2 | 15 | 900 |
3 | c3 | 16 | 900 |
4 | d2 | 16 | 600 |
5 | e1 | 20 | 200 |
5 | e2 | 20 | 200 |
可以看到結果現在是正確得,但是3組和5組當中, 3組求出來了組內薪資最大但是有兩條資料薪資一樣,他們的age不一樣, 5組求出來了組內薪資最大但是有兩條資料薪資一樣,他們age也一樣。
所以現在實現第二個需求,如果薪資一樣求出age最大那一條資料, 如果薪資一樣年齡也一樣的話取出組內任意一條。那麼現在3組應該取出name為c3的, 5組取出組內任意一條
SELECT
*
FROM
test
WHERE
( ID, AGE, salary ) IN (
SELECT
T.ID,
MAX( AGE ),
salary
FROM
test T
WHERE
( T.ID, T.SALARY ) IN ( SELECT ID, MAX( SALARY ) FROM test T1 GROUP BY T1.ID )
GROUP BY
T.ID
)
GROUP BY
id
id | name | age | salary |
1 | a3 | 5 | 90 |
2 | b1 | 12 | 130 |
3 | c3 | 16 | 900 |
4 | d2 | 16 | 600 |
5 | e1 | 20 | 200 |
現在這樣的結果,就符合了需求
三: 需求實現(Hive)
在使用開窗函式之前可以先了解一下什麼是開窗函式
在瞭解完開窗函式之後可以結合需求, 這次我們需要的是,row_number() over(partition by) 或者first_value over(partition by) 這兩個函式, 具體實現方法如下
SELECT id,name,age,salary,row_number() over (PARTITION BY id ORDER BY salary desc) rn from salarytest
id | name | age | salary | rn |
1 | a3 | 5 | 90 | 1 |
1 | a1 | 10 | 80 | 2 |
1 | a2 | 11 | 65 | 3 |
2 | b1 | 12 | 130 | 1 |
2 | b3 | 14 | 80 | 2 |
2 | b2 | 13 | 45 | 3 |
3 | c3 | 16 | 900 | 1 |
3 | c2 | 15 | 900 | 2 |
3 | c1 | 14 | 300 | 3 |
4 | d2 | 16 | 600 | 1 |
4 | d1 | 16 | 500 | 2 |
4 | d3 | 17 | 300 | 3 |
5 | e1 | 20 | 200 | 1 |
5 | e2 | 20 | 200 | 2 |
5 | e3 | 19 | 100 | 3 |