Hive資料處理之報表累計
阿新 • • 發佈:2018-11-24
資料:
+----------+---------+--------+ | username | month | salary | +----------+---------+--------+ | A | 2015-01 | 5 | | A | 2015-01 | 15 | | B | 2015-01 | 6 | | A | 2015-01 | 8 | | B | 2015-01 | 25 | | A | 2015-02 | 20 | | B | 2015-02 | 15 | | B | 2015-02 | 10 | | A | 2015-02 | 7 | | A | 2015-02 | 9 | | B | 2015-02 | 6 | +----------+---------+--------+
上面的是報表中的資料;
SQL檔案:
CREATE DATABASE /*!32312 IF NOT EXISTS*/`test` /*!40100 DEFAULT CHARACTER SET utf8 */; USE `test`; /*Table structure for table `t_access_times` */ DROP TABLE IF EXISTS `t_access_times`; CREATE TABLE `t_access_times` ( `username` varchar(10) DEFAULT NULL, `month` varchar(20) DEFAULT NULL, `salary` int(10) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*Data for the table `t_access_times` */ insert into `t_access_times`(`username`,`month`,`salary`) values ('A','2015-01',5),('A','2015-01',15),('B','2015-01',6),('A','2015-01',8),('B','2015-01',25),('A','2015-02',20),('B','2015-02',15),('B','2015-02',10),('A','2015-02',7),('A','2015-02',9),('B','2015-02',6);
我們最終要查出來的效果是這樣的:
+----------+---------+--------+---------------+ | username | MONTH | salary | accumulate | +----------+---------+--------+---------------+ | A | 2015-01 | 28 | 28 | | A | 2015-02 | 36 | 64 | | B | 2015-01 | 31 | 31 | | B | 2015-02 | 31 | 62 | +----------+---------+--------+---------------+
分析:
第一行資料是第一個月份A的工資,最後一列是累計的工資,後面的月份訂單工資會將前面的所有的月份的工資全部加起來進行累計。
我們一步一步來看:
我們先將原始資料進行加工:
先將每個員工的每個月份的工資總和進行統計:
SELECT username,MONTH,SUM(salary) AS salary
FROM t_access_times
GROUP BY username,MONTH
+----------+---------+--------+
| username | MONTH | salary |
+----------+---------+--------+
| A | 2015-01 | 28 |
| A | 2015-02 | 36 |
| B | 2015-01 | 31 |
| B | 2015-02 | 31 |
+----------+---------+--------+
上面就統計出了每個員工的每個月份的工資的總和。
我面將兩張這樣的表進行聯合起來:
SELECT A.*,B.*
FROM
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
A
INNER JOIN
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
B
兩兩張表合併:
+----------+---------+--------+----------+---------+--------+
| username | MONTH | salary | username | MONTH | salary |
+----------+---------+--------+----------+---------+--------+
| A | 2015-01 | 28 | A | 2015-01 | 28 |
| A | 2015-02 | 36 | A | 2015-01 | 28 |
| B | 2015-01 | 31 | A | 2015-01 | 28 |
| B | 2015-02 | 31 | A | 2015-01 | 28 |
| A | 2015-01 | 28 | A | 2015-02 | 36 |
| A | 2015-02 | 36 | A | 2015-02 | 36 |
| B | 2015-01 | 31 | A | 2015-02 | 36 |
| B | 2015-02 | 31 | A | 2015-02 | 36 |
| A | 2015-01 | 28 | B | 2015-01 | 31 |
| A | 2015-02 | 36 | B | 2015-01 | 31 |
| B | 2015-01 | 31 | B | 2015-01 | 31 |
| B | 2015-02 | 31 | B | 2015-01 | 31 |
| A | 2015-01 | 28 | B | 2015-02 | 31 |
| A | 2015-02 | 36 | B | 2015-02 | 31 |
| B | 2015-01 | 31 | B | 2015-02 | 31 |
| B | 2015-02 | 31 | B | 2015-02 | 31 |
+----------+---------+--------+----------+---------+--------+
進行的是笛卡爾積,所以應該去除不吻合的,B跟A是不能搭配的,所以應該加上連線條件。
ON A.username = B.username
SELECT A.*,B.*
FROM
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
A
INNER JOIN
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
B
ON A.username = B.username
查詢結果:
A 2015-01 28 A 2015-01 28
A 2015-01 28 A 2015-02 36
A 2015-02 36 A 2015-01 28
A 2015-02 36 A 2015-02 36
B 2015-01 31 B 2015-01 31
B 2015-01 31 B 2015-02 31
B 2015-02 31 B 2015-01 31
B 2015-02 31 B 2015-02 31
要對每個使用者跟月份進行分組。 A使用者的1月份的都是應該在一起的,B使用者的一月份的也應該在一起:
SELECT A.*,B.*
FROM
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
A
INNER JOIN
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
B
ON A.username = B.username
GROUP BY A.username,A.month
A 2015-01 28 A 2015-01 28
A 2015-02 36 A 2015-01 28
B 2015-01 31 B 2015-01 31
B 2015-02 31 B 2015-01 31
我們顯示的是 B.*
但是合併的時候,B這些資料會有兩行,但是隻會取一行,我們可以試著將B的兩行資料的工資都加起來。
A的資料也是兩行,但是這兩行資料都是一樣的。
SELECT A.*,SUM(B.salary)
FROM
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
A
INNER JOIN
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
B
ON A.username = B.username
GROUP BY A.username,A.month
得到的結果:
A 2015-01 28 (36+28) = 64
A 2015-02 36 (36+28) = 64
B 2015-01 31 (31+31) = 62
B 2015-02 31 (31+31) = 62
看最後一列的資料是將A的全部月份的資料都加起來(這裡只有兩個月份),但是其實我們這個是不能這樣的,
SELECT A.*,SUM(B.salary)
FROM
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
A
INNER JOIN
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
B
ON A.username = B.username
WHERE A.month >= B.month
GROUP BY A.username,A.month
這樣得到的資料:
A 2015-01 28 36
A 2015-02 36 (36+28) = 64
B 2015-01 31 31
B 2015-02 31 (31+31) = 62
最後進行排序
SELECT A.*,SUM(B.salary) as accumulate
FROM
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
A
INNER JOIN
(SELECT username,MONTH,SUM(salary) AS salary FROM t_access_times GROUP BY username,MONTH)
B
ON A.username = B.username
WHERE A.month >= B.month
GROUP BY A.username,A.month
ORDER BY A.username,A.month;
最後的結果:
+----------+---------+--------+------------+
| username | MONTH | salary | accumulate |
+----------+---------+--------+------------+
| A | 2015-01 | 28 | 28 |
| A | 2015-02 | 36 | 64 |
| B | 2015-01 | 31 | 31 |
| B | 2015-02 | 31 | 62 |
+----------+---------+--------+------------+