Hive資料分析實戰演練
Hive資料分析實戰演練
文章來源:企鵝號 - 程式猿的修身養性
1、準備工作
Hive的底層是基於MapReduce分散式計算和HDFS分散式儲存,因此,在使用Hive進行資料操作前,需要先啟動Hadoop。如果事先已經搭建好了偽分散式環境的Hadoop,執行命令: start-all.sh,等待Hadoop啟動完成即可。
使用Hive進行資料分析操作,必然需要安裝和配置Hive資料倉庫工具,這裡就不介紹其安裝和配置了,具體內容可以參考前面相關文章。本文基於Hive的本地模式(元資料資訊儲存到第三方MySQL資料庫中)進行操作,執行命令:hive,等待Hive啟動完成。如下圖所示,這樣便可以在Hive的shell命令列視窗中進行資料分析操作。
在正式開始操作Hive進行資料分析之前,先介紹幾個Hive的基本命令。
建立資料庫
create database mytest;
切換到指定資料庫
use mytest;
檢視指定的資料庫資訊
describe database mytest;
檢視指定資料表的詳細資訊
desc formatted special1;
2、SUM、AVG、MIN、MAX函式
A、資料準備
建立檔案special1,往該檔案中輸入相應的測試資料,如下圖所示:
然後,將special1檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:
CREATE EXTERNAL TABLE special1 (
cookieid string,createtime string,pv INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile location '/root/temp/special1/';
最後,執行如下命令將本地檔案special1中的資料匯入表special1中:
load data local inpath '/root/temp/special1' into table special1;
B、SUM函式使用
功能:實現分組內所有和連續累積的統計,注意,結果和ORDER BY相關,預設為升序。命令如下:
SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行,會使得最終結果降序排列
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行
FROM special1;
執行結果如下圖所示:
解析:
pv1: 分組內從起點到當前行的pv累積,如11號的pv1等於10號的pv值加上11號的pv值, 12號的pv1等於10號的pv值加上11號的pv值加上12號的pv值;
pv2: 同pv1的計算方法;
pv3: 分組內(cookie1)所有的pv值累加;
pv4: 分組內當前行+往前3行,如11號=10號+11號,12號=10號+11號+12號,13號=10號+11號+12號+13號,14號=11號+12號+13號+14號;
pv5: 分組內當前行+往前3行+往後1行,如14號=11號+12號+13號+14號+15號=5+7+3+2+4=21;
pv6: 分組內當前行+往後所有行,如13號=13號+14號+15號+16號=3+2+4+4=13,14號=14號+15號+16號=2+4+4=10;
如果不指定ROWS BETWEEN,預設為從起點到當前行;
如果不指定ORDER BY,則將分組內所有值累加;
關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前,FOLLOWING:往後,CURRENT ROW:當前行
UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點
——其他AVG,MIN,MAX函式,和SUM函式的用法一樣。
C、AVG函式使用
功能:實現求分組內指定數量行資料的平均值。命令如下:
SELECT cookieid,createtime,pv,
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1
AVG(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行,會使得最終結果降序排列
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行
FROM special1;
執行結果如下圖所示:
D、MIN函式使用
功能:實現求分組內指定數量行資料的最小值。命令如下:
SELECT cookieid,createtime,pv,
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1
MIN(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行,會使得最終結果降序排列
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行
FROM special1;
執行結果如下圖所示:
E、MAX函式使用
功能:實現求分組內指定數量行資料的最大值。命令如下:
SELECT cookieid,createtime,pv,
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1
MAX(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行
FROM special1;
執行結果如下圖所示:
3、NTILE、ROW_NUMBER、RANK,DENSE_RANK函式
A、資料準備
建立檔案special2,往該檔案中輸入相應的測試資料,如下圖所示:
然後,將special2檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:
CREATE EXTERNAL TABLE special2 (
cookieid string,createtime string,pv INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile location '/root/temp/special2/';
最後,執行如下命令將本地檔案special2中的資料匯入表special2中:
load data local inpath '/root/temp/special2' into table special2;
B、NTILE函式使用
功能:NTILE(n),用於將分組資料按照順序切分成n片,並返回當前切片值。
NTILE不支援ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。如果切片不均勻,預設增加第一個切片的分佈。命令如下:
SELECT cookieid,createtime,pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,--將分組內資料分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,--將分組內資料分成3片
NTILE(4) OVER(ORDER BY createtime) AS rn3--將所有資料分成4片
FROM special2
ORDER BY cookieid,createtime;
執行結果如下圖所示:
再比如,統計一個cookie,pv數最多的前1/3數量的天,命令如下:
SELECT cookieid,createtime,pv,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn
FROM special2;
執行結果如下圖所示:
C、ROW_NUMBER函式使用
功能:從1開始,按照順序,生成分組內記錄的序列。比如,按照pv降序排列,生成分組內每天的pv名次。ROW_NUMBER()的應用場景非常多,再比如,獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。命令如下:
SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM special2;
執行結果如下圖所示:
D、RANK和DENSE_RANK函式使用
功能:RANK()生成資料項在分組中的排名,排名相等會在名次中留下空缺位;DENSE_RANK()生成資料項在分組中的排名,排名相等不會在名次中留下空缺位。命令如下:
SELECT cookieid,createtime,pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM special2
WHERE cookieid = 'cookie1';
執行結果如下圖所示:
4、CUME_DIST、PERCENT_RANK函式
A、資料準備
建立檔案special3,往該檔案中輸入相應的測試資料,如下圖所示:
然後,將special3檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:
CREATE EXTERNAL TABLE special3 (
dept STRING,userid string,sal INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile location '/root/temp/special3/';
最後,執行如下命令將本地檔案special3中的資料匯入表special3中:
load data local inpath '/root/temp/special3' into table special3;
B、CUME_DIST函式使用
功能:實現求小於等於當前值的行數/分組內總行數,比如,統計小於等於當前薪水的人數,所佔總人數的比例。命令如下:
SELECT dept,userid,sal,
CUME_DIST() OVER(ORDER BY sal) AS rn1,
CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM special3;
執行結果如下圖所示:
C、PERCENT_RANK函式使用
功能:實現求分組內當前行的RANK值-1/分組內總行數-1的比值,該函式的功能比較特殊,應用場景不太瞭解。命令如下:
SELECT dept,userid,sal,
PERCENT_RANK() OVER(ORDER BY sal) AS rn1,--分組內
RANK() OVER(ORDER BY sal) AS rn11,--分組內RANK值
SUM(1) OVER(PARTITION BY NULL) AS rn12,--分組內總行數
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM special3;
執行結果如下圖所示:
5、LAG、LEAD、FIRST_VALUE、LAST_VALUE函式
A、資料準備
建立檔案special4,往該檔案中輸入相應的測試資料,如下圖所示:
然後,將special4檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:
CREATE EXTERNAL TABLE special4 (
cookieid string,createtime string,url STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile location '/root/temp/special4/';
最後,執行如下命令將本地檔案special4中的資料匯入表special4中:
load data local inpath '/root/temp/special4' into table special4;
B、LAG函式使用
功能:LAG(col,n,DEFAULT) 用於統計視窗內往上第n行的值,第一個引數為列名,第二個引數為往上第n行(可選,預設為1),第三個引數為預設值(當往上第n行為NULL時候,取預設值,如不指定,則為NULL)。命令如下:
SELECT cookieid,createtime,url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM special4;
執行結果如下圖所示:
C、LEAD函式使用
功能:與LAG相反,LEAD(col,n,DEFAULT)用於統計視窗內往下第n行的值,第一個引數為列名,第二個引數為往下第n行(可選,預設為1),第三個引數為預設值(當往下第n行為NULL時候,取預設值,如不指定,則為NULL)。命令如下:
SELECT cookieid,createtime,url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM special4;
執行結果如下圖所示:
D、FIRST_VALUE函式使用
功能:實現求分組內排序後,截止到當前行,第一個值。命令如下:
SELECT cookieid,createtime,url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM special4;
執行結果如下圖所示:
E、LAST_VALUE函式使用
功能:實現求分組內排序後,截止到當前行,最後一個值。命令如下:
SELECT cookieid,createtime,url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM special4;
執行結果如下圖所示:
如果不指定ORDER BY,則預設按照記錄在檔案中的偏移量進行排序。命令如下:
SELECT cookieid,createtime,url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM special4;
執行結果如下圖所示:
如果想要取分組內排序後最後一個值,則需要變通一下。命令如下:
SELECT cookieid,createtime,url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM special4
ORDER BY cookieid,createtime;
執行結果如下圖所示:
6、GROUPING SETS、GROUPING__ID、CUBE、ROOUP函式
A、資料準備
建立檔案special5,往該檔案中輸入相應的測試資料,如下圖所示:
然後,將special5檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:
CREATE EXTERNAL TABLE special5 (
month STRING,day STRING,cookieid STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile location '/root/temp/special5/';
最後,執行如下命令將本地檔案special5中的資料匯入表special5中:
load data local inpath '/root/temp/special5' into table special5;
B、GROUPINT SETS函式使用
功能:在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL。命令如下:
SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM special5
GROUP BY month,day
GROUPING SETS (month,day)
ORDER BY GROUPING__ID;
執行結果如下圖所示:
上面的語句等價於:
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM special5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM special5 GROUP BY day
再比如下述命令:
SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM special5
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING__ID;
執行結果如下圖所示:
上述命令等價於:
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM special5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM special5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM special5GROUP BY month,day
其中的GROUPING__ID,表示結果屬於哪一個分組集合。
C、CUBE函式使用
功能:根據GROUP BY的維度的所有組合進行聚合。命令如下:
SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM special5
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;
執行結果如下圖所示:
上述命令等價於:
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM special5
UNION ALL
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM special5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM special5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM special5 GROUP BY month,day
D、ROLLUP函式使用
功能:是CUBE的子集,以最左側的維度為主,從該維度進行層級聚合。命令如下:
SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM special5
GROUP BY month,day
WITH ROLLUP
ORDER BY GROUPING__ID;
執行結果如下圖所示:
還可以實現這樣的上鑽過程:月天的uv->月的uv->總uv,把month和day調換順序,則以day維度進行層級聚合。命令如下:
SELECT day,month,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM special5
GROUP BY day,month
WITH ROLLUP
ORDER BY GROUPING__ID;
執行結果如下圖所示: