hive:函式使用
- hive函式使用
小技巧:測試函式的用法,可以專門準備一個專門的dual表
create table dual(x string);
insert into table dual values('');
其實:直接用常量來測試函式即可
select substr("abcdefg",1,3);
substr,在資料庫中腳標是從1開始;
hive的所有函式手冊:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inTable-GeneratingFunctions(UDTF)
-
- 常用內建函式
- 型別轉換函式
- 常用內建函式
select cast("5" as int) from dual;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);
- current_timestamp ,hive中的時間戳
- 字元型別的時間,只能是年月日
示例:
1 |
1995-05-05 13:30:59 |
1200.3 |
2 |
1994-04-05 13:30:59 |
2200 |
3 |
1996-06-01 12:20:30 |
80000.5 |
create table t_fun(id string,birthday string,salary string)
row format delimited fields terminated by ',';
select id,to_date(birthday as date) as bir,cast(salary as float) from t_fun;
-
-
- 數學運算函式
-
select round(5.4) from dual; ## 5
select round(5.1345,3) from dual; ##5.135
select ceil(5.4) from dual; // select ceiling(5.4) from dual; ## 6
select floor(5.4) from dual; ## 5
select abs(-5.4) from dual; ## 5.4
select greatest(3,5,6) from dual; ## 6
select least(3,5,6) from dual;
示例:
有表如下:
select greatest(cast(s1 as double),cast(s2 as double),cast(s3 as double)) from t_fun2;
結果:
+---------+--+
| _c0 |
+---------+--+
| 2000.0 |
| 9800.0 |
+---------+--+
select max(age) from t_person; 聚合函式
select min(age) from t_person; 聚合函式
-
-
- 字串函式
-
substr(string, int start) ## 擷取子串
substring(string, int start)
示例:select substr("abcdefg",2) from dual;
substr(string, int start, int len)
substring(string, int start, int len)
示例:select substr("abcdefg",2,3) from dual;
concat(string A, string B...) ## 拼接字串
concat_ws(string SEP, string A, string B...)
示例:select concat("ab","xy") from dual;
select concat_ws(".","192","168","33","44") from dual;
length(string A)
示例:select length("192.168.33.44") from dual;
split(string str, string pat)
示例:select split("192.168.33.44",".") from dual; 錯誤的,因為.號是正則語法中的特定字元
select split("192.168.33.44","\\.") from dual;
upper(string str) ##轉大寫
-
-
- 時間函式
-
select current_timestamp;
select current_date;
## 取當前時間的毫秒數時間戳
select unix_timestamp();
## unix時間戳轉字串
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");
## 字串轉unix時間戳
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017/08/10 17:50:30","yyyy/MM/dd HH:mm:ss");
## 將字串轉成日期date
select to_date("2017-09-17 16:58:32");
-
-
- 表生成函式
- 行轉列函式:explode()
- 表生成函式
-
假如有以下資料:
1,zhangsan,化學:物理:數學:語文 2,lisi,化學:數學:生物:生理:衛生 3,wangwu,化學:語文:英語:體育:生物 |
對映成一張表:
create table t_stu_subject(id int,name string,subjects array<string>)
row format delimited fields terminated by ','
collection items terminated by ':';
使用explode()對陣列欄位“炸裂”
然後,我們利用這個explode的結果,來求去重的課程:
select distinct tmp.sub from (select explode(subjects) as sub from t_stu_subject) tmp; |
-
-
-
- 表生成函式lateral view
-
-
select id,name,tmp.sub
from t_stu_subject lateral view explode(subjects) tmp as sub;
理解: lateral view 相當於兩個表在join
左表:是原表
右表:是explode(某個集合欄位)之後產生的表
而且:這個join只在同一行的資料間進行
那樣,可以方便做更多的查詢:
比如,查詢選修了生物課的同學
select a.id,a.name,a.sub from
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) a
where sub='生物';
-
-
- 集合函式
-
array_contains(Array<T>, value) 返回boolean值
示例:
select moive_name,array_contains(actors,'吳剛') from t_movie;
select array_contains(array('a','b','c'),'c') from dual;
sort_array(Array<T>) 返回排序後的陣列
示例:
select sort_array(array('c','b','a')) from dual;
select 'haha',sort_array(array('c','b','a')) as xx from (select 0) tmp;
size(Array<T>) 返回一個int值
示例:
select moive_name,size(actors) as actor_number from t_movie;
size(Map<K.V>) 返回一個int值
map_keys(Map<K.V>) 返回一個數組
map_values(Map<K.V>) 返回一個數組
-
-
- 條件控制函式
- case when
- 條件控制函式
-
語法:
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
WHEN conditionn THEN resultn
ELSE result
END
示例:
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;
-
-
-
- IF
-
-
select id,if(age>25,'working','worked') from t_user;
select moive_name,if(array_contains(actors,'吳剛'),'好電影','爛片兒')from t_movie;
-
-
- json解析函式:表生成函式
-
json_tuple函式(根據json中的key,就可以獲取到對應value)
只能解析簡單的JSON,多層或者巢狀JSON需要自定義JSON解析函式。
示例:
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;
產生結果:
利用json_tuple從原始json資料表中,etl出一個詳細資訊表:
create table t_rate as select uid, movie, rate, year(from_unixtime(cast(ts as bigint))) as year, month(from_unixtime(cast(ts as bigint))) as month, day(from_unixtime(cast(ts as bigint))) as day, hour(from_unixtime(cast(ts as bigint))) as hour, minute(from_unixtime(cast(ts as bigint))) as minute, from_unixtime(cast(ts as bigint)) as ts from (select json_tuple(rateinfo,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_json) tmp ; |
-
-
- 分析函式:row_number() over()——分組TOPN
- 需求
- 分析函式:row_number() over()——分組TOPN
-
有如下資料:
1,18,a,male 2,19,b,male 3,22,c,female 4,16,d,female 5,30,e,male 6,26,f,female |
需要查詢出每種性別中年齡最大的2條資料
-
-
-
- 實現:
-
-
使用row_number函式,對錶中的資料按照性別分組,按照年齡倒序排序並進行標記
hql程式碼:
select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber
產生結果:
然後,利用上面的結果,查詢出rank<=2的即為最終需求
select id,age,name,sex
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber) tmp
where rank<=2;
練習:求出電影評分資料中,每個使用者評分最高的topn條資料
-
-
- hive 視窗分析函式
-
0: jdbc:hive2://localhost:10000> select * from t_access;
+----------------+---------------------------------+-----------------------+--------------+--+
| t_access.ip | t_access.url | t_access.access_time | t_access.dt |
+----------------+---------------------------------+-----------------------+--------------+--+
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 20170804 |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 20170804 |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 20170804 |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 20170804 |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 20170804 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 20170805 |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 20170805 |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 20170805 |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 20170805 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 20170805 |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 20170806 |
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 20170806 |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 20170806 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 20170806 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 20170806 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 20170806 |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 20170806 |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 20170806 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 20170806 |
+----------------+---------------------------------+-----------------------+--------------+--+
## LAG函式
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
lag(access_time,1,0) over(partition by ip order by access_time)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| ip | url | access_time | rn | last_access_time |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | 0 |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | 0 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 0 |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | 2017-08-04 15:30:20 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | 2017-08-04 15:35:20 |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | 0 |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 0 |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | 2017-08-04 15:30:20 |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | 0 |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | 0 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | 2017-08-05 16:30:20 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | 2017-08-06 16:30:20 |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | 0 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | 0 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | 2017-08-05 15:40:20 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | 2017-08-06 15:40:20 |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | 0 |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
## LEAD函式
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
lead(access_time,1,0) over(partition by ip order by access_time)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| ip | url | access_time | rn | last_access_time |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | 0 |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | 0 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 2017-08-04 15:35:20 |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | 2017-08-05 15:30:20 |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | 0 |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | 0 |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | 2017-08-04 16:30:20 |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | 0 |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | 0 |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | 2017-08-06 16:30:20 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | 2017-08-06 16:30:20 |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | 0 |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | 0 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | 2017-08-06 15:40:20 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | 2017-08-06 15:40:20 |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | 0 |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | 0 |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | 0 |
+----------------+---------------------------------+----------------------+-----+----------------------+--+
## FIRST_VALUE 函式
例:取每個使用者訪問的第一個頁面
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
first_value(url) over(partition by ip order by access_time rows between unbounded preceding and unbounded following)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| ip | url | access_time | rn | last_access_time |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | http://www.edu360.cn/register |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/register |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | http://www.edu360.cn/stu |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | http://www.edu360.cn/excersize |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | http://www.edu360.cn/job |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | http://www.edu360.cn/job |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/pay |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | http://www.edu360.cn/teach |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
## LAST_VALUE 函式
例:取每個使用者訪問的最後一個頁面
select ip,url,access_time,
row_number() over(partition by ip order by access_time) as rn,
last_value(url) over(partition by ip order by access_time rows between unbounded preceding and unbounded following)as last_access_time
from t_access;
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| ip | url | access_time | rn | last_access_time |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
| 192.168.111.3 | http://www.edu360.cn/register | 2017-08-06 15:35:20 | 1 | http://www.edu360.cn/register |
| 192.168.133.3 | http://www.edu360.cn/register | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/register |
| 192.168.33.25 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/teach | 2017-08-04 15:35:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.3 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 3 | http://www.edu360.cn/stu |
| 192.168.33.36 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 1 | http://www.edu360.cn/excersize |
| 192.168.33.4 | http://www.edu360.cn/stu | 2017-08-04 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.4 | http://www.edu360.cn/job | 2017-08-04 16:30:20 | 2 | http://www.edu360.cn/stu |
| 192.168.33.44 | http://www.edu360.cn/stu | 2017-08-05 15:30:20 | 1 | http://www.edu360.cn/stu |
| 192.168.33.46 | http://www.edu360.cn/job | 2017-08-05 16:30:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.46 | http://www.edu360.cn/excersize | 2017-08-06 16:30:20 | 3 | http://www.edu360.cn/job |
| 192.168.33.5 | http://www.edu360.cn/job | 2017-08-04 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-05 15:40:20 | 1 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 2 | http://www.edu360.cn/job |
| 192.168.33.55 | http://www.edu360.cn/job | 2017-08-06 15:40:20 | 3 | http://www.edu360.cn/job |
| 192.168.34.44 | http://www.edu360.cn/pay | 2017-08-06 15:30:20 | 1 | http://www.edu360.cn/pay |
| 192.168.44.3 | http://www.edu360.cn/teach | 2017-08-05 15:35:20 | 1 | http://www.edu360.cn/teach |
+----------------+---------------------------------+----------------------+-----+---------------------------------+--+
/*
累計報表--分析函式實現版
*/
-- sum() over() 函式
select id
,month
,sum(amount) over(partition by id order by month rows between unbounded preceding and current row)
from
(select id,month,
sum(fee) as amount
from t_test
group by id,month) tmp;
-
- 自定義函式
- 需求:
- 自定義函式
需要對json資料表中的json資料寫一個自定義函式,用於傳入一個json,返回一個數據值的陣列
json原始資料表:
需要做ETL操作,將json資料變成普通表資料,插入另一個表中:
-
-
- 實現步驟:
-
1、開發JAVA的UDF類
public class ParseJson extends UDF{ // 過載 :返回值型別 和引數型別及個數,完全由使用者自己決定 // 本處需求是:給一個字串,返回一個數組 public String[] evaluate(String json) { String[] split = json.split("\""); String[] res = new String[]{split[3],split[7],split[11],split[15]}; return res; } } |
2、打jar包
在eclipse中使用export即可
- 上傳jar包到執行hive所在的linux機器
- 在hive中建立臨時函式:
在hive的提示符中:
hive> add jar /root/jsonparse.jar;
然後,在hive的提示符中,建立一個臨時函式:
hive>CREATE TEMPORARY FUNCTION jsonp AS 'cn.edu360.hdp.hive.ParseJson';
- 開發hql語句,利用自定義函式,從原始表中抽取資料插入新表
insert into table t_rate select split(jsonp(json),',')[0], cast(split(jsonp(json),',')[1] as int), cast(split(jsonp(json),',')[2] as bigint), cast(split(jsonp(json),',')[3] as int) from t_rating_json; |
注:臨時函式只在一次hive會話中有效,重啟會話後就無效
如果需要經常使用該自定義函式,可以考慮建立永久函式:
拷貝jar包到hive的類路徑中:
cp wc.jar apps/hive-1.2.1/lib/
建立了:
create function pfuncx as 'com.doit.hive.udf.UserInfoParser';
刪除函式:
DROP TEMPORARY FUNCTION [IF EXISTS] function_name
DROP FUNCTION[IF EXISTS] function_name
此筆記:整理自小牛課堂