Hive筆記整理(三)
阿新 • • 發佈:2018-03-23
大數據 Hive [TOC]
Hive筆記整理(三)
Hive的函數
Hive函數分類
函數的定義和java、mysql一樣,有三種。
UDF(User Definition Function 用戶定義函數)
一路輸入,一路輸出
sin(30°)=1/2
UDAF(User Definition Aggregation Function 聚合函數)
多路輸入,一路輸出
max min count sum avg等等
UDTF(User Definition Table Function 表函數)
一路輸入,多路輸出
explode
常用函數
show functions; 列出hive中可用的函數列表 desc function func_name; 查看函數的幫助說明 case when ---->switch或if else if ---->三元運算符 explode ---->將數組中的元素轉換成多行數據 a = [1, 2, 3, 4] explode(a) ===> 1 2 3 4 split ---->就是字符串中的split函數 array ----> collect_set collect_list concat_ws ---->使用給定的字符串來連接元素 -------------- row_number ---->分組排序或者二次排序
函數案例
wordcount
分析: hello you hello me hello he 使用mr的的過程 step1----->split("\t")---> ["hello", "you"] ["hello", "me"] ["hello", "he"] step2----->遍歷每一個數組,將數組中的每一個值,作為key,value為1寫出去<key, 1> <"hello", 1> <"you", 1> <"hello", 1> <"me", 1> <"hello", 1> <"he", 1> step3,shuffle---> <"hello", [1, 1, 1]> <"you", 1> <"me", 1> <"he", 1> step 4, reduce ====>reduceByKey 使用hql step 1 (mydb1)> select split(line, "\t") from test; ["hello","you"] ["hello","he"] ["hello","me"] step 2 將數組中的每一行數據轉化為多行 (mydb1)> select explode(split(line, "\t")) from test; hello you hello he hello me step 3 在step2的基礎之上進行group by 即可 select w.word, count(w.word) as count from (select explode(split(line, "\t")) word from test) w group by w.word order by count desc;
case when
case when將一下對應的部門名稱顯示出來:
1--->學工組,2--->行政組,3---->銷售組,4---->研發組,5---->其它 hive (mydb1)> select * from t1; 1 2 3 4 5 select id, case id when 1 then "學工組" when 2 then "行政組" when 3 then "銷售組" when 4 then "研發組" else "行政組" end from t1; 分類顯示 1 學工組 2 行政組 3 銷售組 4 研發組 5 其它
row_number 二次排序
三種連接
交叉連接
across join,會有笛卡爾積,所以不用
內連接(等值連接)
inner join
將左表和右表中能夠匹配的上的數據做輸出
外鏈接
outer join
左外連接(left outer join)
右外鏈接(right outer join)
根據員工、部分、薪資,這三張表,
1、分組顯示每一個部分員工的信息(啟動顯示部分名稱,員工姓名,員工性別[男|女],員工薪資),同時分組按照員工薪資降序排序
select
e.name, if(sex == 0, ‘女‘, ‘男‘) as gender, d.name, s.salary,
row_number() over(partition by e.deptid order by s.salary desc) rank
from t_dept d
left join t_employee e on d.id = e.deptid
left join t_salary s on e.id = s.empid
where s.salary is not null;
2、獲取顯示部門薪資top2的員工信息
select
tmp.*
from
(select
e.name, if(sex == 0, ‘女‘, ‘男‘) as gender, d.name, s.salary,
row_number() over(partition by e.deptid order by s.salary desc) rank
from t_dept d
left join t_employee e on d.id = e.deptid
left join t_salary s on e.id = s.empid
where s.salary is not null) tmp
where tmp.rank < 3;
如果查詢的是單表,則可以不用子查詢,只用用having來獲取即可(having rank < 3)
Hive自定義函數
自定義函數步驟
自定義函數需要遵循的6個步驟:
1°、自定義一個Java類來繼承UDF類
2°、覆蓋其中的evaluate()的函數,有系統去調用
3°、將寫好的程序打成一個jar,上傳至服務器
4°、將3°中的jar加載到hive的classpath
hive終端執行add jar jar_path;
5°、給自定義函數設置一個臨時的名稱,也就是說要創建一個臨時的函數
create temporary function 函數名 as ‘寫的evalutor所在類的全類名‘;
6°、執行函數結束之後,可以手動銷毀臨時函數,或者不用管,因為當前會話消失,函數自動銷毀
UDF案例:要根據用戶的birthday,統計對應的生肖和星座
程序代碼如下:
package com.uplooking.bigdata.hive.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
@Description(name = "z_c",
value = "_FUNC_(param1, param2) - 返回給定日期對應的生肖或者星座",
extended = "param1,param2參數可以是一下:\n"
+ "1. param1 is A string in the format of ‘yyyy-MM-dd HH:mm:ss‘ or ‘yyyy-MM-dd‘.\n"
+ "2. param1 date value\n"
+ "3. param1 timestamp value\n"
+ "3. param2 0 or 1, 0 means constellation, 1 means zodica\n"
+ "Example:\n "
+ " > SELECT _FUNC_(‘2009-07-30‘, 0) FROM src LIMIT 1;\n" + " 獅子座")
public class ZodicaAndConstellationUDF extends UDF {
public Text evaluate(java.sql.Date date, int type) {
if(type == 0) {//星座
return new Text(getConstellation(new Date(date.getTime())));
} else if(type == 1) { //生肖
return new Text(getZodica(new Date(date.getTime())));
}
return null;
}
public String[] zodiacArr = { "猴", "雞", "狗", "豬", "鼠", "牛", "虎", "兔", "龍", "蛇", "馬", "羊" };
public String[] constellationArr = { "水瓶座", "雙魚座", "白羊座", "金牛座", "雙子座", "巨蟹座", "獅子座", "×××座", "天秤座", "天蠍座", "射手座", "魔羯座" };
public int[] constellationEdgeDay = { 20, 19, 21, 21, 21, 22, 23, 23, 23, 23, 22, 22 };
/**
* 根據日期獲取生肖
* @return
*/
public String getZodica(Date date) {
Calendar cal = Calendar.getInstance();
cal.setTime(date);
return zodiacArr[cal.get(Calendar.YEAR) % 12];
}
/**
* 根據日期獲取星座
* @return
*/
public String getConstellation(Date date) {
if (date == null) {
return "";
}
Calendar cal = Calendar.getInstance();
cal.setTime(date);
int month = cal.get(Calendar.MONTH);
int day = cal.get(Calendar.DAY_OF_MONTH);
if (day < constellationEdgeDay[month]) {
month = month - 1;
}
if (month >= 0) {
return constellationArr[month];
}
// default to return 魔羯
return constellationArr[11];
}
}
註意依賴在筆記最後面。
上傳到服務器後,在hive終端中加載到hive的classpath:
add jar /home/uplooking/jars/hive/udf-zc.jar
自定義函數:
create temporary function zc as ‘com.uplooking.bigdata.hive.udf.ZodicaAndConstellationUDF‘;
創建測試用的臨時表:
hive (mydb1)>
> create temporary table tmp(
> birthday date);
插入測試用的數據:
hive (mydb1)> insert into tmp values(‘1994-06-21‘);
在查詢中使用函數:
hive (mydb1)> select zc(birthday,0) from tmp;
OK
c0
雙子座
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive (mydb1)> select zc(birthday,1) from tmp;
OK
c0
狗
Time taken: 0.044 seconds, Fetched: 1 row(s)
Hive之jdbc
Hive除了提供前面的cli用戶接口,還提供了jdbc的用戶接口,但是如果需要使用該接口,則需要先啟動hiveserver2服務,啟動該服務後,可以通過hive提供的beeline繼續以cli的方式操作hive(不過需要註意的是,此時是通過jdbc接口進行操作hive的),也可以通過手工編寫java代碼來進行操作。
啟動hiveserver2服務
[uplooking@uplooking01 ~]$ hiveserver2
通過beeline連接hiveserver進行操作
[uplooking@uplooking01 hive]$ beeline
which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/jdk/bin:/home/uplooking/bin:/home/uplooking/app/zookeeper/bin:/home/uplooking/app/hadoop/bin:/home/uplooking/app/hadoop/sbin:/home/uplooking/app/hive/bin)
ls: 無法訪問/home/uplooking/app/hive/lib/hive-jdbc-*-standalone.jar: 沒有那個文件或目錄
Beeline version 2.1.0 by Apache Hive
beeline> !connect jdbc:hive2://uplooking01:10000/mydb1
Connecting to jdbc:hive2://uplooking01:10000/mydb1
Enter username for jdbc:hive2://uplooking01:10000/mydb1: uplooking
Enter password for jdbc:hive2://uplooking01:10000/mydb1: *********
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/uplooking/app/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/uplooking/app/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Error: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: uplooking is not allowed to impersonate uplooking (state=,code=0)
可以看到出現錯誤,解決方案如下:
在執行JDBC的時候,訪問不了遠程的Hive的ThriftServer服務
報的錯誤:uplooking不能偽裝為uplooking
是因為版本在進行升級的時候考慮到的安全策略,需要我們手動對uplooking進行配置,需要將
hadoop中的uplooking用戶和hive中的uplooking用戶進行打通,配置在$HADOOP_HOME/etc/hadoop/core-site.xml
中進行配置:添加一下配置項
<property>
<name>hadoop.proxyuser.uplooking.hosts</name>
<value>*</value>
<description>這是uplooking用戶訪問的本機地址</description>
</property>
<property>
<name>hadoop.proxyuser.uplooking.groups</name>
<value>root</value>
<description>代理uplooking設置的組用戶</description>
</property>
配置成功之後,需要同步到集群中的各個節點,
要想讓集群重新加載配置信息,至少hdfs需要重啟
這樣之後就可以正常使用beeline通過hive提供的jdbc接口來操作hive了:
beeline> !connect jdbc:hive2://uplooking01:10000/mydb1
Connecting to jdbc:hive2://uplooking01:10000/mydb1
Enter username for jdbc:hive2://uplooking01:10000/mydb1: uplooking
Enter password for jdbc:hive2://uplooking01:10000/mydb1: *********
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/uplooking/app/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/uplooking/app/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connected to: Apache Hive (version 2.1.0)
Driver: Hive JDBC (version 2.1.0)
18/03/23 08:00:15 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://uplooking01:10000/mydb1> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| mydb1 |
+----------------+--+
2 rows selected (2.164 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| t1 |
| t2 |
+-----------+--+
2 rows selected (0.118 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> select * from t1;
+------------+--+
| t1.line |
+------------+--+
| hello you |
| hello he |
| hello me |
+------------+--+
3 rows selected (2.143 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1>
通過java代碼連接hiveserver進行操作
程序代碼如下:
package com.uplooking.bigdata.hive.jdbc;
import java.sql.*;
public class HiveJDBC {
public static void main(String[] args) throws Exception {
Class.forName("org.apache.hive.jdbc.HiveDriver");
Connection conn = DriverManager.getConnection("jdbc:hive2://uplooking01:10000/mydb1", "uplooking", "uplooking");
String sql = "select t.word,count(t.word) as count from (select explode(split(line, ‘ ‘)) as word from t1) t group by t.word";
PreparedStatement ps = conn.prepareStatement(sql);
ResultSet rs = ps.executeQuery();
while (rs.next()) {
String word = rs.getString("word");
int count = rs.getInt("count");
System.out.println(word + "\t" + count);
}
rs.close();
ps.close();
conn.close();
}
}
程序執行結果如下:
18/03/23 00:48:16 INFO jdbc.Utils: Supplied authorities: uplooking01:10000
18/03/23 00:48:16 INFO jdbc.Utils: Resolved authority: uplooking01:10000
he 1
hello 3
me 1
you 1
在這個過程中,註意觀察hiveserver2終端的輸出:
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = uplooking_20180323084825_63044683-393d-4625-a3c3-b440109c3d70
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1521765850571_0002, Tracking URL = http://uplooking02:8088/proxy/application_1521765850571_0002/
Kill Command = /home/uplooking/app/hadoop/bin/hadoop job -kill job_1521765850571_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-03-23 08:48:33,427 Stage-1 map = 0%, reduce = 0%
2018-03-23 08:48:40,864 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2018-03-23 08:48:48,294 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.84 sec
MapReduce Total cumulative CPU time: 6 seconds 840 msec
Ended Job = job_1521765850571_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.84 sec HDFS Read: 8870 HDFS Write: 159 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 840 msec
OK
Hive中文註釋亂碼解決
如果有亂碼出現,可以嘗試下面的解決方案:
hive中文註釋亂碼解決:
在hive的元數據庫中,執行一下腳本
ALTER TABLE COLUMNS_V2 MODIFY COLUMN COMMENT VARCHAR(256) CHARACTER SET utf8;
ALTER TABLE TABLE_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
ALTER TABLE PARTITION_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
ALTER TABLE PARTITION_KEYS MODIFY COLUMN PKEY_COMMENT VARCHAR(4000) CHARACTER SET utf8;
ALTER TABLE INDEX_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
同時將url,加上utf-8
&useUnicode=true&characterEncoding=UTF-8
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://uplooking01:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>
</property>
Hive的maven依賴
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hive-api.version>2.1.0</hive-api.version>
<hadoop-api.version>2.6.4</hadoop-api.version>
<hadoop-core.version>1.2.1</hadoop-core.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>${hadoop-core.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-cli</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libfb303</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
Hive筆記整理(三)