Hive筆記整理

阿新 • • 發佈：2018-11-15

Hive 檢視

1、檢視的特點：

①不支援物化檢視
②只能查詢，不能做載入資料操作 load data into
③檢視的建立，只是儲存一份元資料，查詢檢視時才執行對應的子查詢
④view定義中若包含了ORDER BY/LIMIT語句，當查詢檢視時也進行ORDER BY/LIMIT語句操作，view當中定義的優先順序更高
⑤view支援迭代檢視
⑥一旦建立成功，無法修改

2、為什麼建立檢視？

select a.name,b.age from table1 a join table2 b on(a.id=b.id) => view
如果後期經常執行這個查詢語句，每次都寫麻煩
可以將長的SQL（資料表）與檢視對應對映，每次查詢這個檢視就是執行了長的SQL語句

3、檢視操作

#建立檢視
CREATE VIEW  IF NOT EXISTS  view1 AS SELECT * FROM logtbl order by age;
#可以檢視已經建立的檢視
show tables
#刪除檢視
drop view view1

建立檢視的時候不會啟動MR任務
select * from view1;
但是在查詢檢視的時候會啟動MR任務
檢視的建立，只是儲存一份元資料，查詢檢視時才執行對應的子查詢

Hive 索引

索引

優化查詢效能
若使用select * from table where age = 10;假設這個表的資料非常大，是有10個block組成
name查詢的效能會很低
提高效能？
索引1（age > 10） block1(100,200) block2(200,389)
索引2（age = 10） block1(101,220) block2(200,389)
這個索引就類似目錄

建立索引庫，用於存放索引

create index t2_index on table psnbucket_partition(age) 
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild 
in table t2_index_table;

索引庫中只是儲存一些元資料，比如對哪個欄位建立索引，對哪個表建立索引等

alter index t2_index on psnbucket_partition rebuild;

這一步是真正的建立索引資訊，並且儲存到索引庫中，若資料庫有新增資料，也可以使用以上語句重建索引

檢視索引庫：

66 hdfs://zfg/user/hive_remote/warehouse/psnbucket_partition/height=188.0/000000_0 [0,30,60,90,120] 188.0
77 hdfs://zfg/user/hive_remote/warehouse/psnbucket_partition/height=188.0/000000_0 [8,38,68,98,128] 188.0
88 hdfs://zfg/user/hive_remote/warehouse/psnbucket_partition/height=188.0/000000_0 [19,49,79,109,139] 188.0
11 hdfs://zfg/user/hive_remote/warehouse/psnbucket_partition/height=189.0/000000_0 [0,48,96,144,192] 189.0
22 hdfs://zfg/user/hive_remote/warehouse/psnbucket_partition/height=189.0/000000_0 [9,57,105,153,201] 189.0
查詢索引：
show index on psnbucket_partition;

刪除索引
drop index t2_index on psnbucket_partition;
刪除索引的同時索引庫也會被刪除

資料讀取規則

之前往hive中載入的資料都是比較規整的，欄位與欄位之間都是分割好的，每一個欄位都不是髒資料，並且每一個欄位都是有意義的
但是在真實場景中不見得這個盡人意
tomcat執行日誌
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] “GET /bg-upper.png HTTP/1.1” 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] “GET /bg-nav.png HTTP/1.1” 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] “GET /asf-logo.png HTTP/1.1” 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] “GET /bg-button.png HTTP/1.1” 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] “GET /bg-middle.png HTTP/1.1” 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] “GET / HTTP/1.1” 200 11217

 CREATE TABLE logtbl (
    host STRING,
    identity STRING,
    t_user STRING,
    time STRING,
    request STRING,
    referer STRING,
    agent STRING)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
  )
  STORED AS TEXTFILE;

load data local inpath “/root/lo” into table logtbl;
原始髒資料不會變，只是在讀的時候，將髒資料清理掉再顯示出來。

192.168.57.4 - - 123儲存到檔案loerr中，然後將檔案載入到上面已經建立好的表中
load data local inpath “/root/loerr” into table logtbl; 這一步沒問題，因為load就是將資料拷貝到工作目錄區中
查詢看一下select * from logtbl 讀不懂，因為根據正則表示式的模板來讀資料，都不懂

總結：

讀時檢查
不是寫時檢查

beeline

之前在操作hive的是，直接通過hive命令進入hive cli進行資料分析以及處理，這種方式既不安全有不規範
beeline是一個新興的cli客戶端類似jdbc/odbc 可以解決一切的問題，並且還能夠很好的解耦合
hive client直接連線HDFS、yarn
beeline需要先與thriftserver連線，thriftserver能夠進行安全認證、可靠認證、提高客戶端的併發

beeline預設連結hiveserver2的時候，不需要使用者名稱密碼,預設方式也是不安全，我們可以設定hiveserver2使用者名稱、密碼
設定使用者名稱、密碼的步驟：
在hive-site.xml中新增一下資訊：

hive.server2.authentication
CUSTOM

    <property>
            <name>hive.jdbc_passwd.auth.zhangsan</name>
            <value>123456789</value>
    </property>
    <property>
            <name>hive.server2.custom.authentication.class</name>
            <value>com.hoe.hive.authoriz.UserPasswdAuth</value>
    </property>
寫程式碼：
	package com.hoe.hive.authoriz;
	import javax.security.sasl.AuthenticationException;
	import org.apache.hadoop.conf.Configuration;
	import org.apache.hadoop.hive.conf.HiveConf;
	import org.apache.hive.service.auth.PasswdAuthenticationProvider;
	import org.slf4j.Logger;
	import org.slf4j.LoggerFactory;

	public class UserPasswdAuth implements PasswdAuthenticationProvider {
		Logger logger = LoggerFactory.getLogger(UserPasswdAuth.class);
		private static final String USER_PASSWD_AUTH_PREFIX = "hive.jdbc_passwd.auth.%s";
		private Configuration conf = null;
		@Override
		public void Authenticate(String userName, String passwd) throws AuthenticationException {
			logger.info("user: " + userName + " try login.");
			String passwdConf = getConf().get(String.format(USER_PASSWD_AUTH_PREFIX, userName));
			if (passwdConf == null) {
				String message = "沒有發現密碼 " + userName;
				logger.info(message);
				throw new AuthenticationException(message);
			}
			if (!passwd.equals(passwdConf)) {
				String message = "使用者名稱密碼不匹配 " + userName;
				throw new AuthenticationException(message);
			}
		}
		public Configuration getConf() {
			if (conf == null) {
				this.conf = new Configuration(new HiveConf());
			}
			return conf;
		}
		public void setConf(Configuration conf) {
			this.conf = conf;
		}
	}

第一種連結方式：./beeline -u jdbc:hive2://node01:10000/test -n zhangsan -p123456789
第二種連結方式：
./beeline
!connect jdbc:hive2://node01:10000/test
輸入使用者名稱
輸入密碼

JDBC連線操作

因為通過JDBC連結hive 也是連結hiveserver2服務，連結成功才能操作hive
所以JDBC連結的時候也是需要使用者名稱密碼的

package com.hoe.hive.jdbc;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class ConnectHive {

	public static String driverName = "org.apache.hive.jdbc.HiveDriver";

	public static void main(String[] args) {

		try {
			Class.forName(driverName);
		} catch (ClassNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

		String url = "jdbc:hive2://node01:10000";
		String userName = "zhangsan";
		String passwd = "123456789";
		Connection conn = null;
		try {
			conn = DriverManager.getConnection(url, userName, passwd);
			Statement statement = conn.createStatement();
			String sql = "select * from test.logtbl limit 10";
			ResultSet resultSet = statement.executeQuery(sql);
			while (resultSet.next()) {
				System.out.println(resultSet.getString(1) + "-" + resultSet.getString(2));
			}
		} catch (SQLException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

	}
}

Hive內建函式

使用hive實現wordcount

建立資料表

create table docs(line string);

建立結果表

create table wc(word string, totalword int);

載入資料

load data local inpath ‘/tmp/wc’ into table docs;

統計wordcount並且將資料插入到結果表中

from (select explode(split(line, ' ')) as word from docs) w 
insert into table wc 
 select word, count(1) as totalword 
 group by word 
 order by word;

查詢結果

select * from wc;

自定義UDF

add jar /opt/software/hive/hive-1.2.1/lib/FormatTimeUDF.jar;
CREATE TEMPORARY FUNCTION convertTime AS 'com.hoe.hive.userdefinedfunction.FormatTimeUDF';
select convertTime(time) from logtbl;
DROP TEMPORARY FUNCTION convertTime;

自定義UDAF

add jar /opt/software/hive/hive-1.2.1/lib/ReduceTimeByResponseNumUDAF.jar;
CREATE TEMPORARY FUNCTION rrd AS ‘com.hoe.hive.userdefinedfunction.ReduceTimeByResponseNumUDAF’;
select referer,rrd(host) from logtbl group by referer;
DROP TEMPORARY FUNCTION rrd;

自定義UDTF

add jar /opt/software/hive/hive-1.2.1/lib/UserGenericUDTF.jar;
CREATE TEMPORARY FUNCTION exp AS ‘com.hoe.hive.userdefinedfunction.UserGenericUDTF’;
select exp(line) from udtfc;

永久函式

將相應的jar包上傳到HDFS上
create function formatTime AS ‘com.hoe.hive.userdefinedfunction.FormatTimeUDF’ using jar ‘hdfs://zfg/test/FormatTimeUDF.jar’;
create function exp AS ‘com.hoe.hive.userdefinedfunction.UserGenericUDTF’ using jar ‘hdfs://zfg//test/UserGenericUDTF.jar’;
create function rrd AS ‘com.hoe.hive.userdefinedfunction.ReduceTimeByResponseNumUDAF’ using jar ‘hdfs://zfg//test/ReduceTimeByResponseNumUDAF.jar’;

Hive指令碼執行方式:

不在hive中執行
hive -e “sql語句”–>顯示結果後退出
hive -e > “目錄” 追加結果到目錄中
hive -S -e “” :進入hive的靜默模式，只顯示查詢結果，不顯示執行過程
hive -f file：執行HQL（符合98-03標準）指令碼
hive -i “HQL指令碼檔案目錄”：執行Hive互動Shell時候先執行指令碼中的HQL語句