Hive入門詳解

阿新 • • 發佈：2018-12-10

簡介

Hive是基於Hadoop的一個數據倉庫工具，可以將結構化的資料檔案對映為一張資料庫表，並提供類SQL查詢功能

安裝Hive

將hive安裝到hadoop的同級目錄下 tar -zxvf apache-hive-2.3.3-bin.tar.gz -C /home/hadoop/apps/
配置環境變數 vi ~/.bash_profile 增加 export HIVE_HOME= export PATH= $JAVA_HOME/bin:$ PATH: $HADOOP_HOME/bin :$ HIVE_HOME/bin 重新整理配置檔案 source ~/.bash_profile

修改配置檔案hive-site.xml vi /home/hadoop/apps/hive/conf/hive-site.xml 新增

<configuration>
<property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop1:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionPassword</name> 
         <value>Lousen??1234</value>
</property>
</configuration>

匯入連線mysql的jar包 cp mysql-connector-java-5.1.38.jar /home/hadoop/apps/hive/lib/
進入mysql建立對應使用者 mysql -uroot -p create user ‘hive’@‘hadoop1’ identified by ‘Lousen??1234’ grant all privileges on . to ‘hive’@‘hadoop1’ identified by ‘Lousen??1234’ with grant option; flush privileges;
重啟mysql服務 sudo systemctl restart mysqld

初始化元資料 schematool -dbType mysql -initSchema

hive實踐

建立表

內部表建立產品表

create table t_product(id int,name string,price double,category string)
row format delimited
fields terminated by ','
stored as textfile;

匯入資料(從本地) load data local inpath '/home/hadoop/product_data' into table t_product; 匯入資料(從hdfs) load data inpath '/data/hive/test/product_data' into table t_product; 查看錶資料 select * from t_product; 刪除表 drop table t_product; 2. 外部表建立手機表

create external table t_phone(id int,name string,price double)
row format delimited
fields terminated by ','
stored as textfile
location '/hive/test/'; 
注：在hdfs的指定位置上建立表

匯入資料
load data local inpath '/home/hadoop/phone_data' into table t_phone;

子查詢建立表

create table t_product_back
as
select * from t_product;

分割槽表

建立表(分割槽表)
根據月份分割槽
create table t_order(id int,name string,cost double)
partitioned by (month string)
row format delimited 
fields terminated by ',';

匯入資料到分割槽6
load data local inpath '/home/hadoop/phone_data' into table t_order
partition(month='6');

檢視所有訂單的分割槽
show partitions t_order;

桶表

建立表(桶表)
create table t_product_bucket(id int,name string ,price string,category string)
clustered by(id) into 3 buckets
row format delimited 
fields terminated by ',';

桶表匯中的資料，只能從其他表中用子查詢進行插入
set hive.enforce.bucketing=true;
insert into table t_product_bucket select * from t_product;

查詢2上的資料
select * from t_product_bucket tablesample(bucket 2 out of 3 on id);

陣列

建立表(陣列)
create table tab_array (a array<int>,b array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

資料樣式
1,2,3   hello,world,briup

匯入資料
load data local inpath '/home/hadoop/array_data' into table tab_array;

查詢資料
select a[2],b[1] from tab_array;

建立表(map)
create table tab_map (name string,info map<string,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';

資料樣式
zhangsan    name:zhangsan,age:18,gender:male

匯入資料
load data local inpath '/home/hadoop/map_data' into table tab_map;

查詢資料
select  info['name'],info['gender'] from tab_map;

struct

建立表(struct)
create table tab_struct(name string,info struct<age:int,tel:string,salary:double>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';
 
資料樣式
zhangsan    18,18945883365,22.3

匯入資料
load data local inpath '/home/hadoop/struct_data' into table tab_struct;

查詢資料
select info.age,info.tel from tab_struct;

資料查詢

準備範例

範例表big_data
create table big_data(id int,point double)
row format delimited
fields terminated by ','
stored as textfile;

範例資料
big_data
1,80.0
4,50.0
3,60.0
8,40.0
6,85.0
2,100.0
5,80.0
7,60.0

匯入資料
load data local inpath '/home/hadoop/big_data' into table big_data;

檢視資料
select * from big_data;

排序 order by id asc 全域性排序 ex: set mapred.reduce.tasks=2; select * from big_data order by id; 注：為了看到效果才設定兩個reduce task

sort by id desc 區域性排序 ex: set mapred.reduce.tasks=2; select * from big_data sort by id;

分割槽分割槽 distribute by 按照指定的欄位或表示式對資料進行劃分，輸出到對應的Reduce或者檔案中 ex: set mapred.reduce.tasks=2; insert overwrite local directory ‘/home/hadoop/data’ select id from big_data distribute BY id; 注：overwrite使用千萬注意，不要把家目錄給覆蓋了
分割槽+排序 cluster by 除了兼具distribute by的功能，還兼具sort by的排序功能 ex: set mapred.reduce.tasks=2; insert overwrite local directory ‘/home/hadoop/data’ select id from big_data cluster by id;
去重 group by 和 distinct group by select point from big_data group by point; distinct select distinct point from big_data; 注：如果資料較多，distinct效率會更低一些，一般推薦使用group by。
虛擬列 INPUT__FILE__NAME：資料對應的HDFS檔名； BLOCK__OFFSET__INSIDE__FILE：該行記錄在檔案中的偏移量； ex: select id,INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE from big_data;

Join Join Hive中的Join可分為Common Join（Reduce階段完成join）和Map Join（Map階段完成join）。如果不指定MapJoin或者不符合MapJoin的條件（hive.auto.convert.join=true 對於小表啟用mapjoin,hive.mapjoin.smalltable.filesize=25M 設定小表的閾值），那麼Hive解析器會將Join操作轉換成Common Join,即：在Reduce階段完成join. Hive中除了支援和傳統資料庫中一樣的內關聯、左關聯、右關聯、全關聯，還支援LEFT SEMI JOIN和CROSS JOIN，但這兩種JOIN型別也可以用前面的代替。

準備範例

範例資料
user_name_data
1   zhangsan
2   lisi
3   wangwu

範例表user_name
create table user_name(id int,name string)
row format delimited
fields terminated by '\t'
stored as textfile;

匯入資料
load data local inpath '/home/hadoop/user_name_data' into table user_name;

檢視資料
select * from user_name;

範例資料
user_age_data
1   30
2   29
4   21

範例表user_age
create table user_age(id int,age int)
row format delimited
fields terminated by '\t'
stored as textfile;

匯入資料
load data local inpath '/home/hadoop/user_age_data' into table user_age;

檢視資料
select * from user_age;

內連線

SELECT a.id,
a.name,
b.age
FROM user_name a
inner join user_age b
ON (a.id = b.id);

左外連線

SELECT a.id,
a.name,
b.age
FROM user_name a
left join user_age b
ON (a.id = b.id);

右外連線

SELECT a.id,
a.name,
b.age
FROM user_name a
RIGHT OUTER JOIN user_age b
ON (a.id = b.id);

全外連線

SELECT a.id,
a.name,
b.age
FROM user_name a
FULL OUTER JOIN user_age b
ON (a.id = b.id);

半連線

LEFT SEMI JOIN
以LEFT SEMI JOIN關鍵字前面的表為主表，
返回主表的KEY也在副表中的記錄
SELECT a.id,
a.name
FROM user_name a
LEFT SEMI JOIN user_age b
ON (a.id = b.id);
--等價於：
SELECT a.id,
a.name
FROM user_name a
WHERE a.id IN (SELECT id FROM user_age);

笛卡爾積關聯（CROSS JOIN）

SELECT a.id,
a.name,
b.age
FROM user_name a
CROSS JOIN user_age b;

內建函式UDF和內建運算子

取隨機數rand() select rand() from t_product;
求a的階乘 factorial(INT a) select factorial(10) from t_product;
求最大值 greatest(T v1, T v2, …) select greatest(10,123,53,34,1,23,502,120) from t_product;
求最小值 least(T v1, T v2, …) select least(10,123,53,34,1,23,502,120) from t_product;
數學常量e select e() from t_product;
數學常量pi select pi() from t_product;
返回當前時間 select current_date from t_product;
如果列中有null值，則返回預設值 nvl(T value, T default_value) select id,nvl(name, ‘無名氏’) from t_product;
對於值的不同判斷，取不同的值 CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END 如果a=b就返回c,a=d就返回e，否則返回f 如CASE 4 WHEN 5 THEN 5 WHEN 4 THEN 4 ELSE 3 END 將返回4 select CASE id WHEN 1 THEN ‘正品’ ELSE ‘山寨’ END,name from t_product;
判斷某個檔案中是否包含某個字串 in_file(string str, string filename) select in_file(‘start’,’/home/briup/test’) from t_product;
通過某個符號切割字串 split(string str, string pat) select split(‘hello,world,briup’, ‘,’) from t_product;
擷取字串 substr(string|binary A, int start, int len) select substr(‘ceo-larry’, 0, 3) from t_product;
在某字串中查詢某個子串第一次出現的位置,位置從1開始 instr(string str, string substr) select instr(‘ceo-larry’, ‘la’) from t_product;
將第一個字串中的,符合第二個字串的部分,替換成第三個字串 translate(string|char|varchar input, string|char|varchar from, string|char|varchar to) select translate(‘hello briup’, ‘briup’, ‘imut’) from t_product;
比較兩個字串，不同的字元個數 levenshtein(string A, string B) select levenshtein(‘hello’, ‘worldd’) from t_product;
把array中的字元用某個符號拼接起來 concat_ws(string SEP, array) select concat_ws(’#’, split(‘hello,world,briup’, ‘,’)) from t_product;

自定義函式UDF

寫一個java類extends UDF，定義某個邏輯
打成jar包上傳到hive所在的節點
在hive中建立一個函式，和jar中的自定義類建立關聯
- 匯入包 add jar /home/hadoop/udf.jar
- 建立關聯 create temporary function getArea as ‘com.briup.udf.MyUDF’;
- 檢視自定義函式 show functions;
- 使用自定義函式 select id,name,tel,getArea(tel) from t_student_udf;

JDBC連線hive

修改配置檔案hdfs-site.xml

<property>  
<name>dfs.webhdfs.enabled</name>  
<value>true</value>  
</property>

修改配置檔案core-site.xml

<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>

注：以上兩個屬性的第三位指的是hive所在機器的使用者名稱 3. 開啟hive服務保持持續執行 hive --service hiveserver2 & 4. 編寫程式碼測試連線

public class JDBCTest {
    public static void main(String[] args) throws Exception {
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        Connection conn = DriverManager.getConnection("jdbc:hive2://hadoop1:10000/db1803","hadoop","hadoop");
        System.out.println(conn);
        Statement stat = conn.createStatement();
        String hql = "create table t_student_jdbc(id int,name string,tel string)" +
                " row format delimited" +
                " fields terminated by ','" +
                " stored as textfile";
        stat.execute(hql);
        stat.close();
        conn.close();
    }
}

Hive入門詳解

簡介

安裝Hive

hive實踐

建立表

資料查詢

內建函式UDF和內建運算子

自定義函式UDF

JDBC連線hive

Hive入門詳解

Asp.Net MVC3 簡單入門詳解過濾器Filter

線段樹入門詳解

PHP基礎入門詳解（一）【世界上最好用的編程語言】

Selenium Grid分布式測試入門詳解

無向圖的割頂和橋，無向圖的雙連通分量入門詳解及模板 -----「轉載」

linux三劍客之sed入門詳解

生成函數(母函數)入門詳解

樹鏈剖分入門詳解

區塊鏈以及區塊鏈技術入門詳解(2)

區塊鏈以及區塊鏈技術入門詳解(1)

hive 分割槽詳解

主席樹入門詳解+題目推薦

經典ASP NET MVC3 0入門詳解

lambda表示式入門詳解

知識：整合營銷新手入門詳解

Spring Security 入門詳解

linux sed使用（轉） sed入門詳解教程

Spark入門詳解（一）-Spark簡介

Java異常入門詳解

Hive入門詳解

簡介

安裝Hive

hive實踐

建立表

資料查詢

內建函式UDF和內建運算子

自定義函式UDF

JDBC連線hive

相關推薦