Hive 這些基礎知識，你忘記了嗎？

阿新 • • 發佈：2020-01-19

Hive 其實是一個客戶端，類似於navcat、plsql 這種，不同的是Hive 是讀取 HDFS 上的資料，作為離線查詢使用，離線就意味著速度很慢，有可能跑一個任務需要幾個小時甚至更長時間都有可能。

在日常開發中 Hive 用的還是挺廣泛的，常做一些統計工作，就我自己工作來看其實 80% 的工作由 Hive 的基礎部分就能完成了，只有很少的情況需要用到一些複雜查詢或者調優工作。

本文著重挑選出一些易於被忽略基礎知識,篇幅較多，建議收藏，分次閱讀，後臺文件中有詳細的知識點說明，需要深入瞭解 Hive 的去文末下載即可。

本文涉及的內容主要是:

在開始內容之前先來簡單看下 Hive 原理：

01 Hive 查詢原理

Hive 其實是將 hql 轉成 MR 程式去跑，這裡我們不去深入底層瞭解到底是怎麼轉換的，就簡單看下Hive查詢過程：

1 根據HDFS上資料格式，建立hive表

2 通過對映關係將HDFS資料匯入到表中

3 此時hive表對應的元資料資訊記錄到 mysql 中，元資料可不是指的HDFS上的資料，它是指 hive 表的一些引數。

4 寫 select 語句時，根據表與資料的對映關係去寫對應的查詢語句

5 在執行查詢操作時 ,先從元資料庫中找到對應表對應的檔案位置，

再通過 hive 的解析器、編譯器、優化器執行器將 sql 語句轉換成 MR 程式，執行在 Yarn 上，最終得到結果。

PS：Hive 裡有三種查詢方式，分別是bin/hive (客戶端)、jdbc、webui，一般的使用jdbc方式居多。（後臺文件中有詳細操作方式，為方便後續執行sql，建議先搭好環境。）

02 內部表外部表區別

Hive 表與常規的資料庫表不同，它分為內部表和外部表，它們的區別在建立表和刪除表時有所不同。

建立表時：

內部表會移動資料到指定位置，將資料檔案移動到預設位置，一般都是/usr/hive/warehouse/ 目錄下

外部表不會移動資料，資料在哪就是哪

2. 刪除表時：

內部表刪除，資料一起刪除

外部表不會刪除資料

所以區別就很明顯了，一般工作中使用外部表做為資料對映，而統計出的結果一般多使用內部表，因為內部表僅僅用於儲存結果或者關聯，與 HDFS 資料無關。

那麼怎麼區分表是外部表或者是內部表呢？

對於已經建立的表可以使用：

desc formatted 表名即可檢視。

對於新建表：

使用建表語句時即可區分，其中帶 EXTERNAL 的是外部表，不帶的則是內部表。

建表語句如下：

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)] -- 列名 列欄位型別

[COMMENT table_comment]   --  註釋

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] -- 分割槽欄位

[CLUSTERED BY (col_name, col_name, ...) -- 分桶

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] -- 排序欄位

[ROW FORMAT row_format]  row format delimited fields terminated by “分隔符”

[STORED AS file_format] -- 以什麼形式儲存

[LOCATION hdfs_path] -- 對應HDFS檔案路徑

03 Hive 分割槽表

分割槽表幾乎是必用的，一般以自然年月為分割槽，這樣資料比較好管理。而且在執行查詢語句時可以指定查詢分割槽資料，

不加分割槽的 sql 情況：

select a1,a2 .. from table1;

這樣會掃描全表資料，假如資料量比較大，那要等執行結果估計猴年馬月了。

新增分割槽情況：

select a1,a2 .. from table1 where (year = '2019' and month='12');

這樣的話就只會查詢2019年12月的資料了，善用分割槽會大大提升查詢效率。

那分割槽怎麼建立呢？

在建表語句中的分割槽那行加上就是了，

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] -- 分割槽欄位

就這麼簡單。

舉個例子，一個分割槽欄位的分割槽表就不說了，多個分割槽分割槽表 sql 如下：

create table student (id string,name string, age int)

partitioned by (year string,month string,day string)

row format delimited fields terminated by '\t';

分區劃分：

分割槽分為靜態分割槽和動態分割槽。

靜態分割槽需要人為指定分割槽，並且需要明確分割槽欄位，舉例 sql 如下：

1 建立分割槽表：

create table order_partition(

order_number string,

order_price  double,

order_time string

)

partitioned BY(month string)

row format delimited fields terminated by '\t';

2 準備資料在 order.txt 中內容如下：

10001    100 2019-03-02

10002    200 2019-03-02

10003    300 2019-03-02

10004    400 2019-03-03

10005    500 2019-03-03

10006    600 2019-03-03

10007    700 2019-03-04

10008    800 2019-03-04

10009    900 2019-03-04

3 將本地檔案載入到表中

load data local inpath '/bigdata/install/hivedatas/order.txt' overwrite into table order_partition partition(month='2019-03');

這裡再最後指定分割槽為 2019-03 ，所以以上所有內容都會在 2019-03 這個分割槽中。分割槽可以手動新增、刪除。

4 查詢結果

select * from order_partition where month='2019-03';

結果為：

                                  分割槽

10001   100.0   2019-03-02      2019-03

10002   200.0   2019-03-02      2019-03

10003   300.0   2019-03-02      2019-03

10004   400.0   2019-03-03      2019-03

10005   500.0   2019-03-03      2019-03

10006   600.0   2019-03-03      2019-03

10007   700.0   2019-03-04      2019-03

10008   800.0   2019-03-04      2019-03

10009   900.0   2019-03-04      2019-03

動態分割槽則可以將資料自動匯入表的不同分割槽中，與靜態分割槽不同的是動態分割槽只需要指定分割槽欄位，不需要明確分割槽欄位的值。

例如：

1 建立分割槽表：

--建立普通表

create table t_order(

    order_number string,

    order_price  double,

    order_time   string

)row format delimited fields terminated by '\t';

--建立目標分割槽表

create table order_dynamic_partition(

    order_number string,

    order_price  double   

)partitioned BY(order_time string)

row format delimited fields terminated by '\t';

2 準備資料 order_created.txt內容，內容同靜態分割槽

10001    100 2019-03-02

10002    200 2019-03-02

10003    300 2019-03-02

10004    400 2019-03-03

10005    500 2019-03-03

10006    600 2019-03-03

10007    700 2019-03-04

10008    800 2019-03-04

10009    900 2019-03-04

3 向普通表t_order載入資料

load data local inpath '/bigdata/install/hivedatas/order_partition.txt' overwrite into table t_order;

這裡沒有手動指明分割槽欄位的值，而是根據分割槽欄位有內部自己判斷資料落在哪個分割槽中。

4 動態載入資料到分割槽表中

要想進行動態分割槽，需要設定引數：

//開啟動態分割槽功能

hive> set hive.exec.dynamic.partition=true;

 

//設定hive為非嚴格模式

hive> set hive.exec.dynamic.partition.mode=nonstrict;

//載入資料

 

hive> insert into table order_dynamic_partition partition(order_time) select order_number,order_price,order_time from t_order;

5、檢視分割槽

hive>  show partitions order_dynamic_partition;

 

2019-03-02

2019-03-03

2019-03-04

04 Hive分桶表

分桶表一般在超大資料時才會使用，分桶將整個資料內容按某列屬性值取hash值進行區分，具有相同hash值的資料進入到同一個檔案中，意味著原本屬於一個檔案的資料經過分桶後會落到多個檔案中。

例如：

建立分桶表之前要設定一些引數：

1 開啟分桶

 

set hive.enforce.bucketing = true

 

2 設定桶個數

set mapreduce.job.reduces = 4;

建立分桶表：

// 1 建立分桶表

create table user_bucket_demo(id int,name string)

cluster by (id)

into 4 buckets

row format delimited fields terminated by '\t'

 

 

// 2 建立普通表

create table user_demo(id int,name string)

row format delimited fields terminated by '\t'

 

// 3 載入本地資料到普通表

load data local inpath '/home/hadoop/data/02/user_bucket.txt' into table user_demo;

 

 

注意：

//使用這個方式 載入資料到分桶表，資料不會分桶

 load data local inpath '/home/hadoop/data/02/user_bucket.txt' into table user_bucket_demo;

 

 // 4 正確的分桶表載入資料方式:

insert into user_bucket_demo select * from user_demo;

 

// 5 檢視結果

select * from user_bucket_demo tablesample(bucket 1 out of 2)

 

-- 需要的總桶數=4/2=2個

-- 先從第1個桶中取出資料

-- 再從第1+2=3個桶中取出資料

 

 

tablesample(buket x out of y) 函式說明：

 

- x表示從第幾個桶開始取資料

- y表示桶數的倍數，一共需要從 ==桶數/y==  個桶中取資料

05 Hive資料匯入

資料匯入一般是初始化的工作，一般將表與 HDFS 路徑對映好之後，後續的分割槽資料會自動與表做好對映。所以這塊一般來說用的不多，在自己測試時使用的居多吧。

資料匯入方式如下：

load 方式載入資料

這種方式在之前分割槽表時已經使用過了。

load data [local] inpath 'dataPath' [overwrite ] into table student [partition (partcol1=val1,…)];

新增 local 表示從本地載入，不新增表示從 HDFS 上載入

新增 overwrite 表示覆蓋原表資料，不新增 overwrite 表示追加

新增 partition 表示向某個分割槽新增資料

查詢方式載入資料

insert overwrite table yourTableName partition(month = '201806') select column1,column2 from otherTable;

查詢語句中建立表並載入資料

create table yourTableName as select * from otherTable;

使用location 指定載入資料路徑（常用）

1 建立表，並指定HDFS上路徑

create external table score (s_id string,c_id string,s_score int) row format delimited fields terminated by '\t' location '/myscore';

2 上傳資料到 HDFS 上，可在 Hive客戶端通過 dfs 命令操作 HDFS

//建立 HDFS 路徑

dfs -mkdir -p /myscore;

//上傳資料到 HDFS 上，測試資料在文末。

dfs -put /bigdata/install/hivedatas/score.csv /myscore;

//檢視結果

3 select * from score;

注意：

如果查詢不到資料可使用：

msck repair table score;

進行表的修復,說白了就是建立我們表與我們資料檔案之間的一個關係對映

06 使用複合資料型別建表

Hive 中複合資料型別有 Array、Map、Struct 這三種。

Array 代表陣列，型別相同的資料

Map 對映 k--v 對

Struct 則儲存型別不同的一組資料

建立表時除需要指定每行的分隔符（row format），要是有複合型別的還需要指定複合型別的分隔符。

複合資料建表語句：

create table tablename (id string,name string,...)

 

row format delimited fields terminated by ' '

 

Collection items terminated by '\t' -- array 分隔符    Array、Struct

 

Map keys terminated by ':' -- map 分隔符

語句說明

建表：

Array/Struct/map 建立表時使用分割符都為 Collection items terminated by ''

map 如果是多個 map，多個 KV 使用 Collection items terminated by '\t'

map KV 間使用 map keys terminated by '：'

查詢使用：

array -- select locations[0]

map -- info['name']

struct -- info.name info.age

測試案例：

Array

準備測試資料文件 t_array.txt，多個欄位使用“，”拼接

資料：

1 zhangsan beijing,shanghai

2 lisi shanghai,tianjin

建表：

 

create table t_array(

id string,

name string,

locations array<string>

) 

row format delimited fields terminated by ' ' collection items terminated by ',';

 

載入資料到表中

 

load data local inpath '/home/hadoop/data/01/t_array.txt' into table t_array;

 

測試查詢結果：

 

1 簡單查詢：

 

select id,locations[0],locations[1] from t_array;

 

2 查詢陣列中元素個數  

 

select size(locations) from t_array

 

3 查詢locations中包含 beijing 的資訊 

 

select * from t_array 

where array_contains(address,'beijing')

準備測試資料文件t_map.txt

資料：

1 name:zhangsan#age:30

2 name:lisi#age:40

建表： 

 

create table t_map

(id string,info map<string,string>)

row format delimited fields terminated by ' '

collection items terminated by '#'    --- 表示多個 KV 之間拼接的符號

map keys terminated by ':' ----- 表示一個 KV 間的分隔符

 

載入資料： 

 

load data local inpath '/home/hadoop/data/01/t_map.txt' into table t_map;

 

查詢結果：

 

1 簡單查詢：

 

select id,info['name'],info['age'] from t_map;

 

2 查詢 map 的所有 key 值：

 

 select map_keys(info) from t_map;

 

3 查詢 map_values 所有 value 值：

 

select  map_values(info) from t_map;

Struct

準備測試資料文件t_struct.txt

資料：

1 zhangsan:30:beijing

2 lisi:40:shanghai

建表：

 

create table t_struct(id string,info struct<name:string,age:int,address:string>)

 

row format delimited fields terminated by ' ' --欄位間分隔符

 

collection items terminated by ':' -- struct間分隔符

 

 

載入資料：

 

load data local inpath '' into table t_struct;

 

查詢結果：

 

select id,info.name,info.age,info.address from t_struct;

07 Hive 中 4 個 by 的區別

order by 全域性排序，不論 reduce 個數是幾，結果全域性有序

sort by 每個 reduce 內有序，當reduce個數為1時，結果同 order by 是全域性有序，當 reduce 個數大於1，則每個reduce內有序

distribute by + sort by 使用，分割槽排序，與 sort by 區別在於可以指定分割槽欄位，將map端查詢結果hash相同的結果分發到對應的reduce，每個reduce 內有序

cluster by 當 distribute by + sort by 欄位相同時，可換成 cluster by

08 實際需求-表連線時使用分割槽查詢

Hive表連線與常規資料庫的表連線使用方法一樣，關鍵字還是 inner join ,left join 等等，下面看一下實際工作中用到的需求。

需求如下：

Hive 中一張儲存文章表，

欄位說明：

title --標題

content -- 內容

pubtime --釋出時間

serviceId --文章型別

表分割槽欄位 --year month

查詢文章釋出時間 2019年11月份 11-18號，標題與內容相同，並且標題大於 30 的文章，文章型別在 1-5

結果使用子查詢 + 自連線查相同文章

注意：一定要使用分割槽，不然程式會卡死。

結果 sql 如下：

select t1.id, t1.title,t1.content, t1.pubtime,t1.serviceId 

from  (select id, title,content, pubtime,serviceId from article_info where (year = '2019' and month = '11')) t1

inner join  (select id, url, content, pubtime,serviceId from article_info where (year = '2019' and month = '11')) t2

on t1.id = t2.id 

where t1.pubtime >= '2019-11-11 00:00:00' and t1.pubtime <='2019-11-18 23:59:59'

and length(t1.title) < 30 and t1.serviceId in (1,2,3,4,5) and t1.title = t2.content

需求並不難，其實 hive 就是需要多練，在實際工作中有需求才好發揮。

09資料領取

關注公眾號 "大資料江湖"後臺回覆 “Hive學習文件”，即可領取詳細資料。

ps: 我還整理了一些 Hive 常用函式，點選可檢視。

Hive 語法與常用的 sql 類似，可能一些複雜的查詢需要藉助函式來完成，常用函式總結在 “閱讀原文”中，sql 也原本就是熟能生巧的東西，只要多多練習，相信我們在工作中都能夠得心應手。

--- The End

Hive 這些基礎知識，你忘記了嗎？

Hive 這些基礎知識，你忘記了嗎？

Appium+python自動化（十五）- Android 這些基礎知識，你知多少？？？（超詳解）

資料庫的這些效能優化，你做了嗎

資料庫的這些效能優化，你做了嗎？

用了這麽久的python，這些零碎的基礎知識，你還記得多少？

Java基礎精選，你答對了幾道？

數據庫的這些性能優化，你做了嗎？

CentOS7防火墻(Firewalld)，你關了嗎？

【全民充電節】APP又上新福利，你發現了嗎？

一線互聯網常見的 14 個 Java 面試題，你顫抖了嗎程序員

小程序帶給互聯網創業者們的商機，你抓住了嗎|極限工坊淘小咖

一線網際網路常見的14個Java面試題，你顫抖了嗎程式設計師

MyBatis 延遲載入的三種載入方式深入，你get了嗎？

MyBatis 延遲加載的三種加載方式深入，你get了嗎？

Python將被納入高考，小學生都在學，你慌了嗎？

Python 中的匿名函式，你濫用了嗎？

工業4.0圍觀，大炒作，大問題，大差距，你發現了嗎？

2018年IT行業薪資大揭祕：程式設計師平均薪資1.44萬，你拖後腿了嗎？

這是今年前端最常見的面試題，你都會了嗎？

【模型解讀】network in network中的1*1卷積，你懂了嗎

Hive 這些基礎知識，你忘記了嗎？

相關推薦