彷徨 | Hive的SQL--DDL詳細操作

阿新 • • 發佈：2019-02-09

Hive的三種使用方式 :

方式一 : bin/hive 互動式查詢

方式二 : 啟動Hive的網路服務 , 然後通過客戶端beeline去連線服務進行查詢 :

啟動服務 : bin/hiveserver2

啟動客戶端去連線Hive服務 : bin/beeline -u jdbc:hive2://hadoop01:10000 -n root

方式三 : shell指令碼方式查詢

#!/bin/bash
HIVE_HOME=/root/apps/hive-1.2.2
$HIVE_HOME/bin/hive -e 'insert into table t_avg  as  select skuid,avg(amount) from t_2 group by skuid'
$HIVE_HOME/bin/hive -e 'create table t_result as select skuid,sum(amount) from t_2 group by skuid'

補充一：bin/hive -f /root/etl.sql （把sql語句寫到一個專門的檔案裡）, Linux 會去執行SQL檔案裡面的所有語句

新建一個test.sql檔案 , 裡面放倆條SQL語句 :

執行 hive -f /root/test.sql 命令

補充二 : 在Linux視窗hive -e '' SQL語句 '' 也可以執行SQL語句

hive -e 'select * from t_access'

可以看出查詢結果是一樣的

補充三 : SQL是一種面向集合的程式語言

1 以服務的形式啟動Hive :

nohup hiveserver2 >/dev/null 2>&1 &

2 客戶端連線 :

方式一:

beeline
!connect jdbc:hive2://hadoop01:10000
root
方式二:

beeline -u jdbc:hive2://hadoop01:10000 -n root

檢視埠是否被監聽 : netstat -nltp

如果10000埠被監聽 , hive服務就啟動了 .

3 基本操作語句

3.1 建立內部表

create table t_user(id string,name string)
row format delimited
fields terminated by ',';

3.2 建立外部表

create external table t_access(ip String,url String,access_time String)
row format delimited
fields terminated by ','
location "/data/acc";

內部表直接把資料上傳到hdfs對應的目錄(/user/hive/warehouse)上，就能夠關聯起來 . 外部表跟內部表的區別，內部表放在warehouse下面，刪除表的時候會把資料刪除掉，外部表示需要自己制定目錄，刪除表的時候，不會刪除資料

3.3 查看錶結構.

desc tablename

3.4 刪除表

drop table t_order;

刪除表的效果是：

hive會從元資料庫中清除關於這個表的資訊；

hive還會從hdfs中刪除這個表的表目錄；

3.5 分割槽表

3.5.1 一個分割槽建表

注意：分割槽欄位，不能出現在表字段裡面 , 不同的分割槽資料存放在不同的目錄下面
create table t_access(id string,url string,access_time string)
partitioned by(dt string)
row format delimited fields terminated by ',';

日誌檔案 :

向分割槽中匯入資料 :

load data local inpath '/root/access.log.2018-08-29.log' into table t_access partition(dt='20170829');

load data local inpath '/root/access.log.2018-08-30.log' into table t_access partition(dt='20170830');

針對分割槽進行查詢 :

A : 統計8月30號的總量：實質：就是將分割槽欄位當成表字段來用，就可以使用where子句指定分割槽了

select count(*) from t_access where dt='20180829';

B : 統計表中所有資料總量：實質：不指定分割槽條件即可

select count(*) from t_access;

3.5.2 多個分割槽建表

3.5.2.1 內部表分割槽

建表 :

CREATE TABLE t_2(id int,skuid string,price float,amount int)

partitioned by (day string,city string)

row format delimited fields terminated by ',';

導資料 :

t 2.1 資料：2018-04-15 北京

t 2.2 資料：2018-04-15 上海

t 2.3 資料：

LOAD DATA LOCAL INPATH '/root/t2.1' into TABLE t_2 PARTITION(day='2018-04-15',city='beijing');

LOAD DATA LOCAL INPATH '/root/t2.2' into TABLE t_2 PARTITION(day='2018-04-15',city='shanghai');

LOAD DATA LOCAL INPATH '/root/t2.3' into TABLE t_2 PARTITION(day='2018-04-16',city='beijing');

查詢：

1 ： select * from t_2；

2 ： select sum(price*amount) from t_2;

3 ： select sum(price*amount) from t_2 where day = "2018-04-15" and city = "beijing";

3.5.2.2 外部表分割槽

建表：注：外部表建表時，最後需要制定一個目錄 location '/xx/yy';

導資料 :

LOAD DATA LOCAL INPATH '/root/t2.1' into TABLE t_2_ex PARTITION(day='2018-04-15');

查詢 :

注 : 給外部表新增分割槽

已經存在一個目錄 , 但是不在外部表的指定目錄下 , 我們可以修改表 , 給這個表新增一個目錄 .即將一個已存在的資料夾 , 作為表的一個分割槽 .

此時,根目錄下面有一個2018-04-16的資料夾,裡面有一個名為 t2.1 的檔案

我們將其新增到外部表的指定目錄 /xx/yy 下 , 以便一起查詢 . 此時只是Hive記錄了那個檔案的位置 , 並沒有將檔案複製或剪貼到外部表的指定目錄.

alter table t_2_ex add partition (day = '2018-04-16') location '/2018-04-16';

查詢 :

select * from t_2_ex;

我們將其新增到外部表的指定目錄 /xx/yy 下 , 以便一起查詢 . 此時只是Hive記錄了那個檔案的位置 , 並沒有將檔案複製或剪貼到外部表的指定目錄 . 可以看到外部表的指定目錄 /xx/yy下只有2018-04-15一個資料夾 , 並沒有我們剛才新增的2018-04-16資料夾 ,但是查詢的時候 , 會查詢到裡面的內容 .

此方法也適用於內部表 , 可以將一個已存在的資料夾作為內部表的一個分割槽

3.6 CTAS建表語法

3.6.1 可以通過已存在表來建表：

create table t_user_2 like t_user;

新建的t_user_2表結構定義與源表t_user一致，但是沒有資料

查看錶資料 : 並沒有資料

3.6.2 在建表的同時插入資料

create table t_user_3

select id,name from t_user;

t_user_3會根據select查詢的欄位來建表，同時將查詢的結果插入新表中

查詢新表資料 :

3.7 資料的匯入與匯出

3.7.1 將資料檔案匯入hive的表

方式1：匯入資料的一種方式：手動用hdfs命令，將檔案放入表目錄；

方式2：在hive的互動式shell中用hive命令來匯入本地資料到表目錄 ( 將本地檔案匯入 Hive 中的表 )

hive>load data local inpath '/root/order.data.2' into table t_order;

方式3：用hive命令匯入hdfs中的資料檔案到表目錄 ( 將 HDFS 中的檔案匯入Hive中 )

hive>load data inpath '/access.log.2017-08-06.log' into table t_access partition(dt='20170806');

注意：導本地檔案和導HDFS檔案的區別：

本地檔案匯入表：複製

hdfs檔案匯入表：移動

注 :Hive不會對使用者所匯入的資料做任何的檢查和約束;想導什麼資料就導什麼資料,但是欄位不匹配會出現問題.

3.7.2 將hive表中的資料匯出到指定路徑的檔案

將hive表中的資料匯入HDFS的檔案

insert overwrite directory '/root/access-data'

row format delimited fields terminated by ','

select * from t_access;

將hive表中的資料匯入本地磁碟檔案

insert overwrite local directory '/root/access-data'

row format delimited fields terminated by ','

select * from t_access limit 100000;

3.7.3 hive檔案格式

HIVE支援很多種檔案格式： SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE

create table t_pq(movie string,rate int) stored as textfile;

create table t_pq(movie string,rate int) stored as sequencefile;

create table t_pq(movie string,rate int) stored as parquetfile;

3.8 時間型別

TIMESTAMP (時間戳) (包含年月日時分秒的一種封裝)

DATE (日期)（只包含年月日）

示例，假如有以下資料檔案：

1,zhangsan,1985-06-30

2,lisi,1986-07-10

3,wangwu,1985-08-09

那麼，就可以建一個表來對資料進行對映

create table t_customer(id int,name string,birthday date)

row format delimited fields terminated by ',';

然後匯入資料

load data local inpath '/root/customer.dat' into table t_customer;

然後，就可以正確查詢

3.9 複合型別

3.9.1 array陣列型別

示例：array型別的應用

假如有如下資料需要用hive的表去對映：

戰狼2,吳京:吳剛:龍母,2017-08-16

三生三世十里桃花,劉亦菲:癢癢,2017-08-20

設想：如果主演資訊用一個數組來對映比較方便

建表：

create table t_movie(moive_name string,actors array<string>,first_show date)

row format delimited fields terminated by ','

collection items terminated by ':';

匯入資料：

load data local inpath '/root/movie.dat' into table t_movie;

查詢：

select * from t_movie;

我也不知道為啥沒對齊,很尷尬

select moive_name,actors[0] from t_movie;

select movie_name,actors from t_movie where array_contains(actors,'吳剛');

select movie_name,size(actors) from t_movie;

3.9.2 Map型別

maps: MAP<primitive_type, data_type>

假如有以下資料：

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28

2,lisi,father:mayun#mother:huangyi#brother:guanyu,22

3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29

4,mayun,father:mayongzhen#mother:angelababy,26

可以用一個map型別來對上述資料中的家庭成員進行描述

建表語句：

create table t_person(id int,name string,family_members map<string,string>,age int)

row format delimited fields terminated by ','

collection items terminated by '#'

map keys terminated by ':';

匯入資料：

load data local inpath '/root/person.dat' into table t_person;

查詢

select * from t_person;

取map欄位的指定key的值

select id,name,family_members['father'] as father from t_person;

注 : as father 為設定一個別名

取map欄位的所有key

select id,name,map_keys(family_members) as relation from t_person;

取map欄位的所有value

select id,name,map_values(family_members) from t_person;

select id,name,map_values(family_members)[0] from t_person;

綜合：查詢有brother的使用者資訊

select id,name,brother
from
(select id,name,family_members['brother'] as brother from t_person) tmp
where brother is not null;

3.9.3 struct型別

structs: STRUCT<col_name : data_type, ...>

假如有如下資料：

1,zhangsan,18:male:beijing

2,lisi,28:female:shanghai

其中的使用者資訊包含：年齡：整數，性別：字串，地址：字串

設想用一個欄位來描述整個使用者資訊，可以採用struct

建表：

create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)

row format delimited fields terminated by ','

collection items terminated by ':';

匯入資料：

load data local inpath '/root/person_struct.dat' into table t_person_struct;

查詢

select * from t_person_struct;

select id,name,info.age from t_person_struct;

select id,name,info.sex from t_person_struct;

3.10 修改表定義

僅修改Hive元資料，不會觸動表中的資料，使用者需要確定實際的資料佈局符合元資料的定義。

修改表名：

ALTER TABLE table_name RENAME TO new_table_name

示例：alter table t_zhang rename to t_junjie;

修改分割槽名：

alter table t_partition partition(department='xiangsheng',sex='male',howold=20)
rename to partition(department='1',sex='1',howold=20);

新增分割槽：

alter table t_partition add partition (department='2',sex='0',howold=40);

刪除分割槽：

alter table t_partition drop partition (department='2',sex='2',howold=24);

修改表的檔案格式定義：

ALTER TABLE table_name [PARTITION partitionSpec] SET FILEFORMAT file_format

alter table t_partition partition(department='2',sex='0',howold=40 ) set fileformat sequencefile;

修改列名定義：

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENTcol_comment] [FIRST|(AFTER column_name)]

alter table t_user change price jiage float first;

price為之前的列名 , jiage為新的列名 , float 為欄位的型別 ,first 可加可不加 , 加的話該列修改以後會放到第一列 .

增加/替換列：

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type[COMMENT col_comment], ...)

alter table t_user add columns (sex string,addr string);

新增列 , 一次可以新增多個列 . (sex,addr)即為新新增的列 , 需要跟上欄位型別

alter table t_user replace columns (id string,age int,price float);

替換 ,直接將原欄位替換掉

alter table t_junjie replace columns (id int,sex string);

3.11 hive查詢語法

sql是一門面向集合的程式語言；

select 1;

提示：在做小資料量查詢測試時，可以讓hive將mrjob提交給本地執行器執行，可以在hive會話中設定如下引數：

hive> set hive.exec.mode.local.auto=true;

基本查詢示例

select * from t_access;

select count(*) from t_access;

select max(ip) from t_access;

條件查詢

select * from t_access where access_time<'2017-08-06 15:30:20'

select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';

3.12 join關聯查詢示例

注 : Hive中 join 不支援不等值連線 , 只支援等值連線 , 其他 SQL 支援不等值 join 連線 .

假如有a.txt檔案

a,1
b,2
c,3
d,4

假如有b.txt檔案

a,aa
b,bb
d,cc
e,dd

建立 t_a 表和 t_b 表 :

匯入資料 :

3.12.1 內連線

select * from t_a a join t_b b on a.name = b.name;

結果：

3.12.2 左外連線

select * from t_a a left join t_b b on a.name = b.name;

結果 :

3.12.3 右外連線

select * from t_a a right join t_b b on a.name = b.name;

結果：

3.12.4 全外連線 full outer join

select * from t_a a full join t_b b on a.name = b.name;

結果：

3.12.5 左半連線 left semi join

Left semi join ：相當於join連線兩個表後產生的資料中的左半部分

注意： left semi join的 select子句中，不能有右表的欄位

select * from t_a a left semi join t_b b on a.name = b.name;

結果：

3.13 group by分組聚合

注意：一旦有group by子句，那麼，在select子句中就不能有 （分組欄位，聚合函式）以外的欄位

(說的簡單通俗一點就是 , 分組以後的查詢只能查詢分組的欄位 , 以及分組以後可以聚合的欄位 , 比如最大值 , 最小值 , 求和 , 求平均值等等的答案只有一個的欄位 , 如果按性別分組 , 會有 male這種結果 , 我們可以求成績的最大值 , 或者年齡的平均值 , 又或是年齡的最小值 , 但是不能求姓名 , 因為對應的 male 只有一行 , 而姓名有倆個 , 就會出現倆行) .

為什麼where必須寫在group by的前面，為什麼group by後面的條件只能用having

因為，where是用於在真正執行查詢邏輯之前過濾資料用的

having是對group by聚合之後的結果進行再過濾；

有如下資料 :

建立一個表 :

create table t_user(id int,name string,age int,score int,sex string)

row format delimited fields terminated by ',';

導資料 :

load data local inpath '/root/user.txt' into table t_user;

分組查詢 :

1 按性別分組 , 並查詢性別以及年齡的最大值

select max(age),sex from t_user group by sex ;

2 求每一種性別的平均成績 , 但請過濾掉平均年齡 >25歲的性別

select sex,avg(score) from t_user group by sex having avg(age)<=25;

3 . 求每一種性別的平均成績 , 但是性別平均年齡>25的不要 , 而且性別平均成績低於85分的不要

select sex,avg(score) from t_user group by sex having avg(age)<=25 and avg(score)>=85;

4 求每種性別的平均成績 , 但是成績低於82分不計入統計 , 並且最後結果中 , 去除性別平均年齡>25歲的;

select sex,avg(score) from t_user where score>82 group by sex having avg(age)<25;

查詢過程圖 :

上述語句的執行邏輯：

where過濾不滿足條件的資料

用聚合函式和group by進行資料運算聚合，得到聚合結果

用having條件過濾掉聚合結果中不滿足條件的資料

彷徨 | Hive的SQL--DDL詳細操作

Hive的三種使用方式 :

1 以服務的形式啟動Hive :

2 客戶端連線 :

3 基本操作語句

3.1 建立內部表

3.2 建立外部表

3.3 查看錶結構.

3.4 刪除表

3.5 分割槽表

3.5.1 一個分割槽建表

3.5.2 多個分割槽建表

3.5.2.1 內部表分割槽

3.5.2.2 外部表分割槽

注 : 給外部表新增分割槽

3.6 CTAS建表語法

3.6.1 可以通過已存在表來建表：

3.6.2 在建表的同時插入資料

3.7 資料的匯入與匯出

3.7.1 將資料檔案匯入hive的表

3.7.2 將hive表中的資料匯出到指定路徑的檔案

3.7.3 hive檔案格式

3.8 時間型別

3.9 複合型別

3.9.1 array陣列型別

3.9.2 Map型別

3.9.3 struct型別

3.10 修改表定義

修改表名：

修改分割槽名：

新增分割槽：

刪除分割槽：

修改表的檔案格式定義：

修改列名定義：

增加/替換列：

3.11 hive查詢語法

hive> set hive.exec.mode.local.auto=true;

hive> set hive.exec.mode.local.auto=true;

hive> set hive.exec.mode.local.auto=true;

3.12 join關聯查詢示例

3.12.1 內連線

3.12.2 左外連線

3.12.3 右外連線

3.12.4 全外連線 full outer join

3.12.5 左半連線 left semi join

3.13 group by分組聚合

相關推薦