Hive分割槽表

阿新 • • 發佈：2020-12-16

Hive分割槽表

Hive分割槽表

Hive分割槽表

分割槽表的建立

create table dept_partition(
deptno int,
dname string,
loc string
)
PARTITIONED BY (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

在原有建立表的基礎上增加了PARTITIONED BY(col_name data_type [COMMENT col_comment], …)

匯入資料

LOAD DATA LOCAL INPATH '/home/hadoop/data/dept.txt' OVERWRITE INTO TABLE dept_partition PARTITION (day=20201215);

我們檢視一下表的結構

hive (ddl_create)> desc formatted dept_partition;
# Partition Information	 	 
# col_name            	data_type           	comment             
day                 	string              	                           	 

Location:           	hdfs://hadoop100:9000/user/hive/warehouse/ddl_create.db/dept_partition

分割槽表資料的檢視

hive (ddl_create)> select * from dept_partition where day='20201215';
dept_partition.deptno	dept_partition.dname	dept_partition.loc	dept_partition.day
10	ACCOUNTING	NEW YORK	20201215
20	RESEARCH	DALLAS	20201215
30	SALES	CHICAGO	20201215
40	OPERATIONS	BOSTON	20201215

注意： 在大資料場景下，查詢分割槽表的資料時應該通過where 分割槽欄位= ‘’

指定對應的分割槽，某則查詢速度很慢！

可以發現這裡多了day這一列，那麼這一列的資訊是存放在哪裡呢？
我們先去HDFS檢視一下有沒有

[[email protected] data]$ hadoop fs -ls /user/hive/warehouse/ddl_create.db/dept_partition/
drwxrwxr-x   - hadoop supergroup          0 2020-12-15 21:06 /user/hive/warehouse/ddl_create.db/dept_partition/day=20201215
[[email protected] data]$ hadoop fs -text /user/hive/warehouse/ddl_create.db/dept_partition/day=20201215/*
10	ACCOUNTING	NEW YORK
20	RESEARCH	DALLAS
30	SALES	CHICAGO
40	OPERATIONS	BOSTON
[[email protected] data]$

通過檢視HDFS上的資訊，發現並沒有分割槽欄位day的資訊，既然HDFS中沒有，那麼我們去Hive元資料中檢視一下有沒有
在這裡插入圖片描述
可以發現分割槽欄位的資料是存放在Hive元資料的partitions表中

分割槽資料的匯入

從HDFS匯入資料到Hive中

分割槽表的資料預設是存放在
/user/hive/warehouse/database_name/table_name/partiton_name=value
在這裡插入圖片描述

現在我在相同的路徑下建立day=20201217目錄，然後在將dept.txt上傳上去，我們看看在Hive中能不能查詢到該分割槽的內容

[[email protected] data]$ hadoop fs -mkdir /user/hive/warehouse/ddl_create.db/dept_partition/day=20201217
[[email protected] data]$ hadoop fs -put dept.txt /user/hive/warehouse/ddl_create.db/dept_partition/day=20201217/

現在我們看看能不能檢視到day=20201217的資料

hive (ddl_create)> select * from dept_partition;
dept_partition.deptno	dept_partition.dname	dept_partition.loc	dept_partition.day
10	ACCOUNTING	NEW YORK	20201215
20	RESEARCH	DALLAS	20201215
30	SALES	CHICAGO	20201215
40	OPERATIONS	BOSTON	20201215
10	ACCOUNTING	NEW YORK	20201216
20	RESEARCH	DALLAS	20201216
30	SALES	CHICAGO	20201216
40	OPERATIONS	BOSTON	20201216

可以發現並沒有剛才我們插入的資料，這是怎麼回事呢？
原因是day=20201215分割槽的資料我們是通過load方法匯入的，Hive會自動刷入元資料資訊，而剛才我們插入的資料是同過HDFS匯入的，在元資料中並沒有
檢視Hive元資料資訊
在這裡插入圖片描述
果然沒有！

那麼該如何解決這種問題呢？？
官方給出的方法是

ALTER TABLE table_name ADD/DROP PARTITION

hive (ddl_create)> alter table dept_partition add partition(day='20201217');
OK
Time taken: 0.101 seconds
hive (ddl_create)> select * from dept_partition where day='20201217';
OK
dept_partition.deptno	dept_partition.dname	dept_partition.loc	dept_partition.day
10	ACCOUNTING	NEW YORK	20201217
20	RESEARCH	DALLAS	20201217
30	SALES	CHICAGO	20201217
40	OPERATIONS	BOSTON	20201217
Time taken: 0.072 seconds, Fetched: 4 row(s)

我們去Hive元資料中檢視一下發現有了相關資訊
在這裡插入圖片描述
注意： 還有另外一種方式：MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]; 這種方式慎用，因為通過這種方式會重新整理表的整個元資料！

同理，刪除某個分割槽資料可以用

ALTER TABLE table_name DROP PARTITION

這裡演示刪除20201216和20201217的分割槽資訊

hive (ddl_create)> alter table dept_partition drop partition(day='20201217'),partition(day='20201216');
Dropped the partition day=20201216
Dropped the partition day=20201217
OK
Time taken: 0.199 seconds

我們去Hive元資料和HDFS中檢視一下
在這裡插入圖片描述

可以發現Hive元資料與HDFS中資料也刪除掉了

通過insert into table … select 方式匯入

hive (ddl_create)> insert into table dept_partition partition(day='20201218')
                 > select * from dept;
hive (ddl_create)> select * from dept_partition where day='20201218';
dept_partition.deptno	dept_partition.dname	dept_partition.loc	dept_partition.day
50	test	chongqing	20201218
60	test	chongqing	20201218
70	test2	chongqing	20201218
70	test3	chongqing	20201218
80	test4	chongqing	20201218
10	ACCOUNTING	NEW YORK	20201218
20	RESEARCH	DALLAS	20201218
30	SALES	CHICAGO	20201218
40	OPERATIONS	BOSTON	20201218
Time taken: 0.062 seconds, Fetched: 9 row(s)

多級分割槽

這裡我們還是用剛才那個資訊，建立兩層分割槽day,hour

hive (ddl_create)> create table dept_partition_d_h(
                 > deptno int,
                 > dname string,
                 > loc string
                 > )
                 > PARTITIONED BY (day string, hour string)
                 > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive (ddl_create)> LOAD DATA LOCAL INPATH '/home/hadoop/data/dept.txt' INTO TABLE dept_partition_d_h PARTITION (day=20201216,hour=12);
Loading data to table ddl_create.dept_partition_d_h partition (day=20201216, hour=12)
Partition ddl_create.dept_partition_d_h{day=20201216, hour=12} stats: [numFiles=1, numRows=0, totalSize=80, rawDataSize=0]
OK
Time taken: 0.254 seconds

hive (ddl_create)> select * from dept_partition_d_h where day='20201216' and hour='12';
OK
dept_partition_d_h.deptno	dept_partition_d_h.dname	dept_partition_d_h.loc	dept_partition_d_h.day	dept_partition_d_h.hour
10	ACCOUNTING	NEW YORK	20201216	12
20	RESEARCH	DALLAS	20201216	12
30	SALES	CHICAGO	20201216	12
40	OPERATIONS	BOSTON	20201216	12
Time taken: 0.074 seconds, Fetched: 4 row(s)

資料存放在HDFS中對應第二層分割槽的目錄中
在這裡插入圖片描述
對應的表結構自己通過desc formatted table_name檢視，元資料資訊也可以去相應的MySQL資料庫中自行檢視

動態分割槽

假設現在我有一個需求：要將員工表(emp)中的資訊根據員工編號(emptno)插入到分割槽表中(partition=emptno)，我改如何操作？難道是一條一條插入嗎？
這個時候可以使用動態分割槽來建立

# 先建立好分割槽表
hive (ddl_create)> create table emp_dynamic_partition(
                 >   `empno` int, 
                 >   `ename` string, 
                 >   `job` string, 
                 >   `mgr` int, 
                 >   `hiredate` string, 
                 >   `sal` double, 
                 >   `comm` double)
                 > partitioned by(deptno int)  
                 > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.05 seconds

# 在將資料從emp中匯入
hive (ddl_create)> insert into table emp_dynamic_partition partition(deptno)
                 > select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

匯入的過程中，發現報錯，根據提示
set hive.exec.dynamic.partition.mode=nonstrict來解決

hive (ddl_create)> set hive.exec.dynamic.partition.mode=nonstrict;
hive (ddl_create)> insert into table emp_dynamic_partition partition(deptno)
                 > select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;
# 檢視一下
emp_dynamic_partition.empno	emp_dynamic_partition.ename	emp_dynamic_partition.job	emp_dynamic_partition.mgr	emp_dynamic_partition.hiredate	emp_dynamic_partition.sal	emp_dynamic_partition.comm	emp_dynamic_partition.deptno
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00.0	2450.0	NULL	10
7839	KING	PRESIDENT	NULL	1981-11-17 00:00:00.0	5000.0	NULL	10
7934	MILLER	CLERK	7782	1982-01-23 00:00:00.0	1300.0	NULL	10
7369	SMITH	CLERK	7902	1980-12-17 00:00:00.0	800.0	NULL	20
7566	JONES	MANAGER	7839	1981-04-02 00:00:00.0	2975.0	NULL	20
7788	SCOTT	ANALYST	7566	1982-12-09 00:00:00.0	3000.0	NULL	20
7876	ADAMS	CLERK	7788	1983-01-12 00:00:00.0	1100.0	NULL	20
7902	FORD	ANALYST	7566	1981-12-03 00:00:00.0	3000.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00.0	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00.0	1250.0	500.0	30
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00.0	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00.0	2850.0	NULL	30
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00.0	1500.0	0.0	30
7900	JAMES	CLERK	7698	1981-12-03 00:00:00.0	950.0	NULL	30
Time taken: 0.044 seconds, Fetched: 14 row(s)

建立動態分割槽語法說明：

insert into table emp_dynamic_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;

partition(deptno)指定分割槽，在select的最後一個欄位指定分割槽(deptno)就會自動識別，將資料插入對應的分割槽中

結語

分割槽表就先介紹到這裡，希望對你有所幫助！

Hive分割槽表