Hive分割槽表
Hive分割槽表
Hive分割槽表
分割槽表的建立
create table dept_partition(
deptno int,
dname string,
loc string
)
PARTITIONED BY (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
在原有建立表的基礎上增加了PARTITIONED BY(col_name data_type [COMMENT col_comment], …)
匯入資料
LOAD DATA LOCAL INPATH '/home/hadoop/data/dept.txt' OVERWRITE INTO TABLE dept_partition PARTITION (day=20201215);
我們檢視一下表的結構
hive (ddl_create)> desc formatted dept_partition; # Partition Information # col_name data_type comment day string Location: hdfs://hadoop100:9000/user/hive/warehouse/ddl_create.db/dept_partition
分割槽表資料的檢視
hive (ddl_create)> select * from dept_partition where day='20201215';
dept_partition.deptno dept_partition.dname dept_partition.loc dept_partition.day
10 ACCOUNTING NEW YORK 20201215
20 RESEARCH DALLAS 20201215
30 SALES CHICAGO 20201215
40 OPERATIONS BOSTON 20201215
注意: 在大資料場景下,查詢分割槽表的資料時應該通過where 分割槽欄位= ‘’
可以發現這裡多了day這一列,那麼這一列的資訊是存放在哪裡呢?
我們先去HDFS檢視一下有沒有
[[email protected] data]$ hadoop fs -ls /user/hive/warehouse/ddl_create.db/dept_partition/
drwxrwxr-x - hadoop supergroup 0 2020-12-15 21:06 /user/hive/warehouse/ddl_create.db/dept_partition/day=20201215
[[email protected] data]$ hadoop fs -text /user/hive/warehouse/ddl_create.db/dept_partition/day=20201215/*
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
[[email protected] data]$
通過檢視HDFS上的資訊,發現並沒有分割槽欄位day的資訊,既然HDFS中沒有,那麼我們去Hive元資料中檢視一下有沒有
可以發現分割槽欄位的資料是存放在Hive元資料的partitions表中
分割槽資料的匯入
從HDFS匯入資料到Hive中
分割槽表的資料預設是存放在
/user/hive/warehouse/database_name/table_name/partiton_name=value
現在我在相同的路徑下建立day=20201217目錄,然後在將dept.txt上傳上去,我們看看在Hive中能不能查詢到該分割槽的內容
[[email protected] data]$ hadoop fs -mkdir /user/hive/warehouse/ddl_create.db/dept_partition/day=20201217
[[email protected] data]$ hadoop fs -put dept.txt /user/hive/warehouse/ddl_create.db/dept_partition/day=20201217/
現在我們看看能不能檢視到day=20201217的資料
hive (ddl_create)> select * from dept_partition;
dept_partition.deptno dept_partition.dname dept_partition.loc dept_partition.day
10 ACCOUNTING NEW YORK 20201215
20 RESEARCH DALLAS 20201215
30 SALES CHICAGO 20201215
40 OPERATIONS BOSTON 20201215
10 ACCOUNTING NEW YORK 20201216
20 RESEARCH DALLAS 20201216
30 SALES CHICAGO 20201216
40 OPERATIONS BOSTON 20201216
可以發現並沒有剛才我們插入的資料,這是怎麼回事呢?
原因是day=20201215分割槽的資料我們是通過load方法匯入的,Hive會自動刷入元資料資訊,而剛才我們插入的資料是同過HDFS匯入的,在元資料中並沒有
檢視Hive元資料資訊
果然沒有!
那麼該如何解決這種問題呢??
官方給出的方法是
ALTER TABLE table_name ADD/DROP PARTITION
hive (ddl_create)> alter table dept_partition add partition(day='20201217');
OK
Time taken: 0.101 seconds
hive (ddl_create)> select * from dept_partition where day='20201217';
OK
dept_partition.deptno dept_partition.dname dept_partition.loc dept_partition.day
10 ACCOUNTING NEW YORK 20201217
20 RESEARCH DALLAS 20201217
30 SALES CHICAGO 20201217
40 OPERATIONS BOSTON 20201217
Time taken: 0.072 seconds, Fetched: 4 row(s)
我們去Hive元資料中檢視一下發現有了相關資訊
注意: 還有另外一種方式:MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]; 這種方式慎用,因為通過這種方式會重新整理表的整個元資料!
同理,刪除某個分割槽資料可以用
ALTER TABLE table_name DROP PARTITION
這裡演示刪除20201216和20201217的分割槽資訊
hive (ddl_create)> alter table dept_partition drop partition(day='20201217'),partition(day='20201216');
Dropped the partition day=20201216
Dropped the partition day=20201217
OK
Time taken: 0.199 seconds
我們去Hive元資料和HDFS中檢視一下
可以發現Hive元資料與HDFS中資料也刪除掉了
通過insert into table … select 方式匯入
hive (ddl_create)> insert into table dept_partition partition(day='20201218')
> select * from dept;
hive (ddl_create)> select * from dept_partition where day='20201218';
dept_partition.deptno dept_partition.dname dept_partition.loc dept_partition.day
50 test chongqing 20201218
60 test chongqing 20201218
70 test2 chongqing 20201218
70 test3 chongqing 20201218
80 test4 chongqing 20201218
10 ACCOUNTING NEW YORK 20201218
20 RESEARCH DALLAS 20201218
30 SALES CHICAGO 20201218
40 OPERATIONS BOSTON 20201218
Time taken: 0.062 seconds, Fetched: 9 row(s)
多級分割槽
這裡我們還是用剛才那個資訊,建立兩層分割槽day,hour
hive (ddl_create)> create table dept_partition_d_h(
> deptno int,
> dname string,
> loc string
> )
> PARTITIONED BY (day string, hour string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive (ddl_create)> LOAD DATA LOCAL INPATH '/home/hadoop/data/dept.txt' INTO TABLE dept_partition_d_h PARTITION (day=20201216,hour=12);
Loading data to table ddl_create.dept_partition_d_h partition (day=20201216, hour=12)
Partition ddl_create.dept_partition_d_h{day=20201216, hour=12} stats: [numFiles=1, numRows=0, totalSize=80, rawDataSize=0]
OK
Time taken: 0.254 seconds
hive (ddl_create)> select * from dept_partition_d_h where day='20201216' and hour='12';
OK
dept_partition_d_h.deptno dept_partition_d_h.dname dept_partition_d_h.loc dept_partition_d_h.day dept_partition_d_h.hour
10 ACCOUNTING NEW YORK 20201216 12
20 RESEARCH DALLAS 20201216 12
30 SALES CHICAGO 20201216 12
40 OPERATIONS BOSTON 20201216 12
Time taken: 0.074 seconds, Fetched: 4 row(s)
資料存放在HDFS中對應第二層分割槽的目錄中
對應的表結構自己通過desc formatted table_name檢視,元資料資訊也可以去相應的MySQL資料庫中自行檢視
動態分割槽
假設現在我有一個需求:要將員工表(emp)中的資訊根據員工編號(emptno)插入到分割槽表中(partition=emptno),我改如何操作?難道是一條一條插入嗎?
這個時候可以使用動態分割槽來建立
# 先建立好分割槽表
hive (ddl_create)> create table emp_dynamic_partition(
> `empno` int,
> `ename` string,
> `job` string,
> `mgr` int,
> `hiredate` string,
> `sal` double,
> `comm` double)
> partitioned by(deptno int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.05 seconds
# 在將資料從emp中匯入
hive (ddl_create)> insert into table emp_dynamic_partition partition(deptno)
> select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
匯入的過程中,發現報錯,根據提示
set hive.exec.dynamic.partition.mode=nonstrict來解決
hive (ddl_create)> set hive.exec.dynamic.partition.mode=nonstrict;
hive (ddl_create)> insert into table emp_dynamic_partition partition(deptno)
> select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;
# 檢視一下
emp_dynamic_partition.empno emp_dynamic_partition.ename emp_dynamic_partition.job emp_dynamic_partition.mgr emp_dynamic_partition.hiredate emp_dynamic_partition.sal emp_dynamic_partition.comm emp_dynamic_partition.deptno
7782 CLARK MANAGER 7839 1981-06-09 00:00:00.0 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 00:00:00.0 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-01-23 00:00:00.0 1300.0 NULL 10
7369 SMITH CLERK 7902 1980-12-17 00:00:00.0 800.0 NULL 20
7566 JONES MANAGER 7839 1981-04-02 00:00:00.0 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00.0 3000.0 NULL 20
7876 ADAMS CLERK 7788 1983-01-12 00:00:00.0 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-03 00:00:00.0 3000.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00.0 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00.0 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00.0 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00.0 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00.0 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-03 00:00:00.0 950.0 NULL 30
Time taken: 0.044 seconds, Fetched: 14 row(s)
建立動態分割槽語法說明:
insert into table emp_dynamic_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;
partition(deptno)指定分割槽,在select的最後一個欄位指定分割槽(deptno)就會自動識別,將資料插入對應的分割槽中
結語
分割槽表就先介紹到這裡,希望對你有所幫助!