06-Hive表屬性操作
大家好!砸門又又又又見面了。我再自我介紹一下哈,我長得比較帥,帥到哭的那種。
呵呵,開玩笑。這世界上,唯獨美人和大資料不可辜負。好好學大資料技術,技多不壓身。越學人就越帥,你說是吧?言歸正傳,今天要做的實驗是Hive表屬性操作。
表屬性有啥?表名稱、增加列,修改列呀!
修改表名alter table table_name rename to new_table_name;
修改列名alter table tablename change column c1 c2 int comment 'xxxxx' after severity;
//可以把該列放到指定列的後面,或者使用‘first’放到第一位
增加列alter table tablename add columns(c1 string comment 'xxxx',c2 long comment 'yyyyyy')
1 砸門還是來實戰一下,照著這個程式碼敲就可以練個七七八八:
hive> create table testchange(
> name string,value string
> );
OK
Time taken: 0.085 seconds
hive> alter table testchange rename to test;
OK
Time taken: 0.478 seconds
hive>
來,還是要檢查一下:
hive> desc test;
OK
name string
value string
Time taken: 0.381 seconds
hive>
把test表增加兩個列type,col
hive> alter table test add columns(type string,col int comment 'xielaoshi');
OK
Time taken: 0.13 seconds
hive> desc test;
OK
name string
value string
type string
col int xielaoshi
Time taken: 0.235 seconds
hive>
還可以這樣玩,把type列調到name後面:
hive> alter table test change column type type string after name;
OK
Time taken: 0.161 seconds
hive> desc test;
OK
name string
type string
value string
col int xielaoshi
Time taken: 0.156 seconds
hive>
還可以這樣玩,把type放在第一列:
hive> alter table test change column type type string first;
OK
Time taken: 0.113 seconds
hive> desc test;
OK
type string
name string
value string
col int xielaoshi
Time taken: 0.149 seconds
hive> desc formatted test;
好不好玩?好玩吧?其實學技術就是要自己把自己逗樂,就有趣了!
2 修改tblproperties
hive> alter table test set tblproperties('comment'='xxxxx');
OK
Time taken: 0.122 seconds
hive> desc formatted test;
OK
# col_name data_type comment
type string None
name string None
value string None
col int xielaoshi
# Detailed Table Information
Database: default
Owner: root
CreateTime: Thu Jun 02 06:20:43 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/test
Table Type: MANAGED_TABLE
Table Parameters:
comment xxxxx
last_modified_by root
last_modified_time 1464874178
transient_lastDdlTime 1464874178
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.14 seconds
hive>
其實這就改了表的描述資訊。so easy!
3 修改serdeproperties 。serde是序列化和反序列化。究竟什麼是序列化什麼是發序列化?
額外補充:序列化就是指將結構化物件 (例項) 轉化為位元組流 (字元陣列)。反序列化就是將位元組流轉向結構化物件的逆過程。 於是,如果想把“活的”物件儲存到檔案,儲存這串位元組即可,如果想把“活的”物件傳送到遠端主機,傳送這串位元組即可,需要物件的時候,做一下反序列化,就能將物件“復活”了。
將物件序列化儲存到檔案,術語又叫“持久化”。將物件序列化傳送到遠端計算機,術語又叫“資料通訊”。
其實不難理解吧,讓我想想有什麼好的比喻來說明?其實就好比你打電話給你爸媽,要把聲音轉換成電訊號(序列化),到了你爸媽那裡又轉換成聲音(反序列化)。
因為Hadoop在叢集之間進行通訊或者RPC呼叫的時候,需要序列化,而且要求序列化要快,且體積要小,佔用頻寬要小。所以必須理解Hadoop的序列化機制。
Hive底層是hadoop,序列化對hadoop很重要。有多重要?
Hadoop中各個節點的通訊是通過遠端呼叫(RPC)實現的:程序通訊和永久儲存。,所以序列化和反序列化在分散式資料處理領域經常出現然而,那麼 RPC序列化要求具有以下特點:
-
- 緊湊:緊湊的格式能讓我們能充分利用網路頻寬,而頻寬是資料中心最稀缺的資源;
- 快速:程序通訊形成了分散式系統的骨架,所以需要儘量減少序列化和反序列化的效能開銷,這是基本的;
- 可擴充套件:協議為了滿足新的需求變化,所以控制客戶端和伺服器過程中,需要直接引進相應的協議,這些是新協議,原序列化方式能支援新的協議報文;
- 互操作:能支援不同語言寫的客戶端和服務端進行互動;
好,不囉嗦了。serdeproperties的修改針對有分割槽有無分割槽是不同的。
無分割槽:
alter table table_name set serdeproperties('field.delim'='\t');
有分割槽是啥情況呢?
alter table test partition(dt=' xxxxx') set serdeproperties('field.delim'='\t');
為什麼要這麼修改?其實道理很簡單,有些資料就是格式不符合你的統計分析,刪除又是不可能,所以只能修改。
呵呵,還是來點實際的比較好:
hive> create table city(
> time string,
> country string,
> province string,
> city string)
> row format delimited fields terminated by '#'
> lines terminated by '\n'
> stored as textfile;
OK
Time taken: 0.111 seconds
hive>
載入點資料玩玩:
hive> load data local inpath '/usr/host/city' into table city;
Copying data from file:/usr/host/city
Copying file: file:/usr/host/city
Loading data to table default.city
OK
Time taken: 0.426 seconds
hive> select * from city;
OK
20130829234535 china henan nanyang NULL NULL NULL
20130829234536 china henan xinyang NULL NULL NULL
20130829234537 china beijing beijing NULL NULL NULL
20130829234538 china jiang susuzhou NULL NULL NULL
20130829234539 china hubei wuhan NULL NULL NULL
20130829234540 china sandong weizhi NULL NULL NULL
20130829234541 china hebei shijiazhuang NULL NULL NULL
20130829234542 china neimeng eeduosi NULL NULL NULL
20130829234543 china beijing beijing NULL NULL NULL
20130829234544 china jilin jilin NULL NULL NULL
Time taken: 0.105 seconds
hive>
是不是不知道我city裡面的資料是啥?可以給你看看喲!
[root@hadoop1 data3]# cd /usr/host/
[root@hadoop1 host]# cat city
20130829234535 china henan nanyang
20130829234536 china henan xinyang
20130829234537 china beijing beijing
20130829234538 china jiang susuzhou
20130829234539 china hubei wuhan
20130829234540 china sandong weizhi
20130829234541 china hebei shijiazhuang
20130829234542 china neimeng eeduosi
20130829234543 china beijing beijing
20130829234544 china jilin jilin
[root@hadoop1 host]#
神奇吧,但是問題來了。下面的資料整整齊齊,上面的資料怎麼那麼多null?看官,不著急,我先喝口水。
好,其實我下面的資料欄位之間是用‘\t’製表符分隔的,我上面的欄位是用‘#’分隔的。那麼hive會找‘#’,沒找到就把所有的資料當做第一個欄位,後面三個欄位用null補上。明白了嗎?沒明白,加我微信xiehuadong1,語音交流。
不信,你瞧瞧:
hive> desc formatted city;
OK
# col_name data_type comment
time string None
country string None
province string None
city string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Thu Jun 02 06:43:57 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/city
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1464875117
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim #
line.delim \n
serialization.format #
Time taken: 0.132 seconds
hive>
“field.delim # ” 看到了吧,那我們來修改一下:把#分隔符改為\t
hive> alter table city set serdeproperties('field.delim'='\t');
OK
Time taken: 0.118 seconds
hive> select * from city;
OK
20130829234535 china henan nanyang
20130829234536 china henan xinyang
20130829234537 china beijing beijing
20130829234538 china jiang susuzhou
20130829234539 china hubei wuhan
20130829234540 china sandong weizhi
20130829234541 china hebei shijiazhuang
20130829234542 china neimeng eeduosi
20130829234543 china beijing beijing
20130829234544 china jilin jilin
Time taken: 0.125 seconds
hive>
接下來我們再建立一個分割槽表city1;
hive> create table city1(
> time string,
> country string,
> province string,
> city string) partitioned by (dt string)
> row format delimited fields terminated by '#'
> lines terminated by '\n'
> stored as textfile;
OK
Time taken: 0.046 seconds
hive>
載入資料,查詢,檢視描述資訊:
hive> load data local inpath '/usr/host/city' into table city1 partition(dt='20160519');
Copying data from file:/usr/host/city
Copying file: file:/usr/host/city
Loading data to table default.city1 partition (dt=20160519)
OK
Time taken: 0.189 seconds
hive> select * from city1;
OK
20130829234535 china henan nanyang NULL NULL NULL 20160519
20130829234536 china henan xinyang NULL NULL NULL 20160519
20130829234537 china beijing beijing NULL NULL NULL 20160519
20130829234538 china jiang susuzhou NULL NULL NULL 20160519
20130829234539 china hubei wuhan NULL NULL NULL 20160519
20130829234540 china sandong weizhi NULL NULL NULL 20160519
20130829234541 china hebei shijiazhuang NULL NULL NULL 20160519
20130829234542 china neimeng eeduosi NULL NULL NULL 20160519
20130829234543 china beijing beijing NULL NULL NULL 20160519
20130829234544 china jilin jilin NULL NULL NULL 20160519
Time taken: 0.083 seconds
hive> desc formatted city1;
OK
# col_name data_type comment
time string None
country string None
province string None
city string None
# Partition Information
# col_name data_type comment
dt string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Thu Jun 02 06:55:56 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/city1
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1464875756
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim #
line.delim \n
serialization.format #
Time taken: 0.076 seconds
hive>
把#分隔符改為\t
hive> alter table city1 set serdeproperties('field.delim'='\t');
OK
Time taken: 0.103 seconds
hive>
另一種修改方式
hive> alter table city1 partition(dt='20160519') set serdeproperties('field.delim'='\t');
OK
Time taken: 0.141 seconds
hive> select * from city1;
OK
20130829234535 china henan nanyang 20160519
20130829234536 china henan xinyang 20160519
20130829234537 china beijing beijing 20160519
20130829234538 china jiang susuzhou 20160519
20130829234539 china hubei wuhan 20160519
20130829234540 china sandong weizhi 20160519
20130829234541 china hebei shijiazhuang 20160519
20130829234542 china neimeng eeduosi 20160519
20130829234543 china beijing beijing 20160519
20130829234544 china jilin jilin 20160519
Time taken: 0.086 seconds
hive>
多玩玩:
hive> select * from city1 where dt='20160519';
OK
20130829234535 china henan nanyang 20160519
20130829234536 china henan xinyang 20160519
20130829234537 china beijing beijing 20160519
20130829234538 china jiang susuzhou 20160519
20130829234539 china hubei wuhan 20160519
20130829234540 china sandong weizhi 20160519
20130829234541 china hebei shijiazhuang 20160519
20130829234542 china neimeng eeduosi 20160519
20130829234543 china beijing beijing 20160519
20130829234544 china jilin jilin 20160519
Time taken: 0.579 seconds
hive>
3 修改location
文章有點長,耐心看呀,最後一點點了。
alter table table_name[partition()] set location 'path'
alter table table_name set TBLPROPERTIES('EXTERNAL'='TRUE');//內部錶轉外部表
alter table table_name set TBLPROPERTIES('EXTERNAL'='FALSE');//外部錶轉內部表
例如:
[root@hadoop1 host]# hadoop fs -mkdir /location
[root@hadoop1 host]# hadoop fs -put /usr/host/city /location
[root@hadoop1 host]# hadoop fs -ls /location
Found 1 items
-rw-r--r-- 1 root supergroup 359 2016-06-02 07:04 /location/city
[root@hadoop1 host]#
接下來我要把city的location從hdfs://hadoop1:9000/user/hive/warehouse/city 變為hdfs://hadoop1:9000/location
hive> alter table city set location 'hdfs://hadoop1:9000/location';
OK
Time taken: 0.112 seconds
不信你看desc formatted city;
hive> select * from city;
OK
20130829234535 china henan nanyang
20130829234536 china henan xinyang
20130829234537 china beijing beijing
20130829234538 china jiang susuzhou
20130829234539 china hubei wuhan
20130829234540 china sandong weizhi
20130829234541 china hebei shijiazhuang
20130829234542 china neimeng eeduosi
20130829234543 china beijing beijing
20130829234544 china jilin jilin
Time taken: 0.112 seconds
hive> desc formatted city;
OK
# col_name data_type comment
time string None
country string None
province string None
city string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Thu Jun 02 06:43:57 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/location
Table Type: MANAGED_TABLE
Table Parameters:
last_modified_by root
last_modified_time 1464876384
transient_lastDdlTime 1464876384
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
line.delim \n
serialization.format #
Time taken: 0.078 seconds
hive>
上面的location變了吧。看看刪除這個表時什麼情況
hive> drop table city;
OK
Time taken: 0.122 seconds
hive>
你把city 刪了後,元資料的指定位置的資料就全部都沒有了,因為city是一個MANAGED_TABLE 內部表。
外部表和內部表有啥區別呢?歡迎看官留言!
4 再重新建立一下city;
hive> create table city(
> time string,
> country string,
> province string,
> city string)
> row format delimited fields terminated by '#'
> lines terminated by '\n'
> stored as textfile;
OK
Time taken: 1.042 seconds
hive>
當然咯,這時一個內部表。我要把內部表改成外部表:把city表修改成外表
hive>
> alter table city set tblproperties('EXTERNAL'='TRUE');
OK
Time taken: 0.184 seconds
hive> desc formatted city;
OK
# col_name data_type comment
time string None
country string None
province string None
city string None
# Detailed Table Information
Database: default
Owner: root
CreateTime: Thu Jun 02 07:15:20 PDT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop1:9000/user/hive/warehouse/city
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
last_modified_by root
last_modified_time 1464876990
transient_lastDdlTime 1464876990
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim #
line.delim \n
serialization.format #
Time taken: 0.296 seconds
hive>
那你說怎麼把外部表改成內部表呀?
執行程式碼:
alter table city set tblproperties('EXTERNAL'='FALSE');
desc formatted city;
好了,今天就到這裡,有點累了,休息一下。如果你看到此文,想進一步學習或者和我溝通,加我微信公眾號:名字:五十年後
see you again!