hive第二天筆記

阿新 • • 發佈：2021-07-25

第二天筆記

第二天筆記

Hive建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  // 定義欄位名，欄位型別
  [(col_name data_type [COMMENT col_comment], ...)]
  // 給表加上註解
  [COMMENT table_comment]
  // 分割槽
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  // 分桶
  [CLUSTERED BY (col_name, col_name, ...) 
  // 設定排序欄位 升序、降序
  [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [
  	// 指定設定行、列分隔符 
   [ROW FORMAT row_format] 
   // 指定Hive儲存格式：textFile、rcFile、SequenceFile 預設為：textFile
   [STORED AS file_format]
   
   | STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]  (Note:  only available starting with 0.6.0)
  ]
  // 指定儲存位置
  [LOCATION hdfs_path]
  // 跟外部表配合使用，比如：對映HBase表，然後可以使用HQL對hbase資料進行查詢，當然速度比較慢
  [TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0)
  [AS select_statement]  (Note: this feature is only available starting with 0.5.0.)

建表1：全部使用預設建表方式

create table students
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; // 必選，指定列分隔符

建表2：指定location （這種方式也比較常用）

create table students2
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/input1'; // 指定Hive表的資料的儲存位置，一般在資料已經上傳到HDFS，想要直接使用，會指定Location，通常Locaion會跟外部表一起使用，內部表一般使用預設的location

建表3：指定儲存格式

create table students3
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS rcfile; // 指定儲存格式為rcfile，inputFormat:RCFileInputFormat,outputFormat:RCFileOutputFormat，如果不指定，預設為textfile，注意：除textfile以外，其他的儲存格式的資料都不能直接載入，需要使用從表載入的方式。

建表4：create table xxxx as select_statement(SQL語句) (這種方式比較常用)

create table students4 as select * from students2;

建表5：create table xxxx like table_name 只想建表，不需要載入資料

create table students5 like students;

Hive載入資料

1、使用`hdfs dfs -put '本地資料' 'hive表對應的HDFS目錄下'`

2、使用 load data inpath

下列命令需要在hive shell裡執行

// 將HDFS上的/input1目錄下面的資料 移動至 students表對應的HDFS目錄下，注意是 移動、移動、移動
load data inpath '/input1/students.txt' into table students;

// 清空表
truncate table students;
// 加上 local 關鍵字 可以將Linux本地目錄下的檔案 上傳到 hive表對應HDFS 目錄下 原檔案不會被刪除
load data local inpath '/usr/local/soft/data/students.txt' into table students;
// overwrite 覆蓋載入
load data local inpath '/usr/local/soft/data/students.txt' overwrite into table students;

3、create table xxx as SQL語句

4、insert into table xxxx SQL語句（沒有as）

// 將 students表的資料插入到students2 這是複製 不是移動 students表中的表中的資料不會丟失insert into table students2 select * from students;// 覆蓋插入 把into 換成 overwriteinsert overwrite table students2 select * from students;

Hive 內部表（Managed tables）vs 外部表（External tables）

建表：

// 內部表create table students_internal(    id bigint,    name string,    age int,    gender string,    clazz string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/input2';// 外部表create external table students_external(    id bigint,    name string,    age int,    gender string,    clazz string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/input3';

載入資料：

hive> dfs -mkdir /input2;hive> dfs -mkdir /input3;hive> dfs -put /usr/local/soft/data/students.txt /input2/;hive> dfs -put /usr/local/soft/data/students.txt /input3/;

刪除表：

hive> drop table students_internal;Moved: 'hdfs://master:9000/input2' to trash at: hdfs://master:9000/user/root/.Trash/CurrentOKTime taken: 0.474 secondshive> drop table students_external;OKTime taken: 0.09 secondshive>

可以看出，刪除內部表的時候，表中的資料（HDFS上的檔案）會被同表的元資料一起刪除

刪除外部表的時候，只會刪除表的元資料，不會刪除表中的資料（HDFS上的檔案）

一般在公司中，使用外部表多一點，因為資料可以需要被多個程式使用，避免誤刪，通常外部表會結合location一起使用

外部表還可以將其他資料來源中的資料對映到 hive中，比如說：hbase，ElasticSearch......

設計外部表的初衷就是讓表的元資料與資料解耦

Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive. External tables are tables where Hive has loose coupling with the data.

All the write operations to the Managed tables are performed using Hive SQL commands. If a Managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. The transactional semantics (ACID) are also supported only on Managed tables.

Hive 分割槽

分割槽表實際上是在表的目錄下在以分割槽命名，建子目錄

作用：進行分割槽裁剪，避免全表掃描，減少MapReduce處理的資料量，提高效率

一般在公司的hive中，所有的表基本上都是分割槽表，通常按日期分割槽、地域分割槽

分割槽表在使用的時候記得加上分割槽欄位

分割槽也不是越多越好，一般不超過3級，根據實際業務衡量

建立分割槽表：

create table students_pt(    id bigint,    name string,    age int,    gender string,    clazz string)PARTITIONED BY(pt string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

增加一個分割槽：

alter table students_pt add partition(pt='20210622');

刪除一個分割槽：

alter table students_pt drop partition(pt='20210112');

檢視某個表的所有分割槽

show partitions students_pt; // 推薦這種方式（直接從元資料中獲取分割槽資訊）select distinct pt from students_pt; // 不推薦

往分割槽中插入資料：

insert into table students_pt partition(pt='20210101') select * from students;load data local inpath '/usr/local/soft/data/students.txt' into table students_pt partition(pt='20210111');

查詢某個分割槽的資料：

// 全表掃描，不推薦，效率低select count(*) from students_pt;// 使用where條件進行分割槽裁剪，避免了全表掃描，效率高select count(*) from students_pt where pt='20210101';// 也可以在where條件中使用非等值判斷select count(*) from students_pt where pt<='20210112' and pt>='20210110';

Hive動態分割槽

有的時候我們原始表中的資料裡面包含了 ''日期欄位 dt''，我們需要根據dt中不同的日期，分為不同的分割槽，將原始表改造成分割槽表。

hive預設不開啟動態分割槽

動態分割槽：根據資料中某幾列的不同的取值劃分不同的分割槽

開啟Hive的動態分割槽支援

# 表示開啟動態分割槽hive> set hive.exec.dynamic.partition=true;# 表示動態分割槽模式：strict（需要配合靜態分割槽一起使用）、nostrict# strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;hive> set hive.exec.dynamic.partition.mode=nostrict;# 表示支援的最大的分割槽數量為1000，可以根據業務自己調整hive> set hive.exec.max.dynamic.partitions.pernode=1000;

建立原始表並載入資料

create table students_dt(    id bigint,    name string,    age int,    gender string,    clazz string,    dt string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

建立分割槽表並載入資料

create table students_dt_p(    id bigint,    name string,    age int,    gender string,    clazz string)PARTITIONED BY(dt string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

使用動態分割槽插入資料

// 分割槽欄位需要放在 select 的最後，如果有多個分割槽欄位 同理，它是按位置匹配，不是按名字匹配insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;// 比如下面這條語句會使用age作為分割槽欄位，而不會使用student_dt中的dt作為分割槽欄位insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;

多級分割槽

create table students_year_month(    id bigint,    name string,    age int,    gender string,    clazz string,    year string,    month string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';create table students_year_month_pt(    id bigint,    name string,    age int,    gender string,    clazz string)PARTITIONED BY(year string,month string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;

自己嘗試一下多級分割槽

上單講分割槽：https://developer.aliyun.com/article/81775

Hive分桶

分桶實際上是對檔案（資料）的進一步切分

Hive預設關閉分桶

作用：在往分桶表中插入資料的時候，會根據 clustered by 指定的欄位進行hash分組對指定的buckets個數進行取餘，進而可以將資料分割成buckets個數個檔案，以達到是資料均勻分佈，方便我們取抽樣資料，提高Map join效率

分桶欄位需要根據業務進行設定可以解決資料傾斜問題

開啟分桶開關

hive> set hive.enforce.bucketing=true;

建立分桶表

create table students_buks(    id bigint,    name string,    age int,    gender string,    clazz string)CLUSTERED BY (clazz) into 12 BUCKETSROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

往分桶表中插入資料

// 直接使用load data 並不能將資料打散load data local inpath '/usr/local/soft/data/students.txt' into table students_buks;// 需要使用下面這種方式插入資料，才能使分桶表真正發揮作用insert into students_buks select * from students;

https://zhuanlan.zhihu.com/p/93728864 Hive分桶表的使用場景以及優缺點分析

hive第二天筆記

第二天筆記目錄第二天筆記Hive建表建表1：全部使用預設建表方式建表2：指定location （這種方式也比較常用）建表3：指定儲存格式建表4：create table xxxx as select_statement(SQL語句) (這種方式比較常用)建表5：

安卓入門第二天筆記：Layout佈局/快速生成設定與獲取方法/USB連線裝置除錯

今日工作量：P19-P42 為按鈕設定監聽器增加toast字串建立Question類 USB連線裝置 Layout佈局

go web 學習第二天筆記備忘

技術標籤：go langgohttp HTTP 請求 1.Request請求 2.URL 3. Header 4.Body HTTP Request 和 HTTP Response（請求和響應）

前端入職學習筆記-第三週第二天

Vue跨域問題的解決 (1).什麼是跨域跨域：由於瀏覽器同源策略，凡是傳送請求url的協議、域名、埠三者之間任意一個與當前頁面地址不同即為跨域。存在跨域的情況：

Golang筆記整理--第二天

一. 識別符號　　Go語言識別符號構成規則：開頭第一個字元必須是字母或者是下劃線，後面可以跟任意多個字元，數子或者下劃線，並且區分大小寫。

前端HTML第二天學習筆記

#HTML學習第二天筆記：表格常用標籤及屬性：表格的基本語法：1>三個基本的表格標籤如下：2>表頭單元格標籤: 在大多數表格中，第一行通常用來顯示標題而不是實際的資料，這樣可以方便使用者閱讀和理

第二天學習筆記

Markown學習常用快捷鍵 ctrl+c 複製 ctrl+v 貼上 ctrl+A 全選 ctrl+X 剪下 ctrl+z 撤銷 ctrl+s 儲存

SQL學習筆記----第二天（學生資訊表自增欄位的使用）

技術標籤：sql學習sql資料庫mysql 學生資訊表自增欄位的使用--目錄前言一、MySQL中使用自增欄位二、SqlServer中使用自增欄位三、Oracle中使用自增欄位總結

Java第二天學習筆記

資料型別強型別語言要求變數的使用要嚴格符合規定，所有變數都必須先定義後才能使用。

java基礎第二天隨心筆記

1 什麼是變數 : 儲存可變的值的容器在記憶體空間中，儲存可變資料的空間==》變數作用：用於在記憶體中儲存資料，方便訪問資料2 變數包含三個要素1）資料型別：決定空間的大小，值的型別2）變數名：便於訪問儲存

hive第三天筆記

第三天筆記 SQL練習： 1、count(*)、count(1) 、count(\'欄位名\') 區別 2、HQL 執行優先順序：

matlab第二天學習筆記

2.11 表示式 2.11.1 變數與大多數其他程式語言一樣，MATLAB 語言提供數學表示式，但與大多數程式語言不同的是，這些表示式涉及整個矩陣。

Delphi第二天學習筆記

string型別可以看成一個Char陣列，但是這個陣列的下標是從1開始的。動態陣列的下標預設是從0開始的，而自定義陣列的下標是使用者自定義的。

第二天軟體測試筆記

測試的基本原則： 1、使用者故事測試，從使用者的角度出發 2、測試不僅是單純軟體本身的測試（比如環境，配置）

前端學習-學成線上網站開發筆記第二天

其次網站組成中的輪播圖一般定義類名為banner，給這個盒子設定新的樣式，佔據整個瀏覽器的區域。在banner中首先定義盒子模型使用類名wrapper，版心居中，在wrapper中定義左右盒子模型，分別設定左浮動和右浮動，然後

面試遇到Runtime的第二天-isa和meta-Class

本文主要寫一下，runtime中關於類，元類的結構和他們之間的關係。其實應該在上一篇文章面試遇到Runtime的第一天中先寫本文的內容，但是寫那天剛好在整理category的知識點，所以趁熱打鐵的就寫在了上一篇文章。如果在

java學習第二天 20207/7

一. 1.對傳參進行了瞭解 2. 2.java的變數命名與c/c++有些不同在java中有￥，字母，下劃線和數字，同樣不可以是數字開頭。

第二天第二節:02-request_header

import urllib.request def load_baidu(): url= \"https://www.baidu.com\" header = { #瀏覽器的版本 \"User-Agent\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko

HIVE理論學習筆記

概述參加了新的公司新的工作新的環境之後，本人必須學習更多的知識，所以穩固之前的知識和學習新的知識是重中之重，新的公司把hadoop大部分的元件都進行了架構原始碼深度改造，所以使用過程確實遇到一些麻煩，而寫這

JavaSe第三天筆記小記

JavaSE第三天學習小記 1、java陣列與c++中的陣列概念一致，把不同的只是語法 public class Hello {

hive第二天筆記

第二天筆記

Hive建表

建表1：全部使用預設建表方式

建表2：指定location （這種方式也比較常用）

建表3：指定儲存格式

建表4：create table xxxx as select_statement(SQL語句) (這種方式比較常用)

建表5：create table xxxx like table_name 只想建表，不需要載入資料

Hive載入資料

1、使用hdfs dfs -put '本地資料' 'hive表對應的HDFS目錄下'

2、使用 load data inpath

3、create table xxx as SQL語句

4、insert into table xxxx SQL語句 （沒有as）

Hive 內部表（Managed tables）vs 外部表（External tables）

建表：

載入資料：

刪除表：

Hive 分割槽

建立分割槽表：

增加一個分割槽：

刪除一個分割槽：

檢視某個表的所有分割槽

往分割槽中插入資料：

查詢某個分割槽的資料：

Hive動態分割槽

開啟Hive的動態分割槽支援

建立原始表並載入資料

建立分割槽表並載入資料

使用動態分割槽插入資料

多級分割槽

Hive分桶

開啟分桶開關

建立分桶表

往分桶表中插入資料

相關推薦

1、使用`hdfs dfs -put '本地資料' 'hive表對應的HDFS目錄下'`

4、insert into table xxxx SQL語句（沒有as）