1. 程式人生 > >005-Hive Tutorial

005-Hive Tutorial

Hive Tutorial

概念

What Is Hive

Hive是基於Apache Hadoop的資料倉庫的基礎設施。Hadoop提供了大規模的擴充套件和容錯功能,用於在商品硬體上進行資料儲存和處理。

Hive的設計是為了方便地進行資料彙總,對大量資料進行特別的查詢和分析。它提供了SQL,使使用者能夠輕鬆地進行特別查詢、摘要和資料分析。與此同時,Hive的SQL為使用者提供了多個位置來整合他們自己的功能來進行定製分析,比如使用者定義的函式(udf)。

What Hive Is NOT

Hive不是為線上事務處理而設計的。它最好用於傳統的資料倉庫任務。

Data Units

根據粒度大小,hive 資料可以分為:Databases、Tables、Partitions、Buckets(or Clusters)。

Partitions:分割槽。每個表都可以有一個或多個分割槽鍵來決定資料是如何儲存的。
      分割槽,除了儲存單元之外,還允許使用者高效地識別滿足特定標準(分割槽)的行。
Buckets(or Clusters):每個分割槽中的資料可以根據表格中某一列的雜湊函式的值將其劃分為 bucket。

Type System

Hive 支援原始和複雜的資料型別(primitive and complex data types)。
1、原始型別/基本型別/primitive

Integers
    TINYINT — 1 byte integer
    SMALLINT — 2 byte integer
    INT — 4 byte integer
    BIGINT — 8 byte integer
Boolean type
    BOOLEAN — TRUE/FALSE
Floating point numbers
    FLOAT — single precision
    DOUBLE — Double precision
Fixed point numbers
    DECIMAL — a fixed point value of user defined scale and precision
String types
    STRING — sequence of characters in a specified character set
    VARCHAR — sequence of characters in a specified character set with a maximum length
    CHAR — sequence of characters in a specified character set with a defined length
Date and time types
    TIMESTAMP — a specific point in time, up to nanosecond precision
    DATE — a date
Binary types
    BINARY — a sequence of bytes

2、複雜型別
可以使用原始型別和其他複合型別構建複雜型別。

Structs:結構體,可以使用 . (點號)來訪問該型別中的元素。例如,for a column c of type STRUCT {a INT; b INT}可以通過 c.a 來訪問 a 欄位。
Maps (key-value tuples):通過 [element_name] 訪問。
Arrays (indexable lists):(可變址)陣列。陣列中的元素必須是相同型別的。通過 [index] 訪問。

Built In Operators and Functions

內建操作符和函式。

在 Beeline 或 Hive CLI 中,使用下面這些命令可以顯示最新的文件:
  SHOW FUNCTIONS;
  DESCRIBE FUNCTION ;
  DESCRIBE FUNCTION EXTENDED ;

注意:Hive 中是區分大小寫的。

1、操作符
⑴ 關係運算符:返回 True 或 False。

=   !=  <   <=  >   >=   IS NULL    IS NOT NULL     LIKE    RLIKE   REGEXP

Relational Operator

Operand types

Description

A = B

all primitive types

TRUE if expression A is equivalent to expression B; otherwise FALSE

A != B

all primitive types

TRUE if expression A is not equivalent to expression B; otherwise FALSE

A < B

all primitive types

TRUE if expression A is less than expression B; otherwise FALSE

A <= B

all primitive types

TRUE if expression A is less than or equal to expression B; otherwise FALSE

A > B

all primitive types

TRUE if expression A is greater than expression B] otherwise FALSE

A >= B

all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL

all types

TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT NULL

all types

FALSE if expression A evaluates to NULL otherwise TRUE

A LIKE B

strings

TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B

strings

NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, ‘foobar’ rlike ‘foo’ evaluates to TRUE and so does ‘foobar’ rlike ‘^f.*r$’.

A REGEXP B

strings

Same as RLIKE

說明:
A LIKE B:操作型別(A、B)均為 strings,如果字串A匹配滿足 SQL 簡單正則表示式B返回True,否則返回False。
    _ 表示匹配任意字元(類似正則中的 .);
    % 表示匹配任意數量的字元(類似正則中的 .*)
        例:'foobar' LIKE 'foo'     -->   false
            'foobar' LIKE 'foo___'  -->     true
            'foobar' LIKE 'foo%'    -->     true
        使用 \% 可以匹配一個 %
A RLIKE B:操作型別(A、B)均為 strings,如果 A 或 B 是 NULL 則返回 NULL;如果 A 的任意子串能夠匹配 B 則返回 True,否則為 False。
        例:'foobar' rlike 'foo'       -->    true
            'foobar' rlike '^f.*r$'    -->     true
A REGEXP B:操作型別(A、B)均為 strings,類似 RLIKE。

⑵ 算術運算子:所有返回值都是數字型別

+   -   *   /   %   &    |  ^   ~

Arithmetic Operators

Operand types

Description

A + B

all number types

Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.

A - B

all number types

Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A * B

all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.

A / B

all number types

Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.

A % B

all number types

Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A & B

all number types

Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A | B

all number types

Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A ^ B

all number types

Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

~A

all number types

Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

⑶ 邏輯運算子:所有返回值都是 boolean

AND     &&      OR      ||      NOT     !

Logical Operators

Operands types

Description

A AND B

boolean

TRUE if both A and B are TRUE, otherwise FALSE

A && B

boolean

Same as A AND B

A OR B

boolean

TRUE if either A or B or both are TRUE, otherwise FALSE

A || B

boolean

Same as A OR B

NOT A

boolean

TRUE if A is FALSE, otherwise FALSE

!A

boolean

Same as NOT A

說明:
  運算元型別都是 boolean。
  AND 與 && 相同,OR 與 || 相同,NOT 與 ! 相同。

⑷ 複雜型別的操作運算子

Operator

Operand types

Description

A[n]

A is an Array and n is an int

returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of [‘foo’, ‘bar’] then A[0] returns ‘foo’ and A[1] returns ‘bar’

M[key]

M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map for example, if M is a map comprising of
{‘f’ -> ‘foo’, ‘b’ -> ‘bar’, ‘all’ -> ‘foobar’} then M[‘all’] returns ‘foobar’

S.x

S is a struct

returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct.

A[n]:A is an Array and n is an int,返回陣列A中的第n個元素,索引從0開始。
M[key]:M is a Map<K, V> and key has type K,返回指定 key 的 value 值。
S.x:S is a struct,返回 S 的 x field。

2、內建函式

Return Type

Function Name (Signature)

Description

BIGINT

round(double a)

returns the rounded BIGINT value of the double

BIGINT

floor(double a)

returns the maximum BIGINT value that is equal or less than the double

BIGINT

ceil(double a)

returns the minimum BIGINT value that is equal or greater than the double

double

rand(), rand(int seed)

returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.

string

concat(string A, string B,…)

returns the string resulting from concatenating B after A. For example, concat(‘foo’, ‘bar’) results in ‘foobar’. This function accepts arbitrary number of arguments and return the concatenation of all of them.

string

substr(string A, int start)

returns the substring of A starting from start position till the end of string A. For example, substr(‘foobar’, 4) results in ‘bar’

string

substr(string A, int start, int length)

returns the substring of A starting from start position with the given length, for example,
substr(‘foobar’, 4, 2) results in ‘ba’

string

upper(string A)

returns the string resulting from converting all characters of A to upper case, for example, upper(‘fOoBaR’) results in ‘FOOBAR’

string

ucase(string A)

Same as upper

string

lower(string A)

returns the string resulting from converting all characters of B to lower case, for example, lower(‘fOoBaR’) results in ‘foobar’

string

lcase(string A)

Same as lower

string

trim(string A)

returns the string resulting from trimming spaces from both ends of A, for example, trim(’ foobar ‘) results in ‘foobar’

string

ltrim(string A)

returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(’ foobar ‘) results in ‘foobar ‘

string

rtrim(string A)

returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(’ foobar ‘) results in ’ foobar’

string

regexp_replace(string A, string B, string C)

returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace(‘foobar’, ‘oo|ar’, ) returns ‘fb’

int

size(Map<K.V>)

returns the number of elements in the map type

int

size(Array<T>)

returns the number of elements in the array type

value of <type>

cast(<expr> as <type>)

converts the results of the expression expr to <type>, for example, cast(‘1’ as BIGINT) will convert the string ‘1’ to it integral representation. A null is returned if the conversion does not succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of “1970-01-01 00:00:00”

string

to_date(string timestamp)

Return the date part of a timestamp string: to_date(“1970-01-01 00:00:00”) = “1970-01-01”

int

year(string date)

Return the year part of a date or a timestamp string: year(“1970-01-01 00:00:00”) = 1970, year(“1970-01-01”) = 1970

int

month(string date)

Return the month part of a date or a timestamp string: month(“1970-11-01 00:00:00”) = 11, month(“1970-11-01”) = 11

int

day(string date)

Return the day part of a date or a timestamp string: day(“1970-11-01 00:00:00”) = 1, day(“1970-11-01”) = 1

string

get_json_object(string json_string, string path)

Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.

  • 在Hive中支援下列構建的聚合函式

Return Type

Aggregation Function Name (Signature)

Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*)Returns the total number of retrieved rows, including rows containing NULL values; count(expr)Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the average of the distinct values of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

Language Capabilities

Hive’s SQL 提供了基本的SQL操作。這些操作在表或分割槽上工作。這些操作是:

  • 能夠使用WHERE子句從表中過濾行。
  • 能夠使用SELECT子句從表中選擇特定的列。
  • 能夠在兩個表之間做等連線。
  • 能夠在多個“group by”欄目中對儲存在表格中的資料進行評估。
  • 能夠將查詢結果儲存到另一張表中。
  • 能夠將表的內容下載到本地(例如,nfs)目錄。
  • 能夠將查詢的結果儲存在hadoop dfs目錄中。
  • 能夠管理表格和分割槽(建立、刪除和修改)。
  • 能夠在l中插入自定義指令碼

Usage and Examples

注意:下面的許多例子都是過時的。更多的最新資訊可以在 LanguageManual 中找到。

1、Creating, Showing, Altering, and Dropping Tables

關於 creating, showing, altering, and dropping tables 這些的詳細資訊請參閱 Hive Data Definition Language

⑴ Creating Tables

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
# PARTITIONED BY 子句定義了不同於資料列的分割槽列
# 檔案中的資料被假定為用ASCII 001(ctrl-A)作為欄位分隔符,換行符作為行分隔符。

定製欄位分隔符:ROW FORMAT DELIMITED FIELDS TERMINATED BY

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

目前不能更改行分隔符,因為它不是由Hive來決定的,而是由Hadoop分隔符決定的。

在表上的某些列上使用桶也是一個好主意,這樣就可以對資料集執行高效的抽樣查詢。
如果桶不存在,隨機抽樣仍然可以在表上進行,但是查詢是全表掃描了,所以查詢不高效。
下面的例子說明了在 pageview 表的 userid 列上打上了桶:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
# 該表由 userid 的雜湊函式集中到32個桶中。
# 在每個 bucket 中,資料按 viewTime 進行升序排序。
# 這樣的組織允許使用者在 clustered 列上進行高效的抽樣——本例為 userid

⑵ 展示表和分割槽資訊

檢視所有表:show tables;
檢視所有匹配的表:show tables 'page.*';          
列出表的分割槽:show partitions table_name;      
列出表所有列及其型別:describe table_name;
列出表所有列其他屬性:describe extended table_name;
列出表一個分割槽的列和所有其他屬性:describe extended table_name partition (分割槽列=val)

⑶ 修改表

重新命名已存在的表,如果新的表名是已存在的其他表的表名,則會返回錯誤:
    alter table old_name rename to new_name;
重新命名現有表的列。一定要使用相同的列型別,併為每個已存在的列包括一個條目:
    alter table table_name replace columns(col1 type,...);  # 注意:如果只是改列名稱的話,注意列型別和其他列名及其型別,如果寫少了或寫錯了,列就被改變了
增加列:
    alter table table_name add columns (c1 int comment 'a new int column',c2 string default 'def val');

⑷ 刪除表和分割槽

刪除表的同時會隱式刪除該表上的所有索引
    drop table table_name;
刪除表分割槽,分割槽欄位並沒有被刪除,只是刪除了指定分割槽值的分割槽資料(注意 .* 將刪除所有分割槽資料)
    alter table table_name drop partition (ds='2008-08-08')

2、Loading Data

將資料載入到Hive表中有多種方法。使用者可以建立指向 HDFS 中指定位置的外部表。使用者可以使用HDFS put或copy命令將檔案複製到指定的位置,並建立一個指向該位置的表,其中包含所有相關的行格式資訊。一旦完成,使用者就可以轉換資料並將其插入到任何其他的HIVE表中。

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

3、Querying and Inserting Data

⑴ Simple Query

INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

SELECT user.*
FROM user
WHERE user.active = 1;

⑵ Partition Based Query

INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
      page_views.referrer_url like '%xyz.com';
# 建表的時候定義的分割槽,PARTITIONED BY(date DATETIME, country STRING) ;

⑶ Joins
equi-joins

# ------ join ------
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ LEFT OUTER, RIGHT OUTER or FULL OUTER ------
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ LEFT SEMI JOIN ------
# 檢查另一張表中的鍵的存在
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ 多個 JOIN ------
INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

⑷ Aggregations

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

⑸ Multi Table/File Inserts

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid)
    GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid)
    GROUP BY pv_users.age;

⑹ Dynamic-Partition Insert

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

上面這個語句,是一個非常不好的示例,因為我們預先需要知道所有的 country,並且 dt 如果變了,那麼我們需要重新增加新的 insert 語句。例如,當還有另外一個 country=’DC’ 或者 dt = ‘2008-09-10’
Dynamic-partition insert 是為了解決上述問題而被設計的。所以 Dynamic-partition insert 如下即可

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

說明:
 - 上述示例語句中,dt 是一個靜態分割槽列(因為它的值一直都是2008-06-08,沒有任何變化),country 是動態分割槽列。
 - 動態分割槽列的值來自輸入列。
 - 目前,只允許動態分割槽列作為分割槽子句中的最後一列(s),因為分割槽列順序表示它的層次順序,所以不能用(dt, country=’US’)來指定分割槽子句。
 - select 語句中額外增加的 pvs.country 列,是動態分割槽列對應的輸入列。請注意,您不需要為靜態分割槽列新增一個輸入列,因為它的值已經在隔斷子句中知道了。
 
注意,動態分割槽值是通過排序、而不是名稱來選擇的,並作為select子句的最後一列來選擇。(即動態分割槽列的值是來自 select 子句的最後一列,而不通過名字匹配的)

動態分割槽插入語句的語義:
  - 當動態分割槽列已經存在非空分割槽時(例如,在某些ds根分割槽之下存在著country='CA),如果動態分割槽插入在輸入資料中看到相同的值(比如'CA'),就會被覆蓋。
  - 因為 Hive 分割槽對應於HDFS中的一個目錄,所以分割槽值必須符合HDFS路徑格式。任何在URI中具有特殊含義的字元(例如,'%', ':', '/', '#')都將以'%'的形式轉義,後跟2位元組的ASCII值。
  - 如果輸入列是非字串的型別,那麼它的值將首先被轉換成字串,用於構造HDFS路徑。
  - 如果輸入列值為NULL或空字串,這一行將被放入一個特殊的分割槽中,該分割槽的名稱由hive引數hive.exec.default.default.name控制。預設值是HIVE_DEFAULT_PARTITION{}。基本上這個分割槽將包含所有的“壞”行,它們的值不是有效的分割槽名稱。這種方法的警告是,如果您選擇Hive,那麼它將丟失並被HIVE_DEFAULT_PARTITION{}所取代。JIRA hia-1309是一個解決方案,讓使用者指定“壞檔案”來保留輸入分割槽列值。
  - 動態分割槽插入可能是一個資源佔用者,因為它可以在短時間內生成大量的分割槽。為了讓自己分桶,我們定義了三個引數:
  - hive.exec.max.dynamic.partitions.pernode:(預設值是1000)是每個mapper或reducer可以建立的最大動態分割槽數。如果一個mapper或reducer建立的比這個閾值更多,那麼將會從map/reducer(通過計數器)中產生一個致命的錯誤,整個job將被殺死。
  - hive.exec.max.dynamic.partitions:(預設值是100)能夠被一個DML建立的動態分割槽的總數。如果每一個mapper/reducer都沒有超過限制,但是動態分割槽的總數是超過了,那麼在將中間資料移動到最終目的地之前,將會丟擲一個異常結束 job。
  - hive.exec.max.created.files:(預設值是100000)是所有的mapper和reducer建立的最大的檔案總數。每一個mapper/reducer 建立一個新檔案時將執行 Hadoop counter 更新。如果總數超過了hive.exec.max.created.files,將丟擲一個致命的錯誤,job 將被殺死。
  - 我們想要保護不利於動態分割槽插入的另一種情況是,使用者可能意外地指定所有分割槽為動態分割槽,而不指定一個靜態分割槽,雖然最初的目的是想覆蓋一個根分割槽的子分割槽。我們可以定義另外一個引數 hive.exec.dynamic.partition.mode=strict 來保護這種全動態分割槽情況。在嚴格模式下,您必須指定至少一個靜態分割槽。預設模式是嚴格的。另外,我們可以用一個引數 hive.exec.dynamic.partition=true/false 來控制是否允許動態分割槽。在Hive 0.9.0之前預設值是false,在Hive 0.9.0和之後預設值是 true  -Hive 0.6中,hive.merge.mapfiles=true or hive.merge.mapredfiles=true時動態分割槽插入不工作。所以它內部關閉了merge 引數。在Hive 0.7中 merging file 是支援動態分割槽插入的(詳見JIRA hi1307)。

⑺ Inserting into Local Files

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

⑻ Sampling – 取樣
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
從pv_gender_sum的32個桶中選擇第三個桶

INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

⑼ Union All

INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'

    UNION ALL

    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
    ) actions JOIN users u ON(u.id = actions.uid);

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-ArrayOperations
Array Operations
Map (Associative Arrays) Operations
Custom Map/Reduce Scripts
Co-Groups