005-Hive Tutorial

阿新 • • 發佈：2018-11-26

Hive Tutorial
- 概念
- Usage and Examples

Hive Tutorial

概念

What Is Hive

Hive是基於Apache Hadoop的資料倉庫的基礎設施。Hadoop提供了大規模的擴充套件和容錯功能，用於在商品硬體上進行資料儲存和處理。

Hive的設計是為了方便地進行資料彙總，對大量資料進行特別的查詢和分析。它提供了SQL，使使用者能夠輕鬆地進行特別查詢、摘要和資料分析。與此同時，Hive的SQL為使用者提供了多個位置來整合他們自己的功能來進行定製分析，比如使用者定義的函式（udf）。

What Hive Is NOT

Hive不是為線上事務處理而設計的。它最好用於傳統的資料倉庫任務。

Data Units

根據粒度大小，hive 資料可以分為：Databases、Tables、Partitions、Buckets（or Clusters）。

Partitions：分割槽。每個表都可以有一個或多個分割槽鍵來決定資料是如何儲存的。
分割槽，除了儲存單元之外，還允許使用者高效地識別滿足特定標準（分割槽）的行。
Buckets（or Clusters）：每個分割槽中的資料可以根據表格中某一列的雜湊函式的值將其劃分為 bucket。

Type System

Hive 支援原始和複雜的資料型別（primitive and complex data types）。
1、原始型別/基本型別/primitive

Integers
    TINYINT — 1 byte integer
    SMALLINT — 2 byte integer
    INT — 4 byte integer
    BIGINT — 8 byte integer
Boolean type
    BOOLEAN — TRUE/FALSE
Floating point numbers
    FLOAT — single precision
    DOUBLE — Double precision
Fixed point numbers
    DECIMAL — a fixed point value of user defined scale and precision
String types
    STRING — sequence of characters in a specified character set
    VARCHAR — sequence of characters in a specified character set with a maximum length
    CHAR — sequence of characters in a specified character set with a defined length
Date and time types
    TIMESTAMP — a specific point in time, up to nanosecond precision
    DATE — a date
Binary types
    BINARY — a sequence of bytes

2、複雜型別
可以使用原始型別和其他複合型別構建複雜型別。

Structs：結構體，可以使用 . (點號)來訪問該型別中的元素。例如，for a column c of type STRUCT {a INT; b INT}可以通過 c.a 來訪問 a 欄位。
Maps (key-value tuples)：通過 [element_name] 訪問。
Arrays (indexable lists)：(可變址)陣列。陣列中的元素必須是相同型別的。通過 [index] 訪問。

Built In Operators and Functions

內建操作符和函式。

在 Beeline 或 Hive CLI 中，使用下面這些命令可以顯示最新的文件：
SHOW FUNCTIONS;
DESCRIBE FUNCTION ;
DESCRIBE FUNCTION EXTENDED ;

注意：Hive 中是區分大小寫的。

1、操作符
⑴ 關係運算符：返回 True 或 False。

=   !=  <   <=  >   >=   IS NULL    IS NOT NULL     LIKE    RLIKE   REGEXP

Relational Operator	Operand types	Description
A = B	all primitive types	TRUE if expression A is equivalent to expression B; otherwise FALSE
A != B	all primitive types	TRUE if expression A is not equivalent to expression B; otherwise FALSE
A < B	all primitive types	TRUE if expression A is less than expression B; otherwise FALSE
A <= B	all primitive types	TRUE if expression A is less than or equal to expression B; otherwise FALSE
A > B	all primitive types	TRUE if expression A is greater than expression B] otherwise FALSE
A >= B	all primitive types	TRUE if expression A is greater than or equal to expression B otherwise FALSE
A IS NULL	all types	TRUE if expression A evaluates to NULL otherwise FALSE
A IS NOT NULL	all types	FALSE if expression A evaluates to NULL otherwise TRUE
A LIKE B	strings	TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, `'foobar' LIKE 'foo'` evaluates to FALSE where as `'foobar' LIKE 'foo___'` evaluates to TRUE and so does `'foobar' LIKE 'foo%'`. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, `columnValue LIKE 'a\;b'`
A RLIKE B	strings	NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, ‘foobar’ rlike ‘foo’ evaluates to TRUE and so does ‘foobar’ rlike ‘^f.*r$’.
A REGEXP B	strings	Same as RLIKE

說明：
A LIKE B：操作型別（A、B）均為 strings，如果字串A匹配滿足 SQL 簡單正則表示式B返回True，否則返回False。
    _ 表示匹配任意字元（類似正則中的 .）；
    % 表示匹配任意數量的字元（類似正則中的 .*）
        例：'foobar' LIKE 'foo'     -->   false
            'foobar' LIKE 'foo___'  -->     true
            'foobar' LIKE 'foo%'    -->     true
        使用 \% 可以匹配一個 %
A RLIKE B：操作型別（A、B）均為 strings，如果 A 或 B 是 NULL 則返回 NULL；如果 A 的任意子串能夠匹配 B 則返回 True，否則為 False。
        例：'foobar' rlike 'foo'       -->    true
            'foobar' rlike '^f.*r$'    -->     true
A REGEXP B：操作型別（A、B）均為 strings，類似 RLIKE。

⑵ 算術運算子：所有返回值都是數字型別

+   -   *   /   %   &    |  ^   ~

Arithmetic Operators	Operand types	Description
A + B	all number types	Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.
A - B	all number types	Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A * B	all number types	Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.
A / B	all number types	Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.
A % B	all number types	Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A & B	all number types	Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A \| B	all number types	Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A ^ B	all number types	Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
~A	all number types	Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

⑶ 邏輯運算子：所有返回值都是 boolean

AND     &&      OR      ||      NOT     !

Logical Operators	Operands types	Description
A AND B	boolean	TRUE if both A and B are TRUE, otherwise FALSE
A && B	boolean	Same as A AND B
A OR B	boolean	TRUE if either A or B or both are TRUE, otherwise FALSE
A \|\| B	boolean	Same as A OR B
NOT A	boolean	TRUE if A is FALSE, otherwise FALSE
!A	boolean	Same as NOT A

說明：
運算元型別都是 boolean。
AND 與 && 相同，OR 與 || 相同，NOT 與 ! 相同。

⑷ 複雜型別的操作運算子

Operator	Operand types	Description
A[n]	A is an Array and n is an int	returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of [‘foo’, ‘bar’] then A[0] returns ‘foo’ and A[1] returns ‘bar’
M[key]	M is a Map<K, V> and key has type K	returns the value corresponding to the key in the map for example, if M is a map comprising of {‘f’ -> ‘foo’, ‘b’ -> ‘bar’, ‘all’ -> ‘foobar’} then M[‘all’] returns ‘foobar’
S.x	S is a struct	returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct.

A[n]：A is an Array and n is an int，返回陣列A中的第n個元素，索引從0開始。
M[key]：M is a Map<K, V> and key has type K，返回指定 key 的 value 值。
S.x：S is a struct，返回 S 的 x field。

2、內建函式

Return Type	Function Name (Signature)	Description
BIGINT	round(double a)	returns the rounded BIGINT value of the double
BIGINT	floor(double a)	returns the maximum BIGINT value that is equal or less than the double
BIGINT	ceil(double a)	returns the minimum BIGINT value that is equal or greater than the double
double	rand(), rand(int seed)	returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.
string	concat(string A, string B,…)	returns the string resulting from concatenating B after A. For example, concat(‘foo’, ‘bar’) results in ‘foobar’. This function accepts arbitrary number of arguments and return the concatenation of all of them.
string	substr(string A, int start)	returns the substring of A starting from start position till the end of string A. For example, substr(‘foobar’, 4) results in ‘bar’
string	substr(string A, int start, int length)	returns the substring of A starting from start position with the given length, for example, substr(‘foobar’, 4, 2) results in ‘ba’
string	upper(string A)	returns the string resulting from converting all characters of A to upper case, for example, upper(‘fOoBaR’) results in ‘FOOBAR’
string	ucase(string A)	Same as upper
string	lower(string A)	returns the string resulting from converting all characters of B to lower case, for example, lower(‘fOoBaR’) results in ‘foobar’
string	lcase(string A)	Same as lower
string	trim(string A)	returns the string resulting from trimming spaces from both ends of A, for example, trim(’ foobar ‘) results in ‘foobar’
string	ltrim(string A)	returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(’ foobar ‘) results in ‘foobar ‘
string	rtrim(string A)	returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(’ foobar ‘) results in ’ foobar’
string	regexp_replace(string A, string B, string C)	returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace(‘foobar’, ‘oo\|ar’, ) returns ‘fb’
int	size(Map<K.V>)	returns the number of elements in the map type
int	size(Array<T>)	returns the number of elements in the array type
value of <type>	cast(<expr> as <type>)	converts the results of the expression expr to <type>, for example, cast(‘1’ as BIGINT) will convert the string ‘1’ to it integral representation. A null is returned if the conversion does not succeed.
string	from_unixtime(int unixtime)	convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of “1970-01-01 00:00:00”
string	to_date(string timestamp)	Return the date part of a timestamp string: to_date(“1970-01-01 00:00:00”) = “1970-01-01”
int	year(string date)	Return the year part of a date or a timestamp string: year(“1970-01-01 00:00:00”) = 1970, year(“1970-01-01”) = 1970
int	month(string date)	Return the month part of a date or a timestamp string: month(“1970-11-01 00:00:00”) = 11, month(“1970-11-01”) = 11
int	day(string date)	Return the day part of a date or a timestamp string: day(“1970-11-01 00:00:00”) = 1, day(“1970-11-01”) = 1
string	get_json_object(string json_string, string path)	Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.

在Hive中支援下列構建的聚合函式

Return Type	Aggregation Function Name (Signature)	Description
BIGINT	count(*), count(expr), count(DISTINCT expr[, expr_.])	count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL.
DOUBLE	sum(col), sum(DISTINCT col)	returns the sum of the elements in the group or the sum of the distinct values of the column in the group
DOUBLE	avg(col), avg(DISTINCT col)	returns the average of the elements in the group or the average of the distinct values of the column in the group
DOUBLE	min(col)	returns the minimum value of the column in the group
DOUBLE	max(col)	returns the maximum value of the column in the group

Language Capabilities

Hive’s SQL 提供了基本的SQL操作。這些操作在表或分割槽上工作。這些操作是：

能夠使用WHERE子句從表中過濾行。
能夠使用SELECT子句從表中選擇特定的列。
能夠在兩個表之間做等連線。
能夠在多個“group by”欄目中對儲存在表格中的資料進行評估。
能夠將查詢結果儲存到另一張表中。
能夠將表的內容下載到本地（例如，nfs）目錄。
能夠將查詢的結果儲存在hadoop dfs目錄中。
能夠管理表格和分割槽（建立、刪除和修改）。
能夠在l中插入自定義指令碼

Usage and Examples

注意：下面的許多例子都是過時的。更多的最新資訊可以在 LanguageManual 中找到。

1、Creating, Showing, Altering, and Dropping Tables

關於 creating, showing, altering, and dropping tables 這些的詳細資訊請參閱 Hive Data Definition Language。

⑴ Creating Tables

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
# PARTITIONED BY 子句定義了不同於資料列的分割槽列
# 檔案中的資料被假定為用ASCII 001（ctrl-A）作為欄位分隔符，換行符作為行分隔符。

定製欄位分隔符：ROW FORMAT DELIMITED FIELDS TERMINATED BY

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

目前不能更改行分隔符，因為它不是由Hive來決定的，而是由Hadoop分隔符決定的。

在表上的某些列上使用桶也是一個好主意，這樣就可以對資料集執行高效的抽樣查詢。
如果桶不存在，隨機抽樣仍然可以在表上進行，但是查詢是全表掃描了，所以查詢不高效。
下面的例子說明了在 pageview 表的 userid 列上打上了桶：

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;
# 該表由 userid 的雜湊函式集中到32個桶中。
# 在每個 bucket 中，資料按 viewTime 進行升序排序。
# 這樣的組織允許使用者在 clustered 列上進行高效的抽樣——本例為 userid

⑵ 展示表和分割槽資訊

檢視所有表：show tables;
檢視所有匹配的表：show tables 'page.*';          
列出表的分割槽：show partitions table_name;      
列出表所有列及其型別：describe table_name;
列出表所有列其他屬性：describe extended table_name;
列出表一個分割槽的列和所有其他屬性：describe extended table_name partition (分割槽列=val)

⑶ 修改表

重新命名已存在的表，如果新的表名是已存在的其他表的表名，則會返回錯誤：
    alter table old_name rename to new_name;
重新命名現有表的列。一定要使用相同的列型別，併為每個已存在的列包括一個條目：
    alter table table_name replace columns(col1 type,...);  # 注意：如果只是改列名稱的話，注意列型別和其他列名及其型別，如果寫少了或寫錯了，列就被改變了
增加列：
    alter table table_name add columns (c1 int comment 'a new int column',c2 string default 'def val');

⑷ 刪除表和分割槽

刪除表的同時會隱式刪除該表上的所有索引
    drop table table_name;
刪除表分割槽，分割槽欄位並沒有被刪除，只是刪除了指定分割槽值的分割槽資料（注意 .* 將刪除所有分割槽資料）
    alter table table_name drop partition (ds='2008-08-08')

2、Loading Data

將資料載入到Hive表中有多種方法。使用者可以建立指向 HDFS 中指定位置的外部表。使用者可以使用HDFS put或copy命令將檔案複製到指定的位置，並建立一個指向該位置的表，其中包含所有相關的行格式資訊。一旦完成，使用者就可以轉換資料並將其插入到任何其他的HIVE表中。

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

3、Querying and Inserting Data

⑴ Simple Query

INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

SELECT user.*
FROM user
WHERE user.active = 1;

⑵ Partition Based Query

INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
      page_views.referrer_url like '%xyz.com';
# 建表的時候定義的分割槽，PARTITIONED BY(date DATETIME, country STRING) ;

⑶ Joins
equi-joins

# ------ join ------
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ LEFT OUTER, RIGHT OUTER or FULL OUTER ------
INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ LEFT SEMI JOIN ------
# 檢查另一張表中的鍵的存在
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

# ------ 多個 JOIN ------
INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

⑷ Aggregations

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

⑸ Multi Table/File Inserts

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid)
    GROUP BY pv_users.gender

INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid)
    GROUP BY pv_users.age;

⑹ Dynamic-Partition Insert

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

上面這個語句，是一個非常不好的示例，因為我們預先需要知道所有的 country，並且 dt 如果變了，那麼我們需要重新增加新的 insert 語句。例如，當還有另外一個 country=’DC’ 或者 dt = ‘2008-09-10’
Dynamic-partition insert 是為了解決上述問題而被設計的。所以 Dynamic-partition insert 如下即可

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

說明：
- 上述示例語句中，dt 是一個靜態分割槽列（因為它的值一直都是2008-06-08，沒有任何變化），country 是動態分割槽列。
- 動態分割槽列的值來自輸入列。
- 目前，只允許動態分割槽列作為分割槽子句中的最後一列（s），因為分割槽列順序表示它的層次順序，所以不能用(dt, country=’US’)來指定分割槽子句。
- select 語句中額外增加的 pvs.country 列，是動態分割槽列對應的輸入列。請注意，您不需要為靜態分割槽列新增一個輸入列，因為它的值已經在隔斷子句中知道了。

注意，動態分割槽值是通過排序、而不是名稱來選擇的，並作為select子句的最後一列來選擇。（即動態分割槽列的值是來自 select 子句的最後一列，而不通過名字匹配的）

動態分割槽插入語句的語義：
  - 當動態分割槽列已經存在非空分割槽時（例如，在某些ds根分割槽之下存在著country='CA），如果動態分割槽插入在輸入資料中看到相同的值（比如'CA'），就會被覆蓋。
  - 因為 Hive 分割槽對應於HDFS中的一個目錄，所以分割槽值必須符合HDFS路徑格式。任何在URI中具有特殊含義的字元（例如，'%', ':', '/', '#'）都將以'%'的形式轉義，後跟2位元組的ASCII值。
  - 如果輸入列是非字串的型別，那麼它的值將首先被轉換成字串，用於構造HDFS路徑。
  - 如果輸入列值為NULL或空字串，這一行將被放入一個特殊的分割槽中，該分割槽的名稱由hive引數hive.exec.default.default.name控制。預設值是HIVE_DEFAULT_PARTITION{}。基本上這個分割槽將包含所有的“壞”行，它們的值不是有效的分割槽名稱。這種方法的警告是，如果您選擇Hive，那麼它將丟失並被HIVE_DEFAULT_PARTITION{}所取代。JIRA hia-1309是一個解決方案，讓使用者指定“壞檔案”來保留輸入分割槽列值。
  - 動態分割槽插入可能是一個資源佔用者，因為它可以在短時間內生成大量的分割槽。為了讓自己分桶，我們定義了三個引數：
  - hive.exec.max.dynamic.partitions.pernode：(預設值是1000)是每個mapper或reducer可以建立的最大動態分割槽數。如果一個mapper或reducer建立的比這個閾值更多，那麼將會從map/reducer（通過計數器）中產生一個致命的錯誤，整個job將被殺死。
  - hive.exec.max.dynamic.partitions：(預設值是100)能夠被一個DML建立的動態分割槽的總數。如果每一個mapper/reducer都沒有超過限制，但是動態分割槽的總數是超過了，那麼在將中間資料移動到最終目的地之前，將會丟擲一個異常結束 job。
  - hive.exec.max.created.files：(預設值是100000)是所有的mapper和reducer建立的最大的檔案總數。每一個mapper/reducer 建立一個新檔案時將執行 Hadoop counter 更新。如果總數超過了hive.exec.max.created.files，將丟擲一個致命的錯誤，job 將被殺死。
  - 我們想要保護不利於動態分割槽插入的另一種情況是，使用者可能意外地指定所有分割槽為動態分割槽，而不指定一個靜態分割槽，雖然最初的目的是想覆蓋一個根分割槽的子分割槽。我們可以定義另外一個引數 hive.exec.dynamic.partition.mode=strict 來保護這種全動態分割槽情況。在嚴格模式下，您必須指定至少一個靜態分割槽。預設模式是嚴格的。另外，我們可以用一個引數 hive.exec.dynamic.partition=true/false 來控制是否允許動態分割槽。在Hive 0.9.0之前預設值是false，在Hive 0.9.0和之後預設值是 true。 
  - 在Hive 0.6中，hive.merge.mapfiles=true or hive.merge.mapredfiles=true時動態分割槽插入不工作。所以它內部關閉了merge 引數。在Hive 0.7中 merging file 是支援動態分割槽插入的（詳見JIRA hi1307）。

⑺ Inserting into Local Files

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

⑻ Sampling – 取樣
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions
從pv_gender_sum的32個桶中選擇第三個桶

INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

⑼ Union All

INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'

    UNION ALL

    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
    ) actions JOIN users u ON(u.id = actions.uid);

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-ArrayOperations
Array Operations
Map (Associative Arrays) Operations
Custom Map/Reduce Scripts
Co-Groups

005-Hive Tutorial

Hive Tutorial 概念 What Is Hive What Hive Is NOT Data Units Type System Built I

005-hive概述

keep grant 語法數據表 dex apr delete per times Hive概述名稱 hive系統架構 metastore derbymysql HDFS /usr/hive/warehou

Hive Tutorial 閱讀記錄

# Hive Tutorial 官網原文連結：[https://cwiki.apache.org/confluence/display/Hive/Tutorial](https://cwiki.apache.org/confluence/display/Hive/Tutorial) [TOC] ## 1

windows_learn 005 軟件部署

windows_learn 005 軟件部署windows_learn 005 軟件部署內容總覽軟件部署概述將軟件發布給用戶將軟件分配給用戶或計算機將軟件升級和重新部署部署Microsoft Office 發布ZAP應用程序軟件部署的其它設置將軟件重新封裝成MSI應用程序第5章使用組策略

go-005-變量

整數 func 基礎字型開始 import open spl 註意概述　　變量來源於數學，是計算機語言中能儲存計算結果或能表示值抽象概念。變量可以通過變量名訪問。　　Go 語言變量名由字母、數字、下劃線組成，其中首個字母不能為數字。　　聲明變量的一般形式是

Hive的靜態分區和動態分區

操作 mage 分區 ive 作者 over rom for top 作者：Syn良子出處：http://www.cnblogs.com/cssdongl/p/6831884.html 轉載請註明出處雖然之前已經用過很多次hive的分區表，但是還是找時間快速回顧總結一下

005 MFC 選卡控件TabCtrl 動畫控件Animate

tro page show div ani gpa ron log false #TabCtrlDemo 選項卡控件　　拖拽控件　　　　設置 ID IDC_TAB 設置變量名 m_tab 　　初始化控件兩個2 頁面 1 BOOL CTabCtrlDemoDl

Hive入門知識

不支持應用設計行數數據常用 net 倉庫 oal 報錯 Hive 是建立在 Hadoop 上的數據倉庫基礎構架，它提供了一系列的工具，可以用來進行數據提取轉化加載（ETL），這是一種可以存儲、查詢和分析存儲在 Hadoop 中的大規模數據的機制。由於 Hive 是針

[Machine Learning (Andrew NG courses)]V. Octave Tutorial (Week 2)

img and learning text net con fonts http .net [Machine Learning (Andrew NG courses)]V. Octave Tutorial (Week 2)

Hive和Hbase的區別

缺點每一個 oop 設備 actions 利用計數映射編寫 1. 兩者分別是什麽？ Apache Hive是一個構建在Hadoop基礎設施之上的數據倉庫。通過Hive可以使用HQL語言查詢存放在HDFS上的數據。HQL是一種類SQL語言，這種語言最終被轉化為M

hive-0.11.0安裝方法具體解釋

col home 模式 tables 文件 create time his 拷貝先決條件： 1)java環境，須要安裝java1.6以上版本號 2)hadoop環境，Hadoop-1.2.1的安裝方法參考 hadoop-1.2.1安裝方法具體解釋本

Vulkan Tutorial 01 開發環境搭建之Windows

異常方案 party return info auto 行程 while nload 操作系統:Windows8.1 顯卡:Nivida GTX965M 開發工具：Visual Studio 2017 相信很多人在開始學習Vulkan開發的起始階段都會在開發環境的配置上

spark-local 模式提示 /tmp/hive hdfs 權限不夠的問題

spark 大數據 hadoop spark版本為2.0 在spark 在 local 模式下啟動，有時會報/tmp/hive hdfs 權限不夠的問題，但是我們並沒有將hdfs-site.xml配置文件放到我們的項目中，spark的文件應該會存在本地電腦上，但是為什麽會報這個錯誤呢？這個問

ElasticSearch和Hive做整合

oop 執行 nod last space property style pan mil 1、上傳elasticsearh-hadoop的jar包到server1-hadoop-namenode-01上在server1-hadoop-namenode-01上執行：

hive 報錯/tmp/hive on HDFS should be writable. Current permissions are: rwx--x--x

per popu family 問題啟動 article miss 錯誤 art 啟動hive時報例如以下錯誤：/tmp/hive on HDFS should be writable. Current permissions are: rwx--x--x 這是/

安全類工具制作第005篇：進程管理器（下）

btn creat lan 控件 lookup 包括 lln create tdi 一、前言這次的程序是為了完好上一次所編寫的進程管理器。使得當我們選中某一個進程的時候。能夠查看其DLL文件，而且能夠對可疑的模塊進行卸載操作。這樣就能夠有效對抗DLL的

解決kylin報錯 ClassCastException org.apache.hadoop.hive.ql.exec.ConditionalTask cannot be cast to org.apache.hadoop.hive.ql.exec.mr.MapRedTask

conf lan exe hive oop ann 關於 .exe map 方法：去掉參數SET hive.auto.convert.join=true; 從配置文件$KYLIN_HOME/conf/kylin_hive_conf.xml刪掉或 kylin-gui的cu

005-Hive Tutorial

Hive Tutorial

概念

What Is Hive

What Hive Is NOT

Data Units

Type System

Built In Operators and Functions

Language Capabilities

Usage and Examples

1、Creating, Showing, Altering, and Dropping Tables

2、Loading Data

3、Querying and Inserting Data

005-Hive Tutorial

005-hive概述

Hive Tutorial 閱讀記錄

windows_learn 005 軟件部署

go-005-變量

Hive的靜態分區和動態分區

005 MFC 選卡控件TabCtrl 動畫控件Animate

Hive入門知識

[Machine Learning (Andrew NG courses)]V. Octave Tutorial (Week 2)

Hive和Hbase的區別

hive-0.11.0安裝方法具體解釋

Vulkan Tutorial 01 開發環境搭建之Windows

spark-local 模式提示 /tmp/hive hdfs 權限不夠的問題

ElasticSearch和Hive做整合

hive 報錯/tmp/hive on HDFS should be writable. Current permissions are: rwx--x--x

安全類工具制作第005篇：進程管理器（下）

解決kylin報錯 ClassCastException org.apache.hadoop.hive.ql.exec.ConditionalTask cannot be cast to org.apache.hadoop.hive.ql.exec.mr.MapRedTask

排查Hive報錯：org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start of Array expected

實驗三+005+陳曉華

Vulkan Tutorial 06 邏輯設備與隊列

005-Hive Tutorial

Hive Tutorial

概念

What Is Hive

What Hive Is NOT

Data Units

Type System

Built In Operators and Functions

Language Capabilities

Usage and Examples

1、Creating, Showing, Altering, and Dropping Tables

2、Loading Data

3、Querying and Inserting Data

相關推薦