Hive 中的複合資料結構簡介以及一些函式的用法說明

阿新 • • 發佈：2022-04-28

目前 hive 支援的複合資料型別有以下幾種：

map (key1, value1, key2, value2, ...) Creates a map with the given key/value pairs struct (val1, val2, val3, ...) Creates a struct with the given field values. Struct field names will be col1, col2, ... named_struct (name1, val1, name2, val2, ...) Creates a struct with the given field names and values. (as of Hive 0.8.0) array (val1, val2, ...) Creates an array with the given elements create_union (tag, val1, val2, ...) Creates a union type with the value that is being pointed to by the tag parameter

一、map、struct、array 這3種的用法：

1、Array的使用

建立資料庫表，以array作為資料型別
create table  person(name string,work_locations array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
COLLECTION ITEMS TERMINATED BY ',';
資料
biansutao beijing,shanghai,tianjin,hangzhou
linan changchu,chengdu,wuhan
入庫資料
LOAD DATA LOCAL INPATH '/home/hadoop/person.txt' OVERWRITE INTO TABLE person;
查詢
hive> select * from person;
biansutao       ["beijing","shanghai","tianjin","hangzhou"]
linan   ["changchu","chengdu","wuhan"]
Time taken: 0.355 seconds
hive> select name from person;
linan
biansutao
Time taken: 12.397 seconds
hive> select work_locations[0] from person;
changchu
beijing
Time taken: 13.214 seconds
hive> select work_locations from person;   
["changchu","chengdu","wuhan"]
["beijing","shanghai","tianjin","hangzhou"]
Time taken: 13.755 seconds
hive> select work_locations[3] from person;
NULL
hangzhou
Time taken: 12.722 seconds
hive> select work_locations[4] from person;
NULL
NULL
Time taken: 15.958 seconds

2、Map 的使用

建立資料庫表
create table score(name string, score map<string,int>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
要入庫的資料
biansutao '數學':80,'語文':89,'英語':95
jobs '語文':60,'數學':80,'英語':99
入庫資料
LOAD DATA LOCAL INPATH '/home/hadoop/score.txt' OVERWRITE INTO TABLE score;
查詢
hive> select * from score;
biansutao       {"數學":80,"語文":89,"英語":95}
jobs    {"語文":60,"數學":80,"英語":99}
Time taken: 0.665 seconds
hive> select name from score;
jobs
biansutao
Time taken: 19.778 seconds
hive> select t.score from score t;
{"語文":60,"數學":80,"英語":99}
{"數學":80,"語文":89,"英語":95}
Time taken: 19.353 seconds
hive> select t.score['語文'] from score t;
60
89
Time taken: 13.054 seconds
hive> select t.score['英語'] from score t;
99
95
Time taken: 13.769 seconds

3、Struct 的使用

建立資料表
CREATE TABLE test(id int,course struct<course:string,score:int>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
COLLECTION ITEMS TERMINATED BY ',';
資料
1 english,80
2 math,89
3 chinese,95
入庫
LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test;
查詢
hive> select * from test;
OK
1       {"course":"english","score":80}
2       {"course":"math","score":89}
3       {"course":"chinese","score":95}
Time taken: 0.275 seconds
hive> select course from test;
{"course":"english","score":80}
{"course":"math","score":89}
{"course":"chinese","score":95}
Time taken: 44.968 seconds
select t.course.course from test t; 
english
math
chinese
Time taken: 15.827 seconds
hive> select t.course.score from test t;
80
89
95
Time taken: 13.235 seconds

4、資料組合（不支援組合的複雜資料型別）

LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test;
create table test1(id int,a MAP<STRING,ARRAY<STRING>>)
row format delimited fields terminated by 't' 
collection items terminated by ','
MAP KEYS TERMINATED BY ':';
1 english:80,90,70
2 math:89,78,86
3 chinese:99,100,82
LOAD DATA LOCAL INPATH '/home/hadoop/test1.txt' OVERWRITE INTO TABLE test1;

二、hive中的一些不常見函式的用法：

常見的函式就不廢話了，和標準sql類似，下面我們要聊到的基本是HQL裡面專有的函式，

hive裡面的函式大致分為如下幾種：Built-in、Misc.、UDF、UDTF、UDAF

我們就挑幾個標準SQL裡沒有，但是在HIVE SQL在做統計分析常用到的來說吧。

1、array_contains （Collection Functions）

這是內建的對集合進行操作的函式，用法舉例：

create EXTERNAL table IF NOT EXISTS userInfo (id int,sex string, age int, name string, email string,sd string, ed string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' location '/hive/dw';

select * from userinfo where sex='male' and (id!=1 and id !=2 and id!=3 and id!=4 and id!=5) and age < 30;
select * from (select * from userinfo where sex='male' and !array_contains(split('1,2,3,4,5',','),cast(id as string))) tb1 where tb1.age < 30;

其中建表所用的測試資料你可以用如下連結的指令碼自動生成：

http://my.oschina.net/leejun2005/blog/76631

2、get_json_object （Misc. Functions）

測試資料：

first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} third first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],"bicycle":{"price":19.952,"color":"red2"}},"email":"amy@only_for_json_udf_test.net","owner":"amy2"} third first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],"bicycle":{"price":19.953,"color":"red3"}},"email":"amy@only_for_json_udf_test.net","owner":"amy3"} third

create external table if not exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED BY ' ' location '/test/json'
select get_json_object(t_json.f2, '$.owner') from t_json;
SELECT * from t_json where get_json_object(t_json.f2, '$.store.fruit[0].weight') = 9;
SELECT get_json_object(t_json.f2, '$.non_exist_key') FROM t_json;

這裡尤其要注意UDTF的問題，官方文件有說明：

json_tuple A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal. For example,

select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

should be changed to

select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

UDTF(User-Defined Table-Generating Functions) 用來解決輸入一行輸出多行(On-to-many maping) 的需求。

通過Lateral view可以方便的將UDTF得到的行轉列的結果集合在一起提供服務，因為直接在SELECT使用UDTF會存在限制，即僅僅能包含單個欄位，不光是多個UDTF，僅僅單個UDTF加上其他欄位也是不可以，hive提示在UDTF中僅僅能有單一的表示式。如下： hive> select my_test(“abcef:aa”) as qq,’abcd’ from sunwg01; FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF’s

使用Lateral view可以實現上面的需求，Lateral view語法如下： lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)* fromClause: FROM baseTable (lateralView)* hive> create table sunwg ( a array, b array ) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ‘t’ > COLLECTION ITEMS TERMINATED BY ‘,’; OK Time taken: 1.145 seconds hive> load data local inpath ‘/home/hjl/sunwg/sunwg.txt’ overwrite into table sunwg; Copying data from file:/home/hjl/sunwg/sunwg.txt Loading data to table sunwg OK Time taken: 0.162 seconds hive> select * from sunwg; OK [10,11] ["tom","mary"] [20,21] ["kate","tim"] Time taken: 0.069 seconds hive> > SELECT a, name > FROM sunwg LATERAL VIEW explode(b) r1 AS name; OK [10,11] tom [10,11] mary [20,21] kate [20,21] tim Time taken: 8.497 seconds hive> SELECT id, name > FROM sunwg LATERAL VIEW explode(a) r1 AS id > LATERAL VIEW explode(b) r2 AS name; OK 10 tom 10 mary 11 tom 11 mary 20 kate 20 tim 21 kate 21 tim Time taken: 9.687 seconds

3、parse_url_tuple

測試資料：

url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1 url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject url3 https://www.google.com.hk/#hl=zh-CN&newwindow=1&safe=strict&q=hive+translate+example&oq=hive+translate+example&gs_l=serp.3...10174.11861.6.12051.8.8.0.0.0.0.132.883.0j7.7.0...0.0...1c.1j4.8.serp.0B9C1T_n0Hs&bav=on.2,or.&bvm=bv.44770516,d.aGc&fp=e13e41a6b9dab3f6&biw=1241&bih=589

create external table if not exists t_url(f1 string, f2 string) row format delimited fields TERMINATED BY ' ' location '/test/url';
SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, 'HOST', 'PATH', 'QUERY', 'QUERY:k1') b as host, path, query, query_id;

結果：

url1 facebook.com /path1/p.php k1=v1&k2=v2 v1 url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL url3 www.google.com.hk / NULL NULL

4、explode

explode 是一個 hive 內建的表生成函式：Built-in Table-Generating Functions (UDTF)，主要是解決 1 to N 的問題，即它可以把一行輸入拆成多行，比如一個 array 的每個元素拆成一行，作為一個虛表輸出。它有如下需要注意的地方：

Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations:
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported

從上面的原理與語法上可知，

select 列中不能 udtf 和其它非 udtf 列混用，
udtf 不能巢狀，
不支援 GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY
還有 select 中出現的 udtf 一定需要列別名，否則會報錯：

SELECT explode(myCol) AS myNewCol FROM myTable;
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;

5、lateral view

lateral view 是Hive中提供給UDTF的conjunction，它可以解決UDTF不能新增額外的select列的問題。當我們想對hive表中某一列進行split之後，想對其轉換成1 to N的模式，即一行轉多列。hive不允許我們在UDTF函式之外，再新增其它select語句。

如下，我們想將登入某個遊戲的使用者id放在一個欄位user_ids裡，對每一行資料用UDTF後輸出多行。

select game_id, explode(split(user_ids,'\[\[\[')) as user_id   from login_game_log  where dt='2014-05-15' ;
FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions。

提示語法分析錯誤，UDTF不支援函式之外的select 語句，如果我們想支援怎麼辦呢？接下來就是Lateral View 登場的時候了。

Lateral view 其實就是用來和像類似explode這種UDTF函式聯用的。lateral view 會將UDTF生成的結果放到一個虛擬表中，然後這個虛擬表（1 to N）會和輸入行即每個game_id進行join 來達到連線UDTF外的select欄位的目的（源表和拆分的虛表按行做行內 1 join N 的直接連線），這也是為什麼 LATERAL VIEW udtf(expression) 後面需要表別名和列別名的原因。

Lateral View Syntax

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*

fromClause: FROM baseTable (lateralView)*

可以看出，可以在2個地方用Lateral view：

在udtf前面用
在from baseTable後面用

例如：

pageid adid_list

front_page [1, 2, 3]

contact_page [3, 4, 5]

SELECT pageid, adid
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;

pageid adid

front_page 1

front_page 2

front_page 3

contact_page 3

contact_page 4

contact_page 5

From語句後可以跟多個Lateral View。

A FROM clause can have multiple LATERAL VIEW clauses. Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.

給定資料：

Array<int> col1 Array<string> col2

[1, 2] [a", "b", "c"]

[3, 4] [d", "e", "f"]

轉換目標：

想同時把第一列和第二列拆開，類似做笛卡爾乘積。

我們可以這樣寫：

SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;

還有一種情況，如果UDTF轉換的Array是空的怎麼辦呢？

在Hive0.12裡面會支援outer關鍵字，如果UDTF的結果是空，預設會被忽略輸出。

如果加上outer關鍵字，則會像left outer join 一樣，還是會輸出select出的列，而UDTF的輸出結果是NULL。

總結：

Lateral View通常和UDTF一起出現，為了解決UDTF不允許在select欄位的問題。
Multiple Lateral View可以實現類似笛卡爾乘積。
Outer關鍵字可以把不輸出的UDTF的空結果，輸出成NULL，防止丟失資料。

三、ref：

http://blog.csdn.net/wf1982/article/details/7474601 http://www.cnblogs.com/ggjucheng/archive/2013/01/08/2850797.html http://www.oratea.net/?p=650 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parseurltuple https://cwiki.apache.org/confluence/display/Hive/Tutorial

http://blog.csdn.net/inte_sleeper/article/details/7196114 hive lateral view語句：列拆分成行

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

http://blog.csdn.net/oopsoom/article/details/26001307 Lateral View用法與 Hive UDTF explode

http://bit.ly/2bDuVxS 助力大資料的複雜統計分析-Hive視窗函式

Hive 中的複合資料結構簡介以及一些函式的用法說明

一、map、struct、array 這3種的用法：

1、Array的使用

2、Map 的使用

3、Struct 的使用

4、資料組合（不支援組合的複雜資料型別）

二、hive中的一些不常見函式的用法：

1、array_contains （Collection Functions）

2、get_json_object （Misc. Functions）

3、parse_url_tuple

4、explode

5、lateral view

三、ref：

Hive 中的複合資料結構簡介以及一些函式的用法說明

動圖+原始碼，演示Java中常用資料結構執行過程及原理

redis中的資料結構和編碼詳解

JavaSE部分集合中（資料結構 list set Collections）

圖解 Java 中的資料結構及原理！

13.ES6中set資料結構

詳解JavaScript中的資料型別，以及檢測資料型別的方法

面試中的資料結構與演算法

C#中的資料結構

js中的資料結構

Java中各種資料結構的常用方法和遍歷方式，用處

go interface 轉 string_Go 中的資料結構 -- Interface

資料結構 ------- 簡介

windows 驅動開發- 核心中字串資料結構

關於資料型別，以及一些拓展內容

hive中的資料傾斜優化

java物件在JVM堆中的資料結構

資料結構簡介

JavaScript中的資料型別簡介

pandas實現excel中的資料透視表和Vlookup函式功能程式碼

Hive 中的複合資料結構簡介以及一些函式的用法說明

一、map、struct、array 這3種的用法：

1、Array的使用

2、Map 的使用

3、Struct 的使用

4、資料組合 （不支援組合的複雜資料型別）

二、hive中的一些不常見函式的用法：

1、array_contains （Collection Functions）

2、get_json_object （Misc. Functions）

3、parse_url_tuple

4、explode

5、lateral view

三、ref：

相關推薦

4、資料組合（不支援組合的複雜資料型別）