Hive內部函數簡介及查詢語法
阿新 • • 發佈:2018-07-03
backward 系統 group by 內部 desc middle href word 語法 1.Hive內置函數:
在Hive中 系統給我們內置了很多函數 具體參考官方地址
- 看下官網給我們的介紹:
SHOW FUNCTIONS; --查看所有內置函數
DESCRIBE FUNCTION <function_name>; --查看某個函數的描述
DESCRIBE FUNCTION EXTENDED <function_name>; --查看某個函數的具體使用方法
hive> DESCRIBE FUNCTION case; OK CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f Time taken: 0.006 seconds, Fetched: 1 row(s) hive> DESCRIBE FUNCTION EXTENDED case; OK CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f Example: SELECT CASE deptno WHEN 1 THEN Engineering WHEN 2 THEN Finance ELSE admin END, CASE zone WHEN 7 THEN Americas ELSE Asia-Pac END FROM emp_details Time taken: 0.008 seconds, Fetched: 13 row(s) # DESCRIBE 可簡寫為desc hive> desc FUNCTION EXTENDED case; OK CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f Example: SELECT CASE deptno WHEN 1 THEN Engineering WHEN 2 THEN Finance ELSE admin END, CASE zone WHEN 7 THEN Americas ELSE Asia-Pac END FROM emp_details Time taken: 0.009 seconds, Fetched: 13 row(s)
下面我們了解下常用函數的使用方法:
# 為了方便測試 我們創建常用的dual表 hive> create table dual(x string); OK Time taken: 0.11 seconds hive> insert into table dual values(‘‘); Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d Total jobs = 3 Launching Job 1 out of 3 ... OK Time taken: 29.535 seconds hive> select * from dual; OK Time taken: 0.147 seconds, Fetched: 1 row(s) # 測試當前時間 hive> select current_date from dual; OK 2018-07-02 Time taken: 0.111 seconds, Fetched: 1 row(s) # 測試當前時間戳 hive> select current_timestamp from dual; OK 2018-07-02 15:03:28.919 Time taken: 0.117 seconds, Fetched: 1 row(s) # 測試substr函數 用於截取字符串 hive> desc function extended substr; OK substr(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstr(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length len Synonyms: substring pos is a 1-based index. If pos<0 the starting position is determined by counting backwards from the end of str. Example: > SELECT substr(‘Facebook‘, 5) FROM src LIMIT 1; ‘book‘ > SELECT substr(‘Facebook‘, -5) FROM src LIMIT 1; ‘ebook‘ > SELECT substr(‘Facebook‘, 5, 1) FROM src LIMIT 1; ‘b‘ Time taken: 0.016 seconds, Fetched: 10 row(s) hive> SELECT substr(‘helloworld‘,-5) FROM dual; OK world Time taken: 0.171 seconds, Fetched: 1 row(s) hive> SELECT substr(‘helloworld‘,5) FROM dual; OK oworld Time taken: 0.12 seconds, Fetched: 1 row(s) hive> SELECT substr(‘helloworld‘,5,3) FROM dual; OK owo Time taken: 0.142 seconds, Fetched: 1 row(s) # 測試函數concat 用於將字符連接起來 hive> desc function extended concat_ws; OK concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator. Example: > SELECT concat_ws(‘.‘, ‘www‘, array(‘facebook‘, ‘com‘)) FROM src LIMIT 1; ‘www.facebook.com‘ Time taken: 0.019 seconds, Fetched: 4 row(s) hive> select concat_ws(".","192","168","199","151") from dual; OK 192.168.199.151 Time taken: 0.152 seconds, Fetched: 1 row(s) # 測試函數split 用於拆分 hive> desc function extended split; OK split(str, regex) - Splits str around occurances that match regex Example: > SELECT split(‘oneAtwoBthreeC‘, ‘[ABC]‘) FROM src LIMIT 1; ["one", "two", "three"] Time taken: 0.021 seconds, Fetched: 4 row(s) hive> select split("192.168.199.151","\\.") from dual; OK ["192","168","199","151"] Time taken: 0.169 seconds, Fetched: 1 row(s)
2.Hive查詢語法:
- 簡單select語法:
# 簡單select語法 hive> select * from emp where deptno=10; OK 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 Time taken: 0.899 seconds, Fetched: 3 row(s) hive> select * from emp where empno <= 7800; OK 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 Time taken: 0.277 seconds, Fetched: 8 row(s) hive> select * from emp where salary between 1000 and 1500; OK 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 Time taken: 0.187 seconds, Fetched: 5 row(s) hive> select * from emp limit 5; OK 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 Time taken: 0.154 seconds, Fetched: 5 row(s) hive> select * from emp where empno in(7566,7499); OK 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 Time taken: 0.153 seconds, Fetched: 2 row(s) hive> select * from emp where comm is not null; OK 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 Time taken: 0.291 seconds, Fetched: 4 row(s)
- 聚合函數及分組函數:
# 聚合函數及分組函數
# max/min/count/sum/avg 特點:多進一出,進來很多條記錄出去只有一條記錄
# 查詢部門編號為10的有多少條記錄
hive> select count(1) from emp where deptno=10;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
3
Time taken: 38.951 seconds, Fetched: 1 row(s)
# 求最大工資,最小工資,平均工資,工資的和
hive> select max(salary),min(salary),avg(salary),sum(salary) from emp;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
5000.0 800.0 2073.214285714286 29025.0
Time taken: 23.748 seconds, Fetched: 1 row(s)
# 分組函數 group by
# 求部門的平均工資
# 註:select中出現的字段,如果沒有出現在組函數/聚合函數中,必須出現在group by裏面,否則就會產生報錯
hive> select deptno,avg(salary) from emp group by deptno;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
10 2916.6666666666665
20 2175.0
30 1566.6666666666667
Time taken: 36.502 seconds, Fetched: 3 row(s)
# 求每個部門(deptno)、工作崗位(job)的最高工資(salary)
hive> select deptno,job,max(salary) from emp group by deptno,job;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
10 CLERK 1300.0
10 MANAGER 2450.0
10 PRESIDENT 5000.0
20 ANALYST 3000.0
20 CLERK 1100.0
20 MANAGER 2975.0
30 CLERK 950.0
30 MANAGER 2850.0
30 SALESMAN 1600.0
Time taken: 36.096 seconds, Fetched: 9 row(s)
# 查詢平均工資大於2000的部門(使用having子句限定分組查詢)
hive> select deptno,avg(salary) from emp group by deptno having avg(salary) >2000;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
10 2916.6666666666665
20 2175.0
Time taken: 24.71 seconds, Fetched: 2 row(s)
# case when then end(不會跑mr)
hive> select ename, salary,
> case
> when salary > 1 and salary <= 1000 then ‘LOWER‘
> when salary > 1000 and salary <= 2000 then ‘MIDDLE‘
> when salary > 2000 and salary <= 4000 then ‘HIGH‘
> ELSE ‘HIGHEST‘
> end
> from emp;
OK
SMITH 800.0 LOWER
ALLEN 1600.0 MIDDLE
WARD 1250.0 MIDDLE
JONES 2975.0 HIGH
MARTIN 1250.0 MIDDLE
BLAKE 2850.0 HIGH
CLARK 2450.0 HIGH
SCOTT 3000.0 HIGH
KING 5000.0 HIGHEST
TURNER 1500.0 MIDDLE
ADAMS 1100.0 MIDDLE
JAMES 950.0 LOWER
FORD 3000.0 HIGH
MILLER 1300.0 MIDDLE
Time taken: 0.096 seconds, Fetched: 14 row(s)
- 多表join查詢:
# 創建測試表
hive> create table a(
> id int, name string
> ) row format delimited fields terminated by ‘\t‘;
OK
Time taken: 0.311 seconds
hive> create table b(
> id int, age int
> ) row format delimited fields terminated by ‘\t‘;
OK
Time taken: 0.142 seconds
# insert或load數據 最後表數據如下
hive> select * from a;
OK
1 zhangsan
2 lisi
3 wangwu
hive> select * from b;
OK
1 20
2 30
4 40
Time taken: 0.2 seconds, Fetched: 3 row(s)
# 內連接 inner join = join 僅列出表1和表2符合連接條件的數據
hive> select a.id,a.name,b.age from a join b on a.id=b.id;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
1 zhangsan 20
2 lisi 30
Time taken: 24.415 seconds, Fetched: 2 row(s)
# 左外連接(left join) 以左邊的為基準,左邊的數據全部數據全部出現,如果沒有出現就賦null值
hive> select a.id,a.name,b.age from a left join b on a.id=b.id;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
1 zhangsan 20
2 lisi 30
3 wangwu NULL
Time taken: 26.218 seconds, Fetched: 3 row(s)
# 右外連接(right join) 以右表為基準
hive> select a.id,a.name,b.age from a right join b on a.id=b.id;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
1 zhangsan 20
2 lisi 30
NULL NULL 40
Time taken: 24.027 seconds, Fetched: 3 row(s)
# 全連接(full join)相當於表1和表2的數據都顯示,如果沒有對應的數據,則顯示Null.
hive> select a.id,a.name,b.age from a full join b on a.id=b.id;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
1 zhangsan 20
2 lisi 30
3 wangwu NULL
NULL NULL 40
Time taken: 32.94 seconds, Fetched: 4 row(s)
# 笛卡爾積(cross join) 沒有連接條件 會針對表1和表2的每條數據做連接
hive> select a.id,a.name,b.age from a cross join b;
Warning: Map Join MAPJOIN[7][bigTable=a] in task ‘Stage-3:MAPRED‘ is a cross product
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
1 zhangsan 20
1 zhangsan 30
1 zhangsan 40
2 lisi 20
2 lisi 30
2 lisi 40
3 wangwu 20
3 wangwu 30
3 wangwu 40
Time taken: 29.825 seconds, Fetched: 9 row(s)
3.利用Hive sql實現wordcount:
# 創建表 加載測試數據
hive> create table hive_wc(sentence string);
OK
Time taken: 0.149 seconds
[hadoop@hadoop000 ~]$ cat hive-wc.txt
hello,world,welcome
hello,welcome
hive> load data local inpath ‘/home/hadoop/hive-wc.txt‘ into table hive_wc;
Loading data to table default.hive_wc
Table default.hive_wc stats: [numFiles=1, totalSize=34]
OK
Time taken: 0.729 seconds
hive> select * from hive_wc;
OK
hello,world,welcome
hello,welcome
Time taken: 0.13 seconds, Fetched: 2 row(s)
# 獲取每個單詞 利用split分割
hive> select split(sentence,",") from hive_wc;
OK
["hello","world","welcome"]
["hello","welcome"]
Time taken: 0.163 seconds, Fetched: 2 row(s)
# explode把數組轉成多行 結合split使用如下
hive> select explode(split(sentence,",")) from hive_wc;
OK
hello
world
welcome
hello
welcome
Time taken: 0.068 seconds, Fetched: 5 row(s)
# 做group by操作 一條語句即可實現wordcount統計
hive> select word, count(1) as c
> from (select explode(split(sentence,",")) as word from hive_wc) t
> group by word ;
Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
Total jobs = 1
...
OK
hello 2
welcome 2
world 1
Time taken: 34.168 seconds, Fetched: 3 row(s)
Hive內部函數簡介及查詢語法