hive中distinct用法
阿新 • • 發佈:2019-01-05
hive中的distinct是去重的意思,和group by在某些情況下有相同的功能
下面測試下distinct的部分功能,先建立一張測試表
create table test.trip_tmp(
id int,
user_id int,
salesman_id int,
huose_id int
);
插入模擬資料
insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values( 2, 2, 3, 3);
insert into test.trip_tmp values(3, 2, 3, 3);
insert into test.trip_tmp values(4, 2, 5, 3);
insert into test.trip_tmp values(6, 3, 3, 3);
insert into test.trip_tmp values(5, 4, 2, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(6, 2, 5, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(5, 2, 5, 3);
查看錶的所有資料
select * from test.trip_tmp;
OK
1 2 3 3
1 2 3 3
5 2 5 3
2 2 3 3
3 2 3 3
4 2 5 3
6 3 3 3
5 4 2 3
5 2 3 3
6 2 5 3
5 2 3 3
Time taken: 0.277 seconds, Fetched: 11 row(s)
對錶的所有列去重
select distinct id, user_id, salesman_id, huose_id from test.trip_tmp;
OK
1 2 3 3
2 2 3 3
3 2 3 3
4 2 5 3
5 2 3 3
5 2 5 3
5 4 2 3
6 2 5 3
6 3 3 3
Time taken: 13.142 seconds, Fetched: 9 row(s)
這樣distinct後的所有列重複的資料去除了
hive中使用distinct必須在select的最前面,不能在distinct的前面加列名,否則會報錯
select huose_id, distinct id, user_id, salesman_id from test.trip_tmp;
NoViableAltException(96@[80:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS
LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA13.specialStateTransition(HiveParser_SelectClauseParser.java:4625)
at org.antlr.runtime.DFA.predict(DFA.java:80)
at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:1616)
at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1177)
at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:951)
at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:42192)
at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36852)
at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37119)
at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36765)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35954)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35842)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2285)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1334)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:17 cannot recognize input near 'distinct' 'id' ',' in selection target
distinct也可以這樣用,但和把所有的列都放一起差不多
select distinct (id, user_id, huose_id), salesman_id from test.trip_tmp;
OK
{"col1":1,"col2":2,"col3":3} 3
{"col1":2,"col2":2,"col3":3} 3
{"col1":3,"col2":2,"col3":3} 3
{"col1":4,"col2":2,"col3":3} 5
{"col1":5,"col2":2,"col3":3} 3
{"col1":5,"col2":2,"col3":3} 5
{"col1":5,"col2":4,"col3":3} 2
{"col1":6,"col2":2,"col3":3} 5
{"col1":6,"col2":3,"col3":3} 3
Time taken: 9.201 seconds, Fetched: 9 row(s)
distinct不能和聚合函式並列使用,否則會報錯
select distinct id, user_id, salesman_id, count(huose_id) from test.trip_tmp;
FAILED: SemanticException [Error 10128]: Line 1:42 Not yet supported place for UDAF 'count'
但可以在聚合函式裡面使用distinct
select count(distinct id) from test.trip_tmp;
OK
6
Time taken: 4.775 seconds, Fetched: 1 row(s)
最後,如果能用group by的就儘量使用group by,因為group by效能比distinct更好,尤其資料量大的時候能明顯感覺到。