hive中distinct用法

阿新 • • 發佈：2019-01-05

hive中的distinct是去重的意思，和group by在某些情況下有相同的功能

下面測試下distinct的部分功能，先建立一張測試表

create table test.trip_tmp(
id int,
user_id int,
salesman_id int,
huose_id int
);

插入模擬資料

insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values(1, 2, 3, 3);
insert into test.trip_tmp values( 
2, 2, 3, 3);
insert into test.trip_tmp values(3, 2, 3, 3);
insert into test.trip_tmp values(4, 2, 5, 3);
insert into test.trip_tmp values(6, 3, 3, 3);
insert into test.trip_tmp values(5, 4, 2, 3);
insert into test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(6, 2, 5, 3);
insert into 
 test.trip_tmp values(5, 2, 3, 3);
insert into test.trip_tmp values(5, 2, 5, 3);

查看錶的所有資料

select * from test.trip_tmp;
OK
1	2	3	3
1	2	3	3
5	2	5	3
2	2	3	3
3	2	3	3
4	2	5	3
6	3	3	3
5	4	2	3
5	2	3	3
6	2	5	3
5	2	3	3
Time taken: 0.277 seconds, Fetched: 11 row(s)

對錶的所有列去重

select distinct id, user_id, 
 salesman_id, huose_id from test.trip_tmp;

OK
1	2	3	3
2	2	3	3
3	2	3	3
4	2	5	3
5	2	3	3
5	2	5	3
5	4	2	3
6	2	5	3
6	3	3	3
Time taken: 13.142 seconds, Fetched: 9 row(s)

這樣distinct後的所有列重複的資料去除了

hive中使用distinct必須在select的最前面，不能在distinct的前面加列名，否則會報錯

select huose_id, distinct id, user_id, salesman_id from test.trip_tmp;
NoViableAltException(96@[80:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS
 LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );])	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA13.specialStateTransition(HiveParser_SelectClauseParser.java:4625)
	at org.antlr.runtime.DFA.predict(DFA.java:80)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:1616)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1177)
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:951)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:42192)
	at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36852)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37119)
	at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36765)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35954)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35842)
	at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2285)
	at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1334)
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: ParseException line 1:17 cannot recognize input near 'distinct' 'id' ',' in selection target

distinct也可以這樣用，但和把所有的列都放一起差不多

select distinct (id, user_id, huose_id), salesman_id from test.trip_tmp;
OK
{"col1":1,"col2":2,"col3":3}	3
{"col1":2,"col2":2,"col3":3}	3
{"col1":3,"col2":2,"col3":3}	3
{"col1":4,"col2":2,"col3":3}	5
{"col1":5,"col2":2,"col3":3}	3
{"col1":5,"col2":2,"col3":3}	5
{"col1":5,"col2":4,"col3":3}	2
{"col1":6,"col2":2,"col3":3}	5
{"col1":6,"col2":3,"col3":3}	3
Time taken: 9.201 seconds, Fetched: 9 row(s)

distinct不能和聚合函式並列使用，否則會報錯

select distinct id, user_id, salesman_id, count(huose_id) from test.trip_tmp;
FAILED: SemanticException [Error 10128]: Line 1:42 Not yet supported place for UDAF 'count'

但可以在聚合函式裡面使用distinct

select count(distinct id) from test.trip_tmp;
OK
6
Time taken: 4.775 seconds, Fetched: 1 row(s)

最後，如果能用group by的就儘量使用group by，因為group by效能比distinct更好，尤其資料量大的時候能明顯感覺到。

hive中distinct用法

hive中的distinct是去重的意思，和group by在某些情況下有相同的功能下面測試下distinct的部分功能，先建立一張測試表 create table test.trip_tmp( id int, user_id int, salesman_id int, huose

DataTable select() 的使用；DataTabel中distinct用法去重複的的欄位或者記錄 .

DataTabel中distinct 在.NET Framework2.0中，選擇DataTable等資料來源中的唯一值（類似SQL中Distinct的返回結果）非常簡單，如下即可： DataTable d = dataSetName.dataTableName.Defa

pig中distinct用法

Distinct 只能處理關係中的整個記錄，不能是表示式，或者部分域。 --distinct.pig daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray); uniq = dist

Hive中distinct和Group by效率對比及處理方式

select res.flag AS flag ,res.source AS source ,res.template AS template ,SUM(res.click_user)

Hive 中的複合資料結構簡介以及一些函式的用法說明

目錄[-] 一、map、struct、array 這3種的用法： 1、Array的使用 2、Map 的使用 3、Struct 的使用 4、資料組合（不支援組合的複雜資料型別）二、hive中的一些不常見函式的用法： 1、array_contains （

Hive中order by,sort by, distribute by, cluster by區別，用法詳解

1. order by Hive中的order by跟傳統的sql語言中的order by作用是一樣的，會對查詢的結果做一次全域性排序，所以說，只有hive的sql中制定了order by所有的資料都會到同一個reducer進行處理（不管有多少map，也不管檔案有多少

Hive中order by sort by distribute by cluster by用法

1、order by hive中的order by和傳統sql中的order by 一樣，會對資料做全域性排序，加上排序，會新啟動一個jod進行排序，會把所有資料放到同一個reduce中進行處理，不管資料多少，不管檔案多少，都啟用一個reduce進行處理。注意

hive中的concat_ws合併用法

從資料庫裡取N個欄位，然後組合到一起用“，”分割顯示。 CONCAT（）來處理的話是麻煩。 CONCAT_WS（）處理起來比較簡單。 CONCAT（name,",",age） CONCAT_WS(",", name, age,…) 舉個簡單的例子 select concat(",

hive 中 Order by, Sort by ,Dristribute by,Cluster By 的作用和用法

order by order by 會對輸入做全域性排序，因此只有一個reducer（多個reducer無法保證全域性有序）只有一個reducer，會導致當輸入規模較大時，需要較長的計算時間。 set hive.mapred.mode=nonstrict; (default value

Hive中的去重：distinct,group by與ROW_Number()視窗函式

一、distinct,group by與ROW_Number()視窗函式使用方法 1. Distinct用法：對select 後面所有欄位去重，並不能只對一列去重。（1）當distinct應用到多個欄位的時候，distinct必須放在開頭，其應用的範圍是其後面的所有欄位，而不只是緊挨著它的一個欄位，而且di

mysql中去除重複資料之distinct用法

最近利用郭神的litepal建立了郵件客戶端的前端資料庫，然後實現最近聯絡人的顯示，需要在資料庫裡去查詢傳送人的暱稱和傳送人的地址兩個欄位，但發現litepal不支援distinct查詢，但可以通過原生的sql語句實現： select distinct fromaddress,fromname

SQL中distinct的用法和 SQL Union作用

SQL Union作用動態構造一個SQL語句然後執行,構造動態語句的查詢語句如下 SELECT REPLACE(WMSYS.WM_CONCAT(STR),',',' UNION ') FROM (SELECT 'SELECT class_no,stu_name,sex,age FROM ' |

sql去重複操作詳解SQL中distinct的用法

在使用mysql時，有時需要查詢出某個欄位不重複的記錄，這時可以使用mysql提供的distinct這個關鍵字來過濾重複的記錄，但是實際中我們往往用distinct來返回不重複欄位的條數（count(distinct id)）,其原因是distinct只能返回他的目標欄位，而無法返回其他欄位，例如有如下表

hive中Lateral View用法與 Hive UDTF explode的用法

Lateral View是Hive中提供給UDTF的conjunction，它可以解決UDTF不能新增額外的select列的問題。1. Why we need Lateral View？當我們想對hive表中某一列進行split之後，想對其轉換成1 to N的模式，即一行轉多

distinct用法(消除行中重複的記錄)

只有單列情況： select distinct column1 from table; 表示消除column1列重複值的行當有多列時： select distinct column1,column2

Oracle中distinct的用法例項以及Oracle distince 用法和刪除重複資料

Oracle中distinct的用法例項摘要：此外,distinct 會對返回的結果集進行排序所以會大大影響查詢效率,大資料集時比較明顯。所以，最好和order by 結合使用，可以提高效率。 select distinct a，b，c from t；表t裡列

hive中select中DISTINCT的技巧和使用

以下是轉載內容單表的唯一查詢用：distinct 多表的唯一查詢用：group by 在使用MySQL時，有時需要查詢出某個欄位不重複的記錄，雖然mysql提供有distinct這個關鍵字來過濾掉多餘的重複記錄只保留一條，但往往只用它來返回不重複記錄的條數，而不是用它來返

#hive#hive中的Distinct，group by

Select一些資料時候，會做一些去重處理，比如通過distinct 和group by來去重。（1）distinct distinct，在資料量不大的情況下，我都會用，主要自己懶的寫group by xxx這麼多的欄位，額。當資料量太大時候，特別是count(dist

Hive中的count(distinct)優化

問題描述 COUNT(DISTINCT xxx)在hive中很容易造成資料傾斜。針對這一情況，網上已有很多優化方法，這裡不再贅述。但有時，“資料傾斜”又幾乎是必然的。我們來舉個例子：假設表detail_sdk_session中記錄了訪問某網站M的客戶端會話資訊，即：

hive中order by,sort by, distribute by, cluster by作用以及用法

1. order by Hive中的order by跟傳統的sql語言中的order by作用是一樣的，會對查詢的結果做一次全域性排序，所以說，只有hive的sql中制定了order by所有的資料都會到同一個reducer進行處理（不管有多少map，也不管檔案有多

hive中distinct用法

相關推薦