hive count distinct

阿新 • • 發佈：2019-01-25

select count(distinct user_id) from dm_user where ds=20150701;

使用disticnt函式，所有的資料只會shuffle到一個reducer上，導致reducer資料傾斜嚴重

優化後為

set mapred.reduce.tasks=50;

select count(*) from

(select user_id from dm_user where ds=20150701 group by user_id)t;

order by全域性排序，只有一個reduce

sort by 在一個reduce中排序，distribute by 按欄位分為不同的reduce

distribute by 先分為不同的reduce排序，之後在reduce內部排序

// 一個reduce(海量資料,速度很慢) select


year

, temperature order

by

year asc

, temperature

desc limit 100;

替換成

// 多個reduce(海量資料,速度很快) select


year

, temperature distribute by year sort by year asc

, temperature

desc limit 100;

order by  (全域性排序 )


order by 會對輸入做全域性排序，因此只有一個reducer（多個reducer無法保證全域性有序） 

只有一個reducer，會導致當輸入規模較大時，需要較長的計算時間。

在hive.mapred.mode=strict模式下，強制必須新增limit限制，這麼做的目的是減少reducer資料規模


例如，當限制limit 100時， 如果map的個數為50， 則reducer的輸入規模為100*50

distribute by  (類似於分桶)


根據distribute by指定的欄位對資料進行劃分到不同的輸出reduce 檔案中

sort by   (類似於桶內排序)


sort by不是全域性排序，其在資料進入reducer前完成排序。 

因此，如果用sort by進行排序，並且設定mapred.reduce.tasks>1， 則sort by只保證每個reducer的輸出有序，不保證全域性有序。

cluster by 

cluster by 除了具有 distribute by 的功能外還兼具 sort by 的功能。  

但是排序只能是倒序排序，不能指定排序規則為asc 或者desc。

因此，常常認為cluster by = distribute by + sort by

hive count distinct

select count(distinct user_id) from dm_user where ds=20150701; 使用disticnt函式，所有的資料只會shuffle到一個reducer上，導致reducer資料傾斜嚴重優化後為 set mapred.red

hive優化-count(distinct)

問題描述 COUNT(DISTINCT xxx)在hive中很容易造成資料傾斜。針對這一情況，網上已有很多優化方法，這裡不再贅述。但有時，“資料傾斜”又幾乎是必然

Hive SQL優化之 Count Distinct

Hive是Hadoop的子專案，它提供了對資料的結構化管理和類SQL語言的查詢功能。SQL的互動方式極大程度地降低了Hadoop生態環境中資料處理的門檻，使用者不需要編寫程式，通過SQL語句就可以對資料進行分析和處理。目前很多計算需求都可以由Hive來完成，極大程度地降低

Hive中的count(distinct)優化

問題描述 COUNT(DISTINCT xxx)在hive中很容易造成資料傾斜。針對這一情況，網上已有很多優化方法，這裡不再贅述。但有時，“資料傾斜”又幾乎是必然的。我們來舉個例子：假設表detail_sdk_session中記錄了訪問某網站M的客戶端會話資訊，即：

使用子查詢可提升 COUNT DISTINCT 速度 50 倍

原因 desc 精準 http user 計數而且 -1 nbsp Count distinct是SQL分析時的禍根首先：如果你有一個大的且能夠容忍不精確的數據集，那像HyperLogLog這樣的概率計數器應該是你最好的選擇。但對於需要快速、精準答案的查詢，一些簡單

在Apache Kylin中使用Count Distinct

雷頓學院大資料：http://www.leidun.site/ 在OLAP多維分析中，Count Distinct（去重計數）是一種非常常用的指標度量，比如一段時間內的UV、活躍使用者數等等;從1.5.3開始，Apache Kylin提供了兩種Count Distinct計算方式，一種是近

sql語句中的count(distinct column)問題記錄

閒話不多說，直接上乾貨。老帖們，切記不要隨意copy！copy有風險，發帖須謹慎！按照慣性思維，統計一個欄位去重後的條數我們的sql寫起來如下： select count（distinct column）from db.table 如果colu

Postgresql資料庫count(distinct)優化

基本資訊基本情況表共800W資料，從260W的結果集中計算出不同的案件數量(130萬)，需要執行20多秒原SQL內容 select count(distinct c_bh_aj) as ajcount from db_znspgl.

hive中distinct用法

hive中的distinct是去重的意思，和group by在某些情況下有相同的功能下面測試下distinct的部分功能，先建立一張測試表 create table test.trip_tmp( id int, user_id int, salesman_id int, huose

SQL COUNT DISTINCT 函式

可以一同使用 DISTINCT 和 COUNT 關鍵詞，來計算非重複結果的數目。語法 SELECT COUNT(DISTINCT column(s)) FROM table 程式碼示例 SELECT COUNT(Company) FROM Orders 結果： 4 S

mysql中count(*),distinct的使用方法和效率研究

SQL 語句的COUNT有兩種用途 1. 用來計算行數——Count(*) 2. 用來計算某個值的數量——COUNT（col1） Count(*) 永遠返回的都是結果集中的行數，而COUNT（col1）只返回col1值非空的記錄數，如果col1值全部非空， Count(*)和COUNT（col1）的結果是相同

使用GROUP BY的時候如何統計記錄條數 COUNT(*) DISTINCT

INSERT INTO `test_users` (`email_id`, `email`, `passwords`) VALUES (1, ‘[email protected]', ‘1e48c4420b7073bc11916c6c1de226bb'), (2, ‘[email p

with as 和group by 代替 count distinct，提高效能

資料庫：postgresql 背景：使用distinct在對某張表某個欄位做去重統計的時候，發現有統計特別慢（30-40s），當前表大小是400w，distinct 後大概60w sql 很簡單（出於安全考慮，欄位和表名稱都做了替換）: 原sql： select count

SQL count distinct與NULL

[email protected]> select * from tt2; TT2_ID TT2_NAME TT2_DATE ---------- ---------- --------- 1 1 23-DE

Hive中distinct和Group by效率對比及處理方式

select res.flag AS flag ,res.source AS source ,res.template AS template ,SUM(res.click_user)

SQL如何Count Distinct過的資料

distinct: <pre name="code" class="sql">SELECT distinct t1.*,t2.industryId FROM positions t1 ,planteddetails t2, landplanted t3 whe

hive count(*) group by 坑

問題情境：資料質量監控工具，需要配置sql查詢，得到返回資料值，判斷資料是否在合理範圍之內，公司老司機同志讓我給他加一個爬蟲資料量的監控，判斷爬的資料是否有異常我的sql是這樣的：select c

Hive針對distinct的優化

hive針對count(distinct xxx)只產生一個reduce的優化。 0x00 造成的原因由於使用了distinct，導致在map端的combine無法合併重複資料；對於這種count()全聚合操作時，即使設定了reduce task個數，set mapre

sql優化之：count(distinct xxxx)

select count(distinct column) from table_name; 這樣一條sql在資料量比較大時可能跑的時間很長。可以用：select count(1) from (select column from table_name group

hiv踩坑記錄：count(distinct col1,col2) 遇見某列中有null值，結果不準

count(distinct col1,col2) 遇見某列中中有null值，結果不準 SELECT count(DISTINCT col1,col2) from (SELECT 2 as col1,1 as col2 union all SELECT null as co

hive count distinct

相關推薦