HIVE中IN的坑
問題:為什麼HIVE中用了 NOT IN,結果集沒了?
注:這個是原創,轉載請註明,謝謝!
直接進實驗室>>
> select * from a;
OK
1 a1
2 a2
3 a3
Time taken: 0.063 seconds, Fetched: 3 row(s)
hive> select * from b;
OK
1 b1
2 b2
NULL b3
Time taken: 0.063 seconds, Fetched: 3 row(s)
# 兩表通過id匹配,求 A-B ,用 left join實現
hive> select t1.id,t1.name,t2.name from a t1
> left join b t2 on t1.id = t2.id
> where t2.name is null
OK
3 a3 NULL
Time taken: 34.123 seconds, Fetched: 1 row(s)
# 兩表通過id匹配,求 A-B ,用 NOT IN 實現
select * from a where id not in ( select id from b );
OK
Time taken: 34.123 seconds, Fetched: 0 row(s)
這裡有詭異了,為什麼結果集沒了呢? 不能啊??
原因:
在RMDB中, t1.id IN (select t2.id from b t2 ) 等價於 : t1 join b t2 on t1.id = t2.id and t1.id is not null
在hive中,雖然我們的版本已經高達2.0.0,但是對於IN的處理還是就比較簡陋,沒有對null值進行遮蔽,導致凡是子查詢中有null值, 條件就會變成: id in ( null) , 當然, id in ( null) 這個條件是永遠不會有結果的。
正確的用法:
# 兩表通過id匹配,求 A-B ,用 NOT IN 實現
select * from a where id not in ( select id from b where id is not null );
OK
3 a3 NULL
Time taken: 34.123 seconds, Fetched: 1 row(s)
各位不妨可以做個試驗:
--沒結果
hive> select * from a where id not in (null);
OK
Time taken: 3.603 seconds