hive之異常值處理
阿新 • • 發佈:2019-01-10
NULL值型別
count(col_name) 如果col_name的值是NULL,那麼COUNT是不會把它算進去的,所以想統計所有日誌數要使用COUNT(1)
而想對非空列進行相關操作,需要使用col_name IS NOT NULL. 而不是LENGTH(col_name>1), 因為LENGTH(NULL)是沒有結果的
---------------------------------------------------------20170608更新---------------------------------------------------------
除了上述問題,null值在邏輯統計方面也帶來一些麻煩。樓主在使用時,還遇到了如下問題。
問題描述:有條件A、B、C,想統計全部滿足及任意不滿足其中一種情況的下的資料量。
原始程式碼如下:
按理說應該是not_A+not_B+not_C的值不小於case_user才對,但是樓主得到的數量是小於。一頓困惑後經大神點播發現了原因,還是NULL值作祟。SELECT COUNT(1) AS all_user , SUM(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok_user , SUM(CASE WHEN A AND B AND C THEN 0 ELSE 1 END) AS case_user , SUM(CASE WHEN !A THEN 1 ELSE 0 END) AS not_A , SUM(CASE WHEN !B THEN 1 ELSE 0 END) AS not_B , SUM(CASE WHEN !C THEN 1 ELSE 0 END) AS not_C FROM tb
修改後的程式碼如下:
SELECT COUNT(1) AS all_user , SUM(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok_user , SUM(CASE WHEN A AND B AND C THEN 0 ELSE 1 END) AS case_user , SUM(CASE WHEN !A OR A IS NULL THEN 1 ELSE 0 END) AS not_A , SUM(CASE WHEN !B OR B IS NULL THEN 1 ELSE 0 END) AS not_B , SUM(CASE WHEN !C OR C IS NULL THEN 1 ELSE 0 END) AS not_C FROM tb
上述查詢只能是一個全域性統計,並不能瞭解每一個user的情況。也就是說,如果使用者存在多條記錄,有的記錄是滿足A AND B AND C的,有的是不滿足的,那怎麼統計每種條件下每個使用者的記錄滿足情況呢?~~~~~~~使用MIN和MAX~~~~~~~
SELECT
COUNT(1) AS all_user
, SUM(ok) AS all_ok
, SUM(not_A) AS all_not_A
, SUM(not_B) AS all_not_B
, SUM(not_C) AS all_not_C
FROM(
SELECT
userid
, MIN(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok
, MAX(CASE WHEN !A OR A IS NULL THEN 1 ELSE 0 END) AS not_A
, MAX(CASE WHEN !B OR B IS NULL THEN 1 ELSE 0 END) AS not_B
, MAX(CASE WHEN !C OR C IS NULL THEN 1 ELSE 0 END) AS not_C
FROM tb
GROUP BY userid
) a
NaN值型別
利用sql進行一些資料處理操作時,有時會得到異常結果。比如當分母為0的時候,sql不會報錯但是結果會是NaN。 可利用如下程式碼過濾這些異常值select col_1, col_2, col_with_nan
from my_table
where some_conditions
and cast(col_with_nan as String) <> 'NaN';