資料清洗之異常值處理
阿新 • • 發佈:2021-07-07
1.異常值的處理方法:
1). 3δ原則:與平均值的偏差超過標準3個標準差
2). 箱線圖法:異常值>上四分位數+1.5IQR 或 異常值<下四分位數-1.5IQR, IQR=上四分位數-下四分位數
3). 業務常識
# 標準差原則和箱線圖法 import numpy as np import pandas as pd online_retail_pd = pd.read_csv(r'online_retail.csv', encoding='ISO-8859-1') # 刪除完全重複的行 online_retail_pd.drop_duplicates(inplace=True) # 刪除顧客id為空的行 online_retail_pd.dropna(subset=['CustomerID'], inplace=True) price_mean = np.mean(online_retail_pd['UnitPrice']) price_std = np.std(online_retail_pd['UnitPrice']) # 標準差原則: print(f'price mean: {price_mean}, price_std: {price_std}') price_low_bound = price_mean - 3 * price_std price_upper_bound = price_mean + 3 * price_std print(f'price low bound: {price_low_bound}, price upper bound: {price_upper_bound}') # price mean: 3.47406363979831, price_std: 69.76394820732074 # price low bound: -205.8177809821639, price upper bound: 212.76590826176053 # 均值比較小,標準差很大 # 通過3δ原則可知,下界:資料小於-205 ,上界:大於212就是異常值,結合實際業務可知,價格不可能是負數, # 說明下界是沒有參考意義的。 # 箱線圖法 # 箱線圖法: # 上四分位數 price_qu = online_retail_pd['UnitPrice'].quantile(q=0.75) # 下四分位數 price_qr = online_retail_pd['UnitPrice'].quantile(q=0.25) print(price_qu, price_qr) # 四分位數間距 price_iqr = price_qu - price_qr price_max_bound = price_qu + 1.5 * price_iqr price_min_bound = price_qr - 1.5 * price_iqr print(f'price low bound: {price_min_bound}, price upper bound: {price_max_bound}') # 把CustomerID轉換為int型別,原本是float型別 online_retail_pd['CustomerID'] = online_retail_pd['CustomerID'].apply(int) # 新增了三列,用map自定義函式,lambda函式返回長度 online_retail_pd['InvoiceNo_Len'] = online_retail_pd['InvoiceNo'].map(lambda x: len(x)) online_retail_pd['StockCode_Len'] = online_retail_pd['StockCode'].map(lambda x: len(x)) online_retail_pd['CustomerID_Len'] = online_retail_pd['CustomerID'].map(lambda x: len(str(x))) # print(online_retail_pd.groupby('InvoiceNo_Len').size()) # # print(online_retail_pd.groupby('StockCode_Len').size()) # # print(online_retail_pd.groupby('CustomerID_Len').size()) # 3) Quantity、UnitPrice 異常, 取消的訂單 print(online_retail_pd[online_retail_pd['Quantity'] <= 0]) # print(online_retail_pd[online_retail_pd['UnitPrice'] <= 0]) online_retail_pd['Is_Cancel'] = online_retail_pd['InvoiceNo'].apply(lambda x: 1 if x[0] == 'C' else 0) print(online_retail_pd.groupby('Is_Cancel').size()) print(len(online_retail_pd)) online_retail_pd = online_retail_pd[(online_retail_pd['Is_Cancel'] == 0) & (online_retail_pd['Quantity'] > 0) & (online_retail_pd['UnitPrice'] > 0)] print(len(online_retail_pd)) # 資料彙總 # 刪除無用欄位 online_retail_pd.drop(['InvoiceNo_Len', 'StockCode_Len', 'CustomerID_Len', 'Is_Cancel'], axis=1, inplace=True) # # 新增銷售金額 online_retail_pd['Sale_Amount'] = online_retail_pd['Quantity'] * online_retail_pd['UnitPrice'] online_retail_pd['InvoiceDate'] = pd.to_datetime(online_retail_pd['InvoiceDate']) # 重置索引(為什麼要重置索引,因為之前刪掉了一些資料,導致你資料的索引是斷開的,所以需要重置索引) online_retail_pd.reset_index(drop=True) print(online_retail_pd.info())