1. 程式人生 > 其它 >資料清洗之異常值處理

資料清洗之異常值處理

1.異常值的處理方法:
1). 3δ原則:與平均值的偏差超過標準3個標準差
2). 箱線圖法:異常值>上四分位數+1.5IQR 或 異常值<下四分位數-1.5IQR, IQR=上四分位數-下四分位數
3). 業務常識

# 標準差原則和箱線圖法
import numpy as np
import pandas as pd

online_retail_pd = pd.read_csv(r'online_retail.csv', encoding='ISO-8859-1')
# 刪除完全重複的行
online_retail_pd.drop_duplicates(inplace=True)
# 刪除顧客id為空的行
online_retail_pd.dropna(subset=['CustomerID'], inplace=True)
price_mean = np.mean(online_retail_pd['UnitPrice'])
price_std = np.std(online_retail_pd['UnitPrice'])

# 標準差原則:
print(f'price mean: {price_mean}, price_std: {price_std}')
price_low_bound = price_mean - 3 * price_std
price_upper_bound = price_mean + 3 * price_std
print(f'price low bound: {price_low_bound}, price upper bound: {price_upper_bound}')
# price mean: 3.47406363979831, price_std: 69.76394820732074
# price low bound: -205.8177809821639, price upper bound: 212.76590826176053
# 均值比較小,標準差很大
# 通過3δ原則可知,下界:資料小於-205 ,上界:大於212就是異常值,結合實際業務可知,價格不可能是負數,
# 說明下界是沒有參考意義的。


# 箱線圖法
# 箱線圖法:
# 上四分位數
price_qu = online_retail_pd['UnitPrice'].quantile(q=0.75)
# 下四分位數
price_qr = online_retail_pd['UnitPrice'].quantile(q=0.25)
print(price_qu, price_qr)
# 四分位數間距
price_iqr = price_qu - price_qr
price_max_bound = price_qu + 1.5 * price_iqr
price_min_bound = price_qr - 1.5 * price_iqr
print(f'price low bound: {price_min_bound}, price upper bound: {price_max_bound}')

# 把CustomerID轉換為int型別,原本是float型別
online_retail_pd['CustomerID'] = online_retail_pd['CustomerID'].apply(int)

# 新增了三列,用map自定義函式,lambda函式返回長度
online_retail_pd['InvoiceNo_Len'] = online_retail_pd['InvoiceNo'].map(lambda x: len(x))
online_retail_pd['StockCode_Len'] = online_retail_pd['StockCode'].map(lambda x: len(x))
online_retail_pd['CustomerID_Len'] = online_retail_pd['CustomerID'].map(lambda x: len(str(x)))

# print(online_retail_pd.groupby('InvoiceNo_Len').size())
#
# print(online_retail_pd.groupby('StockCode_Len').size())
#
# print(online_retail_pd.groupby('CustomerID_Len').size())

# 3) Quantity、UnitPrice 異常, 取消的訂單

print(online_retail_pd[online_retail_pd['Quantity'] <= 0])
# print(online_retail_pd[online_retail_pd['UnitPrice'] <= 0])
online_retail_pd['Is_Cancel'] = online_retail_pd['InvoiceNo'].apply(lambda x: 1 if x[0] == 'C' else 0)
print(online_retail_pd.groupby('Is_Cancel').size())

print(len(online_retail_pd))
online_retail_pd = online_retail_pd[(online_retail_pd['Is_Cancel'] == 0)
                                    & (online_retail_pd['Quantity'] > 0)
                                    & (online_retail_pd['UnitPrice'] > 0)]
print(len(online_retail_pd))

# 資料彙總
# 刪除無用欄位
online_retail_pd.drop(['InvoiceNo_Len', 'StockCode_Len', 'CustomerID_Len', 'Is_Cancel'], axis=1, inplace=True)

# # 新增銷售金額
online_retail_pd['Sale_Amount'] = online_retail_pd['Quantity'] * online_retail_pd['UnitPrice']
online_retail_pd['InvoiceDate'] = pd.to_datetime(online_retail_pd['InvoiceDate'])

# 重置索引(為什麼要重置索引,因為之前刪掉了一些資料,導致你資料的索引是斷開的,所以需要重置索引)
online_retail_pd.reset_index(drop=True)
print(online_retail_pd.info())