1. 程式人生 > >機器學習--處理缺失值

機器學習--處理缺失值

處理缺失值

在python語言中,缺失值一般被稱為nan,是”not a number”的縮寫。
下面的程式碼可以計算出資料總共有多少個缺失值,這裡資料是儲存在pandas中的DateFrame中:

print(data.isnull().sum())

處理缺失值有一下幾種方式:

1.刪除包含缺失值的資料列

data_without_missing_values=original_data.dropna(axis=1)

在大多數情況下,我們必須在訓練集(training dataset)和測試集(test dataset)中刪除同樣的資料列。

cols_with_missing=[col for col in
original_data.columns if original_data[col].isnull().any()] reduced_original_data=original_data.drop(cols_with_missing,axis=1) reduced_test_data=test_data.drop(cols_with_missing,axis=1)

這種方法適用於資料列包含的缺失值太多的情況

2.填補缺失值

這種方法比直接刪除資料列好點,能訓練出更好的模型。

from sklearn.preprocessing import Imputer
    my_inputer=Imputer()
data_with_imputed_values=my_imputer.fit_transform(original_data)

預設的填補策略是使用均值填充

3.拓展方法

如果缺失資料包含重要特徵資訊的話,我們需要儲存原始資料的缺失值資訊,儲存在boolean列中

#先拷貝一份原始資料
new_data=original_data.copy()
#建立新的列保用來儲存缺失資料列的缺失情況
cols_with_missing=(col for col in new_data.columns if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col+'_was_missing']=new_data[col].isnull()
#插值
my_imputer=Imputer() new_data=my_imputer.fit_transform(new_data)

示例

下面是一個房價預測的例子,用來比較上面提到的三種處理缺失值的情況。

import pandas as pd

# 載入資料
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)

# 為了簡單起見,我們只使用數字特徵的列訓練預測模型 
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
#劃分訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors, 
                                                    melb_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)
#定義一個函式,度量模型的MAE指標
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()#這裡選用隨機森林模型
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

測試第一種方法的效果,刪除包含空缺值的資料列

cols_with_missing = [col for col in X_train.columns 
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

Mean Absolute Error from dropping columns with Missing Values:
347871.8471099837

第二種方法,使用列平均值填補

from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

Mean Absolute Error from Imputation:
201753.99398441747

第三種方法,增加額外列儲存缺失值的資訊

#拷貝原始資料
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
#取得含有缺失值的列名元組
cols_with_missing = (col for col in X_train.columns 
                                 if X_train[col].isnull().any())
#新增列儲存缺失資訊,形成的新增列包含諸如(True,False,True,False這樣的序列)
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Mean Absolute Error from Imputation while Track What Was Imputed:
200147.29626743973

總結

在上面的例子中,方法2和3的表現相差不是很大,但在某些情況下,差距十分明顯