機器學習sklearn（五）：資料處理（二）缺失值處理

阿新 • • 發佈：2021-06-17

來源 https://www.cnblogs.com/B-Hanan/articles/12774433.html

1 單變數缺失

import numpy as np
from sklearn.impute import SimpleImputer

help(SimpleImputer):

class SimpleImputer(_BaseImputer):Imputation transformer for completing missing values.

Parameters(引數設定)

missing_values(缺失值型別) : number, string, np.nan (default) or None

The placeholder for the missing values. All occurrences ofmissing_valueswill be imputed.

strategy : string, default='mean'

The imputation strategy.

If "mean", then replace missing values using the mean along each column. Can only be used with numeric data.
If "median", then replace missing values using the median along each column. Can only be used with numeric data.
If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data.
If "constant", then replace missing values with fill_value. Can be used with strings or numeric data.strategy="constant" for fixed value imputation.

fill_value : string or numerical value, default=None

When strategy == "constant", fill_value is used to replace all occurrences of missing_values.If left to the default, fill_value will be 0 when imputing numericaldata and "missing_value" for strings or object data types.

imp=SimpleImputer(missing_values=np.nan,strategy='mean')
imp.fit([[1,2],[np.nan,3],[7,6]])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

##SimpleImputer類支援稀疏矩陣
import scipy.sparse as sp
X=sp.csc_matrix([[1,2],[0,-1],[8,4]])
imp=SimpleImputer(missing_values=-1,strategy='mean')
imp.fit(X)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=-1, strategy='mean', verbose=0)

X_test=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
print(imp.transform(X_test))

  (0, 0)    3.0
  (1, 0)    6.0
  (2, 0)    7.0
  (0, 1)    2.0
  (1, 1)    3.0
  (2, 1)    6.0

print(imp.transform(X_test).toarray())
[[3. 2.]
 [6. 3.]
 [7. 6.]]

import pandas as pd
df=pd.DataFrame([['a','x'],
                [np.nan,'y'],
                ['a',np.nan],
                ['b','y']],dtype='category')

df

	0	1
0	a	x
1	NaN	y
2	a	NaN
3	b	y

imp=SimpleImputer(strategy='most_frequent')
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]

2 多元特徵估計

使用IterativeImputer類，它將每一個特徵的缺失值建模為其它特性的函式，並使用該估計值進行估計。
工作模式：迭代迴圈
在每一步，都指定一個功能列出作為輸出

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp=IterativeImputer(max_iter=10,random_state=0)
imp.fit([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)

imp.transform([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])

array([[ 1.        ,  2.        ],
       [ 3.        ,  6.        ],
       [ 4.        ,  8.        ],
       [ 1.50004509,  3.        ],
       [ 7.        , 14.00004135]])

X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
print(imp.transform(X_test))

[[ 1.00007297  2.        ]
 [ 6.         12.00002754]
 [ 2.99996145  6.        ]]

3 K-近鄰法

這個KNNImputer類提供了使用k-最近鄰方法填充缺失值的估算。預設情況下，支援缺失值的歐氏距離度量，nan_euclidean_distances，用於查詢最近的鄰居。每個缺失的特性都使用n_neighbors具有該功能值的最近鄰居。

from sklearn.impute import KNNImputer

help(KNNImputer)：

Imputation for completing missing values using k-Nearest Neighbors.

(使用k近鄰方法補全缺失值。)

Each sample's missing values are imputed using the mean value fromn_neighborsnearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

(每個樣本的缺失值是使用在訓練集中找到的最近鄰居的‘n_neighbors’的平均值來推算的.如果兩個都不缺少的要素都不接近，則兩個樣本是接近的。)

Parameters：

missing_values : number, string, np.nan or None, default=np.nan

The placeholder for the missing values. All occurrences ofmissing_valueswill be imputed.

n_neighbors : int, default=5 Number of neighboring samples to use for imputation.

weights : {'uniform', 'distance'} or callable, default='uniform' Weight function used in prediction.

import numpy as np
from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
print(X)
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputer.fit_transform(X)

[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]





array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

4 標記推算值

這個MissingIndicator轉換器用於將資料集轉換為相應的二進位制矩陣，以指示資料集中是否存在缺失值。這種轉換與計算相結合是很有用的。在使用估算時，儲存有關哪些值丟失的資訊可以提供資訊。

from sklearn.impute import MissingIndicator

help(MissingIndicator):

class MissingIndicator(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)

Binary indicators for missing values(缺失值的二進位制指示符).

MissingIndicator(missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)

X = np.array([[-1, -1, 1, 3],
              [4, -1, 0, -1],
              [8, -1, 1, 0]])
indicator = MissingIndicator(missing_values=-1)
mask_missing_values_only = indicator.fit_transform(X)
mask_missing_values_only

array([[ True,  True, False],
       [False,  True,  True],
       [False,  True, False]])

#只返回存在缺失值的列的索引
indicator.features_

array([0, 1, 2, 3])

#這個features引數可以設定為'all'若要返回所有特徵，無論它們是否包含缺失的值
indicator = MissingIndicator(missing_values=-1, features="all")
mask_all = indicator.fit_transform(X)
mask_all

array([[ True,  True, False, False],
       [False,  True, False,  True],
       [False,  True, False, False]])

indicator.features_
#特徵所在的列索引

array([0, 1, 2, 3])

機器學習sklearn（五）：資料處理（二）缺失值處理

1 單變數缺失

2 多元特徵估計

3 K-近鄰法

4 標記推算值

Hadoop基礎（二十九）：資料清洗（ETL）（二）複雜解析版

Hadoop基礎（二十八）：資料清洗（ETL）（一）簡單解析版

Flink實戰（九十三）：資料傾斜（二）keyby 視窗資料傾斜的優化

pandas（13）：資料清洗（重複記錄）

scikit基礎與機器學習入門（6）編碼，增加多項式特徵和缺失值處理

機器學習sklearn（五）：資料集處理（二）缺失值處理

機器學習sklearn（五）：資料處理（二）缺失值處理

機器學習sklearn（十）：資料處理（五）自定義轉換器

機器學習sklearn（六）：資料處理（三）數值型資料處理（一）歸一化( MinMaxScaler/MaxAbsScaler)

機器學習sklearn（七）：資料處理（四）數值型資料處理（二）標準化 StandardScaler

機器學習sklearn（十一）：資料處理（六）非線性轉換

機器學習sklearn（十四）：特徵工程（五）特徵編碼（二）特徵雜湊(二)

機器學習sklearn（十五）：特徵工程（六）特徵選擇（一）主成分分析PCA

機器學習sklearn（二十）：特徵工程（十一）特徵編碼（五）類別特徵編碼（三）獨熱編碼 OneHotEncoder

機器學習sklearn（44）：資料處理（七）資料無量綱化/缺失值

機器學習sklearn（47）：特徵工程（十四）特徵選擇（五）Embedded嵌入法/Wrapper包裝法

機器學習sklearn（58）：演算法例項（十五）分類（八）邏輯迴歸（三）linear_model.LogisticRegression(二) 重要引數

機器學習sklearn（76）：演算法例項（三十三）迴歸（五）線性迴歸大家族（三）迴歸類的模型評估指標

機器學習sklearn（78）：演算法例項（三十五）迴歸（七）線性迴歸大家族（五）多重共線性：嶺迴歸與Lasso（二）Lasso

機器學習sklearn（86）：演算法例項（43）分類（22）樸素貝葉斯（五）貝葉斯分類器做文字分類

機器學習sklearn（五）： 資料處理（二）缺失值處理

1 單變數缺失

2 多元特徵估計

3 K-近鄰法

4 標記推算值

相關推薦

機器學習sklearn（五）：資料處理（二）缺失值處理