sklearn機器學習實戰-KNN
KNN分類
KNN是惰性學習模型,也被稱為基於例項的學習模型
簡單線性迴歸是勤奮學習模型,訓練階段耗費計算資源,但是預測階段代價不高
首先工作是把label的內容進行二值化(如果多分類任務,則考慮OneHot)
from sklearn.preprocessing import LabelBinarizer from sklearn.neighbors import KNeighborsClassifier import numpy as np X_train = np.array([[158,64],[170,86],[183,84],[191,80],[155,49],[163,59],[180,67],[158,54],[170,67]]) y_train = np.array(['male','male','male','male','femal','femal','femal','femal','femal']) # print(y_train.shape) lb = LabelBinarizer() y_train_binarized = lb.fit_transform(y_train) # print(y_train_binarized.shape) print(y_train_binarized) [[1] [1] [1] [1] [0] [0] [0] [0] [0]]
這裡值得注意的地方是,processing是資料處理工具,這裡使用LabelBinarizer,目的是將資料進行二值化,注意,下述資料是在上面資料的基礎上,最後一個值修改
from sklearn.preprocessing import LabelBinarizer from sklearn.neighbors import KNeighborsClassifier import numpy as np X_train = np.array([[158,64],[170,86],[183,84],[191,80],[155,49],[163,59],[180,67],[158,54],[170,67]]) y_train = np.array(['male','male','male','male','femal','femal','femal','femal','femaal']) # print(y_train.shape) lb = LabelBinarizer() y_train_binarized = lb.fit_transform(y_train) # print(y_train_binarized.shape) print(y_train_binarized) [[0 0 1] [0 0 1] [0 0 1] [0 0 1] [0 1 0] [0 1 0] [0 1 0] [0 1 0] [1 0 0]]
在這裡說明一點,fit計算用於進行特徵縮放的最大值、最小值,但是之前簡單線性迴歸中,fit是LinearRegression的一個方法。一般fit對訓練集用,然後分別對訓練集和測試集用transform
可以理解為,fit進行訓練,transform進行轉換
一般如下使用
from sklearn.preprocessing import StandardScaler scaler_ss = StandardScaler() # 訓練接操作 new_train_x = scaler_ss.fit_transform(train_x) # 測試集操作 new_test_x = scaler_ss.tranform(test_x)
如果訓練集和測試集都用fit_transform的話,那麼測試集的最大最小,將有可能與訓練集的不同
from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
X_train = np.array([[158,64],[170,86],[183,84],[191,80],[155,49],[163,59],[180,67],[158,54],[170,67]])
y_train = np.array(['male','male','male','male','femal','femal','femal','femal','femal'])
lb = LabelBinarizer()
y_train_binarized = lb.fit_transform(y_train)
clf = KNeighborsClassifier(3)
clf.fit(X_train,y_train_binarized.reshape(-1))
prediction = clf.predict(np.array([155,70]).reshape(1,-1))[0]
predict_label = lb.inverse_transform(prediction)
array(['femal'], dtype='<U5')
一般準確率,使用accuracy_score,這個很好理解,就是每個都去比較
那麼查全率和查準率怎麼理解呢
from sklearn.metrics import recall_score
x_true = [1,1,1,1,1,1,1,1]
x_predict = [1,1,1,1,1,1,0,0]
print(recall_score(x_true,x_predict))
0.75
from sklearn.metrics import precision_score
x_true = [1,1,1,1,1,1,1,1]
x_predict = [1,1,1,1,1,1,0,0]
print(precision_score(x_true,x_predict))
1.0
其實都是用上下對應上的1,去除以一邊總共的1,不同的是求準,是要去除以上面的,求全是除以下面的,這也好理解嘛,我要全,那反正下面預測出來的對了,我就不管了
對應著西瓜書上關於求準和求全的公式
$P = \frac{TP}{TP+FP}$
$R = \frac{TP}{TP+FN}$
那麼TP對應的就是兩邊都是1的個數(True Positive),那麼FP就是False Positive,也就是兩個都是預測的positive
KNN迴歸
一般沒人會用KNN做迴歸吧?
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
X_train = np.array([[158,1],[170,1],[183,1],[191,1],[155,0],[163,0],[180,0],[158,0],[170,0]])
y_train = [64,86,84,80,49,59,67,54,67]
X_test = np.array([[168,1],[180,1],[160,0],[169,0]])
y_test = [65,96,52,67]
clf = KNeighborsRegressor(3)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("R2_score:")
print(r2_score(y_test,predictions))
print("MAE:")
print(mean_absolute_error(y_test,predictions))
print("MSE:")
print(mean_squared_error(y_test,predictions))
R2_score:
0.6290565226735438
MAE:
8.333333333333336
MSE:
95.8888888888889
R2_score感覺也不太常用,看了公式,沒太記住
StandardScalar後面會用到,這裡也不做展示
這一小節主要用了LabelBinarizer進行二值轉化,如果錯輸入三值,雖然也可以執行,但結果可就難說了
熟悉了主要流程,也即
- 選擇模型,如LinearRegression
- 可能要用到MinMax之類的preprocessing去處理資料
- 模型例項化,再去fit,然後可能transform
- 再接著可能要去predict
- 然後選用合適的方法看擬合效果,如一般選用accuracy_score