機器學習—分類3-1(KNN演算法)
阿新 • • 發佈:2022-03-15
基於KNN預測客戶是否購買汽車新車型
主要步驟流程:
- 1. 匯入包
- 2. 匯入資料集
-
3. 資料預處理
- 3.1 檢測缺失值
- 3.2 生成自變數和因變數
- 3.3 檢視樣本是否均衡
- 3.4 將資料拆分成訓練集和測試集
- 3.5 特徵縮放
-
4. 使用不同的引數構建KNN模型
- 4.1 模型1:構建KNN模型並訓練模型
- 4.1.1 構建KNN模型並訓練
- 4.1.2 預測測試集
- 4.1.3 生成混淆矩陣
- 4.1.4 視覺化測試集的預測結果
- 4.1.5 評估模型效能
- 4.2 模型2:構建KNN模型並訓練模型
- 4.1 模型1:構建KNN模型並訓練模型
1. 匯入包
In [2]:# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. 匯入資料集
In [3]:# 匯入資料集
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset
Out[3]:
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 |
Male | 19 | 19000 | 0 |
1 | 15810944 | Male | 35 | 20000 | 0 |
2 | 15668575 | Female | 26 | 43000 | 0 |
3 | 15603246 | Female | 27 | 57000 | 0 |
4 | 15804002 | Male | 19 | 76000 | 0 |
... | ... | ... | ... | ... | ... |
395 | 15691863 | Female | 46 | 41000 | 1 |
396 | 15706071 | Male | 51 | 23000 | 1 |
397 | 15654296 | Female | 50 | 20000 | 1 |
398 | 15755018 | Male | 36 | 33000 | 0 |
399 | 15594041 | Female | 49 | 36000 | 1 |
400 rows × 5 columns
3. 資料預處理
3.1 檢測缺失值
In [4]:# 檢測缺失值
null_df = dataset.isnull().sum()
null_df
Out[4]:
User ID 0
Gender 0
Age 0
EstimatedSalary 0
Purchased 0
dtype: int64
3.2 生成自變數和因變數
為了視覺化分類效果,僅選取 Age 和 EstimatedSalary 這2個欄位作為自變數
In [5]:# 生成自變數和因變數
X = dataset.iloc[:, [2, 3]].values
X[:5, :]
Out[5]:
array([[ 19, 19000],
[ 35, 20000],
[ 26, 43000],
[ 27, 57000],
[ 19, 76000]], dtype=int64)
In [6]:
y = dataset.iloc[:, 4].values
y[:5]
Out[6]:
array([0, 0, 0, 0, 0], dtype=int64)
3.3 檢視樣本是否均衡
In [7]:# 檢視樣本是否均衡
sample_0 = sum(dataset['Purchased']==0)
sample_1 = sum(dataset['Purchased']==1)
print('不買車的樣本佔總樣本的%.2f' %(sample_0/(sample_0 + sample_1)))
不買車的樣本佔總樣本的0.64
3.4 將資料拆分成訓練集和測試集
In [8]:# 將資料拆分成訓練集和測試集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(300, 2)
(100, 2)
(300,)
(100,)
3.5 特徵縮放
In [9]:# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
4. 使用不同的引數構建KNN模型
4.1 模型1:構建KNN模型並訓練模型
4.1.1 構建KNN模型並訓練
In [10]:# 使用不同的引數構建KNN模型
# 模型1:構建KNN模型並訓練模型(n_neighbors = 5, weights='uniform', metric = 'minkowski', p = 2)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, weights='uniform', metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
Out[10]:
KNeighborsClassifier()
4.1.2 預測測試集
In [11]:# 預測測試集
y_pred = classifier.predict(X_test)
y_pred[:5]
Out[11]:
array([0, 0, 0, 0, 0], dtype=int64)
4.1.3 生成混淆矩陣
In [12]:# 生成混淆矩陣
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[64 4]
[ 3 29]]
4.1.4 視覺化測試集的預測結果
In [13]:# 視覺化測試集的預測結果
from matplotlib.colors import ListedColormap
plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate([0,1]):
print(str(i)+"da"+str(j))
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j,1],
color = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
0da0
1da1
In [14]:
X_set[y_set == 0,1]
Out[14]:
array([ 0.50496393, -0.5677824 , 0.1570462 , 0.27301877, -0.5677824 ,
-1.43757673, -1.58254245, -0.04590581, -0.77073441, -0.59677555,
-0.42281668, -0.42281668, 0.21503249, 0.47597078, 1.37475825,
0.21503249, 0.44697764, -1.37959044, -0.65476184, -0.53878926,
-1.20563157, 0.50496393, 0.30201192, -0.21986468, 0.47597078,
0.53395707, -0.48080297, -0.33583725, -0.50979612, 0.33100506,
-0.77073441, -1.03167271, 0.53395707, -0.50979612, 0.41798449,
-1.43757673, -0.33583725, 0.30201192, -1.14764529, -1.29261101,
-0.3648304 , 1.31677196, 0.38899135, 0.30201192, -1.43757673,
-1.49556302, 0.18603934, -1.26361786, 0.56295021, -0.33583725,
-0.65476184, 0.01208048, 0.21503249, -0.19087153, 0.56295021,
0.35999821, 0.27301877, -0.27785096, 0.38899135, -0.42281668,
-1.00267957, 0.1570462 , -0.27785096, -0.16187839, -0.62576869,
-1.06066585, 0.41798449, -0.19087153])
In [15]:
np.unique(y_set)
Out[15]:
array([0, 1], dtype=int64)
4.1.5 評估模型效能
In [16]:# 評估模型效能
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.93
In [17]:
(cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1])
Out[17]:
0.93
4.2 模型2:構建KNN模型並訓練模型
In [1]:# 模型2:構建KNN模型並訓練模型(n_neighbors = 3, weights='distance', metric = 'minkowski', p = 1)
classifier = KNeighborsClassifier(n_neighbors = 100, weights='distance', metric = 'minkowski', p = 1)
classifier.fit(X_train, y_train)
In [19]:
# 預測測試集
y_pred = classifier.predict(X_test)
y_pred[:5]
Out[19]:
array([0, 0, 0, 0, 0], dtype=int64)
In [20]:
# 生成混淆矩陣
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[63 5]
[ 4 28]]
In [21]:
# 視覺化測試集的預測結果
plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
color = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
In [22]:
# 評估模型效能
print(accuracy_score(y_test, y_pred))
0.91
結論:
- 由上面2個模型可見,不同超引數對KNN模型的效能影響不同。