機器學習—分類3-1（KNN演算法）

阿新 • • 發佈：2022-03-15

基於KNN預測客戶是否購買汽車新車型

主要步驟流程：

1. 匯入包
2. 匯入資料集
3. 資料預處理
- 3.1 檢測缺失值
- 3.2 生成自變數和因變數
- 3.3 檢視樣本是否均衡
- 3.4 將資料拆分成訓練集和測試集
- 3.5 特徵縮放
4. 使用不同的引數構建KNN模型
- 4.1 模型1：構建KNN模型並訓練模型
  - 4.1.1 構建KNN模型並訓練
  - 4.1.2 預測測試集
  - 4.1.3 生成混淆矩陣
  - 4.1.4 視覺化測試集的預測結果
  - 4.1.5 評估模型效能
- 4.2 模型2：構建KNN模型並訓練模型

資料集連結：https://www.heywhale.com/mw/dataset/622f4f9774c3750018981fee/file

1. 匯入包

In [2]:

# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. 匯入資料集

In [3]:

# 匯入資料集
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset

Out[3]:

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
...	...	...	...	...	...
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1

400 rows × 5 columns

3. 資料預處理

3.1 檢測缺失值

In [4]:

# 檢測缺失值
null_df = dataset.isnull().sum()
null_df

Out[4]:

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

3.2 生成自變數和因變數

為了視覺化分類效果，僅選取 Age 和 EstimatedSalary 這2個欄位作為自變數

In [5]:

# 生成自變數和因變數
X = dataset.iloc[:, [2, 3]].values
X[:5, :]

Out[5]:

array([[   19, 19000],
       [   35, 20000],
       [   26, 43000],
       [   27, 57000],
       [   19, 76000]], dtype=int64)

In [6]:

y = dataset.iloc[:, 4].values
y[:5]

Out[6]:

array([0, 0, 0, 0, 0], dtype=int64)

3.3 檢視樣本是否均衡

In [7]:

# 檢視樣本是否均衡
sample_0 = sum(dataset['Purchased']==0)
sample_1 = sum(dataset['Purchased']==1)
print('不買車的樣本佔總樣本的%.2f' %(sample_0/(sample_0 + sample_1)))

不買車的樣本佔總樣本的0.64

3.4 將資料拆分成訓練集和測試集

In [8]:

# 將資料拆分成訓練集和測試集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(300, 2)
(100, 2)
(300,)
(100,)

3.5 特徵縮放

In [9]:

# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4. 使用不同的引數構建KNN模型

4.1 模型1：構建KNN模型並訓練模型

4.1.1 構建KNN模型並訓練

In [10]:

# 使用不同的引數構建KNN模型
# 模型1：構建KNN模型並訓練模型（n_neighbors = 5, weights='uniform', metric = 'minkowski', p = 2）
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, weights='uniform', metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

Out[10]:

KNeighborsClassifier()

4.1.2 預測測試集

In [11]:

# 預測測試集
y_pred = classifier.predict(X_test)
y_pred[:5]

Out[11]:

array([0, 0, 0, 0, 0], dtype=int64)

4.1.3 生成混淆矩陣

In [12]:

# 生成混淆矩陣
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[64  4]
 [ 3 29]]

4.1.4 視覺化測試集的預測結果

In [13]:

# 視覺化測試集的預測結果
from matplotlib.colors import ListedColormap
plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())


for i, j in enumerate([0,1]):
    print(str(i)+"da"+str(j))
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j,1],
                color = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

0da0
1da1

In [14]:

X_set[y_set == 0,1]

Out[14]:

array([ 0.50496393, -0.5677824 ,  0.1570462 ,  0.27301877, -0.5677824 ,
       -1.43757673, -1.58254245, -0.04590581, -0.77073441, -0.59677555,
       -0.42281668, -0.42281668,  0.21503249,  0.47597078,  1.37475825,
        0.21503249,  0.44697764, -1.37959044, -0.65476184, -0.53878926,
       -1.20563157,  0.50496393,  0.30201192, -0.21986468,  0.47597078,
        0.53395707, -0.48080297, -0.33583725, -0.50979612,  0.33100506,
       -0.77073441, -1.03167271,  0.53395707, -0.50979612,  0.41798449,
       -1.43757673, -0.33583725,  0.30201192, -1.14764529, -1.29261101,
       -0.3648304 ,  1.31677196,  0.38899135,  0.30201192, -1.43757673,
       -1.49556302,  0.18603934, -1.26361786,  0.56295021, -0.33583725,
       -0.65476184,  0.01208048,  0.21503249, -0.19087153,  0.56295021,
        0.35999821,  0.27301877, -0.27785096,  0.38899135, -0.42281668,
       -1.00267957,  0.1570462 , -0.27785096, -0.16187839, -0.62576869,
       -1.06066585,  0.41798449, -0.19087153])

In [15]:

np.unique(y_set)

Out[15]:

array([0, 1], dtype=int64)

4.1.5 評估模型效能

In [16]:

# 評估模型效能
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.93

In [17]:

(cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1])

Out[17]:

0.93

4.2 模型2：構建KNN模型並訓練模型

In [1]:

# 模型2：構建KNN模型並訓練模型（n_neighbors = 3, weights='distance', metric = 'minkowski', p = 1）
classifier = KNeighborsClassifier(n_neighbors = 100, weights='distance', metric = 'minkowski', p = 1)
classifier.fit(X_train, y_train)

In [19]:

# 預測測試集
y_pred = classifier.predict(X_test)
y_pred[:5]

Out[19]:

array([0, 0, 0, 0, 0], dtype=int64)

In [20]:

# 生成混淆矩陣
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[63  5]
 [ 4 28]]

In [21]:

# 視覺化測試集的預測結果
plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('pink', 'limegreen')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                color = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

In [22]:

# 評估模型效能
print(accuracy_score(y_test, y_pred))

0.91

結論：

由上面2個模型可見，不同超引數對KNN模型的效能影響不同。

機器學習—分類3-1（KNN演算法）

基於KNN預測客戶是否購買汽車新車型主要步驟流程： 1. 匯入包 2. 匯入資料集 3. 資料預處理

機器學習—分類3-2（樸素貝葉斯演算法）

基於樸素貝葉斯預測客戶是否購買汽車新車型主要步驟流程： 1. 匯入包 2. 匯入資料集

機器學習—分類3-4（邏輯迴歸與ROC）

基於邏輯迴歸預測客戶是否購買汽車新車型ROC曲線主要步驟流程： 1. 匯入包 2. 匯入資料集

機器學習—分類3-3（邏輯迴歸）

基於邏輯迴歸預測客戶是否購買汽車新車型主要步驟流程： 1. 匯入包 2. 匯入資料集

機器學習—迴歸2-1（簡單線性迴歸）

使用簡單線性迴歸根據年齡預測醫療費用主要步驟流程： 1. 匯入包 2. 匯入資料集

吳恩達機器學習---自己的筆記（Day1-6）

Day1 機器學習：有監督學習：有監督學習指的就是我們給學習演算法一個數據集。這個資料集由“正確答案”組成。在房價的例子中，我們給了一系列房子的資料，我們給定資料集中每個樣本的正確價格，即它們實際

機器學習之特徵選擇（Feature Selection）

引言　　特徵提取和特徵選擇作為機器學習的重點內容，可以將原始資料轉換為更能代表預測模型的潛在問題和特徵的過程，可以通過挑選最相關的特徵，提取特徵和創造特徵來實現。要想學習特徵選擇必然要了解什麼是特徵提

機器學習之決策樹（Decision Tree）

1 引言　　決策樹（Decision Tree）是一種非引數的有監督學習方法，它能夠從一系列有特徵和標籤的資料中總結出決策規則，並用樹狀圖的結構來呈現這些規則，以解決分類和迴歸問題。決策樹中每個內部節點表示一個屬性

機器學習-tensorboard的使用（pytorch環境）

建立輸出資料夾： write = SummaryWriter(\"log\") def __init__(self, log_dir=None, comment=\'\', purge_step=None, max_queue=10,

機器學習—迴歸2-4（嶺迴歸）

使用嶺迴歸根據多個因素預測醫療費用資料集連結：https://www.cnblogs.com/ojbtospark/p/16005626.html

機器學習—迴歸2-5（LASSO迴歸）

使用LASSO迴歸根據多個因素預測醫療費用主要步驟流程： 1. 匯入包 2. 匯入資料集

【機器學習】偽標籤（Pseudo-Labelling）的介紹:一種半監督機器學習技術

我們在解決監督機器學習的問題上取得了巨大的進步。這也意味著我們需要大量的資料來構建我們的影象分類器。但是，這並不是人類思維的學習方式。一個人的大腦不需要上百萬個數據來進行訓練，需要通過多次迭代來完成相

【校招VIP】出品：校招前端衝刺大廠之演算法1（排序演算法）

本課程出自【校招VIP】原創內容，請勿擅自轉載，前端（考點課程）「校招前端衝刺大廠之演算法1（排序演算法）」持續更新中......

【校招VIP】出品：校招java衝刺大廠之演算法1（排序演算法）

本課程出自校招VIP原創內容，請勿擅自轉載，java考點課程「校招java衝刺大廠之演算法1（排序演算法）」持續更新中......

機器學習—迴歸與分類4-3（AdaBoost演算法）

使用AdaBoost預測黑色星期五花銷主要步驟流程： 1. 匯入包 2. 匯入資料集 3. 資料預處理

機器學習—聚類5-3（DBSCAN演算法）

使用DBSCAN對環形資料做聚類主要步驟流程： 1. 匯入包 2. 生成資料並可視化 3. 使用DBSCAN做聚類並可視化

【機器學習】數值分析（1）—— 任意方程求根

任意方程求根簡介方程和函式是代數數學中最為重要的內容之一，從初中直到大學，我們都在研究著方程與函式，甚至我們將圖形代數化，從而發展出了代數幾何、解析幾何的內容。而在方程與函式中，我們研究其性質最多的

機器學習實戰2.1KNN分類器程式碼（帶註釋）

技術標籤：學習筆記機器學習python from numpy import * import operator# 運算子模組 def createDataSet():

機器學習Sklearn系列：（五）聚類演算法

本文詳細的介紹了幾種常見的聚類演算法。 K-means 原理首先隨機選擇k個初始點作為質心

機器學習—降維-特徵選擇6-1（過濾法）

使用過濾法對糖尿病資料集降維主要步驟流程： 1. 匯入包 2. 匯入資料集 3. 資料預處理

機器學習—分類3-1（KNN演算法）

主要步驟流程：

1. 匯入包

2. 匯入資料集

3. 資料預處理

3.1 檢測缺失值

3.2 生成自變數和因變數

3.3 檢視樣本是否均衡

3.4 將資料拆分成訓練集和測試集

3.5 特徵縮放

4. 使用不同的引數構建KNN模型

4.1 模型1：構建KNN模型並訓練模型

4.1.1 構建KNN模型並訓練

4.1.2 預測測試集

4.1.3 生成混淆矩陣

4.1.4 視覺化測試集的預測結果

4.1.5 評估模型效能

4.2 模型2：構建KNN模型並訓練模型

相關推薦