機器學習—分類3-4(邏輯迴歸與ROC)
阿新 • • 發佈:2022-03-15
基於邏輯迴歸預測客戶是否購買汽車新車型ROC曲線
主要步驟流程:
- 1. 匯入包
- 2. 匯入資料集
-
3. 資料預處理4. 構建邏輯迴歸模型
- 3.1 檢測缺失值
- 3.2 生成自變數和因變數
- 3.3 檢視樣本是否均衡
- 3.4 將資料拆分成訓練集和測試集
- 3.5 特徵縮放
-
4.構建邏輯迴歸模型
-
5. 手工畫出ROC曲線
-
6. 呼叫庫畫出ROC曲線
-
7. 得到AUC分數
1. 匯入包
In [1]:# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. 匯入資料集
In [2]:# 匯入資料集
dataset = pd.read_csv('Social_Network_Ads.csv')
dataset
Out[2]:
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19 | 19000 | 0 |
1 | 15810944 | Male | 35 |
20000 | 0 |
2 | 15668575 | Female | 26 | 43000 | 0 |
3 | 15603246 | Female | 27 | 57000 | 0 |
4 | 15804002 | Male | 19 | 76000 | 0 |
... | ... | ... | ... | ... | ... |
395 | 15691863 | Female | 46 | 41000 | 1 |
396 | 15706071 | Male | 51 | 23000 | 1 |
397 | 15654296 | Female | 50 | 20000 | 1 |
398 | 15755018 | Male | 36 | 33000 | 0 |
399 | 15594041 | Female | 49 | 36000 | 1 |
400 rows × 5 columns
3. 資料預處理
3.1 檢測缺失值
In [3]:
# 檢測缺失值
null_df = dataset.isnull().sum()
null_df
Out[3]:
User ID 0
Gender 0
Age 0
EstimatedSalary 0
Purchased 0
dtype: int64
3.2 生成自變數和因變數
為了視覺化分類效果,僅選取 Age 和 EstimatedSalary 這2個欄位作為自變數
In [4]:# 生成自變數和因變數
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
3.3 檢視樣本是否均衡
In [5]:# 檢視樣本是否均衡
sample_0 = sum(dataset['Purchased']==0)
sample_1 = sum(dataset['Purchased']==1)
print('不買車的樣本佔總樣本的%.2f' %(sample_0/(sample_0 + sample_1)))
不買車的樣本佔總樣本的0.64
3.4 將資料拆分成訓練集和測試集
In [6]:# 將資料拆分成訓練集和測試集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(300, 2)
(100, 2)
(300,)
(100,)
3.5 特徵縮放
In [7]:# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
4. 構建邏輯迴歸模型
In [8]:# 構建邏輯迴歸模型並訓練模型
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=1, class_weight='balanced', random_state = 0)
classifier.fit(X_train, y_train)
Out[8]:
LogisticRegression(C=1, class_weight='balanced', random_state=0)
In [9]:
# 預測測試集(得到預測結果)
y_pred = classifier.predict(X_test)
print(y_pred[:10])
[0 0 0 0 0 0 0 1 0 1]
In [10]:
# 預測測試集(得到概率)
y_pred_proba = classifier.predict_proba(X_test)[:,1]
print(y_pred_proba[:10])
[0.15981512 0.23237193 0.27326099 0.12555781 0.13506028 0.00922329
0.01852589 0.84130917 0.00685852 0.63294873]
5. 手工畫出ROC曲線
In [11]:# 手工畫出ROC曲線
from sklearn.metrics import confusion_matrix
threshold_list = []
tpr_list = []
fpr_list = []
for i in range(11):
threshold = i * 0.1 # threshold 分別為0、0.1、0.2、0.3、0.4、0.5、0.6、0.7、0.8、0.9、1
new_y_pred_proba = []
for j in y_pred_proba:
if j >= threshold:
new_y_pred_proba.append(1)
else:
new_y_pred_proba.append(0)
cm = confusion_matrix(y_test, new_y_pred_proba) # 混淆矩陣
tp = cm[1,1]
fp = cm[0,1]
fn = cm[1,0]
tn = cm[0,0]
tpr_value = tp/(tp+fn) # 計算tpr
fpr_value = fp/(fp+tn) # 計算fpr
threshold_list.append(threshold)
tpr_list.append(tpr_value)
fpr_list.append(fpr_value)
plt.figure()
plt.plot(fpr_list, tpr_list)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve (hand written)')
plt.show()
6. 呼叫庫畫出ROC曲線
In [12]:# 求出ROC曲線用到的指標值
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
In [13]:
# 顯示ROC曲線用到的指標值
roc_df = pd.DataFrame()
roc_df['fpr'] = fpr
roc_df['tpr'] = tpr
roc_df['thresholds'] = thresholds
roc_df
Out[13]:
fpr | tpr | thresholds | |
---|---|---|---|
0 | 0.000000 | 0.00000 | 1.998532 |
1 | 0.000000 | 0.03125 | 0.998532 |
2 | 0.000000 | 0.18750 | 0.991629 |
3 | 0.014706 | 0.18750 | 0.990037 |
4 | 0.014706 | 0.75000 | 0.677026 |
5 | 0.044118 | 0.75000 | 0.632949 |
6 | 0.044118 | 0.81250 | 0.602053 |
7 | 0.058824 | 0.81250 | 0.596452 |
8 | 0.058824 | 0.84375 | 0.563143 |
9 | 0.102941 | 0.84375 | 0.531246 |
10 | 0.102941 | 0.87500 | 0.521528 |
11 | 0.147059 | 0.87500 | 0.454756 |
12 | 0.147059 | 0.90625 | 0.429694 |
13 | 0.161765 | 0.90625 | 0.410702 |
14 | 0.161765 | 0.93750 | 0.402222 |
15 | 0.220588 | 0.93750 | 0.378048 |
16 | 0.220588 | 0.96875 | 0.346464 |
17 | 0.411765 | 0.96875 | 0.123208 |
18 | 0.411765 | 1.00000 | 0.118657 |
19 | 0.647059 | 1.00000 | 0.035503 |
20 | 0.676471 | 1.00000 | 0.034192 |
21 | 1.000000 | 1.00000 | 0.001963 |
# 畫出ROC
plt.figure()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve (call library)')
plt.show()
手工畫出的ROC曲線和呼叫庫畫出的ROC曲線一致,說明我們對ROC的理解是正確的。
7. 得到AUC分數
In [15]:# 得到AUC分數
from sklearn.metrics import roc_auc_score
auc_score = roc_auc_score(y_test, y_pred_proba)
print('AUC分數是:%.2f' %(auc_score))
AUC分數是:0.95
結論: AUC分數是0.95,說明模型效能非常好