ML Day6邏輯迴歸

阿新 • • 發佈：2019-01-11

機器學習100天，每天進步一點點。跟著GitHub開始學習！

英文專案地址https://github.com/Avik-Jain/100-Days-Of-ML-Code

中文專案地址https://github.com/MLEveryday/100-Days-Of-ML-Code

邏輯迴歸被用來處理不同的分類問題，使用基礎邏輯函式通過估算概率來測量因變數和一個或者多個自變數之間的關係。概率值需轉換為二進位制數，以便實際中進行預測。這是邏輯函式的任務，也被稱為sigmoid函式。然後使用閾值分類器將（0，1）範圍的值轉化成0和1的值來表示結果。

Sigmoid函式是一個S型曲線，可將任意真實值對映為值域範圍為0-1的值。

邏輯迴歸VS線性迴歸：

邏輯迴歸給出離散的輸出結果，線性迴歸給出連續的輸出結果。

資料集|社交網路

該資料集包含了社交網路中使用者的資訊。這些資訊涉及使用者ID,性別,年齡以及預估薪資。一家汽車公司剛剛推出了他們新型的豪華SUV，我們嘗試預測哪些使用者會購買這種全新SUV。並且在最後一列用來表示使用者是否購買。我們將建立一種模型來預測使用者是否購買這種SUV，該模型基於兩個變數，分別是年齡和預計薪資。因此我們的特徵矩陣將是這兩列。我們嘗試尋找使用者年齡與預估薪資之間的某種相關性，以及他是否購買SUV的決定。

邏輯迴歸的處理步驟：

1 資料預處理

匯入庫、匯入資料集、將資料集分為訓練集和測試集、特徵縮放

2 邏輯迴歸模型

將邏輯迴歸應用於訓練集

3 預測

預測測試集結果

4 評估預測

生成混淆矩陣、視覺化

混淆矩陣（confusion matrix），又稱為可能性表格或是錯誤矩陣。它是一種特定的矩陣用來呈現演算法效能的效果，通常用於監督學習。

classification_report函式在報告中顯示每個類的精確度/召回率/F1值。

精確度/召回率/F1值

如下圖所示，假設有若干張圖片，其中12張是狗的圖片其餘是貓的圖片。現在利用程式去識別狗的圖片，結果在識別出的8張圖片中有5張是狗的圖片，3張是貓的圖片。

圖中，實心小圓代表狗的圖片，虛心小圓代表貓的圖片，圓形區域代表識別結果。則該程式的精度precision＝5/8，召回率recall＝5/12。

F1值是精確度和召回率的調和平均值。精確度和召回率都高時，F1值也會高。F1值最佳為1，最差為0。

程式碼：

# Importing the Libraries
import numpy as np  #包含數學計算函式
import matplotlib.pyplot as plt
import pandas as pd  #用於匯入和管理資料集

# Importing the dataset
dataset = pd.read_csv('../datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values  #iloc是取矩陣的某行某列
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()  #針對某一特徵維度進行標準化，經處理後的資料符合標準正態分佈，均值為0，標準差為1
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)

# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
cm = confusion_matrix(y_test, y_pred)  #生成混淆矩陣
print(cm)  # print confusion_matrix
print(classification_report(y_test, y_pred))   # print classification report，在報告中顯示每個類的精確度，召回率，F1值等資訊。

#Visualization
from matplotlib.colors import ListedColormap
X_set,y_set=X_train,y_train
#X,Y=np. meshgrid(x,y):輸入的x,y是網格點的橫縱座標列向量，輸出的X,Y是座標矩陣
X1,X2=np. meshgrid(np. arange(start=X_set[:,0].min()-1, stop=X_set[:, 0].max()+1, step=0.01),
                   np. arange(start=X_set[:,1].min()-1, stop=X_set[:,1].max()+1, step=0.01))
#contourf繪製等高線
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i,j in enumerate(np. unique(y_set)):
    #scatter繪製散點圖
    plt.scatter(X_set[y_set==j,0],X_set[y_set==j,1],
                c = ListedColormap(('red', 'green'))(i), label=j)

plt. title(' LOGISTIC(Training set)')
plt. xlabel(' Age')
plt. ylabel(' Estimated Salary')
plt. legend()
plt. show()

X_set,y_set=X_test,y_test
X1,X2=np. meshgrid(np. arange(start=X_set[:,0].min()-1, stop=X_set[:, 0].max()+1, step=0.01),
                   np. arange(start=X_set[:,1].min()-1, stop=X_set[:,1].max()+1, step=0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i,j in enumerate(np. unique(y_set)):
    plt.scatter(X_set[y_set==j,0],X_set[y_set==j,1],
                c = ListedColormap(('red', 'green'))(i), label=j)

plt. title(' LOGISTIC(Test set)')
plt. xlabel(' Age')
plt. ylabel(' Estimated Salary')
plt. legend()
plt. show()

結果：