Python機器學習的練習三：邏輯迴歸

阿新 • • 發佈：2022-05-02

在這篇文章中，我們將把我們的目標從預測連續值(迴歸)變成分類兩個或更多的離散的儲存器(分類)，並將其應用到學生入學問題上。假設你是一個大學的管理人員，你想要根據兩門考試的結果來確定每個申請人的錄取機會。你可以把以前申請人的歷史資料作為訓練集使用。對於每一個訓練例子，你有申請人的兩門考試成績和錄取決定。為了達到這個目的，我們將根據考試成績建立一個分類模型，使用一種叫邏輯迴歸的方法來估計錄取的概率。

邏輯迴歸

邏輯迴歸實際上是一種分類演算法。我懷疑它這樣命名是因為它與線性迴歸在學習方法上很相似，但是成本和梯度函式表述不同。特別是，邏輯迴歸使用了一個sigmoid或“logit”啟用函式，而不是線性迴歸的連續輸出。

首先匯入和檢查我們將要處理的資料集。

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

import os 
path= os.getcwd()+ 'dataex2data1.txt' 
data= pd.read_csv(path, header=None, names=['Exam 1','Exam 2','Admitted']) 
data.head()

	Exam 1	Exam 2	Admitted
0	34.623660	78.024693	0
1	30.286711	43.894998	0
2	35.847409	72.902198	0
3	60.182599	86.308552	1
4	79.032736	75.344376	1

在資料中有兩個連續的自變數——“Exam 1”和“Exam 2”。我們的預測目標是“Admitted”的標籤。值1表示學生被錄取，0表示學生沒有被錄取。我們看有兩科成績的散點圖，並使用顏色編碼來表達例子是positive或者negative。

positive= data[data['Admitted'].isin([1])] 
negative= data[data['Admitted'].isin([0])]

fig, ax= plt.subplots(figsize=(12,8)) 
ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o', label='Admitted') 
ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x', label='Not Admitted') 
ax.legend() 
ax.set_xlabel('Exam 1 Score') 
ax.set_ylabel('Exam 2 Score')

從這個圖中我們可以看到，有一個近似線性的決策邊界。它有一點彎曲，所以我們不能使用直線將所有的例子正確地分類，但我們能夠很接近。現在我們需要實施邏輯迴歸，這樣我們就可以訓練一個模型來找到最優決策邊界，並做出分類預測。首先需要實現sigmoid函式。

def sigmoid(z): 
    return 1 / (1 + np.exp(-z))

這個函式是邏輯迴歸輸出的“啟用”函式。它將連續輸入轉換為0到1之間的值。這個值可以被解釋為分類概率，或者輸入的例子應該被積極分類的可能性。利用帶有界限值的概率，我們可以得到一個離散標籤預測。它有助於視覺化函式的輸出，以瞭解它真正在做什麼。

nums= np.arange(-10,10, step=1)

fig, ax= plt.subplots(figsize=(12,8)) 
ax.plot(nums, sigmoid(nums),'r')

我們的下一步是寫成本函式。成本函式在給定一組模型引數的訓練資料上評估模型的效能。這是邏輯迴歸的成本函式。

def cost(theta, X, y): 
    theta= np.matrix(theta)
    X= np.matrix(X)
    y= np.matrix(y)
    first= np.multiply(-y, np.log(sigmoid(X* theta.T)))
    second= np.multiply((1 - y), np.log(1 - sigmoid(X* theta.T)))
    return np.sum(first- second)/ (len(X))

注意，我們將輸出減少到單個標量值，該值是“誤差”之和，是模型分配的類概率與示例的真實標籤之間差別的量化函式。該實現完全是向量化的——它在語句(sigmoid(X * theta.T))中計算模型對整個資料集的預測。

測試成本函式以確保它在執行，首先需要做一些設定。

# add a ones column - this makes the matrix multiplication work out easier
data.insert(0,'Ones',1)

# set X (training data) and y (target variable)
cols= data.shape[1] 
X= data.iloc[:,0:cols-1] 
y= data.iloc[:,cols-1:cols]

# convert to numpy arrays and initalize the parameter array theta
X= np.array(X.values) 
y= np.array(y.values) 
theta= np.zeros(3)

檢查資料結構的形狀，以確保它們的值是合理的。這種技術在實現矩陣乘法時非常有用

X.shape, theta.shape, y.shape

((100L, 3L), (3L,), (100L, 1L))

現在計算初始解的成本，將模型引數“theta”設定為零，。

cost(theta, X, y)

0.69314718055994529

我們已經有了工作成本函式，下一步是編寫一個函式，用來計算模型引數的梯度，以找出改變引數來提高訓練資料模型的方法。在梯度下降的情況下，我們不只是在引數值周圍隨機地jigger，看看什麼效果最好。並且在每次迭代訓練中，我們通過保證將其移動到減少訓練誤差(即“成本”)的方向來更新引數。我們可以這樣做是因為成本函式是可微分的。

def gradient(theta, X, y): 
    theta= np.matrix(theta)
    X= np.matrix(X)
    y= np.matrix(y)

    parameters= int(theta.ravel().shape[1])
    grad= np.zeros(parameters)

    error= sigmoid(X* theta.T)- y

    for iin range(parameters):
        term= np.multiply(error, X[:,i])
        grad[i]= np.sum(term)/ len(X)

    return grad

我們並沒有在這個函式中執行梯度下降——我們只計算一個梯度步驟。在練習中，使用“fminunc”的Octave函式優化給定函式的引數，以計算成本和梯度。因為我們使用的是Python，所以我們可以使用SciPy的優化API來做同樣的事情。

import scipy.optimize as opt 
result= opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(X, y)) 
cost(result[0], X, y)

0.20357134412164668

現在我們的資料集裡有了最優模型引數，接下來我們要寫一個函式，它使用我們訓練過的引數theta來輸出資料集X的預測，然後使用這個函式為我們分類器的訓練精度打分。

def predict(theta, X): 
    probability= sigmoid(X* theta.T)
    return [1 if x >= 0.5 else 0 for xin probability]

theta_min= np.matrix(result[0]) 
predictions= predict(theta_min, X) 
correct= [1 if ((a== 1 and b== 1)or (a== 0 and b== 0))else 0 for (a, b)in zip(predictions, y)] 
accuracy= (sum(map(int, correct))% len(correct)) 
print 'accuracy = {0}%'.format(accuracy)
accuracy = 89%

我們的邏輯迴歸分類器預測學生是否被錄取的準確性可以達到89%，這是在訓練集中的精度。我們沒有保留一個hold-out set或使用交叉驗證來獲得準確的近似值，所以這個數字可能高於實際的值。

正則化邏輯迴歸

既然我們已經有了邏輯迴歸的工作實現，我們將通過新增正則化來改善演算法。正則化是成本函式的一個條件，使演算法傾向於更簡單的模型（在這種情況下，模型會減小系數），原理就是幫助減少過度擬合和幫助模型提高通用化能力。我們使用邏輯迴歸的正則化版本去解決稍帶挑戰性的問題，想象你是工廠的產品經理，你有一些晶片在兩種不同測試上的測試結果。通過兩種測試，你將會決定那種晶片被接受或者拒絕。為了幫助你做這個決定，你將會有以往晶片的測試結果資料集，並且通過它建立一個邏輯迴歸模型。

現在視覺化資料。

path= os.getcwd()+ 'dataex2data2.txt' 
data2= pd.read_csv(path, header=None, names=['Test 1','Test 2','Accepted'])

positive= data2[data2['Accepted'].isin([1])] 
negative= data2[data2['Accepted'].isin([0])]

fig, ax= plt.subplots(figsize=(12,8)) 
ax.scatter(positive['Test 1'], positive['Test 2'], s=50, c='b', marker='o', label='Accepted') 
ax.scatter(negative['Test 1'], negative['Test 2'], s=50, c='r', marker='x', label='Rejected') 
ax.legend() 
ax.set_xlabel('Test 1 Score') 
ax.set_ylabel('Test 2 Score')

這個資料看起來比以前的例子更復雜，你會注意到沒有線性決策線，資料也執行的很好，處理這個問題的一種方法是使用像邏輯迴歸這樣的線性技術，就是構造出由原始特徵多項式派生出來的特徵。我們可以嘗試建立一堆多項式特性以提供給分類器。

degree= 5 
x1= data2['Test 1'] 
x2= data2['Test 2']

data2.insert(3,'Ones',1)

for iin range(1, degree): 
    for jin range(0, i):
        data2['F' + str(i)+ str(j)]= np.power(x1, i-j)* np.power(x2, j)

data2.drop('Test 1', axis=1, inplace=True) 
data2.drop('Test 2', axis=1, inplace=True)

data2.head()

	Accepted	Ones	F10	F20	F21	F30	F31	F32
0	1	1	0.051267	0.002628	0.035864	0.000135	0.001839	0.025089
1	1	1	-0.092742	0.008601	-0.063523	-0.000798	0.005891	-0.043509
2	1	1	-0.213710	0.045672	-0.147941	-0.009761	0.031616	-0.102412
3	1	1	-0.375000	0.140625	-0.188321	-0.052734	0.070620	-0.094573
4	1	1	-0.513250	0.263426	-0.238990	-0.135203	0.122661	-0.111283

現在我們需要去修改成本和梯度函式以包含正則項。在這種情況下，將正則化矩陣新增到之前的計算中。這是更新後的成本函式。

def costReg(theta, X, y, learningRate): 
    theta= np.matrix(theta)
    X= np.matrix(X)
    y= np.matrix(y)
    first= np.multiply(-y, np.log(sigmoid(X* theta.T)))
    second= np.multiply((1 - y), np.log(1 - sigmoid(X* theta.T)))
    reg= (learningRate/ 2 * len(X))* np.sum(np.power(theta[:,1:theta.shape[1]],2))
    return np.sum(first- second)/ (len(X))+ reg

我們添加了一個名為“reg”的新變數，它是引數值的函式。隨著引數越來越大，對成本函式的懲罰也越來越大。我們在函式中添加了一個新的“learning rate”引數。這也是等式中正則項的一部分。 learning rate為我們提供了一個新的超引數，我們可以使用它來調整正則化在成本函式中的權重。

接下來，我們將在梯度函式中新增正則化。

def gradientReg(theta, X, y, learningRate): 
    theta= np.matrix(theta)
    X= np.matrix(X)
    y= np.matrix(y)

    parameters= int(theta.ravel().shape[1])
    grad= np.zeros(parameters)

    error= sigmoid(X* theta.T)- y

    for iin range(parameters):
        term= np.multiply(error, X[:,i])

        if (i== 0):
            grad[i]= np.sum(term)/ len(X)
        else:
            grad[i]= (np.sum(term)/ len(X))+ ((learningRate/ len(X))* theta[:,i])

    return grad

與成本函式一樣，將正則項加到最初的計算中。與成本函式不同的是，我們包含了確保第一個引數不被正則化的邏輯。這個決定背後的直覺是，第一個引數被認為是模型的“bias”或“intercept”，不應該被懲罰。

我們像以前那樣測試新函式

# set X and y (remember from above that we moved the label to column 0)
cols= data2.shape[1] 
X2= data2.iloc[:,1:cols] 
y2= data2.iloc[:,0:1]

# convert to numpy arrays and initalize the parameter array theta
X2= np.array(X2.values) 
y2= np.array(y2.values) 
theta2= np.zeros(11)

learningRate= 1

costReg(theta2, X2, y2, learningRate)
0.6931471805599454

我們能使用先前的最優程式碼尋找最優模型引數。

result2= opt.fmin_tnc(func=costReg, x0=theta2, fprime=gradientReg, args=(X2, y2, learningRate)) 
result2

(陣列([ 0.35872309, -3.22200653, 18.97106363, -4.25297831, 18.23053189, 20.36386672, 8.94114455, -43.77439015, -17.93440473, -50.75071857, -2.84162964]), 110, 1)

最後，我們可以使用前面應用的相同方法，為訓練資料建立標籤預測，並評估模型的效能。

theta_min= np.matrix(result2[0]) 
predictions= predict(theta_min, X2) 
correct= [1 if ((a== 1 and b== 1)or (a== 0 and b== 0))else 0 for (a, b)in zip(predictions, y2)] 
accuracy= (sum(map(int, correct))% len(correct)) 
print 'accuracy = {0}%'.format(accuracy)

準確度 = 91%

本文為編譯文章，作者John Wittenauer，原網址為

http://www.johnwittenauer.net/machine-learning-exercises-in-python-part-3/

Python機器學習的練習三：邏輯迴歸

邏輯迴歸

正則化邏輯迴歸

Python機器學習的練習三：邏輯迴歸

【MindSpore:跟著小Mi一起機器學習吧！】邏輯迴歸（二）

【MindSpore:跟著小Mi一起機器學習吧！】邏輯迴歸（一）

Python機器學習的練習四：多元邏輯迴歸

Python機器學習的練習二：多元線性迴歸

Python機器學習的練習一：簡單線性迴歸

Python機器學習演算法：線性迴歸

Python機器學習課程：線性迴歸演算法

機器學習Sklearn系列：（二）邏輯迴歸

python機器學習——邏輯迴歸方法

Python機器學習的練習七：K-Means聚類和主成分分析

Python機器學習的練習六：支援向量機

Python機器學習的練習五：神經網路

Python機器學習的練習八：異常檢測和推薦系統

分散式機器學習：邏輯迴歸的並行化實現（PySpark）

機器學習（三）：支援向量機（Support Vector Machine）

Pytorch學習筆記（2）：邏輯迴歸

Python機器學習筆記：奇異值分解（SVD）演算法

機器學習Sklearn系列：（三）決策樹

python機器學習——線性迴歸方法

Python機器學習的練習三：邏輯迴歸

邏輯迴歸

正則化邏輯迴歸

相關推薦