Python搭建SVM
目錄
從SVM開始才算真正步入機器學習了,之前的KNN只能算是公式應用,畢竟沒有損失函式,沒有權重優化怎麼能稱之為學習呢?
SVM的知識包括幾個主要方面評分函式、損失函式、正則化以及梯度下降。
從整個過程來看:
1.初始化權重,訓練資料點乘權重得到評分矩陣;
2.根據評分矩陣和損失函式公式來求得損失值,並對損失值進行正則化,來減小大的權重引數對損失值影響;
3.利用梯度下降方法對損失值來進行迭代,通過損失值來調整權值引數達到最優。
1.線性分類器
W被稱為權重(weights),b被稱為偏差向量(bias vector)
舉例:xi=32*32*3*1 [3072 x 1]的列向量;W=10*3072 W的每一行對應一個類的分類器
那麼xi的大小就變成[3073x1],而不是[3072x1]了,多出了包含常量1的1個維度。W大小就是[10x3073]了。W中多出來的這一列對應的就是偏差值b
把權重和偏差放到一個矩陣(如圖1所示),我們就只需要學習一個權重矩陣,而不用去學習兩個分別裝著權重和偏差的矩陣了。
圖1
所得的結果就是想要得到的評分矩陣(具體例子如圖2所示)
圖2
每一行相當於一個分類器,第一行就是cat的分類器,得分分別是第一幅圖cat得分是3.2,第二幅圖cat得分是4.9。
2.損失函式 Loss function
多類支援向量機損失(Multiclass Support Vector Machine Loss)
第i個數據的多類SVM的損失函式為
第i個數據中包含影象xi的畫素和代表正確類別的標籤yi 。評分函式輸入畫素資料,然後通過公式 來計算不同分類類別的分值。這裡我們將分值簡寫為 。比如,針對第j個類別的得分就是第j個元素:sj=f(xi,W)j。yi 表示正確的得分,針對第i個數據的多類SVM的損失函式定義如下:
關於0的閥值:max(0,-)函式,它常被稱為折葉損失(hinge loss)。有時候會聽到人們使用平方折葉損失SVM(即L2-SVM),它使用的是max(0,-)2,將更強烈(平方地而不是線性地)地懲罰過界的邊界值。
不使用平方是更標準的版本,但是在某些資料集中,平方折葉損失會工作得更好,可以通過交叉驗證來決定到底使用哪個。
在上述例子中,Syi ,分別是3.2,4.9,-3.1。
L1=max(0,5.1-3.2+1)+ max(0,-1.7-3.2+1)
L2=…
舉例來說,在CIFAR-10中,我們有一個N=50000的訓練集,每個影象有D=32x32x3=3072個畫素,而K=10,這是因為圖片被分為10個不同的類別(狗,貓,汽車等)。定義評分函式為: ,該函式是原始影象畫素到分類分值的對映。
損失函式(Softmax分類器):
第i個數據的Softmax分類器的損失函式為
Softmax分類器就可以理解為邏輯迴歸分類器面對多個分類的一般化歸納。SVM將輸出 作為每個分類的評分(因為無定標,所以難以直接解釋)。與SVM不同,Softmax的輸出(歸一化的分類概率)更加直觀,並且從概率上可以解釋。
在Softmax分類器中,函式對映保持不變,但將這些評分值視為每個分類的未歸一化的對數概率,並且將折葉損失(hinge loss)替換為交叉熵損失(cross-entropy loss)。公式如下:
normalize 意思是歸一化也就是公式中的
這一步操作使得各個概率加起來的和為1。
在實際使用中,SVM和Softmax經常是相似的:通常說來,兩種分類器的表現差別很小,不同的人對於哪個分類器更好有不同的看法。相對於Softmax分類器,SVM更加“區域性目標化(local objective)”,這既可以看做是一個特性,也可以看做是一個劣勢。考慮一個評分是[10, -2, 3]的資料,其中第一個分類是正確的。那麼一個SVM()會看到正確分類相較於不正確分類,已經得到了比邊界值還要高的分數,它就會認為損失值是0。SVM對於數字個體的細節是不關心的:如果分數是[10, -100, -100]或者[10, 9, 9],對於SVM來說沒設麼不同,只要滿足超過邊界值等於1,那麼損失值就等於0。
3.正則化(Regularization)
L1 regularization:
L2 regularization:
最常用的正則化懲罰是L2正規化,L2正規化通過對所有引數進行逐元素的平方懲罰來抑制大數值的權重。
完整的多類SVM損失函式:
它由兩個部分組成:資料損失(data loss),即所有樣例的的平均損失,以及正則化損失(regularization loss)。將其展開完整公式是:
引入了L2懲罰後,SVM們就有了最大邊界(max margin)這一良好性質
其中最好的性質就是對大數值權重進行懲罰,可以提升其泛化能力,因為這就意味著沒有哪個維度能夠獨自對於整體分值有過大的影響。
∆ 在絕大多數情況下設為 都是安全的。
4.資料預處理
mean_image=np.mean(X_train,axis=0)#計算每一列特徵的平均值,共32*32*3個特徵
X_train-=mean_image
X_dev-=mean_image
X_test-=mean_image
X_val-=mean_image
X_train=np.hstack([X_train,np.ones((X_train.shape[0],1))])
X_dev=np.hstack([X_dev,np.ones((X_dev.shape[0],1))])
X_test=np.hstack([X_test,np.ones((X_test.shape[0],1))])
X_val=np.hstack([X_val,np.ones((X_val.shape[0],1))])
對每個特徵減去平均值來中心化資料是非常重要的,該步驟意味著根據訓練集中所有的影象計算出一個平均影象值,這樣影象的畫素值就大約分佈在[-127, 127]之間了。下一個常見步驟是,讓所有數值分佈的區間變為[-1, 1],零均值的中心化。
後面這段程式碼應該就是之前線上性分類器中提到的,那麼xi的大小就變成[3073x1],而不是[3072x1]了,多出了包含常量1的1個維度。np.hstack的意思是將資料按列堆疊起來。
5.總結如下:
定義了從影象畫素對映到不同類別的分類評分的評分函式。
在本節中,評分函式是一個基於權重W和偏差b的線性函式。
與kNN分類器不同,引數方法的優勢在於一旦通過訓練學習到了引數,就可以將訓練資料丟棄了。同時該方法對於新的測試資料的預測非常快,因為只需要與權重W進行一個矩陣乘法運算。
介紹了偏差技巧,讓我們能夠將偏差向量和權重矩陣合二為一,然後就可以只跟蹤一個矩陣。
定義了損失函式(介紹了SVM和Softmax線性分類器最常用的2個損失函式)。損失函式能夠衡量給出的引數集與訓練集資料真實類別情況之間的一致性。在損失函式的定義中可以看到,對訓練集資料做出良好預測與得到一個足夠低的損失值這兩件事是等價的。
5.部分程式碼及其解釋
以下圖為例子舉例來解釋向量法求損失函式和權重
def svm_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
scores = X.dot(W) #=[[3.2,5.1,-1.7],[1.3,4.9,2.0].[2.2,2.5,-3.1]]
num_classes = W.shape[1] #=3
scores_correct = scores[np.arange(num_train), y]
# np.arange(num_train)=[1,2,3] y=[1,2,3], scores_correct=[3.2,4.9,-3.1]
scores_correct = np.reshape(scores_correct, (num_train, 1))
# scores_correct=3*1=[[3.2],[4.9],[-3.1]]
margins = scores - scores_correct + 1.0
#=[[1,2.9,-5.9],[-2.6,1,-1.9],[6.3,6.6,1]]
margins[np.arange(num_train), y] = 0.0
#margins[[1,2,3],[1,2,3]]=0.0,margins=[[0,2.9,-5.9],[-2.6,0,-1.9],[6.3,6.6,0]]
margins[margins <= 0] = 0.0 #margins=[[0,2.9,0],[0,0,0],[6.3,6.6,0]]
loss += np.sum(margins) / num_train #loss=2.9+6.3+6.6=15.8
loss += 0.5 * reg * np.sum(W * W) #正則化 0.5*reg是正則化係數
margins[margins > 0] = 1.0 # margins=[[0,1,0],[0,0,0],[1,1,0]]
row_sum = np.sum(margins, axis=1) #row_sum=[2,0,2]
margins[np.arange(num_train), y] = -row_sum
#margins=[[-2,1,0],[0,0,0],[1,1,-2]]
dW += np.dot(X.T, margins)/num_train + reg * W
return loss, dW
learning_rates=[1.75e-7,1.25e-7,2e-7,1.5e-7,0.75e-7]
regularization_strengths = [3e4,3.5e4,4e4,4.5e4,5e4,3.25e4,3.75e4,4.25e4,4.75e4]
學習率和正則化引數就從這些資料中遍歷,兩兩組合,最後選取交叉驗證準確率最高的一組引數來作為預測測試集的引數 。
# -*- coding: utf-8 -*-
import pickle as p
import numpy as np
import os
def load_CIFAR_batch(filename):
""" 載入cifar資料集的一個batch """
with open(filename, 'rb') as f:
datadict = p.load(f, encoding='latin1')
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype("float")
Y = np.array(Y)
return X, Y
def load_CIFAR10(ROOT):
""" 載入cifar全部資料 """
xs = []
ys = []
for b in range(1, 6):
f = os.path.join(ROOT, 'data_batch_%d' % (b,))
X, Y = load_CIFAR_batch(f)
xs.append(X) #將所有batch整合起來
ys.append(Y)
Xtr = np.concatenate(xs) #使變成行向量,最終Xtr的尺寸為(50000,32*32*3)
Ytr = np.concatenate(ys)
del X, Y
Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
return Xtr, Ytr, Xte, Yte
import numpy as np
import matplotlib.pyplot as plt
# 載入CIFAR-10資料集
cifar10_dir = 'datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# 看看資料集中的一些樣本:每個類別展示一些
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
num_train=40000
num_validation=1000
num_test=1000
num_dev=50
在進行測試時,讀取了全部50000個樣本資料和10000個測試資料,單選擇了其中40000(樣本資料)作為訓練資料,1000(樣本資料)作為交叉驗證資料,1000(測試資料)作為測試資料,正常理論上應該是8:1:1的資料分佈,不過測試集數量少一點問題也不大。看一下最終測試資料,準確率貌似還不錯,至此SVM就算基本理解透徹了,梯度計算那裡還是有些問題,接下去再去好好學習。
lr 7.500000e-08 reg 3.000000e+04 train accuracy: 0.376450 val accracy: 0.358400
lr 7.500000e-08 reg 3.250000e+04 train accuracy: 0.381950 val accracy: 0.361400
lr 7.500000e-08 reg 3.500000e+04 train accuracy: 0.382025 val accracy: 0.365800
lr 7.500000e-08 reg 3.750000e+04 train accuracy: 0.374625 val accracy: 0.355600
lr 7.500000e-08 reg 4.000000e+04 train accuracy: 0.371850 val accracy: 0.355000
lr 7.500000e-08 reg 4.250000e+04 train accuracy: 0.381125 val accracy: 0.364800
lr 7.500000e-08 reg 4.500000e+04 train accuracy: 0.373275 val accracy: 0.354400
lr 7.500000e-08 reg 4.750000e+04 train accuracy: 0.379900 val accracy: 0.362600
lr 7.500000e-08 reg 5.000000e+04 train accuracy: 0.376075 val accracy: 0.357800
lr 1.250000e-07 reg 3.000000e+04 train accuracy: 0.377375 val accracy: 0.362000
lr 1.250000e-07 reg 3.250000e+04 train accuracy: 0.375225 val accracy: 0.360800
lr 1.250000e-07 reg 3.500000e+04 train accuracy: 0.375675 val accracy: 0.359200
lr 1.250000e-07 reg 3.750000e+04 train accuracy: 0.372925 val accracy: 0.354600
lr 1.250000e-07 reg 4.000000e+04 train accuracy: 0.373200 val accracy: 0.355800
lr 1.250000e-07 reg 4.250000e+04 train accuracy: 0.375825 val accracy: 0.355600
lr 1.250000e-07 reg 4.500000e+04 train accuracy: 0.374375 val accracy: 0.352800
lr 1.250000e-07 reg 4.750000e+04 train accuracy: 0.370075 val accracy: 0.351200
lr 1.250000e-07 reg 5.000000e+04 train accuracy: 0.366825 val accracy: 0.351800
lr 1.500000e-07 reg 3.000000e+04 train accuracy: 0.376625 val accracy: 0.358200
lr 1.500000e-07 reg 3.250000e+04 train accuracy: 0.372375 val accracy: 0.359800
lr 1.500000e-07 reg 3.500000e+04 train accuracy: 0.371750 val accracy: 0.357800
lr 1.500000e-07 reg 3.750000e+04 train accuracy: 0.372925 val accracy: 0.356600
lr 1.500000e-07 reg 4.000000e+04 train accuracy: 0.375975 val accracy: 0.360800
lr 1.500000e-07 reg 4.250000e+04 train accuracy: 0.377725 val accracy: 0.363200
lr 1.500000e-07 reg 4.500000e+04 train accuracy: 0.363225 val accracy: 0.348400
lr 1.500000e-07 reg 4.750000e+04 train accuracy: 0.373600 val accracy: 0.354400
lr 1.500000e-07 reg 5.000000e+04 train accuracy: 0.371425 val accracy: 0.352200
lr 1.750000e-07 reg 3.000000e+04 train accuracy: 0.377325 val accracy: 0.363600
lr 1.750000e-07 reg 3.250000e+04 train accuracy: 0.375100 val accracy: 0.357400
lr 1.750000e-07 reg 3.500000e+04 train accuracy: 0.371375 val accracy: 0.352600
lr 1.750000e-07 reg 3.750000e+04 train accuracy: 0.371575 val accracy: 0.355000
lr 1.750000e-07 reg 4.000000e+04 train accuracy: 0.370800 val accracy: 0.349200
lr 1.750000e-07 reg 4.250000e+04 train accuracy: 0.372600 val accracy: 0.361400
lr 1.750000e-07 reg 4.500000e+04 train accuracy: 0.354825 val accracy: 0.341600
lr 1.750000e-07 reg 4.750000e+04 train accuracy: 0.356025 val accracy: 0.340000
lr 1.750000e-07 reg 5.000000e+04 train accuracy: 0.364875 val accracy: 0.351400
lr 2.000000e-07 reg 3.000000e+04 train accuracy: 0.373575 val accracy: 0.354600
lr 2.000000e-07 reg 3.250000e+04 train accuracy: 0.381775 val accracy: 0.365000
lr 2.000000e-07 reg 3.500000e+04 train accuracy: 0.364675 val accracy: 0.353400
lr 2.000000e-07 reg 3.750000e+04 train accuracy: 0.363400 val accracy: 0.349800
lr 2.000000e-07 reg 4.000000e+04 train accuracy: 0.367050 val accracy: 0.340800
lr 2.000000e-07 reg 4.250000e+04 train accuracy: 0.359325 val accracy: 0.340000
lr 2.000000e-07 reg 4.500000e+04 train accuracy: 0.368375 val accracy: 0.347800
lr 2.000000e-07 reg 4.750000e+04 train accuracy: 0.365100 val accracy: 0.352400
lr 2.000000e-07 reg 5.000000e+04 train accuracy: 0.364225 val accracy: 0.349000
best_val_accuracy: 0.365800
test_accuracy: 0.383400
plt.plot(loss_history)
plt.title(u"carve of loss")
plt.xlabel('num_iters')
plt.ylabel('loss')
plt.show()
6.全部程式碼
SVM.py
import numpy as np
class MulticlassSupportVector(object):
def __init__(self):
self.W = None
self.shape=None
def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=200, batch_size=200, verbose=True): #注意這裡傳遞的引數設定
num_train, dim = X.shape
# assume y takes values 0...K-1 where K is number of classes
num_classes = np.max(y) + 1
if self.W is None:
# lazily initialize W
self.W = 0.001 * np.random.randn(dim, num_classes) # 初始化W
# Run stochastic gradient descent(Mini-Batch) to optimize W
loss_history = []
for it in range(num_iters): #每次隨機取batch的資料來進行梯度下降
X_batch = None
y_batch = None
# Sampling with replacement is faster than sampling without replacement.
sample_index = np.random.choice(num_train, batch_size, replace=False)
X_batch = X[sample_index, :] # batch_size by D
y_batch = y[sample_index] # 1 by batch_size
# evaluate loss and gradient
loss, grad = svm_loss_vectorized(self.W,X_batch, y_batch, reg)
loss_history.append(loss)
# perform parameter update
self.W += -learning_rate * grad
if verbose and it % 100 == 0:
print ('Iteration %d / %d: loss %f' % (it, num_iters, loss))
return loss_history
def predict(self, X):
y_pred = np.zeros(X.shape[1]) # 1 by N
y_pred = np.argmax(X.dot(self.W), axis=1) #預測直接找到最後y最大的那個值
return y_pred
def svm_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
scores = X.dot(W) # N by C
num_train = X.shape[0]
num_classes = W.shape[1]
scores_correct = scores[np.arange(num_train), y] # 1 by N
scores_correct = np.reshape(scores_correct, (num_train, 1)) # N by 1
margins = scores - scores_correct + 1.0 # N by C
margins[np.arange(num_train), y] = 0.0
margins[margins <= 0] = 0.0
loss += np.sum(margins) / num_train
loss += 0.5 * reg * np.sum(W * W)
# compute the gradient
margins[margins > 0] = 1.0
row_sum = np.sum(margins, axis=1) # 1 by N
margins[np.arange(num_train), y] = -row_sum
dW += np.dot(X.T, margins)/num_train + reg * W # D by C
return loss, dW
def svm_loss_naive(W, X, y, reg):
"""
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]: #根據公式,正確的那個不用算
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW[:, y[i]] += -X[i, :] # 根據公式:∇Wyi Li = - xiT(∑j≠yi1(xiWj - xiWyi +1>0)) + 2λWyi
dW[:, j] += X[i, :] # 根據公式: ∇Wj Li = xiT 1(xiWj - xiWyi +1>0) + 2λWj , (j≠yi)
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by num_train.
loss /= num_train
dW /= num_train
# Add regularization to the loss.
loss += 0.5 * reg * np.sum(W * W)
dW += reg * W
return loss, dW
SVMexercise.py
# -*- coding: utf-8 -*-
import pickle as p
import numpy as np
import os
def load_CIFAR_batch(filename):
""" 載入cifar資料集的一個batch """
with open(filename, 'rb') as f:
datadict = p.load(f, encoding='latin1')
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0, 2, 3, 1).astype("float")
Y = np.array(Y)
return X, Y
def load_CIFAR10(ROOT):
""" 載入cifar全部資料 """
xs = []
ys = []
for b in range(1, 6):
f = os.path.join(ROOT, 'data_batch_%d' % (b,))
X, Y = load_CIFAR_batch(f)
xs.append(X) #將所有batch整合起來
ys.append(Y)
Xtr = np.concatenate(xs) #使變成行向量,最終Xtr的尺寸為(50000,32*32*3)
Ytr = np.concatenate(ys)
del X, Y
Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
return Xtr, Ytr, Xte, Yte
import numpy as np
import matplotlib.pyplot as plt
# 載入CIFAR-10資料集
cifar10_dir = 'datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# 看看資料集中的一些樣本:每個類別展示一些
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
num_train=40000
num_validation=1000
num_test=1000
num_dev=50
mask=range(num_train,num_train+num_validation)
X_val=X_train[mask]
y_val=y_train[mask]
mask=range(num_train)
X_train=X_train[mask]
y_train=y_train[mask]
mask=np.random.choice(num_train,num_dev,replace=False)
X_dev=X_train[mask]
y_dev=y_train[mask]
mask=range(num_test)
X_test=X_test[mask]
y_test=y_test[mask]
X_train=np.reshape(X_train,(X_train.shape[0],-1))
X_dev=np.reshape(X_dev,(X_dev.shape[0],-1))
X_val=np.reshape(X_val,(X_val.shape[0],-1))
X_test=np.reshape(X_test,(X_test.shape[0],-1))
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
#首先訓練資料,計算影象的平均值
mean_image=np.mean(X_train,axis=0)#計算每一列特徵的平均值,共32*32*3個特徵
print(mean_image[:10]) #檢視指定特徵的資料
plt.figure(figsize=(4,4)) #指定畫框框圖的大小
plt.imshow(mean_image.reshape((32,32,3)).astype('uint8')) #將平均值視覺化
#plt.show()
X_train-=mean_image
X_dev-=mean_image
X_test-=mean_image
X_val-=mean_image
X_train=np.hstack([X_train,np.ones((X_train.shape[0],1))])
X_dev=np.hstack([X_dev,np.ones((X_dev.shape[0],1))])
X_test=np.hstack([X_test,np.ones((X_test.shape[0],1))])
X_val=np.hstack([X_val,np.ones((X_val.shape[0],1))])
import SVM
learning_rates=[1.75e-7,1.25e-7,2e-7,1.5e-7,0.75e-7]
regularization_strengths = [3e4,3.5e4,4e4,4.5e4,5e4,3.25e4,3.75e4,4.25e4,4.75e4]
results={}
best_val=-1
best_svm = None
for rate in learning_rates:
for regular in regularization_strengths:
svm=SVM.MulticlassSupportVector()
svm.train(X_train, y_train, learning_rate=rate, reg=regular, num_iters=1500)
y_train_pred=svm.predict(X_train)
accuracy_train = np.mean(y_train==y_train_pred)
y_val_pred=svm.predict(X_val)
accuracy_val = np.mean(y_val==y_val_pred)
results[(rate,regular)] = (accuracy_train,accuracy_val)
if (best_val < accuracy_val):
best_val = accuracy_val
best_svm = svm
for lr,reg in sorted(results):
train_accuracy,val_accuracy = results[(lr,reg)]
print('lr %e reg %e train accuracy: %f val accracy: %f'%(lr,reg,train_accuracy,val_accuracy))
print('best_val_accuracy: %f'%(best_val))
y_tset_pred = best_svm.predict(X_test)
acc_test = np.mean(y_test == y_tset_pred)
print('test_accuracy: %f'%(acc_test))