機器學習-周志華-個人練習13.4
阿新 • • 發佈:2019-01-05
13.4 從網上下載或自己程式設計實現TSVM演算法,選擇兩個UCI資料集,將其中30%的樣例用作測試樣本,10%的樣例用作有標記樣本,60%的樣例用作無標記樣本,分別訓練出利用無標記樣本的TSVM以及僅利用有標記樣本的SVM,並比較其效能。
選擇最常用的iris資料集,並以sciki-learn的SVM演算法為基礎建立TSVM,為了方便展示效果,選用iris資料集下的兩個第二類和第三類,並將類標記記為-1,1,最後的訓練結果選用其中兩個屬性進行視覺化。本題為了與書上可進行對比,選用了線性超平面來劃分類別,以此才能直接得到權重係數和鬆弛變數,具體程式碼如下:
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import sklearn.svm as svm
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
# balanced data,each class has the same volume of every kind of data
iris = datasets.load_iris()
# data, label = iris.data[50:, [0, 3]], iris.target[50:] * 2 - 3 # 標籤變為-1,1
data, label = iris.data[50:, :], iris.target[50:] * 2 - 3 # 4 attributes
# standardizing
sc = StandardScaler()
sc.fit(data)
data = sc.transform(data)
test_d, test_c = np.concatenate((data[:15], data[50:65])), np.concatenate((label[:15], label[50:65])) # 30
l_d, l_c = np.concatenate((data[45:50], data[95 :])), np.concatenate((label[45:50], label[95:])) # 10
u_d = np.concatenate((data[15:45], data[65:95])) # 60
lu_d = np.concatenate((l_d, u_d))
n = len(l_d)+len(u_d)
# u_d, u_c = np.concatenate((data[20:50], data[70:])), np.concatenate((label[20:50], label[70:])) # 60
clf1 = svm.SVC(C=1,kernel='linear')
clf1.fit(l_d, l_c)
clf0 = svm.SVC(C=1,kernel='linear')
clf0.fit(l_d, l_c)
lu_c_0 = clf0.predict(lu_d)
u_c_new = clf1.predict(u_d) # the pseudo label for unlabelled samples
cu, cl = 0.001, 1
sample_weight = np.ones(n)
sample_weight[len(l_c):] = cu
id_set = np.arange(len(u_d))
while cu < cl:
lu_c = np.concatenate((l_c, u_c_new)) # 70
clf1.fit(lu_d, lu_c, sample_weight=sample_weight)
while True:
u_c_new = clf1.predict(u_d) # the pseudo label for unlabelled samples
u_dist = clf1.decision_function(u_d) # the distance of each sample
norm_weight = np.linalg.norm(clf1.coef_) # norm of weight vector
epsilon = 1 - u_dist * u_c_new * norm_weight
plus_set, plus_id = epsilon[u_c_new > 0], id_set[u_c_new > 0] # positive labelled samples
minus_set, minus_id = epsilon[u_c_new < 0], id_set[u_c_new < 0] # negative labelled samples
plus_max_id, minus_max_id = plus_id[np.argmax(plus_set)], minus_id[np.argmax(minus_set)]
a, b = epsilon[plus_max_id], epsilon[minus_max_id]
if a > 0 and b > 0 and a + b > 2:
u_c_new[plus_max_id], u_c_new[minus_max_id] = -u_c_new[plus_max_id], -u_c_new[minus_max_id]
lu_c = np.concatenate((l_c, u_c_new))
clf1.fit(lu_d, lu_c, sample_weight=sample_weight)
else:
break
cu = min(cu * 2, cl)
sample_weight[len(l_c):] = cu
lu_c = np.concatenate((l_c, u_c_new))
test_c1 = clf0.predict(test_d)
test_c2 = clf1.predict(test_d)
score1 = clf0.score(test_d,test_c)
score2 = clf1.score(test_d,test_c)
fig = plt.figure(figsize=(16,4))
ax = fig.add_subplot(131)
ax.scatter(test_d[:,0],test_d[:,2],c=test_c,marker='o',cmap=plt.cm.coolwarm)
ax.set_title('True Labels for test samples',fontsize=16)
ax1 = fig.add_subplot(132)
ax1.scatter(test_d[:,0],test_d[:,2],c=test_c1,marker='o',cmap=plt.cm.coolwarm)
ax1.scatter(lu_d[:,0], lu_d[:,2], c=lu_c_0, marker='o',s=10,cmap=plt.cm.coolwarm,alpha=.6)
ax1.set_title('SVM, score: {0:.2f}%'.format(score1*100),fontsize=16)
ax2 = fig.add_subplot(133)
ax2.scatter(test_d[:,0],test_d[:,2],c=test_c2,marker='o',cmap=plt.cm.coolwarm)
ax2.scatter(lu_d[:,0], lu_d[:,2], c=lu_c, marker='o',s=10,cmap=plt.cm.coolwarm,alpha=.6)
ax2.set_title('TSVM, score: {0:.2f}%'.format(score2*100),fontsize=16)
for a in [ax,ax1,ax2]:
a.set_xlabel(iris.feature_names[0])
a.set_ylabel(iris.feature_names[2])
plt.show()
上述程式碼執行結果如下,由圖可見,對於iris資料集,TSVM通過利用未標記資料能提高最終分類的準確率,從SVM的96.67%提高到了TSVM的100%,預測標記與測試集的真實標記一致。另外,經測試發現,對於iris資料集,若選用非線性核,如RBF,那麼此時TSVM相對於SVM效能並沒有提升。