1. 程式人生 > >手寫數字識別-SVM方法

手寫數字識別-SVM方法

1 匯入資料,檢視特徵

from sklearn import svm
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import pandas as pd
train=pd.read_csv('Digit/train.csv')
test=pd.read_csv('Digit/test.csv')
i=2
img=train.iloc[i,1:].as_matrix()
img=img.reshape(28,28)
plt.imshow(
img,cmap='gray') plt.title('digit is :'+str(train.iloc[i,0]))

在這裡插入圖片描述

plt.hist(train.iloc[i,1:])

在這裡插入圖片描述
特徵按灰度表示,範圍在0-255之間

2 劃分特徵和標籤,初步訓練

img=train.iloc[:5000,1:]
label=train.iloc[:5000,0]
train_data,test_data,train_label,test_label=train_test_split(
                            img,label,test_size=0.2,random_state=
0)
# 訓練
svc=svm.SVC(C=3)
svc.fit(train_data,train_label)
print svc.score(train_data,train_label)
print svc.score(test_data,test_label)

1.0
0.1

準確率相當於隨機猜測,說明SVC預設引數並不適用,需要調參

3 調參

from sklearn.model_selection import GridSearchCV
svc_param={'C':[1,2,3,4,5,6,7,8,9,10], 
              'kernel':
['linear','rbf'], 'gamma': [0.5, 0.2, 0.1, 0.001, 0.0001]} def grid(model,data,label,param): grid=GridSearchCV(model,param,cv=5,scoring='accuracy') grid.fit(data,label) return grid.best_params_,grid.best_score_

超級慢,svm特徵較多時會比較慢

best_params,best_score=grid(svc,train_data,train_label,svc_param)
print best_params
print best_score
{'kernel': 'linear', 'C': 1, 'gamma': 0.5}
0.91475
svc=svm.SVC(**best_params)
svc.fit(train_data,train_label)
print svc.score(train_data,train_label)
print svc.score(test_data,test_label)
1.0
0.91

最佳引數在測試集上表現較預設提升明顯,

4 預測並提交

test.info()
pred=svc.predict(test)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28000 entries, 0 to 27999
Columns: 784 entries, pixel0 to pixel783
dtypes: int64(784)
memory usage: 167.5 MB
sub_svc=pd.DataFrame({'ImageId':list(range(1,len(pred)+1)),'Label':pred})
sub_svc.to_csv('sub_svc.csv',header=True,index=False)

在這裡插入圖片描述

5 改進

5.1 增加訓練的樣本數量(開始只fit5000)

number=20000
img=train.iloc[:number,1:]
label=train.iloc[:number,0]
train_data,test_data,train_label,test_label=train_test_split(
                            img,label,test_size=0.2,random_state=0)
svc=svm.SVC(**best_params)
svc.fit(train_data,train_label)
print svc.score(train_data,train_label)
print svc.score(test_data,test_label)
1.0
0.90375
pred2=svc.predict(test)
sub_svc2=pd.DataFrame({'ImageId':list(range(1,len(pred2)+1)),'Label':pred2})
sub_svc2.to_csv('sub_svc2.csv',header=True,index=False)

在這裡插入圖片描述**
增加訓練樣本數,有微小提升

5.2 特徵縮放

將灰度值0-255 縮放到0-1之間

test=test/255.0
number=5000
img=train.iloc[:number,1:]/255.0
label=train.iloc[:number,0]
train_data,test_data,train_label,test_label=train_test_split(
                            img,label,test_size=0.2,random_state=0)
svc_param={'C':[1,2,3,4,5,6,7,8,9,10], 
              'kernel': ['linear','rbf'],
              'gamma': [0.5, 0.2, 0.01, 0.001, 0.0001]}
def grid2(model,data,label,param):
    grid=GridSearchCV(model,param,cv=2,scoring='accuracy')
    grid.fit(data,label)
    return grid.best_params_,grid.best_score_
best_params,best_score=grid2(svc,train_data,train_label,svc_param)
print best_params
print best_score
{'kernel': 'rbf', 'C': 4, 'gamma': 0.01}
0.94325
svc=svm.SVC(**best_params)
svc.fit(train_data,train_label)
print svc.score(train_data,train_label)
print svc.score(test_data,test_label)
0.99825
0.946
test=pd.read_csv('Digit/test.csv')
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28000 entries, 0 to 27999
Columns: 784 entries, pixel0 to pixel783
dtypes: int64(784)
memory usage: 167.5 MB
test=pd.read_csv('Digit/test.csv')
test=test.iloc[:,:]/255.0
pred3=svc.predict(test)
sub_svc3=pd.DataFrame({'ImageId':list(range(1,len(pred3)+1)),'Label':pred3})
sub_svc3.to_csv('sub_svc3.csv',header=True,index=False)

在這裡插入圖片描述
提升明顯,可見特徵縮放對svm是很重要的

6 總結:

1 svm可以進行數字識別分類,但由於每個畫素作為一個特徵,28*28個特徵比較多, 訓練起來變慢,調參更是慢

2 增加訓練樣本數對精度提升有效果,但並不顯著

3 特徵縮放對svm演算法是必要的!縮放後的提升效果明顯,最優超引數發生了變化(沒縮放時,0-255,最好的kernel是linear,縮放後0-1,最好kernel是rbf, 且準確率提升約0.3)*