特徵提升之特徵篩選
阿新 • • 發佈:2019-01-26
良好的資料特徵組合不需太多,就可以使得模型的效能表現突出。冗餘的特徵雖然不會影響到模型的效能,但使得CPU的計算做了無用功。比如,PCA主要用於去除多餘的線性相關的特徵組合,因為這些冗餘的特徵組合不會對模型訓練有更多貢獻。不良的特徵自然會降低模型的精度。
特徵篩選與PCA這類通過主成分對特徵進行重建的方法略有區別:對於PCA,經常無法解釋重建之後的特徵;然而特徵篩選不存在對特徵值的修改,從而更加側重於尋找那些對模型的效能提升較大的少量特徵。
下面沿用Titanic資料集,試圖通過特徵篩選來尋找最佳的特徵組合,並且達到提高預測準確性的目標。
Python原始碼:
Result:#coding=utf-8 import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.feature_extraction import DictVectorizer from sklearn.tree import DecisionTreeClassifier from sklearn import feature_selection from sklearn.cross_validation import cross_val_score import numpy as np import pylab as pl #-------------download data titanic=pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt') #-------------sperate data and target y=titanic['survived'] X=titanic.drop(['row.names','name','survived'],axis=1) #-------------fulfill lost data with mean value X['age'].fillna(X['age'].mean(),inplace=True) X.fillna('UNKNOWN',inplace=True) #-------------split data,25% for test X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=33) #-------------feature vectorization vec=DictVectorizer() X_train=vec.fit_transform(X_train.to_dict(orient='record')) X_test=vec.transform(X_test.to_dict(orient='record')) #------------- print 'Dimensions of handled vector',len(vec.feature_names_) #-------------use DTClassifier to predict and measure performance dt=DecisionTreeClassifier(criterion='entropy') dt.fit(X_train,y_train) print dt.score(X_test,y_test) #-------------selection features ranked in the front 20%,use DTClassifier with the same config to predict and measure performance fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=20) X_train_fs=fs.fit_transform(X_train,y_train) dt.fit(X_train_fs,y_train) X_test_fs=fs.transform(X_test) print dt.score(X_test_fs,y_test) percentiles=range(1,100,2) results=[] for i in percentiles: fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=i) X_train_fs=fs.fit_transform(X_train,y_train) scores=cross_val_score(dt,X_train_fs,y_train,cv=5) results=np.append(results,scores.mean()) print results #-------------find feature selection percent with the best performance opt=int(np.where(results==results.max())[0]) print 'Optimal number of features',percentiles[opt] #TypeError: only integer scalar arrays can be converted to a scalar index #transfer list to array #print 'Optimal number of features',np.array(percentiles)[opt] #-------------use the selected features and the same config to measure performance on test datas fs=feature_selection.SelectPercentile(feature_selection.chi2,percentile=7) X_train_fs=fs.fit_transform(X_train,y_train) dt.fit(X_train_fs,y_train) X_test_fs=fs.transform(X_test) print dt.score(X_test_fs,y_test) pl.plot(percentiles,results) pl.xlabel('percentiles of features') pl.ylabel('accuracy') pl.show()
分析:
1.經過初步的特徵處理後,最後訓練與測試資料均有474個維度的特徵。 2.若之間使用全部474個維度的特徵用於訓練決策樹模型進行分類預測,在測試集上的準確性約為81.76% 3.若篩選前20%的維度的特徵,在相同的模型配置下進行預測,那麼在測試集上表現的準確性約為82.37% 4.如果按照固定的間隔採用不同百分比的特徵進行訓練與測試,那麼如圖所示,通過交叉驗證得出的準確性有很大的波動,最好的模型效能表現在選取前7%維度的特徵的時候; 5.如果使用前7%維度的特徵,那麼最終決策樹模型可以在測試集上表現出85.71%的準確性,比最初使用全部特徵的模型高處近4%。