Keras驗證集切分
轉自:https://www.cnblogs.com/bymo/p/9026198.html
在訓練深度學習模型的時候,通常將資料集切分為訓練集和驗證集.Keras提供了兩種評估模型效能的方法:
一.自動切分
在Keras中,可以從資料集中切分出一部分作為驗證集,並且在每次迭代(epoch)時在驗證集中評估模型的效能.
具體地,呼叫model.fit()訓練模型時,可通過validation_split引數來指定從資料集中切分出驗證集的比例.
# MLP with automatic validation set from keras.models import Sequential from keras.layers import Dense import numpy # fix random seed for reproducibility numpy.random.seed(7) # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10)
validation_split:0~1之間的浮點數,用來指定訓練集的一定比例資料作為驗證集。驗證集將不參與訓練,並在每個epoch結束後測試的模型的指標,如損失函式、精確度等。
注意,validation_split的劃分在shuffle之前,因此如果你的資料本身是有序的,需要先手工打亂再指定validation_split,否則可能會出現驗證集樣本不均勻。
二.手動切分
Keras允許在訓練模型的時候手動指定驗證集.
例如,用sklearn庫中的train_test_split()函式將資料集進行切分,然後在keras的model.fit()的時候通過validation_data
# MLP with manual validation set from keras.models import Sequential from keras.layers import Dense from sklearn.model_selection import train_test_split import numpy # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # split into 67% for train and 33% for test X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed) # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10)
三.K折交叉驗證(k-fold cross validation)
將資料集分成k份,每一輪用其中(k-1)份做訓練而剩餘1份做驗證,以這種方式執行k輪,得到k個模型.將k次的效能取平均,作為該演算法的整體效能.k一般取值為5或者10.
- 優點:能比較魯棒性地評估模型在未知資料上的效能.
- 缺點:計算複雜度較大.因此,在資料集較大,模型複雜度較高,或者計算資源不是很充沛的情況下,可能不適用,尤其是在訓練深度學習模型的時候.
sklearn.model_selection提供了KFold以及RepeatedKFold, LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedKFold, GroupKFold, TimeSeriesSplit等變體.
下面的例子中用的StratifiedKFold採用的是分層抽樣,它保證各類別的樣本在切割後每一份小資料集中的比例都與原資料集中的比例相同.
# MLP for Pima Indians Dataset with 10-fold cross validation
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import StratifiedKFold
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(X, Y):
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)
# evaluate the model
scores = model.evaluate(X[test], Y[test], verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
參考:
Evaluate the Performance Of Deep Learning Models in Keras
3.1. Cross-validation: evaluating estimator performance — scikit-learn 0.19.1 documentation