優達機器學習:交叉驗證
阿新 • • 發佈:2019-01-31
練習:在 Sklearn 中訓練/測試分離
#!/usr/bin/python
"""
PLEASE NOTE:
The api of train_test_split changed and moved from sklearn.cross_validation to
sklearn.model_selection(version update from 0.17 to 0.18)
The correct documentation for this quiz is here:
http://scikit-learn.org/0.17/modules/cross_validation.html
"""
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
features = iris.data
labels = iris.target
###############################################################
### YOUR CODE HERE
###############################################################
### import the relevant code and make your train/test split
### name the output datasets features_train, features_test,
### labels_train, and labels_test
# PLEASE NOTE: The import here changes depending on your version of sklearn
from sklearn import cross_validation # for version 0.17
# For version 0.18
# from sklearn.model_selection import train_test_split
### set the random_state to 0 and the test_size to 0.4 so
### we can exactly check your result
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
###############################################################
# DONT CHANGE ANYTHING HERE
clf = SVC(kernel="linear", C=1.)
clf.fit(features_train, labels_train)
print clf.score(features_test, labels_test)
##############################################################
def submitAcc():
return clf.score(features_test, labels_test)
K折交叉驗證
- 可能會出現分類都一樣的問題
- GridSearchCV 就是通過交叉驗證來確定引數的
注意:優達自己寫的函式targetFeatureSplit含義
labels, features = targetFeatureSplit(data)
data是二維陣列,例如
[
[1,12.1],
[0,14.1],
[1,13.1],
[1,15.2]
]
預設函式的第一個返回引數為第一列,也就是作為標籤使用,返回值如下
labels = [1,0,1,1]
第二個返回引數為第二列,作為訓練特徵使用,返回值如下
features=
[
[12,1],
[14.1],
[13.1],
[15.2]
]
練習:第一個(過擬合)POI 識別符
答案:0.989473684211
- validate_poi.py
#!/usr/bin/python
"""
Starter code for the validation mini-project.
The first step toward building your POI identifier!
Start by loading/formatting the data
After that, it's not our code anymore--it's yours!
"""
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )
### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]
data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)
### it's all yours from here forward!
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features,labels)
print clf.score(features,labels)
練習:部署訓練/測試機制
答案:0.724137931034
- validate_poi.py
#!/usr/bin/python
"""
Starter code for the validation mini-project.
The first step toward building your POI identifier!
Start by loading/formatting the data
After that, it's not our code anymore--it's yours!
"""
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )
### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]
data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)
### it's all yours from here forward!
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, labels,test_size=0.3,random_state=42)
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(features_train,labels_train)
result = clf.predict(features_test)
from sklearn.metrics import accuracy_score
print accuracy_score(labels_test,result)
#print clf.score(features_test,labels_test)