Python作業——sklearn
Scikit-Learn: Machine Learning in Python
學習目標:
學習python庫中的sklearn,掌握三種分類方法:樸素貝葉斯、SVM和隨機森林。通過完成assignment,對結果進行對比分析,簡要概括訓練成果。
Assignment :
In the second ML assignment you have to compare the performance of three different classification algorithms, namely Naive Bayes, SVM, and Random Forest. For this assignment you need to generate a random binary classification problem, and then train and test (using 10-fold cross validation) the three algorithms. For some algorithms inner cross validation (5-fold) for choosing the parameters is needed. Then, show the classification performace (per-fold and averaged) in the report, and briefly discussing the results.
Note:
The report has to contain also a short description of the methodology usedto obtain the results.
Steps:
1 Create a classification dataset (n samples >=1000, n features >=10)
2 Split the dataset using 10-fold cross validation
3 Train the algorithms
SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
RandomForestClassifier (possible n estimators values [10, 100, 1000])
4 Evaluate the cross-validated performance
Accuracy
F1-score
5 Write a short report summarizing the methodology and the results
Step1:
Create a classification dataset (n samples >=1000, n features >= 10)
關於分類,使用 Iris資料集 ,這個scikit-learn已經自帶。 #返回值:
#X:形狀陣列[n_samples,n_features]生成的樣本
#y:形狀陣列[n_samples] 每個樣本的類成員的整數標籤
from sklearn import datasets
from sklearn import cross_validation
iris=datasets.load_iris()
#Artificial data generators
dataset=datasets.make_classification(n_samples=1000,n_features=10,
n_informative=2,n_redundant=2,n_repeated=0,n_classes=2)
print(X)
print(y)
X:
y:
Step2 :
Split the dataset using 10-fold cross validation
from sklearn import cross_validation
kf=cross_validation.KFold(len(X),n_folds=10,shuffle=True)
for train_index,test_index in kf:
X_train,y_train=X[train_index],y[train_index]
X_test,y_test=X[test_index],y[test_index]
X_train:
X_test:
Y_train:
Step3:
Train the algorithms
GaussianNBSVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
RandomForestClassifier (possible n estimators values [10, 100, 1000])
GaussianNB:
from sklearn.naive_bayes import GaussianNB
model1 = GaussianNB()
model1.fit(X_train, y_train)
predict = clf.predict(X_test)
print(predict)
predict:
SVC:
from sklearn.svm import SVC
for num in [1e-02, 1e-01, 1e00, 1e01, 1e02]:
model2= SVC(num, kernel='rbf', gamma=0.1)
model2.fit(X_train, y_train)
predict2 = model2.predict(X_test)
print(predict2)
predict2:
RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
for n_estimators in [10, 100, 1000]:
#SVC
model3 = RandomForestClassifier(n_estimators=6)
model3.fit(X_train, y_train)
predict3 = model3.predict(X_test)
print(predict3)
predict3:
Step4:
Evaluate the cross-validated performance
Accuracy
F1-score
AUC ROC
GaussianNB:
from sklearn import metrics
accuracy = metrics.accuracy_score(y_test, predict)
print(accuracy)
F1_score = metrics.f1_score(y_test, pred)
print(F1_score)
auc_roc = metrics.roc_auc_score(y_test, predict)
print(auc_roc)
SVC:
for num in [1e-02, 1e-01, 1e00, 1e01, 1e02]:
model2 = SVC(num, kernel='rbf', gamma=0.1)
model2.fit(X_train, y_train)
predict2 = model2.predict(X_test)
accurary = metrics.accuracy_score(y_test, predict2)
print(accurary)
F1_score = metrics.f1_score(y_test, predict2)
print(F1_score)
auc_roc = metrics.roc_auc_score(y_test, predict2)
print(auc_roc)
RandomForestClassifier:
for n_estimators in [10, 100, 1000]:
model3 = RandomForestClassifier(n_estimators=6)
model3.fit(X_train, y_train)
predict3 = model3.predict(X_test)
accuracy = metrics.accuracy_score(y_test, predict3)
print(accuracy)
F1_score = metrics.f1_score(y_test, predict3)
print(F1_score)
auc_roc = metrics.roc_auc_score(y_test, predict3)
print(auc_roc)
Step5:
Write a short report summarizing the methodology and the result
總結1:三個模型的效能評估從次到優分別是GaussianNB< SVC <RandomForestClassifier
總結2:SVC中,當C取值為1e00時最優
總結3 :RandomForestClassifier中,n_estimators越小越優
(本次作業的耗時主要在關於Anaconda(在spider中)無法匯入sklearn,直接在ipython上是沒問題的)
sklearn提供了很多的資料集和訓練方法,有待於進一步學習。