1. 程式人生 > >Introduction to ML with python 手記

Introduction to ML with python 手記

Supervised Machine Learning Algorithm

K-Nearest Neighbors

  • Classification
from sklearn.model_selection import train_test_split
X,y = mglearn.datasers.make_forge()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 3)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

  • Regression
from sklearn.neighbors import KNeighborsRegrssor
X,y = mglearn.datasets.make_wave(n_samples=40)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
reg = KNeighborsRegressor(n_neighbors = 3)
reg.fit(X_train,y_train)
reg.score(X_test,y_test)

Linear models

  • Linear Regression
from sklearn.linear_model import LinearRegression
X,y = mglearn.datasets.make_wave(n_samples=60)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
lr = LinearRegression().fit(X_train,y_train) 
lr.coef_  ###計算係數 w 一組數
lr.intercept_  ###計算截距 b 一個數
lr.score(X_train,y_train)
lr.score(X_test,y_test)
  • Ridge Regression (less likely to overfit)(less likely to overfit)  
     有引數可調 alpha=?
from sklearn.linear_model import Ridge 
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
ridge = Rodge().fit(X_train,y_train) 
ridge.score(X_train,y_train)
ridge.score(X_test,y_test)
  • Lasso
from sklearn.linear_model import Lasso 

其他同上。

Ridge Regression 相比,兩者都應用了regularization,不同的是Lasso為L1 regularization,而後者為L2 regularization。均可調節引數alpha

  • Linear models for classification

 -------logistic regression and linear support vector machine 

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
  • Linear models for multiclass classification
                          -------logistic regression and linear support vector machine 

上述調參總結:線性模型主要引數來自regularization中引數,包括regression模型中的alpha和LinearSVC、LogisticRegression模型中的C。一般的,大的alpha和小的C意味著更簡單的模型(複雜度更低)。另外一個需要作出的選擇是L1 與L2 regularization:如果只有一小部分特徵是重要的,則使用L1;反之應選取L2.

Naive Bayes Classification

樸素貝葉斯分類與前述討論的線性模型相似,不過訓練速度卻更快。在sklearn中有三種樸素貝葉斯分類器:①GaussianNB, ②BernoulliN, and ③MultinomialNB. 其中①可以用於任何連續的資料,②適用於二元分類資料,③適用於count data,②和③多用於text data 的分類問題。

Decision Trees

from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target, random_state = 42)
tree = DecisionTreeClassifier(random_state = 0)
tree.fit(X_train,y_train)
tree.score(X_train,y_train)
tree.score(X_test,y_test)

上述演算法未進行引數優化容易引起overfitting,故可設定決策樹的深度max_depth=4,如下所示:

tree = DecisionTreeClassifier(max_depth=4,random_state = 0)
  • analyzing decision tree

可以呼叫export_graphviz函式來視覺化tree模型

from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_name=["malignant","benign"],feature_names = cancer.feature_names,impurity=False,filled=True)
import graphviz with open("tree.dot") as f: 
     dot_graph = f.read()  
graphviz.Source(dot_graph)
  • Feature importance in trees 
特徵相關性,衡量每個特徵對決策的貢獻程度,為一個0-1之間的數值,越接近1表明該特徵對做出正確的決策越有幫助。
tree.feature_importances_

Ensembles of Decision Trees

  •  Random Forests
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
X,y = make_moons(n_samples = 100,noise=0.25,random_state = 3)
X_train,X_test,y_train,y_test = train_test_split(X, y,stratify = y, random_state = 42)
forest = RandomForestClassifier(n_estimate = 5,random_state = 2) 
forest.fit(X_train,y_train)
forest.score(X_train,y_train)
forest.score(X_test,y_test)

隨機森林模型對高維度、分散性資料如文字資料等處理的效果並不理想,對於上述資料,線性模型或許更為有效。RF多使用於資料集較大的情況,但同時訓練的速度回更大程度地依賴於CPU與記憶體等物理因素。

RF模型調參主要有:n_estimate、max_feature和pre-pruning options 如max_depth等。①對於n_estimate,一般越大越好,越多的樹有助於增強模型的魯棒性,但同時對機器的記憶體等效能要求也較高。②對於max_feature,這一引數決定了模型的隨機度,且越小的max_feature有助於減小overfitting。總之,使用預設的引數總是不會錯的很離譜:對於classification問題max_feature = sqrt(n_features),對於regression問題,max_feature = log2(n_features)

  • Gradient boosted regression trees(gradient boosting machine)

儘管名字有regression,但該模型可同時用於classification和regression問題。與上述的RF相比,GBRT不存在隨機性,相反它側重於pre-pruning,只要引數設定合適(n_estimates,  learning_rate,  max_depth),它會比RF提供更高的準確率。

from sklearn.ensemble import GradientBoostingClassifier
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target, random_state = 0) 
grbt = GradientBoostingClassifier(random_state = 0,max_depth = 1)
grbt.fit(X_train ,y_train)
grbt.score(X_train ,y_train)
grbt.score(X_test ,y_test) 
plot_feature_importances_cancer(grbt)

Kernelized Support Vector Machines


Neural Network (Deep Learning)

Decision_function

Predict_proba

待續