Introduction to ML with python 手記
Supervised Machine Learning Algorithm
K-Nearest Neighbors
- Classification
from sklearn.model_selection import train_test_split
X,y = mglearn.datasers.make_forge()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 3)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
- Regression
from sklearn.neighbors import KNeighborsRegrssor
X,y = mglearn.datasets.make_wave(n_samples=40)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
reg = KNeighborsRegressor(n_neighbors = 3)
reg.fit(X_train,y_train)
reg.score(X_test,y_test)
Linear models
- Linear Regression
from sklearn.linear_model import LinearRegression
X,y = mglearn.datasets.make_wave(n_samples=60)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
lr = LinearRegression().fit(X_train,y_train)
lr.coef_ ###計算係數 w 一組數
lr.intercept_ ###計算截距 b 一個數
lr.score(X_train,y_train)
lr.score(X_test,y_test)
- Ridge Regression (less likely to overfit)(less likely to overfit)
from sklearn.linear_model import Ridge
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
ridge = Rodge().fit(X_train,y_train)
ridge.score(X_train,y_train)
ridge.score(X_test,y_test)
- Lasso
from sklearn.linear_model import Lasso
其他同上。
與Ridge Regression 相比,兩者都應用了regularization,不同的是Lasso為L1 regularization,而後者為L2 regularization。均可調節引數alpha
- Linear models for classification
-------logistic regression and linear support vector machine
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
- Linear models for multiclass classification
上述調參總結:線性模型主要引數來自regularization中引數,包括regression模型中的alpha和LinearSVC、LogisticRegression模型中的C。一般的,大的alpha和小的C意味著更簡單的模型(複雜度更低)。另外一個需要作出的選擇是L1 與L2 regularization:如果只有一小部分特徵是重要的,則使用L1;反之應選取L2.
Naive Bayes Classification
樸素貝葉斯分類與前述討論的線性模型相似,不過訓練速度卻更快。在sklearn中有三種樸素貝葉斯分類器:①GaussianNB, ②BernoulliN, and ③MultinomialNB. 其中①可以用於任何連續的資料,②適用於二元分類資料,③適用於count data,②和③多用於text data 的分類問題。
Decision Trees
from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target, random_state = 42)
tree = DecisionTreeClassifier(random_state = 0)
tree.fit(X_train,y_train)
tree.score(X_train,y_train)
tree.score(X_test,y_test)
上述演算法未進行引數優化容易引起overfitting,故可設定決策樹的深度max_depth=4,如下所示:
tree = DecisionTreeClassifier(max_depth=4,random_state = 0)
- analyzing decision tree
可以呼叫export_graphviz函式來視覺化tree模型
from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_name=["malignant","benign"],feature_names = cancer.feature_names,impurity=False,filled=True)
import graphviz with open("tree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
- Feature importance in trees
tree.feature_importances_
Ensembles of Decision Trees
- Random Forests
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
X,y = make_moons(n_samples = 100,noise=0.25,random_state = 3)
X_train,X_test,y_train,y_test = train_test_split(X, y,stratify = y, random_state = 42)
forest = RandomForestClassifier(n_estimate = 5,random_state = 2)
forest.fit(X_train,y_train)
forest.score(X_train,y_train)
forest.score(X_test,y_test)
隨機森林模型對高維度、分散性資料如文字資料等處理的效果並不理想,對於上述資料,線性模型或許更為有效。RF多使用於資料集較大的情況,但同時訓練的速度回更大程度地依賴於CPU與記憶體等物理因素。
RF模型調參主要有:n_estimate、max_feature和pre-pruning options 如max_depth等。①對於n_estimate,一般越大越好,越多的樹有助於增強模型的魯棒性,但同時對機器的記憶體等效能要求也較高。②對於max_feature,這一引數決定了模型的隨機度,且越小的max_feature有助於減小overfitting。總之,使用預設的引數總是不會錯的很離譜:對於classification問題max_feature = sqrt(n_features),對於regression問題,max_feature = log2(n_features)
- Gradient boosted regression trees(gradient boosting machine)
儘管名字有regression,但該模型可同時用於classification和regression問題。與上述的RF相比,GBRT不存在隨機性,相反它側重於pre-pruning,只要引數設定合適(n_estimates, learning_rate, max_depth),它會比RF提供更高的準確率。
from sklearn.ensemble import GradientBoostingClassifier
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target, random_state = 0)
grbt = GradientBoostingClassifier(random_state = 0,max_depth = 1)
grbt.fit(X_train ,y_train)
grbt.score(X_train ,y_train)
grbt.score(X_test ,y_test)
plot_feature_importances_cancer(grbt)
Kernelized Support Vector Machines
Neural Network (Deep Learning)
Decision_function
Predict_proba
待續