Introduction to ML with python 手記

阿新 • • 發佈：2019-02-02

Supervised Machine Learning Algorithm

K-Nearest Neighbors

Classification

from sklearn.model_selection import train_test_split

X,y = mglearn.datasers.make_forge()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 3)

clf.fit(X_train,y_train)

clf.score(X_test,y_test)

Regression

from sklearn.neighbors import KNeighborsRegrssor

X,y = mglearn.datasets.make_wave(n_samples=40)

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)

reg = KNeighborsRegressor(n_neighbors = 3)

reg.fit(X_train,y_train)

reg.score(X_test,y_test)

Linear models

Linear Regression

from sklearn.linear_model import LinearRegression

X,y = mglearn.datasets.make_wave(n_samples=60)

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)

lr = LinearRegression().fit(X_train,y_train)

lr.coef_  ###計算係數 w 一組數

lr.intercept_  ###計算截距 b 一個數

lr.score(X_train,y_train)

lr.score(X_test,y_test)

Ridge Regression (less likely to overfit)(less likely to overfit)

有引數可調 alpha=？

from sklearn.linear_model import Ridge

X,y = mglearn.datasets.load_extended_boston()

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)

ridge = Rodge().fit(X_train,y_train)

ridge.score(X_train,y_train)

ridge.score(X_test,y_test)

Lasso

from sklearn.linear_model import Lasso

其他同上。

與Ridge Regression 相比，兩者都應用了regularization，不同的是Lasso為L1 regularization，而後者為L2 regularization。均可調節引數alpha

Linear models for classification

-------logistic regression and linear support vector machine

from sklearn.linear_model import LogisticRegression

from sklearn.svm import LinearSVC

Linear models for multiclass classification

-------logistic regression and linear support vector machine

上述調參總結：線性模型主要引數來自regularization中引數，包括regression模型中的alpha和LinearSVC、LogisticRegression模型中的C。一般的，大的alpha和小的C意味著更簡單的模型（複雜度更低）。另外一個需要作出的選擇是L1 與L2 regularization：如果只有一小部分特徵是重要的，則使用L1；反之應選取L2.

Naive Bayes Classification

樸素貝葉斯分類與前述討論的線性模型相似，不過訓練速度卻更快。在sklearn中有三種樸素貝葉斯分類器：①GaussianNB, ②BernoulliN, and ③MultinomialNB. 其中①可以用於任何連續的資料，②適用於二元分類資料，③適用於count data，②和③多用於text data 的分類問題。

Decision Trees

from sklearn.tree import DecisionTreeClassifier

cancer = load_breast_cancer()

X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target, random_state = 42)

tree = DecisionTreeClassifier(random_state = 0)

tree.fit(X_train,y_train)

tree.score(X_train,y_train）

tree.score(X_test,y_test）

上述演算法未進行引數優化容易引起overfitting，故可設定決策樹的深度max_depth=4，如下所示：

tree = DecisionTreeClassifier(max_depth=4，random_state = 0)

analyzing decision tree

可以呼叫export_graphviz函式來視覺化tree模型

from sklearn.tree import export_graphviz

export_graphviz(tree, out_file="tree.dot", class_name=["malignant","benign"],feature_names = cancer.feature_names,impurity=False,filled=True)

import graphviz with open("tree.dot") as f:

     dot_graph = f.read()

graphviz.Source(dot_graph)

Feature importance in trees

特徵相關性，衡量每個特徵對決策的貢獻程度，為一個0-1之間的數值，越接近1表明該特徵對做出正確的決策越有幫助。

tree.feature_importances_

Ensembles of Decision Trees

Random Forests

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import make_moons

X,y = make_moons(n_samples = 100,noise=0.25,random_state = 3)

X_train,X_test,y_train,y_test = train_test_split(X, y,stratify = y, random_state = 42)

forest = RandomForestClassifier(n_estimate = 5,random_state = 2)

forest.fit(X_train,y_train)

forest.score(X_train,y_train)

forest.score(X_test,y_test)

隨機森林模型對高維度、分散性資料如文字資料等處理的效果並不理想，對於上述資料，線性模型或許更為有效。RF多使用於資料集較大的情況，但同時訓練的速度回更大程度地依賴於CPU與記憶體等物理因素。

RF模型調參主要有：n_estimate、max_feature和pre-pruning options 如max_depth等。①對於n_estimate，一般越大越好，越多的樹有助於增強模型的魯棒性，但同時對機器的記憶體等效能要求也較高。②對於max_feature，這一引數決定了模型的隨機度，且越小的max_feature有助於減小overfitting。總之，使用預設的引數總是不會錯的很離譜：對於classification問題max_feature = sqrt（n_features），對於regression問題，max_feature = log2（n_features）

Gradient boosted regression trees(gradient boosting machine)

儘管名字有regression，但該模型可同時用於classification和regression問題。與上述的RF相比，GBRT不存在隨機性，相反它側重於pre-pruning，只要引數設定合適(n_estimates, learning_rate, max_depth)，它會比RF提供更高的準確率。

from sklearn.ensemble import GradientBoostingClassifier

X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target, random_state = 0)

grbt = GradientBoostingClassifier(random_state = 0,max_depth = 1)

grbt.fit(X_train ,y_train)

grbt.score(X_train ,y_train)

grbt.score(X_test ,y_test)

plot_feature_importances_cancer(grbt)

Introduction to ML with python 手記

Supervised Machine Learning Algorithm

K-Nearest Neighbors

Linear models

其他同上。

Naive Bayes Classification

Decision Trees

上述演算法未進行引數優化容易引起overfitting，故可設定決策樹的深度max_depth=4，如下所示：

Ensembles of Decision Trees

Kernelized Support Vector Machines

Neural Network (Deep Learning)

Decision_function

Predict_proba

Introduction to ML with python 手記

Show HN: VBUILD, to compile “.vue” files to HTML, with python (NO Node.js)

Introduction to MongoDB and Python

A Gentle Introduction to Broadcasting with NumPy Arrays

Introduction to Machine Learning with Python/Python機器學習基礎教程_程式碼修改與更新

Introduction to Random Forest Algorithm with Python

Introduction to Time Series Forecasting With Python

Introduction to Python Deep Learning with Keras

Introduction.to.Machine.Learning.with.Python 筆記

An introduction to parsing text in Haskell with Parsec

How to create own operator with python in mxnet?

Note 1 for <Pratical Programming : An Introduction to Computer Science Using Python 3>

Note 2 for <Pratical Programming : An Introduction to Computer Science Using Python 3>

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第七章：在Direct3D中繪製（二）

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第六章：在Direct3D中繪製

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第五章：渲染流水線

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第四章：Direct 3D初始化

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第三章：變換

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第二章：矩陣代數

Introduction to 3D Game Programming with DirectX 12 學習筆記之 --- 第一章：向量代數

Introduction to ML with python 手記

Supervised Machine Learning Algorithm

K-Nearest Neighbors

Linear models

其他同上。

Naive Bayes Classification

Decision Trees

上述演算法未進行引數優化容易引起overfitting，故可設定決策樹的深度max_depth=4，如下所示：

Ensembles of Decision Trees

Kernelized Support Vector Machines

Neural Network (Deep Learning)

Decision_function

Predict_proba

相關推薦