【機器學習實戰】-- Titanic 資料集(3)-- 邏輯迴歸
1. 寫在前面:
本篇屬於實戰部分,更注重於演算法在實際專案中的應用。如需對邏輯迴歸演算法本身有詳細的瞭解,可參考以下連結,在本人學習的過程中,起到了很大的幫助:
統計學習方法 李航
邏輯迴歸原理小結https://www.cnblogs.com/pinard/p/6029432.html
2. 資料集:
資料集地址:https://www.kaggle.com/c/titanic
Titanic資料集是Kaggle上參與人數最多的專案之一。資料本身簡單小巧,適合初學者上手,深入瞭解比較各個機器學習演算法。
資料集包含11個變數:PassengerID、Pclass、Name、Sex、Age、SibSp、Parch、Ticket、Fare、Cabin、Embarked,通過這些資料來預測乘客在Titanic事故中是否倖存下來。
3. 演算法簡介:
這一節中簡單介紹一下邏輯迴歸,雖然名字中有迴歸,但其實邏輯迴歸是經典的分類演算法,屬於對數線性模型。
3.1 二元邏輯迴歸模型
假設$x\in R^{n}$,$y \in \{0,1\}$, $w$為帶偏置的權值向量,$h_{w}(x)$為模型輸出,則:
$h_{w}(x) = \frac{1}{1+e^{-w \cdot x}}$,
$h_{w}(x)$屬於邏輯斯蒂分佈函式,是一條S形曲線,以點$(0, \frac{1}{2})$為中心對稱。當模型輸出$h_{w}(x) > 0.5$時,分類結果為1;$h_{w}(x) < 0.5$時,分類結果為0。
則其條件概率為:
$P(Y=1|x) = \frac{e^{w \cdot x }}{1 + e^{w \cdot x }}$
$P(Y=0|x) = \frac{1}{1 + e^{w \cdot x }}$
引入機率概念(機率是指該事件發生的概率與該事件不發生的概率的比值,對數機率用logit表示):
logit$(p)$ = log$\frac{P(Y=1|x)}{P(Y=0|x)} = w \cdot x$
3.2 多元邏輯迴歸模型
假設$x\in R^{n}$,$y \in \{1,2,...,K\}$, $w_{k}$為帶偏置的權值向量,$h_{w_{k}}(x)$為模型輸出,則:
$h_{w_{k}}(x) = P(Y=k|x) = \frac{e^{w_{k} \cdot x}}{1+\sum_{k=1}^{K-1}e^{w_{k} \cdot x}}$, k=1,2,...,K-1
$h_{w_{K}}(x) = P(Y=K|x) = \frac{1}{1+\sum{k=1}^{K-1}e^{w_{k} \cdot x}}$
並且,其對數條件概率滿足:
$ln\frac{P(Y=1|x, w)}{P(Y=K|x, w)} = w_{1} \cdot x$
...
$ln\frac{P(Y=K-1|x, w)}{P(Y=K|x, w)} = w_{K-1} \cdot x$
4. 實戰:
1. Sklearn中主要提供了LogsiticRegression和LogisticRegressionCV兩個類來應用邏輯迴歸,其中LogisticRegressionCV,從名字就可以看出,其自帶Cross validation。據Sklearn官方文件所說“The advantage of using a cross-validation estimator over the canonicalestimatorclass along withgrid searchis that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process.” 即這類estiamtor可以利用之前幾步的cross-validation的結果來實現熱啟動,而這通常會改善運算速度。但據個人使用下來發現,其自帶的grid seacrh只能設定,正則項係數Cs這一個超引數。對於其他的超引數,如正則化方式penalty, 類別權重class_weight等並不能選擇,因此使用過程中還是使用了 LogisticRegression 配合 GridSearchCV 進行引數的選擇。
2. 在 param_grid 的設定過程中,我們將 penalty = 'l1' 和 'l2' 區分開了,這是因為兩種正則化的係數'C'的最佳區間可能不在一個數量級的範圍內,分開後在調參的過程中可以單獨給出合適的範圍。比如在本例中,penalty='l1', C<=0.01會有明顯的欠擬合;而penalty='l2',C<0.0001才會有明顯的欠擬合。以上可以用過grid_search.cv_results_對網格搜尋的結果進行檢視。
3. GridSearchCV的 cv_results_這個attribute返回的結果乍一看比較雜亂,其實是可以通過pd.DataFrame()將其轉換成dataframe格式,看起來非常清晰,甚至可以對其按照感興趣的列,如‘mean_test_score’等進行排序,非常方便。
下面附上簡單的程式碼及其執行結果:
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler 5 from sklearn.impute import SimpleImputer 6 from sklearn.pipeline import Pipeline, FeatureUnion 7 from sklearn.model_selection import cross_val_score, GridSearchCV, ParameterGrid, StratifiedKFold, ShuffleSplit 8 from sklearn.linear_model import LogisticRegression, LogisticRegressionCV 9 from sklearn.metrics import accuracy_score, precision_score, recall_score 10 from sklearn.base import TransformerMixin, BaseEstimator 11 12 13 class DataFrameSelector(BaseEstimator, TransformerMixin): 14 def __init__(self, attribute_name): 15 self.attribute_name = attribute_name 16 17 def fit(self, x, y=None): 18 return self 19 20 def transform(self, x): 21 return x[self.attribute_name].values 22 23 24 # Load data 25 data_train = pd.read_csv('train.csv') 26 27 train_x = data_train.drop('Survived', axis=1) 28 train_y = data_train['Survived'] 29 30 # Data cleaning 31 cat_attribs = ['Pclass', 'Sex', 'Embarked'] 32 dis_attribs = ['SibSp', 'Parch'] 33 con_attribs = ['Age', 'Fare'] 34 35 # encoder: OneHotEncoder()、OrdinalEncoder() 36 cat_pipeline = Pipeline([ 37 ('selector', DataFrameSelector(cat_attribs)), 38 ('imputer', SimpleImputer(strategy='most_frequent')), 39 ('encoder', OneHotEncoder()), 40 ]) 41 42 dis_pipeline = Pipeline([ 43 ('selector', DataFrameSelector(dis_attribs)), 44 ('scaler', StandardScaler()), 45 ('imputer', SimpleImputer(strategy='most_frequent')), 46 ]) 47 48 con_pipeline = Pipeline([ 49 ('selector', DataFrameSelector(con_attribs)), 50 ('scaler', StandardScaler()), 51 ('imputer', SimpleImputer(strategy='mean')), 52 ]) 53 54 full_pipeline = FeatureUnion( 55 transformer_list=[ 56 ('con_pipeline', con_pipeline), 57 ('dis_pipeline', dis_pipeline), 58 ('cat_pipeline', cat_pipeline), 59 ] 60 ) 61 62 train_x_cleaned = full_pipeline.fit_transform(train_x) 63 64 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2) 65 clf1 = LogisticRegression(tol=1e-4, max_iter=1000, solver='liblinear') 66 67 param_grid = [{'penalty': ['l1'], 68 'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10], 69 'class_weight': ['balanced', None], 70 }, 71 {'penalty': ['l2'], 72 'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10], 73 'class_weight': ['balanced', None], 74 }, 75 {'penalty': ['none'], 76 'class_weight': ['balanced', None], 77 'solver': ['lbfgs'] 78 } 79 ] 80 81 grid_search = GridSearchCV(clf1, param_grid=param_grid, cv=cv, scoring='accuracy', n_jobs=-1, return_train_score=True) 82 83 grid_search.fit(train_x_cleaned, train_y) 84 predicted_y = grid_search.predict(train_x_cleaned) 85 86 df_cv_results = pd.DataFrame(grid_search.cv_results_) 87 print(accuracy_score(train_y, predicted_y)) 88 print(precision_score(train_y, predicted_y)) 89 print(recall_score(train_y, predicted_y))
執行結果:
1 0.8069584736251403 2 0.7814569536423841 3 0.6900584795321637
可以看到在相同的資料清洗步驟下,作為對數線性模型的邏輯迴歸和作為線性模型的感知機,在分類精度上相差無幾。