決策樹(四)決策樹調參
引言
在這篇文章中,我們將探討決策樹模型的最重要參數,以及它們如何防止過度擬合和欠擬合,並且將盡可能少地進行特征工程。我們將使用來自kaggle的泰坦尼克號數據。
導入數據
import pandas as pd import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt data=pd.read_csv(r‘F:\wd.jupyter\datasets\train.csv‘) data.shape data.head()
查看缺失值
#Checking for missing dataNAs = pd.concat([data.isnull().sum()],axis=1,keys=[‘Train‘]) NAs[NAs.sum(axis=1) > 0]
把Cabin’, ‘Name’ and ‘Ticket’移除,並且填充缺失值,並處理分類型變量。
# At this point we will drop the Cabin feature since it is missing a lot of the data data.pop(‘Cabin‘) data.pop(‘Name‘) data.pop(‘Ticket‘) #Fill the missing age values by the mean value# Filling missing Age values with mean data[‘Age‘] = data[‘Age‘].fillna(data[‘Age‘].mean()) #Fill the missing ‘Embarked’ values by the most frequent value # Filling missing Embarked values with most common value data[‘Embarked‘] = data[‘Embarked‘].fillna(data[‘Embarked‘].mode()[0]) #‘Pclass’ is a categorical feature so we convert its values to stringsdata[‘Pclass‘] = data[‘Pclass‘].apply(str) #Let’s perform a basic one hot encoding of categorical features # Getting Dummies from all other categorical vars for col in data.dtypes[data.dtypes == ‘object‘].index: for_dummy = data.pop(col) data= pd.concat([data, pd.get_dummies(for_dummy, prefix=col)], axis=1)
25%用作測試集,其余的用作訓練集,做基本的決策樹模型
# Prepare data for training models labels = data.pop(‘Survived‘) #For testing, we choose to split our data to 75% train and 25% for test from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.25) #Let’s first fit a decision tree with default parameters to get a baseline idea of the performance from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier() dt.fit(x_train, y_train)
用AUC作為測量尺度查看
y_pred = dt.predict(x_test) #We will use AUC (Area Under Curve) as the evaluation metric. Our target value is binary so it’s a binary classification problem. AUC is a good way for evaluation for this type of problems. from sklearn.metrics import roc_curve, auc false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred) roc_auc = auc(false_positive_rate, true_positive_rate) roc_auc 0.7552447552447552
max_depth
要調整的第一個參數是max_depth(樹的深度)。 樹越深,它就分裂的越多,更能捕獲有關數據的信息。 我們擬合一個深度範圍從1到32的決策樹,並繪制訓練和測試auc分數。
max_depths = np.linspace(1, 32, 32, endpoint=True) train_results = [] test_results = [] for max_depth in max_depths: dt = DecisionTreeClassifier(max_depth=max_depth) dt.fit(x_train, y_train) train_pred = dt.predict(x_train) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred) roc_auc = auc(false_positive_rate, true_positive_rate) # Add auc score to previous train results train_results.append(roc_auc) y_pred = dt.predict(x_test) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred) roc_auc = auc(false_positive_rate, true_positive_rate) # Add auc score to previous test results test_results.append(roc_auc) from matplotlib.legend_handler import HandlerLine2D line1, = plt.plot(max_depths, train_results, ‘b‘, label=‘Train AUC‘) line2, = plt.plot(max_depths, test_results, ‘r‘, label=‘Test AUC‘) plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)}) plt.ylabel(‘AUC score‘) plt.xlabel(‘Tree depth‘) plt.show()
我們看到我們的模型適用於大深度值。 該樹完美地預測了所有訓練數據,但是,它無法很好的擬合測試數據。
min_samples_split
min_samples_split表示拆分內部節點所需的最小樣本數。 這可以在考慮每個節點處的至少一個樣本,並考慮每個節點處的所有樣本之間變化。 當我們增加此參數時,給與樹更多的約束,因為它必須在每個節點處考慮更多樣本。 在這裏,我們將從10%到100%的樣本中改變參數。
min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True) train_results = [] test_results = [] for min_samples_split in min_samples_splits: dt = DecisionTreeClassifier(min_samples_split=min_samples_split) dt.fit(x_train, y_train) train_pred = dt.predict(x_train) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred) roc_auc = auc(false_positive_rate, true_positive_rate) train_results.append(roc_auc) y_pred = dt.predict(x_test) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred) roc_auc = auc(false_positive_rate, true_positive_rate) test_results.append(roc_auc) from matplotlib.legend_handler import HandlerLine2D line1,= plt.plot(min_samples_splits, train_results, ‘b‘, label=‘Train AUC‘) line2,= plt.plot(min_samples_splits, test_results, ‘r‘, label=‘Test AUC‘) plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)}) plt.ylabel(‘AUC score‘) plt.xlabel(‘min samples split‘) plt.show()
我們可以清楚地看到,當我們在每個節點上考慮100%的樣本時,模型無法充分解釋數據。
min_samples_leaf
min_samples_leaf是葉節點所需的最小樣本數。 此參數類似於min_samples_splits,但是,這描述了葉子(樹的基礎)處的樣本的最小樣本數。
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True) train_results = [] test_results = [] for min_samples_leaf in min_samples_leafs: dt = DecisionTreeClassifier(min_samples_leaf=min_samples_leaf) dt.fit(x_train, y_train) train_pred = dt.predict(x_train) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred) roc_auc = auc(false_positive_rate, true_positive_rate) train_results.append(roc_auc) y_pred = dt.predict(x_test) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred) roc_auc = auc(false_positive_rate, true_positive_rate) test_results.append(roc_auc) from matplotlib.legend_handler import HandlerLine2D line1, = plt.plot(min_samples_leafs,train_results, ‘b‘, label=‘Train AUC‘) line2, = plt.plot(min_samples_leafs,test_results, ‘r‘, label=‘Test AUC‘) plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)}) plt.ylabel(‘AUC score‘) plt.xlabel(‘min samples leaf‘) plt.show()
與前一個參數相同的結論。 增加此值可能會導致擬合不足。
max_features
max_features表示查找最佳拆分時要考慮的要最大特征數量。
max_features = list(range(1,data.shape[1])) train_results = [] test_results = [] for max_feature in max_features: dt = DecisionTreeClassifier(max_features=max_feature) dt.fit(x_train, y_train) train_pred = dt.predict(x_train) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred) roc_auc = auc(false_positive_rate, true_positive_rate) train_results.append(roc_auc) y_pred = dt.predict(x_test) false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred) roc_auc = auc(false_positive_rate, true_positive_rate) test_results.append(roc_auc) from matplotlib.legend_handler import HandlerLine2D line1, = plt.plot(max_features, train_results, ‘b‘, label=‘Train AUC‘) line2, = plt.plot(max_features, test_results, ‘r‘, label=‘Test AUC‘) plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)}) plt.ylabel(‘AUC score‘) plt.xlabel(‘max features‘) plt.show()
這也是一個過度擬合的情況。 根據決策樹的sklearn文檔,在找到節點樣本的至少一個有效區分之前,搜索分割不會停止,因為它需要檢查多於max_features的特征數量。
這篇文章研究了模型參數如何影響過度擬合和欠擬合的性能,希望對您有所幫助。
決策樹(四)決策樹調參