信用卡模型(三)
我們前面已經有兩個版本了,都涉及到woe轉換之類的,現在我們嘗試一下xgboost版本的,不需要做woe轉換
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn import preprocessing from sklearn import metrics from sklearn import model_selection from sklearn import ensemble from sklearn importtree from sklearn import linear_model import os, datetime, sys, random, time import seaborn as sns import xgboost as xgs import lightgbm as lgb import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec #from mlxtend import classifier plt.style.use('fivethirtyeight') %matplotlib inlineimport warnings warnings.filterwarnings('ignore') from scipy import stats, special #import shap import catboost as ctb trainingData = pd.read_csv('F:\\python\\Give-me-some-credit-master\\data\\cs-training.csv') testData = pd.read_csv('F:\\python\\Give-me-some-credit-master\\data\\cs-test.csv')
探索性分析
trainingData.head() trainingData.info() trainingData.describe() print(trainingData.shape) print(testData.shape) testData.head() testData.info() testData.describe()
複製原資料
finalTrain = trainingData.copy()
finalTest = testData.copy()
因為,我們需要預測測試資料中拖欠的概率,所以我們需要首先從中刪除附加列
finalTest.drop('SeriousDlqin2yrs', axis=1, inplace = True)
同樣如上所述,讓我們取ID列,即Unnamed:0,並將其儲存在單獨的變數中
trainID = finalTrain['Unnamed: 0'] testID = finalTest['Unnamed: 0'] finalTrain.drop('Unnamed: 0', axis=1, inplace=True) finalTest.drop('Unnamed: 0', axis=1, inplace=True)
y值分佈
因為我們有15萬的資料。它很有可能是一個不平衡的資料集。因此,檢查正負拖欠比率。
fig, axes = plt.subplots(1,2,figsize=(12,6)) finalTrain['SeriousDlqin2yrs'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=axes[0]) axes[0].set_title('SeriousDlqin2yrs') #ax[0].set_ylabel('') sns.countplot('SeriousDlqin2yrs',data=finalTrain,ax=axes[1]) axes[1].set_title('SeriousDlqin2yrs') plt.show()
負拖欠與正拖欠異常值的比率為93.3%至6.7%,約為14:1。因此,我們的資料集是高度不平衡的。
我們不能依靠精確分數來預測模型的成功。這裡將考慮許多其他的評估指標。但稍後再談
EDA
現在讓我們繼續EDA的離群值分析部分。在這裡,我們將去除可能影響預測模型的潛在異常值。
異常值分析
fig = plt.figure(figsize=[30,30]) for col,i in zip(finalTrain.columns,range(1,13)): axes = fig.add_subplot(7,2,i) sns.regplot(finalTrain[col],finalTrain.SeriousDlqin2yrs,ax=axes) plt.show()
從上圖中我們可以觀察到:
在NumberOfTime30-59DaysPastDueNotBesters、NumberOfTimes60-89DaysPastDueNotBeare和NumberOfTimes90DaysLate列中,我們可以看到超過90的拖欠範圍,這在所有3個功能中都很常見。
有一些不尋常的高負債率和未回收利用率的價值
第1步:修復NumberOfTime30-59dayspastduenotbears、NumberOfTime60-89dayspastduenotbear和NumberOfTimes90DaysLate列
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90] ['NumberOfTime30-59DaysPastDueNotWorse'])) print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90] ['NumberOfTime60-89DaysPastDueNotWorse'])) print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90] ['NumberOfTimes90DaysLate'])) print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']<90] ['NumberOfTime60-89DaysPastDueNotWorse'])) print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']<90] ['NumberOfTimes90DaysLate'])) print("Proportion of positive class with special 96/98 values:", round(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs'].sum()*100/ len(finalTrain[finalTrain['NumberOfTime30-59DaysPastDueNotWorse']>=90]['SeriousDlqin2yrs']),2),'%')
Unique values in '30-59 Days' values that are more than or equal to 90: [96 98] Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90: [96 98] Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90: [96 98] Unique values in '60-89 Days' when '30-59 Days' values are less than 90: [ 0 1 2 3 4 5 6 7 8 9 11] Unique values in '90 Days' when '30-59 Days' values are less than 90: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17] Proportion of positive class with special 96/98 values: 54.65 %
從下面我們可以看出,當“NumberOfTime30-59DaysPastDueNotBears”列中的記錄超過90時,
其他記錄逾期付款次數X天的列也具有相同的值。我們將這些分類為特殊標籤,因為陽性的比例異常高達54.65%。
這96和98個值可以被視為會計錯誤。因此,我們將用96之前的最大值替換它們,即13、11和17
finalTrain.loc[finalTrain['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 13 finalTrain.loc[finalTrain['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 11 finalTrain.loc[finalTrain['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 17 print("Unique values in 30-59Days", np.unique(finalTrain['NumberOfTime30-59DaysPastDueNotWorse'])) print("Unique values in 60-89Days", np.unique(finalTrain['NumberOfTime60-89DaysPastDueNotWorse'])) print("Unique values in 90Days", np.unique(finalTrain['NumberOfTimes90DaysLate']))
Unique values in 30-59Days [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13] Unique values in 60-89Days [ 0 1 2 3 4 5 6 7 8 9 11] Unique values in 90Days [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17]
print("Unique values in '30-59 Days' values that are more than or equal to 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']>=90] ['NumberOfTime30-59DaysPastDueNotWorse'])) print("Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']>=90] ['NumberOfTime60-89DaysPastDueNotWorse'])) print("Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']>=90] ['NumberOfTimes90DaysLate'])) print("Unique values in '60-89 Days' when '30-59 Days' values are less than 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']<90] ['NumberOfTime60-89DaysPastDueNotWorse'])) print("Unique values in '90 Days' when '30-59 Days' values are less than 90:",np.unique(finalTest[finalTest['NumberOfTime30-59DaysPastDueNotWorse']<90] ['NumberOfTimes90DaysLate']))
Unique values in '30-59 Days' values that are more than or equal to 90: [96 98] Unique values in '60-89 Days' when '30-59 Days' values are more than or equal to 90: [96 98] Unique values in '90 Days' when '30-59 Days' values are more than or equal to 90: [96 98] Unique values in '60-89 Days' when '30-59 Days' values are less than 90: [0 1 2 3 4 5 6 7 8 9] Unique values in '90 Days' when '30-59 Days' values are less than 90: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18]
finalTest.loc[finalTest['NumberOfTime30-59DaysPastDueNotWorse'] >= 90, 'NumberOfTime30-59DaysPastDueNotWorse'] = 19 finalTest.loc[finalTest['NumberOfTime60-89DaysPastDueNotWorse'] >= 90, 'NumberOfTime60-89DaysPastDueNotWorse'] = 9 finalTest.loc[finalTest['NumberOfTimes90DaysLate'] >= 90, 'NumberOfTimes90DaysLate'] = 18 print("Unique values in 30-59Days", np.unique(finalTest['NumberOfTime30-59DaysPastDueNotWorse'])) print("Unique values in 60-89Days", np.unique(finalTest['NumberOfTime60-89DaysPastDueNotWorse'])) print("Unique values in 90Days", np.unique(finalTest['NumberOfTimes90DaysLate']))
Unique values in 30-59Days [ 0 1 2 3 4 5 6 7 8 9 10 11 12 19] Unique values in 60-89Days [0 1 2 3 4 5 6 7 8 9] Unique values in 90Days [ 0 1 2 3 4 5 6 7 8 9 10 11 12 16 17 18]
第二步:檢查 DebtRatio and RevolvingUtilizationOfUnsecuredLines.
print('Debt Ratio: \n',finalTrain['DebtRatio'].describe()) print('\nRevolving Utilization of Unsecured Lines: \n',finalTrain['RevolvingUtilizationOfUnsecuredLines'].describe())
Debt Ratio: count 150000.000000 mean 353.005076 std 2037.818523 min 0.000000 25% 0.175074 50% 0.366508 75% 0.868254 max 329664.000000 Name: DebtRatio, dtype: float64 Revolving Utilization of Unsecured Lines: count 150000.000000 mean 6.048438 std 249.755371 min 0.000000 25% 0.029867 50% 0.154181 75% 0.559046 max 50708.000000 Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64
在這裡你可以看到75分位和最大值之間的巨大差異。讓我們更深入地探討一下
quantiles = [0.75,0.8,0.81,0.85,0.9,0.95,0.975,0.99] for i in quantiles: print(i*100,'% quantile of debt ratio is: ',finalTrain.DebtRatio.quantile(i))
75.0 % quantile of debt ratio is: 0.86825377325 80.0 % quantile of debt ratio is: 4.0 81.0 % quantile of debt ratio is: 14.0 85.0 % quantile of debt ratio is: 269.1499999999942 90.0 % quantile of debt ratio is: 1267.0 95.0 % quantile of debt ratio is: 2449.0 97.5 % quantile of debt ratio is: 3489.024999999994 99.0 % quantile of debt ratio is: 4979.040000000037
正如你所看到的,在81%之後,分位數有了巨大的增長。因此,
我們的主要目標是檢查超過81%分位數的潛在異常值。然而,由於我們的資料是150000,
讓我們考慮95%和97.5%的分位數進行進一步的分析
finalTrain[finalTrain['DebtRatio'] >= finalTrain['DebtRatio'].quantile(0.95)][['SeriousDlqin2yrs','MonthlyIncome']].describe()
SeriousDlqin2yrs MonthlyIncome
count 7501.000000 379.000000
mean 0.055193 0.084433
std 0.228371 0.278403
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 1.000000 1.000000
我們可以觀察到:
在7501名負債率高於95%的客戶中,即債務高於收入的次數,只有379人擁有月收入值。
月收入的最大值是1,最小值是0,這讓我們懷疑這是資料輸入錯誤。讓我們來看看兩年嚴重的拖欠和月收入是否相等
finalTrain[(finalTrain["DebtRatio"] > finalTrain["DebtRatio"].quantile(0.95)) & (finalTrain['SeriousDlqin2yrs'] == finalTrain['MonthlyIncome'])]
因此,我們的疑點是真實的,379排中有331排的月收入等於2年內嚴重的拖欠。因此,
我們將從我們的分析中刪除這些331個異常值,因為它們的當前值對我們的預測建模沒有用處,而且會增加偏差和方差。
這背後的原因是,我們有331行的債務比率與客戶的收入相比是巨大的,他們沒有仔細審查違約,這只是一個數據輸入錯誤
finalTrain = finalTrain[-((finalTrain["DebtRatio"] > finalTrain["DebtRatio"].quantile(0.95)) & (finalTrain['SeriousDlqin2yrs'] == finalTrain['MonthlyIncome']))] finalTrain
迴圈使用無擔保額度
此欄位基本上表示客戶信用額度所欠金額的比率。比率高於1將被視為嚴重違約。
10的比率在功能上似乎也是可能的,讓我們看看有多少客戶的無擔保額度的迴圈利用率大於10
finalTrain[finalTrain['RevolvingUtilizationOfUnsecuredLines']>10].describe()
在這裡,如果你看到第50分位數和75分位數之間的差異,你會發現從13分位數到1891.25分位數有了巨大的增長。
既然13看起來是一個合理的比率(但太高了),讓我們檢查一下有多少計數在13以上。
finalTrain[finalTrain['RevolvingUtilizationOfUnsecuredLines']>13].describe()
儘管欠了幾千,這238人沒有顯示任何違約,這意味著這可能是另一個錯誤。即使不是錯誤,
這些數字也會給我們的最終預測增加巨大的偏差和偏差。因此,最好的決定是刪除這些值
finalTrain = finalTrain[finalTrain['RevolvingUtilizationOfUnsecuredLines']<=13] finalTrain
現在處理異常值。接下來,我們將繼續處理丟失的資料,
正如我們在本筆記本的開頭所觀察到的那樣,MonthlyIncome和NumberOfDependents的值為空。
空處理
由於MonthlyIncome是一個整數值,我們將用中值替換null。
如果客戶的依賴項的數量不存在,那麼就意味著他們的依賴項的數量是可變的。因此,我們用零來填充它們
def MissingHandler(df): DataMissing = df.isnull().sum()*100/len(df) DataMissingByColumn = pd.DataFrame({'Percentage Nulls':DataMissing}) DataMissingByColumn.sort_values(by='Percentage Nulls',ascending=False,inplace=True) return DataMissingByColumn[DataMissingByColumn['Percentage Nulls']>0] MissingHandler(finalTrain)
Percentage Nulls
MonthlyIncome 19.850633
NumberOfDependents 2.617261
因此,月收入和家屬人數分別為19.76%和2.59%。
finalTrain['MonthlyIncome'].fillna(finalTrain['MonthlyIncome'].median(), inplace=True) finalTrain['NumberOfDependents'].fillna(0, inplace = True)
重新檢查空值
MissingHandler(finalTrain)
對試驗資料進行相似分析
MissingHandler(finalTest) ''' Percentage Nulls MonthlyIncome 19.805326 NumberOfDependents 2.587116 Similar to the training data, we have 19.71% and 2.56% nulls for MonthlyIncome and NumberOfDependents respectively. ''' finalTest['MonthlyIncome'].fillna(finalTrain['MonthlyIncome'].median(), inplace=True) finalTest['NumberOfDependents'].fillna(0, inplace = True) #重新檢查空值 MissingHandler(finalTest) #Percentage Nulls print(finalTrain.shape) print(finalTest.shape) ''' (149431, 11) (101503, 10)
'''
附加EDA
讓我們研究更多關於資料集的事情,以便更熟悉它。
相關矩陣
fig = plt.figure(figsize = [15,10]) mask = np.zeros_like(finalTrain.corr(), dtype=np.bool) mask[np.triu_indices_from(mask)] = True sns.heatmap(finalTrain.corr(), cmap=sns.diverging_palette(150, 275, s=80, l=55, n=9), mask = mask, annot=True, center = 0) plt.title("Correlation Matrix (HeatMap)", fontsize = 15)
從上面的相關熱圖中,我們可以看到最相關的值是NumberOfTime30-59dayspastduenotbears,NumberOfTime60-89dayspastduenotbearse和NumberOfTimes90DaysLate。
現在讓我們轉到筆記本的功能工程部分
特徵工程
讓我們首先將訓練集和測試集結合起來,為資料新增特性並進行進一步的分析。稍後我們將在模型測試之前拆分它們
SeriousDlqIn2Yrs = finalTrain['SeriousDlqin2yrs'] finalTrain.drop('SeriousDlqin2yrs', axis = 1 , inplace = True) finalData = pd.concat([finalTrain, finalTest]) finalData.shape #(250934, 10)
新增一些新功能:
月收入:月收入除以受撫養人的數量
月收入乘以負債率
isRetired:月收入為0且年齡大於65歲(假定退休年齡)的人
回籠額度:未結貸款額度與房地產貸款額度之差
hasRevolvingLines:如果RevolvingLines存在,則1其他0
HasMultipleResaleStates:如果不動產數量大於2
收入:月收入除以1000。對於這些人來說,欺詐的可能性更大,也可能意味著這個人有了一份新工作,而且還沒有加薪百分之十。這兩組人都表示風險更高。
finalData['MonthlyIncomePerPerson'] = finalData['MonthlyIncome']/(finalData['NumberOfDependents']+1) finalData['MonthlyIncomePerPerson'].fillna(0, inplace=True) finalData['MonthlyDebt'] = finalData['MonthlyIncome']*finalData['DebtRatio'] finalData['MonthlyDebt'].fillna(finalData['DebtRatio'],inplace=True) finalData['MonthlyDebt'] = np.where(finalData['MonthlyDebt']==0, finalData['DebtRatio'],finalData['MonthlyDebt']) finalData['isRetired'] = np.where((finalData['age'] > 65), 1, 0) finalData['RevolvingLines'] = finalData['NumberOfOpenCreditLinesAndLoans']-finalData['NumberRealEstateLoansOrLines'] finalData['hasRevolvingLines']=np.where((finalData['RevolvingLines']>0),1,0) finalData['hasMultipleRealEstates'] = np.where((finalData['NumberRealEstateLoansOrLines']>=2),1,0) finalData['incomeDivByThousand'] = finalData['MonthlyIncome']/1000 finalData.shape #(250934, 17) MissingHandler(finalData) #Percentage Nulls
我們現在已經在資料集中添加了新的特性。接下來,我們將通過分析各個列的分佈對資料進行偏斜檢查,並執行Box-Cox變換來減少偏斜。
偏度檢驗與Box-Cox變換
讓我們先檢查每個值的分佈
columnList = list(finalData.columns) columnList fig = plt.figure(figsize=[20,20]) for col,i in zip(columnList,range(1,19)): axes = fig.add_subplot(6,3,i) sns.distplot(finalData[col],ax=axes, kde_kws={'bw':1.5}, color='purple') plt.show()
從上面的分佈圖中,我們可以看到,我們的大部分資料是在任何一個方向上傾斜的。我們只能看到年齡形成接近正態分佈。讓我們檢查每列的傾斜度值
def SkewMeasure(df): nonObjectColList = df.dtypes[df.dtypes != 'object'].index skewM = df[nonObjectColList].apply(lambda x: stats.skew(x.dropna())).sort_values(ascending = False) skewM=pd.DataFrame({'skew':skewM}) return skewM[abs(skewM)>0.5].dropna() skewM = SkewMeasure(finalData) skewM
skew
MonthlyIncome 218.270205
incomeDivByThousand 218.270205
MonthlyIncomePerPerson 206.221804
MonthlyDebt 98.604981
DebtRatio 92.819627
RevolvingUtilizationOfUnsecuredLines 91.721780
NumberOfTimes90DaysLate 15.097509
NumberOfTime60-89DaysPastDueNotWorse 13.509677
NumberOfTime30-59DaysPastDueNotWorse 9.773995
NumberRealEstateLoansOrLines 3.217055
NumberOfDependents 1.829982
isRetired 1.564456
RevolvingLines 1.364633
NumberOfOpenCreditLinesAndLoans 1.219429
hasMultipleRealEstates 1.008475
hasRevolvingLines -8.106007
所有列的傾斜度都非常高。我們將使用λ=0.15的Box-Cox變換來減少這種偏斜
for i in skewM.index: finalData[i] = special.boxcox1p(finalData[i],0.15) #lambda = 0.15 SkewMeasure(finalData)
skew
RevolvingUtilizationOfUnsecuredLines 23.234640
NumberOfTimes90DaysLate 6.787000
NumberOfTime60-89DaysPastDueNotWorse 6.602180
NumberOfTime30-59DaysPastDueNotWorse 3.212010
DebtRatio 1.958314
MonthlyDebt 1.817649
isRetired 1.564456
hasMultipleRealEstates 1.008475
NumberOfDependents 0.947591
incomeDivByThousand 0.708168
MonthlyIncomePerPerson -1.558107
MonthlyIncome -2.152376
hasRevolvingLines -8.106007
由於應用了Box-Cox變換,偏度在更高的範圍內減小了。讓我們再次檢查各個列的分佈圖:
fig = plt.figure(figsize=[20,20]) for col,i in zip(columnList,range(1,19)): axes = fig.add_subplot(6,3,i) sns.distplot(finalData[col],ax=axes, kde_kws={'bw':1.5}, color='purple') plt.show()
Model Training
Train-Validation Split
We will currently split the train and validation sets into a 70-30 proportion.
trainDF = finalData[:len(finalTrain)] testDF = finalData[len(finalTrain):] print(trainDF.shape) print(testDF.shape) ''' (149431, 17) (101503, 17) ''' xTrain, xTest, yTrain, yTest = model_selection.train_test_split(trainDF.to_numpy(),SeriousDlqIn2Yrs.to_numpy(),test_size=0.3,random_state=2020)
LightGBM
超引數調諧
lgbAttributes = lgb.LGBMClassifier(objective='binary', n_jobs=-1, random_state=2020, importance_type='gain') lgbParameters = { 'max_depth' : [2,3,4,5], 'learning_rate': [0.05, 0.1,0.125,0.15], 'colsample_bytree' : [0.2,0.4,0.6,0.8,1], 'n_estimators' : [400,500,600,700,800,900], 'min_split_gain' : [0.15,0.20,0.25,0.3,0.35], #equivalent to gamma in XGBoost 'subsample': [0.6,0.7,0.8,0.9,1], 'min_child_weight': [6,7,8,9,10], 'scale_pos_weight': [10,15,20], 'min_data_in_leaf' : [100,200,300,400,500,600,700,800,900], 'num_leaves' : [20,30,40,50,60,70,80,90,100] } lgbModel = model_selection.RandomizedSearchCV(lgbAttributes, param_distributions = lgbParameters, cv = 5, random_state=2020) lgbModel.fit(xTrain,yTrain.flatten(),feature_name=trainDF.columns.to_list())
RandomizedSearchCV(cv=5, estimator=LGBMClassifier(importance_type='gain', objective='binary', random_state=2020), param_distributions={'colsample_bytree': [0.2, 0.4, 0.6, 0.8, 1], 'learning_rate': [0.05, 0.1, 0.125, 0.15], 'max_depth': [2, 3, 4, 5], 'min_child_weight': [6, 7, 8, 9, 10], 'min_data_in_leaf': [100, 200, 300, 400, 500, 600, 700, 800, 900], 'min_split_gain': [0.15, 0.2, 0.25, 0.3, 0.35], 'n_estimators': [400, 500, 600, 700, 800, 900], 'num_leaves': [20, 30, 40, 50, 60, 70, 80, 90, 100], 'scale_pos_weight': [10, 15, 20], 'subsample': [0.6, 0.7, 0.8, 0.9, 1]}, random_state=2020)
bestEstimatorLGB = lgbModel.best_estimator_
bestEstimatorLGB
LGBMClassifier(colsample_bytree=0.4, importance_type='gain', max_depth=5, min_child_weight=6, min_data_in_leaf=600, min_split_gain=0.25, n_estimators=900, num_leaves=50, objective='binary', random_state=2020, scale_pos_weight=10, subsample=0.9)
從RandomSearchCV中儲存最佳估計量
bestEstimatorLGB = lgb.LGBMClassifier(colsample_bytree=0.4, importance_type='gain', max_depth=5, min_child_weight=6, min_data_in_leaf=600, min_split_gain=0.25, n_estimators=900, num_leaves=50, objective='binary', random_state=2020, scale_pos_weight=10, subsample=0.9).fit(xTrain,yTrain.flatten(),feature_name=trainDF.columns.to_list()) yPredLGB = bestEstimatorLGB.predict_proba(xTest) yPredLGB = yPredLGB[:,1] yTestPredLGB = bestEstimatorLGB.predict(xTest) print(metrics.classification_report(yTest,yTestPredLGB))
precision recall f1-score support 0 0.97 0.86 0.92 41956 1 0.26 0.68 0.37 2973 accuracy 0.85 44929 macro avg 0.62 0.77 0.64 44929 weighted avg 0.93 0.85 0.88 44929
metrics.confusion_matrix(yTest,yTestPredLGB)
array([[36214, 5742], [ 965, 2008]], dtype=int64)
LGBMMetrics = pd.DataFrame({'Model': 'LightGBM', 'MSE': round(metrics.mean_squared_error(yTest, yTestPredLGB)*100,2), 'RMSE' : round(np.sqrt(metrics.mean_squared_error(yTest, yTestPredLGB)*100),2), 'MAE' : round(metrics.mean_absolute_error(yTest, yTestPredLGB)*100,2), 'MSLE' : round(metrics.mean_squared_log_error(yTest, yTestPredLGB)*100,2), 'RMSLE' : round(np.sqrt(metrics.mean_squared_log_error(yTest, yTestPredLGB)*100),2), 'Accuracy Train' : round(bestEstimatorLGB.score(xTrain, yTrain) * 100,2), 'Accuracy Test' : round(bestEstimatorLGB.score(xTest, yTest) * 100,2), 'F-Beta Score (β=2)' : round(metrics.fbeta_score(yTest, yTestPredLGB, beta=2)*100,2)},index=[1]) LGBMMetrics
Model MSE RMSE MAE MSLE RMSLE Accuracy Train Accuracy Test F-Beta Score (β=2)
1 LightGBM 14.49 3.81 14.49 6.96 2.64 86.55 85.51 51.35
ROC AUC
fpr,tpr,_ = metrics.roc_curve(yTest,yPredLGB) rocAuc = metrics.auc(fpr, tpr) plt.figure(figsize=(12,6)) plt.title('ROC Curve') sns.lineplot(fpr, tpr, label = 'AUC for LightGBM Model = %0.2f' % rocAuc) plt.legend(loc = 'lower right') plt.plot([0, 1], [0, 1],'r--') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
特徵重要性
lgb.plot_importance(bestEstimatorLGB, importance_type='gain')
使用SHAP的特徵重要性
import shap
X = pd.DataFrame(xTrain, columns=trainDF.columns.to_list()) explainer = shap.TreeExplainer(bestEstimatorLGB) shap_values = explainer.shap_values(X) shap.summary_plot(shap_values[1], X)
XGBoost
調參
xgbAttribute = xgs.XGBClassifier(tree_method='gpu_hist',n_jobs=-1, gpu_id=0) xgbParameters = { 'max_depth' : [2,3,4,5,6,7,8], 'learning_rate':[0.05,0.1,0.125,0.15], 'colsample_bytree' : [0.2,0.4,0.6,0.8,1], 'n_estimators' : [400,500,600,700,800,900], 'gamma':[0.15,0.20,0.25,0.3,0.35], 'subsample': [0.6,0.7,0.8,0.9,1], 'min_child_weight': [6,7,8,9,10], 'scale_pos_weight': [10,15,20] } xgbModel = model_selection.RandomizedSearchCV(xgbAttribute, param_distributions = xgbParameters, cv = 5, random_state=2020) xgbModel.fit(xTrain,yTrain.flatten())