機器學習_Python中Gradient Boosting Machine(GBM)學習筆記1_資料分析
阿新 • • 發佈:2018-12-11
原文地址:Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python by Aarshay Jain
翻譯出處:http://blog.csdn.net/han_xiaoyang/article/details/52663170
看的是大神寒小陽([email protected])翻譯的一篇關於GBM演算法的blog,原文連結和譯文連結已給出,目前詳細學習了資料分析的部分,原文中一筆帶過,自己找到原始碼進行學習,調通並寫下注釋,分享自己的心得。
資料分析(程式碼+註釋):
# coding: utf-8 # In[2]: import pandas as pd import numpy as np get_ipython().run_line_magic('matplotlib', 'inline') # In[6]: #Load data: train = pd.read_csv('Train_nyOWmfK.csv') test = pd.read_csv('Test_bCtAN1w.csv') # In[7]: train.shape, test.shape # In[8]: train.dtypes#檢視每個屬性的型別 # In[15]: #Combine into data: train['source']= 'train' test['source'] = 'test' data=pd.concat([train, test],ignore_index=True)#將train.csv與test.csv合併,且各自原來的索引忽略,合併後的資料在新表中的用統一排列新的索引 print(data.shape) print(train.dtypes) # ## Check missing: # In[6]: data.apply(lambda x: sum(x.isnull())) ''' lambda只是一個表示式,函式體比def簡單很多。 lambda的主體是一個表示式,而不是一個程式碼塊。僅僅能在lambda表示式中封裝有限的邏輯進去。 lambda表示式是起到一個函式速寫的作用。允許在程式碼內嵌入一個函式的定義。 此處作用是看data資料集中每個屬性的資料為null的個數 ''' # ## Look at categories of all object variables: # In[21]: var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source'] for v in var: print('\n%s這一列資料的不同取值和出現的次數\n'%v) print(data[v].value_counts()) # ## Handle Individual Variables: # ### City Variable: # In[17]: ''' 捨棄"City"屬性,因為這一屬性的取值種類太過複雜 axis=0表示的是要對橫座標操作,axis=1是要對縱座標操作 inplace=False表示要對結果顯示,而True表示對結果不顯示 ''' len(data['City'].unique()) data.drop('City',axis=1,inplace=True) # ### Determine Age from DOB # In[18]: data['DOB'].head() # In[44]: ''' DOB是出生的具體日期,咱們要具體日期作用沒那麼大,年齡段可能對我們有用,所以算一下年齡好了 建立一個年齡的欄位Age ''' #print(data['DOB']) data['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-3:])) data['Age'].head() # In[41]: #刪除原先的欄位 data.drop('DOB',axis=1,inplace=True) # ### EMI_Load_Submitted # In[55]: data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')#畫出箱線圖 # In[46]: #建立了EMI_Loan_Submitted_Missing這個變數,當EMI_Loan_Submitted 變數值缺失時它的值為1,否則為0。然後捨棄了EMI_Loan_Submitted。 data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0) data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10) # In[56]: #drop original vaiables: data.drop('EMI_Loan_Submitted',axis=1,inplace=True) # ### Employer Name # In[57]: len(data['Employer_Name'].value_counts()) # In[59]: #EmployerName的值也太多了,我把它也捨棄了 data.drop('Employer_Name',axis=1,inplace=True) # ### Existing EMI # In[60]: #Existing_EMI的缺失值被填補為0(中位數),因為只有111個缺失值 data.boxplot(column='Existing_EMI',return_type='axes') # In[61]: data['Existing_EMI'].describe() # In[19]: #Impute by median (0) because just 111 missing: data['Existing_EMI'].fillna(0, inplace=True) # ### Interest Rate: # In[63]: #Majority values missing so I'll create a new variable stating whether this is missing or note: data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0) print data[['Interest_Rate','Interest_Rate_Missing']].head(10) # In[62]: data.drop('Interest_Rate',axis=1,inplace=True) # ### Lead Creation Date: # In[64]: #Drop this variable because doesn't appear to affect much intuitively data.drop('Lead_Creation_Date',axis=1,inplace=True) # ### Loan Amount and Tenure applied: # In[65]: #Impute with median because only 111 missing: data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True) data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True) # ### Loan Amount and Tenure selected # In[68]: #High proportion missing so create a new var whether present or not data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0) data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0) # In[69]: #建立了Loan_Amount_Submitted_Missing變數,當Loan_Amount_Submitted有缺失值時為1,反之為0,原本的Loan_Amount_Submitted變數被捨棄 #建立了Loan_Tenure_Submitted_Missing變數,當Loan_Tenure_Submitted有缺失值時為1,反之為0,原本的Loan_Tenure_Submitted變數被捨棄 data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True) # ### Remove logged-in # In[26]: #捨棄了LoggedIn,和Salary_Account data.drop('LoggedIn',axis=1,inplace=True) # ### Remove salary account # In[27]: #Salary account has mnay banks which have to be manually grouped data.drop('Salary_Account',axis=1,inplace=True) # ### Processing_Fee # In[28]: #High proportion missing so create a new var whether present or not data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0) #drop old data.drop('Processing_Fee',axis=1,inplace=True) # ### Source # In[78]: #Source-top保留了2個,其他組合成了不同的類別 data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x) data['Source'].value_counts() print(data['Source']) # ## Final Data: # In[30]: data.apply(lambda x: sum(x.isnull())) # In[31]: data.dtypes # ### Numerical Coding: # In[80]: #給不同的數字編碼,起到區分作用的 from sklearn.preprocessing import LabelEncoder le = LabelEncoder() var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source'] for col in var_to_encode: data[col] = le.fit_transform(data[col]) # ### One-Hot Coding # In[81]: #get_dummies 是利用pandas實現one hot encode的方式。 data = pd.get_dummies(data, columns=var_to_encode) print(data) # ### Separate train & test: # In[77]: print(data['source']) train = data.loc[data['source']=='train'] test = data.loc[data['source']=='test'] #print(train.source) #print(test.source) # In[35]: train.drop('source',axis=1,inplace=True) test.drop(['source','Disbursed'],axis=1,inplace=True) # In[36]: train.to_csv('train_modified.csv',index=False) test.to_csv('test_modified.csv',index=False)
目前只學習了資料分析部分,模型建立及調參會儘快學習。