box-cox轉換及變換引數lambda估算方法
阿新 • • 發佈:2018-12-04
我們進行資料轉換的原因是:除了小樣本可以考慮非引數,大部分的統計原理和引數檢驗都是基於正態分佈推得。
關於box-cox轉換的基礎內容請看:BoxCox-變換方法及其實現運用.pptx
瞭解極大似然估計:極大似然估計思想的最簡單解釋
通過上面的內容可以知道,
- boxcox1p變換中y+c的+c是為了確保(y+c)>0,因為在boxcox變換中要求y>0
- python程式碼:
- y_boxcox = special.boxcox1p(y, lam_best) 利用llf獲得優化後的lambda或boxcox_normmax(x) 得到優化後的lambda
boxcox_normmax(x)說明,詳情見https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_normmax.html
scipy.stats.boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr')[source] Compute optimal Box-Cox transform parameter for input data. Parameters: x : array_like Input array. brack : 2-tuple, optional The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket. method : str, optional The method to determine the optimal transform parameter (boxcox lmbda parameter). Options are: ‘pearsonr’ (default) Maximizes the Pearson correlation coefficient between y = boxcox(x) and the expected values for y if x would be normally-distributed. ‘mle’ Minimizes the log-likelihood boxcox_llf. This is the method used in boxcox. () ‘all’ Use all optimization methods available, and return all results. Useful to compare different methods. Returns: maxlog : float or ndarray The optimal transform parameter found. An array instead of a scalar for method='all'.
接下來,用kaggle中House Prices: Advanced Regression Techniques比賽的資料集做個練習。
scipy.stats.boxcox_llf使用詳見https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_llf.html
import pandas as pd import numpy as np from scipy import stats,special import matplotlib.pyplot as plt train = pd.read_csv('./data/train.csv') y = train['SalePrice'] print(y.shape) lam_range = np.linspace(-2,5,100) # default nums=50 llf = np.zeros(lam_range.shape, dtype=float) # lambda estimate: for i,lam in enumerate(lam_range): llf[i] = stats.boxcox_llf(lam, y) # y 必須>0 # find the max lgo-likelihood(llf) index and decide the lambda lam_best = lam_range[llf.argmax()] print('Suitable lam is: ',round(lam_best,2)) print('Max llf is: ', round(llf.max(),2)) plt.figure() plt.axvline(round(lam_best,2),ls="--",color="r") plt.plot(lam_range,llf) plt.show() plt.savefig('boxcox.jpg') # boxcox convert: print('before convert: ','\n', y.head()) #y_boxcox = stats.boxcox(y, lam_best) y_boxcox = special.boxcox1p(y, lam_best) print('after convert: ','\n', pd.DataFrame(y_boxcox).head()) # inverse boxcox convert: y_invboxcox = special.inv_boxcox1p(y_boxcox, lam_best) print('after inverse: ', '\n', pd.DataFrame(y_invboxcox).head())
結果如下,
比外,也可以通過scipy.stats.boxcox_normplot確定lambda,詳見http://scipy.github.io/devdocs/generated/scipy.stats.boxcox_normplot.html