1. 程式人生 > >box-cox轉換及變換引數lambda估算方法

box-cox轉換及變換引數lambda估算方法

我們進行資料轉換的原因是:除了小樣本可以考慮非引數,大部分的統計原理和引數檢驗都是基於正態分佈推得。

關於box-cox轉換的基礎內容請看:BoxCox-變換方法及其實現運用.pptx

瞭解極大似然估計:極大似然估計思想的最簡單解釋

通過上面的內容可以知道,

 

  • boxcox1p變換中y+c的+c是為了確保(y+c)>0,因為在boxcox變換中要求y>0
  • python程式碼:
  • y_boxcox = special.boxcox1p(y, lam_best) 利用llf獲得優化後的lambda或boxcox_normmax(x) 得到優化後的lambda

boxcox_normmax(x)說明,詳情見https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_normmax.html

scipy.stats.boxcox_normmax(x, brack=(-2.0, 2.0), method='pearsonr')[source]
Compute optimal Box-Cox transform parameter for input data.

Parameters:	
x : array_like 	Input array.
brack : 2-tuple, optional
	The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.
method : str, optional
	The method to determine the optimal transform parameter (boxcox lmbda parameter). Options are:
		‘pearsonr’ (default)
		Maximizes the Pearson correlation coefficient between y = boxcox(x) and the expected values for y if x would be normally-distributed.
		‘mle’
		Minimizes the log-likelihood boxcox_llf. This is the method used in boxcox. ()
		‘all’
		Use all optimization methods available, and return all results. Useful to compare different methods.
		Returns:	
		maxlog : float or ndarray
		The optimal transform parameter found. An array instead of a scalar for method='all'.

接下來,用kaggle中House Prices: Advanced Regression Techniques比賽的資料集做個練習。

scipy.stats.boxcox_llf使用詳見https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox_llf.html

import pandas as pd
import numpy as np
from scipy import stats,special
import matplotlib.pyplot as plt

train = pd.read_csv('./data/train.csv')
y = train['SalePrice']
print(y.shape)

lam_range = np.linspace(-2,5,100)  # default nums=50
llf = np.zeros(lam_range.shape, dtype=float)

# lambda estimate:
for i,lam in enumerate(lam_range):
    llf[i] = stats.boxcox_llf(lam, y)		# y 必須>0

# find the max lgo-likelihood(llf) index and decide the lambda
lam_best = lam_range[llf.argmax()]
print('Suitable lam is: ',round(lam_best,2))
print('Max llf is: ', round(llf.max(),2))

plt.figure()
plt.axvline(round(lam_best,2),ls="--",color="r")
plt.plot(lam_range,llf)
plt.show()
plt.savefig('boxcox.jpg')

# boxcox convert:
print('before convert: ','\n', y.head())
#y_boxcox = stats.boxcox(y, lam_best)
y_boxcox = special.boxcox1p(y, lam_best)
print('after convert: ','\n',  pd.DataFrame(y_boxcox).head())

# inverse boxcox convert:
y_invboxcox = special.inv_boxcox1p(y_boxcox, lam_best)
print('after inverse: ', '\n', pd.DataFrame(y_invboxcox).head())

 結果如下,

 

比外,也可以通過scipy.stats.boxcox_normplot確定lambda,詳見http://scipy.github.io/devdocs/generated/scipy.stats.boxcox_normplot.html