1. 程式人生 > >迴歸分析(三)——多項式迴歸解決非線性問題

迴歸分析(三)——多項式迴歸解決非線性問題

【將線性迴歸模型轉換為曲線——多項式迴歸】

之前都是將解釋變數和目標值之間的關係假設為線性的,如果假設不成立,可以新增多項式選項,轉換為多項式迴歸。


【sklearn實現多項式迴歸】

1、PoltnomialFeatures實現二項迴歸

# quadratic  二項迴歸
from sklearn.preprocessing import PolynomialFeatures
X = np.array([258.0, 270.0, 294.0, 
              320.0, 342.0, 368.0, 
              396.0, 446.0, 480.0, 586.0])\
             [:, np.newaxis]


y = np.array([236.4, 234.4, 252.8, 
              298.6, 314.2, 342.2, 
              360.8, 368.0, 391.2,
              390.8])
lr = LinearRegression()
pr = LinearRegression()
quadratic = PolynomialFeatures(degree=2) #二項式
X_quad = quadratic.fit_transform(X)

2、建立線性迴歸模型便於對比

# fit linear features
lr.fit(X, y)
X_fit = np.arange(250, 600, 10)[:, np.newaxis]
y_lin_fit = lr.predict(X_fit)

3、為多項式迴歸的transform特徵fit 一個多變量回歸模型

# fit quadratic features
pr.fit(X_quad, y )
y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))

4、plot

#plot
plt.scatter(X,y, label = 'traing data')
plt.plot(X_fit, y_lin_fit, label = 'linear fit', linestyle = '--')
plt.plot(X_fit, y_quad_fit, label = 'quadratic fit')
plt.legend (loc = 'upper left')
plt.show()

5、模型評估

# MSE R^2
y_lin_pred = lr.predict(X)
y_quad_pred = pr.predict(X_quad)

print('Training MSE linear: %.3f, quadratic: %.3f' % (
        mean_squared_error(y, y_lin_pred),
        mean_squared_error(y, y_quad_pred)))
print('Training R^2 linear: %.3f, quadratic: %.3f' % (
        r2_score(y, y_lin_pred),
        r2_score(y, y_quad_pred)))

【housing data 建立非線性模型】

''# modeling nonlinear relationship in the housing dataset
X = df[['LSTAT']].values
y = df[['MEDV']].values
regr = LinearRegression()
#create quadratic and cubic features
quadratic = PolynomialFeatures(degree = 2)
cubic = PolynomialFeatures(degree = 3)
X_quad = quadratic.fit_transform(X)
X_cubic = cubic.fit_transform(X)
#fit features
X_fit = np.arange(X.min(), X.max(),1 )[:,np.newaxis]

#linear
regr = regr.fit(X, y)
y_lin_fit = regr.predict(X_fit)
linear_r2 = r2_score(y, regr.predict(X))
#quadratic
regr = regr.fit(X_quad, y)
y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
quadratic_r2 = r2_score(y, regr.predict(X_quad))
#cubic
regr = regr.fit(X_cubic, y)
y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
cubic_r2 = r2_score(y, regr.predict(X_cubic))
# plot results
plt.scatter(X, y, label='training points', color='lightgray')

plt.plot(X_fit, y_lin_fit, 
         label='linear (d=1), $R^2=%.2f$' % linear_r2, 
         color='blue', 
         lw=2, 
         linestyle=':')

plt.plot(X_fit, y_quad_fit, 
         label='quadratic (d=2), $R^2=%.2f$' % quadratic_r2,
         color='red', 
         lw=2,
         linestyle='-')

plt.plot(X_fit, y_cubic_fit, 
         label='cubic (d=3), $R^2=%.2f$' % cubic_r2,
         color='green', 
         lw=2, 
         linestyle='--')

plt.xlabel('% lower status of the population [LSTAT]')
plt.ylabel('Price in $1000s [MEDV]')
plt.legend(loc='upper right')
plt.show()

圖中可以看出三項式模型明顯優於二項式和線性模型,但是三項式加大了模型的複雜度,容易導致過擬合。

在很多非線性問題中,可以考慮下log轉換,將非線性問題變為線性問題。

      

在剛剛的問題中測試下log轉換~~~

# log
#transform features
X_log = np.log(X)
y_sqrt = np.sqrt(y)
#fit features
X_fit = np.arange(X_log.min()-1, X_log.max()+1, 1)[:, np.newaxis]
#regr
regr = regr.fit(X_log, y_sqrt)
y_log_fit = regr.predict(X_fit)
linear_r2 = r2_score(y_sqrt, regr.predict(X_log))
# plot results
plt.scatter(X_log, y_sqrt, label='training points', color='lightgray')

plt.plot(X_fit, y_log_fit, 
         label='linear (d=1), $R^2=%.2f$' % linear_r2, 
         color='blue', 
         lw=2)

plt.xlabel('log(% lower status of the population [LSTAT])')
plt.ylabel('$\sqrt{Price \; in \; \$1000s \; [MEDV]}$')
plt.legend(loc='lower left')

plt.tight_layout()
#plt.savefig('images/10_12.png', dpi=300)
plt.show()

 = 0.69,比前文的三個迴歸模型都要精確啊~~~