迴歸分析(三)——多項式迴歸解決非線性問題
阿新 • • 發佈:2019-02-11
【將線性迴歸模型轉換為曲線——多項式迴歸】
之前都是將解釋變數和目標值之間的關係假設為線性的,如果假設不成立,可以新增多項式選項,轉換為多項式迴歸。
【sklearn實現多項式迴歸】
1、PoltnomialFeatures實現二項迴歸
# quadratic 二項迴歸 from sklearn.preprocessing import PolynomialFeatures X = np.array([258.0, 270.0, 294.0, 320.0, 342.0, 368.0, 396.0, 446.0, 480.0, 586.0])\ [:, np.newaxis] y = np.array([236.4, 234.4, 252.8, 298.6, 314.2, 342.2, 360.8, 368.0, 391.2, 390.8]) lr = LinearRegression() pr = LinearRegression() quadratic = PolynomialFeatures(degree=2) #二項式 X_quad = quadratic.fit_transform(X)
2、建立線性迴歸模型便於對比
# fit linear features
lr.fit(X, y)
X_fit = np.arange(250, 600, 10)[:, np.newaxis]
y_lin_fit = lr.predict(X_fit)
3、為多項式迴歸的transform特徵fit 一個多變量回歸模型
# fit quadratic features
pr.fit(X_quad, y )
y_quad_fit = pr.predict(quadratic.fit_transform(X_fit))
4、plot
#plot plt.scatter(X,y, label = 'traing data') plt.plot(X_fit, y_lin_fit, label = 'linear fit', linestyle = '--') plt.plot(X_fit, y_quad_fit, label = 'quadratic fit') plt.legend (loc = 'upper left') plt.show()
5、模型評估
# MSE R^2 y_lin_pred = lr.predict(X) y_quad_pred = pr.predict(X_quad) print('Training MSE linear: %.3f, quadratic: %.3f' % ( mean_squared_error(y, y_lin_pred), mean_squared_error(y, y_quad_pred))) print('Training R^2 linear: %.3f, quadratic: %.3f' % ( r2_score(y, y_lin_pred), r2_score(y, y_quad_pred)))
【housing data 建立非線性模型】
''# modeling nonlinear relationship in the housing dataset
X = df[['LSTAT']].values
y = df[['MEDV']].values
regr = LinearRegression()
#create quadratic and cubic features
quadratic = PolynomialFeatures(degree = 2)
cubic = PolynomialFeatures(degree = 3)
X_quad = quadratic.fit_transform(X)
X_cubic = cubic.fit_transform(X)
#fit features
X_fit = np.arange(X.min(), X.max(),1 )[:,np.newaxis]
#linear
regr = regr.fit(X, y)
y_lin_fit = regr.predict(X_fit)
linear_r2 = r2_score(y, regr.predict(X))
#quadratic
regr = regr.fit(X_quad, y)
y_quad_fit = regr.predict(quadratic.fit_transform(X_fit))
quadratic_r2 = r2_score(y, regr.predict(X_quad))
#cubic
regr = regr.fit(X_cubic, y)
y_cubic_fit = regr.predict(cubic.fit_transform(X_fit))
cubic_r2 = r2_score(y, regr.predict(X_cubic))
# plot results
plt.scatter(X, y, label='training points', color='lightgray')
plt.plot(X_fit, y_lin_fit,
label='linear (d=1), $R^2=%.2f$' % linear_r2,
color='blue',
lw=2,
linestyle=':')
plt.plot(X_fit, y_quad_fit,
label='quadratic (d=2), $R^2=%.2f$' % quadratic_r2,
color='red',
lw=2,
linestyle='-')
plt.plot(X_fit, y_cubic_fit,
label='cubic (d=3), $R^2=%.2f$' % cubic_r2,
color='green',
lw=2,
linestyle='--')
plt.xlabel('% lower status of the population [LSTAT]')
plt.ylabel('Price in $1000s [MEDV]')
plt.legend(loc='upper right')
plt.show()
圖中可以看出三項式模型明顯優於二項式和線性模型,但是三項式加大了模型的複雜度,容易導致過擬合。
在很多非線性問題中,可以考慮下log轉換,將非線性問題變為線性問題。
在剛剛的問題中測試下log轉換~~~
# log
#transform features
X_log = np.log(X)
y_sqrt = np.sqrt(y)
#fit features
X_fit = np.arange(X_log.min()-1, X_log.max()+1, 1)[:, np.newaxis]
#regr
regr = regr.fit(X_log, y_sqrt)
y_log_fit = regr.predict(X_fit)
linear_r2 = r2_score(y_sqrt, regr.predict(X_log))
# plot results
plt.scatter(X_log, y_sqrt, label='training points', color='lightgray')
plt.plot(X_fit, y_log_fit,
label='linear (d=1), $R^2=%.2f$' % linear_r2,
color='blue',
lw=2)
plt.xlabel('log(% lower status of the population [LSTAT])')
plt.ylabel('$\sqrt{Price \; in \; \$1000s \; [MEDV]}$')
plt.legend(loc='lower left')
plt.tight_layout()
#plt.savefig('images/10_12.png', dpi=300)
plt.show()
= 0.69,比前文的三個迴歸模型都要精確啊~~~