ISLR第三章線性迴歸應用練習題答案(上)
阿新 • • 發佈:2019-01-09
ISLR;R語言; 機器學習 ;線性迴歸
一些專業詞彙只知道英語的,中文可能不標準,請輕噴
8.利用簡單的線性迴歸處理Auto資料集
library(MASS)
library(ISLR)
library(car)
Auto=read.csv("Auto.csv",header=T,na.strings="?")
Auto=na.omit(Auto)
attach(Auto)
summary(Auto)
輸出結果:
mpg cylinders displacement horsepower Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 weight acceleration year origin Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 Median :2804 Median :15.50 Median :76.00 Median :1.000 Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000 name amc matador : 5 ford pinto : 5 toyota corolla : 5 amc gremlin : 4 amc hornet : 4 chevrolet chevette: 4 (Other) :365
線性迴歸:
lm.fit=lm(mpg~horsepower)
summary(lm.fit)
輸出結果:
Call: lm(formula = mpg ~ horsepower) Residuals: Min 1Q Median 3Q Max -13.5710 -3.2592 -0.3435 2.7630 16.9240 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 39.935861 0.717499 55.66 <2e-16 *** horsepower -0.157845 0.006446 -24.49 <2e-16 *** --- Signif. codes: 0 ‘\*\*\*’ 0.001 ‘\*\*’ 0.01 ‘\*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.906 on 390 degrees of freedom Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049 F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
a)
- 零假設 H 0:βhorsepower=0,假設horsepower與mpg不相關。
由於F-statistic值遠大於1,p值接近於0,拒絕原假設,則horsepower和mpg具有統計顯著關係。 - mpg的平均值為23.45,線性迴歸的RSE為4.906,有20.9248%的相對誤差。R-squared為0.6059,說明60.5948%的mpg可以被horsepower解釋。
- 線性迴歸係數小於零,說明mpg與horsepower之間的關係是消極的。
預測mpg
predict(lm.fit,data.frame(mpg=c(98)),interval="prediction") Warning message: 'newdata'必需有1行 但變數裡有392行
修改辦法:
predictor=mpg
response=horsepower
lm.fit2=lm(predictor~response)
predict(lm.fit2,data.frame(response=c(98)),interval="confidence")
fit lwr upr
1 24.47 23.97 24.96
predict(lm.fit2,data.frame(response=c(98)),interval="prediction")
fit lwr upr
1 24.46708 14.8094 34.12476
b)繪製mpg與horsepower散點圖和最小二乘直線
plot(response,predictor)
abline(lm.fit2,lwd=3,col="red")
c)診斷最小二乘法
par(mfrow=c(2,2))
plot(lm.fit2)
有許多證據表明,mpg與horsepower非線性相關。
9.利用聯合的線性迴歸處理Auto資料集
a)繪製散點圖矩陣
pairs(Auto)
b)計算相關性矩陣
cor(subset(Auto,select=-name))
mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
c)多元線性迴歸:
lm.fit3=lm(mpg~.-name,data=Auto)
summary(lm.fit3)
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
- 零假設 :假設mpg與其他變數不相關。
由於F-statistic值遠大於1,p值接近於0,拒絕原假設,則mpg與其他變數具有統計顯著關係。 - 參照每個變數的P值,displacement、weight 、year 、origin在統計顯著關係。
- 汽車對於能源的利用率逐年增長
d)
par(mfrow=c(2,2))
plot(lm.fit3)
殘差仍未明顯的曲線,說明多元線性迴歸不正確。
plot(predict(lm.fit3), rstudent(lm.fit3))
由權重圖知,14號點沒有較大的殘差也有非常大的權重。
e)
lm.fit4=lm(mpg~displacement*weight+year*origin)
summary(lm.fit4)
執行結果:
Call:
lm(formula = mpg ~ displacement * weight + year * origin)
Residuals:
Min 1Q Median 3Q Max
-9.5758 -1.6211 -0.0537 1.3264 13.3266
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.793e+01 8.044e+00 2.229 0.026394 *
displacement -7.519e-02 9.091e-03 -8.271 2.19e-15 ***
weight -1.035e-02 6.450e-04 -16.053 < 2e-16 ***
year 4.864e-01 1.017e-01 4.782 2.47e-06 ***
origin -1.503e+01 4.232e+00 -3.551 0.000432 ***
displacement:weight 2.098e-05 2.179e-06 9.625 < 2e-16 ***
year:origin 1.980e-01 5.436e-02 3.642 0.000308 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.969 on 385 degrees of freedom
Multiple R-squared: 0.8575, Adjusted R-squared: 0.8553
F-statistic: 386.2 on 6 and 385 DF, p-value: < 2.2e-16
可以發現具有統計顯著關係,殘差也有很大的下降。
f)
lm.fit5 = lm(mpg~log(horsepower)+sqrt(horsepower)+horsepower+I(horsepower^2))
summary(lm.fit5)
執行結果:
Call:
lm(formula = mpg ~ log(horsepower) + sqrt(horsepower) + horsepower +
I(horsepower^2))
Residuals:
Min 1Q Median 3Q Max
-15.3450 -2.4725 -0.1594 2.1068 16.2564
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.839e+02 2.439e+02 -2.804 0.00530 **
log(horsepower) 6.515e+02 2.111e+02 3.085 0.00218 **
sqrt(horsepower) -3.385e+02 1.092e+02 -3.101 0.00207 **
horsepower 1.165e+01 3.898e+00 2.988 0.00299 **
I(horsepower^2) -7.425e-03 2.796e-03 -2.655 0.00825 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.331 on 387 degrees of freedom
Multiple R-squared: 0.6952, Adjusted R-squared: 0.692
F-statistic: 220.6 on 4 and 387 DF, p-value: < 2.2e-16
診斷迴歸:
par(mfrow=c(2,2))
plot(lm.fit5)
10.Carseats資料集
a)
summary(Carseats)
執行結果:
Sales CompPrice Income Advertising
Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
Median : 7.490 Median :125 Median : 69.00 Median : 5.000
Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
Population Price ShelveLoc Age Education
Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
Urban US
No :118 No :142
Yes:282 Yes:258
多元線性迴歸:
attach(Carseats)
lm.fit=lm(Sales~Price+Urban+US)
summary(lm.fit)
執行結果:
Call:
lm(formula = Sales ~ Price + Urban + US)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b)
隨著價格的升高銷量下降
商場是否在郊區與銷量無關
商場在美國銷量會更多
c)Sales = 13.04 + -0.05 Price + -0.02 UrbanYes + 1.20 USYes
d)Priece和USYES可以,根據p值和F-statistic可以拒絕零假設。
e)
lm.fit2=lm(Sales~Price+US)
summary(lm.fit2)
輸出結果:
Call:
lm(formula = Sales ~ Price + US)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f)a)和e)RSE相近,但是e)稍微好一點
g)
confint(lm.fit2)
輸出結果:
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
h)
plot(predict(lm.fit2),rstudent(lm.fit2))
輸出結果
所有歸一化的殘差都在-3到3之間,沒有明顯的離群值
par(mfrow=c(2,2))
plot(lm.fit2)
沒有權重值超過(p+1)/n,說明沒有明顯重要的點。
11.研究t-statistic
a)
lm.fit=lm(y~x+0)
summary(lm.fit)
輸出結果:
Call:
lm(formula = y ~ x + 0)
Residuals:
Min 1Q Median 3Q Max
-2.92110 -0.43210 0.04155 0.67849 2.64495
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 1.9454 0.1083 17.96 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.033 on 99 degrees of freedom
Multiple R-squared: 0.7651, Adjusted R-squared: 0.7627
F-statistic: 322.4 on 1 and 99 DF, p-value: < 2.2e-16
p值接近0,拒絕零假設
b)
lm.fit2=lm(x~y+0)
summary(lm.fit2)
輸出結果:
Call:
lm(formula = x ~ y + 0)
Residuals:
Min 1Q Median 3Q Max
-1.05835 -0.30952 -0.01945 0.34313 1.15854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y 0.3933 0.0219 17.96 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4646 on 99 degrees of freedom
Multiple R-squared: 0.7651, Adjusted R-squared: 0.7627
F-statistic: 322.4 on 1 and 99 DF, p-value: < 2.2e-16
同樣p值接近0,拒絕零假設
c)a)和b)擬合的是同一條直線
d)
e)x與y地位相當,交換x,y位置t結果不變
f)
lm.fit3=lm(x~y)
summary(lm.fit3)
輸出結果:
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-1.0381 -0.2899 0.0005 0.3628 1.1782
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01975 0.04667 -0.423 0.673
y 0.39308 0.02200 17.868 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4666 on 98 degrees of freedom
Multiple R-squared: 0.7651, Adjusted R-squared: 0.7627
F-statistic: 319.3 on 1 and 98 DF, p-value: < 2.2e-16
x對y線性迴歸
lm.fit4=lm(y~x)
summary(lm.fit4)
輸出結果:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.94807 -0.46147 0.01291 0.65020 2.61739
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02765 0.10391 0.266 0.791
x 1.94651 0.10894 17.868 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.038 on 98 degrees of freedom
Multiple R-squared: 0.7651, Adjusted R-squared: 0.7627
F-statistic: 319.3 on 1 and 98 DF, p-value: < 2.2e-16
發現t值不變