模型中各變數對模型的解釋程度
在建立一個模型後,我們會關心這個模型對於因變數的解釋程度,甚至想知道各個自變數分別對模型的貢獻有多少。解決這個問題要分為兩種情況來看:線性模型與非線性模型。
多變數線性迴歸模型
決定係數
一般採用ANOVA,求得
其中SS是sums of squares的縮寫,SSR表示來自Regression的變異,SSE表示隨機變異(未能解釋的變異),SSTO表示總變異,SSTO=SSR+SSE。則表示迴歸模型對因變數的解釋程度。並且等於(r為相關係數)。
由於隨著變數增加,也會變大,有可能導致一個變數少但實際解釋能力較好的模型 和 變數特別多但實際解釋能力一般的模型,在使用
各個變數的重要性
這部分的內容,Yi-Chun E. Chao 等人寫了篇paper總結,詳細內容見文末參考文獻。
各個變數相對重要性的評價,令Ij表示xj的relative importance,理想的Ij應該滿足:
(1)對於所有的xj,其Ij均為非負數;
(2)所有的Ij之和等於迴歸模型的總;
(3)Ij值與xj進入模型的順序無關。
下面探討幾個可能可以度量Ij的指標。
單變數
各個變數自己單獨建立迴歸模型(或作相關分析),可以求得各個變數的,一般表示為:
但是僅當各個變數完全不相關時,這個式子才成立:
Type III SS與 Type I SS
這部分詳細內容建議參考:Sequential (or Extra) Sums of Squares
Type III SS在軟體裡一般顯示為Adjust SS,指的是,將p個變數納入迴歸模型後,各個變數的額外貢獻度(獨立貢獻度),一般來說,各個變數的SS之和是小於SSR的,僅當各個變數完全不相關時,各個變數的SS的和才等於SSR。相應地,可以求出Type III ,即:
Type I SS在軟體裡一般顯示為Sequential SS,指的是,在之前p-1個變數的基礎上,再加入當前變數,SSR的增加量。因此各個變數的SS之和是等於SSR的。但是這個SS依賴於進入模型的順序(先進入模型的佔便宜)。相應地,有Type III
偏 (Partial R-squared)
這部分詳細內容請參考:Partial R-squared
偏又叫偏決定係數。這個概念也是基於變數加入的順序,表示的是,在之前p-1個變數的模型不能解釋的變異中,新加入的變數能解釋的比例。也就是這個式子:
比如:在含有x1的模型的基礎上,新增變數x2和x3,則
這個概念一般用於檢驗新加入的變數有沒有價值。
Pratt’s Index
-這個指標首先由Pratt等人提出。Pratt指數是一個乘積:,Bj是迴歸係數,r是單變數建模時的相關係數。一般來說,這個指標評價各個變數的相對重要程度,較前面幾個指標更好,運用較為廣泛。
=, 用表示xj的解釋能力,則據此可求出各變數的解釋比例。
但是存在一個問題就是,有時候Pratt指數可能是負數值。
Dj和εj
其他方法包括:General Dominance Index Dj 和Johnson’s Relative Weight εj。
Dj這個指標首先由Budescu等人提出。之前說過Type III 與當前變數的加入順序有關,那麼列舉所有可能的順序都求出一個,然後求平均數,這就是Dj的思想。具體參考Yi-Chun E. Chao的論文。另外εj這裡也不敘述了,也請參照Yi-Chun E. Chao的論文。
非線性模型
這裡的非線性模型包括Logistic迴歸和Cox迴歸。
偽決定係數
由於的計算時基於最小二乘法(OLS)及F統計量的ANOVA,而Logistic迴歸等模型採用最大似然法(MLE),因此難以求出,這時候衍生出了廣義的,即偽。偽的公式請參考相應資料:維基、Logistic Regression。
各變數的相對重要性
這個我目前沒找到很好的指標,曾經見過一篇腸癌免疫評分的文章採用迴歸模型Wald檢驗的卡方值的proportion來評價各個變數的重要性。感覺不是很嚴謹,找不到論證這個方法的論文。這個方法可能來自Frank Harrell大神,於是閱讀了他的rms軟體包和《Regression Modeling Strategies》這本神書,發現可以Logistic迴歸等非線性模型可以用Wald統計量的ANOVA分析,並且可以用卡方值類比SS,也就是說,Wald卡方值的大小可以衡量變數的重要性。文中有這麼一句話“This is a very powerful model (ROC area = c = 0.88); the survival patterns are easy to detect. The Wald ANOVA in Table 12.2 indicates especially strong sex and pclass effects (χ2 = 199 and 109, respectively).” 但是具體的如何計算各個變數的相對貢獻比例,並沒有看到相應的說明。
書中和rms包文件裡提到“The plot.anova function draws a dot chart showing the relative contribution (χ2, χ2 minus d.f., AIC, partial R2, P -value, etc.) of each factor in the model”,anova函式的margin引數“set to a vector of character strings to write text for selected statistics in the right margin of the dot chart. The character strings can be any combination of "chisq", "d.f.", "P", "partial R2", "proportion R2", and "proportion chisq" ”也提示proportion χ2的價值。
為了模仿線性模型中的和偏,Harrell大神在書中演示了LR(likelihood ratio)體系中的和偏的構建,但是他又說“Since such likelihood ratio statistics are tedious to compute, the 1 d.f. Wald χ2 can be substituted for the LR statistic (keeping in mind that difficulties with the Wald statistic can arise)”。他在接下來的段落中描述了其他統計學家構建的兩個指標:
感覺只是描述了下偽的來歷。
另外,文中有提到另一個指標D,“D is the same as R2 discussed above when p = 1 (indicating only one reestimated parameter, γ), the penalized proportion of explainable log likelihood that was explained by the model. Because of the remark of Schemper,546 all of these indexes may unfortunately be functions of the censoring pattern. ” 但是沒太懂。。。
不過漲了些見識,ANOVA除了傳統的基於F統計量的,還可以用Wald統計量和LR(likelihood ratio
)統計量來分析(ANOVA)。關於Wald與t檢驗、Wald與F檢驗的關係,以後可以再研究下,比如:Are t test and one-way ANOVA both Wald tests?
此外,D. Roland Tomas等人基於Pratt’s Index定義了Logistic迴歸中的類似的指標,此處不做探討。但是他在論文中提到“when the question relates to explanatory variables in logistic regression, the usual recommendation is to inspect the relative magnitudes of the Wald statistics for individual explanatory variables (or their square roots which can be interpreted as large sample z-statistics). The problem with this and related approaches can be easily explained with reference to the governance example. For the explanatory variable DISP, its Wald statistic (or its square root z-statistic) shown in Table 3 is a measure of the contribution of DISP to the logistic regression, over and above the contribution of explanatory variables SUPP and INDEP. Similarly, the Wald statistic for variable SUPP measures its contribution over and above variables DISP and INDEP. Clearly, it is not appropriate to use these two Wald statistics as measures of the relative contribution of DISP and SUPP because the reference set of variables is different in both cases (SUPP and INDEP in the first case, and DISP and INDEP in the second case). The equivalent problem occurs in linear regression, i.e., the t-statistics (or corresponding p-values) for individual variables are not appropriate for assessing relative importance. ” 先忽略那幾個大寫的縮寫是什麼意思以及忽略Table3是什麼內容,總之他的話裡可以得到兩個結論:1、Wald統計量用來評估變數貢獻是經常被推薦的;2、他認為用Wald統計量來衡量變數貢獻度不嚴謹。
M Schemper於1993年在論文中也提到了一種用於Cox模型的評價變數貢獻度的指標PVE,詳見參考文獻。
我在stackexchange上看到了兩個關於relative importance的問題,兩個都是Harrell大神的答疑。
第一個問題是rms包中採用“proportion chisq”衡量relative importance是否可行:Relative importance of variables in Cox regression
這是他們的對話:
-----------------------------------------------------
Adam Robinsson:
I've understood that relative importance of predictors is a tricky question. Suggested methods range from very complex models to very simple variable transformations. I've understood that the brightest still debate which way to go on this matter. I'm looking for an easy but still appealing method to approach this in survival analysis (Cox regression).
My aim is to answer the question: which predictor is the most important one (in terms of predicting the outcome). The reason is simple: clinicians want to know which risk factor to adress first. I understand that "important" in clinical setting is not equal to "important" in the regression-world, but there is a link.
Should I compute the proportion of explainable log-likelihood that is explained by each variable (see Frank Harrell post), by using:
library(survival); library(rms)
data(lung)
S <- Surv(lung$time, lung$status)
f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE, data=lung)
plot(anova(f), what='proportion chisq')
As I understand it, its only possible to use the 'proportion chisq' for Cox models and this should suffice to convey some sense of each variables relative importance. Or should I perhaps use the default plot(anova()), which displays Wald χ2 statistic minus its degrees of freedom for assessing the partial effect of each variable?
I would appreciate some guidance if anyone has any experience on this matter.
===========
Frank Harrell:
Thanks for trying those functions. I believe that both metrics you mentioned are excellent in this context. This is useful for any model that gives rise to Wald statistics (which is virtually all models) although likelihood ratio χ2 statistics would be even better (but more tedious to compute).
You can use the bootstrap to get confidence intervals for the ranks of variables computed these ways. For the example code type ?anova.rms
.
All this is related to the "adequacy index". Two papers using the approach that have appeared in the medical literature are http://www.citeulike.org/user/harrelfe/article/13265566 and http://www.citeulike.org/user/harrelfe/article/13263849 .
===========
Adam Robinsson:
Many thanks for your time prof Harrell. I was delighted to find this function in the rms package, among the wealth of other useful functions. Considering the abovementioned approach, there was virtually no difference between the two measures. Thus, this appears to be an appealing approach, we'll see what the reviewers say.
I recently submitted a paper using your method professor Harrell. Most reviewers liked it but one reviewer claimed that Heller's method would be superior to the abovementioned method. Heller's method is explained here: ncbi.nlm.nih.gov/pmc/articles/PMC3297826 I did try Heller's method but it yields odd results (as far as I'm concerned). Have You, professor Harrell, compared the two methods and come to any conclusion as to which one is to prefer?
===========
Frank Harrell:
I like the Heller approach; I had not known about it before. I like the Kent and O'Quigley index a bit more (I'm not sure the +1 in the denominator is correct in Heller's description of it). But I still like measures that are functions of the gold standard log likelihood, such as the adequacy index, which is the easiest to compute
-----------------------------------------------------
另一個問題是問大家更喜歡哪種方式的relative importance:Which variable relative importance method to use
大致上問的是線性模型的relative importance,圖中列出了一堆方法:
哈哈,又看到Pratt方法(有負數值哦)。
對此,Harrell大神的回答是:
-----------------------------------------------------
I prefer to compute the proportion of explainable log-likelihood that is explained by each variable. For OLS models the rms
package makes this easy:
f <- ols(y ~ x1 + x2 + pol(x3, 2) + rcs(x4, 5) + ...)
plot(anova(f), what='proportion chisq')
# also try what='proportion R2'
The default for plot(anova())
is to display the Wald χ2 statistic minus its degrees of freedom for assessing the partial effect of each variable. Even though this is not scaled [0,1][0,1] it is probably the best method in general because it penalizes a variable requiring a large number of parameters to achieve the χ2. For example, a categorical predictor with 5 levels will have 4 d.f. and a continuous predictor modeled as a restricted cubic spline function with 5 knots will have 4 d.f.
If a predictor interacts with any other predictor(s), the χ2 and partial R2 measures combine the appropriate interaction effects with main effects. For example if the model was y ~ pol(age,2) * sex
the statistic for sex is the combined effects of sex as a main effect plus the effect modification that sex provides for the age effect. This is an assessment of whether there is a difference between the sexes for any age.
Methods such as random forests, which do not favor additive effects, are not likelihood based, and use multiple trees, require a different notion of variable importance.
-----------------------------------------------------
此外,相關連結中看到了一些有趣的討論:
For linear classifiers, do larger coefficients imply more important features?
Approaches to compare differences in means with differences in proportions?
Different prediction plot from survival coxph and rms cph
有關這個問題的其他資料:
Contribution of each Variables in Logistic Regression
Kenneth P. Burnham, Understanding AIC relative variable importance values
效應大小(Effect size)
這部分內容和之前的內容有相似之處,拓展了更多指標。除了相關係數(Pearson r or correlation coefficient)、決定係數(Coefficient of determination),還包括:Eta-squared (η2)、Omega-squared (ω2)、Cohen's ƒ2和Cohen's q等,具體請參考:維基Effect_size。
參考文獻
Regression Modeling Strategies
Yi-Chun E. Chao et al, Quantifying the Relative Importance of Predictors in Multiple Linear Regression Analyses for Public Health Studies, Journal of Occupational and Environmental Hygiene, 5:8, 519-529, DOI: 10.1080/15459620802225481
Tomas, D. Roland, et al. "On Measuring the Relative Importance of Explanatory Variables in a Logistic Regression ," Journal of Modern Applied Statistical Methods: Vol. 7 : Iss. 1 , Article 4. DOI: 10.22237/jmasm/1209614580
M Schemper, The relative importance of prognostic factors in studies of survival