練習1 循環擬合與AUC
R需要包:mice包;AUC包或pROC包
需要知識:logistic回歸;SVM;GAM;缺失值多重插補法;擬合效果ROC、AUC
In cancer studies, a question of critical interest is the progression of cancer. Here we consider a study on cancer progression.
The dataset contains the following variables:
Response: cancer development, which is a binary variable indicating whether there is any new cancer development;
Covariates: we have six covariates: size, stage, age, treatment type, dose I, dose II. Among them, stage and treatment type are categorical; the reminders are continuous.
Our question: in clinical practice, whether it is possible to predict cancer development using the six covariates we measure.
數據示例如下
Cancer development: Yes (coded as 1) or No (coded as 0)
size: tumor size (which is measured with the diameter of tumor)
stage: tumor stage (categorial)
age: age (continuous)
treatment type: different type of treatment
Dose I: dose of drug A
Dose II: dose of drug B
".": missing measurements. You can assume MAR.
-
讀取數據
cancer <- read.csv("cancer-pred.csv", stringsAsFactors = FALSE) cancer <- cancer[ , 1:7]
str(cancer)
查看數據結構,200個觀測值和7個變量,並且存在缺失值。
由於數據集中缺失值由點號代替,故讀取後變量為字符型,需將其轉化為數值型。
無法通過設置read.csv中參數na.strings="."實現跳過轉化步驟,因為最後兩個變量含有小數點。
cancer <- sapply(cancer, as.numeric) #警告信息是由於點號轉化為NA值 cancer <- as.data.frame(cancer) #轉化為數據框更適用於建模 colnames(cancer) <- c("develop", "size", "stage", "age", "type", "Dose.I", "Dose.II")
-
探索性分析
變量箱線圖、柱狀圖、連續變量相關圖,圖片僅展示柱狀圖
old.par <- par(mfrow=c(3, 3)) apply(cancer, 2, boxplot) par(old.par) par(mfrow=c(3, 3), mar=c(4, 4, 2, 0.5)) for (j in 1:ncol(cancer)) { hist(cancer[ , j], xlab=colnames(cancer)[j], main=paste("Histogram of", colnames(cancer)[j]), col="lightblue", breaks=20) } par(mfrow=c(1, 1)) pairs(~size+age+Dose.I+Dose.II, data=cancer)
-
補全缺失值
對於缺失值的處理方法有幾種,參考《R語言實戰》
此處處理方法有三種:刪除缺失值,平均值眾數替代法,多重插補法。
# #缺失值刪除法 # cancer <- na.omit(cancer) # #平均數眾數替代法 # cancer.mean <- sapply(cancer[c(2, 4, 6, 7)], mean, na.rm=TRUE) # cancer.table <- sapply(cancer[ , c(1, 3, 5)], table) # cancer[which(is.na(cancer[ , 2])), 2] <- round(cancer.mean[1], 3) # cancer[which(is.na(cancer[ , 4])), 4] <- round(cancer.mean[2], 1) # cancer[which(is.na(cancer[ , 6])), 6] <- round(cancer.mean[3], 3) # cancer[which(is.na(cancer[ , 7])), 7] <- round(cancer.mean[4], 3) # for(i in c(1, 3, 5)){ # cancer[which(is.na(cancer[ , i])), i] <- # as.numeric(rownames(cancer.table)[which.max(cancer.table[1])]) # } # #多重插補法 imp <- mice(cancer, seed=1) fit <- with(imp,glm(develop~., family=binomial, data = cancer)) pooled <- pool(fit) cancer <- complete(imp,action=3) cancer[ , c(3, 5)] <- sapply(cancer[ , c(3, 5)], as.factor) #部分變量轉化為因子
-
抽樣擬合計算AUC
1. logistic回歸
將數據分為訓練集(150個樣本)和測試集(50個樣本),用訓練集擬合logistic模型,用得到的模型對測試集的因變量進行預測
將預測值與真實值比較得到錯判矩陣、tpr、fpr,針對不同的cutoffs得到roc曲線並計算AUC值。
將以上步驟循環3000次處理,對3000個AUC取平均並進行t檢驗。
cancer.auc <- c(length=3000) #生成空向量以填充 for (i in 1:3000){ set.seed(i) sample.i <- sample(1:nrow(cancer), 3/4*nrow(cancer)) cancer1 <- cancer[sample.i, ]; cancer2 <- cancer[-sample.i, ] glm.can <- glm(develop~., data=cancer1) # glm.new <- step(glm.can, trace=0) #trace=0不返回結果 cancer.pred <- predict(glm.can, family=binomial, newdata = cancer2, type = "response") cancer.roc <- roc(cancer.pred, as.factor(cancer2$develop)) cancer.auc[i] <- auc(cancer.roc) } mean(cancer.auc) #3000次AUC均值 t.test(cancer.auc, alternative = "greater", mu=0.5) hist(cancer.auc, freq = F, breaks = 40) lines(density(cancer.auc, bw=0.05), col="red", lwd=2)
因變量為是否型,雖然t檢驗通過,但是預測效果並不好
2. 支持向量機(SVM)
下面使用支持向量機的方法
#SVM cancer.auc <- c(length=3000) for (i in 1:3000){ set.seed(i) sample.i <- sample(1:nrow(cancer), 3/4*nrow(cancer)) cancer1 <- cancer[sample.i, ]; cancer2 <- cancer[-sample.i, ] cancer.svm <- svm(develop~., data=cancer1kernel= "linear") cancer.pred <- predict(cancer.svm, newdata = cancer2, type="response") cancer.roc <- roc(cancer.pred, as.factor(cancer2$develop)) cancer.auc[i] <- auc(cancer.roc) } mean(cancer.auc) t.test(cancer.auc, alternative = "greater", mu=0.5)
嘗試SVM四種核方法,即kernel參數“linear”、“polynomial”、“radial”、“sigmoid”
最後線性方法得到AUC均值最大,但也只有0.5837262,與logistic模型相比並不具有顯著優勢
3. 廣義加性模型(GAM)
#GAM cancer.auc <- c(length=300) for (i in 1:300){ set.seed(10*i) # set.seed(3) sample.i <- sample(1:nrow(cancer), 3/4*nrow(cancer)) cancer1 <- cancer[sample.i, ]; cancer2 <- cancer[-sample.i, ] cancer.gam <- gam(develop ~ s(size, bs="cr") + stage + s(age, bs="cr") + type + s(Dose.I, bs="cr") + s(Dose.II, bs="cr"), family = binomial, data=cancer1, trace=TRUE) cancer.pred <- predict(cancer.gam, newdata = cancer2, type="response") cancer.roc <- roc(cancer.pred, as.factor(cancer2$develop)) # auc(cancer.roc) cancer.auc[i] <- auc(cancer.roc) } mean(cancer.auc) t.test(cancer.auc, alternative = "greater", mu=0.5)
GAM函數得到300次AUC均值為0.5694343,且GAM循環3000次較GLM運算速度相比很慢
Tips:
問題一:隨機抽樣,以下得到的試驗集與測試集不是互補關系,會導致AUC上升。
set.seed(1) cancer1 <- cancer[sample(1:nrow(cancer), 3/4*nrow(cancer)), ] cancer2 <- cancer[-sample(1:nrow(cancer), 3/4*nrow(cancer)), ]
問題二:三種處理缺失值方法(缺失值剔除法、均值眾數替代法、多重插補法)結果相差不大,差距在0.01左右。
問題三:DoseII數據出現小於0的值如果當作異常值處理刪去結果為0.5770816,不處理結果為0.5873808,AUC降低了0.01,故放棄。
問題四:循環中使用step函數篩選不顯著變量,AUC降低0.3左右,故放棄。
問題五:計算AUC值有兩個包“AUC”和“pROC”,兩個包roc函數參數輸入方式不同。設立了隨機數種子,但是二者得到結果測試集差0.01,試驗集到小數點後20位才不一致,未確定是用AUC計算方式不同解釋還是測試集出現問題解釋。
若需原始數據試驗可回復聯系
練習1 循環擬合與AUC