比較不同的對單細胞轉錄組資料尋找差異基因的方法

阿新 • • 發佈：2022-05-03

背景介紹

如果是bulk RNA-seq，那麼現在最流行的就是DESeq2 和 edgeR啦，而且有很多經過了RT-qPCR 驗證過的真實測序資料可以來評價不同的差異基因演算法的表現。

對單細胞測序資料來說，通常需要先聚類之後把細胞群體進行分組，然後來比較不同的組的差異表達情況。當然，也有不少單細胞測序實驗設計本身就有時間點，不同個體來源，不同培養條件這樣的分組！

同時還有不少方法是不需要預先分類的，因為分類本身就會引入偏差。

跟bulk RNA-seq不一樣的地方是，scRNA-seq通常涉及到的樣本數量更多。這時候可以使用非參檢驗演算法，比如Kolmogorov-Smirnov test (KS-test) 等等。

下面用一個測試資料來評價一下不同的演算法的表現。處理同樣的表達矩陣得到差異結果跟已知的差異結果進行比較看看overlap怎麼樣。評價指標主要是：

召回率，True positive rate (TPR), TP/(TP + FN)
準確率，False positive rate (FPR), FP/(FP+TP)
receiver-operating-characteristic curve (ROC)
area under this curve (AUC)

所以需要安裝並且載入一些包,安裝程式碼如下；

install.packages('ROCR')
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("MAST") 
biocLite("scde") 
install.packages("devtools")
library("devtools")
install_github("BPSC","nghiavtr")
library("BPSC")

載入程式碼如下：

library(ROCR)
library(edgeR)
library(DESeq2)

library(scde)
library(BPSC)
library(MAST)
library(monocle)

載入測試資料

這裡選取的是芝加哥大學Yoav Gilad lab實驗的Tung et al 2017的單細胞測序文章的資料

## 讀取tung文章的資料，生成測試資料，這個程式碼不需要執行。
if(F){

  DE <- read.table("tung/TPs.txt")
  notDE <- read.table("tung/TNs.txt")
  GroundTruth <- list(DE=as.character(unlist(DE)), notDE=as.character(unlist(notDE)))

  molecules <- read.table("tung/molecules.txt", sep = "t")
  anno <- read.table("tung/annotation.txt", sep = "t", header = TRUE)
  keep <- anno[,1] == "NA19101" | anno[,1] == "NA19239"
  data <- molecules[,keep]
  group <- anno[keep,1]
  batch <- anno[keep,4]
  # remove genes that aren't expressed in at least 6 cells
  gkeep <- rowSums(data > 0) > 5;
  counts <- data[gkeep,]
  # Library size normalization
  lib_size = colSums(counts)
  norm <- t(t(counts)/lib_size*median(lib_size)) 
  # Variant of CPM for datasets with library sizes of fewer than 1 mil molecules
  rm(molecules)
  rm(data)
  save.image(file = 'scRNAseq_DEG_input.Rdata')
}
load(file = 'scRNAseq_DEG_input.Rdata')
# 我已經把測試資料儲存為rdata資料格式了，直接載入。
dim(counts);

## [1] 16026   576

dim(norm);

## [1] 16026   576

dim(DE);

## [1] 1083    1

dim(notDE);

## [1] 10897     1

table(group)

## group
## NA19098 NA19101 NA19239 
##       0     288     288

可以看到這裡需要選擇的測試資料來源於2個人，每個人都有288個細胞的表達資料。

就是要對它們進行差異比較，而已知的1083個基因是確定顯著差異的，另外10897個基因是確定不顯著的。(首先，我們要假定這個是金標準！！！)

但是總共卻是16026個基因，所以有一些基因是不確定顯著與否的。

差異分析方法大全

Kolmogorov-Smirnov test

KS檢驗有兩個弊端，首先是它假設基因表達量是連續的，如果有很多細胞表達量一致，比如都是0，表現就很差。其次它對大樣本量太敏感了，可能其實差異並不大，但是樣本數量很多，也會被認為是顯著差異。

pVals <- apply(norm, 1, function(x) {
        ks.test(x[group =="NA19101"], 
                x[group=="NA19239"])$p.value
         })
# multiple testing correction
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
length(sigDE)

## [1] 5095

# Number of KS-DE genes
sum(GroundTruth$DE %in% sigDE)

## [1] 792

# Number of KS-DE genes that are true DE genes
sum(GroundTruth$notDE %in% sigDE)

## [1] 3190

tp <- sum(GroundTruth$DE %in% sigDE)
fp <- sum(GroundTruth$notDE %in% sigDE)
tn <- sum(GroundTruth$notDE %in% names(pVals)[pVals >= 0.05])
fn <- sum(GroundTruth$DE %in% names(pVals)[pVals >= 0.05])
tpr <- tp/(tp + fn)
fpr <- fp/(fp + tn)
cat(c(tpr, fpr))

## 0.7346939 0.2944706

ks_pVals=pVals

可以看到KS檢驗判斷的顯著差異基因實在是太多了，高達5095個。所以它能找回來792個真正的差異基因。但是卻找到了3190個假陽性。所以計算得到召回率73.46%，但是準確率只有29.44%，這個表現不佳。

再看看ROC和RUC

# Only consider genes for which we know the ground truth
pVals <- pVals[names(pVals) %in% GroundTruth$DE | 
               names(pVals) %in% GroundTruth$notDE] 
truth <- rep(1, times = length(pVals));
truth[names(pVals) %in% GroundTruth$DE] = 0;
pred <- ROCR::prediction(pVals, truth)
perf <- ROCR::performance(pred, "tpr", "fpr")
ROCR::plot(perf)

aucObj <- ROCR::performance(pred, "auc")
[email protected][[1]] # AUC

## [1] 0.7954796

把這兩個評價分析包裝成函式，後面可以直接使用！

DE_Quality_AUC <- function(pVals) {
        pVals <- pVals[names(pVals) %in% GroundTruth$DE | 
                       names(pVals) %in% GroundTruth$notDE]
        truth <- rep(1, times = length(pVals));
        truth[names(pVals) %in% GroundTruth$DE] = 0;
        pred <- ROCR::prediction(pVals, truth)
        perf <- ROCR::performance(pred, "tpr", "fpr")
        ROCR::plot(perf)
        aucObj <- ROCR::performance(pred, "auc")
        return([email protected][[1]])
}
DE_Quality_rate <- function(sigDE) {
  (length(sigDE) )
  # Number of KS-DE genes
  ( sum(GroundTruth$DE %in% sigDE) )
  # Number of KS-DE genes that are true DE genes
  (sum(GroundTruth$notDE %in% sigDE))
  tp <- sum(GroundTruth$DE %in% sigDE)
  fp <- sum(GroundTruth$notDE %in% sigDE)
  tn <- sum(GroundTruth$notDE %in% names(pVals)[pVals >= 0.05])
  fn <- sum(GroundTruth$DE %in% names(pVals)[pVals >= 0.05])
  tpr <- tp/(tp + fn)
  fpr <- fp/(fp + tn)
  cat(c(tpr, fpr))
}

Wilcox/Mann-Whitney-U Test

也是一種非參檢驗，通常比較兩個組資料的median的差異。

pVals <- apply(norm, 1, function(x) {
        wilcox.test(x[group =="NA19101"], 
                x[group=="NA19239"])$p.value
        })
# multiple testing correction
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
Wilcox_pVals=pVals
DE_Quality_rate(sigDE)

## 0.8376623 0.3729346

DE_Quality_AUC(pVals)

## [1] 0.8320326

召回率是81.9%，準確率是31.9%，這個表現不佳。

edgeR

edgeR包在bulk RNA-seq測序領域應用很廣泛，基於負二項分佈模型，應用了 generalized linear model (GLM) 演算法

library(edgeR)
dge <- DGEList(counts=counts, norm.factors = rep(1, length(counts[1,])), group=group)
group_edgeR <- factor(group)
design <- model.matrix(~group_edgeR)
dge <- estimateDisp(dge, design = design, trend.method="none")
fit <- glmFit(dge, design)
res <- glmLRT(fit)
pVals <- res$table[,4]
names(pVals) <- rownames(res$table)
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
edgeR_pVals=pVals
DE_Quality_rate(sigDE)

## 0.8692022 0.3948121

DE_Quality_AUC(pVals)

## [1] 0.8477189

召回率是86.9%，準確率是39.4%，表現越來越好了。

Monocle

monocle不僅僅針對於基於read counts的表達矩陣，還可以是已經被各種normalization的表達矩陣，比如基於RPKM/TPM等等，它會把被normalization的表達矩陣用normal/gaussian model (gaussianff()) 演算法處理一下。差異分析的時候同樣也是基於負二項分佈模型，應用了 generalized linear model (GLM) 演算法

library(monocle)
pd <- data.frame(group=group, batch=batch)
rownames(pd) <- colnames(counts)
pd <- new("AnnotatedDataFrame", data = pd)
## 針對於基於read counts的表達矩陣
Obj <- newCellDataSet(as.matrix(counts), phenoData=pd, 
        expressionFamily=negbinomial.size()) 
Obj <- estimateSizeFactors(Obj)
Obj <- estimateDispersions(Obj)
res <- differentialGeneTest(Obj,fullModelFormulaStr="~group")
pVals <- res[,3]
names(pVals) <- rownames(res)
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
monocle_pVals=pVals
DE_Quality_rate(sigDE)
DE_Quality_AUC(pVals)

monocle做差異分析的耗時非常誇張，召回率是84.2%，準確率是38.1%

MAST

MAST基於 zero-inflated negative binomial 分佈模型

library(MAST)
log_counts <- log(counts+1)/log(2)
fData = data.frame(names=rownames(log_counts))
rownames(fData) = rownames(log_counts);
cData = data.frame(cond=group)
rownames(cData) = colnames(log_counts)
obj <- FromMatrix(as.matrix(log_counts), cData, fData)
colData(obj)$cngeneson <- scale(colSums(assay(obj)>0))
cond <- factor(colData(obj)$cond)
# Model expression as function of condition & number of detected genes
zlmCond <- zlm.SingleCellAssay(~cond + cngeneson, obj) 
summaryCond <- summary(zlmCond, doLRT="condNA19101")
summaryDt <- summaryCond$datatable
summaryDt <- as.data.frame(summaryDt)
pVals <- unlist(summaryDt[summaryDt$component == "H",4]) # H = hurdle model
names(pVals) <- unlist(summaryDt[summaryDt$component == "H",1])
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
MAST_pVals=pVals
DE_Quality_rate(sigDE)
DE_Quality_AUC(pVals)

召回率是82.8%，準確率是34.9.%

BPSC

這個用的是 Poisson-Beta 分佈模型

library(BPSC)
bpsc_data <- norm[,batch=="NA19101.r1" | batch=="NA19239.r1"]
bpsc_group = group[batch=="NA19101.r1" | batch=="NA19239.r1"]
control_cells <- which(bpsc_group == "NA19101")
design <- model.matrix(~bpsc_group)
coef=2 # group label
res=BPglm(data=bpsc_data, controlIds=control_cells, design=design, coef=coef, 
                estIntPar=FALSE, useParallel = FALSE)
pVals = res$PVAL
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
BPSC_pVals=pVals
DE_Quality_rate(sigDE)
DE_Quality_AUC(pVals)

召回率是64.8%，準確率是30.7.%

SCDE

SCDE是第一個特意針對單細胞轉錄組測序資料的差異分析而設計的，用貝葉斯統計方法把表達矩陣擬合到 zero-inflated negative binomial 分佈模型裡面。

library(scde)
cnts <- apply(
    counts,
    2,
    function(x) {
        storage.mode(x) <- 'integer'
        return(x)
    }
)
names(group) <- 1:length(group)
colnames(cnts) <- 1:length(group)
o.ifm <- scde::scde.error.models(
    counts = cnts,
    groups = group,
    n.cores = 1,
    threshold.segmentation = TRUE,
    save.crossfit.plots = FALSE,
    save.model.plots = FALSE,
    verbose = 0,
    min.size.entries = 2
)
priors <- scde::scde.expression.prior(
    models = o.ifm,
    counts = cnts,
    length.out = 400,
    show.plot = FALSE
)
resSCDE <- scde::scde.expression.difference(
    o.ifm,
    cnts,
    priors,
    groups = group,
    n.randomizations = 100,
    n.cores = 1,
    verbose = 0
)
# Convert Z-scores into 2-tailed p-values
pVals <- pnorm(abs(resSCDE$cZ), lower.tail = FALSE) * 2
pVals <- p.adjust(pVals, method = "fdr")
sigDE <- names(pVals)[pVals < 0.05]
SCDE_pVals=pVals
DE_Quality_rate(sigDE)
DE_Quality_AUC(pVals)

save(SCDE_pVals,BPSC_pVals,MAST_pVals,monocle_pVals,edgeR_pVals,Wilcox_pVals,ks_pVals,file = 'DEG_results.Rdata')

統計學基礎

負二項分佈 negative binomial model

set.seed(1)
hist(rnbinom(1000, mu=10, size=100), col="grey50", xlab="Read Counts", main="Negative Binomial")

這個是被應用的最廣泛的轉錄組表達資料分佈模型。但是對單細胞轉錄組測序資料來說，因為有很高的dropout情況，導致模型失準，所以就提出來了zero-inflated negative binomial models

zero-inflated negative binomial models

d = 0.5;
counts <- rnbinom(1000, mu=10, size=100);
counts[runif(1000) < d] = 0;
hist(counts, col="grey50", xlab="Read Counts", main="Zero-inflated NB")

就是在原始的負二項分佈資料裡面隨機挑選一些低表達量基因，給它們人為賦值為0表達量值。

Poisson-Beta distribution

a = 0.1
b = 0.1
g = 100
lambdas = rbeta(1000, a, b)
counts = sapply(g*lambdas, function(l) {rpois(1, lambda=l)})
hist(counts, col="grey50", xlab="Read Counts", main="Poisson-Beta")

文中提到的測試資料：

http://www.biotrainee.com/jmzeng/scRNA/DESeq_table.rds

http://www.biotrainee.com/jmzeng/scRNA/pollen.rds

http://www.biotrainee.com/jmzeng/scRNA/tung_umi.rds

http://www.biotrainee.com/jmzeng/scRNA/usoskin1.rds

比較不同的對單細胞轉錄組資料尋找差異基因的方法

背景介紹

載入測試資料

差異分析方法大全

Kolmogorov-Smirnov test

Wilcox/Mann-Whitney-U Test

edgeR

Monocle

MAST

BPSC

SCDE

統計學基礎

負二項分佈 negative binomial model

zero-inflated negative binomial models

Poisson-Beta distribution

比較不同的對單細胞轉錄組資料尋找差異基因的方法

比較不同的對單細胞轉錄組資料聚類的方法

比較不同的對單細胞轉錄組資料normalization方法

比較不同單細胞轉錄組資料尋找features方法

SC2disease：人類疾病的單細胞轉錄組的人工收集資料庫

轉錄組資料拼接之應用篇

單細胞轉錄組3大R包之monocle2

單細胞轉錄組3大R包之Seurat

單細胞轉錄組3大R包之scater

高階轉錄組分析和R語言資料視覺化第十二期（線上線下同時開課）

一個植物轉錄組專案的實戰

MySql查詢兩張相同表，合併成一組資料，並區分資料的不同

doccano標註後的序列標註任務資料轉錄為BIO形式

flask/django 動態查詢表結構相同表名不同資料的Model實現方法

js實現樹形資料轉成扁平資料的方法示例

python 如何判斷一組資料是否符合正態分佈

efcore 跨表查詢，實現一個介面內查詢兩個不同資料庫裡各自的表資料

Django def clean()函式對錶單中的資料進行驗證操作

基於python實現計算兩組資料P值

Oracle 對現有的已經含有資料表新增自動遞增的ID（ID欄位已經存在，只是ID沒有資料）

比較不同的對單細胞轉錄組資料尋找差異基因的方法

背景介紹

載入測試資料

差異分析方法大全

Kolmogorov-Smirnov test

Wilcox/Mann-Whitney-U Test

edgeR

Monocle

MAST

BPSC

SCDE

統計學基礎

負二項分佈 negative binomial model

zero-inflated negative binomial models

Poisson-Beta distribution

相關推薦