比較不同的對單細胞轉錄組資料聚類的方法
阿新 • • 發佈:2022-05-03
背景介紹
聚類之前必須要對錶達矩陣進行normalization,而且要去除一些批次效應等外部因素。通過對錶達矩陣的聚類,可以把細胞群體分成不同的狀態,解釋為什麼會有不同的群體。不過從計算的角度來說,聚類還是蠻複雜的,各個細胞並沒有預先標記好,而且也沒辦法事先知道可以聚多少類。尤其是在單細胞轉錄組資料裡面有很高的噪音,基因非常多,意味著的維度很高。
對這樣的高維資料,需要首先進行降維,可以選擇PCA或者t-SNE方法。聚類的話,一般都是無監督聚類方法,比如:hierarchical clustering, k-means clustering and graph-based clustering。演算法略微有一點複雜,略過吧。
這裡主要比較6個常見的單細胞轉錄組資料的聚類包:
- SINCERA
- pcaReduce
- SC3
- tSNE + k-means
- SEURAT
- SNN-Cliq
所以需要安裝並且載入一些包,安裝程式碼如下;
install.packages('pcaReduce') ## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("SC3") biocLite("Seurat") install.packages("devtools") library("devtools") install_github("BPSC","nghiavtr") install_github("hemberg-lab/scRNA.seq.funcs") devtools::install_github("JustinaZ/pcaReduce")
載入程式碼如下:
library(pcaMethods)
library(pcaReduce)
library(SC3)
library(scater)
library(pheatmap)
set.seed(1234567)
載入測試資料
這裡選取的是資料,載入了這個scater包的SCESet物件,包含著一個23730 features, 301 samples 的表達矩陣。
供11已知的種細胞型別,這樣聚類的時候就可以跟這個已知資訊做對比,看看聚類效果如何。
可以直接用plotPCA來簡單PCA並且視覺化。
pollen <- readRDS("../pollen/pollen.rds") pollen
## SCESet (storageMode: lockedEnvironment)
## assayData: 23730 features, 301 samples
## element names: exprs, is_exprs, tpm
## protocolData: none
## phenoData
## rowNames: Hi_2338_1 Hi_2338_2 ... Hi_GW16_26 (301 total)
## varLabels: cell_type1 cell_type2 ... is_cell_control (33 total)
## varMetadata: labelDescription
## featureData
## featureNames: A1BG A1BG-AS1 ... ZZZ3 (23730 total)
## fvarLabels: mean_exprs exprs_rank ... feature_symbol (11 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
head(fData(pollen))
## mean_exprs exprs_rank n_cells_exprs total_feature_exprs
## A1BG 0.56418762 12460 79 169.82048
## A1BG-AS1 0.31265010 10621 37 94.10768
## A1CF 0.05453986 6796 59 16.41650
## A2LD1 0.22572953 9781 28 67.94459
## A2M 0.15087563 8855 21 45.41356
## A2M-AS1 0.02428046 5366 3 7.30842
## pct_total_exprs pct_dropout total_feature_tpm
## A1BG 1.841606e-03 73.75415 481.37
## A1BG-AS1 1.020544e-03 87.70764 538.18
## A1CF 1.780276e-04 80.39867 13.99
## A2LD1 7.368203e-04 90.69767 350.65
## A2M 4.924842e-04 93.02326 1356.63
## A2M-AS1 7.925564e-05 99.00332 88.61
## log10_total_feature_tpm pct_total_tpm is_feature_control
## A1BG 2.683380 1.599256e-04 FALSE
## A1BG-AS1 2.731734 1.787996e-04 FALSE
## A1CF 1.175802 4.647900e-06 FALSE
## A2LD1 2.546111 1.164965e-04 FALSE
## A2M 3.132781 4.507134e-04 FALSE
## A2M-AS1 1.952356 2.943891e-05 FALSE
## feature_symbol
## A1BG A1BG
## A1BG-AS1 A1BG-AS1
## A1CF A1CF
## A2LD1 A2LD1
## A2M A2M
## A2M-AS1 A2M-AS1
table(pData(pollen)$cell_type1)
##
## 2338 2339 BJ GW16 GW21 GW21+3 hiPSC HL60 K562 Kera
## 22 17 37 26 7 17 24 54 42 40
## NPC
## 15
plotPCA(pollen, colour_by = "cell_type1")
可以看到簡單的PCA也是可以區分部分細胞型別的,只不過在某些細胞相似性很高的群體區分力度不夠,所以需要開發新的演算法來解決這個聚類的問題。
SC聚類
pollen <- sc3_prepare(pollen, ks = 2:5)
pollen <- sc3_estimate_k(pollen)
pollen@sc3$k_estimation
## [1] 11
## 準備 SCESet物件 資料給 SC3方法,先預測能聚多少個類,發現恰好是11個。
## 這裡是平行計算,所以速度還可以
pollen <- sc3(pollen, ks = 11, biology = TRUE)
pollen
## SCESet (storageMode: lockedEnvironment)
## assayData: 23730 features, 301 samples
## element names: exprs, is_exprs, tpm
## protocolData: none
## phenoData
## rowNames: Hi_2338_1 Hi_2338_2 ... Hi_GW16_26 (301 total)
## varLabels: cell_type1 cell_type2 ... sc3_11_log2_outlier_score
## (35 total)
## varMetadata: labelDescription
## featureData
## featureNames: A1BG A1BG-AS1 ... ZZZ3 (23730 total)
## fvarLabels: mean_exprs exprs_rank ... sc3_11_de_padj (16 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
head(fData(pollen))
## mean_exprs exprs_rank n_cells_exprs total_feature_exprs
## A1BG 0.56418762 12460 79 169.82048
## A1BG-AS1 0.31265010 10621 37 94.10768
## A1CF 0.05453986 6796 59 16.41650
## A2LD1 0.22572953 9781 28 67.94459
## A2M 0.15087563 8855 21 45.41356
## A2M-AS1 0.02428046 5366 3 7.30842
## pct_total_exprs pct_dropout total_feature_tpm
## A1BG 1.841606e-03 73.75415 481.37
## A1BG-AS1 1.020544e-03 87.70764 538.18
## A1CF 1.780276e-04 80.39867 13.99
## A2LD1 7.368203e-04 90.69767 350.65
## A2M 4.924842e-04 93.02326 1356.63
## A2M-AS1 7.925564e-05 99.00332 88.61
## log10_total_feature_tpm pct_total_tpm is_feature_control
## A1BG 2.683380 1.599256e-04 FALSE
## A1BG-AS1 2.731734 1.787996e-04 FALSE
## A1CF 1.175802 4.647900e-06 FALSE
## A2LD1 2.546111 1.164965e-04 FALSE
## A2M 3.132781 4.507134e-04 FALSE
## A2M-AS1 1.952356 2.943891e-05 FALSE
## feature_symbol sc3_gene_filter sc3_11_markers_clusts
## A1BG A1BG TRUE 5
## A1BG-AS1 A1BG-AS1 TRUE 4
## A1CF A1CF TRUE 2
## A2LD1 A2LD1 FALSE NA
## A2M A2M FALSE NA
## A2M-AS1 A2M-AS1 FALSE NA
## sc3_11_markers_padj sc3_11_markers_auroc sc3_11_de_padj
## A1BG 7.740802e-10 0.8554452 1.648352e-18
## A1BG-AS1 1.120284e-03 0.6507985 5.575777e-03
## A1CF 5.007946e-23 0.8592113 1.162843e-17
## A2LD1 NA NA NA
## A2M NA NA NA
## A2M-AS1 NA NA NA
## 可以看到SC3方法處理後的SCESet物件的基因資訊增加了5列,比較重要的是sc3_gene_filter資訊,決定著該基因是否拿去聚類,因為基因太多了,需要挑選
table(fData(pollen)$sc3_gene_filter)
##
## FALSE TRUE
## 11902 11828
### 只有一半的基因被挑選去聚類了
## 後面是一些視覺化
sc3_plot_consensus(pollen, k = 11, show_pdata = "cell_type1")
sc3_plot_silhouette(pollen, k = 11)
sc3_plot_expression(pollen, k = 11, show_pdata = "cell_type1")
sc3_plot_markers(pollen, k = 11, show_pdata = "cell_type1")
plotPCA(pollen, colour_by = "sc3_11_clusters")
## 還支援shiny的互動式聚類,暫時不顯示
# sc3_interactive(pollen)
很明顯可以看到SC3聚類的效果要好於普通的PCA
pcaReduce
# use the same gene filter as in SC3
input <- exprs(pollen[fData(pollen)$sc3_gene_filter, ])
# run pcaReduce 1 time creating hierarchies from 1 to 30 clusters
pca.red <- PCAreduce(t(input), nbt = 1, q = 30, method = 'S')[[1]]
## 這裡對2~30種類別的情況都分別對樣本進行分組。
## 我們這裡取只有11組的時候,這些樣本是如何分組的資訊來視覺化。
pData(pollen)$pcaReduce <- as.character(pca.red[,32 - 11])
plotPCA(pollen, colour_by = "pcaReduce")
tSNE + kmeans
scater包包裝了 Rtsne 和 ggplot2 來做tSNE並且視覺化。
pollen <- plotTSNE(pollen, rand_seed = 1, return_SCESet = TRUE)
## 上面的tSNE的結果,下面用kmeans的方法進行聚類,假定是8類細胞型別。
pData(pollen)$tSNE_kmeans <- as.character(kmeans(pollen@reducedDimension, centers = 8)$clust)
plotTSNE(pollen, rand_seed = 1, colour_by = "tSNE_kmeans")
SNN-Cliq
這個有一點難用,算了吧。
distan <- "euclidean"
par.k <- 3
par.r <- 0.7
par.m <- 0.5
# construct a graph
scRNA.seq.funcs::SNN(
data = t(input),
outfile = "snn-cliq.txt",
k = par.k,
distance = distan
)
# find clusters in the graph
snn.res <-
system(
paste0(
"python snn-cliq/Cliq.py ",
"-i snn-cliq.txt ",
"-o res-snn-cliq.txt ",
"-r ", par.r,
" -m ", par.m
),
intern = TRUE
)
cat(paste(snn.res, collapse = "n"))
snn.res <- read.table("res-snn-cliq.txt")
# remove files that were created during the analysis
system("rm snn-cliq.txt res-snn-cliq.txt")
pData(pollen)$SNNCliq <- as.character(snn.res[,1])
plotPCA(pollen, colour_by = "SNNCliq")
SINCERA
至少是在這個資料集上面表現不咋地
# perform gene-by-gene per-sample z-score transformation
dat <- apply(input, 1, function(y) scRNA.seq.funcs::z.transform.helper(y))
# hierarchical clustering
dd <- as.dist((1 - cor(t(dat), method = "pearson"))/2)
hc <- hclust(dd, method = "average")
num.singleton <- 0
kk <- 1
for (i in 2:dim(dat)[2]) {
clusters <- cutree(hc, k = i)
clustersizes <- as.data.frame(table(clusters))
singleton.clusters <- which(clustersizes$Freq < 2)
if (length(singleton.clusters) <= num.singleton) {
kk <- i
} else {
break;
}
}
cat(kk)
## 14
pheatmap(
t(dat),
cluster_cols = hc,
cutree_cols = 14,
kmeans_k = 100,
show_rownames = FALSE
)
SEURAT
library(Seurat)
pollen_seurat <- new("seurat", raw.data = get_exprs(pollen, exprs_values = "tpm"))
pollen_seurat <- Setup(pollen_seurat, project = "Pollen")
pollen_seurat <- MeanVarPlot(pollen_seurat)
pollen_seurat <- RegressOut(pollen_seurat, latent.vars = c("nUMI"),
genes.regress = [email protected])
pollen_seurat <- PCAFast(pollen_seurat)
pollen_seurat <- RunTSNE(pollen_seurat)
pollen_seurat <- FindClusters(pollen_seurat)
TSNEPlot(pollen_seurat, do.label = T)
pData(pollen)$SEURAT <- as.character(pollen_seurat@ident)
sc3_plot_expression(pollen, k = 11, show_pdata = "SEURAT")
markers <- FindMarkers(pollen_seurat, 2)
FeaturePlot(pollen_seurat,
head(rownames(markers)),
cols.use = c("lightgrey", "blue"),
nCol = 3)