單細胞測序——scImpute
An accurate and robust imputation method scImpute for single-cell RNA-seq data
http://jsb.ucla.edu/sites/default/files/publications/NC_scImpute.pdf
18年UCLA剛發在NC上的一篇technology的文章,軟件試過有用但是在本人數據上並沒有明顯的優越性,有待更多考證。
不過原理需求甚解以便效尤後續之分析。
the workflow in the imputation step of scImpute method. scImpute first learns each gene’s dropout probability in each cell by fitting a mixture model. Next, scImpute imputes the (highly probable) dropout values in cell j (gene set Aj ) by borrowing information of the same gene in other similar cells, which are selected based on gene set Bj (not severely affected by dropout events).
selecting matrix 的圖示原理:
與其他軟件的比較,空白對照和平行對照:scimpute、MAGIC、SAVER
one more dimension reduction plot:
clustering:
效果先擺上來,確實很不錯。那麽具體算法和統計模型是如何架構的呢:
step by step Factorization:
establishing the normalized count matrix
1. PCA is performed on matrix X for dimension reduction and the resulting matrix is denoted as Z, where columns represent cells and rows represent principal components (PCs). The purpose of dimension reduction is to reduce the impact of large portions of dropout values. The PCs are selected such that at least 40% of the variance in data could be explained. 2. Based on the PCA-transformed data Z, the distance matrix DJ×J between the cells could be calculated. For each cell j, we denote its distance to the nearest neighbor as lj. For the set L = {l1, …, lJ}, we denote its first quartile as Q1, and third quartile as Q3. The outlier cells are those cells which do not have close neighbors:
equation1
For each outlier cell, we set its candidate neighbor set Nj = ?. Please note that the outlier cells could be a result of experimental/technical errors or biases, but they may also represent real biological variation as rare cell types. scImpute would not impute gene expression values in outlier cells, nor use them to impute gene expression values in other cells. 3. The remaining cells {1, …, J}\O are clustered into K groups by spectral clustering23. We denote gj = k if cell j is assigned to cluster k (k = 1, …, K). Hence, cell j has the candidate neighbor set Nj ? j ′ : gj ′ ? gj; j ′ ≠j
用PCA的方法,將上一步的均一化之後的matrix降維分析,得到一個叫Z的e PCA-transformed的矩陣,據此算出一個細胞與細胞之間的distance matrix Dj*j,outliners就通過這個距離矩陣中的細胞間相對位置進行outliners的判斷與篩選,acoording to equation1
之後再將剩下來的細胞,在一個O range 範圍內的細胞都cluster到一個group,這樣得到的k個groups就可以call it as the subpopulations of those cell population
###############那麽處理好哪些細胞是可用的問題之後我們需要對這些大數據量的單細胞分布做一個有統計學意義的描述:
For each gene i, its expression in cell subpopulation k is modeled as a random variable XekT i with density functi
equation2
其中 是基因i在細胞亞群k中的 dropout率 α,β 是基因i 在gamam分布中的形態與位置參數, and μ, σ 是基因i在正態分布中的均值和標準差,這些參數都是用的EM也就是最大似然估計來進行的估測。
這個公式主要意義在於詮釋了在基因的不同表達情況下,如何更好的衡量它是不是一個dorpout的value還是反應了一個真實的生物變異
equation3
公式三就是dropout rate的計算公式
下面來到文章中核心的如何去impute those we found dropout points above:
Imputation of dropout values. Now, we impute the gene expressions cell by cell. For each cell j, we select a gene set Aj in need of imputation based on the genes’ dropout probabilities in cell j: Aj = {i : dij ≥ t}, where t is a threshold on dropout probabilities. We also have a gene set Bj = {i : dij < t} that have accurate gene expression with high confidence and do not need imputation. We learn cells’ similarities through the gene set Bj. Then we impute the expression of genes in the set Aj by borrowing information from the same gene’s expression in other similar cells learned from Bj. Supplementary Figs. 19 and 20c give some real data distributions of genes‘ zero count proportions across cells and genes‘ dropout probabilities, showing that it is reasonable to divide genes into two sets. To learn the cells similar to cell j from Bj, we use the non-negative least squares (NNLS) regression:
equation4
Recall that Nj represents the indices of cells that are candidate neighbors of cell j. The response XBj;j is a vector representing the Bj rows in the j-th column of X, the design matrix XBj;Nj is a sub-matrix of X with dimensions Bj Nj , and the coefficients β(j) is a vector of length Nj . Note that NNLS itself has the property of leading to a sparse estimate bβejT , whose components may have exact zeros39, so NNLS can be used to select similar cells of cell j from its neighbors Nj. Finally, the estimated coefficients bβejT from the set Bj are used to impute the expression of genes in the set Aj in cell j:
equation5
說了一大堆,最核心大的就是 We learn cells’ similarities through the gene set Bj. Then we impute the expression of genes in the set Aj by borrowing information from the same gene’s expression in other similar cells learned from Bj.NNLS是非負最小二乘回歸的縮寫,在尋找cellJ鄰近的相似細胞的時候可以派上用場,Bj是從gene set B中得到的估計系數,用於在A geneset中對有dropout 的基因表達矩陣進行impute。其中A gene set 是找出來的需要impute的set 然而B是找出來的相對標準以及精確的不需impute的gene set,用一個dij與t threshold的一個比較得出。得到的一個稀疏估計值βhatJ , 是擁有幾乎完全為0的表達量組分的一個估計系數(whose components may have exact zeros)
至此我們可以將need imputed matrix Xij分為從來自A geneset 以及B geneset的兩種的情況。
We construct a separate regression model for each cell to impute the expression of genes with high dropout probabilities。整個scimpute的過程,只需要兩個參數的人為設置,第一個是K就是cluster到多少個gourd的個數,以及一個dropout的rate threshold t。
advantages of scimpute in article:scImpute simultaneously determines the values that need imputation, and would not introduce biases to the high expression values of accurately measured genes。但是scImpute的inputing 相對保守不會overscImpute也不會過於sparse。
############validation step
Generation of simulated scRNA-seq data.##自行看文章
Four evaluation measures of clustering results
(adjusted Rand index, Jaccard index, normalized mutual information (nmi), and purity)
adjusted Rand index:是在聚類分類中的用的比較多的經典的檢驗方法,懲罰的是假陽性以及假陰性的分類事件,
Jaccard index:類似ARI,但是JI並不能很精確的判定真陰性事件。
NMI:是從信息理論的角度解讀亞群與亞群之間的相似性
purity:純度,從真正的一個聚類中的得來的樣本數的百分比。
單細胞測序——scImpute