單細胞數據初步處理 | drop-seq | QC | 質控 | 正則化 normalization
比對
The raw Drop-seq data was processed with the standard pipeline (Drop-seq tools version 1.12 from McCarroll laboratory). Reads were aligned to the ENSEMBL release 84 Mus musculus genome.
10x Genomics data was processed using the same pipeline as for Drop-seq data, adjusting the barcode locations accordingly
我還沒有深入接觸10x和drop-seq的數據,目前的10x數據都是用官網cellranger跑出來的。
質控
We selected cells for downstream processing in each Drop-seq run, using the quality control metrics output by the Drop-seq tools package9, as well as metrics derived from the UMI matrix.
1) We first removed cells with a low number (<700) of unique detected genes. From the remaining cells, we filtered additional outliers.
2) We removed cells for which the overall alignment rate was less than the mean minus three standard deviations.
3) We removed cells for which the total number of reads (after log10 transformation) was not within three standard deviations of the mean.
4) We removed cells for which the total number of unique molecules (UMIs, after log10 transformation) was not within three standard deviations of the mean.
5) We removed cells for which the transcriptomic alignment rate (defined by PCT_USABLE_BASES) was not within three standard deviations of the mean.
6) We removed cells that showed an unusually high or low number of UMIs given their number of reads by fitting a loess curve (span= 0.5, degree= 2) to the number of UMIs with number of reads as predictor (both after log10 transformation). Cells with a residual more than three standard deviations away from the mean were removed.
7) With the same criteria, we removed cells that showed an unusually high or low number of genes given their number of UMIs. Of these filter steps, step 1 removed the majority of cells.
Steps 2 to 7 removed only a small number of additional cells from each eminence (2% to 4%), and these cells did not exhibit unique or biologically informative patterns of gene expression.
1. 過濾掉基因數量太少的細胞;
2. 過濾基因組比對太差的細胞;
3. 過濾掉總reads數太少的細胞;
4. 過濾掉UMI太少的細胞;
5. 過濾掉轉錄本比對太少的細胞;
6. 根據統計分析,過濾reads過多或過少的細胞;
7. 根據統計分析,過濾UMI過低或過高的細胞;
註:連過濾都有點統計的門檻,其實也簡單,應該是默認為正態分布,去掉了左右極端值。
還有一個就是簡單的擬合回歸,LOESS Curve Fitting (Local Polynomial Regression)
正則化
The raw data per Drop-seq run is a UMI count matrix with genes as rows and cells as columns. The values represent the number of UMIs that were detected. The aim of normalization is to make these numbers comparable between cells by removing the effect of sequencing depth and biological sources of heterogeneity that may confound the signal of interest, in our case cell cycle stage.
目前有很多正則化的方法,但是作者還是自己開發了一個。
正則化就是去掉一些影響因素,使得我們的數據之間可以相互比較。這裏就提到了兩個最主要的因素:測序深度和細胞周期。
A common approach to correct for sequencing depth is to create a new normalized expression matrix x with (see Fig), in which ci,j is the molecule count of gene i in cell j and mj is the sum of all molecule counts for cell j. This approach assumes that ci,j increases linearly with mj, which is true only when the set of genes detected in each cell is roughly the same.
可以看到常規的正則化方法是不適合的,
However, for Drop-seq, in which the number of UMIs is low per cell compared to the number of genes present, the set of genes detected per cell can be quite different. Hence, we normalize the expression of each gene separately by modelling the UMI counts as coming from a generalized linear model with negative binomial distribution, the mean of which can be dependent on technical factors related to sequencing depth. Specifically, for every gene we model the expected value of UMI counts as a function of the total number of reads assigned to that cell, and the number of UMIs per detected gene (sum of UMI divided by number of unique detected genes).
這個就有些門檻了,用了廣義線性回歸模型來做正則化。
To solve the regression problem, we use a generalized linear model (glm function of base R package) with a regularized overdispersion parameter theta. Regularizing theta helps us to avoid overfitting which could occur for genes whose variability is mostly driven by biological processes rather than sampling noise and dropout events. To learn a regularized theta for every gene, we perform the following procedure.
1) For every gene, obtain an empirical theta using the maximum likelihood model (theta.ml function of the MASS R package) and the estimated mean vector that is obtained by a generalized linear model with Poisson error distribution.
2) Fit a line (loess, span = 0.33, degree = 2) through the variance–mean UMI count relationship (both log10 transformed) and predict regularized theta using the fit. The relationship between variance and theta and mean is given by variance= mean + (mean2/theta).
Normalized expression is then defined as the Pearson residual of the regression model, which can be interpreted as the number of standard deviations by which an observed UMI count was higher or lower than its expected value. Unless stated otherwise, we clip expression to the range [-30, 30] to prevent outliers from dominating downstream analyses.
單細胞數據初步處理 | drop-seq | QC | 質控 | 正則化 normalization