單細胞轉錄組3大R包之scater
scater 這個R包很強大,是McCarthy et al. 2017 發表的,包含的功能有:
- Automated computation of QC metrics
- Transcript quantification from read data with pseudo-alignment
- Data format standardisation
- Rich visualizations for exploratory analysis
- Seamless integration into the Bioconductor universe
- Simple normalisation methods
R包工作流程圖
S4物件
主要是基於 SCESet 物件來進行下游分析,跟ExpressionSet物件類似,也是常見的3個組成:
- exprs, a numeric matrix of expression values, where rows are features, and columns are cells
- phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
- featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.
主要就是讀取scRNA上游分析處理得到的表達矩陣,加上每個樣本的描述資訊,形成矩陣之後。對樣本進行過濾,然後對基因進行過濾。針對過濾後的表達矩陣進行各種分類的視覺化。
最新的文件如下:
HTML |
R Script |
An introduction to the scater package |
---|---|---|
HTML |
R Script |
Data visualisation methods in scater |
HTML |
R Script |
Expression quantification and import |
HTML |
R Script |
Quality control with scater |
HTML |
R Script |
Transition from SCESet to SingleCellExperiment |
其GitHub上面專門開設了介紹它用法的課程:http://hemberg-lab.github.io/scRNA.seq.course/
測試資料
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
exprs(example_sce) <- log2(
calculateCPM(example_sce, use.size.factors = FALSE) + 1
)
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(eg = 1:40))
#scater_gui(example_sce)
但是真的非常好用,所有的視覺化都集中在了 scater_gui
這個函式產生的shiny
網頁裡面:
-
plotScater
: a plot method exists forSingleCellExperiment
objects, which gives an overview of expression across cells. -
plotQC
: various methods are available for producing QC diagnostic plots. -
plotPCA
: produce a principal components plot for the cells. -
plotTSNE
: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells. -
plotDiffusionMap
: produce a diffusion map (reduced dimension) plot for the cells. -
plotMDS
: produce a multi-dimensional scaling plot for the cells. -
plotReducedDim
: plot a reduced-dimension representation of the cells. -
plotExpression
: plot expression levels for a defined set of features. -
plotPlatePosition
: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level. -
plotColData
: plot cell metadata and QC metrics. -
plotRowData
: plot feature metadata and QC metrics.
可以充分的探索自己的資料,隨便看一個視覺化函式的結果:
## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
x = "Mutation_Status", exprs_values = "exprs",
colour = "Treatment")
詳細的QC
做QC要結合上面的視覺化步驟,所有沒辦法自動化,只能先視覺化,肉眼分辨一下哪些樣本或者基因資料是需要捨棄的。
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))
## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")
## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),
colData = sc_example_cell_info,
rowData = gene_df)
example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
calculateCPM(example_sce, use.size.factors = FALSE) + 1)
## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40))
colnames(colData(example_sceset))
## [1] "Cell"
## [2] "Mutation_Status"
## [3] "Cell_Cycle"
## [4] "Treatment"
## [5] "total_features"
## [6] "log10_total_features"
## [7] "total_counts"
## [8] "log10_total_counts"
## [9] "pct_counts_top_50_features"
## [10] "pct_counts_top_100_features"
## [11] "pct_counts_top_200_features"
## [12] "pct_counts_top_500_features"
## [13] "total_features_endogenous"
## [14] "log10_total_features_endogenous"
## [15] "total_counts_endogenous"
## [16] "log10_total_counts_endogenous"
## [17] "pct_counts_endogenous"
## [18] "pct_counts_top_50_features_endogenous"
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"
## [23] "log10_total_features_feature_control"
## [24] "total_counts_feature_control"
## [25] "log10_total_counts_feature_control"
## [26] "pct_counts_feature_control"
## [27] "total_features_eg"
## [28] "log10_total_features_eg"
## [29] "total_counts_eg"
## [30] "log10_total_counts_eg"
## [31] "pct_counts_eg"
## [32] "is_cell_control"
colnames(rowData(example_sceset))
## [1] "Gene" "is_feature_control"
## [3] "is_feature_control_eg" "mean_counts"
## [5] "log10_mean_counts" "rank_counts"
## [7] "n_cells_counts" "pct_dropout_counts"
## [9] "total_counts" "log10_total_counts"
首先是基於樣本的過濾,用 colData(object)
可以檢視各個樣本統計情況
-
total_counts
: total number of counts for the cell (aka ‘library size’) -
log10_total_counts
: total_counts on the log10-scale -
total_features
: the number of features for the cell that have expression above the detection limit (default detection limit is zero) -
filter_on_total_counts
: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset? -
filter_on_total_features
: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset? -
counts_feature_controls
: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated. -
counts_endogenous_features
: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults tototal_counts
if no control features are indicated. -
log10_counts_feature_controls
: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated. -
log10_counts_endogenous_features
: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated. -
n_detected_feature_controls
: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls
: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.
然後是基於基因的過濾,用 rowData(object)
可以檢視各個基因統計情況
-
mean_exprs
: the mean expression level of the gene/feature. -
exprs_rank
: the rank of the feature’s expression level in the cell. -
total_feature_counts
: the total number of counts mapped to that feature across all cells. -
log10_total_feature_counts
: total feature counts on the log10-scale. -
pct_total_counts
: the percentage of all counts that are accounted for by the counts mapping to the feature. -
is_feature_control
: is the feature a control feature? Default isFALSE
unless control features are defined by the user. -
n_cells_exprs
: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).
scater一站式過濾低質量樣本
scater包自己提供了一個基於PCA的QC標準,不需要自己根據文庫大小,覆蓋的基因數量,外源的ERCC spike-ins 含量以及線粒體DNA含量來進行人工過濾。
預設的篩選條件如下:
- pct_counts_top100features
- total_features
- pct_counts_feature_controls
- n_detected_feature_controls
- log10_counts_endogenous_features
- log10_counts_feature_controls
一站式QC函式如下:
dat_pca <- scater::plotPCA(dat_qc,
size_by = "total_features",
shape_by = "use",
pca_data_input = "pdata",
detect_outliers = TRUE,
return_SCESet = TRUE)
還有更詳細的教程,需要看
- https://www.bioconductor.org/help/workflows/simpleSingleCell/
- http://hemberg-lab.github.io/scRNA.seq.course/index.html
sessionInfo()
過濾只是它最基本的工具,它作為單細胞轉錄組3大R包,功能肯定是非常全面的,比如前面我們講解的normalization,DEG, features selection,cluster,它都手到擒來,只不過是包裝的是其它R包的函式。