單細胞轉錄組3大R包之scater

阿新 • • 發佈：2022-05-03

scater 這個R包很強大，是McCarthy et al. 2017 發表的，包含的功能有：

Automated computation of QC metrics
Transcript quantification from read data with pseudo-alignment
Data format standardisation
Rich visualizations for exploratory analysis
Seamless integration into the Bioconductor universe
Simple normalisation methods

R包工作流程圖

S4物件

主要是基於 SCESet 物件來進行下游分析，跟ExpressionSet物件類似，也是常見的3個組成：

exprs, a numeric matrix of expression values, where rows are features, and columns are cells
phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)

featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

主要就是讀取scRNA上游分析處理得到的表達矩陣，加上每個樣本的描述資訊，形成矩陣之後。對樣本進行過濾，然後對基因進行過濾。針對過濾後的表達矩陣進行各種分類的視覺化。

HTML	R Script	An introduction to the scater package
HTML	R Script	Data visualisation methods in scater
HTML	R Script	Expression quantification and import
HTML	R Script	Quality control with scater
HTML	R Script	Transition from SCESet to SingleCellExperiment
PDF

測試資料

suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info") 

example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)

exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1
)

keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sce <- calculateQCMetrics(example_sce, 
                                  feature_controls = list(eg = 1:40))

#scater_gui(example_sce)

但是真的非常好用，所有的視覺化都集中在了 scater_gui 這個函式產生的shiny網頁裡面：

plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.
plotQC: various methods are available for producing QC diagnostic plots.
plotPCA: produce a principal components plot for the cells.
plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.
plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.
plotMDS: produce a multi-dimensional scaling plot for the cells.
plotReducedDim: plot a reduced-dimension representation of the cells.
plotExpression: plot expression levels for a defined set of features.
plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.
plotColData: plot cell metadata and QC metrics.
plotRowData: plot feature metadata and QC metrics.

可以充分的探索自己的資料，隨便看一個視覺化函式的結果：

## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
               x = "Mutation_Status", exprs_values = "exprs", 
               colour = "Treatment")

詳細的QC

做QC要結合上面的視覺化步驟，所有沒辦法自動化，只能先視覺化，肉眼分辨一下哪些樣本或者基因資料是需要捨棄的。

library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))

## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)

example_sce <- normalise(example_sce)

## Warning in .local(object, ...): using library sizes as size factors

## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
    calculateCPM(example_sce, use.size.factors = FALSE) + 1)

## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40)) 


colnames(colData(example_sceset))

##  [1] "Cell"                                  
##  [2] "Mutation_Status"                       
##  [3] "Cell_Cycle"                            
##  [4] "Treatment"                             
##  [5] "total_features"                        
##  [6] "log10_total_features"                  
##  [7] "total_counts"                          
##  [8] "log10_total_counts"                    
##  [9] "pct_counts_top_50_features"            
## [10] "pct_counts_top_100_features"           
## [11] "pct_counts_top_200_features"           
## [12] "pct_counts_top_500_features"           
## [13] "total_features_endogenous"             
## [14] "log10_total_features_endogenous"       
## [15] "total_counts_endogenous"               
## [16] "log10_total_counts_endogenous"         
## [17] "pct_counts_endogenous"                 
## [18] "pct_counts_top_50_features_endogenous" 
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"        
## [23] "log10_total_features_feature_control"  
## [24] "total_counts_feature_control"          
## [25] "log10_total_counts_feature_control"    
## [26] "pct_counts_feature_control"            
## [27] "total_features_eg"                     
## [28] "log10_total_features_eg"               
## [29] "total_counts_eg"                       
## [30] "log10_total_counts_eg"                 
## [31] "pct_counts_eg"                         
## [32] "is_cell_control"

colnames(rowData(example_sceset))

##  [1] "Gene"                  "is_feature_control"   
##  [3] "is_feature_control_eg" "mean_counts"          
##  [5] "log10_mean_counts"     "rank_counts"          
##  [7] "n_cells_counts"        "pct_dropout_counts"   
##  [9] "total_counts"          "log10_total_counts"

首先是基於樣本的過濾，用 colData(object) 可以檢視各個樣本統計情況

total_counts: total number of counts for the cell (aka ‘library size’)
log10_total_counts: total_counts on the log10-scale
total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)
filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?
filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?
counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.
counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.
log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

然後是基於基因的過濾，用 rowData(object) 可以檢視各個基因統計情況

mean_exprs: the mean expression level of the gene/feature.
exprs_rank: the rank of the feature’s expression level in the cell.
total_feature_counts: the total number of counts mapped to that feature across all cells.
log10_total_feature_counts: total feature counts on the log10-scale.
pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.
is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.
n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

scater一站式過濾低質量樣本

scater包自己提供了一個基於PCA的QC標準，不需要自己根據文庫大小，覆蓋的基因數量，外源的ERCC spike-ins 含量以及線粒體DNA含量來進行人工過濾。

預設的篩選條件如下：

pct_counts_top100features
total_features
pct_counts_feature_controls
n_detected_feature_controls
log10_counts_endogenous_features
log10_counts_feature_controls

一站式QC函式如下：

dat_pca <- scater::plotPCA(dat_qc,
                  size_by = "total_features", 
                  shape_by = "use",
                  pca_data_input = "pdata",
                  detect_outliers = TRUE,
                  return_SCESet = TRUE)

還有更詳細的教程，需要看

https://www.bioconductor.org/help/workflows/simpleSingleCell/
http://hemberg-lab.github.io/scRNA.seq.course/index.html

sessionInfo()

過濾只是它最基本的工具，它作為單細胞轉錄組3大R包，功能肯定是非常全面的，比如前面我們講解的normalization，DEG, features selection，cluster，它都手到擒來，只不過是包裝的是其它R包的函式。

單細胞轉錄組3大R包之scater

scater 這個R包很強大，是McCarthy et al. 2017 發表的，包含的功能有： Automated computation of QC metrics

單細胞轉錄組3大R包之monocle2

主要是針對單細胞轉錄組測序資料開發的，用來找不同細胞型別或者不同細胞狀態的差異表達基因。分析起始是表達矩陣，作者推薦用比較老舊的Tophat+Cufflinks流程，或者RSEM, eXpress,Sailfish,等等。需要的是基於轉錄本

單細胞轉錄組3大R包之Seurat

牛津大學的Rahul Satija等開發的Seurat，最早公佈在Nature biotechnology, 2015，文章是； Spatial reconstruction of single-cell gene expression data , 在2017年進行了非常大的改動，所以重新在biorxiv發表了文章

高階轉錄組分析和R語言資料視覺化第十二期（線上線下同時開課）

“ 福利公告：為了響應學員的學習需求，經過易生信培訓團隊的討論籌備，現決定安排擴增子16S分析、巨集基因組、Python課程線上直播課。報名參加線上直播課的老師可在1年內選擇參加同課程的一次線下課。期

SC2disease：人類疾病的單細胞轉錄組的人工收集資料庫

SC2disease：人類疾病的單細胞轉錄組的人工收集資料庫近日，國際權威學術期刊《核酸研究》（Nucleic Acids Research）發表了西北工業大學、西安交通大學、哈爾濱工業大學、復旦大學、天津大學等團隊合作開

比較不同單細胞轉錄組資料尋找features方法

挑選到的跟feature相關的基因集，有點類似於在某些組間差異表達的基因集，都需要後續功能註釋。

比較不同的對單細胞轉錄組資料尋找差異基因的方法

背景介紹如果是bulk RNA-seq，那麼現在最流行的就是DESeq2 和 edgeR啦，而且有很多經過了RT-qPCR 驗證過的真實測序資料可以來評價不同的差異基因演算法的表現。

比較不同的對單細胞轉錄組資料normalization方法

使用CPM去除文庫大小影響之所以需要normalization，就是因為測序的各個細胞樣品的總量不一樣，所以測序資料量不一樣，就是文庫大小不同，這個因素是肯定需要去除。最簡單的就是counts per million (CPM)，所有樣本的

比較不同的對單細胞轉錄組資料聚類的方法

背景介紹聚類之前必須要對錶達矩陣進行normalization，而且要去除一些批次效應等外部因素。通過對錶達矩陣的聚類，可以把細胞群體分成不同的狀態，解釋為什麼會有不同的群體。不過從計算的角度來說，聚類還是蠻複雜

一個植物轉錄組專案的實戰

轉錄組轉錄組測序的研究物件為特定細胞在某一功能狀態下所能轉錄出來的所有 RNA 的總和，包括 mRNA 和非編碼 RNA 。通過轉錄組測序，能夠全面獲得物種特定組織或器官的轉錄本資訊，從而進行轉錄本結構研究

轉錄組資料拼接之應用篇

前前後後接觸了一些基因組和轉錄組拼接的工作，而且後期還會持續進行。期間遇到了各種各樣莫名其妙的坑，也嘗試了一些不同的方法和軟體，簡單做一個階段性小結。上週的今天更新了原理部分二代測序資料拼接之原理篇

R語言3函式與R包

生信技能樹R語言部分學習筆記學習用Rmarkdown來寫程式碼，Rmarkdown可以將註釋、程式碼、執行結構一起展示的一種工具。

男人的衣櫃裡還有啥？海瀾之家溼廁紙 3 大包 9.9 元狂促

【海瀾之家生活家旗艦店】海瀾之家溼廁紙 40 抽 * 3 包日常售價 24.9 元，今日可領 15 元大額券，實付 9.9 元包郵：天貓海瀾之家溼廁紙 40 抽 * 3 包券後 9.9 元領 15 元券摺合 3.3 元/包，價效比真真不錯，有興趣

大差價：超亞 75 度酒精溼巾 3.9 元/包（京東 10.8 元）

【阿里健康大藥房】超亞 75 度酒精溼巾 40 抽 * 3 包今日大促價 21.9 元，疊加 10 元加碼衝量券，實付 11.9 元包郵。摺合 3.9 元/包近期探底價，需拍寶貝第三項“3 包”哦：天貓超亞酒精溼巾 40 抽 * 3 包 75 度酒精

Redis為什麼是單執行緒、及高併發快的3大原因詳解

Redis的高併發和快速原因 1.redis是基於記憶體的，記憶體的讀寫速度非常快； 2.redis是單執行緒的，省去了很多上下文切換執行緒的時間；

真實感受一下縣比省大不包郵，省市區鄉鎮多級資料重灌上陣

以前採集的舊版省市區三級或四級城市資料總是覺得怪怪的，經過多方探討，終於下定決心進行了一次重大更新，釋出了這個重（chong）裝版。除了省市區鄉鎮資料外，座標和邊界範圍、還有拼音都是有的。

sql語句實現行轉列的3種方法例項

前言一般在做資料統計的時候會用到行轉列，假如要統計學生的成績，資料庫裡查詢出來的會是這樣的，但這並不能達到想要的效果，所以要在查詢的時候做一下處理，下面話不多說了，來一起看看詳細的介紹。

Windows環境下安裝EPDFree和pandas（包含epd_free-7.3-2安裝包下載）

準備軟體： epd_free-7.3-2-win-x86.msi 下載地址：http://epdfree-7-3-2.software.informer.com/7.3/

老司機帶你玩轉面試（3）：Redis 高可用之主從模式

前文回顧建議前面文章沒看過的同學先看下前面的文章：「老司機帶你玩轉面試（1）：快取中介軟體 Redis 基礎知識以及資料持久化」

install.packages("sf"): ERROR, 安裝R包sf出錯解決方法

install.packages(\"sf\") ERROR * installing *source* package ‘sf’ ... ** package ‘sf’ successfully unpacked and MD5 sums checked

單細胞轉錄組3大R包之scater

R包工作流程圖

S4物件

測試資料

詳細的QC

scater一站式過濾低質量樣本

相關推薦