1. 程式人生 > 其它 >單細胞轉錄組3大R包之Seurat

單細胞轉錄組3大R包之Seurat

牛津大學的Rahul Satija等開發的Seurat,最早公佈在Nature biotechnology, 2015,文章是; Spatial reconstruction of single-cell gene expression data , 在2017年進行了非常大的改動,所以重新在biorxiv發表了文章在 Integrated analysis of single cell transcriptomic data across conditions, technologies, and species 。 功能涵蓋了scRNA-seq的QC、過濾、標準化、批次效應、PCA、tNSE亞群聚類分析、差異基因分析、亞群特異性標誌物鑑定等等等。

其GitHub地址是:http://satijalab.org/seurat/

給初學者提供了一個2,700 PBMC scRNA-seq dataset from 10X genomics的資料實戰指導,非常容易學會: http://satijalab.org/seurat/pbmc3k_tutorial.html 資料在: https://personal.broadinstitute.org/rahuls/seurat/seurat_files_nbt.zip

同時還提供兩個公共資料的實戰演練教程:

  • https://www.dropbox.com/s/4d00eyd84qscyd2/IntegratedAnalysis_Examples.zip?dl=1
  • http://bit.ly/IAexpmat

下載後如下所示:

IntegratedAnalysis_Examples
├── [ 211]  INSTALL
├── [1.4K]  README.md
├── [ 170]  data
│   ├── [ 83K]  Supplementary_Table_MarrowCellData.tsv
│   ├── [547K]  Supplementary_Table_PancreasCellData.tsv
│   └── [ 561]  regev_lab_cell_cycle_genes.txt
├── [ 170]  examples
│   ├── [2.9K]  marrow_commandList.R
│   └── [2.5K]  pancreas_commandList.R
└── [ 170]  tutorial
    ├── [8.2K]  Seurat_AlignmentTutorial.Rmd
    └── [7.6M]  Seurat_AlignmentTutorial.pdf
IntegratedAnalysis_ExpressionMatrices
├── [102M]  marrow_mars.expressionMatrix.txt
├── [ 66M]  marrow_ss2.expressionMatrix.txt
├── [330M]  pancreas_human.expressionMatrix.txt
├── [ 54M]  pancreas_mouse.expressionMatrix.txt
├── [165M]  pbmc_10X.expressionMatrix.txt
└── [101M]  pbmc_SeqWell.expressionMatrix.txt

seurat的用法

這裡的測試資料是經由Illumina NextSeq 500測到的2,700 single cells 表達矩陣,下載地址;

根據表達矩陣構建seurat物件

需要準備好3個輸入檔案

library(Seurat)
library(dplyr)
library(Matrix)
## https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz 
## 下載整個壓縮包解壓即可重現整個流程
# Load the PBMC dataset
list.files("~/Downloads/filtered_gene_bc_matrices/hg19/")
## [1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"
pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")

# Examine the memory savings between regular and sparse matrices
dense.size <- object.size(x = as.matrix(x = pbmc.data))
dense.size
## 709264728 bytes
sparse.size <- object.size(x = pbmc.data)
sparse.size
## 38715120 bytes
dense.size / sparse.size
## 18.3 bytes
# Initialize the Seurat object with the raw (non-normalized data).  Keep all
# genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at
# least 200 detected genes
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, 
    project = "10X_PBMC")
pbmc
## An object of class seurat in project 10X_PBMC 
##  13714 genes across 2700 samples.

進行一系列的QC步驟

mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums([email protected][mito.genes, ]) / Matrix::colSums([email protected])

# AddMetaData adds columns to [email protected], and is a great place to stash QC stats
pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")
VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)
# GenePlot is typically used to visualize gene-gene relationships, but can be used for anything 
# calculated by the object, i.e. columns in [email protected], PC scores etc.
# Since there is a rare subset of cells with an outlier level of high mitochondrial percentage
# and also low UMI content, we filter these as well
par(mfrow = c(1, 2))
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito")
GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene")
# We filter out cells that have unique gene counts over 2,500 or less than 200
# Note that low.thresholds and high.thresholds are used to define a 'gate'
# -Inf and Inf should be used if you don't want a lower or upper threshold.
pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.05))

可以看到這裡選擇的QC標準是 200~2500基因範圍內,以及線粒體基因表達佔比小於5%的才保留。

normalization

這裡預設根據細胞測序文庫大小進行normalization,簡單的做一個log轉換即可。

 summary([email protected][,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1764  0.0000 76.0000
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", 
    scale.factor = 10000)

summary(pbmc@data[,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1171  0.0000  5.7531

rDetection of variable genes across the single cells

pbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)
length(x = [email protected])
## [1] 1838

Scaling the data and removing unwanted sources of variation

需要去除那些technical noise,batch effects, or even biological sources of variation (cell cycle stage)

pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))
summary([email protected][,1])

PCA分析

pbmc <- RunPCA(object = pbmc, pc.genes = [email protected], do.print = TRUE, pcs.print = 1:5, genes.print = 5)
## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""

對PCA分析結果可以進行一系列的視覺化: PrintPCA, VizPCA, PCAPlot, and PCHeatmap

# Examine and visualize PCA results a few different ways
PrintPCA(object = pbmc, pcs.print = 1:5, genes.print = 5, use.full = FALSE)
## [1] "PC1"
## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"  
## [1] ""
## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"   
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"
## [1] ""
## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"
## [1] ""
## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"    
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"   
## [1] ""
## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"
## [1] ""
## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"         
## [5] "CEBPB"        
## [1] ""
## [1] ""
VizPCA(object = pbmc, pcs.use = 1:2)
PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)
# ProjectPCA scores each gene in the dataset (including genes not included in the PCA) based on their correlation 
# with the calculated components. Though we don't use this further here, it can be used to identify markers that 
# are strongly correlated with cellular heterogeneity, but may not have passed through variable gene selection. 
# The results of the projected PCA can be explored by setting use.full=T in the functions above
pbmc <- ProjectPCA(object = pbmc, do.print = FALSE)

最重要的就是 PCHeatmap 函數了

PCHeatmap(object = pbmc, pc.use = 1, cells.use = 500, do.balanced = TRUE, label.columns = FALSE)
PCHeatmap(object = pbmc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE, label.columns = FALSE, use.full = FALSE)

找到有統計學顯著性的主成分

主成分分析結束後需要確定哪些主成分所代表的基因可以進入下游分析,這裡可以使用JackStraw做重抽樣分析。可以用JackStrawPlot視覺化看看哪些主成分可以進行下游分析。

pbmc <- JackStraw(object = pbmc, num.replicate = 100, do.print = FALSE) 
JackStrawPlot(object = pbmc, PCs = 1:12)

當然,也可以用最經典的碎石圖來確定主成分。

PCElbowPlot(object = pbmc)

這個確定主成分是非常有挑戰性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.

在本例子裡面,3種方法結果差異不大,可以在PC7~10直接挑選。

Cluster the cells

# save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph
# but with a different resolution value (see docs for full details)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.6, print.output = 0, save.SNN = TRUE)

A useful feature in Seurat v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted formatted summary of the parameters that were chosen.

PrintFindClustersParams(object = pbmc)
## Parameters used in latest FindClusters calculation run on: 2018-01-22 07:43:31
## =============================================================================
## Resolution: 0.6
## -----------------------------------------------------------------------------
## Modularity Function    Algorithm         n.start         n.iter
##      1                   1                 100             10
## -----------------------------------------------------------------------------
## Reduction used          k.param          k.scale          prune.SNN
##      pca                 30                25              0.0667
## -----------------------------------------------------------------------------
## Dims used in calculation
## =============================================================================
## 1 2 3 4 5 6 7 8 9 10
# While we do provide function-specific printing functions, the more general function to 
# print calculation parameters is PrintCalcParams(). 

Run Non-linear dimensional reduction (tSNE)

同樣也是一個函式,這個結果也可以像PCA分析一下挑選合適的PC進行下游分析。

pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE)
# note that you can set do.label=T to help label individual clusters
TSNEPlot(object = pbmc)

這一步很耗時,可以儲存該物件,便於重複,以及分享交流

save(pbmc, file = "pbmc3k.rData")

Finding differentially expressed genes (cluster biomarkers)

差異分析在seurat包裡面被封裝成了函式:FindMarkers,有一系列引數可以選擇,然後又4種找差異基因的演算法:

  • ROC test (“roc”)
  • t-test (“t”)
  • LRT test based on zero-inflated data (“bimod”, default)
  • LRT test based on tobit-censoring models (“tobit”)
# find all markers of cluster 1
cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)
print(x = head(x = cluster1.markers, n = 5))
##                p_val avg_logFC pct.1 pct.2    p_val_adj
## S100A9  0.000000e+00  3.827593 0.996 0.216  0.00000e+00
## S100A8  0.000000e+00  3.786535 0.973 0.123  0.00000e+00
## LGALS2  0.000000e+00  2.634722 0.908 0.060  0.00000e+00
## FCN1    0.000000e+00  2.369524 0.956 0.150  0.00000e+00
## CD14   8.129864e-290  1.949317 0.663 0.029 1.11493e-285
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0,3), min.pct = 0.25)
print(x = head(x = cluster5.markers, n = 5))
##                p_val avg_logFC pct.1 pct.2     p_val_adj
## GZMB   3.854665e-190  3.195021 0.955 0.084 5.286288e-186
## IGFBP7 2.967797e-155  2.175917 0.542 0.010 4.070037e-151
## GNLY   7.492111e-155  3.514718 0.961 0.143 1.027468e-150
## FGFBP2 2.334109e-150  2.559484 0.852 0.085 3.200998e-146
## FCER1G 4.819154e-141  2.280724 0.839 0.100 6.608987e-137
# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)
## # A tibble: 16 x 7
## # Groups:   cluster [8]
##            p_val avg_logFC pct.1 pct.2     p_val_adj cluster     gene
##            <dbl>     <dbl> <dbl> <dbl>         <dbl>  <fctr>    <chr>
##  1 1.315805e-234  1.149058 0.924 0.483 1.804495e-230       0     LDHB
##  2 3.311687e-129  1.068122 0.662 0.202 4.541648e-125       0     IL7R
##  3  0.000000e+00  3.827593 0.996 0.216  0.000000e+00       1   S100A9
##  4  0.000000e+00  3.786535 0.973 0.123  0.000000e+00       1   S100A8
##  5  0.000000e+00  2.977399 0.936 0.042  0.000000e+00       2    CD79A
##  6 1.038405e-271  2.492236 0.624 0.022 1.424068e-267       2    TCL1A
##  7 8.029765e-207  2.158812 0.974 0.230 1.101202e-202       3     CCL5
##  8 1.118949e-181  2.113428 0.588 0.050 1.534527e-177       3     GZMK
##  9 1.066599e-173  2.275509 0.962 0.137 1.462733e-169       4   FCGR3A
## 10 1.996623e-123  2.151881 1.000 0.316 2.738169e-119       4     LST1
## 11 9.120707e-265  3.334634 0.955 0.068 1.250814e-260       5     GZMB
## 12 6.251673e-192  3.763928 0.961 0.131 8.573544e-188       5     GNLY
## 13 2.510362e-238  2.729243 0.844 0.011 3.442711e-234       6   FCER1A
## 14  7.037034e-21  1.965168 1.000 0.513  9.650588e-17       6 HLA-DPB1
## 15 2.592342e-186  4.952160 0.933 0.010 3.555138e-182       7      PF4
## 16 7.813553e-118  5.889503 1.000 0.023 1.071551e-113       7     PPBP

值得注意的是: The ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).

cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 0, thresh.use = 0.25, test.use = "roc", only.pos = TRUE)

同時,該包提供了一系列視覺化方法來檢查差異分析的結果的可靠性:

  • VlnPlot (shows expression probability distributions across clusters)
  • FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations
  • JoyPlot, CellPlot, and DotPlot
VlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))
# you can plot raw UMI counts as well
VlnPlot(object = pbmc, features.plot = c("NKG7", "PF4"), use.raw = TRUE, y.log = TRUE)
FeaturePlot(object = pbmc, features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"), cols.use = c("grey", "blue"), reduction.use = "tsne")

DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.

pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC) -> top10
# setting slim.col.label to TRUE will print just the cluster IDS instead of every cell name
DoHeatmap(object = pbmc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Assigning cell type identity to clusters

這個主要取決於生物學背景知識:

Cluster ID

Markers

Cell Type

0

IL7R

CD4 T cells

1

CD14, LYZ

CD14+ Monocytes

2

MS4A1

B cells

3

CD8A

CD8 T cells

4

FCGR3A, MS4A7

FCGR3A+ Monocytes

5

GNLY, NKG7

NK cells

6

FCER1A, CST3

Dendritic Cells

7

PPBP

Megakaryocytes

current.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)
new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells", "CD8 T cells", "FCGR3A+ Monocytes", "NK cells", "Dendritic cells", "Megakaryocytes")
pbmc@ident <- plyr::mapvalues(x = pbmc@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

Further subdivisions within cell types

# First lets stash our identities for later
pbmc <- StashIdent(object = pbmc, save.name = "ClusterNames_0.6")

# Note that if you set save.snn=T above, you don't need to recalculate the SNN, and can simply put: 
# pbmc <- FindClusters(pbmc,resolution = 0.8)
pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.8, print.output = FALSE)

# Demonstration of how to plot two tSNE plots side by side, and how to color points based on different criteria
plot1 <- TSNEPlot(object = pbmc, do.return = TRUE, no.legend = TRUE, do.label = TRUE)
plot2 <- TSNEPlot(object = pbmc, do.return = TRUE, group.by = "ClusterNames_0.6", no.legend = TRUE, do.label = TRUE)
plot_grid(plot1, plot2)
# Find discriminating markers
tcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)

# Most of the markers tend to be expressed in C1 (i.e. S100A4). However, we can see that CCR7 is upregulated in 
# C0, strongly indicating that we can differentiate memory from naive CD4 cells.
# cols.use demarcates the color palette from low to high expression
FeaturePlot(object = pbmc, features.plot = c("S100A4", "CCR7"), cols.use = c("green", "blue"))

The memory/naive split is bit weak, and we would probably benefit from looking at more cells to see if this becomes more convincing. In the meantime, we can restore our old cluster identities for downstream processing.

還有一個非常給力的用法,限於篇幅,就不介紹了,大家可以自行探索。

後面還有一個10X的單細胞實戰,用的就是這個包,敬請期待。