lncRNA實戰專案-第六步-WGCNA相關性分析

阿新 • • 發佈：2022-05-02

WGCNA將lncRNA分成18個模組（3635個lncRNA），空間模組中lncRNA表達呈現明顯的組織區域特異性，如：CB (M1, 794個lncRNAs),DG/CA1 (M2, 443個lncRNAs), CA1 (M4, 369個lncRNAs),neocortex (M7, 123個lncRNAs)和OC (M10,57個lncRNAs)。時間模組中lncRNA表達與年齡有關，而與組織區域不明顯；性別模組中lncRNA表達與性別和年齡都相關。每個模組就必須做pathway/go等資料庫的註釋分析咯！

資料收集：

google搜尋或在生信技能樹和生信菜鳥團搜尋WGCNA ，能找到很多教程，下面列出幾個中文教程和英文教程，強烈推薦中文教程1和英文教程3。

一文學會WGCNA分析
學習WGCNA總結
Tutorials for the WGCNA package

WGCNA Background and glossary
Data input and cleaning
Network construction and module detection
Relating modules to external information and identifying important genes
Interfacing network analysis with other data such as functional annotation and gene ontology

Network visualization using WGCNA functions
Exporting a gene network to external visualization software

背景知識

基本概念:

WGCNA(weighted correlation network analysis)加權基因共表達網路分析, 用於提取與性狀或臨床特徵相關的基因模組，解析基礎代謝途徑，轉錄調控途徑、翻譯水平調控等生物學過程。WGCNA適合於複雜的資料模式，推薦5組以上的資料，如：不同器官、組織型別發育調控；同一組織不同時期發育調控；非生物脅迫不同時間點應答；病原物侵染後不同時間點應答。

基本步驟：

WGCNA分為表達量聚類分析和表型關聯兩部分，具體步驟包括基因之間相關係數的計算，共表達網路的構建，篩選特定模組，模組與性狀關聯，核心基因的篩選。

Overview of a typical WGCNA analysis.

術語：

Co-expression weighted network: 是一個無向有權重（undirected, weighted）的網路。“無權重（unweighted network）”,基因與基因之間的相關度只能是0或者1,0表示兩個基因沒有聯絡，而1表示有。“有權重（weighted network）”基因之間不僅僅是相關與否，還記錄著它們的相關性數值，數值就是基因之間的聯絡的權重(相關性)。 Module:（模組）指表達模式相似的基因聚為一類，這樣的一類基因稱為模組。 Connectivity（連線度）:連線度是指某個基因與該網路中其他基因的相關程度，常為“相關度”之和。 Eigengene（eigen- +‎ gene）：基因和樣本構成的矩陣。 Hub gene: 模組中的高度相關的核心基因。 Gene signicance ，GS:基因顯著性引數，為非負數字資訊，比如基於樣本的臨床資料（clinical trails）和基於每個基因的-log(p-value)等。

分析流程

WGCNA輸入檔案需要一個表達矩陣，最好是RPKM或其他歸一化好的表達量，還需要一個矩陣關於臨床資訊或者其它關於樣本屬性的資訊。

STEP1: 輸入資料的準備

表達矩陣可以從作者GitHub上下載（https://github.com/DChenABLife/RhesusLncRNA），我這裡只嘗試對lncRNA的表達矩陣(https://github.com/DChenABLife/RhesusLncRNA/blob/master/All_sample_LncRNA_exp_RPKM.xlsx)做後續分析。下載的表達矩陣檔案是Excel格式的，需要轉為csv格式方便後續用R處理，可以直接開啟這個excel檔案，然後另存為csv格式即可。

讀入原始表達資料

原始資料包含64個樣本，9904個lncRNA表達量，這時的矩陣行為基因，列為樣本資訊，其中第一列是lncRNA ID，(這裡的lncRNA ID是cufflinks 組裝時給的自由的ID, 是需要和已有的ID對應, 對於新的轉錄本再通過nr/nt等資料庫註釋），第66列是作者給出的註釋（我查了幾個註釋有的也查不出來是什麼意思）。

setwd("G:/My_exercise/WGCNA/")
lncRNAexpr <- read.csv("All_sample_LncRNA_exp_RPKM.csv",sep=",",header = T)
head(lncRNAexpr)
dim(lncRNAexpr)
#[1] 9904   66

rawdata

重新命名資料列表，行名和列名

##去掉Annotation這列
fpkm <- lncRNAexpr[,-66]
head(fpkm)
##重新命名行名和列名
rownames(fpkm)=fpkm[,1]
fpkm=fpkm[,-1]
fpkm[1:4,1:4]

fpkm

準備表型資訊

這裡有64個樣本，包含獼猴腦不同空間區域，不同發育時期，以及性別，因為每個樣本都交叉包含著三種不同的資訊，如果選擇全部表型資訊，我試了試，後續的模組和性狀完全看不清關係，所以我這裡僅選擇腦不同區域的表型資訊,包括CB、DG、PFC、PCC、CA1、OC、PC、TC。

##Sample Info
subname=sapply(colnames(fpkm),function(x) strsplit(x,"_")[[1]][1])
datTraits = data.frame(gsm=names(fpkm),
                       subtype=subname)
rownames(datTraits)=datTraits[,1]
head(datTraits)

sample.Info

下載並載入WGCNA包

source("http://bioconductor.org/biocLite.R")
#biocLite(c("AnnotationDbi", "impute", "GO.db", "preprocessCore")) 
##如果上面的包已經下載過了，就不用下載
biocLite("WGCNA")
library(WGCNA)

行列轉置

WGCNA針對的是基因進行聚類，而一般我們的聚類是針對樣本用hclust即可，也就是說要轉置為行表示樣本, 列表示基因。

RNAseq_voom <- fpkm 
WGCNA_matrix = t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing = T)[1:5000],])
datExpr <- WGCNA_matrix  ## top 5000 mad genes
datExpr[1:4,1:4]

datExpr

確定臨床表型與樣本名字

sampleNames = rownames(datExpr);
traitRows = match(sampleNames, datTraits$gsm)  
rownames(datTraits) = datTraits[traitRows, 1]

datExpr和datTraits準備好後，接下來就是構建基因網路，鑑定模組。網路構建有三種方法：1）一步法構建網路；2）多步法構建網路；3）block-wise構建網路（主要針對大資料集）。下面的介紹的步驟是一步法構建網路。

STEP2：確定最佳soft-thresholding power

選擇合適的“軟閥值（soft thresholding power）”beta, 用到的函式是pickSoftThreshold，pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)，powerVector可以是一系列數值，從而選擇最優值。這個函式返回一個列表，第一項是powerEstimate是估計的最優power；第二項是fitIndices是詳細的矩陣資料，其中第五列是mean.k表示平均“連線度（connectivity）”。

Constructing a weighted gene network entails the choice of the soft thresholding power to which co-expression similarity is raised to calculate adjacency.

# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="green")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
# 選擇beta值
best_beta=sft$powerEstimate
#> best_beta
[1] 3

最佳beta值是3。

Soft Threshold

STEP3: 一步法構建共表達矩陣

一步法構建網路，使用函式blockwiseModules(), 這個函式包含很多引數，其中power=sft$powerEstimate=3 即上一步得到的最佳軟閾值；maxBlockSize 預設為5000，表示在這個數值內的基因將整體被計算，如果調大需要更多的記憶體；numerricLabels 預設為返回顏色，設定為TRUE則返回數字；mergeCutHeight是合併模組閾值的一個引數。

net = blockwiseModules(datExpr, power = sft$powerEstimate,
                       maxBlockSize = 6000,TOMType = "unsigned", 
                       minModuleSize = 30,reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "AS-green-FPKM-TOM",
                       verbose = 3)

STEP4:模組鑑定及視覺化

模組鑑定

上一步的返回結果是一個列表，可以用table()函式檢視，0表示沒有任何module接受。table(net$colors) 可以看總共有多少模組，每個模組的大小，這裡共有9個模組，從1-9每個模組的大小是遞減的，從2254-115，0表示這些基因不在所有模組內。

table(net$colors)

視覺化

dendrograms表示在一個block中所有基因的進化樹圖, 使用函式plotDendroAndColors()檢視系統發生樹；blockGenes是一個block中所有的基因。

# Convert labels to colors for plotting
mergedColors = labels2colors(net$colors)
table(mergedColors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
## assign all of the gene to their corresponding module 
## hclust for the genes.

cluster Dendrogram

#明確樣本數和基因數
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#首先針對樣本做個系統聚類樹
datExpr_tree<-hclust(dist(datExpr), method = "average")
par(mar = c(0,5,2,0))
plot(datExpr_tree, main = "Sample clustering", sub="", xlab="", cex.lab = 2, 
     cex.axis = 1, cex.main = 1,cex.lab=1)
## 如果這個時候樣本是有性狀，或者臨床表型的，可以加進去看看是否聚類合理
#針對前面構造的樣品矩陣新增對應顏色
sample_colors <- numbers2colors(as.numeric(factor(datTraits$subtype)), 
                                colors = c("grey","blue","red","green"),signed = FALSE)
## 這個給樣品新增對應顏色的程式碼需要自行修改以適應自己的資料分析專案。
#  sample_colors <- numbers2colors( datTraits ,signed = FALSE)
## 如果樣品有多種分類情況，而且 datTraits 裡面都是分類資訊，那麼可以直接用上面程式碼，
##當然，這樣給的顏色不明顯，意義不大。
#構造10個樣品的系統聚類樹及性狀熱圖
par(mar = c(1,4,3,1),cex=0.8)
plotDendroAndColors(datExpr_tree, sample_colors,
                    groupLabels = colnames(sample),
                    cex.dendroLabels = 0.8,
                    marAll = c(1, 4, 3, 1),
                    cex.rowText = 0.01,
                    main = "Sample dendrogram and trait heatmap")

Sample dendrogram and trait heatmap

STEP5:模組和性狀的關係

MEs是一個關於modules的特徵量矩陣，行數等於篩選的modules數，列數等於樣本數；

design=model.matrix(~0+ datTraits$subtype)
colnames(design)=levels(datTraits$subtype)
moduleColors <- labels2colors(net$colors)
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0); ##不同顏色的模組的ME值矩陣(樣本vs模組)
moduleTraitCor = cor(MEs, design , use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)

sizeGrWindow(10,6)
# Will display correlations and their p-values
textMatrix = paste(signif(moduleTraitCor, 2), "n(",
                   signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
               xLabels = names(design),
               yLabels = names(MEs),
               ySymbols = names(MEs),
               colorLabels = FALSE,
               colors = greenWhiteRed(50),
               textMatrix = textMatrix,
               setStdMargins = FALSE,
               cex.text = 0.5,
               zlim = c(-1,1),
               main = paste("Module-trait relationships"))

圖中第二列第五行，即CB/turquoise相關性最強（0.97），pvalue=1e-41,後續分析可以挑選這個模組。每一列對應的樣本特徵可以從design這裡檢視。

STEP6:感興趣性狀的模組的具體基因分析

選定CB/turquoise這個模組後，接下來看看模組與樣本特徵和基因的相關性，1. 首先計算全模組與基因的相關性矩陣，2. 再計算性狀與基因的相關性矩陣，3. 使用verboseScatterplot()繪製相關性圖：

首先計算模組與基因的相關性矩陣

# names (colors) of the modules
modNames = substring(names(MEs), 3)
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
## 算出每個模組跟基因的皮爾森相關係數矩陣
## MEs是每個模組在每個樣本里面的值
## datExpr是每個基因在每個樣本的表達量
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");

再計算性狀與基因的相關性矩陣

## 只有連續型性狀才能只有計算
## 這裡把是否屬於 CB 表型這個變數用0,1進行數值化。
CB = as.data.frame(design[,2]);
names(CB) = "CB"
geneTraitSignificance = as.data.frame(cor(datExpr, CB, use = "p"));
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(CB), sep="");
names(GSPvalue) = paste("p.GS.", names(CB), sep="")

最後把兩個相關性矩陣聯合起來,指定感興趣模組進行分析

module = "turquoise"
column = match(module, modNames);
moduleGenes = moduleColors==module;
sizeGrWindow(7, 7);
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
                   abs(geneTraitSignificance[moduleGenes, 1]),
                   xlab = paste("Module Membership in", module, "module"),
                   ylab = "Gene significance for CB",
                   main = paste("Module membership vs. gene significancen"),
                   cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)

上圖可以看出基因跟其對應的性狀高度相關，可以匯出做個GO/KEGG註釋，看看這些基因的具體功能。

STEP7:網路的視覺化

Topological Overlap Matrix (TOM)圖是指基因和篩選模型的視覺化表示。WGCNA認為基因之間的簡單的相關性不足以計算共表達，所以它利用鄰近矩陣，又計算了一個新的鄰近矩陣。一般來說，TOM就是WGCNA分析的最終結果，後續的只是對TOM的下游註釋。

生成TOM圖的步驟：1. 生成全基因不相似TOM矩陣，比如dissTOM=1-TOMsimilarityFromExpr(datExpr, power = 6)，可以把得到的不相關矩陣加冪，這樣繪製的TOM圖色彩差異會比較明顯。2. 使用TOMplot()繪製TOM圖。

#首先針對所有基因畫熱圖
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
geneTree = net$dendrograms[[1]]; 
#生成全基因不相似TOM矩陣
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6); 
plotTOM = dissTOM^7; 
diag(plotTOM) = NA; 
#TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")

#然後隨機選取部分基因作圖
nSelect = 400
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")

圖上和圖左是全基因系統發育樹，不同顏色亮帶表示不同的module，每一個亮點表示每個基因與其他基因的相關性強弱（越亮表示相關性越強）。

#最後畫模組和性狀的關係
# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
## 只有連續型性狀才能只有計算
## 這裡把是否屬於 Luminal 表型這個變數用0,1進行數值化。
CB = as.data.frame(design[,2]);
names(CB) = "CB"
# Add the weight to existing module eigengenes
MET = orderMEs(cbind(MEs, CB))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,7.5);
par(cex = 0.9)
plotEigengeneNetworks(MET, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8,                         xLabelsAngle= 90)
# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
## 模組的聚類圖
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
                      plotHeatmaps = FALSE)
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
## 性狀與模組熱圖
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2),
                      plotDendrograms = FALSE, xLabelsAngle = 90)

STEP8:提取指定模組的基因名

提取基因資訊,可以做GO/KEGG等分析，進而解釋這些module的生物學意義。這裡的lncRNA ID轉換著有點麻煩，這一步先略過，之後再看看。

# Select module
module = "turquoise";
# Select module probes
probes = colnames(datExpr) ## 我們例子裡面的probe就是基因名
inModule = (moduleColors==module);
modProbes = probes[inModule];

STEP9: 模組的匯出

主要模組裡面的基因直接的相互作用關係資訊可以匯出到cytoscape,VisANT等網路視覺化軟體。

# Recalculate topological overlap
TOM = TOMsimilarityFromExpr(datExpr, power = 6); 
# Select module
module = "turquoise";
# Select module probes
probes = colnames(datExpr) ## 我們例子裡面的probe就是基因名
inModule = (moduleColors==module);
modProbes = probes[inModule]; 
## 也是提取指定模組的基因名
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)

首先是匯出到VisANT

vis = exportNetworkToVisANT(modTOM,
                            file = paste("VisANTInput-", module, ".txt", sep=""),
                            weighted = TRUE,
                            threshold = 0)

也可以輸出模組裡的部分基因：

nTop = 30;
IMConn = softConnectivity(datExpr[, modProbes]);
top = (rank(-IMConn) <= nTop)
vis = exportNetworkToVisANT(modTOM[top, top],
       file = paste("VisANTInput-", module, "-top30.txt", sep=""),
       weighted = TRUE,
       threshold = 0,
       probeToGene = data.frame(annot$substanceBXH, annot$gene_symbol) )

然後是匯出到cytoscape

cyt = exportNetworkToCytoscape(
  modTOM,
  edgeFile = paste("CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
  nodeFile = paste("CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
  weighted = TRUE,
  threshold = 0.02,
  nodeNames = modProbes, 
  nodeAttr = moduleColors[inModule]
  );

編輯：jimmy