1. 程式人生 > >RRBS甲基化分析流程

RRBS甲基化分析流程

RRBS甲基化流程

分析流程

和普通的測序分析一致,首先fastqc質量檢測,接著對序列進行修剪,修剪後再質量檢測;如果質量檢測通過,則進行序列回帖,然後去除重複,計算甲基化程度,以及一些後續分析,本次後續分析使用R包methlykit以及edmr,還有其他一些甲基化分析軟體可以參見附錄。
整體的分析流程圖

流程命令

比對流程

首先進行fastqc質量鑑定,如果需要的話,再去除接頭,PS:去除接頭後需要重新進行質量鑑定
接下來就是Bismark序列比對,Bismark會通過呼叫bowtie1或者bowtie2進行比對,(生信菜鳥團有人介紹使用bsmap進行序列比對)
下面以呼叫bowtie1為例,

1、準備基因組

bismark_genome_preparation --bowtie1 /home/huangml/WangYunpeng/RRBS/

--bowtie1 選定bowtie1,如果bowtie不在環境變數路徑內,還要另外指明

/home/huangml/WangYunpeng/RRBS/ 選定參考基因組位置

2、測序檔案比對

bismark -bowtie1 /home/huangml/WangYunpeng/RRBS -1 P11m414_LiNew_R1.fq.gz -2 P11m414_LiNew_R2.fq.gz

-1和-2 分別選定雙端測序檔案

/home/huangml/WangYunpeng/RRBS/ 選定參考基因組位置

3、去除重複序列(消除PCR過度擴增的影響)

deduplicate_bismark --bam P3m76_LiNew_R1_bismark_pe.bam

--bam 保證輸出為bam格式

4、抽提出甲基化的統計資訊

bismark_methylation_extractor --bedGraph --gzip P3m76_LiNew_R1_bismark_pe.deduplicated.bam

--gzip 結果輸出為GZIP壓縮檔案

--bedGraph 結果輸出格式為bedGraph

5、甲基化的視覺化 (此步驟需要用Bismark在GitHub上的版本,官網版本存在bug,圖片會無法顯示)

bismark2report

需要用root許可權在上一步統計資訊的資料夾下面執行,生成HTML格式的圖形統計結果

詳細版的使用Bismark使用引數在https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html
該軟體的作者還有相關練習在http://www.bioinformatics.bbsrc.ac.uk/training.html
後續分析:
如果是使用methlykit包進行分析,可以將Bismark比對生成的bam檔案轉成sam,並且排序(此處的bam檔案最好是不要經過deduplicate_bismark 去除重複序列的。如果去除重複序列,methlykit聚類圖效果不好,可以兩個都試一試),methlykit包中有提供processBismarkAln函式,對其進行預處理,即可使用methlykit包進行分析
事實上,bismark不僅僅針對RRBS測序,還可以供其他建庫方式比對使用。
下面是用不同建庫方式的時候,使用Bismark時候的建議
各種建庫方式的建議

視覺化:

Bismark作者提供了SeqMonk工具進行視覺化,同樣的,有相關練習http://www.bioinformatics.bbsrc.ac.uk/training.html

差異甲基化的分析方法####

methylKit包,本次使用的就是這個,滑動視窗選取差異甲基化區域
eDMR包(可以在GitHub上搜索,與methlykit作者相同)比起methylKit滑動視窗選取差異甲基化區域,該方法用“雙峰正態模型”將鄰近的一些CpG劃分為一片甲基化區域
PS:如果是多組樣本做overlap取甲基化區域的交集,eDMR分析結果不同組樣本比較的時候,區域起始與結束位置並不一致,overlap難以選取,methylKit結果可以稍微容易一點選取overlap,但是做實驗的時候,更希望選出差異甲基化的promoter區域,因此多輪篩選出來結果也不是很好。如果要針對promoter區域分析差異甲基化,可以使用GRanges物件,自己編寫程式統計,附錄有demo版本的程式碼。
還有一些其他軟體包如下:
其他一些甲基化分析軟體

methylKit 相關統計分析

首先,將Bismark比對結果的sam檔案,先使用samtools進行排序,然後用processBismarkAln函式讀入R,預處理生成類似如下的檔案:
- P11m414_CpG.txt
- P11m415_CpG.txt
- P11m416_CpG.txt

library(methylKit)
setwd("/home/huangml/WangYunpeng/RRBS/result")
mySaveFolder="/home/huangml/WangYunpeng/RRBS/result/2-methylKit-result"
由sam檔案建立可以讀入的檔案
my.methRaw=processBismarkAln(location = "P11m414_LiNew/P11m414_LiNew_R1_bismark_pe.sorted.sam",
                              sample.id="P11m414", assembly="M.musculus", 
                              read.context="CpG",save.folder=mySaveFolder)
my.methRaw=processBismarkAln(location = "P11m415_LiNew/P11m415_LiNew_R1_bismark_pe.sorted.sam",
                             sample.id="P11m415", assembly="M.musculus", 
                             read.context="CpG",save.folder=mySaveFolder)
 my.methRaw=processBismarkAln(location = "P11m416_LiNew/P11m416_LiNew_R1_bismark_pe.sorted.sam",
                              sample.id="P11m416", assembly="M.musculus", 
                              read.context="CpG",save.folder=mySaveFolder)
 my.methRaw=processBismarkAln(location = "P3m76_LiNew/P3m76_LiNew_R1_bismark_pe.sorted.sam",
                              sample.id="P3m76", assembly="M.musculus", 
                              read.context="CpG",save.folder=mySaveFolder)
 my.methRaw=processBismarkAln(location = "P3m77_LiNew/P3m77_LiNew_R1_bismark_pe.sorted.sam",
                              sample.id="P3m77", assembly="M.musculus", 
                              read.context="CpG",save.folder=mySaveFolder)
 my.methRaw=processBismarkAln(location = "P3m78_LiNew/P3m78_LiNew_R1_bismark_pe.sorted.sam",
                              sample.id="P3m78", assembly="M.musculus", 
                              read.context="CpG",save.folder=mySaveFolder)

檔案具體內容如下:

chrBase chr base    strand  coverage    freqC   freqT
chr1.3020813    chr1    3020813 F   10  100.00  0.00
chr1.3020841    chr1    3020841 F   19  0.00    100.00
chr1.3020876    chr1    3020876 F   47  82.98   17.02
chr1.3020890    chr1    3020890 F   38  97.37   2.63
chr1.3020944    chr1    3020944 F   49  34.69   65.31
chr1.3020970    chr1    3020970 F   22  63.64   36.36
chr1.3020986    chr1    3020986 F   22  81.82   18.18
chr1.3021011    chr1    3021011 F   13  30.77   69.23
chr1.3020877    chr1    3020877 R   196 17.86   82.14

因為bowtie選取的基因組,除了Chr1,Chr2這些染色體外,還有JH584304.1這種染色體,為了後續分析的方便,使用shell指令碼將這些行去除

cat P11m414_CpG.txt|grep chr >P11m414_CpG_filter.txt
cat P11m415_CpG.txt|grep chr >P11m415_CpG_filter.txt
cat P11m416_CpG.txt|grep chr >P11m416_CpG_filter.txt
cat P3m76_CpG.txt|grep chr >P3m76_CpG_filter.txt
cat P3m77_CpG.txt|grep chr >P3m77_CpG_filter.txt
cat P3m78_CpG.txt|grep chr >P3m78_CpG_filter.txt

此處strand列:
R在後面的格式中為”-“,表示CpG在負鏈上;F在後面的格式中為”+”,表示CpG在正鏈上
freqC、freqT為C、T出現的頻率,coverage列指的是:覆蓋度
例如:coverage為12,freqC為25,freqT為75,即所有的reads中,有12X25%=3個在此處為C;有12X75%=9個在此處為T。
因為亞硫酸氫鈉處理後, 正常的C會轉成T,甲基化的C不改變,
所以,在這裡,每個位點甲基化的概率就可以求出來,就是這裡的freqC
注意:輸入檔案此處base的位置,是從0開始,C的座標

預處理完成後就可以愉快地進行分析了
每個位點甲基化程度密度圖
在這幅圖中,橫座標表示甲基化的概率,縱座標表示此概率中位點個數
甲基化的概率是由每個位點上,測到的C的數量÷此位點測到的(C+T)數量(即coverage覆蓋度 ),就是freqC
比如有23.5%的位點是0-10%概率被甲基化
由於一個位點,要麼完全甲基化,要麼完全沒有甲基化,所以這個圖應該呈“兩頭高,中間低 ”
然後就是每個CpG覆蓋度的圖,橫座標表示覆蓋度,縱座標表示該覆蓋度的CpG的比例。按理來說,應該是隨著覆蓋度的增加,CpG佔得比例越低,也就是如下圖,只有一個峰
沒有PCR擴增
我這邊實際做的時候有的樣本在右側出現了第二個峰,即疑似收到PCR擴增影響
疑似受到PCR擴增

還有樣本相關度圖
樣本相關度圖
樣本聚類圖
樣本聚類圖
PCA碎石圖(反應PCA各個主成分佔有的資訊量)
PCA碎石圖
PCA圖
PCA圖

接下來就是差異甲基化分析,以及一些註釋,
註釋檔案可以在UCSC Table Browser上面下載
http://genome.ucsc.edu/cgi-bin/hgTables
基因註釋
CpG註釋
差異甲基化分析也就是呼叫一個函式就可以完成,然後也能繪製高低甲基化的CpG在各個染色體上面的分佈
CpG在各個染色體上面的分佈
也能做出來差異甲基化位點所處區域的註釋餅圖
差異甲基化位點所處區域的註釋餅圖
差異甲基化位點所處區域的註釋餅圖

計算差異甲基化區域的時候,用getMethylDiff選取閥值,為了後續方便,difference選0,qvalue選1,把所有的差異甲基化資訊都取出來,方便後續做overlap

# get hyper methylated bases
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
## get hypo methylated bases
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
### get all differentially methylated bases
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)

最後可以做出來一張表
dist.to.feature表示距離對應轉錄本的距離,此處的甲基化程度是計算了加權平均,即:
[(樣本1的coverage X 樣本1的甲基化程度) + (樣本2的coverage X 樣本2的甲基化程度) + (樣本3的coverage X 樣本3的甲基化程度)] / 全部樣本的coverage
而且p值計算也是跟coverage有關,一般coverage越高,p值越小,因此最後篩選的時候,需要選取一個甲基化差異大小的cutoff值才能篩出較好的結果。

target.row  dist.to.feature feature.name    feature.strand  prom    exon    intron  CpGi    shores  chr start   end strand  pvalue  qvalue  meth.diff   chr.1   start.1 end.1   strand.1    coverage1   numCs1  numTs1  coverage2   numCs2  numTs2  coverage3   numCs3  numTs3  coverage4   numCs4  numTs4  coverage5   numCs5  numTs5
1   -35253  ENSMUST00000193812.1    +   0   0   0   0   0   chr1    3037501 3038000 *   0.130978182587743   0.368847571732731   2.65453921631286    chr1    3037501 3038000 *   589 450 139 766 584 182 467 331 136 11  9   2   619 468 151
2   -27753  ENSMUST00000193812.1    +   0   0   0   0   0   chr1    3045001 3045500 *   0.205092438212255   0.464935651021862   6.46069192334451    chr1    3045001 3045500 *   18  16  2   79  59  20  108 75  33  160 111 49  58  45  13
3   19487   ENSMUST00000082908.1    +   0   0   0   0   0   chr1    3121501 3122000 *   0.473963234904003   0.671055466685657   1.75762409327408    chr1    3121501 3122000 *   405 135 270 381 124 257 399 123 276 28  11  17  243 75  168
4   42487   ENSMUST00000082908.1    +   0   0   0   0   0   chr1    3144501 3145000 *   0.000500421870756168    0.00770695458755044 34.2622950819672    chr1    3144501 3145000 *   20  17  3   10  10  0   21  19  2   30  7   23  10  8   2
5   47987   ENSMUST00000082908.1    +   0   0   0   0   0   chr1    3150001 3150500 *   0.378680177157077   0.614693649805443   3.04931554931556    chr1    3150001 3150500 *   97  69  28  199 138 61  139 90  49  132 90  42  191 129 62
6   23550   ENSMUST00000195335.1    -   0   0   1   0   0   chr1    3344501 3345000 *   0.384323438187859   0.618361247561513   0.888609129944173   chr1    3344501 3345000 *   1546    1188    358 1969    1505    464 1558    1185    373 423 323 100 1463    1100    363
7   -16995  ENSMUST00000192973.1    -   0   0   1   1   1   chr1    3531501 3532000 *   0.321950144928481   0.573991640642017   4.68823747978991    chr1    3531501 3532000 *   104 47  57  10  2   8   542 152 390 630 240 390 575 277 298
8   499 ENSMUST00000070533.4    -   1   1   1   1   1   chr1    3670501 3671000 *   0.0400319523622198  0.181718957464176   1.20385138561919    chr1    3670501 3671000 *   564 21  543 875 37  838 774 27  747 793 17  776 1157    33  1124
9   0   ENSMUST00000070533.4    -   1   1   0   1   1   chr1    3671001 3671500 *   0.235280154140887   0.496408529219308   0.329122071544707   chr1    3671001 3671500 *   1440    34  1406    2316    25  2291    911 6   905 1347    14  1333    1205    23  1182
10  10848   ENSMUST00000193244.1    +   0   0   0   0   0   chr1    3691001 3691500 *   0.0627064903720938  0.24089809047917    12.4561403508772    chr1    3691001 3691500 *   25  4   21  35  14  21  41  7   34  41  1   40  32  12  20
11  -11010  ENSMUST00000194454.1    +   0   0   0   0   0   chr1    3740501 3741000 *   0.554821487335487   0.70972185690377    2.88107656528709    chr1    3740501 3741000 *   67  44  23  104 56  48  75  41  34  79  46  33  105 57  48
12  13993   ENSMUST00000194454.1    +   0   0   0   0   0   chr1    3766001 3766500 *   0.314915932386795   0.56842668204026    5.08922670191673    chr1    3766001 3766500 *   55  31  24  98  53  45  71  35  36  119 59  60  77  39  38
13  -35069  ENSMUST00000157708.2    -   0   0   0   0   0   chr1    3819001 3819500 *   0.492413307552576   0.680133325397598   1.23209686005095    chr1    3819001 3819500 *   806 486 320 675 404 271 874 513 361 31  15  16  607 362 245

edmr 相關統計分析

edmr是同一作者開發的分析軟體包,需要匯入methylKit的結果物件,結果也能做出來一張表

X.seqnames  X.start X.end   X.width X.strand    X.mean.meth.diff    X.num.CpGs  X.num.DMCs  X.DMR.pvalue    X.DMR.qvalue    mean.meth.diff  num.CpGs    num.DMCs    DMR.pvalue  DMR.qvalue  gene_id
chr1    3144713 3144714 2   *   34.2622950819672    1   1   0.0271046807426123  0.0317195618915124  34.2622950819672    1   1   0.0271046807426123  0.0317195618915124  ENSMUSG00000064842
chr1    4440613 4440614 2   *   35.1841820151679    1   1   0.00632050671943642 0.00994437732842689 35.1841820151679    1   1   0.00632050671943642 0.00994437732842689 ENSMUSG00000025902
chr1    4708929 4708930 2   *   42.707672796449 1   1   0.000274707245588477    0.000838466762644724    42.707672796449 1   1   0.000274707245588477    0.000838466762644724    ENSMUSG00000088000
chr1    7335729 7335730 2   *   27.027027027027 1   1   0.0110216692220077  0.0155357692771628  27.027027027027 1   1   0.0110216692220077  0.0155357692771628  ENSMUSG00000097797
chr1    7961488 7961489 2   *   42.5840474620962    1   1   0.000921281050334225    0.00222516022281138 42.5840474620962    1   1   0.000921281050334225    0.00222516022281138 ENSMUSG00000025909
chr1    11266003    11266118    116 *   13.3487227902122    4   1   0.0997579489181488  0.100014616025142   13.3487227902122    4   1   0.0997579489181488  0.100014616025142   ENSMUSG00000048960
chr1    14034006    14034091    86  *   8.68367639534687    2   1   0.0105262967680302  0.0149942009607968  8.68367639534687    2   1   0.0105262967680302  0.0149942009607968  ENSMUSG00000089358
chr1    14482427    14482428    2   *   -38.6322188449848   1   1   0.000588236522659868    0.00153378846466034 -38.6322188449848   1   1   0.000588236522659868    0.00153378846466034 ENSMUSG00000025932
chr1    16881507    16881509    3   *   -15.5697317965766   2   1   0.000998453592196275    0.00236273734671548 -15.5697317965766   2   1   0.000998453592196275    0.00236273734671548 ENSMUSG00000091020
chr1    17947597    17947599    3   *   -8.78092667566352   2   1   0.0219766732675353  0.0268544227000858  -8.78092667566352   2   1   0.0219766732675353  0.0268544227000858  ENSMUSG00000025774
chr1    19283718    19283719    2   *   45.9718969555035    1   1   5.65448522173639e-05    0.000230050227825624    45.9718969555035    1   1   5.65448522173639e-05    0.000230050227825624    ENSMUSG00000025927
chr1    22078450    22078451    2   *   -26.024011299435    1   1   0.0212495775275785  0.0261849151051362  -26.024011299435    1   1   0.0212495775275785  0.0261849151051362  ENSMUSG00000097109
chr1    25933225    25933273    49  *   7.14092324883044    3   1   0.00102059992978815 0.00239574561831797 7.14092324883044    3   1   0.00102059992978815 0.00239574561831797 ENSMUSG00000052558
chr1    33305755    33305756    2   *   31.25   1   1   0.00926663084236262 0.0134958780582083  31.25   1   1   0.00926663084236262 0.0134958780582083  ENSMUSG00000065223

做完註釋以後的表

gene_id X.seqnames  X.start X.end   X.width X.strand    mean.meth.diff  num.CpGs    num.DMCs    DMR.pvalue  DMR.qvalue  external_gene_name  description
ENSMUSG00000000078  chr13   5904326 5904437 112 *   22.3150215456186    2   1   0.00525297729046551 0.00859482460523443 Klf6    Kruppel-like factor 6 [Source:MGI Symbol;Acc:MGI:1346318]
ENSMUSG00000000103  chrY    1825395 1825676 282 *   -18.9428561030264   2   1   0.0582084675097316  0.0603777271684794  Zfy2    zinc finger protein 2, Y-linked [Source:MGI Symbol;Acc:MGI:99213]
ENSMUSG00000000142  chr11   108934666   108934880   215 *   -10.9727382510561   3   1   2.78572193216407e-05    0.000132378412142268    Axin2   axin 2 [Source:MGI Symbol;Acc:MGI:1270862]
ENSMUSG00000000168  chr9    50646206    50646207    2   *   -32.1727549467276   1   1   0.000364708428440863    0.00104527776398199 Dlat    dihydrolipoamide S-acetyltransferase (E2 component of pyruvate dehydrogenase complex) [Source:MGI Symbol;Acc:MGI:2385311]
ENSMUSG00000000184  chr6    127123258   127123259   2   *   -25.9238677410392   1   1   0.0498456787199272  0.0522285017241664  Ccnd2   cyclin D2 [Source:MGI Symbol;Acc:MGI:88314]
ENSMUSG00000000214  chr7    142896455   142896594   140 *   19.7056100872936    4   1   0.00416596289719983 0.00717937684544462 Th  tyrosine hydroxylase [Source:MGI Symbol;Acc:MGI:98735]
ENSMUSG00000000282  chr11   74842332    74843170    839 *   -5.32093188731543   12  3   2.05083914787464e-06    1.44423552040088e-05    Mnt max binding protein [Source:MGI Symbol;Acc:MGI:109150]
ENSMUSG00000000305  chr2    179539423   179539682   260 *   11.883850699595 3   1   0.00253251875295475 0.00488793676306709 Cdh4    cadherin 4 [Source:MGI Symbol;Acc:MGI:99218]
ENSMUSG00000000305  chr2    179735469   179735470   2   *   25.2322880371661    1   1   0.033753452531274   0.0379645150824816  Cdh4    cadherin 4 [Source:MGI Symbol;Acc:MGI:99218]
ENSMUSG00000000320  chr11   70277150    70277152    3   *   -6.357836313962 2   1   0.00184192025034553 0.00379085347298226 Alox12  arachidonate 12-lipoxygenase [Source:MGI Symbol;Acc:MGI:87998]
ENSMUSG00000000532  chr15   101204648   101204783   136 *   12.0686133893593    5   1   0.0404981742974062  0.0439984811837062  Acvr1b  activin A receptor, type 1B [Source:MGI Symbol;Acc:MGI:1338944]
ENSMUSG00000000617  chr11   50859302    50859426    125 *   -15.4346198344394   4   1   0.00203672422170811 0.00408860437637684 Grm6    glutamate receptor, metabotropic 6 [Source:MGI Symbol;Acc:MGI:1351343]
ENSMUSG00000000628  chr6    82824502    82824503    2   *   35.8208955223881    1   1   0.00426469215430179 0.00731716016885342 Hk2 hexokinase 2 [Source:MGI Symbol;Acc:MGI:1315197]
ENSMUSG00000000631  chr11   77844743    77844744    2   *   41.6293810589113    1   1   1.44892991507138e-06    1.06752042272171e-05    Myo18a  myosin XVIIIA [Source:MGI Symbol;Acc:MGI:2667185]
ENSMUSG00000000631  chr11   77850325    77850371    47  *   8.51951370771973    2   1   0.062903399303469   0.0646358691348567  Myo18a  myosin XVIIIA [Source:MGI Symbol;Acc:MGI:2667185]
ENSMUSG00000000673  chr17   83846617    83846720    104 *   13.7457563735597    5   1   0.000305531979488168    0.000914249703127139    Haao    3-hydroxyanthranilate 3,4-dioxygenase [Source:MGI Symbol;Acc:MGI:1349444]
ENSMUSG00000000794  chr3    89678068    89678150    83  *   7.53771410606525    4   1   0.00138514401187844 0.00300786380783757 Kcnn3   potassium intermediate/small conductance calcium-activated channel, subfamily N, member 3 [Source:MGI Symbol;Acc:MGI:2153183]
ENSMUSG00000000794  chr3    89539229    89539352    124 *   9.2559368065671 3   1   0.00160395436916688 0.00337436591041887 Kcnn3   potassium intermediate/small conductance calcium-activated channel, subfamily N, member 3 [Source:MGI Symbol;Acc:MGI:2153183]
ENSMUSG00000000805  chr11   84959957    84960110    154 *   11.5381727783586    7   1   0.00162022659053642 0.00340247584012648 Car4    carbonic anhydrase 4 [Source:MGI Symbol;Acc:MGI:1096574]
ENSMUSG00000000823  chr2    181589875   181589993   119 *   -14.4753926701571   2   1   3.06269548409202e-05    0.000142075040512046    Zfp512b zinc finger protein 512B [Source:MGI Symbol;Acc:MGI:2685478]

主要內容就是那幾個txt表,生成的都是統計模型相關的圖片
表各列說明:seqnames,start,end,width,strand指定差異甲基化區域的位置
mean.meth.diff甲基化差異程度
num.CpGs該區域內CpG的個數
num.DMCs該區域內CpG中差異甲基化的個數
DMR.pvalue和DMR.qvalue是差異甲基化的p值和q值
篩選條件DMC.qvalue = 0.05,num.DMCs = 1,num.CpGs = 1,DMR.methdiff = 5
註釋最近基因名是根據”轉錄起始位點”,所以會有在A基因裡,註釋名稱為B的情況
文獻連結:http://europepmc.org/articles/PMC3622633

Promoter甲基化程度分析

對於實驗人員來說,其實上述兩張表在進行多次overlap之後,再選取P值cut off後,已經只有很少一部分,然而這些交集部分,又往往會出現在exon區,intron區,甚至距離基因十萬八千里的基因間區域,這就要求我們自行編寫程式對promoter區域的CpG甲基化程度進行分析。
首先,取出methlykit預處理過後生成的,每個樣本CpG甲基化程度檔案。(就是那個Bismark比對sam檔案,使用samtools進行sort之後,使用methlykit包中函式processBismarkAln生成的txt檔案,注意,最好先用grep去除JH584304.1這種染色體,方便後續的分析)

chrBase chr base    strand  coverage    freqC   freqT
chr1.3020841    chr1    3020841 F   15  0.00    100.00
chr1.3020876    chr1    3020876 F   55  83.64   16.36
chr1.3020890    chr1    3020890 F   48  87.50   12.50
chr1.3020944    chr1    3020944 F   51  15.69   84.31
chr1.3020970    chr1    3020970 F   10  70.00   30.00
chr1.3020986    chr1    3020986 F   10  80.00   20.00
chr1.3020877    chr1    3020877 R   174 15.52   84.48
chr1.3020891    chr1    3020891 R   167 85.63   14.37
chr1.3020945    chr1    3020945 R   221 56.56   43.44

寫了一些程式碼進行了初步的統計,程式碼參考附錄,取了每個轉錄本轉錄起始位點上游200到下游1500bp範圍內的CpG,根據Coverage計算出加權平均值,結果檔案如下,value值為該轉錄本promoter區甲基化程度。如果為NA,則該promoter區沒有測出CpG。

seqnames    start   end width   strand  value   transUp$tx_id	transUp$tx_name
chr1    3052733 3054433 1701    +   NA  1   ENSMUST00000160944
chr1    3100516 3102216 1701    +   NA  2   ENSMUST00000082908
chr1    3465087 3466787 1701    +   NA  3   ENSMUST00000161581
chr1    4527517 4529217 1701    +   NA  4   ENSMUST00000180019
chr1    4806288 4807988 1701    +   1.95143414634146    5   ENSMUST00000134384
chr1    4806323 4808023 1701    +   1.91739263803681    6   ENSMUST00000027036
chr1    4806392 4808092 1701    +   1.49543174143753    7   ENSMUST00000155020
chr1    4806396 4808096 1701    +   1.45132490636704    8   ENSMUST00000119612
chr1    4806398 4808098 1701    +   1.43786178107607    9   ENSMUST00000137887
chr1    4806411 4808111 1701    +   1.41219329214475    10  ENSMUST00000115529
chr1    4806418 4808118 1701    +   1.41219329214475    11  ENSMUST00000150971
chr1    4806737 4808437 1701    +   1.46171741778319    12  ENSMUST00000131119
chr1    4835405 4837105 1701    +   NA  13  ENSMUST00000141278
chr1    4856314 4858014 1701    +   1.58979249448124    14  ENSMUST00000081551
chr1    4856538 4858238 1701    +   1.73958654120331    15  ENSMUST00000165720
chr1    4969357 4971057 1701    +   NA  16  ENSMUST00000144339

附錄 相關程式碼

MethylKit

MethylKit Part1:

library(methylKit)
setwd("/home/huangml/WangYunpeng/RRBS/result/TilingWindows")
#與4.0相比,依舊沒去除PCR影響,僅僅是視窗化而已(去除PCR影響聚類效果沒有明顯變好,考慮到後面的資料提取,沒有多此一舉)
#讀入檔案
myobj=methRead(list("/home/huangml/WangYunpeng/RRBS/result/P11m414_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P11m415_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P11m416_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P3m76_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P3m77_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P3m78_CpG_filter.txt"),
                 sample.id=list("P11m414","P11m415","P11m416","P3m76","P3m77","P3m78"),
                 assembly="M.musculus",
                 treatment=c(1,1,1,0,0,0),
                 context="CpG"
)


#畫出樣本甲基化狀態圖
pdf("P11m414_MethylationStats.pdf")
getMethylationStats(myobj[[1]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P11m415_MethylationStats.pdf")
getMethylationStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P11m416_MethylationStats.pdf")
getMethylationStats(myobj[[3]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P3m76_MethylationStats.pdf")
getMethylationStats(myobj[[4]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P3m77_MethylationStats.pdf")
getMethylationStats(myobj[[5]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P3m78_MethylationStats.pdf")
getMethylationStats(myobj[[6]],plot=TRUE,both.strands=FALSE)
dev.off()

#畫出樣本甲基化覆蓋度圖
pdf("P11m414_CoverageStats.pdf")
getCoverageStats(myobj[[1]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P11m415_CoverageStats.pdf")
getCoverageStats(myobj[[2]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P11m416_CoverageStats.pdf")
getCoverageStats(myobj[[3]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P3m76_CoverageStats.pdf")
getCoverageStats(myobj[[4]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P3m77_CoverageStats.pdf")
getCoverageStats(myobj[[5]],plot=TRUE,both.strands=FALSE)
dev.off()
pdf("P3m78_CoverageStats.pdf")
getCoverageStats(myobj[[6]],plot=TRUE,both.strands=FALSE)
dev.off()


#過濾資料,減少PCR影響
filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL, hi.count=NULL,hi.perc=99.9)

#將樣本聚合在一起
meth=unite(myobj, destrand=FALSE)

#畫出樣本關聯度的圖
pdf("P11m-P3mCorrelation.pdf")
getCorrelation(meth,plot=TRUE)
dev.off()

#畫出樣本聚類的圖
pdf("P11m-P3mclusterSamples.pdf")
clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)
dev.off()

#畫出樣本主成分分析圖
pdf("P11m-P3mPCAscreeplot.pdf")
PCASamples(meth, screeplot=TRUE)
dev.off()

#畫出樣本主成分分析後的圖
pdf("P11m-P3mPCAcluster.pdf")
PCASamples(meth)
dev.off()

tiles=tileMethylCounts(myobj,win.size=500,step.size=500)
meth=unite(tiles, destrand=FALSE)

myDiff=calculateDiffMeth(meth,num.cores=12)

# get hyper methylated bases
print("get hyper methylated bases")
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
#
# get hypo methylated bases
print("get hypo methylated bases")
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
#
# get all differentially methylated bases
print("get all differentially methylated bases")
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)

#visualize the distribution of hypo/hyper-methylated bases/regions per chromosome
print("visualize the distribution of hypo/hyper-methylated bases/regions per chromosome")
diffMethPerChr(myDiff,plot=FALSE,qvalue.cutoff=0.01, meth.cutoff=25)
pdf("P11m-P3mhypo-hyper-methylatedPerChromosome.pdf")
diffMethPerChr(myDiff,plot=TRUE,qvalue.cutoff=0.01, meth.cutoff=25)
dev.off()

save.image("P11m-P3m.RData")

MethylKit Part2:

library(methylKit)
library(genomation)
setwd("/home/huangml/WangYunpeng/RRBS/result/TilingWindows")
load("P11m-P3m.RData")
#先是讀取基因註釋資訊,用這個註釋資訊對差異甲基化區域進行註釋
gene.obj=readTranscriptFeatures("/home/huangml/WangYunpeng/RRBS/annotation.bed.txt")
diffAnn=annotateWithGeneParts(as(myDiff25p,"GRanges"),gene.obj)
#將差異甲基化最近的基因名輸出到檔案裡
write.csv(getAssociationWithTSS(diffAnn),"P11m-P3m.csv")

getTargetAnnotationStats(diffAnn,percentage=TRUE,precedence=TRUE)

pdf("P11m-P3mAnnotation.pdf")
plotTargetAnnotation(diffAnn,precedence=TRUE, main="differential methylation annotation")
dev.off()


#讀取CpG island的註釋資訊,用這些註釋資訊來註釋我們差異甲基化的區域
cpg.obj=readFeatureFlank("/home/huangml/WangYunpeng/RRBS/cpgi.bed.txt",feature.flank.name=c("CpGi","shores"))

diffCpGann=annotateWithFeatureFlank(as(myDiff25p,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                         feature.name="CpGi",flank.name="shores")

pdf("P11m-P3mCpGisland.pdf")
plotTargetAnnotation(diffCpGann,col=c("green","gray","white"), main="differential methylation annotation")
dev.off()

# 根據之前的註釋資訊,可以得到起始子區域/CpG island區域的位置,
# 然後下面的方法可以總結這些區域的甲基化資訊                      
# promoters=regionCounts(myobj,gene.obj$promoters)
# head(promoters[[1]])

getFeatsWithTargetsStats(diffAnn,percentage=TRUE)

save.image("P11m-P3m-diff.RData")

MethylKit Part3:

library(methylKit)
library(genomation)
setwd("/home/huangml/WangYunpeng/RRBS/result/TilingWindows")
load("P11m-P3m-diff.RData")
# promoters=regionCounts(meth,gene.obj$promoters)
# exons=regionCounts(meth,gene.obj$exons)
# introns=regionCounts(meth,gene.obj$introns)
# TSSes=regionCounts(meth,gene.obj$TSSes)

# write.csv(promoters,"promoters_regionCounts.csv")
# write.csv(exons,"exons_regionCounts.csv")
# write.csv(introns,"introns_regionCounts.csv")
# write.csv(TSSes,"TSSes_regionCounts.csv")

write.csv(gene.obj$promoters,"promoters.csv")
write.csv(gene.obj$exons,"exons.csv")
write.csv(gene.obj$introns,"introns.csv")
write.csv(gene.obj$TSSes,"TSSes.csv")

#write.csv(meth,"P11m-P3m-all.csv")

#write.csv(myDiff25p,"P11m-P3m-myDiff25p.csv")

Diffdataframe=getData(myDiff25p)
index=row.names(Diffdataframe)#meth[index]取回原來每個樣本的資訊
regionType=data.frame(getMembers(diffAnn))
merge_result=c(getAssociationWithTSS(diffAnn),regionType,myDiff25p,meth[index])
write.csv(merge_result,"P11m-P3m-result.csv")


library(edmr)
library(GenomicRanges)
library(IRanges)
library(mixtools)
library(data.table)
library(methylKit)
setwd("/home/huangml/WangYunpeng/RRBSedmr/liver")
load("P0m-P3m.RData")
setwd("/home/huangml/WangYunpeng/RRBSedmr/liver/edmr")
myDiffFrame<-getData(myDiff)

# fitting the bimodal normal distribution to CpGs distribution
pdf("P0m-P3mDensityCurves.pdf")
myMixmdl=myDiff.to.mixmdl(myDiffFrame, plot=T, main="Density Curves")
dev.off()

# plot cost function and the determined distance cutoff
pdf("P0m-P3mCostfunction.pdf")
plotCost(myMixmdl, main="cost function")
dev.off()

# calculate all DMRs candidate
mydmr=edmr(myDiff, DMC.qvalue = 0.05,num.DMCs = 1,num.CpGs = 1,DMR.methdiff = 5)

# mydmr=edmr(myDiff, mode=1, ACF=TRUE)
result<-DataFrame(mydmr)


library(TxDb.Mmusculus.UCSC.mm10.ensGene)
mm10genes<- genes(TxDb.Mmusculus.UCSC.mm10.ensGene)
geneID<- nearest(mydmr,mm10genes)
geneName<- mm10genes[geneID]@elementMetadata
result<-DataFrame(mydmr,geneName)
write.table(result,file = "P0m-P3m-Edmr-result.txt",quote = FALSE,sep = "\t",row.names = FALSE)

edmr

edmr Part1:

library(methylKit)
setwd("/home/huangml/WangYunpeng/RRBSedmr/liver/")
#進行差異點計算,從而為edmr提供資料
#讀入檔案
myobj=methRead(list("/home/huangml/WangYunpeng/RRBSnew/result/data/P0m14liver_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBSnew/result/data/P0m15liver_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBSnew/result/data/P0m16liver_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P3m76_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P3m77_CpG_filter.txt",
                    "/home/huangml/WangYunpeng/RRBS/result/P3m78_CpG_filter.txt"),
                 sample.id=list("P0m14liver","P0m15liver","P0m16liver","P3m76","P3m77","P3m78"),
                 assembly="M.musculus",
                 treatment=c(1,1,1,0,0,0),
                 context="CpG"
)

#將樣本聚合在一起
meth=unite(myobj, destrand=FALSE)

myDiff=calculateDiffMeth(meth,num.cores=12)

# get hyper methylated bases
print("get hyper methylated bases")
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
#
# get hypo methylated bases
print("get hypo methylated bases")
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
#
# get all differentially methylated bases
print("get all differentially methylated bases")
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)

#visualize the distribution of hypo/hyper-methylated bases/regions per chromosome
print("visualize the distribution of hypo/hyper-methylated bases/regions per chromosome")
diffMethPerChr(myDiff,plot=FALSE,qvalue.cutoff=0.01, meth.cutoff=25)
pdf("P0m-P3mhypo-hyper-methylatedPerChromosome.pdf")
diffMethPerChr(myDiff,plot=TRUE,qvalue.cutoff=0.01, meth.cutoff=25)
dev.off()

save.image("P0m-P3m.RData")

library(methylKit)
library(genomation)

#先是讀取基因註釋資訊,用這個註釋資訊對差異甲基化區域進行註釋
gene.obj=readTranscriptFeatures("/home/huangml/WangYunpeng/RRBS/annotation.bed.txt")
diffAnn=annotateWithGeneParts(as(myDiff25p,"GRanges"),gene.obj)

getTargetAnnotationStats(diffAnn,percentage=TRUE,precedence=TRUE)

pdf("P0m-P3mAnnotation.pdf")
plotTargetAnnotation(diffAnn,precedence=TRUE, main="differential methylation annotation")
dev.off()


#讀取CpG island的註釋資訊,用這些註釋資訊來註釋我們差異甲基化的區域
cpg.obj=readFeatureFlank("/home/huangml/WangYunpeng/RRBS/cpgi.bed.txt",feature.flank.name=c("CpGi","shores"))

diffCpGann=annotateWithFeatureFlank(as(myDiff25p,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                         feature.name="CpGi",flank.name="shores")

pdf("P0m-P3mCpGisland.pdf")
plotTargetAnnotation(diffCpGann,col=c("green","gray","white"), main="differential methylation annotation")
dev.off()

# 根據之前的註釋資訊,可以得到起始子區域/CpG island區域的位置,
# 然後下面的方法可以總結這些區域的甲基化資訊                      
# promoters=regionCounts(myobj,gene.obj$promoters)
# head(promoters[[1]])

getFeatsWithTargetsStats(diffAnn,percentage=TRUE)

library(methylKit)
library(genomation)

setwd("/home/huangml/WangYunpeng/RRBSedmr/liver/Meth0")

# 比起原來只有gene.obj註釋,多了CpG註釋資訊
myDiff0p=getMethylDiff(myDiff,difference=0,qvalue=1)
myDiff0p.hyper=getMethylDiff(myDiff,difference=0,qvalue=1,type="hyper")
myDiff0p.hypo=getMethylDiff(myDiff,difference=0,qvalue=1,type="hypo")

#先是讀取基因註釋資訊,用這個註釋資訊對差異甲基化區域進行註釋
gene.obj=readTranscriptFeatures("/home/huangml/WangYunpeng/RRBS/annotation.bed.txt")
diffAnn=annotateWithGeneParts(as(myDiff0p.hyper,"GRanges"),gene.obj)

#讀取CpG island的註釋資訊,用這些註釋資訊來註釋我們差異甲基化的區域
cpg.obj=readFeatureFlank("/home/huangml/WangYunpeng/RRBS/cpgi.bed.txt",feature.flank.name=c("CpGi","shores"))

diffCpGann=annotateWithFeatureFlank(as(myDiff0p.hyper,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                         feature.name="CpGi",flank.name="shores")

CpGType<-data.frame([email protected])

Diffdataframe<-getData(myDiff0p.hyper)
index<-row.names(Diffdataframe)#meth[index]取回原來每個樣本的資訊
regionType<-data.frame([email protected])
dist.to.TSS<-data.frame(getAssociationWithTSS(diffAnn))
merge_result<-data.frame(dist.to.TSS,regionType,CpGType,myDiff0p.hyper,meth[index])

write.table(merge_result,file = "P0m-P3mhyper-result.txt",quote = FALSE,row.names = FALSE)

# -------------------------------------------------------------------------------

#先是讀取基因註釋資訊,用這個註釋資訊對差異甲基化區域進行註釋
gene.obj=readTranscriptFeatures("/home/huangml/WangYunpeng/RRBS/annotation.bed.txt")
diffAnn=annotateWithGeneParts(as(myDiff0p.hypo,"GRanges"),gene.obj)

#讀取CpG island的註釋資訊,用這些註釋資訊來註釋我們差異甲基化的區域
cpg.obj=readFeatureFlank("/home/huangml/WangYunpeng/RRBS/cpgi.bed.txt",feature.flank.name=c("CpGi","shores"))

diffCpGann=annotateWithFeatureFlank(as(myDiff0p.hypo,"GRanges"),
                                    cpg.obj$CpGi,cpg.obj$shores,
                         feature.name="CpGi",flank.name="shores")

CpGType<-data.frame([email protected])

Diffdataframe<-getData(myDiff0p.hypo)
index<-row.names(Diffdataframe)#meth[index]取回原來每個樣本的資訊
regionType<-data.frame([email protected])
dist.to.TSS<-data.frame(getAssociationWithTSS(diffAnn))
merge_result<-data.frame(dist.to.TSS,regionType,CpGType,myDiff0p.hypo,meth[index])

write.table(merge_result,file = "P0m-P3mhypo-result.txt",quote = FALSE,row.names = FALSE)

edmr Part2:

library(edmr)
library(GenomicRanges)
library(IRanges)
library(mixtools)
library(data.table)
library(methylKit)
setwd("/home/huangml/WangYunpeng/RRBSedmr/liver")
load("P0m-P3m.RData")
setwd("/home/huangml/WangYunpeng/RRBSedmr/liver/edmr")
myDiffFrame<-getData(myDiff)

# fitting the bimodal normal distribution to CpGs distribution
pdf("P0m-P3mDensityCurves.pdf")
myMixmdl=myDiff.to.mixmdl(myDiffFrame, plot=T, main="Density Curves")
dev.off()

# plot cost function and the determined distance cutoff
pdf("P0m-P3mCostfunction.pdf")
plotCost(myMixmdl, main="cost function")
dev.off()

# calculate all DMRs candidate
mydmr=edmr(myDiff, DMC.qvalue = 0.05,num.DMCs = 1,num.CpGs = 1,DMR.methdiff = 5)

# mydmr=edmr(myDiff, mode=1, ACF=TRUE)
result<-DataFrame(mydmr)


library(TxDb.Mmusculus.UCSC.mm10.ensGene)
mm10genes<- genes(TxDb.Mmusculus.UCSC.mm10.ensGene)
geneID<- nearest(mydmr,mm10genes)
geneName<- mm10genes[geneID]@elementMetadata
result<-DataFrame(mydmr,geneName)
write.table(result,file = "P0m-P3m-Edmr-result.txt",quote = FALSE,sep = "\t",row.names = FALSE)

Promoter甲基化程度分析

Promoter Demo版本

library(TxDb.Mmusculus.UCSC.mm10.ensGene)
txdb_mm10 <- TxDb.Mmusculus.UCSC.mm10.ensGene
trans <- as.data.frame(transcripts(txdb_mm10))
# 取出正鏈上的基因
transUp <- trans[trans$strand=="+",]
# 取出Tss上游200到下游1500bp內的範圍
grPromoterUp <- GRanges(seqnames = transUp$seqnames,strand = "+",
                 ranges = IRanges(start = transUp$start-1500,end = transUp$start+200))
# # 取出負鏈上的基因
# transDown <- trans[trans$strand=="-",]
# # 取出Tss上游200到下游1500bp內的範圍
# grPromoterDown <- GRanges(seqnames = transDown$seqnames,strand = "-",
#                       ranges = IRanges(start = transDown$end-200,end = transDown$end+1500))
grPromoterUp$value <- NaN #value儲存每個promoter區域的甲基化程度
wt01Cpg <- read.table("C:/Users/f/Desktop/wt01_CpG_filter.txt",header = TRUE)
grCpG <- GRanges(seqnames = wt01Cpg$chr,
                 ranges = IRanges(start = wt01Cpg$base, width = 1))
grCpG$meth <- wt01Cpg$freqC
grCpG$coverage <- wt01Cpg$coverage
hitObj<- findOverlaps(grPromoterUp,grCpG)
promoterid<- unique([email protected])
for(i in 1:length(promoterid)){
  CpGid<- hitObj[[email protected]==promoterid[i],]@to
  promoterCpG <- grCpG[CpGid]
  grPromoterUp[promoterid[i]]$value<- weighted.mean(promoterCpG$meth,promoterCpG$coverage) #加權平均值
}
grPromoterUp <- as.data.frame(grPromoterUp)
grPromoterUp <- cbind(grPromoterUp,transUp$tx_id,transUp$tx_name)
write.table(grPromoterUp,"wt01UpStrandPromoterMeth.txt",sep = "\t",quote = FALSE,row.names = FALSE)
#Further :把負鏈上的基因也做一遍