轉錄組分析流程
阿新 • • 發佈:2019-01-01
文章目錄
- 分析流程概述
- 下載測試資料
- 資料質量控制
- Tophat –> Cufflink –> Cuffdiff
- Subread -> featureCounts -> DESeq2
- 流程程式碼
- DESeq2差異分析
- 讀取資料,提取表達矩陣
- 表型資料讀取,樣本分組資訊
- 視覺化樣本間的相似性
- 構建DESeqDataSet(dds)物件
- 使用rlog-transformed 與 正常 區別
- 熱圖呈現樣品間的距離
- 差異分析及結果提取
- edgeR差異分析
- HISAT2 ->StringTie -> Ballgown
- 結果視覺化
分析流程概述
參考文獻:hppRNA—a Snakemake-based handy parameter-free pipeline for RNA-Seq analysis of numerous samples
六種主要流程示意圖:
下載測試資料
wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
如果文章中的資料不是fastq格式,而是給GSExxxx,這是需要下載SRA資料
需要安裝sratoolkit,SRR_Acc_List.txt檔案儲存SRR號碼
cat SRR_Acc_List.txt | while read id; do (prefetch ${id} &);done
# 批量轉換sra到fq格式
ls /public/project/RNA/airway/sra/* | while read id; do ( nohup fastq-dump --gzip --split-3 -O ./ ${id} & ); done
下載的資料:
[sunchengquan 15:45:09 /data/Data_base/test_tmp/RNA_seq_practice/chrX_data/samples]
$ ll
總用量 1.8G
-rwxr-xr-x 1 sunchengquan sunchengquan 88M 1月 15 2016 ERR188044_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 88M 1月 15 2016 ERR188044_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 84M 1月 15 2016 ERR188104_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 85M 1月 15 2016 ERR188104_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 108M 1月 15 2016 ERR188234_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 109M 1月 15 2016 ERR188234_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 59M 1月 15 2016 ERR188245_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 60M 1月 15 2016 ERR188245_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 65M 1月 15 2016 ERR188257_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 66M 1月 15 2016 ERR188257_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 39M 1月 15 2016 ERR188273_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 39M 1月 15 2016 ERR188273_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 88M 1月 15 2016 ERR188337_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 88M 1月 15 2016 ERR188337_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 63M 1月 15 2016 ERR188383_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 63M 1月 15 2016 ERR188383_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 86M 1月 15 2016 ERR188401_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 87M 1月 15 2016 ERR188401_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 56M 1月 15 2016 ERR188428_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 56M 1月 15 2016 ERR188428_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 70M 1月 15 2016 ERR188454_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 70M 1月 15 2016 ERR188454_chrX_2.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 75M 1月 15 2016 ERR204916_chrX_1.fastq.gz
-rwxr-xr-x 1 sunchengquan sunchengquan 75M 1月 15 2016 ERR204916_chrX_2.fastq.gz
資料質量控制
reads質量評估軟體:fastqc生成質控報告,multiqc將各個樣本的質控報告整合為一個。
reads質量控制軟體:prinseq,cutadapt,trimmomatic,trim_galore
#!/usr/bin/env bash
set -e
settings(){
samples=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/samples
if test -w $samples;then
mkdir -p {$samples/qc,$samples/cleandata/qc}
else
echo "沒有寫入許可權"
fi
}
thread(){
tmp_fifofile="/tmp/$$.fifo" #指令碼執行的當前程序ID號作為檔名
mkfifo "$tmp_fifofile"
exec 6<>"$tmp_fifofile" #將fd6指向fifo型別
rm $tmp_fifofile
thread_num=$1 # 此處定義執行緒數
for((i=0;i<$thread_num;i++));do
echo
done >&6 # 事實上就是在fd6中放置了$thread個回車符
$2 6 $3
exec 6>&- # 關閉df6
}
qc(){
source activate RNA
printf "[%s %s %s %s %s %s]::資料質量評估\n" $(echo `date`)
start=$(date +%s.%N)
list=$(find $2 -name *q\.gz)
file_num=`ls -l $2/qc|wc -l`
if [ $file_num -lt 2 ];then
for i in $list;do
read -u$1
{
name=`awk -v each=$i 'BEGIN{split(each,arr,"/");l=length(arr);print arr[l]}' `
fastqc $i -o $2/qc &>> $2/qc/qc.log
printf "[%s %s %s %s %s %s]::%s質量評估完成\n" $(echo `date`) $name
echo >&$1
} &
done && wait
multiqc -d $2/qc -o $2/qc
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::資料質量評估耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
trim_qc(){
printf "[%s %s %s %s %s %s]::資料質量控制\n" $(echo `date`)
source activate RNA
start=$(date +%s.%N)
dir=$samples/cleandata
find $samples -name *1?f*q?gz|sort >$dir/1
find $samples -name *2?f*q?gz|sort >$dir/2
paste -d ":" $dir/1 $dir/2 > $dir/config && rm $dir/1 $dir/2
file_num=`ls -l $dir|wc -l`
if [ $file_num -lt 3 ];then
for id in `cat $dir/config`;do
read -u$1
fq1=$(echo $id|cut -d":" -f1)
fq2=$(echo $id |cut -d":" -f2)
base_name=$(basename $fq1)
name=`awk -v each=$base_name 'BEGIN{split(each,arr,"_");print arr[1]}' `
{
trim_galore -q 25 --phred33 --length 25 --stringency 3 --paired -o $dir $fq1 $fq2 &> $dir/trim.log
printf "[%s %s %s %s %s %s]::%s質量控制完成\n" $(echo `date`) $name
echo >&$1
} &
done && wait
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::資料質量控制耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
settings
thread 6 qc $samples
thread 3 trim_qc
thread 6 qc $samples/cleandata
Tophat –> Cufflink –> Cuffdiff
流程圖:
手動安裝相關軟體
我們已經使用bioconda安裝相關的軟體,現在手動安裝一下,本流程所需要的軟體
下載並安裝比對軟體bowtie2
cd ~/local/app
curl -OL http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.4/bowtie2-2.2.4-linux-x86_64.zip
unzip bowtie2-2.2.4-linux-x86_64.zip
把比對軟體以及相關程式連結到bin資料夾
ln -s ~/local/app/bowtie2-2.2.4/bowtie2 ~/bin/
ln -s ~/local/app/bowtie2-2.2.4/bowtie2-align* ~/bin/
ln -s ~/local/app/bowtie2-2.2.4/bowtie2-build ~/bin/
安裝tophat2
cd ~/local/app/
curl -OL http://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
tar zxvf tophat-2.0.13.Linux_x86_64.tar.gz
cd ~/bin/
vi tophat
#!/usr/bin/env bash
python2 ~/local/app/tophat-2.1.1.Linux_x86_64/tophat [email protected]
chmod 755 tophat
儲存退出
#註釋檔案
cd /data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genes
curl -L ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz > hg38.gtf.gz
gunzip *.gz
cat hg38.gtf | awk ' $1 =="X" { print $0 }' > chr_X.gtf
安裝cufflinks
cd ~/local/app
curl -OL http://cole-trapnell-lab.github.io/cufflinks/assets/downloads/cufflinks-2.1.1.Linux_x86_64.tar.gz
tar zxvf cufflinks-2.1.1.Linux_x86_64.tar.gz
ln -fs ~/local/app/cufflinks-2.1.1.Linux_x86_64/cufflinks ~/bin
ln -fs ~/local/app/cufflinks-2.1.1.Linux_x86_64/cuffdiff ~/bin
ln -fs ~/local/app/cufflinks-2.1.1.Linux_x86_64/gtf_to_sam ~/bin
ln -fs ~/local/app/cufflinks-2.1.1.Linux_x86_64/cuffcompare ~/bin
cd ~/bin
vi cuffmerge
#!/usr/bin/env bash
python2 ~/local/app/cufflinks-2.1.1.Linux_x86_64/cuffmerge [email protected]
chmod 755 cuffmerge
流程程式碼
#!/usr/bin/env bash
set -ue
settings(){
samples=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/samples
index=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genome/index
output=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/tophat+cuff
if test -w $(dirname $output) && test -w $(dirname index);then
mkdir -p {$index/bowtie,$output/1_tophat,$output/2_cufflinks,$output/3_cuffdiff}
fi
cuffdiff=$output/3_cuffdiff
indexes=$index/bowtie/chrX
genome=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genome/chrX.fa
gene=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genes/chrX.gtf
}
thread(){
tmp_fifofile="/tmp/$$.fifo"
mkfifo "$tmp_fifofile"
exec 6<>"$tmp_fifofile"
rm $tmp_fifofile
thread_num=$1
for((i=0;i<$thread_num;i++));do
echo
done >&6
$2 6
exec 6>&-
}
index(){
printf "[%s %s %s %s %s %s]::建立索引bowtie2-build\n" $(echo `date`)
start=$(date +%s.%N)
file_num=`ls -l $index/bowtie|wc -l`
source activate RNA
base_name=$(basename $genome)
name=`awk -v each=$base_name 'BEGIN{split(each,arr,".");print arr[1]}' `
if [ $file_num -lt 2 ];then
bowtie2-build -f $genome $index/bowtie/$name &> $index/bowtie/index.log
ln -s $genome $index/bowtie/$basename
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::建立索引耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
mapping(){
printf "[%s %s %s %s %s %s]::與參考基因組比對tophat\n" $(echo `date`)
start=$(date +%s.%N)
dir=$output/1_tophat
find $samples/cleandata -name *1?f*q.gz|sort > $dir/1
find $samples/cleandata -name *2?f*q.gz|sort > $dir/2
paste -d ":" $dir/1 $dir/2 > $dir/config && rm $dir/1 $dir/2
file_num=`ls -l $dir|wc -l`
source activate RNA
if [ $file_num -lt 3 ];then
for id in $(cat $dir/config);do
fq1=$(echo $id|cut -d":" -f1)
fq2=$(echo $id |cut -d":" -f2)
name=`awk -v each=$(basename $fq1) 'BEGIN{split(each,arr,"_");print arr[1]}' `
read -u$1
{
tophat -p 8 -G $gene -o $dir/$name $indexes $fq1 $fq2 &>> $dir/mapping.log
printf "[%s %s %s %s %s %s]::%s比對完成\n" $(echo `date`) $name
echo >&$1
} &
done && wait
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::比對耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
assemble(){
printf "[%s %s %s %s %s %s]::轉錄本組裝和定量cufflinks\n" $(echo `date`)
start=$(date +%s.%N)
dir=$output/2_cufflinks
file_num=`ls -l $dir|wc -l`
source activate RNA
if [ $file_num -lt 3 ];then
for id in $(cat $output/1_tophat/config);do
fq1=$(echo $id|cut -d":" -f1)
name=`awk -v each=$(basename $fq1) 'BEGIN{split(each,arr,"_");print arr[1]}' `
read -u$1
{
cufflinks -p 8 -g $gene -o $dir/$name $output/1_tophat/$name/accepted_hits.bam &> $dir/$name.log
printf "[%s %s %s %s %s %s]::%s轉錄本組裝完成\n" $(echo `date`) $name
echo >&$1
} &
done && wait
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::轉錄本組裝耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
merge(){
printf "[%s %s %s %s %s %s]::轉錄本合併cuffmerge\n" $(echo `date`)
start=$(date +%s.%N)
dir=$output/2_cufflinks
find $dir -name *transcripts?gtf|sort > $dir/assemblies.txt
source activate RNA
if [ ! -d $dir/merged_asm ];then
cuffmerge -p 8 -o $dir/merged_asm -g $gene -s $genome $dir/assemblies.txt &> $dir/cuffmerge.log
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::轉錄本合併耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
diff(){
printf "[%s %s %s %s %s %s]::差異分析cuffdiff\n" $(echo `date`)
start=$(date +%s.%N)
dir=$output/3_cuffdiff
S1=$output/1_tophat/ERR188245/accepted_hits.bam;S2=$output/1_tophat/ERR188428/accepted_hits.bam;S3=$output/1_tophat/ERR188337/accepted_hits.bam
S4=$output/1_tophat/ERR204916/accepted_hits.bam;S5=$output/1_tophat/ERR188234/accepted_hits.bam;S6=$output/1_tophat/ERR188273/accepted_hits.bam
S7=$output/1_tophat/ERR188401/accepted_hits.bam;S8=$output/1_tophat/ERR188257/accepted_hits.bam;S9=$output/1_tophat/ERR188383/accepted_hits.bam
S10=$output/1_tophat/ERR188454/accepted_hits.bam;S11=$output/1_tophat/ERR188104/accepted_hits.bam;S12=$output/1_tophat/ERR188044/accepted_hits.bam
file_num=`ls -l $dir|wc -l`
source activate RNA
if [ $file_num -lt 3 ];then
cuffdiff -p 8 -b $genome -o $dir -L Female,Male -u $output/2_cufflinks/merged_asm/merged.gtf $S1,$S2,$S3,$S4,$S5,$S6 $S7,$S8,$S9,$S10,$S11,$S12 &> $dir/cuffdiff.log
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::差異分析cuffdiff耗時%.2f分鐘\n" $(echo `date`) $dur
fi
source deactivate RNA
}
expression_matrix(){
dir=$output/3_cuffdiff
expr=$dir/gene_exp.diff
##篩選出下調的基因(log2_fold_change < -2 & pvalue < 0.001)
awk '{if(($10<-2)&&($11<0.001))print $3"\t"$8"\t"$9"\t"$10}' $dir/gene_exp.diff | grep -v 'inf' > $dir/down.txt
## 篩選出上調的基因(log2_fold_change > 2 & pvalue < 0.001
awk '{if(($10>2)&&($11<0.001))print $3"\t"$8"\t"$9"\t"$10}' $dir/gene_exp.diff | grep -v 'inf' > $dir/up.txt
}
settings
index
thread 4 mapping
thread 4 assemble
merge
diff
Subread -> featureCounts -> DESeq2
流程程式碼
#!/usr/bin/env bash
set -ue
settings(){
samples=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/samples
index=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genome/index
output=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/subread+featurecounts
if test -w $(dirname $output) && test -w $(dirname index);then
mkdir -p {$index/subread,$output/1_subjunc,$output/2_featurecounts}
else
echo "沒有寫入許可權"
fi
genome=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genome/chrX.fa
gene=/data/Data_base/test_tmp/RNA_seq_practice/chrX_data/genes/chr_X.gtf
}
thread(){
tmp_fifofile="/tmp/$$.fifo"
mkfifo "$tmp_fifofile"
exec 6<>"$tmp_fifofile"
rm $tmp_fifofile
thread_num=$1
for((i=0;i<$thread_num;i++));do
echo
done >&6
$2 6
exec 6>&-
}
index(){
printf "[%s %s %s %s %s %s]::建立索引subread-buildindex\n" $(echo `date`)
start=$(date +%s.%N)
file_num=`ls -l $index/subread|wc -l`
source activate RNA
base_name=$(basename $genome)
name=`awk -v each=$base_name 'BEGIN{split(each,arr,".");print arr[1]}' `
if [ $file_num -lt 2 ];then
subread-buildindex -o $index/subread/$name $genome &> $index/subread/index.log
fi
dur=$(echo "($(date +%s.%N) - $start)/60" | bc -l)
printf "[%s %s %s %s %s %s]::建立索引耗時%.2f分鐘\n" $(echo `date`) $dur
source deactivate RNA
}
mapping(){
printf "[%s %s %s %s %s %s]::與參考基因組比對subjunc\n" $(echo `date`)
start=$(date +%s.%N)
dir=$output/1_subjunc
find $samples/cleandata -name *1?f*q.gz|sort > $dir/1
find $samples/cleandata -name *2?f*q.gz|sort > $dir/2
paste -d ":" $dir/1 $dir/2 > $dir/config && rm $dir/1 $dir/2
file_num=`ls -l $dir|wc -l`
index_prefix=`awk -v each=$(basename $genome) 'BEGIN{split(each,arr,".");print arr[1]}' `
source activate RNA
if [ $file_num -lt 3 ];then
for id in $(cat $dir/config);do
fq1=$(echo $id|cut -d":" -f1)
fq2=$(echo $id |cut -d":" -f2