轉錄組分析---Hisat2+StringTie+Ballgown使用
阿新 • • 發佈:2018-11-09
轉錄組分析---Hisat2+StringTie+Ballgown使用
(2016-10-10 08:14:45) 轉載▼標籤: 生物資訊學轉錄組 |
First, using the python scripts included in the HISAT2 package, extract splice-site and exon information from the gene annotation file: $ extract_splice_sites.py gemome.gtf >genome.ss#得到剪接位點資訊
其中: -G reference annotation to use for guiding the assembly process (GTF/GFF3) -l name prefix for output transcripts (default: STRG) -f minimum isoform fraction (default: 0.1) -m minimum assembled transcript length (default: 200) -o output path/file name for the assembled transcripts GTF (default: stdout) -a minimum anchor length for junctions (default: 10) -j minimum junction coverage (default: 1) -t disable trimming of predicted transcripts based on coverage (default: coverage trimming is enabled) -c minimum reads per bp coverage to consider for transcript assembly (default: 2.5) -v verbose (log bundle processing details) -g gap between read mappings triggering a new bundle (default: 50) -C output a file with reference transcripts that are covered by reads -M fraction of bundle allowed to be covered by multi-hit reads (default:0.95) -p number of threads (CPUs) to use (default: 1) -A gene abundance estimation output file -B enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended) -b enable output of Ballgown table files but these files will be created under the directory path given as -e only estimate the abundance of given reference transcripts (requires -G) -x do not assemble any transcripts on the given reference sequence(s) -h print this usage message and exit 5. 合併所有樣本的gtf檔案 $ stringtie --merge -p 8 -G genome.gtf -o stringtie_merged.gtf mergelist.txt 6. 新轉錄本的註釋(lncRNA必備,普通轉錄組忽略) gffcompare –r genomegtf –G –o merged stringtie_merged.gtf 備註:gffcompare 是獨立軟體,下載地址http://ccb.jhu.edu/software/stringtie/gff.shtml,結果如下; = Predicted transcript has exactly the same introns as the reference transcript c Predicted transcript is contained within the reference transcript j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript e Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment i Predicted transcript falls entirely within a reference intron o Exon of predicted transcript overlaps a reference transcript p Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment) r Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence u Predicted transcript is intergenic in comparison with known reference transcripts x Exon of predicted transcript overlaps reference but lies on the opposite strand s Intron of predicted transcript overlaps a reference intron on the opposite strand 7. 轉錄本定量和下游ballgown軟體原始檔案構建: $ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file1/file1.gtf file1.bam $ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file2/file2.gtf file2.bam 8. Ballgown差異表達分析: >library(ballgown) >library(RSkittleBrewer) >library(genefilter) >library(dplyr) >library(devtools) >pheno_data = read.csv("geuvadis_phenodata.csv")#讀取表型資料 >bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "file", pData=pheno_data)#讀取表達量 >bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)#過濾低表達量基因 >results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")#差異表達分析,運用的是一般線性模型,比較組sex,影響因素:population >results_genes = stattest(bg_chrX_filt, feature="gene",covariate="sex", adjustvars = c("population"), getFC=TRUE,meas="FPKM")#基因差異表達 >results_transcripts=data.frame(geneNames=ballgown::geneNames(bg_chrX_filt),geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)#增加基因名字,id >results_transcripts = arrange(results_transcripts,pval)#按pval sort >results_genes = arrange(results_genes,pval) >write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE) >write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE) >subset(results_transcripts,results_transcripts$qval<0.05) >subset(results_genes,results_genes$qval<0.05) 9. 結果視覺化: >tropical= c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow') >palette(tropical) >fpkm = texpr(bg_chrX,meas="FPKM") >fpkm = log2(fpkm+1) >boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)') >ballgown::transcriptNames(bg_chrX)[12] ## 12 ## "NM_012227" >ballgown::geneNames(bg_chrX)[12] ## 12 ## "GTPBP6" >plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2), main=paste(ballgown::geneNames(bg_chrX)[12],' : ', ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex", ylab='log2(FPKM+1)') >points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex)) >plotTranscripts(ballgown::geneIDs(bg_chrX)[1729], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234')) >plotMeans('MSTRG.56', bg_chrX_filt,groupvar="sex",legend=FALSE)