轉錄組分析---Hisat2+StringTie+Ballgown使用

阿新 • • 發佈：2018-11-09

轉錄組分析---Hisat2+StringTie+Ballgown使用

(2016-10-10 08:14:45)

標籤：

1.Hisat2建立基因組索引：
First, using the python scripts included in the HISAT2 package, extract splice-site and exon information from the gene annotation file: $ extract_splice_sites.py gemome.gtf >genome.ss#得到剪接位點資訊

$ extract_exons.py genome.gtf >genome.exon#得到外顯子資訊 Second, build a HISAT2 index: $ hisat2-build --ss genome.ss --exon genome.exon genome.fa genome 備註extract_splice_sites.py 和 extract_exons.py 在hisat2軟體包中涵蓋了，這兩步不是必須的，只是為了發現剪下位點，也可以直接： $ hisat2-build

genome.fa genome 2. 利用hisat2比對到基因組： hisat2 -p 8 --dta -x genome -1 file1_1.fastq.gz -2 file1_2.fastq.gz -S file1.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 file2_1.fastq.gz -2 file2_2.fastq.gz -S file2.sam 備註：--dta:輸出轉錄組型的報告檔案 -x：基因組索引 -S : 輸出sam檔案 -p: 執行緒數其他引數： Input: -q query input files are FASTQ .fq/.fastq (default) --qseq query input files are in Illumina's qseq format -f query input files are (multi-)FASTA .fa/.mfa -r query input files are raw one-sequence-per-line -c , , are sequences themselves, not files -s/--skip skip the first reads/pairs in the input (none) -u/--upto stop after first reads/pairs (no limit) -5/--trim5 trim bases from 5'/left end of reads (0) -3/--trim3 trim bases from 3'/right end of reads (0) --phred33 qualities are Phred+33 (default) --phred64 qualities are Phred+64 --int-quals qualities encoded as space-delimited integers Alignment: -N max # mismatches in seed alignment; can be 0 or 1 (0) -L length of seed substrings; must be >3, <32 (22) -i interval between seed substrings w/r/t read len (S,1,1.15) --n-ceil func for max # non-A/C/G/Ts permitted in aln (L,0,0.15) --dpad include extra ref chars on sides of DP table (15) --gbar disallow gaps within nucs of read extremes (4) --ignore-quals treat all quality values as 30 on Phred scale (off) --nofw do not align forward (original) version of read (off) --norc do not align reverse-complement version of read (off) Spliced Alignment: --pen-cansplice penalty for a canonical splice site (0) --pen-noncansplice penalty for a non-canonical splice site (12) --pen-canintronlen penalty for long introns (G,-8,1) with canonical splice sites --pen-noncanintronlen penalty for long introns (G,-8,1) with noncanonical splice sites --min-intronlen minimum intron length (20) --max-intronlen maximum intron length (500000) --known-splicesite-infile provide a list of known splice sites --novel-splicesite-outfile report a list of splice sites --novel-splicesite-infile provide a list of novel splice sites --no-temp-splicesite disable the use of splice sites found --no-spliced-alignment disable spliced alignment --rna-strandness Specify strand-specific information (unstranded) --tmo Reports only those alignments within known transcriptome --dta Reports alignments tailored for transcript assemblers --dta-cufflinks Reports alignments tailored specifically for cufflinks Scoring: --ma match bonus (0 for --end-to-end, 2 for --local) --mp , max and min penalties for mismatch; lower qual = lower penalty <2,6> --sp , max and min penalties for soft-clipping; lower qual = lower penalty <1,2> --np penalty for non-A/C/G/Ts in read/ref (1) --rdg , read gap open, extend penalties (5,3) --rfg , reference gap open, extend penalties (5,3) --score-min min acceptable alignment score w/r/t read length (L,0.0,-0.2) Reporting: (default) look for multiple alignments, report best, with MAPQ OR -k report up to alns per read; MAPQ not meaningful OR -a/--all report all alignments; very slow, MAPQ not meaningful Effort: -D give up extending after failed extends in a row (15) -R for reads w/ repetitive seeds, try sets of seeds (2) Paired-end: --fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr) --no-mixed suppress unpaired alignments for paired reads --no-discordant suppress discordant alignments for paired reads Output: -t/--time print wall-clock time taken by search phases --un write unpaired reads that didn't align to --al write unpaired reads that aligned at least once to --un-conc write pairs that didn't align concordantly to --al-conc write pairs that aligned concordantly at least once to (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g. --un-gz , to gzip compress output, or add '-bz2' to bzip2 compress output.) --quiet print nothing to stderr except serious errors --met-file send metrics to file at (off) --met-stderr send metrics to stderr (off) --met report internal counters & metrics every secs (1) --no-head supppress header lines, i.e. lines starting with @ --no-sq supppress @SQ header lines --rg-id set read group id, reflected in @RG line and RG:Z: opt field --rg add ("lab:value") to @RG line of SAM header. Note: @RG line only printed when --rg-id is set. --omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments. Performance: -o/--offrate override offrate of index; must be >= index's offrate -p/--threads number of alignment threads to launch (1) --reorder force SAM output order to match order of input reads --mm use memory-mapped I/O for index; many 'bowtie's can share Other: --qc-filter filter out reads that are bad according to QSEQ filter --seed seed for random number generator (0) --non-deterministic seed rand. gen. arbitrarily instead of using read attributes --version print version information and quit -h/--help print this usage message 3. 將sam檔案sort並轉化成bam:

$ samtools sort [email protected] 8 -o file1.bam file1.sam $ samtools sort [email protected] 8 -o file2.bam file2.sam 4. 組裝轉錄本： $ stringtie -p 8 -G genome.gtf -o file1.gtf –l file1 file1.bam $ stringtie -p 8 -G genome.gtf -o file2.gtf –l file2 file2.bam lncRNA (-f 0.01 -a 10 -j 1 -c 0.01)
其中： -G reference annotation to use for guiding the assembly process (GTF/GFF3) -l name prefix for output transcripts (default: STRG) -f minimum isoform fraction (default: 0.1) -m minimum assembled transcript length (default: 200) -o output path/file name for the assembled transcripts GTF (default: stdout) -a minimum anchor length for junctions (default: 10) -j minimum junction coverage (default: 1) -t disable trimming of predicted transcripts based on coverage (default: coverage trimming is enabled) -c minimum reads per bp coverage to consider for transcript assembly (default: 2.5) -v verbose (log bundle processing details) -g gap between read mappings triggering a new bundle (default: 50) -C output a file with reference transcripts that are covered by reads -M fraction of bundle allowed to be covered by multi-hit reads (default:0.95) -p number of threads (CPUs) to use (default: 1) -A gene abundance estimation output file -B enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended) -b enable output of Ballgown table files but these files will be created under the directory path given as -e only estimate the abundance of given reference transcripts (requires -G) -x do not assemble any transcripts on the given reference sequence(s) -h print this usage message and exit 5. 合併所有樣本的gtf檔案 $ stringtie --merge -p 8 -G genome.gtf -o stringtie_merged.gtf mergelist.txt 6. 新轉錄本的註釋（lncRNA必備，普通轉錄組忽略） gffcompare –r genomegtf –G –o merged stringtie_merged.gtf 備註：gffcompare 是獨立軟體，下載地址http://ccb.jhu.edu/software/stringtie/gff.shtml，結果如下; = Predicted transcript has exactly the same introns as the reference transcript c Predicted transcript is contained within the reference transcript j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript e Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment i Predicted transcript falls entirely within a reference intron o Exon of predicted transcript overlaps a reference transcript p Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment) r Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence u Predicted transcript is intergenic in comparison with known reference transcripts x Exon of predicted transcript overlaps reference but lies on the opposite strand s Intron of predicted transcript overlaps a reference intron on the opposite strand 7. 轉錄本定量和下游ballgown軟體原始檔案構建： $ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file1/file1.gtf file1.bam $ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file2/file2.gtf file2.bam 8. Ballgown差異表達分析： >library(ballgown) >library(RSkittleBrewer) >library(genefilter) >library(dplyr) >library(devtools) >pheno_data = read.csv("geuvadis_phenodata.csv")#讀取表型資料 >bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "file", pData=pheno_data)#讀取表達量 >bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)#過濾低表達量基因 >results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")#差異表達分析，運用的是一般線性模型，比較組sex，影響因素：population >results_genes = stattest(bg_chrX_filt, feature="gene",covariate="sex", adjustvars = c("population"), getFC=TRUE,meas="FPKM")#基因差異表達 >results_transcripts=data.frame(geneNames=ballgown::geneNames(bg_chrX_filt),geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)#增加基因名字，id >results_transcripts = arrange(results_transcripts,pval)#按pval sort >results_genes = arrange(results_genes,pval) >write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE) >write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE) >subset(results_transcripts,results_transcripts$qval<0.05) >subset(results_genes,results_genes$qval<0.05) 9. 結果視覺化： >tropical= c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow') >palette(tropical) >fpkm = texpr(bg_chrX,meas="FPKM") >fpkm = log2(fpkm+1) >boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)') >ballgown::transcriptNames(bg_chrX)[12] ## 12 ## "NM_012227" >ballgown::geneNames(bg_chrX)[12] ## 12 ## "GTPBP6" >plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2), main=paste(ballgown::geneNames(bg_chrX)[12],' : ', ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex", ylab='log2(FPKM+1)') >points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex)) >plotTranscripts(ballgown::geneIDs(bg_chrX)[1729], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234')) >plotMeans('MSTRG.56', bg_chrX_filt,groupvar="sex",legend=FALSE)

轉錄組分析---Hisat2+StringTie+Ballgown使用