1. 程式人生 > >真核基因組註釋導讀

真核基因組註釋導讀

前言

   二代測序以及最近三代單分子測序的火熱,讓我們獲得高質量基因組越來越來容易,然而基因組註釋仍然面臨許多挑戰。其中一個挑戰就是尋找基因(gene finding),訓練基因model,選擇基因預測軟體和註釋軟體,另一個挑戰就是更新合併不同途徑註釋的基因,目前還沒有完美的解決方法,但流行的RNA-seq 資料能夠極大程度的幫助我們校準基因。基因組註釋不是簡單的點選幾下滑鼠就能夠完成的,然而現在有很多工具幫助我們更好的註釋基因組。

基因組組裝 (Genome assemblies)

  進行基因組註釋之前,需要對組裝的基因組進行質量評估,檢視是否可以用來進行基因組註釋,繼而獲得可信的註釋結果。有3個指標可以衡量組裝的質量。
* Scaffold and contig N50s
* Percent gaps
* Percent coverage
   CEGMA提供了另外一種評估方案。CEGMA篩選收集了一些很保守的單拷貝基因(這些基因可以看做在每個真核物種裡都存在),這樣我們可以通過計算存在於目前的組裝版本的基因數目來衡量組裝基因組的完整性。

基因組註釋

插播一下,基因註釋與基因預測的關係

  • gene predictors find the single most likely coding sequence (CDS) of a gene and do not report untranslated regions (UTRs) or alternatively spliced variants. Gene prediction is therefore a somewhat misleading term. A more accurate description might be ‘canonical CDS prediction’.

  • Gene annotations, conversely, generally include UTRs, alternative splice isoforms and have attributes such as evidence trails.

The figure shows a genome annotation and its associated evidence. Terms in parentheses are the names of commonly used software tools for assembling particular types of evidence. Note that the gene annotation (shown in blue) captures both alternatively spliced forms and the 5′ and 3′UTRs suggested by the evidence. By contrast, the gene prediction that is generated by SNAP (shown in green) is incorrect as regards the gene’s 5′ exons and start-of-translation site and, like most gene-predictors, it predicts only a single transcript with no UTR.
繼續說基因組註釋,註釋的第一步是重複的鑑定和掩蓋

Repeat identification

  真核基因組包含大量的重複序列,小麥的重複序列高達85%, 重複實際上包括兩個方面,一是低複雜度序列(low-complexity),二是轉座元件( transposable elements),如 viruses, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs)。很多情況是一個重複片段套另外一個重複片段,大量的重複給我們的註釋工作帶來不少困難,重複會導致seed millions of spurious BLAST alignments, producing false evidence for gene annotations。many transposon open reading frames (ORFs) look like true host genes to gene predictors, causing portions of transposon ORFs to be added as additional exons to gene predictions, completely corrupting the final gene annotations.所以第一步鑑定重複非常重要。 重複之間非常不保守,所以為了精確檢測往往需要建立一個對應物種的重複資料庫。鑑定重複的工具,根據原理可分為兩類: 根據相似性(homology-based tools) 和從頭預測(de novo tools),根據相似性,即鑑定已知的重複元件,而從頭預測則可以鑑定新的重複元件。與重複資料庫比對,最常用的軟體是RepeatMasker。一般將重複的序列標記為N,或者小寫的acgt。

Evidence alignment

  重複鑑定之後,接下來就是將蛋白、ESTs 和 RNA-seq 資料比對到基因組,當然這一步的工具就多了,這裡不再單獨一一列出,見下表。根據情況選擇合適的工具

Software Description Refs
BLAST Suite of rapid database search tools that uses Karlin–Altschul statistics 31,32,33
BLAT Faster than BLAST but has fewer features 42
Splign Splice-aware tool designed to align cDNA to genomic sequence 44
Spidey mRNA-to-DNA alignment tool that is designed to account for possible paralogous alignments 45
Prosplign Global alignment tool that uses BLAST hits to align in a splice-site- and paralogy-aware manner 140
sim4 Splice-aware cDNA-to-DNA alignment tool 46
Exonerate Splice-site-aware alignment algorithm that can align both protein and EST sequences to a genome 43
Cufflinks Extension to TopHat. Uses TopHat outputs to create transcript models 54
Trinity High-quality de novo transcriptome assembler 50
MapSplice Spliced aligner that does not use a model of canonical splice junction 141
TopHat Transcriptome aligner that aligns RNA sequencing (RNA-seq) reads to a reference genome using Bowtie to identify splice sites 51
GSNAP A fast short-read assembler 52

參考文獻,請參見原文

Ab initio gene prediction 和 Evidence-driven gene prediction

這一步常用的軟體有

Software Description Refs
Augustus Accepts expressed sequence tag (EST)-based and protein-based evidence hints. Highly accurate 66,67
mGene Support vector machine (SVM)-based discriminative gene predictor. Directly predicts 5′ and 3′ untranslated regions (UTRs) and poly(A) sites 133
SNAP Accepts EST and protein-based evidence hints. Easily trained 62
FGENESH Training files are constructed by SoftBerry and supplied to users 72
Geneid First published in 1992 and revised in 2000. Accepts external hints from EST and protein-based evidence 134
Genemark A self-training gene finder 69,70
Twinscan Extension of the popular Genscan algorithm that can use homology between two genomes to guide gene prediction 71
GAZE Highly configurable gene predictor 74
GenomeScan Extension of the popular Genscan algorithm that can use BLASTX searches to guide gene prediction 135
Conrad Discriminative gene predictor that uses conditional random fields (CRFs) 136
Contrast Discriminative gene predictor that uses both SVMs and CRFs 137
CRAIG Discriminative gene predictor that uses CRFs 138
Gnomon Hidden Markov model (HMM) tool based on Genscan that uses EST and protein alignments to guide gene prediction 73
GeneSeqer A tool for identifying potential exon–intron structure in precursor mRNAs (pre-mRNAs) by splice site prediction and spliced alignment 139

使用不同的軟體預測之後,需要進一步整合到一起,去除冗餘,發現可變剪下體等

The annotation phase

  這一步一般都需要手動去鑑定和校正,當然也可以利用一些軟體來校正,常用的有三個 JIGSAW,EVidenceModeler (EVM) 和 GLEAN(and its successor, Evigan) 。 In a recent gene prediction competition, the combiners nearly always improved on the underlying gene prediction models,and JIGSAW, EVM or Evigan performed similarly.
  當然另外的軟體是在預測的同時根據evidence進行校正,This is the process used by PASA, Gnomon and MAKER。
未完待續