1. 程式人生 > >EggNOG功能註釋資料庫線上和本地使用

EggNOG功能註釋資料庫線上和本地使用

文章目錄

COG簡介

COG(Clusters of Orthologous Groups of proteins,直系同源蛋白簇)構成每個COG的蛋白都是被假定為來自於一個祖先蛋白,因此是orthologs或者是paralogs。
通過把所有完整基因組的編碼蛋白一個一個的互相比較確定的。在考慮來自一個給定基因組的蛋白時,這種比較將給出每個其他基因組的一個最相似的蛋白(因此需要用完整的基因組來定義COG),這些基因的每一個都輪番的被考慮。如果在這些蛋白(或子集)之間一個相互的最佳匹配關係被發現,那麼那些相互的最佳匹配將形成一個COG。這樣,一個COG中的成員將與這個COG中的其他成員比起被比較的基因組中的其他蛋白更相像。

主頁:https://www.ncbi.nlm.nih.gov/COG/

image

COG單字母描述,詳見 http://www.sbg.bio.ic.ac.uk/~phunkee/html/old/COG_classes.html

COG one letter code descriptions

INFORMATION STORAGE AND PROCESSING

  • [J] Translation, ribosomal structure and biogenesis
  • [A] RNA processing and modification
  • [K] Transcription
  • [L] Replication, recombination and repair
  • [B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING

  • [D] Cell cycle control, cell division, chromosome partitioning
  • [Y] Nuclear structure
  • [V] Defense mechanisms
  • [T] Signal transduction mechanisms
  • [M] Cell wall/membrane/envelope biogenesis
  • [N] Cell motility
  • [Z] Cytoskeleton
  • [W] Extracellular structures
  • [U] Intracellular trafficking, secretion, and vesicular transport
  • [O] Posttranslational modification, protein turnover, chaperones

METABOLISM

  • [C] Energy production and conversion
  • [G] Carbohydrate transport and metabolism
  • [E] Amino acid transport and metabolism
  • [F] Nucleotide transport and metabolism
  • [H] Coenzyme transport and metabolism
  • [I] Lipid transport and metabolism
  • [P] Inorganic ion transport and metabolism
  • [Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED

  • [R] General function prediction only
  • [S] Function unknown

eggNOG簡介

image

eggNOG註釋的原理和解讀

通過已知蛋白對未知序列進行功能註釋;

通過檢視指定的eggNOG編號對應的protein數目,存在及缺失,從而能推導特定的代謝途徑是否存在;

每個eggNOG編號是一類蛋白,將query序列和比對上的eggNOG編號的proteins進行多序列比對,能確定保守位點,分析其進化關係。

eggNOG mapper線上版

eggNOG-mapper就比對、註釋eggNOG資料庫的專用工具。

eggNOG-mapper線上分析,只需滑鼠單擊三步完成。

1.訪問線上工具

http://eggnogdb.embl.de/#/app/emapper

2.引數設定

主要是選擇蛋白序列檔案,和設定郵箱。一般其它預設即可。

image

注意方法選擇:diamond在序列少時相對較慢,但序列多時相對較快。HMMER方法對於親源較遠序列預測成功率更高,但資料量大時計算時間長,線上限制一次最多5000條序列。

3.提交任務

點選Run按扭即提交任務。會出現如下視窗。

image

出現任務狀態,和引文列表頁面。值得注意的是,線上分析,即有序列限制,又要排隊,如果用的人多,有時需要等很久。

eggNOG mapper本地版

更推薦conda安裝,輕鬆稿定依賴關係和環境變數

conda install eggnog-mapper

手動軟體下載和安裝

cd ~/software
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3

軟體說明

less README.md

使用eggNOG資料庫進行功能註釋新基因、蛋白序列。常用於新基因組、轉錄組和巨集基因組的基因集。直系同源(orthology)功能預測認為比傳統的同源搜尋更準確,可以避免直接從旁系同源(paralogs)借用功能註釋(基因重複有很高的機會形成功能分化)。

幫助文件

https://github.com/jhcepas/eggnog-mapper/wiki

安裝說明

軟體依賴python2.7, wget, hmmer3, diamond,

硬碟空間要求:

記憶體要求:

HMMER3註釋時大記憶體時非常快,記憶體需要如下:

  • 真核資料庫euk: ~90GB
  • 細菌資料庫bact:~32GB
  • 古細菌資料庫arch:~10GB

軟體安裝

上面使用conda或wget下載方式安裝,還可選git方式

git clone https://github.com/jhcepas/eggnog-mapper.git

資料庫下載

  • eggNOG提供了107個分類學的HMM資料庫,三個最優資料庫真核euk、細菌bact和古菌arch,和一個病毒特異資料庫viruses
  • 三個最優庫包括對應所有HMM。
  • 具體107個數據子集見 http://eggnogdb.embl.de/#/app/downloads

顯示程式幫助

python eggnog-mapper/download_eggnog_data.py -h

下載四個常用資料庫,保存於data目錄。
指定程式下載至指定目錄,並y自動同意,f強制下載

mkdir -p eggnog
python eggnog-mapper/download_eggnog_data.py --data_dir eggnog -y -f euk bact arch viruses

基本使用

cd eggnog-mapper

HMMER方法

本地檢索細菌資料庫
Disk based searches on the optimized bacterial database
-i輸入、–output輸出檔案字首、-d指定資料庫資料、–data_dir指定資料庫位置

python emapper.py -i test/polb.fa --output polb_bact -d bact --data_dir ~/data/db/eggnog

diamond方法

-m指定diamond方法,預設為hmmer方法。diamond在多於千條序列時才會體現速度優勢,少量序列會感覺非常慢,而且結果也沒有hmmer的更準確,尤其是對遠源註釋方面。

python emapper.py -i test/polb.fa --output diamond_bact_ -d bact --data_dir ~/data/db/eggnog -m diamond

時間較長,1個多小時

結果解讀

https://github.com/jhcepas/eggnog-mapper/wiki/Results-Interpretation

結果有三個檔案

polb_bact.emapper.annotations
polb_bact.emapper.hmm_hits
polb_bact.emapper.seed_orthologs

主要關注annotations結果,其中包括基因對應的GO、KEGG和COG描述

[project_name].emapper.hmm_hits檔案:hmm比對結果列表

For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.

[project_name].emapper.seed_orthologs檔案:最佳結果列表

each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.

[project_name].emapper.annotations檔案:比對結果整理,這才是重點。
This file provides final annotations of each query. Tab-delimited columns in the file are:

製表符分隔的13列檔案,如下:

  1. 序列名query_name: query sequence name
  2. eggNOG編號seed_eggNOG_ortholog: best protein match in eggNOG
  3. seed_ortholog_evalue: best protein match (e-value)
  4. seed_ortholog_score: best protein match (bit-score)
  5. 預測基因名predicted_gene_name: Predicted gene name for query sequences
  6. 逗號分隔的GO註釋GO_terms: Comma delimited list of predicted Gene Ontology terms
  7. KO編號註釋KEGG_KO: Comma delimited list of predicted KEGG KOs
  8. 代謝反應BiGG_Reactions: Comma delimited list of predicted BiGG metabolic reactions
  9. 註釋物種範圍Annotation_tax_scope: The taxonomic scope used to annotate this query sequence
  10. OG編號Matching_OGs: Comma delimited list of matching eggNOG Orthologous Groups
  11. best_OG|evalue|score: Best matching Orthologous Groups (only in HMM mode)
  12. COG分類COG functional categories: COG functional category inferred from best matching OG
  13. 模型註釋eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG

高階使用

https://github.com/jhcepas/eggnog-mapper/wiki/Advanced-usage-and-tips

大記憶體和多執行緒加速

–usemem可讀入全部資料進記憶體,可使用記憶體預測載入資料,–cpu可設定多執行緒,–override是強制覆蓋結果,否則有結果檔案會中止

python emapper.py -i test/polb.fa --output polb_bact --database bact --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 11.8659 secs

伺服器共用記憶體模式

先讀入細菌庫,資料庫選擇,僅能指定某一類資料庫

python emapper.py --database bact --data_dir ~/data/db/eggnog --cpu 10 --servermode

需要時間讀入資料

Waiting for server to become ready... localhost 51500

直到顯示:

Server ready listening at localhost:51500 and using 10 CPU cores
Use `emapper.py -d bact:localhost:51500 (...)` to search against this server

再啟動分析命令

–usemem可讀入全部資料進記憶體,可使用記憶體預測載入資料,–cpu可設定多執行緒

python emapper.py -i test/polb.fa --output polb_bact --database bact:localhost:51500 --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 9.77332 secs

巨集基因組大資料模式

https://github.com/jhcepas/eggnog-mapper/wiki/Setting-up-large-scale-analyses

大基因組,和巨集基因組資料的註釋(>100M的蛋白)。

分析主要分兩步:同源檢索,計算密集;功能註釋,讀寫密集。資料拆分會提高效率。

同源檢索

1. 序列拆分

準備檔案並調整為單行fasta

cp /mnt/bai/yongxin/test/meta1809/temp/23prokka_all/mg.faa input_file0.faa 
format_fasta_1line.pl -i input_file0.faa -o input_file.faa

拆分為檔案,每個2百萬行,1百萬條序列。這裡測序用10000行,5000條序列。

# -l按行數分割,-a字尾寬度3位,預設2位;-d資料字尾
split -l 10000 -a 3 -d input_file.faa input_file.chunk_

2.並行比對

方法1. 產生命令用於叢集

for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; 
done

方法2. 平行計算

time parallel -j 3 --xapply \
  'python emapper.py -m diamond --no_annot --no_file_comments --data_dir ~/data/db/eggnog --cpu 16 -i {1} -o {1}' \
 ::: input_file.chunk*

耗時 real 14m45.579s

功能註釋

此步為硬碟密集型,推薦將eggnog.db儲存於SSD硬碟,或/dev/shm記憶體目錄中

3. 合併比對結果

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs

4.註釋

為了提高速度,將資料庫複製到記憶體,21s

cp ~/data/db/eggnog/eggnog.db /dev/shm

time emapper.py --annotate_hits_table input_file.emapper.seed_orthologs --no_file_comments -o output_file --cpu 20 --data_dir /dev/shm --override

資料庫在記憶體時,處理1萬條序列大約15s

現在我們獲得了所有基因註釋的列表。配合基因丰度矩陣,可以進行可種彙總、差異比較、功能描述了。

附1. emapper.py引數詳解

python emapper.py -h
usage: emapper.py [-h] [--guessdb] [--database] [--dbtype {hmmdb,seqdb}]
                  [--data_dir] [--qtype {hmm,seq}] [--tax_scope]
                  [--target_orthologs {one2one,many2one,one2many,many2many,all}]
                  [--excluded_taxa]
                  [--go_evidence {experimental,non-electronic}]
                  [--hmm_maxhits] [--hmm_evalue] [--hmm_score]
                  [--hmm_maxseqlen] [--hmm_qcov] [--Z] [--dmnd_db DMND_DB]
                  [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
                  [--gapopen GAPOPEN] [--gapextend GAPEXTEND]
                  [--seed_ortholog_evalue] [--seed_ortholog_score] [--output]
                  [--resume] [--override] [--no_refine] [--no_annot]
                  [--no_search] [--report_orthologs] [--scratch_dir]
                  [--output_dir] [--temp_dir] [--no_file_comments]
                  [--keep_mapping_files] [-m {hmmer,diamond}] [-i]
                  [--translate] [--servermode] [--usemem] [--cpu]
                  [--annotate_hits_table] [--version]

optional arguments:
  -h, --help            顯示幫助show this help message and exit
  --version             版本號

Target HMM Database Options:
  --guessdb             根據物種ID猜所屬資料庫guess eggnog db based on the provided taxid
  --database , -d       資料庫選擇,僅能指定某一類資料庫specify the target database for sequence searches.Choose among: euk,bact,arch, host:port, or a local hmmpressed database
  --dbtype {hmmdb,seqdb} 資料庫型別
  --data_dir            資料目錄 Directory to use for DATA_PATH.
  --qtype {hmm,seq}     方法選擇,序列少用hmm,序列多用seq

Annotation Options:
  --tax_scope           設定物種範圍,預設自動調整Fix the taxonomic scope used for annotation, so only orthologs from a particular clade are used for functional transfer. By default, this is automatically adjusted for every query sequence.
  --target_orthologs {one2one,many2one,one2many,many2many,all}
                        功能註釋型別 defines what type of orthologs should be used for functional transfer
  --excluded_taxa       (for debugging and benchmark purposes)
  --go_evidence {experimental,non-electronic}
                        註釋準確度,只選實驗 Defines what type of GO terms should be used for
                        annotation:experimental = Use only terms inferred from
                        experimental evidencenon-electronic = Use only non-
                        electronically curated terms

HMM search_options:
  --hmm_maxhits         匹配結果數量,預設1 Max number of hits to report. Default=1
  --hmm_evalue          E-value threshold. Default=0.001
  --hmm_score           Bit score threshold. Default=20
  --hmm_maxseqlen       忽略序列大於5000的蛋白Ignore query sequences larger than `maxseqlen`.
                        Default=5000
  --hmm_qcov            min query coverage (from 0 to 1). Default=(disabled)
  --Z                   Fixed database size used in phmmer/hmmscan (allows
                        comparing e-values among databases).
                        Default=40,000,000

diamond search_options:
  --dmnd_db DMND_DB     資料庫位置Path to DIAMOND-compatible database
  --matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
                        Scoring matrix
  --gapopen GAPOPEN     Gap open penalty
  --gapextend GAPEXTEND
                        Gap extend penalty

Seed ortholog search option:
  --seed_ortholog_evalue 
                        Min E-value expected when searching for seed eggNOG
                        ortholog. Applies to phmmer/diamond searches. Queries
                        not having a significant seed orthologs will not be
                        annotated. Default=0.001
  --seed_ortholog_score 
                        Min bit score expected when searching for seed eggNOG
                        ortholog. Applies to phmmer/diamond searches. Queries
                        not having a significant seed orthologs will not be
                        annotated. Default=60

Output options:
  --output , -o         base name for output files
  --resume              Resumes a previous execution skipping reported hits in
                        the output file.
  --override            Overwrites output files if they exist.
  --no_refine           Skip hit refinement, reporting only HMM hits.
  --no_annot            Skip functional annotation, reporting only hits
  --no_search           Skip HMM search mapping. Use existing hits file
  --report_orthologs    The list of orthologs used for functional transferred
                        are dumped into a separate file
  --scratch_dir         Write output files in a temporary scratch dir, move
                        them to final the final output dir when finished.
                        Speed up large computations using network file
                        systems.
  --output_dir          Where output files should be written
  --temp_dir            Where temporary files are created. Better if this is a
                        local disk.
  --no_file_comments    No header lines nor stats are included in the output
                        files
  --keep_mapping_files  Do not delete temporary mapping files used for
                        annotation (i.e. HMMER and DIAMOND search outputs)

Execution options:
  -m {hmmer,diamond}    執行選項,預設為hmmer,可選diamondDefault:hmmer
  -i                    輸入檔案 Input FASTA file containing query sequences
  --translate           輸入核酸序列,翻譯為蛋白 Assume sequences are genes instead of proteins
  --servermode          資料載入記憶體模式,方便反覆使用Loads target database in memory and keeps running in
                        server mode, so another instance of eggnog-mapper can
                        connect to this sever. Auto turns on the --usemem flag
  --usemem              讀入整個資料庫至記憶體 If a local hmmpressed database is provided as target
                        using --db, this flag will allocate the whole database
                        in memory using hmmpgmd. Database will be unloaded
                        after execution.
  --cpu                 多執行緒
  --annotate_hits_table 
                        註釋結果 Annotatate TSV formatted table of query->hits. 4
                        fields required: query, hit, evalue, score. Implies
                        --no_search and --no_refine.

Reference

https://github.com/jhcepas/eggnog-mapper/wiki

[1] Fast genome-wide functional annotation through orthology assignment by
eggNOG-mapper. Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho,
Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork.
Mol Biol Evol (2017). doi:
10.1093/molbev/msx148

[2] eggNOG 4.5: a hierarchical orthology framework with improved functional
annotations for eukaryotic, prokaryotic and viral sequences. Jaime
Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide
Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi
Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer
Bork. Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi:
10.1093/nar/gkv1248

猜你喜歡

寫在後面

為鼓勵讀者交流、快速解決科研困難,我們建立了“巨集基因組”專業討論群,目前己有國內外2300+ 一線科研人員加入。參與討論,獲得專業解答,歡迎分享此文至朋友圈,並掃碼加主編好友帶你入群,務必備註“姓名-單位-研究方向-職稱/年級”。技術問題尋求幫助,首先閱讀《如何優雅的提問》學習解決問題思路,仍末解決群內討論,問題不私聊,幫助同行。
image

學習擴增子、巨集基因組科研思路和分析實戰,關注“巨集基因組”
image

點選閱讀原文,跳轉最新文章目錄閱讀
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA