EggNOG功能註釋資料庫線上和本地使用
文章目錄
COG簡介
COG(Clusters of Orthologous Groups of proteins,直系同源蛋白簇)構成每個COG的蛋白都是被假定為來自於一個祖先蛋白,因此是orthologs或者是paralogs。
通過把所有完整基因組的編碼蛋白一個一個的互相比較確定的。在考慮來自一個給定基因組的蛋白時,這種比較將給出每個其他基因組的一個最相似的蛋白(因此需要用完整的基因組來定義COG),這些基因的每一個都輪番的被考慮。如果在這些蛋白(或子集)之間一個相互的最佳匹配關係被發現,那麼那些相互的最佳匹配將形成一個COG。這樣,一個COG中的成員將與這個COG中的其他成員比起被比較的基因組中的其他蛋白更相像。
主頁:https://www.ncbi.nlm.nih.gov/COG/
COG單字母描述,詳見 http://www.sbg.bio.ic.ac.uk/~phunkee/html/old/COG_classes.html
COG one letter code descriptions
INFORMATION STORAGE AND PROCESSING
- [J] Translation, ribosomal structure and biogenesis
- [A] RNA processing and modification
- [K] Transcription
- [L] Replication, recombination and repair
- [B] Chromatin structure and dynamics
CELLULAR PROCESSES AND SIGNALING
- [D] Cell cycle control, cell division, chromosome partitioning
- [Y] Nuclear structure
- [V] Defense mechanisms
- [T] Signal transduction mechanisms
- [M] Cell wall/membrane/envelope biogenesis
- [N] Cell motility
- [Z] Cytoskeleton
- [W] Extracellular structures
- [U] Intracellular trafficking, secretion, and vesicular transport
- [O] Posttranslational modification, protein turnover, chaperones
METABOLISM
- [C] Energy production and conversion
- [G] Carbohydrate transport and metabolism
- [E] Amino acid transport and metabolism
- [F] Nucleotide transport and metabolism
- [H] Coenzyme transport and metabolism
- [I] Lipid transport and metabolism
- [P] Inorganic ion transport and metabolism
- [Q] Secondary metabolites biosynthesis, transport and catabolism
POORLY CHARACTERIZED
- [R] General function prediction only
- [S] Function unknown
eggNOG簡介
eggNOG註釋的原理和解讀
通過已知蛋白對未知序列進行功能註釋;
通過檢視指定的eggNOG編號對應的protein數目,存在及缺失,從而能推導特定的代謝途徑是否存在;
每個eggNOG編號是一類蛋白,將query序列和比對上的eggNOG編號的proteins進行多序列比對,能確定保守位點,分析其進化關係。
eggNOG mapper線上版
eggNOG-mapper就比對、註釋eggNOG資料庫的專用工具。
eggNOG-mapper線上分析,只需滑鼠單擊三步完成。
1.訪問線上工具
http://eggnogdb.embl.de/#/app/emapper
2.引數設定
主要是選擇蛋白序列檔案,和設定郵箱。一般其它預設即可。
注意方法選擇:diamond在序列少時相對較慢,但序列多時相對較快。HMMER方法對於親源較遠序列預測成功率更高,但資料量大時計算時間長,線上限制一次最多5000條序列。
3.提交任務
點選Run按扭即提交任務。會出現如下視窗。
出現任務狀態,和引文列表頁面。值得注意的是,線上分析,即有序列限制,又要排隊,如果用的人多,有時需要等很久。
eggNOG mapper本地版
更推薦conda安裝,輕鬆稿定依賴關係和環境變數
conda install eggnog-mapper
手動軟體下載和安裝
cd ~/software
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3
軟體說明
less README.md
使用eggNOG資料庫進行功能註釋新基因、蛋白序列。常用於新基因組、轉錄組和巨集基因組的基因集。直系同源(orthology)功能預測認為比傳統的同源搜尋更準確,可以避免直接從旁系同源(paralogs)借用功能註釋(基因重複有很高的機會形成功能分化)。
幫助文件
https://github.com/jhcepas/eggnog-mapper/wiki
安裝說明
軟體依賴python2.7, wget, hmmer3, diamond,
硬碟空間要求:
- eggNOG註釋資料庫:~20GB
- eggNOG序列fasta檔案:~20GB
- eggNOG資料庫(euk, bact, arch): ~130GB,還有1-35GB的每個庫對應的HMM資料庫,不用全下載,需要什麼下什麼。
每個HMM庫大小見 http://beta-eggnogdb.embl.de/download/eggnog_4.5/hmmdb_levels/
記憶體要求:
HMMER3註釋時大記憶體時非常快,記憶體需要如下:
- 真核資料庫euk: ~90GB
- 細菌資料庫bact:~32GB
- 古細菌資料庫arch:~10GB
軟體安裝
上面使用conda或wget下載方式安裝,還可選git方式
git clone https://github.com/jhcepas/eggnog-mapper.git
資料庫下載
- eggNOG提供了107個分類學的HMM資料庫,三個最優資料庫真核euk、細菌bact和古菌arch,和一個病毒特異資料庫viruses
- 三個最優庫包括對應所有HMM。
- 具體107個數據子集見 http://eggnogdb.embl.de/#/app/downloads
顯示程式幫助
python eggnog-mapper/download_eggnog_data.py -h
下載四個常用資料庫,保存於data目錄。
指定程式下載至指定目錄,並y自動同意,f強制下載
mkdir -p eggnog
python eggnog-mapper/download_eggnog_data.py --data_dir eggnog -y -f euk bact arch viruses
基本使用
cd eggnog-mapper
HMMER方法
本地檢索細菌資料庫
Disk based searches on the optimized bacterial database
-i輸入、–output輸出檔案字首、-d指定資料庫資料、–data_dir指定資料庫位置
python emapper.py -i test/polb.fa --output polb_bact -d bact --data_dir ~/data/db/eggnog
diamond方法
-m指定diamond方法,預設為hmmer方法。diamond在多於千條序列時才會體現速度優勢,少量序列會感覺非常慢,而且結果也沒有hmmer的更準確,尤其是對遠源註釋方面。
python emapper.py -i test/polb.fa --output diamond_bact_ -d bact --data_dir ~/data/db/eggnog -m diamond
時間較長,1個多小時
結果解讀
https://github.com/jhcepas/eggnog-mapper/wiki/Results-Interpretation
結果有三個檔案
polb_bact.emapper.annotations
polb_bact.emapper.hmm_hits
polb_bact.emapper.seed_orthologs
主要關注annotations
結果,其中包括基因對應的GO、KEGG和COG描述
[project_name].emapper.hmm_hits
檔案:hmm比對結果列表
For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.
[project_name].emapper.seed_orthologs
檔案:最佳結果列表
each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.
[project_name].emapper.annotations
檔案:比對結果整理,這才是重點。
This file provides final annotations of each query. Tab-delimited columns in the file are:
製表符分隔的13列檔案,如下:
- 序列名query_name: query sequence name
- eggNOG編號seed_eggNOG_ortholog: best protein match in eggNOG
- seed_ortholog_evalue: best protein match (e-value)
- seed_ortholog_score: best protein match (bit-score)
- 預測基因名predicted_gene_name: Predicted gene name for query sequences
- 逗號分隔的GO註釋GO_terms: Comma delimited list of predicted Gene Ontology terms
- KO編號註釋KEGG_KO: Comma delimited list of predicted KEGG KOs
- 代謝反應BiGG_Reactions: Comma delimited list of predicted BiGG metabolic reactions
- 註釋物種範圍Annotation_tax_scope: The taxonomic scope used to annotate this query sequence
- OG編號Matching_OGs: Comma delimited list of matching eggNOG Orthologous Groups
- best_OG|evalue|score: Best matching Orthologous Groups (only in HMM mode)
- COG分類COG functional categories: COG functional category inferred from best matching OG
- 模型註釋eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG
高階使用
https://github.com/jhcepas/eggnog-mapper/wiki/Advanced-usage-and-tips
大記憶體和多執行緒加速
–usemem可讀入全部資料進記憶體,可使用記憶體預測載入資料,–cpu可設定多執行緒,–override是強制覆蓋結果,否則有結果檔案會中止
python emapper.py -i test/polb.fa --output polb_bact --database bact --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 11.8659 secs
伺服器共用記憶體模式
先讀入細菌庫,資料庫選擇,僅能指定某一類資料庫
python emapper.py --database bact --data_dir ~/data/db/eggnog --cpu 10 --servermode
需要時間讀入資料
Waiting for server to become ready... localhost 51500
直到顯示:
Server ready listening at localhost:51500 and using 10 CPU cores
Use `emapper.py -d bact:localhost:51500 (...)` to search against this server
再啟動分析命令
–usemem可讀入全部資料進記憶體,可使用記憶體預測載入資料,–cpu可設定多執行緒
python emapper.py -i test/polb.fa --output polb_bact --database bact:localhost:51500 --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 9.77332 secs
巨集基因組大資料模式
https://github.com/jhcepas/eggnog-mapper/wiki/Setting-up-large-scale-analyses
大基因組,和巨集基因組資料的註釋(>100M的蛋白)。
分析主要分兩步:同源檢索,計算密集;功能註釋,讀寫密集。資料拆分會提高效率。
同源檢索
1. 序列拆分
準備檔案並調整為單行fasta
cp /mnt/bai/yongxin/test/meta1809/temp/23prokka_all/mg.faa input_file0.faa
format_fasta_1line.pl -i input_file0.faa -o input_file.faa
拆分為檔案,每個2百萬行,1百萬條序列。這裡測序用10000行,5000條序列。
# -l按行數分割,-a字尾寬度3位,預設2位;-d資料字尾
split -l 10000 -a 3 -d input_file.faa input_file.chunk_
2.並行比對
方法1. 產生命令用於叢集
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
方法2. 平行計算
time parallel -j 3 --xapply \
'python emapper.py -m diamond --no_annot --no_file_comments --data_dir ~/data/db/eggnog --cpu 16 -i {1} -o {1}' \
::: input_file.chunk*
耗時 real 14m45.579s
功能註釋
此步為硬碟密集型,推薦將eggnog.db儲存於SSD硬碟,或/dev/shm記憶體目錄中
3. 合併比對結果
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
4.註釋
為了提高速度,將資料庫複製到記憶體,21s
cp ~/data/db/eggnog/eggnog.db /dev/shm
time emapper.py --annotate_hits_table input_file.emapper.seed_orthologs --no_file_comments -o output_file --cpu 20 --data_dir /dev/shm --override
資料庫在記憶體時,處理1萬條序列大約15s
現在我們獲得了所有基因註釋的列表。配合基因丰度矩陣,可以進行可種彙總、差異比較、功能描述了。
附1. emapper.py引數詳解
python emapper.py -h
usage: emapper.py [-h] [--guessdb] [--database] [--dbtype {hmmdb,seqdb}]
[--data_dir] [--qtype {hmm,seq}] [--tax_scope]
[--target_orthologs {one2one,many2one,one2many,many2many,all}]
[--excluded_taxa]
[--go_evidence {experimental,non-electronic}]
[--hmm_maxhits] [--hmm_evalue] [--hmm_score]
[--hmm_maxseqlen] [--hmm_qcov] [--Z] [--dmnd_db DMND_DB]
[--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
[--gapopen GAPOPEN] [--gapextend GAPEXTEND]
[--seed_ortholog_evalue] [--seed_ortholog_score] [--output]
[--resume] [--override] [--no_refine] [--no_annot]
[--no_search] [--report_orthologs] [--scratch_dir]
[--output_dir] [--temp_dir] [--no_file_comments]
[--keep_mapping_files] [-m {hmmer,diamond}] [-i]
[--translate] [--servermode] [--usemem] [--cpu]
[--annotate_hits_table] [--version]
optional arguments:
-h, --help 顯示幫助show this help message and exit
--version 版本號
Target HMM Database Options:
--guessdb 根據物種ID猜所屬資料庫guess eggnog db based on the provided taxid
--database , -d 資料庫選擇,僅能指定某一類資料庫specify the target database for sequence searches.Choose among: euk,bact,arch, host:port, or a local hmmpressed database
--dbtype {hmmdb,seqdb} 資料庫型別
--data_dir 資料目錄 Directory to use for DATA_PATH.
--qtype {hmm,seq} 方法選擇,序列少用hmm,序列多用seq
Annotation Options:
--tax_scope 設定物種範圍,預設自動調整Fix the taxonomic scope used for annotation, so only orthologs from a particular clade are used for functional transfer. By default, this is automatically adjusted for every query sequence.
--target_orthologs {one2one,many2one,one2many,many2many,all}
功能註釋型別 defines what type of orthologs should be used for functional transfer
--excluded_taxa (for debugging and benchmark purposes)
--go_evidence {experimental,non-electronic}
註釋準確度,只選實驗 Defines what type of GO terms should be used for
annotation:experimental = Use only terms inferred from
experimental evidencenon-electronic = Use only non-
electronically curated terms
HMM search_options:
--hmm_maxhits 匹配結果數量,預設1 Max number of hits to report. Default=1
--hmm_evalue E-value threshold. Default=0.001
--hmm_score Bit score threshold. Default=20
--hmm_maxseqlen 忽略序列大於5000的蛋白Ignore query sequences larger than `maxseqlen`.
Default=5000
--hmm_qcov min query coverage (from 0 to 1). Default=(disabled)
--Z Fixed database size used in phmmer/hmmscan (allows
comparing e-values among databases).
Default=40,000,000
diamond search_options:
--dmnd_db DMND_DB 資料庫位置Path to DIAMOND-compatible database
--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
Scoring matrix
--gapopen GAPOPEN Gap open penalty
--gapextend GAPEXTEND
Gap extend penalty
Seed ortholog search option:
--seed_ortholog_evalue
Min E-value expected when searching for seed eggNOG
ortholog. Applies to phmmer/diamond searches. Queries
not having a significant seed orthologs will not be
annotated. Default=0.001
--seed_ortholog_score
Min bit score expected when searching for seed eggNOG
ortholog. Applies to phmmer/diamond searches. Queries
not having a significant seed orthologs will not be
annotated. Default=60
Output options:
--output , -o base name for output files
--resume Resumes a previous execution skipping reported hits in
the output file.
--override Overwrites output files if they exist.
--no_refine Skip hit refinement, reporting only HMM hits.
--no_annot Skip functional annotation, reporting only hits
--no_search Skip HMM search mapping. Use existing hits file
--report_orthologs The list of orthologs used for functional transferred
are dumped into a separate file
--scratch_dir Write output files in a temporary scratch dir, move
them to final the final output dir when finished.
Speed up large computations using network file
systems.
--output_dir Where output files should be written
--temp_dir Where temporary files are created. Better if this is a
local disk.
--no_file_comments No header lines nor stats are included in the output
files
--keep_mapping_files Do not delete temporary mapping files used for
annotation (i.e. HMMER and DIAMOND search outputs)
Execution options:
-m {hmmer,diamond} 執行選項,預設為hmmer,可選diamondDefault:hmmer
-i 輸入檔案 Input FASTA file containing query sequences
--translate 輸入核酸序列,翻譯為蛋白 Assume sequences are genes instead of proteins
--servermode 資料載入記憶體模式,方便反覆使用Loads target database in memory and keeps running in
server mode, so another instance of eggnog-mapper can
connect to this sever. Auto turns on the --usemem flag
--usemem 讀入整個資料庫至記憶體 If a local hmmpressed database is provided as target
using --db, this flag will allocate the whole database
in memory using hmmpgmd. Database will be unloaded
after execution.
--cpu 多執行緒
--annotate_hits_table
註釋結果 Annotatate TSV formatted table of query->hits. 4
fields required: query, hit, evalue, score. Implies
--no_search and --no_refine.
Reference
https://github.com/jhcepas/eggnog-mapper/wiki
[1] Fast genome-wide functional annotation through orthology assignment by
eggNOG-mapper. Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho,
Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork.
Mol Biol Evol (2017). doi:
10.1093/molbev/msx148
[2] eggNOG 4.5: a hierarchical orthology framework with improved functional
annotations for eukaryotic, prokaryotic and viral sequences. Jaime
Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide
Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi
Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer
Bork. Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi:
10.1093/nar/gkv1248
猜你喜歡
- 10000+: 菌群分析
寶寶與貓狗 提DNA發Nature 實驗分析誰對結果影響大 Cell微生物專刊 腸道指揮大腦 - 系列教程:微生物組入門 Biostar 微生物組 巨集基因組
- 專業技能:生信寶典 學術圖表 高分文章 不可或缺的人
- 一文讀懂:巨集基因組 寄生蟲益處 進化樹
- 必備技能:提問 搜尋 Endnote
- 文獻閱讀 熱心腸 SemanticScholar Geenmedical
- 擴增子分析:圖表解讀 分析流程 統計繪圖
- 16S功能預測 PICRUSt FAPROTAX Bugbase Tax4Fun
- 線上工具:16S預測培養基 生信繪圖
- 科研經驗:雲筆記 雲協作 公眾號
- 程式設計模板: Shell R Perl
- 生物科普: 腸道細菌 人體上的生命 生命大躍進 細胞暗戰 人體奧祕
寫在後面
為鼓勵讀者交流、快速解決科研困難,我們建立了“巨集基因組”專業討論群,目前己有國內外2300+ 一線科研人員加入。參與討論,獲得專業解答,歡迎分享此文至朋友圈,並掃碼加主編好友帶你入群,務必備註“姓名-單位-研究方向-職稱/年級”。技術問題尋求幫助,首先閱讀《如何優雅的提問》學習解決問題思路,仍末解決群內討論,問題不私聊,幫助同行。
學習擴增子、巨集基因組科研思路和分析實戰,關注“巨集基因組”
點選閱讀原文,跳轉最新文章目錄閱讀
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA