這或許是我寫的最全的BLAST教程

阿新 • • 發佈：2018-12-18

Basic local alignment search tool (BLAST)

包括：blastn, blastp, blastx, tblastn, tblastx等. 使用conda安裝即可。

conda install -c bioconda blast
# blast安裝perl模組的方法
conda isntall perl-digest-md5

BLAST的主要理念

Search may take place in nucleotide and/or protein space or translated spaces where nucleotides are translated into proteins.

Searches may implement search “strategies”: optimizations to a certain task. Different search strategies will return different alignments.
Searches use alignments that rely on scoring matrices
Searches may be customized with many additional parameters. BLAST has many subtle functions that most users never need.

本地BLAST的基本步驟

用makeblastdb為BLAST提供資料庫
選擇blast工具，blastn,blastp
執行工具，有必要的還可以對輸出結果進行修飾

第一步：建立檢索所需資料庫

BLAST資料庫分為兩類，核酸資料庫和氨基酸資料庫，可以用makeblastbd建立。可以用help引數簡單看下說明。

$ makeblastdb -help
USAGE
  makeblastdb [-h][-help][-in input_file][-input_type type]
    -dbtype molecule_type [-title database_title] 
[-parse_seqids][-hash_index][-mask_data mask_data_files][-mask_id mask_algo_ids][-mask_desc mask_algo_descriptions][-gi_mask][-gi_mask_name gi_based_mask_names][-out database_name][-max_file_sz number_of_bytes][-logfile File_Name][-taxid TaxID][-taxid_map TaxIDMapFile][-version]
-dbtype <String, `nucl', `prot'>

具體以擬南芥基因組作為案例，介紹使用方法：注：擬南芥的基因組可以在TAIR上下在，也可在ensemblgenomes下載。後者還可以下載其他植物的基因組

# 下載基因組
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-36/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
# 構建核酸BLAST資料庫
makeblastdb -in Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -dbtype nucl -out TAIR10 -parse_seqids

# 下載擬南芥protein資料
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-36/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAIR10.pep.all.fa.gz
# 構建蛋白BLAST資料庫
gzip -dArabidopsis_thaliana.TAIR10.pep.all.fa.gz
makeblastdb -in  Arabidopsis_thaliana.TAIR10.pep.all.fa -dbtype prot -out TAIR10 -parse_seqids

如果你從NCBI或者其他渠道下載了格式化過的資料庫，那麼可以用blastdbcmd去檢索blast資料庫，引數很多，常用就如下幾個

db string : string表示資料庫所在路徑
dbtype string,: string在（guess, nucl, prot)中選擇一個
檢索相關引數
- -entry all 或 555, AC147927 或 gnl|dbname|tag’
- -entry_batch 提供一個包含多個檢索關鍵字的檔案
- -info 資料庫基本資訊
輸出格式 -outfmt %f %s %a %g …預設是%f
out 輸出檔案
show_blastdb_search_path: blast檢索資料庫路徑

使用案例

# 檢視資訊
blastdbcmd -db TAIR10 -dbtype nucl -info
# 所有資料
blastdbcmd -db TAIR10 -dbtype nucl -entry all | head
# 具體關鍵字，如GI號
blastdbcmd -db TAIR10 -dbtype nucl -entry 3 | head

還有其他有意思的引數，可以看幫助檔案瞭解

可選：BLAST安裝和更新nr和nt庫

安裝nt/nr庫需要先進行環境變數配置，在家目錄下新建一個.ncbirc配置檔案，然後新增如下內容

; 開始配置BLAST
[BLAST]
; 宣告BLAST資料庫安裝位置
BLASTDB=/home/xzg/Database/blast
; Specifies the data sources to use for automatic resolution
; for sequence identifiers
DATA_LOADERS=blastdb
; 蛋白序列資料庫存本地位置
BLASTDB_PROT_DATA_LOADER=/home/xzg/Database/blast/nr
; 核酸資料庫本地存放位置
BLASTDB_NUCL_DATA_LOADER=/home/xzg/Database/blast/nt
[WINDOW_MASKER]
WINDOW_MASKER_PATH=/home/xzg/Database/blast/windowmasker

配置好之後，使用BLAST+自帶的update_blastdb.pl指令碼下載nr和nt等庫檔案（不建議下載序列檔案，一是因為後者檔案更大，二是因為可以從庫檔案中提取序列blastdbcmd -db nr -dbtype prot -entry all -outfmt “%f” -out nr.fa ，最主要是建庫需要花費很長時間），直接執行下列命令即可自動下載。

nohup time update_blastdb.pl nt nr > log &

如果你不像通過update_blastdb.pl下載nr和nt等庫檔案，也可以是從ncbi上直接下載一系列nt/nr.xx.tar.gz檔案，然後解壓縮即可，後續還可以用update_blastdb.pl進行資料更新。

報錯：使用update_blastdb.pl更新和下載資料庫時候出現模組未安裝的問題。解決方法，首先用conda安裝對應的模組，然後修改update_blastdb.pl的第一行，即shebang部分，以conda的perl替換，或者按照如下方法執行。

perl `which update_blastdb.pl`

下載過程中請確保網路狀態良好，否則會出現Downloading nt.00.tar.gz...Unable to close datastream報錯。

第二步：選擇blast工具

根據不同的需求，比如說你用的序列是氨基酸還是核苷酸，你要查詢的資料是核甘酸還是氨基酸，選擇合適的blast工具。不同需求的對應關係可以見下圖（來自biostars handbook）

BLAST工具

不同工具的應用範圍雖然不同，但是基本引數都是一致的，學會blastn，基本上其他諸如blastp,blastx也都會了。

blastn的使用引數很多 blastn [-h] ，但是比較常用是如下幾個

-db : 資料庫在本地的位置，或者是NCBI上資料庫的型別,如 -db nr

BLAST庫

-query: 檢索檔案
-query_loc : 指定檢索的位置
-strand：搜尋正義鏈還是反義鏈，還是都要
out : 輸出檔案
-remote：可以用NCBI的遠端資料庫，一般與 -db nr
-evalue 科學計數法，比如說1e3，定義期望值閾值。E值表明在隨機的情況下，其它序列與目標序列相似度要大於這條顯示的序列的可能性。與S值有關，S值表示兩序列的同源性，分值越高表明它們之間相似的程度越大

E值總結： 1.E值適合於有一定長度，而且複雜度不能太低的序列。2. 當E值小於10-5時，表明兩序列有較高的同源性，而不是因為計算錯誤。3. 當E值小於10-6時，表時兩序列的同源性非常高，幾乎沒有必要再做確認。
與聯配計分相關引數： -gapopen，開啟gap的代價；-gapextend， gap延伸的代價；-penalty：核算錯配的懲罰； -reward，核酸正確匹配的獎勵；
結果過濾：-perc_identity, 根據相似度

注 BLAST還提供一個task引數，感覺很有用的樣子，好像會針對任務進行優化速度。

第三步：執行blast，調整輸出格式。

我隨機找了一段序列進行檢索

echo '>test' > query.fa
echo TGAAAGCAAGAAGAGCGTTTGGTGGTTTCTTAACAAATCATTGCAACTCCACAAGGCGCCTGTAATAGACAGCTTGTGCATGGAACTTGGTCCACAGTGCCCTACCACTGATGATGTTGATATCGGAAAGTGGGTTGCAAAAGCTGTTGATTGTTTGGTGATGACGCTAACAATCAAGCTCCTCTGGT >> query.fa

用的是blastn 檢索核酸資料庫。最簡單的用法就是提供資料庫所在位置和需要檢索的序列檔案

blastn -db BLAST/TAIR10 -query query.fa
# 還可以指定檢索序列的位置
blastn -db BLAST/TAIR10 -query query.fa  -query_loc 20-100
# 或者使用遠端資料庫
blastn -db nr -remote -query query.fa
blastn -db nt -remote -query query.fa

以上是預設輸出，blast的-outfmt選項提供個性化的選擇。一共有18個選擇，預設是0。 0 = Pairwise, 1 = Query-anchored showing identities, 2 = Query-anchored no identities, 3 = Flat query-anchored showing identities, 4 = Flat query-anchored no identities, 5 = BLAST XML, 6 = Tabular, 7 = Tabular with comment lines, 8 = Seqalign (Text ASN.1), 9 = Seqalign (Binary ASN.1), 10 = Comma-separated values, 11 = BLAST archive (ASN.1), 12 = Seqalign (JSON), 13 = Multiple-file BLAST JSON, 14 = Multiple-file BLAST XML2, 15 = Single-file BLAST JSON, 16 = Single-file BLAST XML2, 17 = Sequence Alignment/Map (SAM), 18 = Organism Report 其中6，7，10，17可以自定輸出格式。預設是

qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore

簡寫	含義
qaccver	查詢的AC版本（與此類似的還有qseqid,qgi,qacc,與序列命名有關）
saccver	目標的AC版本（於此類似的還有sseqid,sallseqid,sgi,sacc,sallacc,也是序列命名相關)
pident	完全匹配百分比（響應的nident則是匹配數）
length	聯配長度(另外slen表示查詢序列總長度，qlen表示目標序列總長度)
mismatch	錯配數目
gapopen	gap的數目
qstart	查詢序列起始
qstart	查詢序列結束
sstart	目標序列起始
send	目標序列結束
evalue	期望值
bitscore	Bit得分
score	原始得分
AC：	accession

以格式7為例項進行輸出，並且對線上資料庫進行檢索

blastn -task blastn -remote -db nr -query query.fa  -outfmt 7 -out query.txt
head -n 15 query.txt

BLAST BEST HIT

如果想輸入序列，增加對應的格式qseq, sseq

blastn -query query.fa -db nr -outfmt ' 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send qseq sseq evalue bitscore'

對已有序列進行註釋時常見的best hit only模式命令列

blastn -query gene.fa -out gene.blast.txt -task megablast -db nt -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt "7 std stitle" -perc_identity 50 -max_target_seqs 1
# 引數詳解
-task megablast : 任務執行模式，可選有'blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn'
-best_hit_score_edge 0.05 :  Best Hit 演算法的邊界值，取值範圍為0到0.5，系統推薦0.1
-best_hit_overhang 0.25 : Best Hit 演算法的閾值，取值範圍為0到0.5，系統推薦0.1
-perc_identity 50 : 相似度大於50
-max_target_seqs 1 : 最多保留多少個聯配

僅僅看引數依舊無用，還需要知道BLAST的Best-Hits的過濾演算法。假設一個序列存在兩個match結果,A和B，無論A還是B，他們的HSP（High-scoring Segment Pair, 沒有gap時的最高聯配得分）一定要高於best_hit_overhang，否則被過濾。如果滿足下列條件則保留A

evalue(A) >= evalue (B)
score(A)/length(A) < (1.0-score_edge)*score(B)/length(B)

搭建網頁BLAST

曾經的BLAST安裝後提供wwwblast用於構建本地的BLAST網頁工具，但是BLAST+沒有提供這個工具，好在BLAST足夠出名，也就有人給它開發網頁版工具。如viroBLAST和Sequenceserver, 目前來看似乎後者更受人歡迎。

有root安裝起來非常的容易

sudo apt install ruby ncbi-blast+ ruby-dev rubygems-integration npm
sudo gem install sequenceserver

資料庫準備,前面步驟已經下載了擬南芥基因組的FASTA格式資料

sequenceserver -d /到/之前/建立/BLAST/檔案路徑

然後就可以開啟瀏覽器輸入IP:埠號使用了。

Sequenceserver高階用法

開機啟動：

新建一個/etc/systemd/system/sequenceserver.service檔案，新增如下內容。注意修改ExecStart.

[Unit]
Description=SequenceServer server daemon
Documentation="file://sequenceserver --help" "http://sequenceserver.com/doc"
After=network.target

[Service]
Type=simple
User=seqservuser
ExecStart=/path/to/bin/sequenceserver -c /path/to/sequenceserver.conf
KillMode=process
Restart=on-failure
RestartSec=42s
RestartPreventExitStatus=255

[Install]
WantedBy=multi-user.target

然後重新載入systemctl

## let systemd know about changed files
sudo systemctl daemon-reload
## enable service for automatic start on boot
systemctl enable sequenceserver.service
## start service immediately
systemctl start sequenceserver.service

nginx反向代理：我承認沒有基本的nginx的知識根本搞不定這一步，所以我建議組內使用就不要折騰這個。簡單的說就是在nginx的配置檔案的server部分新增如下內容即可。

location /sequenceserver/ {
    root /home/priyam/sequenceserver/public/dist;
    proxy_pass http://localhost:4567/;
    proxy_intercept_errors on;
    proxy_connect_timeout 8;
    proxy_read_timeout 180;
}

這或許是我寫的最全的BLAST教程

Basic local alignment search tool (BLAST)

BLAST的主要理念

本地BLAST的基本步驟

第一步：建立檢索所需資料庫

可選：BLAST安裝和更新nr和nt庫

第二步：選擇blast工具

第三步：執行blast，調整輸出格式。

搭建網頁BLAST

Sequenceserver高階用法

術語列表

參考資料

瀏覽器緩存清理最全圖文教程

最全Pycharm教程（10）——Pycharm調試器總篇

最全Pycharm教程（24）——Pycharm編輯器功能之宏定義

最全Pycharm教程（11）——Pycharm調試器之斷點篇

最全Pycharm教程（26）——Pycharm搜索導航之文件名、符號名搜索（轉）

Python的高級特效是什麽？有什麽用呢？史上最全的教程！

史上最全SVN 教程

最全PyCharm教程（1-15）【轉】

halcon教程從0基礎到精通快速瞭解操作並精通視覺專案最全視訊教程資料彙總 halcon下載

Go語言連接Oracle（就我這個最全）

從零開始搭建Detox自動化測試框架測試React Native （IOS/Andriod）也許是全網最全的教程持續更新中

192G最全PS教程+素材+外掛+推薦收藏

這可能是目前最全的Redis高可用技術解決方案總結

CentOS下Solr7.4.0安裝部署到Tomcat最全詳細教程

最全Pycharm教程（41）——Pycharm擴充套件功能之便籤註釋

最全Pycharm教程（29）——再探IDE，速成手冊

最全Pycharm教程（26）——Pycharm搜尋導航之檔名、符號名搜尋

最全Pycharm教程（4）——有關Python直譯器的相關配置

最全Pycharm教程（35）——Pycharm中使用Vagrant

最全Pycharm教程（13）——Pycharm部署

這或許是我寫的最全的BLAST教程

Basic local alignment search tool (BLAST)

BLAST的主要理念

本地BLAST的基本步驟

第一步：建立檢索所需資料庫

可選：BLAST安裝和更新nr和nt庫

第二步：選擇blast工具

第三步：執行blast，調整輸出格式。

搭建網頁BLAST

Sequenceserver高階用法

術語列表

參考資料

相關推薦