1. 程式人生 > >python實現根據序列ID從提取fasta檔案序列

python實現根據序列ID從提取fasta檔案序列

當序列少的時候,我習慣用 grep -A 1 -f seq.lst seq.fas | sed ‘/^–$/d’ > out.fas提取,但是這次遇到了一個大檔案,用grep就太費時了,然後又試了一下TBtools的提取序列功能,發現時間也很長,所以就寫了個python。提取將近100萬條reads耗時也就需要10s左右

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Usage: python script.py lstFile seqFile outfile
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("lstFile", help="input a list file", type=str)
parser.add_argument("seqFile", help="input a sequence file", type=str)
parser.add_argument("outfile", help="output a file", type=str)
args = parser.parse_args()
lst = args.lstFile
seq = args.seqFile
out = args.outfile

frLst = open(lst, 'r')
frSeq = open(seq, 'r')
fwOut = open(out, 'w')

query = []
database = {}
for i in frLst.readlines():
    query.append(i.strip())
for i in frSeq.readlines():
    if i.startswith('>'):
        keys = i.lstrip('>').strip()
        database[keys] = []
    else:
        database[keys].append(i.strip())
for line in query:
    fwOut.write('>' + str(line) + '\n' + str(''.join(database[line])) + '\n')
fwOut.close()