1. 程式人生 > >reads count檔案轉化為fasta格式檔案(redundant reads)

reads count檔案轉化為fasta格式檔案(redundant reads)

從NCBI下載的測序資料很多是去過接頭的,並且整理成readscount格式,即每行第一列為reads,第二列為reads數,而我們需要把它整理成fasta格式,並且每個read都整理為一條序列

原始檔案:

cat GSM3124755_WTB_PARE.csv | head
GATCTTTCGAACTTTCCCAAC,1
ACTCTCTGCACTAAACAAAA,1
TTTTGTCATTGATTTTTGTA,4
GCAATCGAAATTCTCTGACG,1
GTAGTGACGAAAGCTGGCTCC,1
ATTACAGCTTCTGATGTCTT,4
CATCTTGGTCATGTCTTTGA,1
CATACAATATGGAGATGAAG,1
CCGACTTTGAGGGAGTTCGT,1
TACATTGGTGTTGGTACTGT,1

python指令碼

fw = open('GSM3124755_WTB_PARE.fas', 'w')
s = 0
with open('GSM3124755_WTB_PARE.csv', 'r') as fr:
    for line in fr.readlines():
        s += 1
        count = str(line.strip().split(',')[1])
        seq = str(line.strip().split(',')[0])
        for i in range(int(count)):
            fw.write('>' + str(s) + '_' + str(i + 1)  + '\n' + seq + '\n')
fw.close()

輸出結果:
cat cat GSM3124755_WTB_PARE.fas | head

>1_1
GATCTTTCGAACTTTCCCAAC
>2_1
ACTCTCTGCACTAAACAAAA
>3_1
TTTTGTCATTGATTTTTGTA
>3_2
TTTTGTCATTGATTTTTGTA
>3_3
TTTTGTCATTGATTTTTGTA
>3_4
TTTTGTCATTGATTTTTGTA
>4_1
GCAATCGAAATTCTCTGACG
>5_1
GTAGTGACGAAAGCTGGCTCC
>6_1
ATTACAGCTTCTGATGTCTT
>6_2
ATTACAGCTTCTGATGTCTT
>6_3
ATTACAGCTTCTGATGTCTT
>6_4
ATTACAGCTTCTGATGTCTT
>7_1
CATCTTGGTCATGTCTTTGA
>8_1
CATACAATATGGAGATGAAG
>9_1
CCGACTTTGAGGGAGTTCGT
>10_1
TACATTGGTGTTGGTACTGT