reads count檔案轉化為fasta格式檔案(redundant reads)
阿新 • • 發佈:2018-11-21
從NCBI下載的測序資料很多是去過接頭的,並且整理成readscount格式,即每行第一列為reads,第二列為reads數,而我們需要把它整理成fasta格式,並且每個read都整理為一條序列
原始檔案:
cat GSM3124755_WTB_PARE.csv | head
GATCTTTCGAACTTTCCCAAC,1 ACTCTCTGCACTAAACAAAA,1 TTTTGTCATTGATTTTTGTA,4 GCAATCGAAATTCTCTGACG,1 GTAGTGACGAAAGCTGGCTCC,1 ATTACAGCTTCTGATGTCTT,4 CATCTTGGTCATGTCTTTGA,1 CATACAATATGGAGATGAAG,1 CCGACTTTGAGGGAGTTCGT,1 TACATTGGTGTTGGTACTGT,1
python指令碼
fw = open('GSM3124755_WTB_PARE.fas', 'w') s = 0 with open('GSM3124755_WTB_PARE.csv', 'r') as fr: for line in fr.readlines(): s += 1 count = str(line.strip().split(',')[1]) seq = str(line.strip().split(',')[0]) for i in range(int(count)): fw.write('>' + str(s) + '_' + str(i + 1) + '\n' + seq + '\n') fw.close()
輸出結果:
cat cat GSM3124755_WTB_PARE.fas | head
>1_1 GATCTTTCGAACTTTCCCAAC >2_1 ACTCTCTGCACTAAACAAAA >3_1 TTTTGTCATTGATTTTTGTA >3_2 TTTTGTCATTGATTTTTGTA >3_3 TTTTGTCATTGATTTTTGTA >3_4 TTTTGTCATTGATTTTTGTA >4_1 GCAATCGAAATTCTCTGACG >5_1 GTAGTGACGAAAGCTGGCTCC >6_1 ATTACAGCTTCTGATGTCTT >6_2 ATTACAGCTTCTGATGTCTT >6_3 ATTACAGCTTCTGATGTCTT >6_4 ATTACAGCTTCTGATGTCTT >7_1 CATCTTGGTCATGTCTTTGA >8_1 CATACAATATGGAGATGAAG >9_1 CCGACTTTGAGGGAGTTCGT >10_1 TACATTGGTGTTGGTACTGT