生物資訊資料格式:fastq格式
文章目錄
- 格式說明
- 例項演練
- 判斷fastq序列編碼是Phred33(Illumina1.8+) or Phred64(Illumina1.3+)
- fastq轉換fasta格式
- Linux 操作fastq
- 獲取資料
- 統計reads_1.fq檔案中共有多少條序列資訊
- 輸出reads_1.fq檔案中的識別符號(即以@開頭的那一行)
- 輸出reads_1.fq檔案中所有序列的資訊(即每個序列的第二行)
- 輸出reads_1.fq檔案中‘+’及其後面的描述資訊(即每個序列的第三行)
- 輸出reads_1.fq檔案中質量值資訊(即每個序列的第四行)
- 計算reads_1.fq檔案含有N鹼基的reads個數
- 計算reads_1.fq檔案裡面的序列鹼基總數
- 計算reads_1.fq檔案中所有reads的N鹼基總數
- 計算reads_1.fq中測序鹼基質量值恰好為Q20的個數
- 計算reads_1.fq中測序鹼基質量值恰好為Q30的個數
- 計算reads_1.fq中所以序列的第一位鹼基的ATCGNactg分佈情況
- 將reads_1.fq轉為reads_1.fa檔案(即將fastq轉化為fasta)
- 統計上述reads_1.fa檔案中共有多少條序列
- 計算reads_1.fa檔案中總的鹼基序列的GC數量
- 刪除reads_1.fa檔案中每條序列的N鹼基
- 刪除reads_1.fa檔案中含有N鹼基的序列
- 刪除reads_1.fa檔案中短於65bp的鹼基序列
- 刪除reads_1.fa檔案每條序列的前後五個鹼基
- 刪除reads_1.fa檔案中的長於125bp的序列
- 檢視reads_1.fq中每條序列的第一位鹼基的質量值的平均值
格式說明
FASTQ檔案中每個序列通常有四行:
1.第一行:必須以“@”開頭,後面跟著唯一的序列ID識別符號,然後跟著可選的序列描述內容,識別符號與描述內容用空格分開;
2.第二行:序列字元(核酸為[AGCTN]+,蛋白為氨基酸字元);
3.第三行:必須以“+”開頭,後面跟著可選的ID識別符號和可選的描述內容,如果“+”後面有內容,該內容必須與第一行“@”後的內容相同;
4.第四行:鹼基質量字元,每個字元對應第二行相應位置鹼基或氨基酸的質量,該字元可以按一定規則轉換為鹼基質量得分,鹼基質量得分可以反映該鹼基的錯誤率。這一行的字元數與第二行中的字元數必須相同。
檢視fatsq檔案
!cat ./data/test1.fq
@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA
TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCACCACCACNNNCGGCTCATTTGTCTTTNNTTTTTGTTTTGTTCTGTA
+
BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJJJJJIIJ###--5ABECFFDDEEEEE##,[email protected]?CDD<AD>C:@>
@HWI-ST1223:80:D1FMTACXX:2:1101:1375:2060 1:N:0:AGTCAA
NTGCTGAGCCACGACAAGGATCCCAGAGGGCCNAGCCCTGCATCTTGTATGGACCAGTTACNCATCAAAAGAGACTACTGTAGGCACCATCAATCAGATC
+
#1:DDDD;?CFFHDFEEIGIIIIIIG;DHFGG#)0?BFBDHBFF<FCFEFD;@[email protected]=7?E#,,,;=(>3;=;;C>[email protected]
@HWI-ST1223:80:D1FMTACXX:2:1101:1383:2091 1:N:0:AGTCAA
NGTTCGTGTGGAACCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAA
+
#1=DDFDFHHHHHJGJJJJJJJJJJJJJJJIJIGDHIHIGIJJJJJJJIIIGHHFDD3>BDDBDDDDDDDDDDBDCCBDDDDDDDDDDDBBDDDDEEACD
@HWI-ST1223:80:D1FMTACXX:2:1101:1452:2138 1:N:0:AGTCAA
NTCTAGGAGGTCTAGAAAGCCCAGGCCACCGGTACAAACATCAAGGGTGTTACGGATGTGCCGCTCTGAACCTCCAGGACGACTTTGATTTCAACTACAA
+
#[email protected]DDDDDDDDDDCDDDDDDDDDDDCCCEDEDDDDDDD
每個鹼基的質量值與錯誤率之間的關係:
其中P代表該鹼基被測序錯誤的概率,如果該鹼基測序出錯的概率為0.001,則Q應該為30,那麼30+33=63,那麼63對應的ASCii碼為“?”,則在第四行中該鹼基對應的質量代表值即為“?”
print(ord("?"))
63
一般地,鹼基質量從0-40,既ASCii碼為從 “!”(0+33)到“I”(40+33)。以上是sanger中心採用記錄read測序質量的方法
對於每個鹼基的質量編碼標示,不同的軟體採用不同的方案,目前有5種方案:
-
Sanger,Phred quality score,值的範圍從0到92,對應的ASCII碼從33到126,但是對於測序資料(raw read data)質量得分通常小於60,序列拼接或者mapping可能用到更大的分數。
-
Solexa/Illumina 1.0, Solexa/Illumina quality score,值的範圍從-5到63,對應的ASCII碼從59到126,對於測序資料,得分一般在-5到40之間;
-
Illumina 1.3+,Phred quality score,值的範圍從0到62對應的ASCII碼從64到126,低於測序資料,得分在0到40之間;
-
Illumina 1.5+,Phred quality score,但是0到2作為另外的標示,詳見http://solexaqa.sourceforge.net/questions.htm#illumina
-
Illumina 1.8+
ASCII值小於等於58(相應的質量得分小於等於25)對應的字元只有在Phred+33的編碼中被使用,所有Phred+64所使用的字元的ASCII值都大於等於59。在通常情況下,ASCII值大於等於74的字元只出現在Phred+64中。利用這些資訊即可在程式中進行判斷
例項演練
判斷fastq序列編碼是Phred33(Illumina1.8+) or Phred64(Illumina1.3+)
def fq_phred_check(inputfile):
"""判斷fastq序列編碼"""
fastq_file = open(inputfile,'r',encoding='utf-8')
count = 1
for line in fastq_file:
line_strip = line.strip()
if count % 4 == 0:
for each in line_strip:
ASCII_each = ord(each)
if ASCII_each > 75:
phred_value = 64
break
elif ASCII_each < 58:
phred_value = 33
break
count += 1
fastq_file.close()
return phred_value
inputfile = './data/test1.fq'
phred_value = fq_phred_check(inputfile)
print(phred_value)
33
fastq轉換fasta格式
def fastq_fasta(inputfile):
"""將fastq轉換成fasta序列格式"""
fastq_file = open(inputfile,'r',encoding='utf-8')
out_file_name = inputfile.strip().split('/')[-1].split('.')[0] + '.fasta'
output_fasta = open(out_file_name,'w',encoding='utf-8')
i = 0
for line in fastq_file:
if i % 4 == 0:
line_new = line.strip().split('\n')[0].replace('@','>')
output_fasta.write(line_new + '\n')
elif (i - 1) % 4 == 0:
output_fasta.write(line)
i = i + 1
print("fastq transform fasta success", out_file_name)
fastq_file.close()
output_fasta.close()
def main():
inputfile = './data/test1.fq'
fastq_fasta(inputfile)
if __name__ == '__main__':
main()
fastq transform fasta success test1.fasta
Linux 操作fastq
獲取資料
mkdir -p ~/biosoft
cd ~/biosoft
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zip
unzip bowtie2-2.3.4.3-linux-x86_64.zip
cd ~/biosoft/bowtie2-2.3.4.3-linux-x86_64/example/reads
統計reads_1.fq檔案中共有多少條序列資訊
[sunchengquan 08:07:20 ~/local/app/bowtie2-2.2.4/example/reads]
$ ll
總用量 8.4M
-rwxrwx--- 1 sunchengquan sunchengquan 4.0M 10月 23 2014 longreads.fq
-rwxrwx--- 1 sunchengquan sunchengquan 2.2M 10月 23 2014 reads_1.fq
-rwxrwx--- 1 sunchengquan sunchengquan 2.2M 10月 23 2014 reads_2.fq
-rwxrwx--- 1 sunchengquan sunchengquan 9.1K 10月 23 2014 simulate.pl
[sunchengquan 08:07:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ wc reads_1.fq
40000 40000 2285692 reads_1.fq
[sunchengquan 08:07:42 ~/local/app/bowtie2-2.2.4/example/reads]
$ wc -l reads_1.fq
40000 reads_1.fq
[sunchengquan 09:06:02 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $[`wc -l reads_1.fq |cut -d' ' -f1`/4]
10000
[sunchengquan 09:06:52 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $((`wc -l reads_1.fq |cut -d' ' -f1`/4))
10000
[sunchengquan 09:07:10 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $(expr `wc -l reads_1.fq |cut -d' ' -f1`/4)
40000/4
[sunchengquan 09:07:35 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $(expr `wc -l reads_1.fq |cut -d' ' -f1` / 4)
10000
[sunchengquan 09:12:17 ~/local/app/bowtie2-2.2.4/example/reads]
$ let c=`wc -l reads_1.fq |cut -d' ' -f1`/4
[sunchengquan 09:12:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $c
10000
[sunchengquan 09:16:11 ~/local/app/bowtie2-2.2.4/example/reads]
$ a=`wc -l reads_1.fq |cut -d' ' -f1`
[sunchengquan 09:16:23 ~/local/app/bowtie2-2.2.4/example/reads]
$ b=4
[sunchengquan 09:16:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo `expr $a / $b`
10000
[sunchengquan 09:18:03 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo `expr $a / 4`
10000
輸出reads_1.fq檔案中的識別符號(即以@開頭的那一行)
[sunchengquan 09:18:22 ~/local/app/bowtie2-2.2.4/example/reads]
$ grep '^@' reads_1.fq |head -3
@r1
@r2
@r3
[sunchengquan 09:21:59 ~/local/app/bowtie2-2.2.4/example/reads]
$ grep '^@' reads_1.fq |cut -f1 |cut -c1|head -3
@
@
@
[sunchengquan 09:22:38 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |wc
10000 40000 2285692
[sunchengquan 09:21:51 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1|cut -c1|head -3
@
@
@
[sunchengquan 09:27:19 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1) {print $0}}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:28:47 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1) print $0}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:31:38 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print}}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:32:28 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{print NR}' reads_1.fq|head -3
1
2
3
輸出reads_1.fq檔案中所有序列的資訊(即每個序列的第二行)
[sunchengquan 09:34:51 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|head -1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
輸出reads_1.fq檔案中‘+’及其後面的描述資訊(即每個序列的第三行)
[sunchengquan 09:35:06 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==3){print $0 }}' reads_1.fq|head -3
+
+
+
輸出reads_1.fq檔案中質量值資訊(即每個序列的第四行)
[sunchengquan 09:36:24 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print $0 }}' reads_1.fq|head -1
+"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#[email protected]==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>
計算reads_1.fq檔案含有N鹼基的reads個數
[sunchengquan 09:39:13 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|grep N|wc
6429 6429 782897
[sunchengquan 09:41:03 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|grep -c N
6429
計算reads_1.fq檔案裡面的序列鹼基總數
[sunchengquan 12:46:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |grep -o '[ATGCN]' |wc
1088399 1088399 2176798
[sunchengquan 12:47:12 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length}}' reads_1.fq |head -1
122
[sunchengquan 12:49:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length($0)}}' reads_1.fq |head -1
122
[sunchengquan 12:58:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length($0)}}' reads_1.fq |paste -s -d +|bc
1088399
計算reads_1.fq檔案中所有reads的N鹼基總數
[sunchengquan 12:58:50 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |grep -o '[N]' |wc
26001 26001 52002
計算reads_1.fq中測序鹼基質量值恰好為Q20的個數
[sunchengquan 17:14:00 ~/local/app/bowtie2-2.2.4/example/reads]
$ python -c 'print (chr(33+20))'
5
[sunchengquan 17:15:32 ~/local/app/bowtie2-2.2.4/example/reads]
$ python -c 'print (chr(33+30))'
?
[sunchengquan 17:15:43 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '[5]'|wc -l
21369
[sunchengquan 17:17:18 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '5'|wc -l
21369
計算reads_1.fq中測序鹼基質量值恰好為Q30的個數
[sunchengquan 17:17:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '?'|wc -l
21574
計算reads_1.fq中所以序列的第一位鹼基的ATCGNactg分佈情況
[sunchengquan 17:26:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |cut -c1|sort |uniq -c
2184 A
2203 C
2219 G
1141 N
2253 T
將reads_1.fq轉為reads_1.fa檔案(即將fastq轉化為fasta)
[sunchengquan 17:33:13 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print ">"$0} else if(NR%4==2){print}}' reads_1.fq |head -2
>@r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:45:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print ">"substr($0,2)} else if(NR%4==2){print}}' reads_1.fq |head -2
>r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:46:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1,2|tr '\t' '\n'|head -2
@r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:48:14 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1,2|tr '\t' '\n'|tr '@' '>'|head -2
>r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
統計上述reads_1.fa檔案中共有多少條序列
[sunchengquan 17:50:29 ~/local/app/bowtie2-2.2.4/example/reads]
$