1. 程式人生 > >基因組測序模擬

基因組測序模擬

基因組測序模擬

一、摘要

通過熟悉已有的基因組測序模擬和評估程式,加深全基因組鳥槍法測序原理的理解,並且能夠編寫程式模擬全基因組鳥槍法測序,理解覆蓋度、測序深度、拷貝數等概念,設定測序相關引數,生成單端/雙端測序結果檔案

二、材料和方法

1、硬體平臺

處理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
安裝記憶體(RAM):16.0GB

2、系統平臺

Windows 8.1,Ubuntu

3、軟體平臺

  1. art_454
  2. GenomeABC http://crdd.osdd.net/raghava/genomeabc/
  3. Python3.5
  4. Biopython

4、資料庫資源

NCBI資料庫:https://www.ncbi.nlm.nih.gov/

5、研究物件

酵母基因組Saccharomyces cerevisiae S288c (assembly R64)
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz

6、方法

  1. art_454的使用
    首先至art系列軟體的官網,下載軟體,在ubuntu系統安裝,然後閱讀相關引數設定的幫助文件,執行程式。
  2. GenomeABC
    進入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),輸入引數,獲得模擬測序結果。
  3. 程式設計模擬測序
    下載安裝python,並且安裝biopython擴充套件模組,編寫程式,模擬單端/雙端測序。

三、結果

1、art_454的執行結果

無引數art_454執行,閱讀幫助文件
無引數art_454執行
圖表 1無引數art_454執行
對酵母基因組進行基因組單端測序模擬,FOLD_COVERAGE設為20,即覆蓋度為20.
下圖為模擬單端測序,程式執行過程及結果
art454單端測序
圖表 2 art454單端測序
art454單端模擬結果
圖表 3 art454單端模擬結果
雙端測序模擬,FOLD_COVERAGE設為20,即覆蓋度為20;MEAN_FRAG_LEN設為1500,即平均片段長度為1500;STD_DEV設為20,即長度的標準差為20
下圖為模擬雙端測序,程式執行過程及結果
art454雙端測序


圖表 4 art454雙端測序
art454雙端模擬結果
圖表 5 art454雙端模擬結果
2、GenomeABC
下圖為設定引數頁面
設定引數頁面
下圖為結果下載頁面
結果下載頁面
圖表 6 結果下載頁面
3、程式設計模擬測序結果
拷貝數是這裡的N值;覆蓋度是m,測序深度是巨集觀的量,在這裡與覆蓋度意思相同,就是測序儀10X,20X。
單端測序
程式模擬單端測序
圖表 7 程式模擬單端測序
雙端測序
程式模擬雙端測序
圖表 8 程式模擬雙端測序
測序結果
結果檔案
圖表 9 結果檔案

因為期望片段長度是600bp,在片段長度區間200-1000bp內,所以大部分的片段都沒有刪除。
測序結果統計表

測序方式 基因組大小(bp) 片段長度區間 (bp) N值 期望片段長度 克隆保留率 片段數量 Reads長度範圍(bp) Reads總數量 Reads總長度 覆蓋度(m值) 理論丟失率(e-m) 覆蓋率(1-e-m)
單端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
單端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
雙端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
雙端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886

四、討論和結論

程式執行方法

在類的構造方法init()中,調整引數。
Averagefragmentlength為片段平均的長度;
minfragmentlength和maxfragmentlength是保留片段的範圍;
cloneRetainprobability是克隆的保留率;
minreadslength和maxreadslength是測序reads的長度範圍

模擬測序的諸多方法都封裝成了Sequencing類,只需要建立類,並呼叫singlereadsequencing()和pairreadsequencing()方法,傳入檔名的引數即可。

附錄

from Bio import SeqIO
from math import exp
import random

class Sequencing:
    # N代表拷貝份數
    def __init__(self)
        self.fragmentList = []
        self.readsID = 1
        self.readsList = []
        self.averagefragmentlength = 650
        self.minfragmentlength = 500
        self.maxfragmentlength = 800
        self.cloneRetainprobability = 1
        self.minreadslength = 50
        self.maxreadslength = 150
        self.N = 10
        self.genomeLength = 0
        self.allreadslength = 0

    # 生成斷裂點
    def generatebreakpoint(self, seqlen, averageLength):
        # 假設平均每500bp 產生一個斷裂點(averageLength = 500),通過隨機函式生成seqlen/500個隨機斷裂點(1到seqlen之間的隨機整數)
        breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
        breakpoint.append(seqlen)
        breakpoint.append(0)
        # 把隨機斷裂點從小到大排序
        breakpoint.sort()
        return breakpoint

    # 沿斷裂點打斷基因組,並刪除不符合長度要求的序列片段,定義片段範圍:200-1000bp
    def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
        for i in range(len(breakpoint) - 1):
            fragment = seq[breakpoint[i]:breakpoint[i + 1]]
            if maxfragmentlength > len(fragment) > minfragmentlength:
                self.fragmentList.append(fragment)
        return self.fragmentList

    # 模擬克隆時的隨機丟失情況,random.random()生成0-1的保留概率
    def clonefragment(self, fragmentList, cloneRetainprobability):
        clonedfragmentList = []
        Lossprobability = [random.random() for _ in range(len(fragmentList))]
        for i in range(len(fragmentList)):
            if Lossprobability[i] <= cloneRetainprobability:
                clonedfragmentList.append(fragmentList[i])
        return clonedfragmentList

    # 模擬單端測序,並修改reads的ID號
    def singleread(self, clonedfragmentList):
        for fragment in clonedfragmentList:
            fragment.id = ""
            fragment.name = ""
            fragment.description = fragment.description[12:].split(",")[0]
            fragment.description = str(self.readsID) + "." + fragment.description
            self.readsID += 1
            readslength = random.randint(self.minreadslength, self.maxreadslength)
            self.allreadslength += readslength
            self.readsList.append(fragment[:readslength])

    def singlereadsequencing(self, genomedata, sequencingResult):
        for seq_record in SeqIO.parse(genomedata, "fasta"):
            seqlen = len(seq_record)
            self.genomeLength += seqlen
            for i in range(self.N):
                # 生成斷裂點
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
                # 沿斷裂點打斷基因組
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
        # 模擬克隆時的隨機丟失情況
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
        # 模擬單端測序
        self.singleread(clonedfragmentList)
        SeqIO.write(self.readsList, sequencingResult, "fasta")

    def pairread(self, clonedfragmentList):
        for fragment in clonedfragmentList:
            fragment.id = ""
            fragment.name = ""
            description = fragment.description[12:].split(",")[0]
            fragment.description = str(self.readsID) + "." + description
            readslength = random.randint(self.minreadslength, self.maxreadslength)
            self.allreadslength += readslength
            self.readsList.append(fragment[:readslength])

            readslength = random.randint(self.minreadslength, self.maxreadslength)
            self.allreadslength += readslength

            fragmentcomplement = fragment.reverse_complement()
            fragmentcomplement.id = ""
            fragmentcomplement.name = ""
            fragmentcomplement.description = str(self.readsID) + "." + description
            self.readsList.append(fragmentcomplement[:readslength])

            self.readsID += 1

    def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
        for seq_record in SeqIO.parse(genomedata, "fasta"):
            seqlen = len(seq_record)
            self.genomeLength += seqlen
            for i in range(self.N):
                # 生成斷裂點
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
                # 沿斷裂點打斷基因組
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
        # 模擬克隆時的隨機丟失情況
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
        # 模擬雙端測序
        self.pairread(clonedfragmentList)
        readsList_1 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 0]
        readsList_2 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 1]
        SeqIO.write(readsList_1, sequencingResult_1, "fasta")
        SeqIO.write(readsList_2, sequencingResult_2, "fasta")

    def resultsummary(self):
        print("基因組長度:" + str(self.genomeLength / 1000) + "kb")
        print("片段長度區間:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
        print("N值:" + str(self.N))
        print("期望片段長度:" + str(self.averagefragmentlength))
        print("克隆保留率:" + str(self.cloneRetainprobability))
        print("片段數量:" + str(len(self.fragmentList)))
        print("reads長度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
        print("reads總數量:" + str(len(self.readsList)))
        print("reads總長度:" + str(self.allreadslength / 1000) + "kb")
        m = self.allreadslength / self.genomeLength
        print("覆蓋度(m值):" + str(round(m, 5)))
        print("理論丟失率(e^-m):" + str(round(exp(-m), 5)))
        print("覆蓋率(1-e^-m):" + str(round(1 - exp(-m), 5)))
# -------------------------------------------主程式-------------------------------------------
# 模擬單端測序
sequencingObj = Sequencing()
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
sequencingObj.resultsummary()

# 模擬雙端測序
sequencingObj = Sequencing()
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
sequencingObj.resultsummary()
from Bio import SeqIO
from math import exp
import random

class Sequencing:
    # N代表拷貝份數
    def __init__(self):
        self.fragmentList = []
        self.readsID = 1
        self.readsList = []
        self.averagefragmentlength = 650
        self.minfragmentlength = 500
        self.maxfragmentlength = 800
        self.cloneRetainprobability = 1
        self.minreadslength = 50
        self.maxreadslength = 150
        self.N = 10
        self.genomeLength = 0
        self.allreadslength = 0

    # 生成斷裂點
    def generatebreakpoint(self, seqlen, averageLength):
        # 假設平均每500bp 產生一個斷裂點(averageLength = 500),通過隨機函式生成seqlen/500個隨機斷裂點(1到seqlen之間的隨機整數)
        breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
        breakpoint.append(seqlen)
        breakpoint.append(0)
        # 把隨機斷裂點從小到大排序
        breakpoint.sort()
        return breakpoint

    # 沿斷裂點打斷基因組,並刪除不符合長度要求的序列片段,定義片段範圍:200-1000bp
    def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
        for i in range(len(breakpoint) - 1):
            fragment = seq[breakpoint[i]:breakpoint[i + 1]]
            if maxfragmentlength > len(fragment) > minfragmentlength:
                self.fragmentList.append(fragment)
        return self.fragmentList

    # 模擬克隆時的隨機丟失情況,random.random()生成0-1的保留概率
    def clonefragment(self, fragmentList, cloneRetainprobability):
        clonedfragmentList = []
        Lossprobability = [random.random() for _ in range(len(fragmentList))]
        for i in range(len(fragmentList)):
            if Lossprobability[i] <= cloneRetainprobability:
                clonedfragmentList.append(fragmentList[i])
        return clonedfragmentList

    # 模擬單端測序,並修改reads的ID號
    def singleread(self, clonedfragmentList):
        for fragment in clonedfragmentList:
            fragment.id = ""
            fragment.name = ""
            fragment.description = fragment.description[12:].split(",")[0]
            fragment.description = str(self.readsID) + "." + fragment.description
            self.readsID += 1
            readslength = random.randint(self.minreadslength, self.maxreadslength)
            self.allreadslength += readslength
            self.readsList.append(fragment[:readslength])

    def singlereadsequencing(self, genomedata, sequencingResult):
        for seq_record in SeqIO.parse(genomedata, "fasta"):
            seqlen = len(seq_record)
            self.genomeLength += seqlen
            for i in range(self.N):
                # 生成斷裂點
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
                # 沿斷裂點打斷基因組
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
        # 模擬克隆時的隨機丟失情況
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
        # 模擬單端測序
        self.singleread(clonedfragmentList)
        SeqIO.write(self.readsList, sequencingResult, "fasta")

    def pairread(self, clonedfragmentList):
        for fragment in clonedfragmentList:
            fragment.id = ""
            fragment.name = ""
            description = fragment.description[12:].split(",")[0]
            fragment.description = str(self.readsID) + "." + description
            readslength = random.randint(self.minreadslength, self.maxreadslength)
            self.allreadslength += readslength
            self.readsList.append(fragment[:readslength])

            readslength = random.randint(self.minreadslength, self.maxreadslength)
            self.allreadslength += readslength

            fragmentcomplement = fragment.reverse_complement()
            fragmentcomplement.id = ""
            fragmentcomplement.name = ""
            fragmentcomplement.description = str(self.readsID) + "." + description
            self.readsList.append(fragmentcomplement[:readslength])

            self.readsID += 1

    def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
        for seq_record in SeqIO.parse(genomedata, "fasta"):
            seqlen = len(seq_record)
            self.genomeLength += seqlen
            for i in range(self.N):
                # 生成斷裂點
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
                # 沿斷裂點打斷基因組
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
        # 模擬克隆時的隨機丟失情況
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
        # 模擬雙端測序
        self.pairread(clonedfragmentList)
        readsList_1 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 0]
        readsList_2 = [self.readsList[i] for i in range(len(self.readsList)) if i % 2 == 1]
        SeqIO.write(readsList_1, sequencingResult_1, "fasta")
        SeqIO.write(readsList_2, sequencingResult_2, "fasta")

    def resultsummary(self):
        print("基因組長度:" + str(self.genomeLength / 1000) + "kb")
        print("片段長度區間:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
        print("N值:" + str(self.N))
        print("期望片段長度:" + str(self.averagefragmentlength))
        print("克隆保留率:" + str(self.cloneRetainprobability))
        print("片段數量:" + str(len(self.fragmentList)))
        print("reads長度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
        print("reads總數量:" + str(len(self.readsList)))
        print("reads總長度:" + str(self.allreadslength / 1000) + "kb")
        m = self.allreadslength / self.genomeLength
        print("覆蓋度(m值):" + str(round(m, 5)))
        print("理論丟失率(e^-m):" + str(round(exp(-m), 5)))
        print("覆蓋率(1-e^-m):" + str(round(1 - exp(-m), 5)))
# -------------------------------------------主程式-------------------------------------------
# 模擬單端測序
sequencingObj = Sequencing()
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
sequencingObj.resultsummary()

# 模擬雙端測序
sequencingObj = Sequencing()
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
sequencingObj.resultsummary()