1. 程式人生 > >測多少資料量?幾個G?多少reads?如何換算?

測多少資料量?幾個G?多少reads?如何換算?

關鍵詞:

lncRNA表達量低,所以要看lncRNA的表達量變化,就要比普通RNA-seq多測一些。

要兼顧SNP低表達量的lncRNA,要測得更深一些~

到底需要測多少資料量呢?

 

我們看看權威的ENCODE對RNA-seq的測序深度是如何評價的:

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011)

The ENCODE Consortium

 

Sequencing depth.

The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample. Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing

(e.g. 30M pair-end reads of length > 30NT, of which 20-25M are mappable to the genome or known transcriptome, Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing.

 

The ability to detect reliably low copy number transcripts/isoforms depends upon the depth of sequencing and on a sufficiently complex library. For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended.

[Specialized studies in which the prevalence of different RNAs has been intentionally altered (e.g. “normalizing” using DSN) as part of sample preparation need more than the read amounts (>30M paired end reads) used for simple comparison (see above). Reasons for this include:

(1) overamplification of inserts as a result of an additional round of PCR after DSN and

(2) much more broad coverage given the nature of A(-) and low abundance transcripts.

權威的話轉換如下:

根據研究目的決定測序深度:

目的1:通過抓取polyA尾巴建庫(只測那些帶有polyA尾巴的基因,大多是蛋白編碼基因),

尋找樣品間基因轉錄譜的相似性,只需要30M reads,長度大於30nt即可,雙端測序,其中20-25M能夠回帖到已知轉錄組上。

 

目的2:要發現新的轉錄本,對已知isoform(同一基因由於不同的可變剪接方式形成多種isoform,勉強譯為亞型)進行定量分析,

兼顧低表達量的轉錄本isoform,就需要100-200M read,長度大於76bp,雙端測序。

lncRNA-seq屬於這一型別。

注:ENCODE測的是人和小鼠,其他物種不包括在此推薦範圍內。

 

另外,miRNA測序,只需要10M read,每條read長50bp,單端測序。

ChIP-seq,需要20M read,每條read長50bp,單端測序。

 

銷售只說多少G,不說reads數,如何把reads數換算成G呢?

這跟測序長度有關:

PE150或2*150,即 雙端測序,每條read長度150bp。

150bp X 2端 X read數 = 資料量

例如,測50M read,150bp X 2端 X 50M read = 15000M = 15G

注:對於雙端測序,一個RNA片段,即fragment,也叫read,會測出來2條序列。

 

SE50或1*50,即 單端測序,每條read長度50bp。

50bp X 1端 X read數 = 資料量

例如,測20M read,50bp X 1端 X 20M read = 1000M = 1G

 

再絮叨一句:這裡的G是鹼基數(Gbase,Gb),跟你看到的檔案大小(gigabyte,GB)不是一回事哦~

測序公司給你的檔案通常是壓縮的fastq格式,裡面有read ID號,有鹼基,有每個鹼基的質量

小哈看到檔案大小就感覺資料量不夠,是基於經驗的推測,要明確測了多少資料量,跑一個FastQC或RSeQC就知道了。