測序資料模擬——ART基因序列產生器

阿新 • • 發佈：2019-02-10

ART基因序列產生器簡介

官網1.

軟體下載2.

ART使用

軟體配置

tar zxvf artbingreatsmokymountains041716linux64tgz.tgz 
cd art_bin_GreatSmokyMountains/
#art_illumina=~/congmin/software/art_bin_GreatSmokyMountains/art_illumina

引數設定

[email protected]:~/congmin/software/art_bin_GreatSmokyMountains$ art_illumina 

    ====================ART====================
             ART_Illumina (2008 
-2016)          
          Q Version 2.5.1 (Apr 17, 2016)       
     Contact: Weichun Huang <[email protected]> 
    -------------------------------------------

===== USAGE =====

art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>

===== PARAMETERS =====

  -1 
   --qprof1   the first-read quality profile
  -2   --qprof2   the second-read quality profile
  -amp --amplicon amplicon sequencing simulation
  -c   --rcount   number of reads/read pairs to be generated per sequence/amplicon (not be used together with -f/--fcov)
  -d   --id       the prefix identification tag for 
 read ID
  -ef  --errfree  indicate to generate the zero sequencing errors SAM file as well the regular one
                  NOTE: the reads in the zero-error SAM file have the same alignment positions
                  as those in the regular SAM file, but have no sequencing errors
  -f   --fcov     the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
  -h   --help     print out usage information
  -i   --in       the filename of input DNA/RNA reference
  -ir  --insRate  the first-read insertion rate (default: 0.00009)
  -ir2 --insRate2 the second-read insertion rate (default: 0.00015)
  -dr  --delRate  the first-read deletion rate (default:  0.00011)
  -dr2 --delRate2 the second-read deletion rate (default: 0.00023)
  -l   --len      the length of reads to be simulated
  -m   --mflen    the mean size of DNA/RNA fragments for paired-end simulations
  -mp  --matepair indicate a mate-pair read simulation
  -M  --cigarM    indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch
  -nf  --maskN    the cutoff frequency of 'N' in a window size of the read length for masking genomic regions
                  NOTE: default: '-nf 1' to mask all regions with 'N'. Use '-nf 0' to turn off masking
  -na  --noALN    do not output ALN alignment file
  -o   --out      the prefix of output filename
  -p   --paired   indicate a paired-end read simulation or to generate reads from both ends of amplicons
                  NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
  -q   --quiet    turn off end of run summary
  -qL  --minQ     the minimum base quality score
  -qU  --maxQ     the maxiumum base quality score
  -qs  --qShift   the amount to shift every first-read quality score by 
  -qs2 --qShift2  the amount to shift every second-read quality score by
                  NOTE: For -qs/-qs2 option, a positive number will shift up quality scores (the max is 93) 
                  that reduce substitution sequencing errors and a negative number will shift down 
                  quality scores that increase sequencing errors. If shifting scores by x, the error
                  rate will be 1/(10^(x/10)) of the default profile.
  -rs  --rndSeed  the seed for random number generator (default: system time in second)
                  NOTE: using a fixed seed to generate two identical datasets from different runs
  -s   --sdev     the standard deviation of DNA/RNA fragment size for paired-end simulations.
  -sam --samout   indicate to generate SAM alignment file
  -sp  --sepProf  indicate to use separate quality profiles for different bases (ATGC)
  -ss  --seqSys   The name of Illumina sequencing system of the built-in profile used for simulation
       NOTE: sequencing system ID names are:
            GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)
           HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)
           HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)
           HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)
           MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)
===== NOTES =====

* ART by default selects a built-in quality score profile according to the read length specified for the run.

* For single-end simulation, ART requires input sequence file, outputfile prefix, read length, and read count/fold coverage.

* For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values of
  the mean and standard deviation of DNA/RNA fragment lengths

===== EXAMPLES =====

 1) single-end read simulation
    art_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat

 2) paired-end read simulation
       art_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat

 3) mate-pair read simulation
       art_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat

 4) amplicon sequencing simulation with 5' end single-end reads 
    art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat

 5) amplicon sequencing simulation with paired-end reads
       art_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat

 6) amplicon sequencing simulation with matepair reads
       art_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat

 7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulation
       art_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat

 8) reduce the substitution error rate to one 10th of the default profile
       art_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error

 9) turn off the masking of genomic regions with unknown nucleotides 'N'
       art_illumina -ss HS20 -nf 0  -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask

 10) masking genomic regions with >=5 'N's within the read length 50
       art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

使用例項

程式碼

zhoukr@bsn001:~/congmin/software/art_bin_GreatSmokyMountains$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20

結果

zhoukr@bsn001:~/congmin/software/art_bin_GreatSmokyMountains$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20

    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.1 (Apr 17, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------
還在執行

結果檔案

[email protected]:~/congmin/software/art_bin_GreatSmokyMountains$ ll
total 9443836
drwxrwxr-x 2 hadoop hadoop       4096  6月  2 23:10 ./
drwxrwxr-x 6 hadoop hadoop       4096  6月  2 22:59 ../
-rw-rw-r-- 1 hadoop hadoop 4635232124  6月  2 23:11 G38L100F20Nhs20.aln
-rw-rw-r-- 1 hadoop hadoop 4347022003  6月  2 23:11 G38L100F20Nhs20.fq
-rw-r--r-- 1 hadoop hadoop  252513055  6月  2 23:00 GRCH38chr1L3556522.fna

參考

測序資料模擬——ART基因序列產生器

目錄 ART基因序列產生器簡介官網1. 軟體下載2. 相關論文3. ART使用軟體配置 tar zxvf artbingreatsmokymountains041716linux64tgz.tgz cd

Next generation sequencing (NGS)二代測序資料預處理與分析

Next generation sequencing (NGS)二代測序資料預處理與分析 By zilhua | 2014 年 3 月 12 日 0 Comment 常使用的工具列表質量控制Quality Co

降解組測序資料分析--CleaveLand

需要準備的資料：降解組測序資料QC後整理為redundant fasta格式轉錄本資料，fasta格式 miRNA序列，fasta格式檢視幫助文件 CleaveLand4.pl --help readscount轉為fasta格式(redundent rea

測序資料質控-FastQC

通常我們下機得到的資料是raw reads，但是公司通常會質控一份給我們，所以到很多人手上就是clean data了。我們再次使用fastqc來進行測序資料質量檢視以及結果分析。 fastqc的操作： 1. FastQC使用 fastqc -f [bam | sam | fastq] -o [

2018-6-23轉錄組學習2 測序資料質量檢查

1.sra檔案轉換為fastq格式為了進行測序資料質量檢查我們需要將下載好的sra資料轉換為fastq格式：使用Sratoolkits中的fastq-dump命令進行格式轉換 Sratoolkits的官方文件中有fastq-dump命令的介紹(http

如何劃窗統計測序資料的reads數（depth）

對於公司送回來的測序資料，我們通常需要進行質檢，檢查資料是否符合我們要求的測序深度，在質檢中，統計各個位點的depth就顯得尤為重要。最常見的統計depth的方法就是使用samtools depth，但是這個方法僅僅侷限於對單個位點進行depth進行統

miRDeep分析microRNA測序資料流程

microRNA的分析過程：microRNA測序資料分析選用軟體miRDeep，關於這個軟體，有文章：Discovering microRNAs from deep sequencing data using miRDeep,Nature Biotechnology 2

經典：基因組測序資料從頭拼接或組裝演算法的原理

基因組測序資料的拼接/組裝（圖片來源：google）每一個物種的參考基因組序列（reference genome）的產生都要先通過測序的方法，獲得基因組的測序讀段（reads），然後再進行從頭拼接或組裝（英文名稱為do novo&nbs

一代測序序列資料批量聚類處理

首先我們將所有一代測序的序列檔案都儲存在同一個資料夾下，然後用cat命令合併成一個fasta檔案。在每條序列第一行插入> for file in .fas; do sed -i “s/>./file” ; done 將序列第一和第二行合併

全面解讀第四代基因測序技術Oxford Nanopore--轉載

能夠 ural 變異發現提高 pla art deep 導致納米孔測序技術（又稱第四代測序技術）是最近幾年興起的新一代測序技術。目前測序長度可以達到150kb。這項技術開始於90年代，經歷了三個主要的技術革新：一、單分子DNA從納米孔通過；二、納米孔上的酶對於測序分子

Java位元組序(不同語言中的網路資料傳輸時位元組序列轉換)

BIG-ENDIAN（大位元組序、高位元組序） LITTLE-ENDIAN（小位元組序、低位元組序）主機位元組序網路位元組順序 JAVA位元組序 1．BIG-ENDIAN、LITTLE-ENDIAN跟多位元組型別的資料有關的比如

基因組測序模擬

基因組測序模擬一、摘要通過熟悉已有的基因組測序模擬和評估程式，加深全基因組鳥槍法測序原理的理解，並且能夠編寫程式模擬全基因組鳥槍法測序，理解覆蓋度、測序深度、拷貝數等概念，設定測序相關引數，生成單端/雙端測序結果檔案二、材料和方法 1、硬體平臺處理器:Inte

關於illumina產生的測序原始檔bcl轉換成fastq格式的問題

由於連線測序儀的伺服器不知道哪裡抽了風，無法直接的生成fastq格式的檔案，好久都無解，經過一段時間仍無法解決，所以採用曲線救國的方法，看能不能利用三方軟體將bcl轉換成fastq檔案 google以後發現illumina的OLB（off-line Base

java 連結mysql 產生500W資料模擬生成環境

java 插入資料到mysql 通過sqoop 匯入到hive 中，kylin模擬見cube 時間和資料膨脹率 kylin 資料插入到 HBase Kylin HBase 1.1.3 Hive 1.2.1 Hadoop 2.5.1 create table infoa

第三代基因測序技術革新雲計算的應用

打通等等高速公路個人獲得能力雲端減少計算第三代基因測序技術革新雲計算的應用一位準媽媽，在懷孕12-24周時，需要做唐氏兒的篩查，傳統唐篩的方式準確率低，如果結果顯示危險性高，那麽準媽媽還需要做羊膜穿刺等進一步檢查。　　而今天，隨著基因測序技術的發展，我們

測序分析軟件-phred的安裝

就會 des 自帶 red 獲得 install zxvf port 文件 1.進入phred官網，給作者寫信，獲得所需的軟件，大約需要兩三天的時間即可收到回信。 2.根據作者的指示下載，解壓相應軟件。 3.以筆者本人的安裝為例unbuntu系統（phred自帶的insta

<二代測序> 下載 NCBI sra 文件

fix mpc size blog contains 地址 flow logic tid 本文近期更新地址： http://blog.csdn.net/tanzuozhev/article/details/51077222 隨著測序技術的

宏基因組測序及分析

drop 體積構建 pan 以及 lex 建議 nod 1.8 宏基因組測序：濃度>=50ng/ul OD260/280：1.8-2.0 DNA兩次需要量>=3ug 宏基因組測序需要提供什麽樣品要求？（1）提供環境微生物的基因組DNA或者擴增產物，O

illumina 測序原理

需要 qmail 等等 lane prime 功能一段 vfk 可能一些常用基本概念的介紹： flowcell流動池是指Illumina測序時，測序反應發生的位置，1個flowcell含有8條lane lane通道每一個flowcell上都有8條泳道，用於測序反應

WES 平均測序深度

一次概念發生區分分離利用變異相同作用 http://blog.csdn.net/guomutian911/article/details/70312973 1 基礎概念平均測序深度：指定區域內得到的所有堿基數目與該區域的長度的比值，如果是全基因組

測序資料模擬——ART基因序列產生器

目錄

ART基因序列產生器簡介

官網1.

軟體下載2.

相關論文3.

ART使用

軟體配置

引數設定

使用例項

參考

相關推薦