084-【生信軟體】-ANNOVAR軟體幫助文件
阿新 • • 發佈:2018-12-14
安裝
會郵件收到一個軟體安裝包
annovar.latest.tar/
包含的perl指令碼
[email protected] /opt/script/tool/annovar Sun Oct 07 16:42 forstart $tree -L 1 . ├── annotate_variation.pl ├── coding_change.pl ├── convert2annovar.pl ├── example ├── humandb ├── retrieve_seq_from_fasta.pl ├── table_annovar.pl └── variants_reduction.pl
humandb
ANNOVAR的安裝包裡自帶了一些常用的資料庫,在humandb/目錄下
[email protected] /opt/script/tool/annovar/humandb Sun Oct 07 16:43 forstart $tree -L 1 . ├── genometrax-sample-files-gff ├── GRCh37_MT_ensGeneMrna.fa ├── GRCh37_MT_ensGene.txt ├── hg19_example_db_generic.txt ├── hg19_example_db_gff3.txt ├── hg19_MT_ensGeneMrna.fa ├── hg19_MT_ensGene.txt ├── hg19_refGeneMrna.fa ├── hg19_refGene.txt ├── hg19_refGeneVersion.txt ├── hg19_refGeneWithVerMrna.fa └── hg19_refGeneWithVer.txt
gff檔案
[email protected] /opt/script/tool/annovar/humandb/genometrax-sample-files-gff Sun Oct 07 16:44 forstart $tree -L 1 . ├── list ├── sample_chip_featuretype_hg19.gff ├── sample_common_snp_featuretype_hg19.gff ├── sample_cosmic_featuretype_hg19.gff ├── sample_cpg_islands_featuretype_hg19.gff ├── sample_dbnsfp_featuretype_hg19.gff ├── sample_disease_featuretype_hg19.gff ├── sample_dnase_featuretype_hg19.gff ├── sample_drug_featuretype_hg19.gff ├── sample_evs_featuretype_hg19.gff ├── sample_gwas_featuretype_hg19.gff ├── sample_hgmd_common_snp_featuretype_hg19.gff ├── sample_hgmd_disease_genes_featuretype_hg19.gff ├── sample_hgmd_featuretype_hg19.gff ├── sample_hgmdimputed_featuretype_hg19.gff ├── sample_microsatellites_featuretype_hg19.gff ├── sample_miRNA_featuretype_hg19.gff ├── sample_omim_featuretype_hg19.gff ├── sample_pathway_featuretype_hg19.gff ├── sample_pgx_featuretype_hg19.gff ├── sample_ptms_featuretype_hg19.gff ├── sample_snps_dbsnp_featuretype_hg19.gff ├── sample_snps_ensembl_featuretype_hg19.gff ├── sample_transfac_sites_featuretype_hg19.gff └── sample_tss_featuretype_hg19.gff 0 directories, 25 files
example
[email protected]:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ ll
total 20152
drwxrwxrwx 1 toucan toucan 4096 Sep 26 22:47 ./
drwxrwxrwx 1 toucan toucan 4096 Sep 26 22:47 ../
-rwxrwxrwx 1 toucan toucan 1940 Apr 17 03:41 README*
-rwxrwxrwx 1 toucan toucan 1831 Apr 17 03:41 ex1.avinput*
-rwxrwxrwx 1 toucan toucan 1706 Apr 17 03:41 ex2.vcf*
-rwxrwxrwx 1 toucan toucan 44 Apr 17 03:41 example.simple_region*
-rwxrwxrwx 1 toucan toucan 44 Apr 17 03:41 example.tab_region*
-rwxrwxrwx 1 toucan toucan 20317115 Apr 17 03:41 gene_fullxref.txt*
-rwxrwxrwx 1 toucan toucan 295664 Apr 17 03:41 gene_xref.txt*
-rwxrwxrwx 1 toucan toucan 1436 Apr 17 03:41 grantham.matrix*
-rwxrwxrwx 1 toucan toucan 43 Apr 17 03:41 snplist.txt*
README說明
[email protected]:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ cat README
visit ANNOVAR website at http://www.openbioinformatics.org/annovar for more exmaple.
Please cite ANNOVAR if you use it in your research (Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data, Nucleic Acids Research, 38:e164, 2010). I spent tremendous amount of time and effort to maintain this tool, and your citation really means a lot to me.
ex1.avinput: a simple ANNOVAR input example with a few variants (in hg19 coordinate)
ex2.vcf: a simple VCF file with genotype information for 3 samples
gene_xref.txt: an example gene cross-reference file to be used on 'gx' operation in table_annovar.pl
example.simple_region: a file containing a list of genomic regions in sample format (for use in retrieve_seq_from_fasta.pl)
example.tab_region: a flie containing a list of genomic regions in tab-delimited format (for use in retrieve_seq_from_fasta.pl)
snplist.txt: a text file listing several dbSNP rs identifiers, one per line
humandb/hg19_example_db_generic.txt: an example file for generic database
humandb/hg19_example_db_gff3.txt: an example file for GFF3 database
grantham.matrix: a matrix file containing GRANTHAM scores for gene-based annotation
humandb/genometrax-sample-files-gff: a directory containing several "sample" GFF files provided by BioBase
humandb/hg19_MT_ensGene.txt and humandb/hg19_MT_ensGene.fa: mitochondria sequence for the NC_001807 contig used by UCSC Genome Browser. Even if you align your sequence data with reference sequences from UCSC, you should still use these files, not the ENSEMBLE file, for mitochondria annotation, because the ENSEMBLE annotations have some errors.
humandb/GRCh37_MT_ensGene.txt and humandb/GRCh37_MT_ensGene.fa: mitochondria sequence for the NC_012920 contig. If you align your sequence data using 1000 Genomes Project reference FASTA file, then you should use this file for annotating mitochondria variants.
如果要進行其他註釋,需要使用 -downdb 命令下載資料庫到 ‘humandb/’ 目錄裡:
#下載1000g2015Aug資料庫
$perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/
軟體幫助文件
(ANNOVAR程式結構
│ annotate_variation.pl #主程式,功能包括下載資料庫,三種不同的註釋
│ coding_change.pl #可用來推斷蛋白質序列
│ convert2annovar.pl #將多種格式轉為.avinput的程式
│ retrieve_seq_from_fasta.pl #用於自行建立其他物種的轉錄本
│ table_annovar.pl #註釋程式,可一次性完成三種類型的註釋
│ variants_reduction.pl #可用來更靈活地定製過濾註釋流程
│
├─example #存放示例檔案
│
└─humandb #人類註釋資料庫)
annotate_variation.pl
$cat mam/annotate_variation.txt
SYNOPSIS
annotate_variation.pl [arguments] <query-file|table-name> <database-location>
Optional arguments:
-h, --help print help message
-m, --man print complete documentation
-v, --verbose use verbose output
Arguments to download databases or perform annotations
--downdb download annotation database
--geneanno annotate variants by gene-based annotation (infer functional consequence on genes)
--regionanno annotate variants by region-based annotation (find overlapped regions in database)
--filter annotate variants by filter-based annotation (find identical variants in database)
Arguments to control input and output
--outfile <file> output file prefix
--webfrom <string> specify the source of database (ucsc or annovar or URL) (downdb operation)
--dbtype <string> specify database type
--buildver <string> specify genome build version (default: hg18 for human)
--time print out local time during program run
--comment print out comment line (those starting with #) in output files
--exonsort sort the exon number in output line (gene-based annotation)
--transcript_function use transcript name rather than gene name (gene-based annotation)
--hgvs use HGVS format for exonic annotation (c.122C>T rather than c.C122T)(gene-based annotation)
--separate separately print out all functions of a variant in several lines (gene-based annotation)
--seq_padding create a new file with cDNA sequence padded by this much either side(gene-based annotation)
--(no)firstcodondel treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
--aamatrix <file> specify an amino acid substitution matrix file (gene-based annotation)
--colsWanted <string> specify which columns to output by comma-delimited numbers (region-based annotation)
--scorecolumn <int> the column with scores in DB file (region-based annotation)
--poscolumn <string> the comma-delimited column with position information in DB file (region-based annotation)
--gff3dbfile <file> specify a DB file in GFF3 format (region-based annotation)
--gff3attribute output all fields in GFF3 attribute (default: ID and score only)
--bedfile <file> specify a DB file in BED format file (region-based annotation)
--genericdbfile <file> specify a DB file in generic format (filter-based annotation)
--vcfdbfile <file> specify a DB file in VCF format (filter-based annotation)
--otherinfo print out additional columns in database file (filter-based annotation)
--infoasscore use INFO field in VCF file as score in output (filter-based annotation)
--idasscore use ID field in VCF file as score in output (filter-based annotation)
--infosep use # rather than , to separate fields when -otherinfo is used
Arguments to fine-tune the annotation procedure
--batchsize <int> batch size for processing variants per batch (default: 5m)
--genomebinsize <int> bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
--expandbin <int> check nearby bin to find neighboring genes (default: 2m/genomebinsize)
--neargene <int> distance threshold to define upstream/downstream of a gene
--exonicsplicing report exonic variants near exon/intron boundary as 'exonic;splicing' variants
--score_threshold <float> minimum score of DB regions to use in annotation
--normscore_threshold <float> minimum normalized score of DB regions to use in annotation
--reverse reverse directionality to compare to score_threshold
--rawscore output includes the raw score (not normalized score) in UCSC BrowserTrack
--minqueryfrac <float> minimum percentage of query overlap to define match to DB (default: 0)
--splicing_threshold <int> distance between splicing variants and exon/intron boundary (default: 2)
--indel_splicing_threshold <int> if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
--maf_threshold <float> filter 1000G variants with MAF above this threshold (default: 0)
--sift_threshold <float> SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
--precedence <string> comma-delimited to specify precedence of variant function (default: exonic>intronic...)
--indexfilter_threshold <float> controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
--thread <int> use multiple threads for filter-based annotation
--maxgenethread <int> max number of threads for gene-based annotation (default: 6)
--mingenelinecount <int> min line counts to enable threaded gene-based annotation (default: 1000000)
Arguments to control memory usage
--memfree <int> ensure minimum amount of free system memory (default: 0)
--memtotal <int> limit total amount of memory used by ANNOVAR (default: 0, unlimited,in the order of kb)
--chromosome <string> examine these specific chromosomes in database file
Function: annotate a list of genetic variants against genome annotation
databases stored at local disk.
# 示例
Example: #download annotation databases from ANNOVAR or UCSC and save to humandb/ directory
annotate_variation.pl -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -buildver mm9 -downdb refGene mousedb/
annotate_variation.pl -downdb -webfrom annovar esp6500siv2_all humandb/
#gene-based annotation of variants in the varlist file (by default --geneanno is ON)
annotate_variation.pl -buildver hg19 ex1.avinput humandb/
#region-based annotate variants
annotate_variation.pl -regionanno -buildver hg19 -dbtype cytoBand ex1.avinput humandb/
annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.avinput humandb/
#filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
annotate_variation.pl -filter -dbtype 1000g2015aug_all -maf 0.01 ex1.avinput humandb/
annotate_variation.pl -filter -buildver hg19 -dbtype snp138 ex1.avinput humandb/
annotate_variation.pl -filter -dbtype dbnsfp30a -otherinfo ex1.avinput humandb/
Version: $Date: 2018-04-16 00:43:31 -0400 (Mon, 16 Apr 2018) $
OPTIONS
--help print a brief usage message and detailed explanation of options.
--man print the complete manual of the program.
--verbose
use verbose output.
--downdb
download annotation databases from UCSC Genome Browser, Ensembl,
1000 Genomes Project, ANNOVAR website or other resources. The
annotation databases are required for functional annotation of
genetic variants.
--geneanno
perform gene-based annotation. For each variant, examine whether
it hit exon, intron, intergenic region, or close to a transcript,
or hit a non-coding RNA gene, or is located in a untranslated
region (see *.variant_function output file). In addition, for an
exonic variant, determine whether it causes splicing change,
non-synonymous amino acid change, synonymous amino acid change or
frameshift changes (see *.exonic_variant_function output file).
--regionanno
perform region-based annotation. For each variant, examine whether
its genomic region (one or multiple base pairs) overlaps with a
specific genomic region, such as the most conserved elements, the
predicted transcription factor binding sites, the specific
cytogeneic bands, the evolutionarily conserved RNA secondary
structures.
--filter
perform filter-based annotation. For each variants, filter it
against a variation database, such as the 1000 Genomes Project
database, to identify whether it has been reporte in the database.
Exact match of nucleotide position and nucleotide composition are
required.
--outfile
specify the output file prefix. Several output files will be
generated using this prefix and different suffixes. A directory
name can also be specified as part of the argument, so that the
output files can be written to a different directory than the
current directory.
--webfrom
specify the source of database (ucsc or annovar or URL) in the
downdb operation. By default, files from UCSC Genome Browser
annotation database will be downloaded.
--dbtype
specify the database type to be used in gene-based, region-based
or filter-based annotations. For gene-based annotation, by default
refGene annotations from the UCSC Genome Browser will be used for
annotating variants. However, users can switch to use Ensembl
annotations, or use the UCSC Gene annotations, or the GENCODE Gene
annotations, or other types of gene annotations. For region-based
annotations, users can select any UCSC annotation databases (by
providing the database name), or alternatively select a Generic
Feature Format version 3 (GFF3) formatted file for annotation (by
providing 'gff3' as the --dbtype and providing the --gff3dbfile
argument), or select a BED file (by providing '-- dbtype bed' and
--bedfile arguments). For filter-based annotations, users can
select a dbSNP file, a 1000G file, a generic format file (with
simple columns including chr, start, end, reference, observed,
score), a VCF format file (which is a widely used format for
variants exchange), or many other types of formats.
--buildver
genome build version to use. By default, the hg18 build for human
genome is used. The build version will be used by ANNOVAR to
identify corresponding database files automatically, for example,
when gene-based annotation is used for hg18 build, ANNOVAR will
search for the hg18_refGene.txt file, but if the hg19 is used as
-- buildver, ANNOVAR will examine hg19_refGene.txt instead.
--time print out the local time during execution of the program
--comment
specify that the program should include comment lines in the
output files. Comment lines are defined as any line starting with
#. By default, these lines are not recognized as valid ANNOVAR
input and are therefore written to the INVALID_INPUT file. This
argument can be very useful to keep columns headers in the output
file, if the input file use comment line to flag the column
headers (usually the first line in the input file).
--exonsort
sort the exon number in output line in the exonic_variant_function
file during gene-based annotation. If a mutation affects multiple
transcripts, the ones with the smaller exon number will be printed
before the transcript with larger exon number in the output.
--transcript_function
use transcript name rather than gene name in output, for
gene-based annotation
--hgvs use HGVS format for exonic annotation (c.122C>T rather than
c.C122T) for gene-based annotation
--separate
for gene-based annotation, separate the effects of each variant,
so that each effect (intronic, exonic, splicing) is printed in one
output line. By default, all effects are printed in the same line,
in the comma-separated form of 'UTR3,UTR5' or 'exonic,splicing'.
--seq_padding
create a new file with cDNA sequence padded by this much either
side (gene-based annotation)
--firstcodondel
if the first codon of a gene is deleted, then the whole gene will
be treated as deleted in gene-based annotation. By default, this
option is ON.
--aamatrixfile
specify an amino acid substitution matrix, so that the scores are
printed in the exonic_variant_function file in gene-based
annotation. The matrix file is tab- delimited, and an example is
included in the ANNOVAR package.
--colsWanted
specify which columns are desired in the output for -regionanno.
By default, ANNOVAR inteligently selects the columns based on the
DB type. However, users can use a list of comma-delimited numbers,
or use 'all', or use 'none', to request custom output columns.
--scorecolumn
specify the the column with desired output scores in UCSC database
file (for region-based annotation). The default usually works
okay.
--poscolumn
the comma-delimited column with position information in DB file
(region-based annotation). The default usually works okay.
--gff3dbfile
specify the GFF3-formatted database file used in the region-based
annotation. Please consult
http://www.sequenceontology.org/resources/gff3.html for detailed
description on this file format. Note that GFF3 is generally not
compatible with previous versions of GFF.
--gff3attribute
output should contain all fields in GFF3 file attribute column
(the 9th column). By default, only the ID in the attribute and the
scores for the GFF3 file will be printed.
--bedfile
specify a DB file in BED format file in region-based annotation.
Please consult http://genome.ucsc.edu/FAQ/FAQformat.html#format1
for detailed descriptions on this format.
--genericdbfile
specify the generic format database file used in the filter-based
annotation.
--vcfdbfile
specify the database file in VCF format in the filter-based
annotation. VCF has been a popular format for summarizing SNP and
indel calls in a population of samples, and has been adopted by
1000 Genomes Project in their most recent data release.
--otherinfo
print out additional columns in database file in filter-based
annotation. This argument is useful when the annotation database
contains more than one annotation columns, so that all columns
will be printed out and separated by comma (by default).
--idasscore
when annotating against a VCF file, treat the ID field in VCF file
as the score to be printed in the output, in filter-based
annotation. By default the score is the allele frequency inferred
from VCF file.
--infoasscore
when annotating against a VCF file, treat the INFO field in VCF
file as the score to be printed in the output, in filter-based
annotation. By default the score is allele frequency inferred from
VCF file.
--infosep
use '#' rather than ',' to separate multiple fields when
-otherinfo is used in annotation. This argument is useful when the
annotation string itself contains comma, to help users clearly
separate different annotation fields.
--batchsize
this argument specifies the batch size for processing variants by
gene-based annotation. Normally 5 million variants (usually one
human genome will have about 3-5 million variants depending on
ethnicity) are annotated as a batch, to reduce the amounts of
memory. The users can adjust the parameters: larger values make
the program slightly faster, at the expense of slightly larger
memory requirements. In a 64bit computer, the default settings
usually take 1GB memory for gene-based annotation for human genome
for a typical query file, but this depends on the complexity of
the query (note that the query has a few required fields, but may
have many optional fields and those fields need to be read and
kept in memory).
--genomebinsize
the bin size of genome to speed up search. By default 100kb is
used for gene- based annotation, so that variant annotation
focused on specific bins only (based on the start-end site of a
given variant), rather than searching the entire chromosomes for
each variant. By default 10kb is used for region-based annotation.
The filter-based annotations look for variants directly so no bin
is used.
--expandbin
expand bin to both sides to find neighboring genes/regions. For
gene-based annotation, ANNOVAR tries to find nearby genes for any
intergenic variant, with a maximum number of nearby bins to
search. By default, ANNOVAR will automatically set this argument
to search 2 megabases to the left and right of the variant in
genome.
--neargene
the distance threshold to define whether a variant is in the
upstream or downstream region of a gene. By default 1 kilobase
from the start or end site of a transcript is defined as upstream
or downstream, respectively. This is useful, for example, when one
wants to identify variants that are located in the promoter
regions of genes across the genome.
--exonicsplicing
report exonic variants near exon/intron boundary as
'exonic;splicing' variants. These variants are technically exonic
variants, but there are some literature reports that some of them
may also affect splicing so a keyword is preserved specifically
for them.
--score_threshold
the minimum score to consider when examining region-based
annotations on UCSC Genome Browser tables. Some tables do not have
such scores and this argument will not be effective.
--normscore_threshold
the minimum normalized score to consider when examining
region-based annotations on UCSC Genome Browser tables. The
normalized score is calculated by UCSC, ranging from 0 to 1000, to
make visualization easier. Some tables do not have such scores and
this argument will not be effective.
--reverse
reverse the criteria for --score_threshold and
--normscore_threshold. So the minimum score becomes maximum score
for a result to be printed.
--rawscore
for region-based annotation, print out raw scores from UCSC Genome
Browser tables, rather than normalized scores. By default,
normalized scores are printed in the output files. Normalized
scores are compiled by UCSC Genome Browser for each track, and
they usually range from 0 to 1000, but there are some exceptions.
--minqueryfrac
The minimum fraction of overlap between a query and a database
record to decide on their match. By default, any overlap is
regarded as a match, but this may not work best when query consist
of large copy number variants.
--splicing_threshold
distance between splicing variants and exon/intron boundary, to
claim that a variant is a splicing variant. By default, 2bp is
used. ANNOVAR is relatively more stringent than some other
software to claim variant as regulating splicing. In addition, if
a variant is an exonic variant, it will not be reported as
splicing variant even if it is within 2bp to an exon/intron
boundary.
--indel_splicing_threshold
If set, max size of indel allowed to be called a splicing variant
(if boundary within --splicing_threshold bases of an intron/exon
junction.) If not set, this is equal to the --splicing_threshold,
as per original behavior.
--maf_threshold
the minor allele frequency (MAF) threshold to be used in the
filter-based annotation for the 1000 Genomes Project databases. By
default, any variant annotated in the 1000G will be used in
filtering.
--sift_threshold
the default SIFT threshold for deleterious prediction for -dbtype
avsift (default: 0.05). This argument is obselete, since the
recommended database for SIFT annotation is LJB database now,
rather than avsift database.
--thread
specify the number of threads to use in filter-based annotation.
The Perl and all components in the system needs to support
multi-threaded analysis to use this feature. It is recommended
when your database is stored at a SSD drive, which results in
nearly linear speed up of annotation for large genome files.
--maxgenethread
specify the maximum number of threads for gene-based annotation
(default: 6). Generally speaking, too many threads for gene-based
annotation will negatively impacts the performance.
--mingenelinecount
specify the minimum line counts to enable threaded gene-based
annotation (default: 1000000). For input files with less lines,
the threaded annotation will not be used, since it actually cost
more time than non-threaded annotation.
--memfree
the minimum amount of free system memory that ANNOVAR should
ensure to have.
--memtotal
the total amount of memory that ANNOVAR should use at most. By
default, this value is zero, meaning that there is no limit on
that. Decreasing this threshold reduce the memory requirement by
ANNOVAR, but may increase the execution time.
--chromosome
examine these specific chromosomes in database file. The argument
takes comma- delimited values, and the dash can be correctly
recognized. For example, 5-10,X represent chromosome 5 through
chromosome 10 plus chromosome X.
DESCRIPTION
ANNOVAR is a software tool that can be used to functionally annotate a
list of genetic variants, possibly generated from next-generation
sequencing experiments. For example, given a whole-genome resequencing
data set for a human with specific diseases, typically around 3 million
SNPs and around half million insertions/deletions will be identified.
Given this massive amounts of data (and candidate disease- causing
variants), it is necessary to have a fast algorithm that scans the data
and identify a prioritized subset of variants that are most likely
functional for follow-up Sanger sequencing studies and functional assays.
Currently, these various types of functional annotations produced by
ANNOVAR can be (1) gene-based annotations (the default behavior), such as
exonic variants, intronic variants, intergenic variants, downstream
variants, UTR variants, splicing site variants, stc. For exonic variants,
ANNOVAR will try to predict whether each of the variants is non-synonymous
SNV, synonymous SNV, frameshifting change, nonframeshifting change. (2)
region-based annotation, to identify whether a given variant overlaps with
a specific type of genomic region, for example, predicted transcription
factor binding site or predicted microRNAs.(3) filter-based annotation, to
filter a list of variants so that only those not observed in variation
databases (such as 1000 Genomes Project and dbSNP) are printed out.
Detailed documentation for ANNOVAR should be viewed in ANNOVAR website
(http://annovar.openbioinformatics.org/). Below is description on commonly
encountered file formats when using ANNOVAR software.
* variant file format
A sample variant file contains one variant per line, with the
fields being chr, start, end, reference allele, observed allele,
other information. The other information can be anything (for
example, it may contain sample identifiers for the corresponding
variant.) An example is shown below:
16 49303427 49303427 C T rs2066844 R702W (NOD2)
16 49314041 49314041 G C rs2066845 G908R (NOD2)
16 49321279 49321279 - C rs2066847 c.3016_3017insC (NOD2)
16 49290897 49290897 C T rs9999999 intronic (NOD2)
16 49288500 49288500 A T rs8888888 intergenic (NOD2)
16 49288552 49288552 T - rs7777777 UTR5 (NOD2)
18 56190256 56190256 C T rs2229616 V103I (MC4R)
* database file format: UCSC Genome Browser annotation database
Most but not all of the gene annotation databases are directly
downloaded from UCSC Genome Browser, so the file format is
identical to what was used by the genome browser. The users can
check Table Browser (for example, human hg18 table browser is at
http://www.genome.ucsc.edu/cgi-bin/hgTables?org=Human&db=hg18) to
see what fields are available in the annotation file. Note that
even for the same species (such as humans), the file format might
be different between different genome builds (such as between
hg16, hg17 and hg18). ANNOVAR will try to be smart about guessing
file format, based on the combination of the -- buildver argument
and the number of columns in the input file. In general, the
database file format should not be something that users need to
worry about.
* database file format: GFF3 format for gene-based annotations)
As of June 2010, ANNOVAR cannot perform gene-based annotations
using GFF3 input files, and any annotations on GFF3 is
region-based. I suggest that users download gff3ToGenePred tool
from UCSC and convert GFF3-based gene annotation to UCSC format,
so that ANNOVAR can perform gene-based annotation for your species
of interests.
* database file format: GFF3 format for region-based
annotations)
Currently, region-based annotations can support the Generic
Feature Format version 3 (GFF3) formatted files. The GFF3 has
become the de facto golden standards for many model organism
databases, such that many users may want to take a custom
annotation database and run ANNOVAR on them, and it would be the
most convenient if the custom file is made with GFF3 format.
* database file format: generic format for filter-based
annotations)
The 'generic' format is designed for filter-based annotation that
looks for exact variants. The format is almost identical to the
ANNOVAR input format, with chr, start, end, reference allele,
observed allele and scores (higher scores are regarded as better).
* database file format: VCF format for filter-based annotations)
ANNOVAR can directly interrogate VCF files as database files. A
VCF file may contain summary information for variants (for
example, this variant has MAF of 5% in this population), or it may
contain the actual variant calls for each individual in a specific
population.
* sequence file format
ANNOVAR can directly examine FASTA-formatted sequence files. For
mRNA sequences, the name of the sequences are the mRNA identifier.
For genomic sequences, the name of the sequences in the files are
usually chr1, chr2, chr3, etc, so that ANNOVAR knows which
sequence corresponds to which chromosome. Unfortunately, UCSC uses
things like chr6_random to annotate un-assembled sequences, as
opposed to using the actual contig identifiers. This causes some
issues (depending on how reads alignment algorithms works), but in
general should not be something that user need to worry about. If
the users absolutely care about the exact contigs rather than
chr*_random, then they will need to re-align the short reads at
chr*_random to a different FASTA file that contains the contigs
(such as the GRCh36/37/38), and then execute ANNOVAR on the newly
identified variants.
* invalid input
If the query file contains input lines with invalid format,
ANNOVAR will skip such line and continue with the annotation on
next lines. These invalid input lines will be written to a file
with suffix invalid_input. Users should manually examine this file
and identify sources of error.
--------------------------------------------------------------------------
------
ANNOVAR is free for academic, personal and non-profit use.
For questions or comments, please contact $Author: kaichop
<[email protected]> $.
$cat mam/convert2annovar.txt
SYNOPSIS
convert2annovar.pl [arguments] <variantfile>
Optional arguments:
-h, --help print help message
-m, --man print complete documentation
-v, --verbose use verbose output
--format <string> input format (default: pileup)
--includeinfo include supporting information in output
--outfile <file> output file name (default: STDOUT)
--snpqual <float> quality score threshold in pileup file (default: 20)
--snppvalue <float> SNP P-value threshold in GFF3-SOLiD file (default: 1)
--coverage <int> read coverage threshold in pileup file (default: 0)
--maxcoverage <int> maximum coverage threshold (default: none)
--chr <string> specify the chromosome (for CASAVA format)
--chrmt <string> chr identifier for mitochondria (default: M)
--fraction <float> minimum allelic fraction to claim a mutation (for pileup format)
--altcov <int> alternative allele coverage threshold (for pileup format)
--allelicfrac print out allelic fraction rather than het/hom status (for pileup format)
--species <string> if human, convert chr23/24/25 to X/Y/M (for gff3-solid format)
--filter <string> output variants with this filter (case insensitive, for vcf4 format)
--confraction <float> minimal fraction for two indel calls as a 0-1 value (for vcf4old format)
--allallele print all alleles rather than first one (for vcf4old format)
--withzyg print zygosity/coverage/quality when -includeinfo is used (for vcf4 format)
--comment keep comment line in output (for vcf4 format)
--allsample process all samples in file with separate output files (for vcf4 format)
--genoqual <float> genotype quality score threshold (for vcf4 format)
--varqual <float> variant quality score threshold (for vcf4 format)
--dbsnpfile <file> dbSNP file in UCSC format (for rsid format)
--withfreq for --allsample, print frequency information instead (for vcf4 format)
--withfilter print filter information in output (for vcf4 format)
--seqdir <string> directory with FASTA sequences (for region format)
--inssize <int> insertion size (for region format)
--delsize <int> deletion size (for region format)
--subsize <int> substitution size (default: 1, for region format)
--genefile <file> specify the gene file from UCSC (for transcript format)
--splicing_threshold <int> the splicing threshold (for transcript format)
--context <int> print context nucleotide for indels (for casava format)
--avsnpfile <file> specify the avSNP file (for rsid format)
--keepindelref keep Ref/Alt alleles for indels (for vcf4 format)
Function: convert variant call file generated from various software programs
into ANNOVAR input format
Example: convert2annovar.pl -format pileup -outfile variant.query variant.pileup
convert2annovar.pl -format cg -outfile variant.query variant.cg
convert2annovar.pl -format cgmastervar variant.masterVar.txt
convert2annovar.pl -format gff3-solid -outfile variant.query variant.snp.gff
convert2annovar.pl -format soap variant.snp > variant.avinput
convert2annovar.pl -format maq variant.snp > variant.avinput
convert2annovar.pl -format casava -chr 1 variant.snp > variant.avinput
convert2annovar.pl -format vcf4 variantfile > variant.avinput
convert2annovar.pl -format vcf4 -filter pass variantfile -allsample -outfile variant
convert2annovar.pl -format vcf4old input.vcf > output.avinput
convert2annovar.pl -format rsid snplist.txt -dbsnpfile snp138.txt > output.avinput
convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003 -inssize 1 -delsize 2
convert2annovar.pl -format transcript NM_022162 -gene humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/
Version: $Date: 2018-04-16 00:48:00 -0400 (Mon, 16 Apr 2018) $
OPTIONS
--help print a brief usage message and detailed explanation of options.
--man print the complete manual of the program.
--verbose
use verbose output.
--format
the format of the input files. Currently supported formats include
pileup, cg, cgmastervar, gff3-solid, soap, maq, casava, vcf4,
vcf4old, rsid. In August 2013, the VCF file processing subroutine
is changed (multiple samples in VCF file can be processed in
genotype-aware manner), but users can use vcf4old to have
identical results as the old behavior.
--outfile
specify the output file name. By default, output is written to
STDOUT.
--snpqual
quality score threshold in the pileup file, such that variant
calls with lower quality scores will not be printed out in the
output file.
--snppvalue
SNP p-value threshold in the pileup file, such that variant calls
with higher values will not be printed out in the output file.
--coverage
read coverage threshold in the pileup file, such that variants
calls generated with lower coverage will not be printed in the
output file.
--maxcoverage
maximum read coverage threshold in the pileup file, such that
variants calls generated with higher coverage will not be printed
in the output file.
--includeinfo
specify that the output should contain additional information in
the input line. By default, only the chr, start, end, reference
allele, observed allele and homozygosity status are included in
output files.
--chr specify the chromosome for CASAVA format
--chrmt specify the name of mitochondria chromosome (default is MT)
--altcov
the minimum coverage of the alternative (mutated) allele to be
printed out in output
--allelicfrac
print out allelic fraction rather than het/hom status (for pileup
format). This is useful when processing mitochondria variants.
--fraction
specify the minimum fraction of alternative allele, to print out
the mutation. For example, a site has 10 reads, 3 supports
alternative allele. A -fraction of 0.4 will not allow the mutation
to be printed out.
--species
specify the species from which the sequencing data is obtained.
For the GFF3- SOLiD format, when species is human, the chromosome
23, 24 and 25 will be converted to X, Y and M, respectively.
--filter
for VCF4 file, only print out variant calls with this filter
annotated. For example, if using GATK VariantFiltration walker,
you will see PASS, GATKStandard, HARD_TO_VALIDATE, etc in the
filter field. Using 'pass' as a filter is recommended in this
case.
--allsample
for multi-sample VCF4 file, the --allsample argument will process
all samples in the file and generate separate output files for
each sample. By default, only the first sample in VCF4 file will
be processed.
--withzyg
for VCF4 format, print out zygosity information, coverage
information and genotype quality information when -includeinfo is
used. By default, these information are printed out if
-includeinfo is not used.
--genoqual
minimum genotype quality for the variant in this sample, to be
printed out. The genotype quality is typically denoted as GQ in
the SAMPLE column
--varqual
minimum variant quality (the QUAL column in the VCF file) to
handle the variant in VCF file.
--comment
include VCF4 header comment lines in the output file
--genoqual
specify the genotype quality score to be included in the output
file
--varqual
specify the variant quality score to be included in the output
file
--dbsnpfile
specify the dbSNP file to query (for rsid format)
--withfreq
include frequency information in the output (for VCF format with
multiple samples)
--withfilter
include filter information in the output file (for VCF format)
--seqdir
specify the directory for sequence file (for region format)
--inssize
specify the insertion size when generating all mutations (for
region format)
--delsize
specify the deletion size when generating all mutations (for
region format)
--subsize
specify the substitution size when generating all mutations (for
region format)
--genefile
specify the gene file from UCSC, which can be refGene, knownGene
or ensGene (for transcript format)
--splicing_threshold
specify the splicing threshold (for transcript format)
--context
print context for indels which is useful to convert to VCF files
(for CASAVA format)
--avsnpfile
specify the avsnpfile that will be queried when using rsid as the
input file format
--keepindelref
do not alter the Ref and Alt alleles for indels in the VCF file
(by default the program automatically changes and shortens the Ref
and Alt allele)
DESCRIPTION
This program is used to convert variant call file generated from various
software programs into ANNOVAR input format. Currently, the program can
handle Samtools genotype-calling pileup format, Solid GFF format, Complete
Genomics variant format, SOAP format, MAQ format, CASAVA format, VCF
format. These formats are described below.
* pileup format
The pileup format can be produced by the Samtools genotyping
calling subroutine. Note that the phrase pileup format can be used
in several instances, and here I am only referring to the pileup
files that contains the actual genotype calls.
Using SamTools, given an alignment file in BAM format, a pileup
file with genotype calls can be produced by the command below:
samtools pileup -vcf ref.fa aln.bam> raw.pileup
samtools.pl varFilter raw.pileup > final.pileup
ANNOVAR will automatically filter the pileup file so that only
SNPs reaching a quality threshold are printed out (default is 20,
use --snpqual argument to change this). Most likely, users may
want to also apply a coverage threshold, such that SNPs calls from
only a few reads are not considered. This can be achieved using
the -coverage argument (default value is 0).
An example of pileup files for SNPs is shown below:
chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<:
chr1 556675 C C 55 0 60 16 ,,..A..,...,.... CB%%5%,A/+,%....
chr1 556676 C C 59 0 60 16 g,.....,...,.... .B%%.%.?.=/%...1
chr1 556677 G G 75 0 60 16 ,$,.....,...,.... .B%%9%5A6?)%;?:<
chr1 556678 G K 60 60 60 24 ,$.....,...,....^~t^~t^~t^~t^~t^~t^~t^~t^~t B%%B%<A;AA%??<=??;BA%B89
chr1 556679 C C 61 0 60 23 .....a...a....,,,,,,,,, %%1%&?*:2%*&)(89/[email protected]@@
chr1 556680 G K 88 93 60 23 ..A..,..A,....ttttttttt %%)%7B:B0%55:7=>>[email protected]?B;
chr1 556681 C C 102 0 60 25 .$....,...,....,,,,,,,,,^~,^~. %%3%.B*4.%.34.6./[email protected]@>5.
chr1 556682 A A 70 0 60 24 ...C,...,....,,,,,,,,,,. %:%(B:A4%7A?;A><<999=<<
chr1 556683 G G 99 0 60 24 ....,...,....,,,,,,,,,,. %A%[email protected]%?%[email protected]/./-1A7?
The columns are chromosome, 1-based coordinate, reference base,
consensus base, consensus quality, SNP quality, maximum mapping
quality of the reads covering the sites, the number of reads
covering the site, read bases and base qualities.
An example of pileup files for indels is shown below:
seq2 156 * +AG/+AG 71 252 99 11 +AG * 3 8 0
ANNOVAR automatically recognizes both SNPs and indels in pileup
file, and process them correctly.
* GFF3-SOLiD format
The SOLiD provides a GFF3-compatible format for SNPs, indels and
structural variants. A typical example file is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version 2
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files Yoruban_snp_10x.txt
##run-path
chr_name AB_SOLiD SNP caller SNP coord coord 1 . . coverage=# cov;ref_base=ref;ref_score=score;ref_confi=confi;ref_single=Single;ref_paired=Paired;consen_base=consen;consen_score=score;consen_confi=conf;consen_single=Single;consen_paired=Paired;rs_id=rs_id,dbSNP129
1 AB_SOLiD SNP caller SNP 997 997 1 . . coverage=3;ref_base=A;ref_score=0.3284;ref_confi=0.9142;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6716;consen_confi=0.9349;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 2061 2061 1 . . coverage=2;ref_base=G;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.8985;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4770 4770 1 . . coverage=2;ref_base=A;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G;consen_score=1.0000;consen_confi=0.8854;consen_single=0/0;consen_paired=2/2
1 AB_SOLiD SNP caller SNP 4793 4793 1 . . coverage=14;ref_base=A;ref_score=0.0723;ref_confi=0.8746;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6549;consen_confi=0.8798;consen_single=0/0;consen_paired=9/9
1 AB_SOLiD SNP caller SNP 6241 6241 1 . . coverage=2;ref_base=T;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.7839;consen_single=0/0;consen_paired=2/2
Newer version of ABI BioScope now use diBayes caller, and the
output file is given below:
##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##List of SNPs. Date Sat Dec 18 10:30:45 2010 Stringency: medium Mate Pair: 1 Read Length: 50 Polymorphism Rate: 0.003000 Bayes Coverage: 60 Bayes_Single_SNP: 1 Filter_Single_SNP: 1 Quick_P_Threshold: 0.997000 Bayes_P_Threshold: 0.040000 Minimum_Allele_Ratio: 0.150000 Minimum_Allele_Ratio_Multiple_of_Dicolor_Error: 100
##1 chr1
##2 chr2
##3 chr3
##4 chr4
##5 chr5
##6 chr6
##7 chr7
##8 chr8
##9 chr9
##10 chr10
##11 chr11
##12 chr12
##13 chr13
##14 chr14
##15 chr15
##16 chr16
##17 chr17
##18 chr18
##19 chr19
##20 chr20
##21 chr21
##22 chr22
##23 chrX
##24 chrY
##25 chrM
# source-version SOLiD BioScope diBayes(SNP caller)
#Chr Source Type Pos_Start Pos_End Score Strand Phase Attributes
chr1 SOLiD_diBayes SNP 221367 221367 0.091151 . . genotype=R;reference=G;coverage=3;refAlleleCounts=1;refAlleleStarts=1;refAlleleMeanQV=29;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=27;diColor1=11;diColor2=33;het=1;flag=
chr1 SOLiD_diBayes SNP 555317 555317 0.095188 . . genotype=Y;reference=T;coverage=13;refAlleleCounts=11;refAlleleStarts=10;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=00;diColor2=22;het=1;flag=
chr1 SOLiD_diBayes SNP 555327 555327 0.037582 . . genotype=Y;reference=T;coverage=12;refAlleleCounts=6;refAlleleStarts=6;refAlleleMeanQV=19;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=12;diColor2=30;het=1;flag=
chr1 SOLiD_diBayes SNP 559817 559817 0.094413 . . genotype=Y;reference=T;coverage=9;refAlleleCounts=5;refAlleleStarts=4;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=14;diColor1=11;diColor2=33;het=1;flag=
chr1 SOLiD_diBayes SNP 714068 714068 0.000000 . . genotype=M;reference=C;coverage=13;refAlleleCounts=7;refAlleleStarts=6;refAlleleMeanQV=25;novelAlleleCounts=6;novelAlleleStarts=4;novelAlleleMeanQV=22;diColor1=00;diColor2=11;het=1;flag=
The file conforms to standard GFF3 specifications, but the last column is solid-
specific and it gives certain parameters for the SNP calls.
An example of the short indel format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
##type DNA
##date 2009-01-26
##time 18:33:20
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files ../../mp-results/JOAN_20080104_1.pas,../../mp-results/BARB_20071114_1.pas,../../mp-results/BARB_20080227_2.pas
##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-w2x25-2x-4x-8x-10x/2x
##Filter-settings: max-ave-read-pos=none,min-ave-from-end-pos=9.1,max-nonreds-4filt=2,min-insertion-size=none,min-deletion-size=none,max-insertion-size=none,max-deletion-size=none,require-called-indel-size?=T
chr1 AB_SOLiD Small Indel Tool deletion 824501 824501 1 . . del_len=1;tight_chrom_pos=824501-824502;loose_chrom_pos=824501-824502;no_nonred_reads=2;no_mismatches=1,0;read_pos=4,6;from_end_pos=21,19;strands=+,-;tags=R3,F3;indel_sizes=-1,-1;read_seqs=G3021212231123203300032223,T3321132212120222323222101;dbSNP=rs34941678,chr1:824502-824502(-),EXACT,1,/GG
chr1 AB_SOLiD Small Indel Tool insertion_site 1118641 1118641 1 . . ins_len=3;tight_chrom_pos=1118641-1118642;loose_chrom_pos=1118641-1118642;no_nonred_reads=2;no_mismatches=0,1;read_pos=17,6;from_end_pos=8,19;strands=+,+;tags=F3,R3;indel_sizes=3,3;read_seqs=T0033001100022331122033112,G3233112203311220000001002
The keyword deletion or insertion_site is used in the fourth
column to indicate that file format.
An example of the medium CNV format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
##type DNA
##date 2009-01-27
##time 15:54:36
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files big_d20e5-del12n_up-ConsGrp-2nonred.pas.sum
##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-results-lmp-e5/big_d20e5-indel_950_2050
chr1 AB_SOLiD Small Indel Tool deletion 3087770 3087831 1 . . del_len=62;tight_chrom_pos=none;loose_chrom_pos=3087768-3087773;no_nonred_reads=2;no_mismatches=2,2;read_pos=27,24;from_end_pos=23,26;strands=-,+;tags=F3,F3;indel_sizes=-62,-62;read_seqs=T11113022103331111130221213201111302212132011113022,T02203111102312122031111023121220311111333012203111
chr1 AB_SOLiD Small Indel Tool deletion 4104535 4104584 1 . . del_len=50;tight_chrom_pos=4104534-4104537;loose_chrom_pos=4104528-4104545;no_nonred_reads=3;no_mismatches=0,4,4;read_pos=19,19,27;from_end_pos=31,31,23;strands=+,+,-;tags=F3,R3,R3;indel_sizes=-50,-50,-50;read_seqs=T31011011013211110130332130332132110110132020312332,G21031011013211112130332130332132110132132020312332,G20321302023001101123123303103303101113231011011011
chr1 AB_SOLiD Small Indel Tool insertion_site 2044888 2044888 1 . . ins_len=18;tight_chrom_pos=2044887-2044888;loose_chrom_pos=2044887-2044889;no_nonred_reads=2;bead_ids=1217_1811_209,1316_908_1346;no_mismatches=0,2;read_pos=13,15;from_end_pos=37,35;strands=-,-;tags=F3,F3;indel_sizes=18,18;read_seqs=T31002301231011013121000101233323031121002301231011,T11121002301231011013121000101233323031121000101231;non_indel_no_mismatches=3,1;non_indel_seqs=NIL,NIL
chr1 AB_SOLiD Small Indel Tool insertion_site 74832565 74832565 1 . . ins_len=16;tight_chrom_pos=74832545-74832565;loose_chrom_pos=74832545-74832565;no_nonred_reads=2;bead_ids=1795_181_514,1651_740_519;no_mismatches=0,2;read_pos=13,13;from_end_pos=37,37;strands=-,-;tags=F3,R3;indel_sizes=16,16;read_seqs=T33311111111111111111111111111111111111111111111111,G23311111111111111111111111111111111111111311011111;non_indel_no_mismatches=1,0;non_indel_seqs=NIL,NIL
An example of the large indel format by GFF3-SOLiD is given below:
##gff-version 3
##solid-gff-version 0.3
##source-version ???
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files /data/results5/yoruban_strikes_back_large_indels/LMP/five_mm_unique_hits_no_rescue/5_point_6x_del_lib_1/results/NA18507_inter_read_indels_5_point_6x.dat
##run-path
chr1 AB_SOLiD Large Indel Tool insertion_site 1307279 1307791 1 . . deviation=-742;stddev=7.18;ref_clones=-;dev_clones=4
chr1 AB_SOLiD Large Indel Tool insertion_site 2042742 2042861 1 . . deviation=-933;stddev=8.14;ref_clones=-;dev_clones=3
chr1 AB_SOLiD Large Indel Tool insertion_site 2443482 2444342 1 . . deviation=-547;stddev=11.36;ref_clones=-;dev_clones=17
chr1 AB_SOLiD Large Indel Tool insertion_site 2932046 2932984 1 . . deviation=-329;stddev=6.07;ref_clones=-;dev_clones=14
chr1 AB_SOLiD Large Indel Tool insertion_site 3166925 3167584 1 . . deviation=-752;stddev=13.81;ref_clones=-;dev_clones=14
An example of the CNV format by GFF3-SOLiD if given below:
##gff-version 3
##solid-gff-version 0.3
##source-version ???
##type DNA
##date 2009-03-13
##time 0:0:0
##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
##reference-file
##input-files Yoruban_cnv.coords
##run-path
chr1 AB_CNV_PIPELINE repeat_region 1062939 1066829 . . . fraction_mappable=51.400002;logratio=-1.039300;copynum=1;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 1073630 1078667 . . . fraction_mappable=81.000000;logratio=-1.409500;copynum=1;numwindows=2
chr1 AB_CNV_PIPELINE repeat_region 2148325 2150352 . . . fraction_mappable=98.699997;logratio=-1.055000;copynum=1;numwindows=1
chr1 AB_CNV_PIPELINE repeat_region 2245558 2248109 . .