1. 程式人生 > >084-【生信軟體】-ANNOVAR軟體幫助文件

084-【生信軟體】-ANNOVAR軟體幫助文件

安裝

會郵件收到一個軟體安裝包

annovar.latest.tar/

包含的perl指令碼

[email protected] /opt/script/tool/annovar  Sun Oct 07 16:42  forstart
$tree -L 1
.
├── annotate_variation.pl
├── coding_change.pl
├── convert2annovar.pl
├── example
├── humandb
├── retrieve_seq_from_fasta.pl
├── table_annovar.pl
└── variants_reduction.pl

humandb

ANNOVAR的安裝包裡自帶了一些常用的資料庫,在humandb/目錄下

[email protected] /opt/script/tool/annovar/humandb  Sun Oct 07 16:43  forstart
$tree -L 1
.
├── genometrax-sample-files-gff
├── GRCh37_MT_ensGeneMrna.fa
├── GRCh37_MT_ensGene.txt
├── hg19_example_db_generic.txt
├── hg19_example_db_gff3.txt
├── hg19_MT_ensGeneMrna.fa
├── hg19_MT_ensGene.txt
├── hg19_refGeneMrna.fa
├── hg19_refGene.txt
├── hg19_refGeneVersion.txt
├── hg19_refGeneWithVerMrna.fa
└── hg19_refGeneWithVer.txt

gff檔案

[email protected] /opt/script/tool/annovar/humandb/genometrax-sample-files-gff  Sun Oct 07 16:44  forstart
$tree -L 1
.
├── list
├── sample_chip_featuretype_hg19.gff
├── sample_common_snp_featuretype_hg19.gff
├── sample_cosmic_featuretype_hg19.gff
├── sample_cpg_islands_featuretype_hg19.gff
├── sample_dbnsfp_featuretype_hg19.gff
├── sample_disease_featuretype_hg19.gff
├── sample_dnase_featuretype_hg19.gff
├── sample_drug_featuretype_hg19.gff
├── sample_evs_featuretype_hg19.gff
├── sample_gwas_featuretype_hg19.gff
├── sample_hgmd_common_snp_featuretype_hg19.gff
├── sample_hgmd_disease_genes_featuretype_hg19.gff
├── sample_hgmd_featuretype_hg19.gff
├── sample_hgmdimputed_featuretype_hg19.gff
├── sample_microsatellites_featuretype_hg19.gff
├── sample_miRNA_featuretype_hg19.gff
├── sample_omim_featuretype_hg19.gff
├── sample_pathway_featuretype_hg19.gff
├── sample_pgx_featuretype_hg19.gff
├── sample_ptms_featuretype_hg19.gff
├── sample_snps_dbsnp_featuretype_hg19.gff
├── sample_snps_ensembl_featuretype_hg19.gff
├── sample_transfac_sites_featuretype_hg19.gff
└── sample_tss_featuretype_hg19.gff

0 directories, 25 files

example

[email protected]:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ ll
total 20152
drwxrwxrwx 1 toucan toucan     4096 Sep 26 22:47 ./
drwxrwxrwx 1 toucan toucan     4096 Sep 26 22:47 ../
-rwxrwxrwx 1 toucan toucan     1940 Apr 17 03:41 README*
-rwxrwxrwx 1 toucan toucan     1831 Apr 17 03:41 ex1.avinput*
-rwxrwxrwx 1 toucan toucan     1706 Apr 17 03:41 ex2.vcf*
-rwxrwxrwx 1 toucan toucan       44 Apr 17 03:41 example.simple_region*
-rwxrwxrwx 1 toucan toucan       44 Apr 17 03:41 example.tab_region*
-rwxrwxrwx 1 toucan toucan 20317115 Apr 17 03:41 gene_fullxref.txt*
-rwxrwxrwx 1 toucan toucan   295664 Apr 17 03:41 gene_xref.txt*
-rwxrwxrwx 1 toucan toucan     1436 Apr 17 03:41 grantham.matrix*
-rwxrwxrwx 1 toucan toucan       43 Apr 17 03:41 snplist.txt*

README說明

[email protected]:/mnt/e/software/linux/ANNOVAR/annovar.latest.tar/annovar/example$ cat README
visit ANNOVAR website at http://www.openbioinformatics.org/annovar for more exmaple.

Please cite ANNOVAR if you use it in your research (Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data, Nucleic Acids Research, 38:e164, 2010). I spent tremendous amount of time and effort to maintain this tool, and your citation really means a lot to me.

ex1.avinput: a simple ANNOVAR input example with a few variants (in hg19 coordinate)

ex2.vcf: a simple VCF file with genotype information for 3 samples

gene_xref.txt: an example gene cross-reference file to be used on 'gx' operation in table_annovar.pl

example.simple_region: a file containing a list of genomic regions in sample format (for use in retrieve_seq_from_fasta.pl)

example.tab_region: a flie containing a list of genomic regions in tab-delimited format (for use in retrieve_seq_from_fasta.pl)

snplist.txt: a text file listing several dbSNP rs identifiers, one per line

humandb/hg19_example_db_generic.txt: an example file for generic database

humandb/hg19_example_db_gff3.txt: an example file for GFF3 database

grantham.matrix: a matrix file containing GRANTHAM scores for gene-based annotation

humandb/genometrax-sample-files-gff: a directory containing several "sample" GFF files provided by BioBase

humandb/hg19_MT_ensGene.txt and humandb/hg19_MT_ensGene.fa: mitochondria sequence for the NC_001807 contig used by UCSC Genome Browser. Even if you align your sequence data with reference sequences from UCSC, you should still use these files, not the ENSEMBLE file, for mitochondria annotation, because the ENSEMBLE annotations have some errors.

humandb/GRCh37_MT_ensGene.txt and humandb/GRCh37_MT_ensGene.fa: mitochondria sequence for the NC_012920 contig. If you align your sequence data using 1000 Genomes Project reference FASTA file, then you should use this file for annotating mitochondria variants.

如果要進行其他註釋,需要使用 -downdb 命令下載資料庫到 ‘humandb/’ 目錄裡:

#下載1000g2015Aug資料庫
$perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2015aug humandb/    

軟體幫助文件

(ANNOVAR程式結構
│ annotate_variation.pl #主程式,功能包括下載資料庫,三種不同的註釋
│ coding_change.pl #可用來推斷蛋白質序列
│ convert2annovar.pl #將多種格式轉為.avinput的程式
│ retrieve_seq_from_fasta.pl #用於自行建立其他物種的轉錄本
│ table_annovar.pl #註釋程式,可一次性完成三種類型的註釋
│ variants_reduction.pl #可用來更靈活地定製過濾註釋流程
│
├─example #存放示例檔案
│
└─humandb #人類註釋資料庫)

annotate_variation.pl

$cat mam/annotate_variation.txt
SYNOPSIS
     annotate_variation.pl [arguments] <query-file|table-name> <database-location>

     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output

            Arguments to download databases or perform annotations
                --downdb                    download annotation database
                --geneanno                  annotate variants by gene-based annotation (infer functional consequence on genes)
                --regionanno                annotate variants by region-based annotation (find overlapped regions in database)
                --filter                    annotate variants by filter-based annotation (find identical variants in database)

            Arguments to control input and output
                --outfile <file>            output file prefix
                --webfrom <string>          specify the source of database (ucsc or annovar or URL) (downdb operation)
                --dbtype <string>           specify database type
                --buildver <string>         specify genome build version (default: hg18 for human)
                --time                      print out local time during program run
                --comment                   print out comment line (those starting with #) in output files
                --exonsort                  sort the exon number in output line (gene-based annotation)
                --transcript_function       use transcript name rather than gene name (gene-based annotation)
                --hgvs                      use HGVS format for exonic annotation (c.122C>T rather than c.C122T)(gene-based annotation)
                --separate                  separately print out all functions of a variant in several lines (gene-based annotation)
                --seq_padding               create a new file with cDNA sequence padded by this much either side(gene-based annotation)
                --(no)firstcodondel         treat first codon deletion as wholegene deletion (default: ON) (gene-based annotation)
                --aamatrix <file>           specify an amino acid substitution matrix file (gene-based annotation)
                --colsWanted <string>       specify which columns to output by comma-delimited numbers (region-based annotation)
                --scorecolumn <int>         the column with scores in DB file (region-based annotation)
                --poscolumn <string>        the comma-delimited column with position information in DB file (region-based annotation)
                --gff3dbfile <file>         specify a DB file in GFF3 format (region-based annotation)
                --gff3attribute             output all fields in GFF3 attribute (default: ID and score only)
                --bedfile <file>            specify a DB file in BED format file (region-based annotation)
                --genericdbfile <file>      specify a DB file in generic format (filter-based annotation)
                --vcfdbfile <file>          specify a DB file in VCF format (filter-based annotation)
                --otherinfo                 print out additional columns in database file (filter-based annotation)
                --infoasscore               use INFO field in VCF file as score in output (filter-based annotation)
                --idasscore                 use ID field in VCF file as score in output (filter-based annotation)
                --infosep                   use # rather than , to separate fields when -otherinfo is used


            Arguments to fine-tune the annotation procedure
                --batchsize <int>           batch size for processing variants per batch (default: 5m)
                --genomebinsize <int>       bin size to speed up search (default: 100k for -geneanno, 10k for -regionanno)
                --expandbin <int>           check nearby bin to find neighboring genes (default: 2m/genomebinsize)
                --neargene <int>            distance threshold to define upstream/downstream of a gene
                --exonicsplicing            report exonic variants near exon/intron boundary as 'exonic;splicing' variants
                --score_threshold <float>   minimum score of DB regions to use in annotation
                --normscore_threshold <float> minimum normalized score of DB regions to use in annotation
                --reverse                   reverse directionality to compare to score_threshold
                --rawscore                  output includes the raw score (not normalized score) in UCSC BrowserTrack
                --minqueryfrac <float>      minimum percentage of query overlap to define match to DB (default: 0)
                --splicing_threshold <int>  distance between splicing variants and exon/intron boundary (default: 2)
                --indel_splicing_threshold <int>    if set, use this value for allowed indel size for splicing variants (default: --splicing_threshold)
                --maf_threshold <float>     filter 1000G variants with MAF above this threshold (default: 0)
                --sift_threshold <float>    SIFT threshold for deleterious prediction for -dbtype avsift (default: 0.05)
                --precedence <string>       comma-delimited to specify precedence of variant function (default: exonic>intronic...)
                --indexfilter_threshold <float>     controls whether filter-based annotation use index if this fraction of bins need to be scanned (default: 0.9)
                --thread <int>              use multiple threads for filter-based annotation
                --maxgenethread <int>       max number of threads for gene-based annotation (default: 6)
                --mingenelinecount <int>    min line counts to enable threaded gene-based annotation (default: 1000000)

           Arguments to control memory usage
                --memfree <int>             ensure minimum amount of free system memory (default: 0)
                --memtotal <int>            limit total amount of memory used by ANNOVAR (default: 0, unlimited,in the order of kb)
                --chromosome <string>       examine these specific chromosomes in database file


     Function: annotate a list of genetic variants against genome annotation
     databases stored at local disk.
# 示例
     Example: #download annotation databases from ANNOVAR or UCSC and save to humandb/ directory
              annotate_variation.pl -downdb -webfrom annovar refGene humandb/
              annotate_variation.pl -buildver mm9 -downdb refGene mousedb/
              annotate_variation.pl -downdb -webfrom annovar esp6500siv2_all humandb/

              #gene-based annotation of variants in the varlist file (by default --geneanno is ON)
              annotate_variation.pl -buildver hg19 ex1.avinput humandb/

              #region-based annotate variants
              annotate_variation.pl -regionanno -buildver hg19 -dbtype cytoBand ex1.avinput humandb/
              annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3 -gff3dbfile tfbs.gff3 ex1.avinput humandb/

              #filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
              annotate_variation.pl -filter -dbtype 1000g2015aug_all -maf 0.01 ex1.avinput humandb/
              annotate_variation.pl -filter -buildver hg19 -dbtype snp138 ex1.avinput humandb/
              annotate_variation.pl -filter -dbtype dbnsfp30a -otherinfo ex1.avinput humandb/

     Version: $Date: 2018-04-16 00:43:31 -0400 (Mon, 16 Apr 2018) $

OPTIONS
    --help  print a brief usage message and detailed explanation of options.

    --man   print the complete manual of the program.

    --verbose
            use verbose output.

    --downdb
            download annotation databases from UCSC Genome Browser, Ensembl,
            1000 Genomes Project, ANNOVAR website or other resources. The
            annotation databases are required for functional annotation of
            genetic variants.

    --geneanno
            perform gene-based annotation. For each variant, examine whether
            it hit exon, intron, intergenic region, or close to a transcript,
            or hit a non-coding RNA gene, or is located in a untranslated
            region (see *.variant_function output file). In addition, for an
            exonic variant, determine whether it causes splicing change,
            non-synonymous amino acid change, synonymous amino acid change or
            frameshift changes (see *.exonic_variant_function output file).

    --regionanno
            perform region-based annotation. For each variant, examine whether
            its genomic region (one or multiple base pairs) overlaps with a
            specific genomic region, such as the most conserved elements, the
            predicted transcription factor binding sites, the specific
            cytogeneic bands, the evolutionarily conserved RNA secondary
            structures.

    --filter
            perform filter-based annotation. For each variants, filter it
            against a variation database, such as the 1000 Genomes Project
            database, to identify whether it has been reporte in the database.
            Exact match of nucleotide position and nucleotide composition are
            required.

    --outfile
            specify the output file prefix. Several output files will be
            generated using this prefix and different suffixes. A directory
            name can also be specified as part of the argument, so that the
            output files can be written to a different directory than the
            current directory.

    --webfrom
            specify the source of database (ucsc or annovar or URL) in the
            downdb operation. By default, files from UCSC Genome Browser
            annotation database will be downloaded.

    --dbtype
            specify the database type to be used in gene-based, region-based
            or filter-based annotations. For gene-based annotation, by default
            refGene annotations from the UCSC Genome Browser will be used for
            annotating variants. However, users can switch to use Ensembl
            annotations, or use the UCSC Gene annotations, or the GENCODE Gene
            annotations, or other types of gene annotations. For region-based
            annotations, users can select any UCSC annotation databases (by
            providing the database name), or alternatively select a Generic
            Feature Format version 3 (GFF3) formatted file for annotation (by
            providing 'gff3' as the --dbtype and providing the --gff3dbfile
            argument), or select a BED file (by providing '-- dbtype bed' and
            --bedfile arguments). For filter-based annotations, users can
            select a dbSNP file, a 1000G file, a generic format file (with
            simple columns including chr, start, end, reference, observed,
            score), a VCF format file (which is a widely used format for
            variants exchange), or many other types of formats.

    --buildver
            genome build version to use. By default, the hg18 build for human
            genome is used. The build version will be used by ANNOVAR to
            identify corresponding database files automatically, for example,
            when gene-based annotation is used for hg18 build, ANNOVAR will
            search for the hg18_refGene.txt file, but if the hg19 is used as
            -- buildver, ANNOVAR will examine hg19_refGene.txt instead.

    --time  print out the local time during execution of the program

    --comment
            specify that the program should include comment lines in the
            output files. Comment lines are defined as any line starting with
            #. By default, these lines are not recognized as valid ANNOVAR
            input and are therefore written to the INVALID_INPUT file. This
            argument can be very useful to keep columns headers in the output
            file, if the input file use comment line to flag the column
            headers (usually the first line in the input file).

    --exonsort
            sort the exon number in output line in the exonic_variant_function
            file during gene-based annotation. If a mutation affects multiple
            transcripts, the ones with the smaller exon number will be printed
            before the transcript with larger exon number in the output.

    --transcript_function
            use transcript name rather than gene name in output, for
            gene-based annotation

    --hgvs  use HGVS format for exonic annotation (c.122C>T rather than
            c.C122T) for gene-based annotation

    --separate
            for gene-based annotation, separate the effects of each variant,
            so that each effect (intronic, exonic, splicing) is printed in one
            output line. By default, all effects are printed in the same line,
            in the comma-separated form of 'UTR3,UTR5' or 'exonic,splicing'.

    --seq_padding
            create a new file with cDNA sequence padded by this much either
            side (gene-based annotation)

    --firstcodondel
            if the first codon of a gene is deleted, then the whole gene will
            be treated as deleted in gene-based annotation. By default, this
            option is ON.

    --aamatrixfile
            specify an amino acid substitution matrix, so that the scores are
            printed in the exonic_variant_function file in gene-based
            annotation. The matrix file is tab- delimited, and an example is
            included in the ANNOVAR package.

    --colsWanted
            specify which columns are desired in the output for -regionanno.
            By default, ANNOVAR inteligently selects the columns based on the
            DB type. However, users can use a list of comma-delimited numbers,
            or use 'all', or use 'none', to request custom output columns.

    --scorecolumn
            specify the the column with desired output scores in UCSC database
            file (for region-based annotation). The default usually works
            okay.

    --poscolumn
            the comma-delimited column with position information in DB file
            (region-based annotation). The default usually works okay.

    --gff3dbfile
            specify the GFF3-formatted database file used in the region-based
            annotation. Please consult
            http://www.sequenceontology.org/resources/gff3.html for detailed
            description on this file format. Note that GFF3 is generally not
            compatible with previous versions of GFF.

    --gff3attribute
            output should contain all fields in GFF3 file attribute column
            (the 9th column). By default, only the ID in the attribute and the
            scores for the GFF3 file will be printed.

    --bedfile
            specify a DB file in BED format file in region-based annotation.
            Please consult http://genome.ucsc.edu/FAQ/FAQformat.html#format1
            for detailed descriptions on this format.

    --genericdbfile
            specify the generic format database file used in the filter-based
            annotation.

    --vcfdbfile
            specify the database file in VCF format in the filter-based
            annotation. VCF has been a popular format for summarizing SNP and
            indel calls in a population of samples, and has been adopted by
            1000 Genomes Project in their most recent data release.

    --otherinfo
            print out additional columns in database file in filter-based
            annotation. This argument is useful when the annotation database
            contains more than one annotation columns, so that all columns
            will be printed out and separated by comma (by default).

    --idasscore
            when annotating against a VCF file, treat the ID field in VCF file
            as the score to be printed in the output, in filter-based
            annotation. By default the score is the allele frequency inferred
            from VCF file.

    --infoasscore
            when annotating against a VCF file, treat the INFO field in VCF
            file as the score to be printed in the output, in filter-based
            annotation. By default the score is allele frequency inferred from
            VCF file.

    --infosep
            use '#' rather than ',' to separate multiple fields when
            -otherinfo is used in annotation. This argument is useful when the
            annotation string itself contains comma, to help users clearly
            separate different annotation fields.

    --batchsize
            this argument specifies the batch size for processing variants by
            gene-based annotation. Normally 5 million variants (usually one
            human genome will have about 3-5 million variants depending on
            ethnicity) are annotated as a batch, to reduce the amounts of
            memory. The users can adjust the parameters: larger values make
            the program slightly faster, at the expense of slightly larger
            memory requirements. In a 64bit computer, the default settings
            usually take 1GB memory for gene-based annotation for human genome
            for a typical query file, but this depends on the complexity of
            the query (note that the query has a few required fields, but may
            have many optional fields and those fields need to be read and
            kept in memory).

    --genomebinsize
            the bin size of genome to speed up search. By default 100kb is
            used for gene- based annotation, so that variant annotation
            focused on specific bins only (based on the start-end site of a
            given variant), rather than searching the entire chromosomes for
            each variant. By default 10kb is used for region-based annotation.
            The filter-based annotations look for variants directly so no bin
            is used.

    --expandbin
            expand bin to both sides to find neighboring genes/regions. For
            gene-based annotation, ANNOVAR tries to find nearby genes for any
            intergenic variant, with a maximum number of nearby bins to
            search. By default, ANNOVAR will automatically set this argument
            to search 2 megabases to the left and right of the variant in
            genome.

    --neargene
            the distance threshold to define whether a variant is in the
            upstream or downstream region of a gene. By default 1 kilobase
            from the start or end site of a transcript is defined as upstream
            or downstream, respectively. This is useful, for example, when one
            wants to identify variants that are located in the promoter
            regions of genes across the genome.

    --exonicsplicing
            report exonic variants near exon/intron boundary as
            'exonic;splicing' variants. These variants are technically exonic
            variants, but there are some literature reports that some of them
            may also affect splicing so a keyword is preserved specifically
            for them.

    --score_threshold
            the minimum score to consider when examining region-based
            annotations on UCSC Genome Browser tables. Some tables do not have
            such scores and this argument will not be effective.

    --normscore_threshold
            the minimum normalized score to consider when examining
            region-based annotations on UCSC Genome Browser tables. The
            normalized score is calculated by UCSC, ranging from 0 to 1000, to
            make visualization easier. Some tables do not have such scores and
            this argument will not be effective.

    --reverse
            reverse the criteria for --score_threshold and
            --normscore_threshold. So the minimum score becomes maximum score
            for a result to be printed.

    --rawscore
            for region-based annotation, print out raw scores from UCSC Genome
            Browser tables, rather than normalized scores. By default,
            normalized scores are printed in the output files. Normalized
            scores are compiled by UCSC Genome Browser for each track, and
            they usually range from 0 to 1000, but there are some exceptions.

    --minqueryfrac
            The minimum fraction of overlap between a query and a database
            record to decide on their match. By default, any overlap is
            regarded as a match, but this may not work best when query consist
            of large copy number variants.

    --splicing_threshold
            distance between splicing variants and exon/intron boundary, to
            claim that a variant is a splicing variant. By default, 2bp is
            used. ANNOVAR is relatively more stringent than some other
            software to claim variant as regulating splicing. In addition, if
            a variant is an exonic variant, it will not be reported as
            splicing variant even if it is within 2bp to an exon/intron
            boundary.

    --indel_splicing_threshold
            If set, max size of indel allowed to be called a splicing variant
            (if boundary within --splicing_threshold bases of an intron/exon
            junction.) If not set, this is equal to the --splicing_threshold,
            as per original behavior.

    --maf_threshold
            the minor allele frequency (MAF) threshold to be used in the
            filter-based annotation for the 1000 Genomes Project databases. By
            default, any variant annotated in the 1000G will be used in
            filtering.

    --sift_threshold
            the default SIFT threshold for deleterious prediction for -dbtype
            avsift (default: 0.05). This argument is obselete, since the
            recommended database for SIFT annotation is LJB database now,
            rather than avsift database.

    --thread
            specify the number of threads to use in filter-based annotation.
            The Perl and all components in the system needs to support
            multi-threaded analysis to use this feature. It is recommended
            when your database is stored at a SSD drive, which results in
            nearly linear speed up of annotation for large genome files.

    --maxgenethread
            specify the maximum number of threads for gene-based annotation
            (default: 6). Generally speaking, too many threads for gene-based
            annotation will negatively impacts the performance.

    --mingenelinecount
            specify the minimum line counts to enable threaded gene-based
            annotation (default: 1000000). For input files with less lines,
            the threaded annotation will not be used, since it actually cost
            more time than non-threaded annotation.

    --memfree
            the minimum amount of free system memory that ANNOVAR should
            ensure to have.

    --memtotal
            the total amount of memory that ANNOVAR should use at most. By
            default, this value is zero, meaning that there is no limit on
            that. Decreasing this threshold reduce the memory requirement by
            ANNOVAR, but may increase the execution time.

    --chromosome
            examine these specific chromosomes in database file. The argument
            takes comma- delimited values, and the dash can be correctly
            recognized. For example, 5-10,X represent chromosome 5 through
            chromosome 10 plus chromosome X.

DESCRIPTION
    ANNOVAR is a software tool that can be used to functionally annotate a
    list of genetic variants, possibly generated from next-generation
    sequencing experiments. For example, given a whole-genome resequencing
    data set for a human with specific diseases, typically around 3 million
    SNPs and around half million insertions/deletions will be identified.
    Given this massive amounts of data (and candidate disease- causing
    variants), it is necessary to have a fast algorithm that scans the data
    and identify a prioritized subset of variants that are most likely
    functional for follow-up Sanger sequencing studies and functional assays.

    Currently, these various types of functional annotations produced by
    ANNOVAR can be (1) gene-based annotations (the default behavior), such as
    exonic variants, intronic variants, intergenic variants, downstream
    variants, UTR variants, splicing site variants, stc. For exonic variants,
    ANNOVAR will try to predict whether each of the variants is non-synonymous
    SNV, synonymous SNV, frameshifting change, nonframeshifting change. (2)
    region-based annotation, to identify whether a given variant overlaps with
    a specific type of genomic region, for example, predicted transcription
    factor binding site or predicted microRNAs.(3) filter-based annotation, to
    filter a list of variants so that only those not observed in variation
    databases (such as 1000 Genomes Project and dbSNP) are printed out.

    Detailed documentation for ANNOVAR should be viewed in ANNOVAR website
    (http://annovar.openbioinformatics.org/). Below is description on commonly
    encountered file formats when using ANNOVAR software.

    *       variant file format

            A sample variant file contains one variant per line, with the
            fields being chr, start, end, reference allele, observed allele,
            other information. The other information can be anything (for
            example, it may contain sample identifiers for the corresponding
            variant.) An example is shown below:

                    16      49303427        49303427        C       T       rs2066844       R702W (NOD2)
                    16      49314041        49314041        G       C       rs2066845       G908R (NOD2)
                    16      49321279        49321279        -       C       rs2066847       c.3016_3017insC (NOD2)
                    16      49290897        49290897        C       T       rs9999999       intronic (NOD2)
                    16      49288500        49288500        A       T       rs8888888       intergenic (NOD2)
                    16      49288552        49288552        T       -       rs7777777       UTR5 (NOD2)
                    18      56190256        56190256        C       T       rs2229616       V103I (MC4R)

    *       database file format: UCSC Genome Browser annotation database

            Most but not all of the gene annotation databases are directly
            downloaded from UCSC Genome Browser, so the file format is
            identical to what was used by the genome browser. The users can
            check Table Browser (for example, human hg18 table browser is at
            http://www.genome.ucsc.edu/cgi-bin/hgTables?org=Human&db=hg18) to
            see what fields are available in the annotation file. Note that
            even for the same species (such as humans), the file format might
            be different between different genome builds (such as between
            hg16, hg17 and hg18). ANNOVAR will try to be smart about guessing
            file format, based on the combination of the -- buildver argument
            and the number of columns in the input file. In general, the
            database file format should not be something that users need to
            worry about.

    *       database file format: GFF3 format for gene-based annotations)

            As of June 2010, ANNOVAR cannot perform gene-based annotations
            using GFF3 input files, and any annotations on GFF3 is
            region-based. I suggest that users download gff3ToGenePred tool
            from UCSC and convert GFF3-based gene annotation to UCSC format,
            so that ANNOVAR can perform gene-based annotation for your species
            of interests.

    *       database file format: GFF3 format for region-based
            annotations)

            Currently, region-based annotations can support the Generic
            Feature Format version 3 (GFF3) formatted files. The GFF3 has
            become the de facto golden standards for many model organism
            databases, such that many users may want to take a custom
            annotation database and run ANNOVAR on them, and it would be the
            most convenient if the custom file is made with GFF3 format.

    *       database file format: generic format for filter-based
            annotations)

            The 'generic' format is designed for filter-based annotation that
            looks for exact variants. The format is almost identical to the
            ANNOVAR input format, with chr, start, end, reference allele,
            observed allele and scores (higher scores are regarded as better).

    *       database file format: VCF format for filter-based annotations)

            ANNOVAR can directly interrogate VCF files as database files. A
            VCF file may contain summary information for variants (for
            example, this variant has MAF of 5% in this population), or it may
            contain the actual variant calls for each individual in a specific
            population.

    *       sequence file format

            ANNOVAR can directly examine FASTA-formatted sequence files. For
            mRNA sequences, the name of the sequences are the mRNA identifier.
            For genomic sequences, the name of the sequences in the files are
            usually chr1, chr2, chr3, etc, so that ANNOVAR knows which
            sequence corresponds to which chromosome. Unfortunately, UCSC uses
            things like chr6_random to annotate un-assembled sequences, as
            opposed to using the actual contig identifiers. This causes some
            issues (depending on how reads alignment algorithms works), but in
            general should not be something that user need to worry about. If
            the users absolutely care about the exact contigs rather than
            chr*_random, then they will need to re-align the short reads at
            chr*_random to a different FASTA file that contains the contigs
            (such as the GRCh36/37/38), and then execute ANNOVAR on the newly
            identified variants.

    *       invalid input

            If the query file contains input lines with invalid format,
            ANNOVAR will skip such line and continue with the annotation on
            next lines. These invalid input lines will be written to a file
            with suffix invalid_input. Users should manually examine this file
            and identify sources of error.

    --------------------------------------------------------------------------
    ------

    ANNOVAR is free for academic, personal and non-profit use.

    For questions or comments, please contact $Author: kaichop
    <[email protected]> $.


$cat mam/convert2annovar.txt
SYNOPSIS
     convert2annovar.pl [arguments] <variantfile>

     Optional arguments:
            -h, --help                      print help message
            -m, --man                       print complete documentation
            -v, --verbose                   use verbose output
                --format <string>           input format (default: pileup)
                --includeinfo               include supporting information in output
                --outfile <file>            output file name (default: STDOUT)
                --snpqual <float>           quality score threshold in pileup file (default: 20)
                --snppvalue <float>         SNP P-value threshold in GFF3-SOLiD file (default: 1)
                --coverage <int>            read coverage threshold in pileup file (default: 0)
                --maxcoverage <int>         maximum coverage threshold (default: none)
                --chr <string>              specify the chromosome (for CASAVA format)
                --chrmt <string>            chr identifier for mitochondria (default: M)
                --fraction <float>          minimum allelic fraction to claim a mutation (for pileup format)
                --altcov <int>              alternative allele coverage threshold (for pileup format)
                --allelicfrac               print out allelic fraction rather than het/hom status (for pileup format)
                --species <string>          if human, convert chr23/24/25 to X/Y/M (for gff3-solid format)
                --filter <string>           output variants with this filter (case insensitive, for vcf4 format)
                --confraction <float>       minimal fraction for two indel calls as a 0-1 value (for vcf4old format)
                --allallele                 print all alleles rather than first one (for vcf4old format)
                --withzyg                   print zygosity/coverage/quality when -includeinfo is used (for vcf4 format)
                --comment                   keep comment line in output (for vcf4 format)
                --allsample                 process all samples in file with separate output files (for vcf4 format)
                --genoqual <float>          genotype quality score threshold (for vcf4 format)
                --varqual <float>           variant quality score threshold (for vcf4 format)
                --dbsnpfile <file>          dbSNP file in UCSC format (for rsid format)
                --withfreq                  for --allsample, print frequency information instead (for vcf4 format)
                --withfilter                print filter information in output (for vcf4 format)
                --seqdir <string>           directory with FASTA sequences (for region format)
                --inssize <int>             insertion size (for region format)
                --delsize <int>             deletion size (for region format)
                --subsize <int>             substitution size (default: 1, for region format)
                --genefile <file>           specify the gene file from UCSC (for transcript format)
                --splicing_threshold <int>  the splicing threshold (for transcript format)
                --context <int>             print context nucleotide for indels (for casava format)
                --avsnpfile <file>          specify the avSNP file (for rsid format)
                --keepindelref              keep Ref/Alt alleles for indels (for vcf4 format)

     Function: convert variant call file generated from various software programs
     into ANNOVAR input format

     Example: convert2annovar.pl -format pileup -outfile variant.query variant.pileup
              convert2annovar.pl -format cg -outfile variant.query variant.cg
              convert2annovar.pl -format cgmastervar variant.masterVar.txt
              convert2annovar.pl -format gff3-solid -outfile variant.query variant.snp.gff
              convert2annovar.pl -format soap variant.snp > variant.avinput
              convert2annovar.pl -format maq variant.snp > variant.avinput
              convert2annovar.pl -format casava -chr 1 variant.snp > variant.avinput
              convert2annovar.pl -format vcf4 variantfile > variant.avinput
              convert2annovar.pl -format vcf4 -filter pass variantfile -allsample -outfile variant
              convert2annovar.pl -format vcf4old input.vcf > output.avinput
              convert2annovar.pl -format rsid snplist.txt -dbsnpfile snp138.txt > output.avinput
              convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003 -inssize 1 -delsize 2
              convert2annovar.pl -format transcript NM_022162 -gene humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/

     Version: $Date: 2018-04-16 00:48:00 -0400 (Mon, 16 Apr 2018) $

OPTIONS
    --help  print a brief usage message and detailed explanation of options.

    --man   print the complete manual of the program.

    --verbose
            use verbose output.

    --format
            the format of the input files. Currently supported formats include
            pileup, cg, cgmastervar, gff3-solid, soap, maq, casava, vcf4,
            vcf4old, rsid. In August 2013, the VCF file processing subroutine
            is changed (multiple samples in VCF file can be processed in
            genotype-aware manner), but users can use vcf4old to have
            identical results as the old behavior.

    --outfile
            specify the output file name. By default, output is written to
            STDOUT.

    --snpqual
            quality score threshold in the pileup file, such that variant
            calls with lower quality scores will not be printed out in the
            output file.

    --snppvalue
            SNP p-value threshold in the pileup file, such that variant calls
            with higher values will not be printed out in the output file.

    --coverage
            read coverage threshold in the pileup file, such that variants
            calls generated with lower coverage will not be printed in the
            output file.

    --maxcoverage
            maximum read coverage threshold in the pileup file, such that
            variants calls generated with higher coverage will not be printed
            in the output file.

    --includeinfo
            specify that the output should contain additional information in
            the input line. By default, only the chr, start, end, reference
            allele, observed allele and homozygosity status are included in
            output files.

    --chr   specify the chromosome for CASAVA format

    --chrmt specify the name of mitochondria chromosome (default is MT)

    --altcov
            the minimum coverage of the alternative (mutated) allele to be
            printed out in output

    --allelicfrac
            print out allelic fraction rather than het/hom status (for pileup
            format). This is useful when processing mitochondria variants.

    --fraction
            specify the minimum fraction of alternative allele, to print out
            the mutation. For example, a site has 10 reads, 3 supports
            alternative allele. A -fraction of 0.4 will not allow the mutation
            to be printed out.

    --species
            specify the species from which the sequencing data is obtained.
            For the GFF3- SOLiD format, when species is human, the chromosome
            23, 24 and 25 will be converted to X, Y and M, respectively.

    --filter
            for VCF4 file, only print out variant calls with this filter
            annotated. For example, if using GATK VariantFiltration walker,
            you will see PASS, GATKStandard, HARD_TO_VALIDATE, etc in the
            filter field. Using 'pass' as a filter is recommended in this
            case.

    --allsample
            for multi-sample VCF4 file, the --allsample argument will process
            all samples in the file and generate separate output files for
            each sample. By default, only the first sample in VCF4 file will
            be processed.

    --withzyg
            for VCF4 format, print out zygosity information, coverage
            information and genotype quality information when -includeinfo is
            used. By default, these information are printed out if
            -includeinfo is not used.

    --genoqual
            minimum genotype quality for the variant in this sample, to be
            printed out. The genotype quality is typically denoted as GQ in
            the SAMPLE column

    --varqual
            minimum variant quality (the QUAL column in the VCF file) to
            handle the variant in VCF file.

    --comment
            include VCF4 header comment lines in the output file

    --genoqual
            specify the genotype quality score to be included in the output
            file

    --varqual
            specify the variant quality score to be included in the output
            file

    --dbsnpfile
            specify the dbSNP file to query (for rsid format)

    --withfreq
            include frequency information in the output (for VCF format with
            multiple samples)

    --withfilter
            include filter information in the output file (for VCF format)

    --seqdir
            specify the directory for sequence file (for region format)

    --inssize
            specify the insertion size when generating all mutations (for
            region format)

    --delsize
            specify the deletion size when generating all mutations (for
            region format)

    --subsize
            specify the substitution size when generating all mutations (for
            region format)

    --genefile
            specify the gene file from UCSC, which can be refGene, knownGene
            or ensGene (for transcript format)

    --splicing_threshold
            specify the splicing threshold (for transcript format)

    --context
            print context for indels which is useful to convert to VCF files
            (for CASAVA format)

    --avsnpfile
            specify the avsnpfile that will be queried when using rsid as the
            input file format

    --keepindelref
            do not alter the Ref and Alt alleles for indels in the VCF file
            (by default the program automatically changes and shortens the Ref
            and Alt allele)

DESCRIPTION
    This program is used to convert variant call file generated from various
    software programs into ANNOVAR input format. Currently, the program can
    handle Samtools genotype-calling pileup format, Solid GFF format, Complete
    Genomics variant format, SOAP format, MAQ format, CASAVA format, VCF
    format. These formats are described below.

    *       pileup format

            The pileup format can be produced by the Samtools genotyping
            calling subroutine. Note that the phrase pileup format can be used
            in several instances, and here I am only referring to the pileup
            files that contains the actual genotype calls.

            Using SamTools, given an alignment file in BAM format, a pileup
            file with genotype calls can be produced by the command below:

                    samtools pileup -vcf ref.fa aln.bam> raw.pileup
                    samtools.pl varFilter raw.pileup > final.pileup

            ANNOVAR will automatically filter the pileup file so that only
            SNPs reaching a quality threshold are printed out (default is 20,
            use --snpqual argument to change this). Most likely, users may
            want to also apply a coverage threshold, such that SNPs calls from
            only a few reads are not considered. This can be achieved using
            the -coverage argument (default value is 0).

            An example of pileup files for SNPs is shown below:

                    chr1 556674 G G 54 0 60 16 a,.....,...,.... (B%A+%7B;0;%=B<:
                    chr1 556675 C C 55 0 60 16 ,,..A..,...,.... CB%%5%,A/+,%....
                    chr1 556676 C C 59 0 60 16 g,.....,...,.... .B%%.%.?.=/%...1
                    chr1 556677 G G 75 0 60 16 ,$,.....,...,.... .B%%9%5A6?)%;?:<
                    chr1 556678 G K 60 60 60 24 ,$.....,...,....^~t^~t^~t^~t^~t^~t^~t^~t^~t B%%B%<A;AA%??<=??;BA%B89
                    chr1 556679 C C 61 0 60 23 .....a...a....,,,,,,,,, %%1%&?*:2%*&)(89/[email protected]@@
                    chr1 556680 G K 88 93 60 23 ..A..,..A,....ttttttttt %%)%7B:B0%55:7=>>[email protected]?B;
                    chr1 556681 C C 102 0 60 25 .$....,...,....,,,,,,,,,^~,^~. %%3%.B*4.%.34.6./[email protected]@>5.
                    chr1 556682 A A 70 0 60 24 ...C,...,....,,,,,,,,,,. %:%(B:A4%7A?;A><<999=<<
                    chr1 556683 G G 99 0 60 24 ....,...,....,,,,,,,,,,. %A%[email protected]%?%[email protected]/./-1A7?

            The columns are chromosome, 1-based coordinate, reference base,
            consensus base, consensus quality, SNP quality, maximum mapping
            quality of the reads covering the sites, the number of reads
            covering the site, read bases and base qualities.

            An example of pileup files for indels is shown below:

                    seq2  156 *  +AG/+AG  71  252  99  11  +AG  *  3  8  0

            ANNOVAR automatically recognizes both SNPs and indels in pileup
            file, and process them correctly.

    *       GFF3-SOLiD format

            The SOLiD provides a GFF3-compatible format for SNPs, indels and
            structural variants. A typical example file is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version 2
                    ##type DNA
                    ##date 2009-03-13
                    ##time 0:0:0
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files Yoruban_snp_10x.txt
                    ##run-path
                    chr_name        AB_SOLiD SNP caller     SNP     coord   coord   1       .       .       coverage=# cov;ref_base=ref;ref_score=score;ref_confi=confi;ref_single=Single;ref_paired=Paired;consen_base=consen;consen_score=score;consen_confi=conf;consen_single=Single;consen_paired=Paired;rs_id=rs_id,dbSNP129
                    1       AB_SOLiD SNP caller     SNP     997     997     1       .       .       coverage=3;ref_base=A;ref_score=0.3284;ref_confi=0.9142;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6716;consen_confi=0.9349;consen_single=0/0;consen_paired=2/2
                    1       AB_SOLiD SNP caller     SNP     2061    2061    1       .       .       coverage=2;ref_base=G;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.8985;consen_single=0/0;consen_paired=2/2
                    1       AB_SOLiD SNP caller     SNP     4770    4770    1       .       .       coverage=2;ref_base=A;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=G;consen_score=1.0000;consen_confi=0.8854;consen_single=0/0;consen_paired=2/2
                    1       AB_SOLiD SNP caller     SNP     4793    4793    1       .       .       coverage=14;ref_base=A;ref_score=0.0723;ref_confi=0.8746;ref_single=0/0;ref_paired=1/1;consen_base=G;consen_score=0.6549;consen_confi=0.8798;consen_single=0/0;consen_paired=9/9
                    1       AB_SOLiD SNP caller     SNP     6241    6241    1       .       .       coverage=2;ref_base=T;ref_score=0.0000;ref_confi=0.0000;ref_single=0/0;ref_paired=0/0;consen_base=C;consen_score=1.0000;consen_confi=0.7839;consen_single=0/0;consen_paired=2/2

            Newer version of ABI BioScope now use diBayes caller, and the
            output file is given below:

                    ##gff-version 3
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##List of SNPs. Date Sat Dec 18 10:30:45 2010    Stringency: medium Mate Pair: 1 Read Length: 50 Polymorphism Rate: 0.003000 Bayes Coverage: 60 Bayes_Single_SNP: 1 Filter_Single_SNP: 1 Quick_P_Threshold: 0.997000 Bayes_P_Threshold: 0.040000 Minimum_Allele_Ratio: 0.150000 Minimum_Allele_Ratio_Multiple_of_Dicolor_Error: 100
                    ##1     chr1
                    ##2     chr2
                    ##3     chr3
                    ##4     chr4
                    ##5     chr5
                    ##6     chr6
                    ##7     chr7
                    ##8     chr8
                    ##9     chr9
                    ##10    chr10
                    ##11    chr11
                    ##12    chr12
                    ##13    chr13
                    ##14    chr14
                    ##15    chr15
                    ##16    chr16
                    ##17    chr17
                    ##18    chr18
                    ##19    chr19
                    ##20    chr20
                    ##21    chr21
                    ##22    chr22
                    ##23    chrX
                    ##24    chrY
                    ##25    chrM
                    # source-version SOLiD BioScope diBayes(SNP caller)
                    #Chr    Source  Type    Pos_Start       Pos_End Score   Strand  Phase   Attributes
                    chr1    SOLiD_diBayes   SNP     221367  221367  0.091151        .       .       genotype=R;reference=G;coverage=3;refAlleleCounts=1;refAlleleStarts=1;refAlleleMeanQV=29;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=27;diColor1=11;diColor2=33;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     555317  555317  0.095188        .       .       genotype=Y;reference=T;coverage=13;refAlleleCounts=11;refAlleleStarts=10;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=00;diColor2=22;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     555327  555327  0.037582        .       .       genotype=Y;reference=T;coverage=12;refAlleleCounts=6;refAlleleStarts=6;refAlleleMeanQV=19;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=29;diColor1=12;diColor2=30;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     559817  559817  0.094413        .       .       genotype=Y;reference=T;coverage=9;refAlleleCounts=5;refAlleleStarts=4;refAlleleMeanQV=23;novelAlleleCounts=2;novelAlleleStarts=2;novelAlleleMeanQV=14;diColor1=11;diColor2=33;het=1;flag=
                    chr1    SOLiD_diBayes   SNP     714068  714068  0.000000        .       .       genotype=M;reference=C;coverage=13;refAlleleCounts=7;refAlleleStarts=6;refAlleleMeanQV=25;novelAlleleCounts=6;novelAlleleStarts=4;novelAlleleMeanQV=22;diColor1=00;diColor2=11;het=1;flag=
                    The file conforms to standard GFF3 specifications, but the last column is solid-
                    specific and it gives certain parameters for the SNP calls.

            An example of the short indel format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
                    ##type DNA
                    ##date 2009-01-26
                    ##time 18:33:20
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files ../../mp-results/JOAN_20080104_1.pas,../../mp-results/BARB_20071114_1.pas,../../mp-results/BARB_20080227_2.pas
                    ##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-w2x25-2x-4x-8x-10x/2x
                    ##Filter-settings: max-ave-read-pos=none,min-ave-from-end-pos=9.1,max-nonreds-4filt=2,min-insertion-size=none,min-deletion-size=none,max-insertion-size=none,max-deletion-size=none,require-called-indel-size?=T
                    chr1    AB_SOLiD Small Indel Tool       deletion        824501  824501  1       .       .   del_len=1;tight_chrom_pos=824501-824502;loose_chrom_pos=824501-824502;no_nonred_reads=2;no_mismatches=1,0;read_pos=4,6;from_end_pos=21,19;strands=+,-;tags=R3,F3;indel_sizes=-1,-1;read_seqs=G3021212231123203300032223,T3321132212120222323222101;dbSNP=rs34941678,chr1:824502-824502(-),EXACT,1,/GG
                    chr1    AB_SOLiD Small Indel Tool       insertion_site  1118641 1118641 1       .       .   ins_len=3;tight_chrom_pos=1118641-1118642;loose_chrom_pos=1118641-1118642;no_nonred_reads=2;no_mismatches=0,1;read_pos=17,6;from_end_pos=8,19;strands=+,+;tags=F3,R3;indel_sizes=3,3;read_seqs=T0033001100022331122033112,G3233112203311220000001002

            The keyword deletion or insertion_site is used in the fourth
            column to indicate that file format.

            An example of the medium CNV format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version SOLiD Corona Lite v.4.0r2.0, find-small-indels.pl v 1.0.1, process-small-indels v 0.2.2, 2009-01-12 12:28:49
                    ##type DNA
                    ##date 2009-01-27
                    ##time 15:54:36
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files big_d20e5-del12n_up-ConsGrp-2nonred.pas.sum
                    ##run-path /data/results2/Yoruban-frag-indel/try.01.06/mp-results-lmp-e5/big_d20e5-indel_950_2050
                    chr1    AB_SOLiD Small Indel Tool       deletion        3087770 3087831 1       .       .   del_len=62;tight_chrom_pos=none;loose_chrom_pos=3087768-3087773;no_nonred_reads=2;no_mismatches=2,2;read_pos=27,24;from_end_pos=23,26;strands=-,+;tags=F3,F3;indel_sizes=-62,-62;read_seqs=T11113022103331111130221213201111302212132011113022,T02203111102312122031111023121220311111333012203111
                    chr1    AB_SOLiD Small Indel Tool       deletion        4104535 4104584 1       .       .   del_len=50;tight_chrom_pos=4104534-4104537;loose_chrom_pos=4104528-4104545;no_nonred_reads=3;no_mismatches=0,4,4;read_pos=19,19,27;from_end_pos=31,31,23;strands=+,+,-;tags=F3,R3,R3;indel_sizes=-50,-50,-50;read_seqs=T31011011013211110130332130332132110110132020312332,G21031011013211112130332130332132110132132020312332,G20321302023001101123123303103303101113231011011011
                    chr1    AB_SOLiD Small Indel Tool       insertion_site  2044888 2044888 1       .       .   ins_len=18;tight_chrom_pos=2044887-2044888;loose_chrom_pos=2044887-2044889;no_nonred_reads=2;bead_ids=1217_1811_209,1316_908_1346;no_mismatches=0,2;read_pos=13,15;from_end_pos=37,35;strands=-,-;tags=F3,F3;indel_sizes=18,18;read_seqs=T31002301231011013121000101233323031121002301231011,T11121002301231011013121000101233323031121000101231;non_indel_no_mismatches=3,1;non_indel_seqs=NIL,NIL
                    chr1    AB_SOLiD Small Indel Tool       insertion_site  74832565        74832565        1   .       .       ins_len=16;tight_chrom_pos=74832545-74832565;loose_chrom_pos=74832545-74832565;no_nonred_reads=2;bead_ids=1795_181_514,1651_740_519;no_mismatches=0,2;read_pos=13,13;from_end_pos=37,37;strands=-,-;tags=F3,R3;indel_sizes=16,16;read_seqs=T33311111111111111111111111111111111111111111111111,G23311111111111111111111111111111111111111311011111;non_indel_no_mismatches=1,0;non_indel_seqs=NIL,NIL

            An example of the large indel format by GFF3-SOLiD is given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version ???
                    ##type DNA
                    ##date 2009-03-13
                    ##time 0:0:0
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files /data/results5/yoruban_strikes_back_large_indels/LMP/five_mm_unique_hits_no_rescue/5_point_6x_del_lib_1/results/NA18507_inter_read_indels_5_point_6x.dat
                    ##run-path
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  1307279 1307791 1       .       .   deviation=-742;stddev=7.18;ref_clones=-;dev_clones=4
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  2042742 2042861 1       .       .   deviation=-933;stddev=8.14;ref_clones=-;dev_clones=3
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  2443482 2444342 1       .       .   deviation=-547;stddev=11.36;ref_clones=-;dev_clones=17
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  2932046 2932984 1       .       .   deviation=-329;stddev=6.07;ref_clones=-;dev_clones=14
                    chr1    AB_SOLiD Large Indel Tool       insertion_site  3166925 3167584 1       .       .   deviation=-752;stddev=13.81;ref_clones=-;dev_clones=14

            An example of the CNV format by GFF3-SOLiD if given below:

                    ##gff-version 3
                    ##solid-gff-version 0.3
                    ##source-version ???
                    ##type DNA
                    ##date 2009-03-13
                    ##time 0:0:0
                    ##feature-ontology http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.141
                    ##reference-file
                    ##input-files Yoruban_cnv.coords
                    ##run-path
                    chr1    AB_CNV_PIPELINE repeat_region   1062939 1066829 .       .       .       fraction_mappable=51.400002;logratio=-1.039300;copynum=1;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   1073630 1078667 .       .       .       fraction_mappable=81.000000;logratio=-1.409500;copynum=1;numwindows=2
                    chr1    AB_CNV_PIPELINE repeat_region   2148325 2150352 .       .       .       fraction_mappable=98.699997;logratio=-1.055000;copynum=1;numwindows=1
                    chr1    AB_CNV_PIPELINE repeat_region   2245558 2248109 .       .