This page provides examples of how to use GVF to annotate a variety of features styles. It is a community effort and a work in progress, and as such you are welcome and encouraged to help us correct errors and add more examples. We will also document sources of GVF files in actual use here so that you can refer to those files as additional examples.
GVF Format In Use
GVF Features in 2 minutes
GVF is a text file with nine-tab delimited columns which is an extension of GFF3.
Remember from the GFF3 spec that the columns are:
- seqid - The name of the sequence that this feature is one. This will usually be the chromosome name or contig name and should correspond to the name given for this sequence in the corresponding fasta sequence file.
- source - A description of where/how this feature annotation was generated. This might be the name of a variant caller (MAQ), an organization (NCBI), or a particular individual (NA_18507). The source column is currently not constrained to an ontology and it's meaning must be described by the data provider.
- type - A SO term describing this feature. In GVF this term must be sequence_alteration or one of it's children, or gap.
- start - The beginning of the feature relative to the land mark feature (the feature referenced by seqid above) in one-based coordinates. Start and end are always given in the 5' to 3' direction relative to the landmark and thus start will always be less then end.
- end - The end of the feature relative to the land mark feature.
- score - A score for the feature. This
- strand - The strand that this feature is annotated on represented as either '+', '-'. Note that the sequences given in the Variant_seq and Reference_seq attributes are affected by this value and both should be give as the reverse strand sequence if the strand is '-'.
- phase - The phase column is unused in GVF, but is maintained for compatibility with GFF3.
- attributes - Tag-value pairs that described specific attributes of the feature.
- GFF3 attributes that are relevant to GVF include:
- ID - An ID for this feature (required). This has to be unique within this file, but doesn't have to have meaning outside the file.
- Name - A descriptive name for this feature (optional).
- Alias - A secondary name for the feature (optional).
- Parent - Indicates the parent of the feature. Optional unless a part-of relationship is appropriate.
- Dbxref - A database cross reference in the format DB_Name:DB_ID. See the spec for more details.
- GVF specific attributes:
- Reference_seq - The reference sequence.
- Variant_seq - The variant sequence.
- Variant_freq - The frequency of the variant sequence in a particular population.
- Variant_reads - The number of reads covering this feature that support the variant.
- Total_reads - The total number of reads that cover this sequence.
- Genotype - Heterozygous, homozygous or hemizygous.
- Variant_copy_number - For CNV features the number of copies of this region in the variant.
- Reference_copy_number - For CNV features the number of copies of this region in the reference.
GVF Feature Examples
GVF features are described on one line with columns delimited by tabs. To allow for easier viewing in a browser the following examples will have the first 8 columns on one line and then each tag-value attribute pair from column 9 on a separate line below that.
A Heterozygous SNV
A heterozygous SNV. Notice that the feature is on chromosome 1. It was called by SOAPsnp. It is typed by the SO term SNV. It's at position 15883 on chromosome 1. SOAPsnp asigned it a score of 36.5. It is described on the '+' strand. It is given an ID of chr1:SOAP:SNV:15883 which is simply the first columns joined by a ':'. In the individual described in this file both a G and a C were found at this locus. The refernce genome has a C. That means that this individual is heterozygous. Of 33 total reads 17 support the G and 16 support the C.
chr1 SOAPsnp SNV 15883 15883 36.5 + . ID=chr1:SOAP:SNV:15883; Variant_seq=G,C; Reference_seq=C; Genotype=heterozygous; Variant_reads=17,16; Total_reads=33;
A Homozygous SNV
The same SNV above, but homozygous this time and with Variant_effect added. Notice how the Total reads here is 33, but the Variant_reads is 32. This means that there was one read supporting another sequence, but that this wasn't enough for the variant caller to support a heterozygous call here. This SNV falls within the CDS of the RefSeq mRNAs NM_012345 and NM_543210 and the variant sequence creates an allele of this gene with a non-conservative substitution. The second variant sequence 'C' is the same as the reference.
chr1 SOAP SNV 15883 15883 36.5 + . ID=chr1:SOAP:SNV:15883; Variant_seq=G; Reference_seq=C; Genotype=homozygous; Variant_reads=32; Total_reads=33; Variant_effect= nonsynonymous_codon 0 mRNA NM_012345,NM_543210;
Another Coding SNV with Functional Annotation
chr16 samtools SNV 49291141 49291141 . + . ID=ID_1; Reference_seq=G; Variant_seq=A,G; Genotype=heterozygous; Variant_effect=synonymous_codon 0 mRNA uc002egm.1,uc010cbk.1,uc010cbj.1;
A short deletion
A longer deletion
A homozygous deletion in the individual genome relative to the reference genome. The region deleted is longer than 50 nucleotides and thus the GVF simply has a'~' as the Reference_sequence value. There are a total of 27 reads on average spanning this region, of which 26 on average supported the deletion.
chr1 Celera nucleotide_deletion 8834426 8834497 . + . ID=ABC_98765; Variant_seq=-; Reference_seq=~; Variant_reads=27; Total_reads=26;
A short insertion
A longer insertion
A Copy number variant created by expansion of a segmental duplication that was already present in the reference genome.
chr10 PennCNV copy_number_variation 3922747 3923761 . + . ID=CNV193; Variant_seq=~; Reference_seq=~; Variant_copy_number=7; Reference_copy_number=5;
A complex example of overlapping variants
A complex example where a t(8;21)(q22;q22.3) translocation, an inversion and an SNV overlap. These are actually three separate variants each mapping to a common region of the reference genome and thus are described as three separated records in the GVF file - each relative to it's location on the reference genome not relative to any of the other features described.
chr21 BreakDancer translocation 41400000 46944323 . + . ID=t8-21_q22-q22.3; Dbxref=PMID:2052570
chr21 DGV inversion 42061144 42083169 . + . ID=Variation_37237;
chr21 SAMTools SNV 42071394 42071394 . + . ID=rs2989342; Variant_seq=C,T; Reference_seq=T