GFF3 Developement

From SO Wiki
Jump to: navigation, search

Suggestions for changes to the GFF3 should be considered in light of the following:

  1. GFF3 is actively used to annotate hundreds of genomes, by dozens of projects. Backwards compatibility with the community supported tools and datasets generated and used by these groups is a critical consideration.
  2. A big part of the success of GFF3 is it's simplicity. GFF3 is simple enough to manipulate manually with command line tools, but is structured enough to build robust, validating parsers for. Maintaining this balance between simplicity and structure is critical for future success.

Unresolved GFF3 Issues

  1. Partial Features
  2. Discontinuous features
    1. Can discontinuous features occur across landmark features (i.e a gene split across contigs)
    2. Can discontinuous features cross origin of circular genomes?
  3. Landmark features
    1. How are landmark features identified?
    2. How does a landmark feature relate to the ##sequence-region directive?
    3. Is a landmark feature required for each implied landmark as specified in column 1?
    4. Is the SO type of a landmark feature constrained?
  4. GFF3 directives
    1. Can we require that all ## directives appear at the top of the file - with the exception of ### which needs to be interspered to preform it's function and ##FASTA which should be required to be given at the bottom.
    2. Should we have a class of directives with more explicit syntax (##directive-name key1=value1,value2;key2=value3,value4;) such as GVF structured directives.
    3. GVF directives to include?
      1. ##reference-fasta
      2. ##file-version
      3. ##file-date
    4. Other directives to include
      1. ##genome-build-accession
      2. ##genome-build-file
      3. ##annotation-source
  5. Which validator do we use
  6. Clarification of the GAP attribute
  7. Negative start and end coordinates
  8. Is_circular attribute
    1. What features are allowed/disallowed to have this attribute
    2. Should features be split when they cross the origin or have their END mapped forward to be longer than the sequence?
    3. When you have a child that is completely past the origin (an exon), but the parent spans the origin (a transcript) how are the exon START and END calculated? Do you use the same coordinates that they parent is using, or use the actual coordinates?
  9. Can the ##feature-ontology directive be used to extend terms allowed in column 3
  10. Does GFF3/SO support annotating RNA and protein sequences directly?
  11. Additional terms in SO are needed
    1. All EMBL/Genbank/DDBJ Feature table terms should be supported/mapped
    2. Support/map terms used by EMBOSS
  12. The FT_SO.txt mapping file needs to be updated
  13. Should we version the GFF3 file like this ##gff-version 3.1.21
  14. Best Practices Pages
    1. Trans-spliced genes
    2. Partial features
    3. Discontinuous features
      1. Discontinuous features across contigs
      2. Discontinuous features across origin of circular genomes
    4. Link to Best Practices from the spec
    5. Create Best Practices page for transposable elements and their associated proteins
  15. Set up wiki accounts for GFF3 discussion participants to edit
  16. What is the best way to do GenBank to GFF3 conversion
  17. Dbxref file updates
  18. ID characters - What characters are allowed/disallowed for ID?
  19. New attributes
    1. Relationship
    2. Part/Order?
    3. Start_range End_range
    4. Partial
  20. Set up a GFF3 FAQ
  21. Provide links to GFF, GFF2, GTF, UCSC GTF GVF etc spec pages
  22. Clarify how to represent transposable elements and their associated proteins in the GFF3 spec.

Resolved GFF3 Issues

Version 1.21

  1. gff-version directive Is the ##gff-version directive only allowed once per file?
  2. attribute values quoting Should double quoting of attribute values be allowed disallowed?
  3. GFF3 Fasta Sections
    1. Are FASTA sequences required to be all together at the end of the file?
    2. Are FASTA sequences allowed to interspersed with feature lines?
  4. GFF3 character encoding Does GFF3 specify a character encoding and if so, which one?
  5. Hex code escapes see GFF3 character encoding
    1. Refer to RC??? instead of URL encoded
    2. Clean up wording on explicitly required escapes
    3. Clean up wording on explicitly disallowed escapes
    4. Be sure MAKER is inline with the allowed/disallowed escapes
    5. Required escapes - All columns (tab, newline, carriage return, percent sign, control characters)
    6. Required escapes - column 9 (semicolon, equals, ampersand (why is this escaped?)comma).
Personal tools