{"payload":{"allShortcutsEnabled":false,"fileTree":{"":{"items":[{"name":"img","path":"img","contentType":"directory"},{"name":"gff3.md","path":"gff3.md","contentType":"file"},{"name":"gvf.md","path":"gvf.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":1.723661,"foldersToFetch":[],"repo":{"id":57918881,"defaultBranch":"master","name":"Specifications","ownerLogin":"The-Sequence-Ontology","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2016-05-02T20:18:24.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/11649563?v=4","public":true,"private":false,"isOrgOwned":true},"symbolsExpanded":false,"treeExpanded":true,"refInfo":{"name":"master","listCacheKey":"v0:1466736068.0","canEdit":false,"refType":"branch","currentOid":"fe73505276dd324bf6a55773f3413fe2bed47af4"},"path":"gff3.md","currentUser":null,"blob":{"rawLines":null,"stylingDirectives":null,"colorizedLines":null,"csv":null,"csvError":null,"dependabotInfo":{"showConfigurationBanner":false,"configFilePath":null,"networkDependabotPath":"/The-Sequence-Ontology/Specifications/network/updates","dismissConfigurationNoticePath":"/settings/dismiss-notice/dependabot_configuration_notice","configurationNoticeDismissed":null},"displayName":"gff3.md","displayUrl":"https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md?raw=true","headerInfo":{"blobSize":"59.1 KB","deleteTooltip":"You must be signed in to make or propose changes","editTooltip":"You must be signed in to make or propose changes","ghDesktopPath":"https://desktop.github.com","isGitLfs":false,"onBranch":true,"shortPath":"640ea30","siteNavLoginPath":"/login?return_to=https%3A%2F%2Fgithub.com%2FThe-Sequence-Ontology%2FSpecifications%2Fblob%2Fmaster%2Fgff3.md","isCSV":false,"isRichtext":true,"toc":[{"level":3,"text":"Generic Feature Format Version 3 (GFF3)","anchor":"generic-feature-format-version-3-gff3","htmlText":"Generic Feature Format Version 3 (GFF3)"},{"level":4,"text":"Summary","anchor":"summary","htmlText":"Summary"},{"level":4,"text":"GFF3 Validator","anchor":"gff3-validator","htmlText":"GFF3 Validator"},{"level":4,"text":"Description of the Format","anchor":"description-of-the-format","htmlText":"Description of the Format"},{"level":4,"text":"The Canonical Gene","anchor":"the-canonical-gene","htmlText":"The Canonical Gene"},{"level":4,"text":"Circular Genomes","anchor":"circular-genomes","htmlText":"Circular Genomes"},{"level":4,"text":"Representing Spliced Non-Coding Transcripts","anchor":"representing-spliced-non-coding-transcripts","htmlText":"Representing Spliced Non-Coding Transcripts"},{"level":4,"text":"Parent (part_of) Relationships","anchor":"parent-part_of-relationships","htmlText":"Parent (part_of) Relationships"},{"level":4,"text":"The Gap Attribute","anchor":"the-gap-attribute","htmlText":"The Gap Attribute"},{"level":4,"text":"Alignments","anchor":"alignments","htmlText":"Alignments"},{"level":4,"text":"Transcript-Relative Alignments","anchor":"transcript-relative-alignments","htmlText":"Transcript-Relative Alignments"},{"level":5,"text":"Case #1: alignment to a + strand transcript","anchor":"case-1-alignment-to-a--strand-transcript","htmlText":"Case #1: alignment to a + strand transcript"},{"level":5,"text":"Case #2: alignment to a - strand transcript","anchor":"case-2-alignment-to-a---strand-transcript","htmlText":"Case #2: alignment to a - strand transcript"},{"level":4,"text":"Ontology Associations and DB Cross References","anchor":"ontology-associations-and-db-cross-references","htmlText":"Ontology Associations and DB Cross References"},{"level":4,"text":"Other Syntax","anchor":"other-syntax","htmlText":"Other Syntax"},{"level":4,"text":"Pathological Cases","anchor":"pathological-cases","htmlText":"Pathological Cases"},{"level":4,"text":"Change Log","anchor":"change-log","htmlText":"Change Log"}],"lineInfo":{"truncatedLoc":"870","truncatedSloc":"750"},"mode":"file"},"image":false,"isCodeownersFile":null,"isPlain":false,"isValidLegacyIssueTemplate":false,"issueTemplate":null,"discussionTemplate":null,"language":"Markdown","languageID":222,"large":false,"planSupportInfo":{"repoIsFork":null,"repoOwnedByCurrentUser":null,"requestFullPath":"/The-Sequence-Ontology/Specifications/blob/master/gff3.md","showFreeOrgGatedFeatureMessage":null,"showPlanSupportBanner":null,"upgradeDataAttributes":null,"upgradePath":null},"publishBannersInfo":{"dismissActionNoticePath":"/settings/dismiss-notice/publish_action_from_dockerfile","releasePath":"/The-Sequence-Ontology/Specifications/releases/new?marketplace=true","showPublishActionBanner":false},"rawBlobUrl":"https://github.com/The-Sequence-Ontology/Specifications/raw/master/gff3.md","renderImageOrRaw":false,"richText":"

Generic Feature Format Version 3 (GFF3)

\n

Summary

\n

Author: Lincoln Stein
\nDate: 18 August 2020
\nVersion: 1.26

\n

Although there are many richer ways of representing genomic features via XML and in relational database schemas, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community's need for a simple format that can be modified with a text editor and processed with shell tools like grep. The GFF format, although widely used, has fragmented into multiple incompatible dialects. When asked why they have modified the published Sanger specification, bioinformaticists frequently answer that the format was insufficient for their needs, and they needed to extend it. The proposed GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats. The new format:

\n
    \n
  1. Adds a mechanism for representing more than one level of hierarchical grouping of features and subfeatures.
  2. \n
  3. Separates the ideas of group membership and feature name/id.
  4. \n
  5. Constrains the feature type field to be taken from a controlled vocabulary.
  6. \n
  7. Allows a single feature, such as an exon, to belong to more than one group at a time.
  8. \n
  9. Provides an explicit convention for pairwise alignments.
  10. \n
  11. Provides an explicit convention for features that occupy disjunct regions.
  12. \n
\n

GFF3 Validator

\n

GFF3 validation tools are available at modENCODE-DCC

\n

Description of the Format

\n

GFF3 files are nine-column, tab-delimited, plain text files. Literal use of tab, newline, carriage return, the percent (%) sign, and control characters must be encoded using RFC 3986 Percent-Encoding; no other characters may be encoded. Backslash and other ad-hoc escaping conventions that have been added to the GFF format are not allowed. The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of UTF-8 is recommended.

\n\n

In addition, the following characters have reserved meanings in column 9 and must be escaped when used in other contexts:

\n\n

Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces. Use of the \"+\" (plus) character to encode spaces is deprecated from early versions of the spec and is no longer allowed.

\n

Undefined fields are replaced with the \".\" character, as described in the original GFF spec.

\n
\n
Column 1: \"seqid\"
\n
The ID of the landmark used to establish the coordinate system for the current feature. IDs may contain any characters, but must escape any characters not in the set [a-zA-Z0-9.:^*$@!+_?-|]. In particular, IDs may not contain unescaped whitespace and must not begin with an unescaped \">\".
\n
Column 2: \"source\"
\n
The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as \"Genescan\" or a database name, such as \"Genbank.\" In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column.
\n
Column 3: \"type\"
\n
The type of the feature (previously called the \"method\"). This is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:0000110) or an is_a child of it.
\n
Columns 4 & 5: \"start\" and \"end\"
\n
\n

The start and end coordinates of the feature are given in positive 1-based integer coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature.

\n

For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.

\n
\n
Column 6: \"score\"
\n
The score of the feature, a floating point number. As in earlier versions of the format, the semantics of the score are ill-defined. It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features.
\n
Column 7: \"strand\"
\n
The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown.
\n
Column 8: \"phase\"
\n
\n

For features of type \"CDS\", the phase indicates where the next codon begins relative to the 5' end (where the 5' end of the CDS is relative to the strand of the CDS feature) of the current CDS feature. For clarification the 5' end for CDS features on the plus strand is the feature's start and and the 5' end for CDS features on the minus strand is the feature's end. The phase is one of the integers 0, 1, or 2, indicating the number of bases forward from the start of the current CDS feature the next codon begins. A phase of \"0\" indicates that a codon begins on the first nucleotide of the CDS feature (i.e. 0 bases forward), a phase of \"1\" indicates that the codon begins at the second nucleotide of this CDS feature and a phase of \"2\" indicates that the codon begins at the third nucleotide of this region. Note that ‘Phase’ in the context of a GFF3 CDS feature should not be confused with the similar concept of frame that is also a common concept in bioinformatics. Frame is generally calculated as a value for a given base relative to the start of the complete open reading frame (ORF) or the codon (e.g. modulo 3) while CDS phase describes the start of the next codon relative to a given CDS feature.

\n

The phase is REQUIRED for all CDS features.

\n
\n
Column 9: \"attributes\"
\n
\n

A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: \",=;\". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. Attribute values do not need to be and should not be quoted. The quotes should be included as part of the value by parsers and not stripped.

\n

These tags have predefined meanings:

\n
\n
ID
\n
Indicates the ID of the feature. The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID must collectively represent a single feature.
\n
Name
\n
Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
\n
Alias
\n
A secondary name for the feature. It is suggested that this tag be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
\n
Parent
\n
Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, an so forth. A feature may have multiple parents. Parent can only be used to indicate a partof relationship.
\n
Target
\n
Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is \"target_id start end [strand]\", where strand is optional and may be \"+\" or \"-\". If the target_id contains spaces, they must be escaped as hex escape %20.
\n
Gap
\n
The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is inspired from the CIGAR format described in the Exonerate documentation.
\n
Derives_from
\n
Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural \"part of\" one. This is needed for polycistronic genes. See \"PATHOLOGICAL CASES\" for further discussion.
\n
Note
\n
A free text note.
\n
Dbxref
\n
A database cross reference. See the section \"Ontology Associations and Db Cross References\" for details on the format.
\n
Ontology_term
\n
A cross reference to an ontology term. See the section \"Ontology Associations and Db Cross References\" for details.
\n
Is_circular
\n
A flag to indicate whether a feature is circular. See extended discussion below.
\n
\n

Multiple attributes of the same type are indicated by separating the values with the comma \",\" character, as in:

\n
Parent=AF2312,AB2812,abc-3
\n

In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.

\n

Note that attribute names are case sensitive. \"Parent\" is not the same as \"parent\".

\n

All attributes that begin with an uppercase letter are reserved for later use. Attributes that begin with a lowercase letter can be used freely by applications.

\n
\n
\n

The Canonical Gene

\n

\"Figure1\"
\nFIGURE 1

\n

This section describes the representation of a protein-coding gene in GFF3. To illustrate how a canonical gene is represented, consider Figure 1 (figure1.png). This indicates a gene named EDEN extending from position 1000 to position 9000. It encodes three alternatively-spliced transcripts named EDEN.1, EDEN.2 and EDEN.3, the last of which has two alternative translational start sites leading to the generation of two protein coding sequences.

\n

There is also an identified transcriptional factor binding site located 50 bp upstream from the transcriptional start site of EDEN.1 and EDEN2.

\n

Here is how this gene should be described using GFF3:

\n
 0  ##gff-version 3.1.26\n 1  ##sequence-region ctg123 1 1497228\n 2  ctg123 . gene            1000  9000  .  +  .  ID=gene00001;Name=EDEN\n 3  ctg123 . TF_binding_site 1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001\n 4  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1\n 5  ctg123 . mRNA            1050  9000  .  +  .  ID=mRNA00002;Parent=gene00001;Name=EDEN.2\n 6  ctg123 . mRNA            1300  9000  .  +  .  ID=mRNA00003;Parent=gene00001;Name=EDEN.3\n 7  ctg123 . exon            1300  1500  .  +  .  ID=exon00001;Parent=mRNA00003\n 8  ctg123 . exon            1050  1500  .  +  .  ID=exon00002;Parent=mRNA00001,mRNA00002\n 9  ctg123 . exon            3000  3902  .  +  .  ID=exon00003;Parent=mRNA00001,mRNA00003\n10  ctg123 . exon            5000  5500  .  +  .  ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003\n11  ctg123 . exon            7000  9000  .  +  .  ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003\n12  ctg123 . CDS             1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1\n13  ctg123 . CDS             3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1\n14  ctg123 . CDS             5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1\n15  ctg123 . CDS             7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1\n16  ctg123 . CDS             1201  1500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2\n17  ctg123 . CDS             5000  5500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2\n18  ctg123 . CDS             7000  7600  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2\n19  ctg123 . CDS             3301  3902  .  +  0  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3\n20  ctg123 . CDS             5000  5500  .  +  1  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3\n21  ctg123 . CDS             7000  7600  .  +  1  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3\n22  ctg123 . CDS             3391  3902  .  +  0  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4\n23  ctg123 . CDS             5000  5500  .  +  1  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4\n24  ctg123 . CDS             7000  7600  .  +  1  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4\n
\n

Lines beginning with '##' are directives (sometimes called pragmas or meta-data) and provide meta-information about the document as a whole. Blank lines should be ignored by parsers and lines beginning with a single '#' are used for human-readable comments and can be ignored by parsers. End-of-line comments (comments preceded by # at the end of and on the same line as a feature or directive line) are not allowed.

\n

Line 0 gives the GFF version using the ##gff-version pragma. Line 1 indicates the boundaries of the region being annotated (a 1,497,228 bp region named \"ctg123\") using the ##sequence-region pragma.

\n

Line 2 defines the boundaries of the gene. Column 9 of this line assigns the gene an ID of gene00001, and a human-readable name of EDEN. Because the gene is not part of a larger feature, it has no Parent.

\n

Line 3 annotates the transcriptional factor binding site. Since it is logically part of the gene, its Parent attribute is gene00001.

\n

Lines 4-6 define this gene's three spliced transcripts, one line for the full extent of each of the mRNAs. These features are necessary to act as parents for the four CDSs which derive from them, as well as the structural parents of the five exons in the alternative splicing set.

\n

Lines 7-11 identify the five exons. The Parent attributes indicate which mRNAs the exons belong to. Notice that several of the exons share the same parents, using the comma symbol to indicate multiple parentage.

\n

Lines 12-24 denote this gene's four CDSs. Each CDS belongs to one of the mRNAs. cds00003 and cds00004, which correspond to alternative start codons, belong to the same mRNA.

\n

Note that several of the features, including the gene, its mRNAs and the CDSs, all have Name attributes. This attributes assigns those features a public name, but is not mandatory. The ID attributes are only mandatory for those features that have children (the gene and mRNAs), or for those that span multiple lines. The IDs are not required to have meaning outside the file in which they reside. Hence, a slightly simplified version of this file would look like this:

\n
##gff-version 3.1.26\n##sequence-region ctg123 1 1497228\nctg123 . gene            1000 9000  .  +  .  ID=gene00001;Name=EDEN\nctg123 . TF_binding_site 1000 1012  .  +  .  Parent=gene00001\nctg123 . mRNA            1050 9000  .  +  .  ID=mRNA00001;Parent=gene00001\nctg123 . mRNA            1050 9000  .  +  .  ID=mRNA00002;Parent=gene00001\nctg123 . mRNA            1300 9000  .  +  .  ID=mRNA00003;Parent=gene00001\nctg123 . exon            1300 1500  .  +  .  Parent=mRNA00003\nctg123 . exon            1050 1500  .  +  .  Parent=mRNA00001,mRNA00002\nctg123 . exon            3000 3902  .  +  .  Parent=mRNA00001,mRNA00003\nctg123 . exon            5000 5500  .  +  .  Parent=mRNA00001,mRNA00002,mRNA00003\nctg123 . exon            7000 9000  .  +  .  Parent=mRNA00001,mRNA00002,mRNA00003\nctg123 . CDS             1201 1500  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS             3000 3902  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS             5000 5500  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS             7000 7600  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS             1201 1500  .  +  0  ID=cds00002;Parent=mRNA00002\nctg123 . CDS             5000 5500  .  +  0  ID=cds00002;Parent=mRNA00002\nctg123 . CDS             7000 7600  .  +  0  ID=cds00002;Parent=mRNA00002\nctg123 . CDS             3301 3902  .  +  0  ID=cds00003;Parent=mRNA00003\nctg123 . CDS             5000 5500  .  +  1  ID=cds00003;Parent=mRNA00003\nctg123 . CDS             7000 7600  .  +  1  ID=cds00003;Parent=mRNA00003\nctg123 . CDS             3391 3902  .  +  0  ID=cds00004;Parent=mRNA00003\nctg123 . CDS             5000 5500  .  +  1  ID=cds00004;Parent=mRNA00003\nctg123 . CDS             7000 7600  .  +  1  ID=cds00004;Parent=mRNA00003\n
\n
\n
NOTE 1
\n
\n

SO or SOFA IDs: If using the SO (or SOFA) IDs rather than the short names1 (\"mRNA\" etc), use the following mappings:

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
geneSO:0000704
mRNASO:0000234
exonSO:0000147
cdsSO:0000316
\n

Other mRNA parts that you might wish to use are:

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
intronSO:0000188 (redundant with exon)
polyA_sequenceSO:0000610 (part of the three_prime_UTR)
polyA_siteSO:0000553 (part of the gene)
five_prime_UTRSO:0000204
three_prime_UTRSO:0000205
\n
\n
NOTE 2
\n
\"Orphan\" exons CDSs, and other features. Ab initio gene prediction programs call hypothetical exons and CDS's that are attached to the genomic sequence and not necessarily to a known transcript. To handle these features, you may either (1) create a placeholder mRNA and use it as the parent for the exon and CDS subfeatures; or (2) attach the exons and CDSs directly to the gene. This is allowed by SO because of the transitive nature of the part_of relationship.
\n
NOTE 3
\n
UTRs, splice sites and translational start and stop sites. These are implied by the combination of exon and CDS and do not need to be explicitly annotated as part of the canonical gene. In the case of annotating predicted splice or translational start/stop sites independently of a particular gene, it is suggested that they be attached directly to the genomic sequence and not to a gene or a subpart of a gene.
\n
NOTE 4
\n
CDS features MUST have have a defined phase field. Otherwise it is not possible to infer the correct polypeptides corresponding to partially annotated genes.
\n
NOTE 5
\n
The START and STOP codons are included in the CDS. That is, if the locations of the start and stop codons are known, the first three base pairs of the CDS should correspond to the start codon and the last three correspond the stop codon.
\n
\n

Circular Genomes

\n

For a circular genome, the landmark feature should include Is_circular=true in column 9. In the example below, from bacteriophage f1, gene II extends across the origin from positions 6477-831. The feature end is given as length of the landmark feature, J02448, plus the distance from the origin to the end of gene II (6407 + 831 = 7238).

\n
##gff-version 3.1.26\n# organism Enterobacteria phage f1\n# Note Bacteriophage f1, complete genome.\nJ02448  GenBank region  1      6407    .       +       .       ID=J02448;Name=J02448;Is_circular=true;\nJ02448  GenBank CDS     6006   7238    .       +       0       ID=geneII;Name=II;Note=protein II;\n
\n

Representing Spliced Non-Coding Transcripts

\n

For spliced non-coding transcripts, such as those produced by some processed snRNAs and viruses, use a parent feature of \"noncoding_transcript\" and a child of \"exon.\"

\n

Parent (part_of) Relationships

\n

The reserved Parent attribute can be used to establish a part-of relationship between two features. A feature that has the Parent attribute set is interpreted as asserting that it is a part of the specified Parent feature.

\n

Features must respect the Sequence Ontology Part-Of relationships. A Parent relationship between two features that is not one of the Part-Of relationships listed in SO should trigger a parse exception Similarly, a set of Parent relationships that would cause a cycle should also trigger an exception.

\n

The GFF3 format does not enforce a rule in which features must be wholly contained within the location of their parents, since some elements of the Sequence Ontology (e.g. enhancers in genes) allow for distant cis relationships.

\n

The Gap Attribute

\n

Protein and nucleotide alignment features typically consist of two sequences, the reference sequence and the \"target\", and are not always colinear. For example, consider the following alignment between an EST (\"EST23\") and a segment of the genome (\"chr3\"):

\n
chr3  (reference)  1 CAAGACCTAAACTGGAT-TCCAAT  23\nEST23 (target)     1 CAAGACCT---CTGGATATCCAAT  21\n
\n

Previous versions of the GFF format would represent this alignment as three colinear segments, but this made it difficult to reconstruct the gapped alignment. GFF3 recommends representing gapped alignments explicitly with the \"Gap\" attribute. The Gap attribute's format consists of a series of (operation,length) pairs separated by space characters, for example \"M8 D3 M6\". Each operation is a single-letter code:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
CodeOperation
Mmatch
Iinsert a gap into the reference sequence
Dinsert a gap into the target (delete from reference)
Fframeshift forward in the reference sequence
Rframeshift reverse in the reference sequence
\n

In the alignment between EST23 and chr3 shown above, chr3 is the reference sequence referred to in the first column of the GFF3 file, and EST23 is the sequence referred to by the Target attribute. This gives a Gap string of \"M8 D3 M6 I1 M6\". The full GFF match line will read:

\n
chr3 . Match 1 23 . . . ID=Match1;Target=EST23 1 21;Gap=M8 D3 M6 I1 M6\n
\n

For protein to nucleotide matches, the M, I and D operations apply to amino acid residues in the target and nucleotide base pairs in the reference in a 1:3 residue. That is, \"M2\" means to match two amino residues in the target to six base pairs in the reference. Hence this alignment:

\n
100 atgaaggag---gttattgcgaatgtcggcggt\n  1 M..K..E..V..V..I..-..N..V..G..G..\n
\n

Corresponds to this GFF3 Line:

\n
ctg123 . nucleotide_to_protein 100 129 . + . ID=match008;Target=p101 1 10;Gap=M3 I1 M2 D1 M4\n
\n

In addition, the Gap attribute provides <F>orward and <R>everse frameshift operators to allow for frameshifts in the alignment. These are in nucleotide coordinates: a forward frameshift skips forward the indicated number of base pairs, while a reverse frameshift moves backwards. Examples:

\n
100 atgaaggag---gttattgaatgtcggcggt     Gap=M3 I1 M2 F1 M4\n  1 M..K..E..V..V..I...\n                        N..V..G..G\n\n100 atgaaggag---gttataatgtcggcggt       Gap=M3 I1 M2 R1 M4\n  1 M..K..E..V..V..I.\n                      N..V..G..G\n
\n

Alignments

\n

In the SO, an alignment between the reference sequence and another sequence is called a \"match\". In addition to the generic \"match\" type, there are the subclasses:

\n\n

Matches typically contain gaps; matches broken up by large gaps are usually called \"HSPs\" (high-scoring segment pair), and previous incarnations of GFF have handled gapped alignments by breaking up the alignment into a series of ungapped HSPs.

\n

The SO does not have an HSP type. Instead, gapped matches are represented as a single feature that occupies a discontinuous location on the reference sequence. Figure 2 shows the same gene as before, but with a new track added showing an alignment of a sequenced cDNA to the genome. For the purposes of illustration, we have shown the regions of alignment to be exact across the three exons of the second spliced transcript (EDEN.2).

\n

\"Figure2\"
\nFIGURE 2

\n

The recommended way to represent this alignment is with a single feature of type \"cDNA_match\" and a Gap attribute that indicates that the alignment is in three segments:

\n
ctg123 . cDNA_match 1050  9000  6.2e-45  +  .    ID=match00001;Target=cdna0123 12 2964;Gap=M451 D3499 M501 D1499 M2001\n
\n

Parsed out, the Target attribute indicates that the sequence named \"cdna0123\" between bases 12 and 2964 (in cdna coordinates) aligns to bases 1050 to 9000 of ctg123. The Gap attribute is easier to read when spaces are inserted:

\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
M451match 451 bases
D3499skip 3499 bases in the reference ctg123 sequence
M501match the next 501 bases
D1499skip 1499 bases in the reference ctg123
M2001match the next 2001 bases
\n

Note that the matched region is 2953 bases, which corresponds exactly to the matching subsequence [12,2964] of the target. Extra bases in the cDNA which would cause gaps in the reference sequence would be indicated using the CIGAR \"I\" notation.

\n

Another important item to note is that the ID corresponds to the Match and not to the target sequence. This avoids the confusion that has occurred in previous incarnations of GFF which made it impossible to distinguish between a particular alignment of a target sequence to the genome and all alignments of a target sequence to the genome.

\n

A limitation of the Gap representation is that the entire alignment shares the same score (column 6). To give each component of the match a separate score, it can be broken across multiple lines as shown here:

\n
ctg123 . cDNA_match 1050  1500  5.8e-42 +  . ID=match00001;Target=cdna0123 12  462\nctg123 . cDNA_match 5000  5500  8.1e-43 +  . ID=match00001;Target=cdna0123 463 963\nctg123 . cDNA_match 7000  9000  1.4e-40 +  . ID=match00001;Target=cdna0123 964 2964\n
\n

Notice that the ID is the same across each of the three lines, indicating that these lines all refer to a single feature, the Match. Each aligning segment, however has a distinct score and Target region.

\n

The two types of representations can be mixed, allowing large aligned segments to have their own GFF line and score, while small gaps within them are represented using a Gap attribute.

\n

Matches can align to either the + or the - strand of the reference sequence. This should be denoted in the seventh column of the GFF line and not by changing the order of the start and end positions in the Target attribute. To illustrate this, Figure 3 adds an EST pair to the annotation. The two ESTs, mjm1123.5 and mum1123.3 correspond to 5' and 3' EST reads from the same cDNA clone. The following GFF3 lines describe them:

\n
ctg123 . EST_match 1200  3200  2.2e-30  +  .    ID=match00002;Target=mjm1123.5 5 506;Gap=M301 D1499 M201\nctg123 . EST_match 7000  9000  7.4e-32  -  .    ID=match00003;Target=mjm1123.3 1 502;Gap=M101 D1499 M401\n
\n

Please note that the subsequence indicated by the Target always uses the coordinate system of the EST, regardless of the direction of the alignment. For the 3' EST, the seventh column contains a \"-\" to indicate that the match is to the reverse complement of ctg123. The Gap attribute does not change as a consequence of this reverse complementation, and is read from left to right in the usual manner.

\n

An application may wish to group the EST pair into a single feature. This can be accomplished by creating an implied cDNA_match that extends from the left end of the first EST to the right end of the last EST, and indicating that this cDNA match is the Parent of the two ESTs. The parts of the match use the SO \"match_part\" term. A match_part can be used as a subpart of any type of match.

\n
ctg123 . cDNA_match  1200  9000  .        .  .    ID=cDNA00001\nctg123 . match_part  1200  3200  2.2e-30  +  .    ID=match00002;Parent=cDNA00001;Target=mjm1123.5 5 506;Gap=M301 D1499 M201\nctg123 . match_part  7000  9000  7.4e-32  -  .    ID=match00003;Parent=cDNA00001;Target=mjm1123.3 1 502;Gap=M101 D1499 M401\n
\n

\"Figure3\"
\nFIGURE 3

\n

Transcript-Relative Alignments

\n

The representation of strandedness in nucleotide-to-nucleotide and protein-to-nucleotide alignments is a common source of confusion in GFF files. This section will attempt to explain it.

\n
Case #1: alignment to a + strand transcript
\n

Consider a pair of EST matches to the genome:

\n
=============================  genome\n    ------------------->       transcript\n    ------>        <----\n     EST_A (5')    EST_B (3')\n
\n

EST_A is a 5' EST and its sequence (as represented in a FASTA file, for example) is in the same strand as the genomic sequence. It is represented as:

\n
ctg123 . EST_match 1000 1500 . + . ID=match001;Target=EST_A 1 500 +\n
\n

The strand field in column #7 is \"+\" indicating that the match is to the forward strand of the genome. The optional strand field in the Target attribute is also +, indicating that the alignment is to the plus strand of the implied underlying transcript.

\n

Let us now consider EST_B, which is a 3' EST. Its sequence as represented in the FASTA file aligns to the reverse complement of the genomic sequence. It is represented as:

\n
ctg123 . EST_match 2000 2500 . + . ID=match002;Target=EST_B 1 500 -\n
\n

The strand field in column #7 is \"+\" indicating that the match is to a transcript feature on the forward of the genome. The strand field in the Target attribute is -, indicating that the EST sequence should be reverse complemented in order to align to the underlying transcript.

\n
Case #2: alignment to a - strand transcript
\n

Here is the opposite case:

\n
=============================  genome\n<--------------------          transcript\n ------>        <----\n EST_D (3')  EST_C (5')\n
\n

In this case, the 5' EST_C aligns to the reverse complement of the forward strand of the genome, while the 3' EST_D aligns to the forward strand directly. These are represented as follows:

\n
ctg123 . EST_match  2000 2500 . - . ID=match001;Target=EST_C 1 500 +\nctg123 . EST_match  1000 1500 . - . ID=match001;Target=EST_D 1 500 -\n
\n

The first line indicates that the transcript is on the - strand of the genome, and that EST_C aligns to the transcripts forward strand. The second line uses - in the 7th column to indicate that the transcript is on the minus strand, and - in the Target field to indicate that EST_D aligns to the minus strand of the transcript.

\n

Confused? Just remember that for purposes of display, the source and target strands will be multiplied together. A +/+ or -/- alignment indicates that the reference sequence and the target sequence can be aligned directly. A +/- or -/+ alignment indicates that the target must be reverse complemented in order to align to the plus strand of the reference sequence.

\n

A similar rule applies to TBLASTX alignments, which rely on matching the six-frame translation of the source to the six-frame translation of the target. Consider the case of two genomes that align together in the forward direction, whose alignment is supported by translations of genes A and B, one of which is on the plus strand, and the other on the minus strand:

\n
=============================>  genome X\n     ------>        <----\n     gene A          gene B\n=============================> genome Y\n
\n

These two alignments will be represented as:

\n
X TBLASTX translated_nucleotide_match 1000 1500 . + . ID=matchA;Target=Y 500  1000 +\nX TBLASTX translated_nucleotide_match 2000 2500 . - . ID=matchB;Target=Y 1500 2000 -\n
\n

Note that the first alignment is +/+ and the second is -/-. Both indicate that the sequences of genomes X and Y can be aligned directly.

\n

Now we look at the case of two genomes that align in the antiparallel direction:

\n
=============================> genome X\n     ------>        <----\n     gene A          gene B\n<============================= genome Y\n
\n

These two alignments will be represented as:

\n
X TBLASTX translated_nucleotide_match 1000 1500 . + . ID=matchA;Target=Y 500  1000 -\nX TBLASTX translated_nucleotide_match 2000 2500 . - . ID=matchB;Target=Y 1500 2000 +\n
\n

The first match indicates that a plus strand feature of genome X aligns to a minus strand feature of genome Y. The second match indicates that a minus strand feature of genome X aligns to a plus strand feature of genome Y. In both cases, the result is to align the plus strand of genome X to the minus strand of genome Y.

\n

Ontology Associations and DB Cross References

\n

Two reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label.

\n

The value of both Ontology_term and Dbxref is the ID of the cross referenced object in the form \"DBTAG:ID\". The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database. IDs can contain unescaped colons but DBTAGs cannot, so parsing code should split on the first colon encountered in the attribute value.

\n

The format of each type of ID varies from database to database. An authoritative list of databases, their DBTAGs, and the URL transformation rules that can be used to fetch the objects given their IDs can be found at this location: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs

\n

Further details can be found here: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec

\n

Here are some common examples:

\n
\n
a dbxref to an EMBL sequence accession number:
\n
Dbxref=\"EMBL:AA816246\"
\n
a dbxref to an NCBI gi number:
\n
Dbxref=\"NCBI_gi:10727410\"
\n
an Ontology_term referring to a GO association
\n
Ontology_term=\"GO:0046703\"
\n
\n

Other Syntax

\n

Comment lines begin with the '#' symbol. End-of-line comments (comments preceded by '#' at the end of and on the same line as a feature or directive line) are not allowed. Directive lines (sometimes referred to as pragmas or meta-data) are preceded by '##'. Application specific directives are allowed, but are not required to be supported by parsers. The following directives are specified:

\n
\n
##gff-version 3.1.26
\n
The GFF version follows the format of 3.#.# in this spec. This directive must be present, must be the topmost line of the file. The version number always begins with 3, the second and third numbers are optional and indicate a major revision and a minor revision respectively.
\n
##sequence-region seqid start end
\n
The sequence segment referred to by this file, in the format \"seqid start end\". This element is optional, but strongly encouraged because it allows parsers to perform bounds checking on features. There may be multiple ##sequence-region directives, each corresponding to one of the reference sequences referred to in the body of the file, however only one ##sequence-region directive may be given for any given seqid. While a ##sequence-region pragma is not required for any or all landmark features when one is given all features on that landmark feature (having that seqid) must be contained within the range defined by that ##sequence-region directive. An exception to this rule is allowed when a landmark feature is marked with the Is_circular attribute. In that case the features contained on that landmark may extend their coordinates beyond the boundary as described above.
\n
##feature-ontology URI
\n
\n

This directive indicates that the GFF3 file uses the ontology of feature types located at the indicated URI or URL. Multiple URIs may be added, in which case they are merged (or raise an exception if they cannot be merged). The URIs for the released sequence ontologies are:

\n \n
\n
##attribute-ontology URI
\n
This directive indicates that the GFF3 uses the ontology of attribute names located at the indicated URI or URL. This directive may appear multiple times to load multiple URIs, in which case they are merged (or raise an exception if merging is not possible). Currently no formal attribute ontologies exist, so this attribute is for future extension.
\n
##source-ontology URI
\n
This directive indicates that the GFF3 uses the ontology of source names located at the indicated URI or URL. This directive may appear multiple times to load multiple URIs, in which case they are merged (or raise an exception if merging is not possible). Currently no formal source ontologies exist, so this attribute is for future extension.
\n
##species NCBI_Taxonomy_URI
\n
\n This directive indicates the species that the annotations apply to. The preferred format is a NCBI URL that points to the relevant species page in either of the following formats:\n \n
\n
##genome-build source buildName
\n
\n

The genome assembly build name used for the coordinates given in the file. Please specify the source of the assembly as well as its name. Examples (the parentheses are comments):

\n
##genome-build NCBI B36           (human)\n##genome-build WormBase ws110     (worm)\n##genome-build FlyBase r4.1       (drosophila)
\n
\n
###
\n
This directive (three # signs in a row) indicates that all forward references to feature IDs that have been seen to this point have been resolved. After seeing this directive, a program that is processing the file serially can close off any open objects that it has created and return them, thereby allowing iterative access to the file. Otherwise, software cannot know that a feature has been fully populated by its subfeatures until the end of the file has been reached. It is recommended that complex features, such as the canonical gene, be terminated with the ### notation.
\n
##FASTA
\n
\n

This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together. All FASTA sequences included in the file must be included together at the end of the file and may not be interspersed with the features lines. Once a ##FASTA section is encountered no other content beyond valid FASTA sequence is allowed.

\n

Example:

\n
##gff-version 3.1.26\n##sequence-region ctg123 1 1497228\nctg123 . gene               1000  9000  .  +  .  ID=gene00001;Name=EDEN\nctg123 . TF_binding_site    1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001\nctg123 . mRNA               1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1\nctg123 . five_prime_UTR     1050  1200  .  +  .  Parent=mRNA00001\nctg123 . CDS                1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS                3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS                5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . CDS                7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001\nctg123 . three_prime_UTR    7601  9000  .  +  .  Parent=mRNA00001\nctg123 . cDNA_match         1050  1500  5.8e-42  +  . ID=match00001;Target=cdna0123+12+462\nctg123 . cDNA_match         5000  5500  8.1e-43  +  . ID=match00001;Target=cdna0123+463+963\nctg123 . cDNA_match         7000  9000  1.4e-40  +  . ID=match00001;Target=cdna0123+964+2964\n##FASTA\n>ctg123\ncttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg\ntgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta\ntctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa\naagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat\naatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat\ncttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc\ngtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc\nttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt\naggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag\naatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc\n...\n>cnda0123\nttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc\nagctttctcaagggatcaaaattatggatcattatggaatacctcggtgg\naggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata\ntcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt\ngaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg\ntcaaacagcggctgtaaaaatttgtgattatggttaaagg
\n

For backward-compatibility with the GFF version output by the Artemis tool, a GFF line that begins with the character > creates an implied ##FASTA directive.

\n
\n
\n

Pathological Cases

\n

The following section discusses how to represent \"pathological\" cases that arise in prokaryotic and eukaryotic genetics. Most of these have to do with organisms' endlessly creative ways of processing transcripts.

\n
\n
Single exon genes
\n
\n

This is the case in which a single unspliced transcript encodes a single CDS.

\n
----->XXXXXXX*------>
\n

The preferred representation is to create a gene, a transcript, an exon and a CDS:

\n
chrX  . gene XXXX YYYY  .  +  . ID=gene01;name=resA\nchrX  . mRNA XXXX YYYY  .  +  . ID=tran01;Parent=gene01\nchrX  . exon XXXX YYYY  .  +  . Parent=tran01\nchrX  . CDS  XXXX YYYY  .  +  . Parent=tran01
\n

Some groups will find this redundant. A valid alternative is to omit the exon feature:

\n
chrX  . gene XXXX YYYY  .  +  . ID=gene01;name=resA\nchrX  . mRNA XXXX YYYY  .  +  . ID=tran01;Parent=gene01\nchrX  . CDS  XXXX YYYY  .  +  . Parent=tran01
\n

It is not recommended to parent the CDS directly onto the gene, because this will make it impossible to determine the UTRs (since the gene may validly include untranscribed regulatory regions).

\n

Also note that mixing the two styles, as in the case of an organism with both spliced and unspliced transcripts, is liable to lead to the confusion of people working with the GFF3 file.

\n
\n
Polycistronic transcripts
\n
\n

This is the case in which a single (possibly spliced) transcript encodes multiple open reading frames that generate independent protein products.

\n
----->XXXXXXX*-->BBBBBB*--->ZZZZ*-->AAAAAA*-----
\n

Since the single transcript corresponds to multiple genes that can be identified by genetic analysis, the recommended solution here is to create four \"gene\" objects and make them the parent for a single transcript. The transcript will contain a single exon (in the unspliced case) and four separate CDSs:

\n
chrX  . gene XXXX YYYY  .  +  . ID=gene01;name=resA\nchrX  . gene XXXX YYYY  .  +  . ID=gene02;name=resB\nchrX  . gene XXXX YYYY  .  +  . ID=gene03;name=resX\nchrX  . gene XXXX YYYY  .  +  . ID=gene04;name=resZ\nchrX  . mRNA XXXX YYYY  .  +  . ID=tran01;Parent=gene01,gene02,gene03,gene04\nchrX  . exon XXXX YYYY  .  +  . ID=exon00001;Parent=tran01\nchrX  . CDS  XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene01\nchrX  . CDS  XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene02\nchrX  . CDS  XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene03\nchrX  . CDS  XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene04
\n

To disambiguate the relationship between which genes encode which CDSs, you may use the Derives_from relationship.

\n
\n
Gene containing an intein
\n
\n

An intein occurs when a portion of the protein is spliced out and the two polypeptide fragments are rejoined to become a functional protein. The portion that is spliced out is called the \"intein,\" and it may itself have intrinsic molecular activity:

\n
----->XXXXXXyyyyyyyyyyXXXXXXX*-------\n(yyyyyy is the intein)
\n

The preferred representation is to create one gene, one transcript, one exon, and one CDS. The CDS produces a pre-polypeptide using the \"Derives_from\" tag, and this polypeptide in turn gives rise to two mature_polypeptides, one each for the intein and the flanking protein:

\n
chrX  . gene               XXXX YYYY  .  +  . ID=gene01;name=resA\nchrX  . mRNA               XXXX YYYY  .  +  . ID=tran01;Parent=gene01\nchrX  . exon               XXXX YYYY  .  +  . Parent=tran01\nchrX  . CDS                XXXX YYYY  .  +  . ID=cds01;Parent=tran01\nchrX  . polypeptide        XXXX YYYY  .  +  . ID=poly01;Derives_from=cds01\nchrX  . mature_polypeptide XXXX YYYY  .  +  . ID=poly02;Parent=poly01\nchrX  . mature_polypeptide XXXX YYYY  .  +  . ID=poly02;Parent=poly01\nchrX  . intein             XXXX YYYY  .  +  . ID=poly03;Parent=poly01
\n

Because the flanking mature_polypeptide has discontinuous coordinates on the genome, it appears twice with the same ID.

\n

If the intein is immediately degraded, you may not wish to annotate it explicitly, and its line would be deleted from the example. However, if it has molecular activity, it may correspond to a gene, in which case:

\n
chrX  . gene               XXXX YYYY  .  +  . ID=gene01;name=resA\nchrX  . gene               XXXX YYYY  .  +  . ID=gene02;name=inteinA\nchrX  . mRNA               XXXX YYYY  .  +  . ID=tran01;Parent=gene01,gene02\nchrX  . exon               XXXX YYYY  .  +  . Parent=tran01\nchrX  . CDS                XXXX YYYY  .  +  . ID=cds01;Parent=tran01\nchrX  . polypeptide        XXXX YYYY  .  +  . ID=poly01;Derives_from=cds01\nchrX  . mature_polypeptide XXXX YYYY  .  +  . ID=poly02;Parent=poly01;Derives_from=gene01\nchrX  . mature_polypeptide XXXX YYYY  .  +  . ID=poly02;Parent=poly01;Derives_from=gene01\nchrX  . intein             XXXX YYYY  .  +  . ID=poly03;Parent=poly01;Derives_from=gene02
\n

The term \"polypeptide\" is part of SO. The terms \"mature_polypeptide\" and \"intein\" are slated to be added in a pending release.

\n
\n
Trans-spliced transcript
\n
\n

This occurs when two genes contribute to a processed transcript via a trans-splicing reaction:

\n
spliced\nleader\n=======>----->XXXXXXX*------>
\n

The simplest way to represent this is to show the mRNA as being split across two discontinuous genomic locations:

\n
chrX  . gene               XXXX YYYY  .  +  . ID=gene01;name=my_gene\nchrX  . mRNA               XXXX YYYY  .  +  . ID=tran01;Parent=gene01\nchrX  . mRNA               XXXX YYYY  .  +  . ID=tran01;Parent=gene01\nchrX  . exon               XXXX YYYY  .  +  . Parent=tran01\nchrX  . CDS                XXXX YYYY  .  +  . ID=cds01;Parent=tran01
\n

However, this does not indicate which part of the transcript comes from the spliced leader. A preferred representation explicitly adds features for the spliced leader gene, the primary_transcript and the spliced_leader_RNA:

\n
chrX  . gene               XXXX YYYY  .  +  . ID=gene01;name=my_gene\nchrX  . gene               XXXX YYYY  .  +  . ID=gene02;name=leader_gene\nchrX  . mRNA               XXXX YYYY  .  +  . ID=tran01;Parent=gene01,gene02\nchrX  . mRNA               XXXX YYYY  .  +  . ID=tran01;Parent=gene01,gene02\nchrX  . primary_transcript XXXX YYYY  .  +  . ID=pt01;Parent=tran01;Derives_from=gene01\nchrX  . spliced_leader_RNA XXXX YYYY  .  +  . ID=sl01;Parent=tran01;Derives_from=gene02\nchrX  . exon               XXXX YYYY  .  +  . Parent=tran01\nchrX  . CDS                XXXX YYYY  .  +  . ID=cds01;Parent=tran01
\n

As shown here, the mRNA derives from two genes (\"my_gene\" and the leader gene) and occupies disjunct coordinates on the genome. The primary_transcript, which encodes the body of the mRNA, is part of (has as its Parent) this mRNA. The same relationship applies to the spliced leader RNA. The Derives_from relationship is used to indicate which genes produced the primary transcript and spliced leader respectively.

\n

The exon and CDS features follow in the normal fashion.

\n
\n
Programmed frameshift
\n
\n

This event occurs when the ribosome performs a programmed frameshift during translation in order to skip over an in-frame stop codon. The frameshift may occur forward or backward.

\n
-------------------------> mRNA\n==========\n          ============*  CDS
\n

The representation of this is to make the CDS discontinuous:

\n
chrX  . gene               XXXX   YYYY .  +  . ID=gene01;name=my_gene\nchrX  . mRNA               XXXX   YYYY .  +  . ID=tran01;Parent=gene01;Ontology_term=SO:1000069\nchrX  . exon               XXXX   YYYY .  +  . Parent=tran01\nchrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01\nchrX  . CDS                YYYY-1 ZZZZ .  +  0 ID=cds01;Parent=tran01
\n

The CDS segment that represent the new reading frame will always has a phase of 0 since the ribosome is moving and thus redefining the codon.

\n

It is suggested that the mRNA be tagged with the appropriate SO transcript attributes such as \"minus_1_translational_frameshift\" (SO:1000069). This will allow all such programmed frameshift mRNAs to be recovered with a query. The accession for \"plus_1_translational_frameshift\" is SO:1001263.

\n
\n
An operon
\n
\n

A classic operon occurs when the genes in a polycistronic transcript are co-regulated by cis-regulatory element(s):

\n
regulatory element\n* ================================================> operon\n----->XXXXXXX*-->BBBBBB*--->ZZZZ*-->AAAAAA*-----
\n

It can be indicated in GFF3 in this way:

\n
chrX  . operon   XXXX YYYY  .  +  . ID=operon01;name=my_operon\nchrX  . promoter XXXX YYYY  .  +  . Parent=operon01\nchrX  . gene     XXXX YYYY  .  +  . ID=gene01;Parent=operon01;name=resA\nchrX  . gene     XXXX YYYY  .  +  . ID=gene02;Parent=operon01;name=resB\nchrX  . gene     XXXX YYYY  .  +  . ID=gene03;Parent=operon01;name=resX\nchrX  . gene     XXXX YYYY  .  +  . ID=gene04;Parent=operon01;name=resZ\nchrX  . mRNA     XXXX YYYY  .  +  . ID=tran01;Parent=gene01,gene02,gene03,gene04\nchrX  . exon     XXXX YYYY  .  +  . ID=exon00001;Parent=tran01\nchrX  . CDS      XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene01\nchrX  . CDS      XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene02\nchrX  . CDS      XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene03\nchrX  . CDS      XXXX YYYY  .  +  . Parent=tran01;Derives_from=gene04
\n

The regulatory element (\"promoter\" in this example) is part of the operon via the Parent tag. The four genes are part of the operon, and the resulting mRNA is multiply-parented by the four genes, as in the earlier example.

\n

At the time of this writing, promoters and other cis-regulatory elements cannot be part_of an operon, but this restriction is being reconsidered.

\n
\n
miRNA extension
\n
\n

mirGFF3 format is adapted from the GFF3 definition to contain miRNA/isomiRs information from miRNA-seq data. The main difference is at the Attributes column, where these fields are mandatory: Variant, Cigar, Hits, Expression and Filter. To understand more about each one, please visit the main repository https://github.com/miRTop/mirGFF3

\n
\n
\n
\n

Change Log

\n
\n
1.26 Tue 18 Aug 2020
\n
\n
    \n
  • More internal links (thanks to Juke34).
  • \n
  • Switched to the actual GFF3 version number in examples.
  • \n
  • Standardized date format in changelog.
  • \n
  • Fixed typos (thanks to lbergelson) and formatting.
  • \n
  • UTF-8 is now the only recommended character encoding.
  • \n
\n
\n
1.25 Tue 24 Sep 2019
\n
\n
    \n
  • Added clarifications to CDS phase based on discussions with jbethune.
  • \n
\n
\n
1.24 Mon 15 Jul 2019
\n
\n
    \n
  • Added miRNA extension to the pathological cases.
  • \n
\n
\n
1.23 Fri 3 Oct 2016
\n
\n
    \n
  • Added SO:0000110 sequence_feature as allowable under Column 3: \"type\".
  • \n
\n
\n
1.22 Mon 2 May 2016
\n
\n
    \n
  • Converted from HTML to Markdown.
  • \n
\n
\n
1.21 Tue 26 Feb 2013
\n
\n
    \n
  • Clarification of escaping conventions.
  • \n
  • Explicit requirement that the value of start and end be one-based positive integers.
  • \n
  • Clarification to the use of quotes in attribute values.
  • \n
  • Clarification of lines beginning with # and exclusion of inline comments.
  • \n
  • Clarification that the ##gff-version pragma only appears once in a file.
  • \n
  • Clarification to the ##sequence-region pragma.
  • \n
\n
\n
1.20 Wed 15 Dec 2010
\n
\n
    \n
  • Added language to the description of the ID attribute to clarify that discontinuous features can exist on multiple lines and share the same ID.
  • \n
\n
\n
1.19 Tue 6 Jul 2010
\n
\n
    \n
  • Fixed coordinate errors in the EST_match and match_part examples in the 'Alignments' section.
  • \n
  • Constrained multiple attribute values to the Parent, Alias, Note, Dbxref and Ontology_term attributes.
  • \n
\n
\n
1.18 Thu 24 Jun 2010
\n
\n
    \n
  • Added the sections regarding circular genomes to the spec.
  • \n
\n
\n
1.17 Wed 2 Jun 2010
\n
\n
    \n
  • Changed the spec to include Sequence Ontology (SO) sequence_feature terms in column 3 as well as SOFA terms. (SOFA is a subset of SO).
  • \n
\n
\n
1.16 Tue 25 May 2010
\n
\n
    \n
  • Fixed more incorrect CDS phases throughout.
  • \n
  • Changed (three|five)_prime_utr to (three|five)_prime_UTR throughout.
  • \n
  • Changed (3'|5')-UTR to (three|five)_prime_UTR throughout.
  • \n
  • Added ID attributes to CDS features (required for multiline features) in the FASTA pragma example.
  • \n
\n
\n
1.15 Mon 31 Aug 2009
\n
\n
    \n
  • Fixed incorrect CDS phases in the canonical gene example.
  • \n
\n
\n
1.14 Mon 25 Aug 2008
\n
\n
    \n
  • Add meta-directives for species and build number.
  • \n
\n
\n
1.13 Wed 23 May 2007
\n
\n
    \n
  • Insist that CDS include the start and end codon.
  • \n
\n
\n
1.12 Thu 5 Apr 2007
\n
\n
    \n
  • Use \"match_part\" as the subpart of cDNA_match in the paired EST example.
  • \n
  • Phase is required for all CDS features.
  • \n
\n
\n
1.11 Fri 1 Dec 2006
\n
\n
    \n
  • Clarified definition of phase relative to reverse strand features.
  • \n
\n
\n
1.10 Thu 14 Sep 2006
\n
\n
    \n
  • Reformatted for new SO web site.
  • \n
\n
\n
1.09 Wed 6 Sep 2006
\n
\n
    \n
  • Information about the GFF3 validator.
  • \n
\n
\n
1.08 Tue 18 Jul 2006
\n
\n
    \n
  • Added URLs for SO releases.
  • \n
\n
\n
1.07 Wed 24 May 2006
\n
\n
    \n
  • Fixed description of phase (temporarily lost due to CVS glitches).
  • \n
\n
\n
1.06 Wed 24 May 2006
\n
\n
    \n
  • Relaxed escaping rules.
  • \n
  • Fixed typos found by Gordon Gremme.
  • \n
\n
\n
1.05 Tue 23 May 2006
\n
\n
    \n
  • Fixed all IDs in the examples to make them internally consistent. Previously, some examples did not validate because of inconsistent numbers of zeroes in the identifiers (mRNA00001 vs mRNA0001).
  • \n
\n
\n
\n
","renderedFileInfo":null,"shortPath":null,"symbolsEnabled":true,"tabSize":8,"topBannersInfo":{"overridingGlobalFundingFile":false,"globalPreferredFundingPath":null,"showInvalidCitationWarning":false,"citationHelpUrl":"https://docs.github.com/github/creating-cloning-and-archiving-repositories/creating-a-repository-on-github/about-citation-files","actionsOnboardingTip":null},"truncated":false,"viewable":true,"workflowRedirectUrl":null,"symbols":{"timed_out":false,"not_analyzed":false,"symbols":[{"name":"Generic Feature Format Version 3 (GFF3)","kind":"section_3","ident_start":4,"ident_end":43,"extent_start":0,"extent_end":60468,"fully_qualified_name":"Generic Feature Format Version 3 (GFF3)","ident_utf16":{"start":{"line_number":0,"utf16_col":4},"end":{"line_number":0,"utf16_col":43}},"extent_utf16":{"start":{"line_number":0,"utf16_col":0},"end":{"line_number":870,"utf16_col":0}}},{"name":"Summary","kind":"section_4","ident_start":50,"ident_end":57,"extent_start":45,"extent_end":1381,"fully_qualified_name":"Summary","ident_utf16":{"start":{"line_number":2,"utf16_col":5},"end":{"line_number":2,"utf16_col":12}},"extent_utf16":{"start":{"line_number":2,"utf16_col":0},"end":{"line_number":17,"utf16_col":0}}},{"name":"GFF3 Validator","kind":"section_4","ident_start":1386,"ident_end":1400,"extent_start":1381,"extent_end":1502,"fully_qualified_name":"GFF3 Validator","ident_utf16":{"start":{"line_number":17,"utf16_col":5},"end":{"line_number":17,"utf16_col":19}},"extent_utf16":{"start":{"line_number":17,"utf16_col":0},"end":{"line_number":21,"utf16_col":0}}},{"name":"Description of the Format","kind":"section_4","ident_start":1507,"ident_end":1532,"extent_start":1502,"extent_end":10705,"fully_qualified_name":"Description of the Format","ident_utf16":{"start":{"line_number":21,"utf16_col":5},"end":{"line_number":21,"utf16_col":30}},"extent_utf16":{"start":{"line_number":21,"utf16_col":0},"end":{"line_number":99,"utf16_col":0}}},{"name":"The Canonical Gene","kind":"section_4","ident_start":10710,"ident_end":10728,"extent_start":10705,"extent_end":20423,"fully_qualified_name":"The Canonical Gene","ident_utf16":{"start":{"line_number":99,"utf16_col":5},"end":{"line_number":99,"utf16_col":23}},"extent_utf16":{"start":{"line_number":99,"utf16_col":0},"end":{"line_number":235,"utf16_col":0}}},{"name":"Circular Genomes","kind":"section_4","ident_start":20428,"ident_end":20444,"extent_start":20423,"extent_end":21105,"fully_qualified_name":"Circular Genomes","ident_utf16":{"start":{"line_number":235,"utf16_col":5},"end":{"line_number":235,"utf16_col":21}},"extent_utf16":{"start":{"line_number":235,"utf16_col":0},"end":{"line_number":245,"utf16_col":0}}},{"name":"Representing Spliced Non-Coding Transcripts","kind":"section_4","ident_start":21110,"ident_end":21153,"extent_start":21105,"extent_end":21323,"fully_qualified_name":"Representing Spliced Non-Coding Transcripts","ident_utf16":{"start":{"line_number":245,"utf16_col":5},"end":{"line_number":245,"utf16_col":48}},"extent_utf16":{"start":{"line_number":245,"utf16_col":0},"end":{"line_number":249,"utf16_col":0}}},{"name":"Parent (part_of) Relationships","kind":"section_4","ident_start":21328,"ident_end":21358,"extent_start":21323,"extent_end":22116,"fully_qualified_name":"Parent (part_of) Relationships","ident_utf16":{"start":{"line_number":249,"utf16_col":5},"end":{"line_number":249,"utf16_col":35}},"extent_utf16":{"start":{"line_number":249,"utf16_col":0},"end":{"line_number":257,"utf16_col":0}}},{"name":"The Gap Attribute","kind":"section_4","ident_start":22121,"ident_end":22138,"extent_start":22116,"extent_end":24787,"fully_qualified_name":"The Gap Attribute","ident_utf16":{"start":{"line_number":257,"utf16_col":5},"end":{"line_number":257,"utf16_col":22}},"extent_utf16":{"start":{"line_number":257,"utf16_col":0},"end":{"line_number":297,"utf16_col":0}}},{"name":"Alignments","kind":"section_4","ident_start":24792,"ident_end":24802,"extent_start":24787,"extent_end":30056,"fully_qualified_name":"Alignments","ident_utf16":{"start":{"line_number":297,"utf16_col":5},"end":{"line_number":297,"utf16_col":15}},"extent_utf16":{"start":{"line_number":297,"utf16_col":0},"end":{"line_number":373,"utf16_col":0}}},{"name":"Transcript-Relative Alignments","kind":"section_4","ident_start":30061,"ident_end":30091,"extent_start":30056,"extent_end":34455,"fully_qualified_name":"Transcript-Relative Alignments","ident_utf16":{"start":{"line_number":373,"utf16_col":5},"end":{"line_number":373,"utf16_col":35}},"extent_utf16":{"start":{"line_number":373,"utf16_col":0},"end":{"line_number":444,"utf16_col":0}}},{"name":"Case #1: alignment to a + strand transcript","kind":"section_5","ident_start":30287,"ident_end":30330,"extent_start":30281,"extent_end":31536,"fully_qualified_name":"Case #1: alignment to a + strand transcript","ident_utf16":{"start":{"line_number":377,"utf16_col":6},"end":{"line_number":377,"utf16_col":49}},"extent_utf16":{"start":{"line_number":377,"utf16_col":0},"end":{"line_number":398,"utf16_col":0}}},{"name":"Case #2: alignment to a - strand transcript","kind":"section_5","ident_start":31542,"ident_end":31585,"extent_start":31536,"extent_end":34455,"fully_qualified_name":"Case #2: alignment to a - strand transcript","ident_utf16":{"start":{"line_number":398,"utf16_col":6},"end":{"line_number":398,"utf16_col":49}},"extent_utf16":{"start":{"line_number":398,"utf16_col":0},"end":{"line_number":444,"utf16_col":0}}},{"name":"Ontology Associations and DB Cross References","kind":"section_4","ident_start":34460,"ident_end":34505,"extent_start":34455,"extent_end":36077,"fully_qualified_name":"Ontology Associations and DB Cross References","ident_utf16":{"start":{"line_number":444,"utf16_col":5},"end":{"line_number":444,"utf16_col":50}},"extent_utf16":{"start":{"line_number":444,"utf16_col":0},"end":{"line_number":465,"utf16_col":0}}},{"name":"Other Syntax","kind":"section_4","ident_start":36082,"ident_end":36094,"extent_start":36077,"extent_end":44731,"fully_qualified_name":"Other Syntax","ident_utf16":{"start":{"line_number":465,"utf16_col":5},"end":{"line_number":465,"utf16_col":17}},"extent_utf16":{"start":{"line_number":465,"utf16_col":0},"end":{"line_number":572,"utf16_col":0}}},{"name":"Pathological Cases","kind":"section_4","ident_start":44736,"ident_end":44754,"extent_start":44731,"extent_end":55510,"fully_qualified_name":"Pathological Cases","ident_utf16":{"start":{"line_number":572,"utf16_col":5},"end":{"line_number":572,"utf16_col":23}},"extent_utf16":{"start":{"line_number":572,"utf16_col":0},"end":{"line_number":719,"utf16_col":0}}},{"name":"Change Log","kind":"section_4","ident_start":55515,"ident_end":55525,"extent_start":55510,"extent_end":60468,"fully_qualified_name":"Change Log","ident_utf16":{"start":{"line_number":719,"utf16_col":5},"end":{"line_number":719,"utf16_col":15}},"extent_utf16":{"start":{"line_number":719,"utf16_col":0},"end":{"line_number":870,"utf16_col":0}}}]}},"copilotInfo":null,"copilotAccessAllowed":false,"csrf_tokens":{"/The-Sequence-Ontology/Specifications/branches":{"post":"OKlCu2x9Y5r1q2L5taX6rLiV-5X8jZ_iPK1AzoXOZNQrUci94imGXYwc6c4b3vfDWCD4Qq7e0OLB9NSUbQIpSA"},"/repos/preferences":{"post":"jccB4niculxIYLGMZP0G5mLP-bNkmMlbj20bcfJWgjCbBhSGHe4KW7RcoGVZr17MFH6-IwXtiNCjvm-nXzUsSQ"}}},"title":"Specifications/gff3.md at master · The-Sequence-Ontology/Specifications"}