SO USAGE GUIDE

[[ DRAFT!! IN PROGRESS!! ]]

Preliminary guide to annotation with SO

This document deals with how genomic database curators, database programmers, programmers writing format convertors or object models should implement SO.

The document is aimed at developers; however if developers follow the rules layed down here then downstream tools and databases will automatically guide biologist end-users to use SO consistently.

For the purposes of this document we assume that we are assigning SO to objects in a relational database, end-user tool, object model, flatfile or XML format that has a schema/specification that meets a set of minimal requirements.

I call any such schema/format that meets these requirements a "generic feature graph" model.

The requirements are:

1. features are generic entities and can be typed via a DAG-style ontology

2. features can be arrange hierarchically with respect to one another (for example, gene entities can contain transcript entities)

3. neither this typing nor this hierarchy is directly constrained or limited by the model

Some models meeting these requirements are

  Model		Model Type	How Feature Graphs are Implemented
  =====		==========	==================================
  Chado		(relational)    via feature_relationship table
  ChadoXML	(XML)		via feature_relationship element
  ChaosXML	(XML)		via feature_relationship element
  GFF3		(tab-delimited) via ID and ParentID tags
  BioPerl	(Object)        via FeatureHolderI interface
  BioJava       (Object)        via???
  CGL		(Object)        need to check this...
Some that don't:

ensembl features cannot be arranged in arbitrary hierarchies; the feature graph is directly constrained by the schema so that it does not meet requirement 3.

GAME XML 1.0 does not allow arbitrary hierarchies - the level of nesting is constrained by the DTD, although this is much more flexible and loosely typed than the ensembl relational schema.

Note that SO of course can be used by these models; it is just that these models probably require more specific SO usage documentation tailored to that model. For example, ensembl may have a table "noncoding_transcript" (it doesn't but this is for the purposes of illustration). There may be no tables for more specific types; more detailed typing would be done by SO. Obviously it would only make sense to use subtypes of the relevant SO type here. Other examples may be more subtle.

Similarly GAME would need a similar GAME-specific usage guide that says which SO types can go at the annotation element level, which can go at the feature_set level and so on.

[need to check if Apollo allows hierarchies of arbitrary depth - I think this is the intention for future versions]

The rest of this documentation applies to generic feature graph models only. There are differences within generic feature graph models; for example, both Chaos and Chado allow the arc labels on the feature graph to be typed, which is not available to GFF3. We will have documentation relating to these cases later on.

A lot of what is covered here is also covered in the GFF3 spec: gff3-jan04.shtml

Some of the stuff in that document is specific to GFF3; for example GFF3 allows split locations so explicit annotation of exons is optional (exons can be implied from a split mRNA location). This document attempts to generalise some of that.

BASIC SO COMPLIANCE

Basic SO compliance means that all feature types come from SO. This one is a no-brainer. This can easily be checked automatically

CENTRAL DOGMA GENE MODEL

aka Basic Canonical Gene

[this recapitulates some of what is in the chado docs as well as the GFF3 docs]

To illustrate how a canonical gene should be represented consider Figure 1. This indicates a gene named EDEN extending from position 1000 to position 9000. It encodes three alternatively-spliced transcripts named EDEN.1, EDEN.2 and EDEN.3. It also has an identified transcriptional factor binding site located 50 bp upstream from the transcriptional start site of EDEN.1 and EDEN2.

Figure1FIGURE 1

The first thing to be noted is that there are a number of different ways or representing this gene depending on which parts are explicit and which are implicit.

The first example in the GFF3 documentation provides a way in which all the exons are implicit (because GFF3 allows split locations). We will assume that we are not using split locations.

Without split locations we will need a minimum of

exon

mRNA

gene

TF_binding_site

we also need a way of representing the protein coding portion. The GFF3 docs opts for a CDS. Another option is polypeptide.

SO says

polypeptide :derived_from CDS :part_of mRNA

Does this mean we need all 3 entities? Not necessarily, because CDS or polypeptide can 'masquerade' as the other one.

In GFF3 it is more "natural" to use CDS, because GFF3 is more concerned with genomic sequences and it is fine to leave details of the polypeptide/protein implicit.

Chado is more concerned with the "bigger picture" that just genomic sequence, so it uses polypeptide instead of CDS, even though it perhaps feels "unnatural" to locate a polypeptide on genomic sequence. The advantage is that various protein type data can be attached to the polypeptide entity without introducing superfluous entities.

[Note that chado and chaosxml currently use "protein" erroneously here - this will change]

To represent the gene model in figure 1 we need

exon, mRNA, gene, TF_binding_site AND one or both of {CDS,polypeptide}

If we remove the promoter from the picture we have

exon, mRNA, gene, AND one or both of {CDS,polypeptide}

This is refered to the "SO minimal gene model form"; the 3 possible variations are refered to as CDS, polypeptide or CDS+polypeptide.

[perhaps we need an ontology of usages!]

Software tools should be conversant in the 3 variations of the SO minimal gene model form; roundtripping may be a problem, as the software may convert to an internal canonical form and output in only one of the forms.

Feature Graph

Depending on which of the forms is used, the feature containment hierarchy should look like this:

 gene
  mRNA
   exon
   CDS

 gene
  mRNA
   exon
   polypeptide

 gene
  mRNA
   exon
   CDS
    polypeptide
This conforms to the SO partonomy.

SO will eventually include cardinality constraints; at the moment these are unrestricted. Once cardinality constraints are introduced, the cardinality between CDS and polypeptide must be 1:1 (because they are essentially the same entity). An mRNA must have either a CDS or polypeptide. Usually it has one, but if it is polycistronic it can have more.

If the model allows labeling of the relationship types, then these labels should correspond to the SO relationship type. This is always part_of, with the exception of the relationship between polypeptide and its parent/subject, which should be derived_from.

The bioperl module Bio::SeqFeature::Tools::TypeMapper provides a method get_relationship_type_by_parent_child() for verifying this.

Non-minimal forms

Non-minimal forms contain entities that can be derived from entities within the minimal form. For example

  intron
  UTR
A non-minimal form may also include subclasses of the classes used in the minimal form, for example "five_prime_exon" (inferable by spatial coordinates) or "polycistronic_transcript" (inferable from cardinality of CDS to mRNA). A non-minimal form can always be reduced to the corresponding minimal model with no loss of information. A minimal form can always be expanded to a non-minimal form using unambiguous formal computable rules. We intend to supply these rules in some rule language (such as prolog or KIF or RuleML) as part of SO.

For now those rules exist as non-computable human readable definitions which serve as a spec for software implementors.

It is up to individual software implementors how they wish to deal with the implicit parts of non-minimal forms. They may choose to ignore them when importing SO compliant data. Some piece of software may decide to infer them. For example, end-user software such as a database query interface or a genome viewer much allow users to treat implicit entities such as introns just the same way they treat introns.

Alternative minimal forms

Other minimal forms than the "SO minimal gene model form" are possible. For example, One could specify UTR instead of CDS. The UTR would be explicit and the CDS is implicit.

Other examples are less silly. There is a convincing argument to be made for making splice_sites explicit and exons and introns implicit. This leads to a very natural system for generating intron IDs that are guaranteed to be unique (generating intron IDs from the IDs of surrounding exons IDs is problemmatic because the two exons may share the same donor or acceptor splice site).

However, the alternative minimal forms differ from conventional usage so they must be regarded as non-standard. Software should not be required to understand SO compliant formats that use alternative minimal forms.

If a piece of software insists on using alternative minimal forms as in internal representation then there should be software for converting between standard and alternative minimal forms. Ideally this would be done using the computable rules that will eventually become part of SO.

If an alternative minimal form is common enough then we can document it here.

Forms with intermediate entities

An mRNA does not come into existence by itself - there is first of all a primary_transcript that gets turned into some kind of processed_transcript. Representing two entities is superfluous the vast majority of the time. Really it is just one entity at different points in time.

There are occasions when it is useful to record distinct entities, if there is biological data which only applies to one, or if the question of identity is not quite so simple.

An example is trans-splicing in drosophila mod(mdg4). Here we have two primary_transcripts and one processed_transcript. Perhaps the easiest thing to do is hack this as a single transcript. We still want to allow the option of representing intermediate entities.

This also applies to things such as post translational modification.

Again the rule is simply to conform to the allowed SO relationship types.

Most software will not be able to deal with intermediate entities; it is our hope that some will because this is necessary to really use the power of SO for detailed curation of biological oddities.

  gene
   primary_transcript
    mRNA
     CDS
      polypeptide
       protein
   exon

Noncoding genes

We have seen how to deal with protein coding genes - ie genes for which at least one transcripts is a mRNA.

The SO type "gene" covers all kinds of genes - basically anything that has a transcript. "exon" also covers the exon from any kind of RNA.

Noncoding genes are typed by the type of their transcript. The feature graph will look like this:

  gene
   fooRNA
   exon
Again this must conform to SO - mRNA is the only RNA type that is allowed to have CDS and (by transitivity) polypeptide.

Since most noncoding RNAs are single-exon, there is an argument to be made for making the exon optional. This is left at the implementors discretion. A similar argument can be made for making the gene element optional since if there is only one exon there is only one possible spliceform! Noncoding genes can usually be represented by a single entity rather than introducing two superfluous ones.

This can lead to problems further down the line. A query for "how many genes" will have to jump through some hoops if there is no explicit gene entity for ncRNAs.

However, this is left at the data providers discretion.

"SO Chado form" (which is subsumed by "SO minimal gene model form") chooses to always have an explicit gene and exon entity, even if they are slightly superfluous.

[need a name for alternative form - superminimal?]

Some ncRNAs may have further downstream "products", similar to how an mRNA has a polypeptide product. For example, miRNAs; this is dealt with in a seperate section.

Other gene parts

We have outlined the entities that are necessary in the specification of a gene model; there are other parts that are not necessary, but neither are they inferable. For example, regulatory regions.

We saw in figure 1 a gene with a TF_binding_site

The rules for these are quite simple - the SO partonomy must be adhered to. For example, a TF_binding_site must be part of a gene; eg

 gene
  TF_binding_site
  mRNA
   exon
   {CDS,polypeptide}
If non-chado forms are used for ncRNAs then we have a problem here - the TF_binding_site cannot be part of a RNA!

PSEUDOGENES

Tricky - we also have pseudoexons... how should these all fit together...

In chado we currently have graphs like this

  gene
   pseudogene
    exon
ie they are treated like ncRNAs - this makes things a bit simpler for certain pieces of software but it's not right i fear....

TRANSPOSONS

Generally we either just want to mark an entire region as being a transposable element insertion; but we may also want to do in depth curation of the LTRs, gag pol and env genes or whatever

MATCHES

SEQUENCE_VARIANTS

SPATIAL LOCATIONS OF SO ENTITIES

This is very much down to the individual models, as each has its own location model. For example, GFF3 and BioPerl allow split locations, whereas chado consciously avoided this.

However, there are some general rules that can be layed down

Some of these rules should become computable rules part of SO - for example, a CDS must be spatially contained by a mRNA. These are mostly obvious but it is still good to have them.

Some locations are derivable - for example, the location of a transcript is defined by the location of its component exons.

Note that in chado/chaos a location always represents the outermost boundaries of a feature. Thus a CDS can be represented by a begin and end coordinate only. If there are cds_introns, then these are implicit and derivable from the locations of the exons of the sibling mRNA.

matches will always have paired locations. sequence_variants are represented as paired locations in chado/chaos even though the hit location coordinates will probably be null.

PROPERTIES

MODEL SPECIFIC NOTES

BioPerl

Bio::SeqFeature::Tools::Unflattener and Bio::SeqFeature::Tools::TypeMapper will convert genbank files to SO compliant bioperl feature graphs. These classes also have some useful methods for adding or removing implicit features derivable from explicit features.

all SO features are represented with a Bio::SeqFeatureI; the type is represented with $sf->primary_tag()

SO match feature should use Bio::SeqFeature::SimilarityPair