What is the Sequence Ontology?

Eilbeck K1, Lewis S1, Mungall CJ2, Yandell M2, Ashburner M3

1.Department of Molecular and Cellular Biology, Life Sciences Addition, University of California, Berkeley, CA 94729-3200. 2. HHMI, Department of Molecular and Cellular Biology, Life Sciences Addition, University of California, Berkeley, CA 94729-3200. 3. Department of Genetics, University of Cambridge, Downing Street, Cambridge, UK CB23EH

Aims of the Sequence Ontology

The Sequence Ontology (SO) is a structured controlled-vocabulary for describing biological sequences. It is provided as a common resource to the bioinformatics community in an effort to unite the multiple genome annotation groups with shared semantics. The purpose of SO is to standardize and define the terms used to describe sequences in order to allow robust, rapid, and standardized querying of model organism databases regardless of original source.

Structure of the Sequence Ontology

At its simplest, an ontology is a description of the concepts and the relationships that exist between these concepts, in a domain of interest - in this case biological sequence. The concepts comprise a controlled vocabulary used by the community to describe sequence features. The relationships give context to the concepts by allowing us to describe specific knowledge about the domain. For example 'a five prime cap is part of a processed transcript'. The ontology allows us to share a common understanding between various genome annotation groups and also share this knowledge with software. Figure 1. shows a portion of the Sequence Ontology.

Image Has been Removed

Figure 1. Browsing the Sequence Ontology. A selection of SO is shown here with the kinds of regions expanded in the first panel. The isa relationships are denoted with the 'i' icon and part_of relationships denoted with the 'P' icon. The concept 'gene' is expanded to show its parts, and then 'regulatory_region' is expanded to show the kinds of regulatory region and a reference is shown for concept 'exonic_splice_enhancer'.

The concepts in SO allow us to describe exactly what a region of sequence is, and locate that meaning on the sequence in base coordinates. Each of the concepts will have a definition and a reference to explicitly identify its meaning. Currently about 70% of concepts have definitions. Examples of these locatable sequence features are contig, BAC, gene, transcript, miRNA, and transposable_element.

The relationships allow us to define what something is a kind of and what something is a component of.

  • A promoter isa regulatory region
  • A regulatory_region is part_of a gene
  • An exonic_splice_enhancer isa splice_enhancer isa regulatory_region is part_of a gene.

These relationships allow us to locate and then describe the structure of a sequence with regard to its constituent parts in its biological context, e.g. where the exons of a transcript are located on a sequenced BAC.

The concepts can be located on the sequence in base coordinates, and therefore be used in formats such as GFF3, Chado xml and model organism databases. Figure 2. demonstrates how the terms in SO are used to mark up sequence using Chado xml, thus rendering the knowledge into a suitable format for computational analysis. This format is particularly appropriate as the relationships are naturally modeled.

Image Has been Removed

Figure 2. Chado XML markup of part of a Drosophila melanogaster annotation using the Sequence Ontology. Each of the features is assigned a type from SO such as 'exon' to denote the kind of concept it is and a unique 'feature_id', used to identify it within the document. The relationships are expressed using these feature_ids with the feature_relationship tag and the type of relationship. In this document there is a part_of relationship described between exon2 (id 480262) and the transcript (id 168257).

What does the Sequence Ontology add to annotations?

  1. Comparisons between different organisms become simplified if all of the annotations are marked up with the same terms.
  2. Validation: Annotations can be validated using the Sequence Ontology, allowing mistakes to be easily located through recourse to the ontology. For example, software can know that exons are legitimate parts of transcripts and also that a non_coding exon appearing in the cds or protein does not make sense.
  3. Retrieval: SO can be used to query a sequence database about the kinds of the sequences archived within it and return the annotations of interest. For example 'give me all of the non-coding exons of this transcript' or 'give me all of the alternate transcripts of this gene that contain this exon'.
  4. Analysis: The Sequence Ontology can be used to interpret the results of sequence similarity tools such as BLAST. Sequence annotations marked up with SO enable software that can use sequence alignments in ways not previously possible - in effect they allow us to BLAST annotations rather than sequences against one another, providing information about conservation of gene structure, even from an un-annotated genome. Figure 3. shows an example of using a sequence annotation with SO to infer gene structures onto an un-annotated genome.

Image Has been Removed

Figure 3. Comparative Genomics: Using Sequence Ontology marked up D. melanogaster annotations, to infer gene structures onto the un-annotated D. pseudoobscura genome, to retrieve and compare the equivalent exons and introns.

Sequence Ontology details:

Where to find SO: http://song.sourceforge.net - cvs directory

How to view SO:

SO is best viewed using OBO-edit (by John Richter and Suzi Lewis) http://oboedit.org/

How to contact SO:

Developers mailing list: song-devel@sourceforge.net

Groups involved in SO:

BDGP, Flybase, Wormbase, MGI and the Sanger Institute. Compatibility has been established between SO and the MGED ontology.