Ensembl Gene Set

Information Data Genome Annotation Ensembl Gene Set

Background

Ensembl is a system providing automated genome annotation and subsequent visualisation of annotated genomes. The Ensembl analysis and annotation pipeline is based on a rule set of heuristics, a human annotator would use (Curwen et al., 2004). As genomes nowadays are sequenced on an industrial scale, labour intensive manual curation can no longer cope with the amount of information generated. The Ensembl genome annotation pipeline was thus conceived to facilitate annotation of genome sequences in a timely fashion.

All Ensembl gene predictions are based on experimental evidence, which is imported via manually curated UniProt/Swiss-Prot, partially manually curated NCBI RefSeq, automatically annotated UniProt/TrEMBL records. Untranslated regions (UTRs) are annotated to the extent supported by EMBL mRNA records. As there is no guarantee that UTR sequences in EMBL records are complete there is similarly no guarantee that the Ensembl genome analysis and annotation pipeline has enough biological evidence to predict complete UTR regions. Promoter regions are currently not annotated by Ensembl as the set of well-characterised promoters is still small and there is currently no algorithm yielding reliable results on a genomic scale.

Sources of Biological Evidence

EMBL is the European part of the International Nucleotide Sequence Database Collaboration (INSD) and maintained at the EBI. All records are synchronised with GenBank at the NCBI (North America) and the DNA Database of Japan (DDBJ).
UniProt/Swiss-Prot is a manually curated database so that both, protein sequences and annotation are assumed to be of highest quality and accuracy.
RefSeq is a partially manually curated dataset maintained at the NCBI. RefSeq accession numbers prefixed with 'NM_' are curated mRNAs, while 'NP_' accession numbers represent curated proteins. Accession numbers prefixed with 'XM_' and 'XP_' are predicted mRNAs and predicted proteins, respectively. Since Ensembl bases its predictions solely on the basis of experimental evidence, both predicted RefSeq datasets are not included in the Ensembl database.
UniProt/TrEMBL is automatically annotated and contains the translations of EMBL/GenBank/DDBJ coding sequence features (CDS).

Gene-Build Procedure

Ensembl gene builds are rather complex and involve two important steps (Curwen et al., 2004). The initial targeted build aligns all species-specific protein and mRNA information to the genome sequence. An additional similarity build is based on information from closely related species and aims to broaden the spectrum of transcript predictions. This second step is especially important for less popular model organisms with a much smaller amount of direct, species-specific protein and mRNA evidence available.

External References Mapping Procedure

Naming of transcripts occurs at a later step, after the gene-build is completed. If the transcript or protein models can be mapped to species-specific UniProt/Swiss-Prot, RefSeq or UniProt/TrEMBL entries then Ensembl refers to them as known genes, if not (e.g. genes predicted on the basis of evidence from closely related species) they are called novel genes.

The difference is thus due to evidence coming directly from a manually curated database (UniProt/Swiss-Prot), a partially manually curated (RefSeq) or a species-specific entry (UniProt/TrEMBL) or whether the gene model is inferred from a closely related species. For these reasons, known genes will dominate in all established model organisms, while less popular organisms will display a significantly larger fraction of novel genes. But again, all Ensembl gene predictions are based on experimental evidence.

Supporting Evidence

All sequence records the Ensembl analysis and annotation pipeline used for the annotation of a particular transcript model are available on a per exon basis from the 'Supporting Evidence' section of corresponding 'ExonView' pages. These pages are linked from 'GeneView', 'TransView' and 'ProteinView' pages via the [Exon Information] links.

While Ensembl is a browser providing automatically annotated genomes, the Vertebrate Genome Annotation Browser (Vega) is its counterpart for manually curated genome annotation. Since manual curation is very labour-intensive it is currently limited to certain chromosomes of certain species.

References

Val Curwen, Eduardo Eyras, T. Daniel Andrews, Laura Clarke, Emmanuel Mongin, Steven M.J. Searle, and Michele Clamp
The Ensembl Automatic Gene Annotation System
Genome Res. 2004 May; 14(5):942-950.
[Abstract] [Full text]

The above research article was published in an 'Ensembl Special' and provides detailed information on the Ensembl gene-building algorithm. Please find additional references in the Ensembl scientiftic publications document.

Ashbya Genome Database .