FASTA Database Files

Information Data FASTA Database Files

Introduction

Ensembl provides sequence databases of transcript and translation models predicted by the Ensembl analysis and annotation pipleine, as well as by ab initio methods. The database files in FASTA format are available from corresponding 'fasta' directories on the ftp.ensembl.org site. This document describes the current naming convention and the sequence header line format used by Ensembl. Similar descriptions are also available from README files in FTP site directories.

While Ensembl provides also a more general description of the FTP site structure, 'fasta' directories contain the following sub-directories:

dna: assembled genomic DNA sequences
cdna: gene prediction transcripts (cDNAs, mRNAs)
pep: gene prediction translations (peptides, proteins)
rna: non-coding RNA gene predictions

To facilitate storage and download all databases are GNU Zip (gzip, *.gz) compressed.

FASTA Database File Names

All files deposited in these directories obey a common naming scheme:

species.version.month.sequence type.[status].[id type].[id].fa.gz

species - This is the systematic name of the species.
version - The version number of the genome sequence assembly build denotes the central sequence version for all Ensembl sequence type exports. Although, its annotation may change between assembly builds, the underlying genome sequence is only altered upon arrival of a new build.
month - The month of the corresponding Ensembl release represents the version of genome annotation data. Although gene-builds are generally performed on new genome sequence assemblies, the corresponding annotation may be updated between gene-builds.
sequence type
- dna - This stores unmasked genomic DNA sequences.
- dna_rm - This is the masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with equal numbers of 'N's.
- cdna - Transcript (cDNA) sequences resulting from Ensembl and ab intio gene predictions.
- pep - Translation (peptide) sequences resulting from Ensembl and ab intio gene predictions.
- rna - Transcript (RNA) sequences of non-coding RNA predictions.
[status] - The status of transcript (cDNA) or translation (peptide) sequence database files indicates the prediction method behind a particular sub-set. For more information about these classes of Ensembl gene and transcript predictions see the Ensembl Gene Predictions section.
- known_ccds - The Consensus Coding Sequence (CCDS) project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. Initial results from the CCDS project are now available through the appropriate Ensembl gene pages and from the CCDS project page at NCBI. More information is available from the Ensembl CCDS page.
- known - The sub-set of all transcripts or translations resulting from Ensembl known gene predictions. If Ensembl gene predictions can be mapped to species-specific UniProt/Swiss-Prot, RefSeq or UniProt/TrEMBL entries then Ensembl refers to them as known genes.
- novel - The sub-set of all transcripts or translations resulting from Ensembl novel gene predictions. If Ensembl gene predictions cannot be mapped to species-specific but UniProt/Swiss-Prot, RefSeq or UniProt/TrEMBL entries from closely related species then Ensembl refers to them as novel genes.
- pseudo - All transcripts from Ensembl pseudogene predictions.
- abinitio - The sub-set of transcripts or translations resulting from ab initio gene prediction algorithms such as the Semi-HMM-based Nucleic Acid Parser (SNAP) or GENSCAN. Generally, all ab initio predictions are solely based on the genomic sequence and do not exploit any other experimental evidence. Since, not all GENSCAN or SNAP transcript or translation predictions may represent biologically real molecules, these predictions should be used with care.
[id type] - The sequence identifier type specifies the genome coordinate system in use. Generally, Ensembl provides database files for all sequence-level entities that have been used to determine the genome sequences, as well as for top-level entities that have been assembled from sequence-level entities.
- Top-level entities represent the largest contiguous sequences that have been assembled and are annotated by the Ensembl analysis and annotation pipeline.
  - chromosome - The top-level coordinate system for most species in Ensembl.
  - nonchromosomal - Contains mitochondrial DNA or DNA sequences (supercontigs) that could not be assigned with confidence to a chromosome based on the current amount of sequence information.
- Sequence-level entities result from the genome sequencing (BAC clones, contigs) or result from the assembly build and specify the assembly of top-level entities.
  - scaffold - Larger sequence contigs that result from the assembly of shorter (often whole genome shotgun, WGS) sequencing reads which could not yet be assembled into chromosomes. Generally, more genome sequencing, narrowing of gaps and establishment of a tiling path is neccessary before a chromosome assembly can be established.
  - chunk - While contig sequences can be assembled into larger entities, they sometimes need to be artificially broken down into the so-called 'chunks'. This is due to limitations in the annotation pipeline and the finite record size imposed by the relational database management system (RDBMS) MySQL, which stores the genome sequence and annotation information.
  - clone - In general, this is the smallest sequence entity. It is often identical to the sequence of one BAC clone, or the part of its sequence in the tiling path.
[id] - The actual sequence identifier. Depending on the [id type] the [id] could represent the name of a chromosome, a scaffold, a contig, a clone, ...
fa - All files in these directories represent sequence database files in FASTA format.
gz - All files are compressed with GNU Zip facilitating storage and network transfer efficiency.

FASTA Sequence Header Lines

The FASTA format sequence header lines are designed to be consistent across all types of Ensembl sequences, giving enough info for the sequence to be identified outside the context of the sequence database file.

>ID SEQTYPE:IDTYPE LOCATION

The following sequence header line is an example for the simple Ensembl cDNA header format:

>ENST00000289823 cdna:known chromosome:NCBI34:8:21922367:21927699:1 
 ^               ^    ^     ^
 ID              |    |     LOCATION 
                 |    IDTYPE 
                 SEQTYPE

There is obviously a great deal more transcript-specific meta data that could be added to the header.

>ID SEQTYPE:IDTYPE LOCATION META

An example for the extended sequence header format is the following line:

>ENST00000289823 cdna:known chromosome:NCBI34:8:21922367:21927699:1 gene:ENSG00000158815:HUGO:FGF17
 ^               ^    ^     ^                                       ^ 
 ID              |    |     LOCATION                                META
                 |    IDTYPE 
                 SEQTYPE

DNA Directories

Top Level

These files contain the full sequence of the assembly in FASTA format. They contain one chromosome per file.

species.version.month.sequence type.id type.id.fa.gz

Examples

The genomic sequence of human chromosome 1:

Homo_sapiens.NCBI34.may.dna.chromosome.1.fa.gz

The masked version of the genome sequence on human chromosome 1 contains '_rm' in the name:

Homo_sapiens.NCBI34.may.dna_rm.chromosome.1.fa.gz

Non-chromosomal assembly sequences (e.g. mitochondrial genome, sequence contigs not yet mapped on chromosomes):

Homo_sapiens.NCBI34.may.dna.nonchromosomal.fa.gz 
Homo_sapiens.NCBI34.may.dna_rm.nonchromosomal.fa.gz

Sequence Level

These files represent dumps of the assembly at the sequence level in FASTA format.

species.version.month.sequence type.id type.fa.gz

Examples

Unmasked sequence file name examples:

Homo_sapiens.NCBI34.may.dna.contig.fa.gz 
Anopheles_gambiae.MOZ2a.may.dna.chunk.fa.gz 
Fugu_rubripes.FUGU2.may.dna.scaffold.fa.gz

Repeat masked files contain '_rm' in the file name:

Homo_sapiens.NCBI34.may.dna_rm.contig.fa.gz 
Anopheles_gambiae.MOZ2a.may.dna_rm.chunk.fa.gz 
Fugu_rubripes.FUGU2.may.dna_rm.scaffold.fa.gz

Note that the sequence 'id type' varies in different species: contigs in human, chunks in Anopheles gambiae, scaffolds in Takifugu rubripes.

README

Each directory on ftp.ensembl.org contains an auto-generated README file, explaining the filenames and FASTA format header line conventions in use.

Ashbya Genome Database .