#### README #### IMPORTANT: Please note you can download subsets of data via the BioMart data mining tool. See https://www.ensembl.org/info/data/biomart/ for more information. ################## Fasta cDNA dumps ################# These files hold the cDNA sequences corresponding to Ensembl genes, excluding ncRNA genes, which are in a separate 'ncrna' Fasta file. cDNA consists of transcript sequences for actual and possible genes, including pseudogenes, NMD and the like. See the file names explanation below for different subsets of both known and predicted transcripts. ------------ FILE NAMES ------------ The files are consistently named following this pattern: ....fa.gz : The systematic name of the species. : The assembly build name. : cdna for cDNA sequences * 'cdna.all' - all transcripts of Ensembl genes, excluding ncRNA. * 'cdna.abinitio' - transcripts resulting from 'ab initio' gene prediction algorithms such as SNAP and GENSCAN. In general all 'ab initio' predictions are solely based on the genomic sequence and do not use other experimental evidence. Therefore, not all GENSCAN or SNAP cDNA predictions represent biologically real cDNAs. Consequently, these predictions should be used with care. EXAMPLES (Note: Not all species have 'cdna.abinitio' data) for Human: Homo_sapiens.NCBI36.cdna.all.fa.gz cDNA sequences for all transcripts Homo_sapiens.NCBI36.cdna.abinitio.fa.gz cDNA sequences for 'ab initio' prediction transcripts. ------------------------------ FASTA Sequence Header Lines ------------------------------ The FASTA sequence header lines are designed to be consistent across all types of Ensembl FASTA sequences. Stable IDs for genes and transcripts are suffixed with a version if they have been generated by Ensembl (this is typical for vertebrate species, but not for non-vertebrates). All ab initio data is unversioned. General format: >TRANSCRIPT_ID SEQTYPE LOCATION GENE_ID GENE_BIOTYPE TRANSCRIPT_BIOTYPE Example of an Ensembl cDNA header: >ENST00000289823.1 cdna chromosome:NCBI35:8:21922367:21927699:1 gene:ENSG00000158815.1 gene_biotype:protein_coding transcript_biotype:protein_coding ^ ^ ^ ^ ^ ^ TRANSCRIPT_ID | LOCATION GENE_ID GENE_BIOTYPE TRANSCRIPT_BIOTYPE SEQTYPE

-