Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 1;27(13):i275-82.
doi: 10.1093/bioinformatics/btr209.

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

Affiliations

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

Michael F Lin et al. Bioinformatics. .

Abstract

Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.

Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

Availability and implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
PhyloCSF method overview. (A) PhyloCSF uses phylogenetic codon models estimated from genome-wide training data based on known coding and non-coding regions. These models include a phylogenetic tree and codon substitution rate matrices QC and QN for coding and non-coding regions, respectively, shown here for 12 Drosophila species. QC captures the characteristic evolutionary signatures of codon substitutions in conserved coding regions, while QN captures the typical evolutionary rates of triplet sites in non-coding regions. (B) PhyloCSF applied to a short region from the first exon of the D.melanogaster homeobox gene Dfd. The alignment of this region shows only synonymous substitutions compared with the inferred ancestral sequence (green). Using the maximum likelihood estimate of a scale factor ρ applied to the assumed branch lengths, the alignment has higher probability under the coding model than the non-coding model, resulting in a positive log-likelihood ratio Λ. (C) PhyloCSF applied to a conserved region within a Dfd intron. In contrast to the exonic alignment, this region shows many non-synonymous substitutions (red), nonsense substitutions (blue, purple) and frameshifts (orange). The alignment has lower probability under the coding model, resulting in a negative score.
Fig. 2.
Fig. 2.
PhyloCSF performance benchmarks. ROC plots and error measures for several methods to distinguish known protein-coding and randomly selected non-coding regions in D.melanogaster. The top row of plots shows the results for our full dataset of ~50 000 regions matching the fly exon length distribution, while the bottom row of plots is based on the 37% of these regions between 30 and 180 nt in length. The left-hand plots show the performance of the methods applied to multiple alignments of 12 fly genomes, while the right-hand plots use pairwise alignments between D.melanogaster and D.ananassae. PhyloCSF effectively dominates the other methods.
Fig. 3.
Fig. 3.
Exploiting non-independence of codon sites. The PhyloCSF log-likelihood ratio Λ is transformed based on alignment length into a new log-likelihood ratio score Ψ (see main text). Ψ provides a superior discriminant in the full 12 fly dataset.
Fig. 4.
Fig. 4.
A novel human coding gene found using mRNA-Seq and PhyloCSF. Transcriptome reconstruction by Scripture (Guttman et al., 2010) based on brain mRNA-Seq data provided by Illumina, Inc. produced two alternative transcript models lying antisense to an intron of GTF2E2, a known protein-coding gene. PhyloCSF identified a 95-codon ORF in the third exon of this transcript, highly conserved across placental mammals. The color schematic illustrates the genome alignment of 29 placental mammals for this ORF, indicating conservation (white), synonymous and conservative codon substitutions (green), other non-synonymous codon substitutions (red), stop codons (blue/magenta/yellow) and frame-shifted regions (orange). Despite its unmistakable protein-coding evolutionary signatures, the ORF's translation shows no sequence similarity to known proteins.

Similar articles

Cited by

References

    1. Alioto T., Guigó R. State of the art in eukaryotic gene prediction. In: Frishman D., Valencia A., editors. Modern Genome Annotation: the BioSapiens Network. New York: Springer; 2009. pp. 7–40.
    1. Anisimova M., Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol. Biol. Evol. 2008;26:255–271. - PubMed
    1. Arvestad L., Bruno W.J. Estimation of reversible substitution matrices from multiple pairs of sequences. J. Mol. Evol. 1997;45:696–703. - PubMed
    1. Blanchette M., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. - PMC - PubMed
    1. Brent M.R. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat. Rev. Genet. 2008;9:62–73. - PubMed

Publication types

-