PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions
- PMID: 21685081
- PMCID: PMC3117341
- DOI: 10.1093/bioinformatics/btr209
PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions
Abstract
Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.
Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.
Availability and implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.
Figures
![Fig. 1.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/3117341/bin/btr209f1.gif)
![Fig. 2.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/3117341/bin/btr209f2.gif)
![Fig. 3.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/3117341/bin/btr209f3.gif)
![Fig. 4.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/3117341/bin/btr209f4.gif)
Similar articles
-
PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools.Bioinformatics. 2022 Feb 7;38(5):1440-1442. doi: 10.1093/bioinformatics/btab756. Bioinformatics. 2022. PMID: 34734986 Free PMC article.
-
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.Genome Res. 2019 Dec;29(12):2073-2087. doi: 10.1101/gr.246462.118. Epub 2019 Sep 19. Genome Res. 2019. PMID: 31537640 Free PMC article.
-
Detecting and comparing non-coding RNAs in the high-throughput era.Int J Mol Sci. 2013 Jul 24;14(8):15423-58. doi: 10.3390/ijms140815423. Int J Mol Sci. 2013. PMID: 23887659 Free PMC article. Review.
-
Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes.Bioinformatics. 2012 Dec 1;28(23):3042-50. doi: 10.1093/bioinformatics/bts582. Epub 2012 Oct 7. Bioinformatics. 2012. PMID: 23044541
-
Computational discovery of human coding and non-coding transcripts with conserved splice sites.Bioinformatics. 2011 Jul 15;27(14):1894-900. doi: 10.1093/bioinformatics/btr314. Epub 2011 May 26. Bioinformatics. 2011. PMID: 21622663
Cited by
-
Comparative Genome Annotation.Methods Mol Biol. 2024;2802:165-187. doi: 10.1007/978-1-0716-3838-5_7. Methods Mol Biol. 2024. PMID: 38819560 Review.
-
Effect of Curcumin on Hepatic mRNA and lncRNA Co-Expression in Heat-Stressed Laying Hens.Int J Mol Sci. 2024 May 15;25(10):5393. doi: 10.3390/ijms25105393. Int J Mol Sci. 2024. PMID: 38791430 Free PMC article.
-
FuncPEP v2.0: An Updated Database of Functional Short Peptides Translated from Non-Coding RNAs.Noncoding RNA. 2024 Apr 9;10(2):20. doi: 10.3390/ncrna10020020. Noncoding RNA. 2024. PMID: 38668378 Free PMC article.
-
Biophysical characterization of high-confidence, small human proteins.bioRxiv [Preprint]. 2024 Apr 15:2024.04.12.589296. doi: 10.1101/2024.04.12.589296. bioRxiv. 2024. Update in: Biophys Rep (N Y). 2024 Jun 21:100167. doi: 10.1016/j.bpr.2024.100167. PMID: 38659920 Free PMC article. Updated. Preprint.
-
Micropeptides: potential treatment strategies for cancer.Cancer Cell Int. 2024 Apr 15;24(1):134. doi: 10.1186/s12935-024-03281-w. Cancer Cell Int. 2024. PMID: 38622617 Free PMC article. Review.
References
-
- Alioto T., Guigó R. State of the art in eukaryotic gene prediction. In: Frishman D., Valencia A., editors. Modern Genome Annotation: the BioSapiens Network. New York: Springer; 2009. pp. 7–40.
-
- Anisimova M., Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol. Biol. Evol. 2008;26:255–271. - PubMed
-
- Arvestad L., Bruno W.J. Estimation of reversible substitution matrices from multiple pairs of sequences. J. Mol. Evol. 1997;45:696–703. - PubMed
-
- Brent M.R. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat. Rev. Genet. 2008;9:62–73. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials