Improving the specificity of high-throughput ortholog prediction
- PMID: 16729895
- PMCID: PMC1524997
- DOI: 10.1186/1471-2105-7-270
Improving the specificity of high-throughput ortholog prediction
Abstract
Background: Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.
Results: To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) - identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida - Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at http://www.pathogenomics.ca/ortholuge/ (software under GNU General Public License).
Conclusion: The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.
Figures
![Figure 1](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-1.gif)
![Figure 2](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-2.gif)
![Figure 3](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-3.gif)
![Figure 4](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-4.gif)
![Figure 5](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-5.gif)
![Figure 6](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-6.gif)
![Figure 7](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-7.gif)
![Figure 8](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/1524997/bin/1471-2105-7-270-8.gif)
Similar articles
-
OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis.Nucleic Acids Res. 2013 Jan;41(Database issue):D366-76. doi: 10.1093/nar/gks1241. Epub 2012 Nov 29. Nucleic Acids Res. 2013. PMID: 23203876 Free PMC article.
-
OrthoFocus: program for identification of orthologs in multiple genomes in family-focused studies.J Bioinform Comput Biol. 2008 Aug;6(4):811-24. doi: 10.1142/s0219720008003692. J Bioinform Comput Biol. 2008. PMID: 18763744
-
Choosing BLAST options for better detection of orthologs as reciprocal best hits.Bioinformatics. 2008 Feb 1;24(3):319-24. doi: 10.1093/bioinformatics/btm585. Epub 2007 Nov 26. Bioinformatics. 2008. PMID: 18042555
-
Homology assessment and molecular sequence alignment.J Biomed Inform. 2006 Feb;39(1):18-33. doi: 10.1016/j.jbi.2005.11.005. Epub 2005 Dec 9. J Biomed Inform. 2006. PMID: 16380300 Review.
-
Cross-species sequence comparisons: a review of methods and available resources.Genome Res. 2003 Jan;13(1):1-12. doi: 10.1101/gr.222003. Genome Res. 2003. PMID: 12529301 Free PMC article. Review.
Cited by
-
OrthoRefine: automated enhancement of prior ortholog identification via synteny.BMC Bioinformatics. 2024 Apr 25;25(1):163. doi: 10.1186/s12859-024-05786-7. BMC Bioinformatics. 2024. PMID: 38664637 Free PMC article.
-
Elucidating the Mesocarp Drupe Transcriptome of Açai (Euterpe oleracea Mart.): An Amazonian Tree Palm Producer of Bioactive Compounds.Int J Mol Sci. 2023 May 26;24(11):9315. doi: 10.3390/ijms24119315. Int J Mol Sci. 2023. PMID: 37298279 Free PMC article.
-
A Mycobacterial Systems Resource for the Research Community.mBio. 2021 Mar 2;12(2):e02401-20. doi: 10.1128/mBio.02401-20. mBio. 2021. PMID: 33653882 Free PMC article.
-
Genome-Scale Mapping Reveals Complex Regulatory Activities of RpoN in Yersinia pseudotuberculosis.mSystems. 2020 Nov 10;5(6):e01006-20. doi: 10.1128/mSystems.01006-20. mSystems. 2020. PMID: 33172972 Free PMC article.
-
Comparing time series transcriptome data between plants using a network module finding algorithm.Plant Methods. 2019 Jun 1;15:61. doi: 10.1186/s13007-019-0440-x. eCollection 2019. Plant Methods. 2019. PMID: 31164912 Free PMC article.
References
-
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: An updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. - DOI - PMC - PubMed
-
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: New developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. doi: 10.1093/nar/29.1.22. - DOI - PMC - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials