Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 May 28:7:270.
doi: 10.1186/1471-2105-7-270.

Improving the specificity of high-throughput ortholog prediction

Affiliations

Improving the specificity of high-throughput ortholog prediction

Debra L Fulton et al. BMC Bioinformatics. .

Abstract

Background: Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.

Results: To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) - identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida - Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at http://www.pathogenomics.ca/ortholuge/ (software under GNU General Public License).

Conclusion: The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example of how RBH analysis may falsely identify a paralog as an ortholog. Illustrated is a hypothetical species tree and gene tree for the human, cattle, and mouse species, where human and cattle orthologs (unshaded genes) are being identified. If the true cattle ortholog has not yet been sequenced because of an incomplete bovine genome project, it will not be present in the gene dataset used for analysis (cattle gene crossed out with an X), and the best reciprocal BLAST hit for the human gene will be a cattle paralog (shaded gene). However, Ortholuge will detect this case as a potential paralog, because it examines the relative phylogenetic distance between genes and identifies how well their relative distances match expected species divergence.
Figure 2
Figure 2
An overview of the Ortholuge method. (A) Flow-chart outlining the main steps of the method. (B) The three ratios computed by Ortholuge. The phylogenetic distances in the numerator (dark line) and denominator (dashed line) for each ratio is shown, overlaid on the phylogenetic tree (gray line) that relates the ingroups and outgroup. Note that the three ratios are related such that Ratio2 = Ratio1 × Ratio3. Therefore, ratio data is presented both in terms of frequency histograms for all three ratios (see Fig. 4) and also as Ratio1 × Ratio2 plots (see Fig. 5) for just two of the three ratios – the latter is simply another way to conveniently visualize the data.
Figure 3
Figure 3
Ratio 1 (R1) ratio distribution curves for selected alignment characteristics. Higher quality mouse-rat-human ortholog sequence sets were analyzed to devise the gap-masking and sequence trimming approaches. These methods were evaluated for the introduction of ratio distribution biases for selected alignment characteristics such as identity and gap length. Ratio distribution curves were plotted for several characteristics. No obvious bias was observed through the introduction of our gap masking approach or alignment trimming.
Figure 4
Figure 4
Histogram illustrating the distribution of RBH-predicted (i.e. putative) orthologous groups across the three Ortholuge distance ratios. The results for predicted mouse-rat-human RBH ortholog sets (EGO RBH data set; 19,200 ortholog groups) are shown. Each of the three ratios forms their own distribution: Ratio1 and Ratio2 are generally located at ratio values lower than 1 and Ratio3 is generally located about a ratio value of 1, reflecting the relative distances between ingroups and between each ingroup and the outgroup. A similar ratio analysis was performed on a RefSeq RBH dataset (see Figure 3 of [Additional file 1]).
Figure 5
Figure 5
Ortholuge R1 × R2 plots (Ratio1 versus Ratio2) for selected eukaryotic data, where each point represents one putative ortholog group. (A) Putative orthologous groups identified using RBH for mouse-rat-human (Figure 4 shows the corresponding histogram). (B) Putative orthologs groups for mouse-rat-human from a higher quality (more precise) dataset (see Methods). It is expected that this more precise data set comprises primarily true orthologs. (C) A lower quality data set of RBH-predicted orthologous groups for cattle-human-mouse, where cattle genes have been identified from an incomplete genome sequence. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 30. Note that most orthologous groups exhibit low Ratio1 and Ratio2 values, in all three data sets. For example, in panels A and D, about 86% of orthologs have Ratio1 and Ratio2 values less than 1. However, the higher quality data set (panels B and E) contains fewer points at higher Ratio values versus the RBH-predicted data set. The lower quality data set contains more points with very high Ratio2 values (i.e. only 73% of points have Ratio1 and Ratio2 values less than 1), potentially reflecting the increased occurrence of probable cattle paralogs (i.e. paralogs being misidentified as orthologs by an RBH-analysis with an incomplete cattle genome).
Figure 6
Figure 6
Ortholuge R1 × R2 plots for the prokaryotic data, illustrating two ortholog data sets and a true-negative data set. (A) Putative orthologous groups from an RBH-predicted data set. (B) Probable true orthologs from a higher quality (more precise) data set. (C) True-negative orthologs (i.e. true paralogs) from the "gene-loss simulation" data set. Darker dots represent putative orthologous groups which have had an ingroup1 true-negative (paralog) introduced into the group. Lighter dots represent putative orthologous groups which have had an ingroup2 true-negative (paralog) introduced into the group. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 10. Most putative ortholog groups (particularly for the high quality data set) exhibit low Ratio1 and Ratio2 values (for example, all values are less than 1 for the points in the high quality data set plot), whereas most true-negative groups exhibit higher Ratio1 and Ratio2 values (i.e. only 9% of ingroup1 true negative introductions, and 6% of ingroup2 true negative introductions, have points with Ratio1 and Ratio2 values less than 1).
Figure 7
Figure 7
R1 × R2 plots, for the prokaryotic data, illustrating the effect of introducing outgroup paralogs (outgroup ortholog true-negatives) in the analysis. Unlike for other figures of R1 × R2 plots in the paper, only ratio ranges from 0 to 2 are shown for each axis. (A) RBH-predicted orthologous groups. (B) Outgroup paralogs from a true-negative data set where all possible outgroups were replaced with next best RBH paralogs. They cannot be well distinguished from other orthologs, however, this is actually promising, since Ortholuge is in essence identifying orthologs between the ingroups only. This analysis shows that an outgroup paralog does not interfere greatly with the identification of true orthologs shared between the ingroups.
Figure 8
Figure 8
Example of the generation of cut-offs for classification of ssd-orthologs and probable paralogs, based on an iterative-true-negative analysis (i.e. based on an introduction of random sets of true-negatives). The particular analysis illustrated here is a Ratio1 analysis for the mouse, rat, human RefSeq RBH dataset, with true-negatives introduced into the mouse (ingroup1) set. In panel A, the number of putative orthologous groups in each ratio range for the true-negative-transformed data set is shown for the whole data set (light shaded bars) and for just the introduced true-negatives only (dark shaded bars). Note how the distribution of the data set differs from that of the true negatives (i.e. introduced paralogs). In panel B, the proportion of randomly introduced true-negatives at 0.5 ratio range intervals is used to formulate cut-offs (denoted by dashed lines) for classifying ssd-orthologs and probable paralogs for the analysis. For the ssd-orthologs cut-off (left-most dashed line), no more than 10% true negatives in a given ratio range are permitted for the ssd-orthologs range. For the probable paralogs cut-off (right-most dashed line) the proportion of true negatives is at or above 50 percent. The resulting middle region bounded by these two cut-off points establishes the "uncertain" orthology class ratio range. Dashed-lines denoting these particular cut-offs are also illustrated on the figure in Panel A for reference. This approach for a true-negative analysis and cut-off generation is also performed for Ratio2 [Additional file 1] and the combination of cut-offs for Ratio1 and Ratio2 are used to classify putative orthologous groups from another data set (such as an RBH-predicted data set) into the three classification levels of "probable ssd-ortholog", "uncertain" and "probable paralogs". Panel C schematically shows the areas of an R1 × R2 that would be classified in this way, with the cut-off numbers in this particular example matching the RefSeq RBH-based mouse-rat-human analysis (see Table 2 for how these ranges are numerically determined).

Similar articles

Cited by

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. doi: 10.2307/2412448. - DOI - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. - DOI - PubMed
    1. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: An updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. - DOI - PMC - PubMed
    1. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: New developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. doi: 10.1093/nar/29.1.22. - DOI - PMC - PubMed
    1. Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA) Genome Res. 2002;12:493–502. doi: 10.1101/gr.212002. - DOI - PMC - PubMed

Publication types

-