Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May 20;8(2):360-74.
doi: 10.4056/sigs.3446951. eCollection 2013.

Phylogeny-driven target selection for large-scale genome-sequencing (and other) projects

Affiliations

Phylogeny-driven target selection for large-scale genome-sequencing (and other) projects

Markus Göker et al. Stand Genomic Sci. .

Abstract

Despite the steadily decreasing costs of genome sequencing, prioritizing organisms for sequencing remains important in large-scale projects. Phylogeny-based selection is of interest to identify those organisms whose genomes can be expected to differ most from those that have already been sequenced. Here, we describe a method that infers a phylogenetic scoring independent of which set of organisms has previously been targeted, which is computationally simple and easy to apply in practice. The scoring itself, as well as pre- and post-processing of the data, is illustrated using two real-world examples in which the method has already been applied for selecting targets for genome sequencing. These projects are the JGI CSP Genomic Encyclopedia of Bacteria and Archaea phase I, targeting 1,000 type strains, and, on a smaller-scale, the phylogenomics of the Roseobacter clade. Potential artifacts of the method are discussed and compared to a selection approach based on the taxonomic classification.

Keywords: 16S rRNA; Genomic Encyclopedia; Roseobacter clade; genomics; phylogenetic diversity; taxon selection; tree of life.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hypothetical example phylogeny. The numbers above the branches indicate the branch lengths; internal edge labels derived from the names of the leaves of the corresponding subtrees have been added to ease the navigation.
Figure 2
Figure 2
Scatterplot showing the relationship between the two examined variants of the phylogenetic scoring, bRPD (x-axis) and uRPD (y-axis). In addition to the fact that the overall correlation between the two measures is high (see also Table 2), it is obvious that the distribution of both variants is highly right-skewed; that is, few strains with high scores are accompanied by a bulk of strains which contribute only little to the overall sum of the scores.
Figure 3
Figure 3
Saturation plot for the bRPD measure. X-axis, index of the decreasingly sorted bRPD values; y-axis, cumulative bRPD sum in percent. The right-skewed distribution of the bRPD values (see Figure 2) manifests itself in the fact that only about 2,000 strains (vertical line) are necessary to reach 50% of the overall phylogenetic diversity (horizontal line) as estimated using this measure.
Figure 4
Figure 4
Phylogenetic tree of the members of the Roseobacter clade (known at the time of target selection) rooted with Labrenzia spp. The branches are scaled in terms of the expected number of substitutions per site (see size bar). Bootstrap support values [43] were calculated but have been omitted for clarity because they are not relevant to the scoring. The organisms with the ten highest bRPD scores are marked in blue. The organisms with the ten next highest bRPD scores (ranks 11 to 20) are marked in green.

Similar articles

Cited by

References

    1. Markowitz VM, Mavromatis K, Ivanova NN, Chen IA, Chu K, Kyrpides NC. IMG-ER: a system for microbial genome annotation expert review and curation. Bioinformatics 2009; 25:2271-2278 10.1093/bioinformatics/btp393 - DOI - PubMed
    1. Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, Angiuoli SV, et al. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol 2008; 26:541-547 10.1038/nbt1360 - DOI - PMC - PubMed
    1. Wiley EO, Lieberman BS. Phylogenetics. Theory and practice of phylogenetic systematics. Wiley-Blackwell, Hoboken (NJ), 2011.
    1. Klenk HP, Göker M. En route to a genome-based classification of Archaea and Bacteria? Syst Appl Microbiol 2010; 33:175-182 10.1016/j.syapm.2010.03.003 - DOI - DOI - DOI - PubMed
    1. Tindall BJ. Misunderstanding the Bacteriological Code. Int J Syst Bacteriol 1999; 49:1313-1316 10.1099/00207713-49-3-1313 - DOI - PubMed

LinkOut - more resources

-