CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

doi:10.1101/gr.186072.114

. 2015 Jul;25(7):1043-55.

doi: 10.1101/gr.186072.114. Epub 2015 May 14.

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Donovan H Parks¹, Michael Imelfort¹, Connor T Skennerton¹, Philip Hugenholtz², Gene W Tyson³

Affiliations

¹ Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia;
² Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia; Institute for Molecular Bioscience, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia;
³ Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia; Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia.

PMID: 25977477
PMCID: PMC4484387
DOI: 10.1101/gr.186072.114

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Donovan H Parks et al. Genome Res. 2015 Jul.

. 2015 Jul;25(7):1043-55.

doi: 10.1101/gr.186072.114. Epub 2015 May 14.

Authors

Donovan H Parks¹, Michael Imelfort¹, Connor T Skennerton¹, Philip Hugenholtz², Gene W Tyson³

Affiliations

¹ Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia;
² Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia; Institute for Molecular Bioscience, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia;
³ Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia; Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD 4072, Queensland, Australia.

PMID: 25977477
PMCID: PMC4484387
DOI: 10.1101/gr.186072.114

Abstract

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of "marker" genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.

PubMed Disclaimer

Figures

**Figure 1.**
Error in completeness and contamination estimates on simulated genomes with 50%, 70%, 80%, or 90% completeness (comp.) and 5%, 10%, or 15% contamination (cont.). Quality estimates were determined using domain-level marker genes treated as individual markers (IM) or organized into collocated marker sets (MS). Simulated genomes were generated under the random fragment model from 3324 draft genomes spanning 39 classes (20 phyla) with each draft genome being used to generate 20 simulated genomes. A systematic bias in the estimates results in completeness being overestimated on average (median value to the *right* of zero) and contamination being underestimated on average (median value to the *left* of zero). Results are summarized using box-and-whisker plots showing the 1st (99th), 5th (95th), 25th (75th), and 50th percentiles.

**Figure 2.**
CheckM consists of a workflow for precomputing lineage-specific marker genes for each branch within a reference genome tree (*top* box) and an online workflow for inferring the quality of putative genomes (*bottom* box). Starting with a set of annotated reference genomes, the quality of these genomes is assessed in order to produce a set of near-complete genomes suitable for inferring marker genes. These genomes form the basis of a reference genome tree. A simulation framework is then used to associate each branch in the reference genome tree with a lineage-specific marker set suitable for robustly estimating the quality of genomes placed along a given branch (Fig. 3). To determine the quality of a putative genome, its position within the reference genome tree is inferred in order to establish the set of marker genes suitable for assessing its quality. These marker genes are identified within the putative genome and the presence/absence of these genes used to estimate its completeness and contamination.

**Figure 3.**
Overview of simulation framework for selecting lineage-specific marker genes. (A) To evaluate a genome, G, it is placed into a reference genome tree. Each parental node from the point of insertion to the root of the tree defines a lineage-specific marker set which can be used to estimate the completeness and contamination of this genome. (B) To select a suitable set of lineage-specific marker genes for evaluating G, the genomes in the child lineage of G with the fewest genomes were used as proxies for G. (C) Genomes at different levels of completeness and contamination were simulated from these proxy genomes by subsampling and duplicating fixed sized genomic fragments. (D) Each parental marker set was then used to estimate the completeness and contamination of these simulated genomes, and the marker set resulting in the best average performance over all simulated genomes was identified. This marker set is used to assess the quality of any genomes subsequently inserted along this branch.

**Figure 4.**
Error in completeness and contamination estimates on simulated genomes with 50%, 70%, 80%, or 90% completeness and 5%, 10%, or 15% contamination. Quality estimates were determined using (1) domain: marker sets inferred across all archaeal or bacterial genomes; (2) selected: marker sets inferred from genomes within the lineage selected by CheckM; and (3) best: marker sets inferred from genomes within the lineage producing the most accurate estimates. Marker genes were organized into collocated marker sets in all cases. Simulated genomes were generated under the random contig model from 2430 draft genomes spanning 31 classes (18 phyla) with each draft genome being used to generate 20 simulated genomes.

**Figure 5.**
Lineage-specific completeness and contamination estimates for 262 isolates annotated as finished in IMG (A), 2019 isolates annotated as draft in IMG (B), 632 genomes recovered using single-cell genomics (C), and 146 population genomes recovered from metagenomic data (D). Dashed lines indicate the criteria required for a genome to be considered a near-complete genome with low contamination. *Insets* give a more detailed view of the quality of the isolate genomes. The 2281 isolate genomes were obtained from IMG and sequenced as part of the GEBA, GEBA-KMG, GEBA-PCC, GEBA-RNB, or HMP initiatives.

See this image and copyright information in PMC

Cited by

Roseateles caseinilyticus sp. nov. and Roseateles cellulosilyticus sp. nov., isolated from rice paddy field soil.
So Y, Chhetri G, Kim I, Park S, Jung Y, Seo T. So Y, et al. Antonie Van Leeuwenhoek. 2024 Jun 4;117(1):87. doi: 10.1007/s10482-024-01988-4. Antonie Van Leeuwenhoek. 2024. PMID: 38833203
Gut microbiome shifts in people with type 1 diabetes are associated with glycaemic control: an INNODIA study.
Vatanen T, de Beaufort C, Marcovecchio ML, Overbergh L, Brunak S, Peakman M, Mathieu C, Knip M; INNODIA consortium. Vatanen T, et al. Diabetologia. 2024 Jun 4. doi: 10.1007/s00125-024-06192-7. Online ahead of print. Diabetologia. 2024. PMID: 38832971
Lacticaseibacillus salsurivasis sp. nov. and Companilactobacillus muriivasis sp. nov., Isolated from Traditional Chinese Pickle.
Zhang HX, Li CY, Gu CT. Zhang HX, et al. Curr Microbiol. 2024 Jun 3;81(7):203. doi: 10.1007/s00284-024-03738-1. Curr Microbiol. 2024. PMID: 38831185
Metabolic relationships between marine red algae and algae-associated bacteria.
Kim KH, Kim JM, Baek JH, Jeong SE, Kim H, Yoon HS, Jeon CO. Kim KH, et al. Mar Life Sci Technol. 2024 May 8;6(2):298-314. doi: 10.1007/s42995-024-00227-z. eCollection 2024 May. Mar Life Sci Technol. 2024. PMID: 38827136 Free PMC article.
Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity.
Zhang Z, Xiao J, Wang H, Yang C, Huang Y, Yue Z, Chen Y, Han L, Yin K, Lyu A, Fang X, Zhang L. Zhang Z, et al. Nat Commun. 2024 May 31;15(1):4631. doi: 10.1038/s41467-024-49060-z. Nat Commun. 2024. PMID: 38821971 Free PMC article.

See all "Cited by" articles

References

1. Akhter S, Aziz RK, Edwards RA. 2012. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 40: e126. - PMC - PubMed
1. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol 31: 533–538. - PubMed
1. Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed
1. Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, et al. 2009. Genome project standard in a new era of sequencing. Science 326: 236–237. - PMC - PubMed
1. Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. 2014. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2: e243. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
- scite Smart Citations

[1] Akhter S, Aziz RK, Edwards RA. 2012. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 40: e126. - PMC - PubMed

[2] Akhter S, Aziz RK, Edwards RA. 2012. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 40: e126. - PMC - PubMed

[3] Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol 31: 533–538. - PubMed

[4] Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol 31: 533–538. - PubMed

[5] Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed

[6] Brady A, Salzberg SL. 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673–676. - PMC - PubMed

[7] Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, et al. 2009. Genome project standard in a new era of sequencing. Science 326: 236–237. - PMC - PubMed

[8] Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, et al. 2009. Genome project standard in a new era of sequencing. Science 326: 236–237. - PMC - PubMed

[9] Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. 2014. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2: e243. - PMC - PubMed

[10] Darling AE, Jospin G, Lowe E, Matsen FA IV, Bik HM, Eisen JA. 2014. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2: e243. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Affiliations

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources