Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin

doi:10.1186/s40168-018-0470-z

. 2018 May 17;6(1):90.

doi: 10.1186/s40168-018-0470-z.

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin

Nicholas A Bokulich¹, Benjamin D Kaehler², Jai Ram Rideout³, Matthew Dillon³, Evan Bolyen³, Rob Knight⁴, Gavin A Huttley⁵, J Gregory Caporaso^{6

7}

Affiliations

¹ The Pathogen and Microbiome Institute, Northern Arizona University, PO Box 4073, Flagstaff, AZ, 86011-4073, USA. nicholas.bokulich@nau.edu.
² Research School of Biology, Australian National University, 46 Sullivans Creek Road, Acton ACT, 2601, Australia. benjamin.kaehler@anu.edu.au.
³ The Pathogen and Microbiome Institute, Northern Arizona University, PO Box 4073, Flagstaff, AZ, 86011-4073, USA.
⁴ Departments of Pediatrics and Computer Science and Engineering, and Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA.
⁵ Research School of Biology, Australian National University, 46 Sullivans Creek Road, Acton ACT, 2601, Australia. gavin.huttley@anu.edu.au.
⁶ The Pathogen and Microbiome Institute, Northern Arizona University, PO Box 4073, Flagstaff, AZ, 86011-4073, USA. gregcaporaso@gmail.com.
⁷ Department of Biological Sciences, Northern Arizona University, 1298 S Knoles Drive, Building 56, 3rd Floor, Flagstaff, AZ, USA. gregcaporaso@gmail.com.

PMID: 29773078
PMCID: PMC5956843
DOI: 10.1186/s40168-018-0470-z

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin

Nicholas A Bokulich et al. Microbiome. 2018.

. 2018 May 17;6(1):90.

doi: 10.1186/s40168-018-0470-z.

Authors

Nicholas A Bokulich¹, Benjamin D Kaehler², Jai Ram Rideout³, Matthew Dillon³, Evan Bolyen³, Rob Knight⁴, Gavin A Huttley⁵, J Gregory Caporaso^{6

7}

Affiliations

¹ The Pathogen and Microbiome Institute, Northern Arizona University, PO Box 4073, Flagstaff, AZ, 86011-4073, USA. nicholas.bokulich@nau.edu.
² Research School of Biology, Australian National University, 46 Sullivans Creek Road, Acton ACT, 2601, Australia. benjamin.kaehler@anu.edu.au.
³ The Pathogen and Microbiome Institute, Northern Arizona University, PO Box 4073, Flagstaff, AZ, 86011-4073, USA.
⁴ Departments of Pediatrics and Computer Science and Engineering, and Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA.
⁵ Research School of Biology, Australian National University, 46 Sullivans Creek Road, Acton ACT, 2601, Australia. gavin.huttley@anu.edu.au.
⁶ The Pathogen and Microbiome Institute, Northern Arizona University, PO Box 4073, Flagstaff, AZ, 86011-4073, USA. gregcaporaso@gmail.com.
⁷ Department of Biological Sciences, Northern Arizona University, 1298 S Knoles Drive, Building 56, 3rd Floor, Flagstaff, AZ, USA. gregcaporaso@gmail.com.

PMID: 29773078
PMCID: PMC5956843
DOI: 10.1186/s40168-018-0470-z

Abstract

Background: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis.

Results: We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated "novel" marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ).

Conclusions: Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Classifier performance on mock community datasets for 16S rRNA gene sequences (left column) and fungal ITS sequences (right column). a Average F-measure for each taxonomy classification method (averaged across all configurations and all mock community datasets) from class to species level. Error bars = 95% confidence intervals. b Average F-measure for each optimized classifier (averaged across all mock communities) at species level. c Average taxon accuracy rate for each optimized classifier (averaged across all mock communities) at species level. d Average Bray-Curtis distance between the expected mock community composition and its composition as predicted by each optimized classifier (averaged across all mock communities) at species level. Violin plots show median (white point), quartiles (black bars), and kernel density estimation (violin) for each score distribution. Violins with different lower-case letters have significantly different means (paired t test false detection rate-corrected P < 0.05)

**Fig. 2**
Classifier performance on cross-validated sequence datasets. Classification accuracy of 16S rRNA gene V4 subdomain (first row), V1–3 subdomain (second row), full-length 16S rRNA gene (third tow), and fungal ITS sequences (fourth row). a Average F-measure for each taxonomy classification method (averaged across all configurations and all cross-validated sequence datasets) from class to species level. Error bars = 95% confidence intervals. b Average F-measure for each optimized classifier (averaged across all cross-validated sequence datasets) at species level. Violins with different lower-case letters have significantly different means (paired t-test false detection rate-corrected P < 0.05). c correlation between F-measure performance for each method/configuration classification of V4 subdomain (x axis), V1–3 subdomain (y axis), and full-length 16S rRNA gene sequences (z axis). Inset lists the Pearson R² value for each pairwise correlation; each correlation is significant (P < 0.001)

**Fig. 3**
Classifier performance on novel-taxa simulated sequence datasets for 16S rRNA gene sequences (left column) and fungal ITS sequences (right column). a–f, Average F-measure (a), precision (b), recall (c), overclassification (d), underclassification (e), and misclassification (f) for each taxonomy classification method (averaged across all configurations and all novel taxa sequence datasets) from phylum to species level. Error bars = 95% confidence intervals. b Average F-measure for each optimized classifier (averaged across all novel taxa sequence datasets) at species level. Violins with different lower-case letters have significantly different means (paired t test false detection rate-corrected P < 0.05)

**Fig. 4**
Classification accuracy comparison between mock community, cross-validated, and novel taxa evaluations. Scatterplots show mean F-measure scores for each method configuration, averaged across all samples, for classification of 16S rRNA genes at genus level (a) and species level (b), and fungal ITS sequences at genus level (c) and species level (d)

**Fig. 5**
Runtime performance comparison of taxonomy classifiers. Runtime (s) for each taxonomy classifier either varying the number of query sequences and keeping a constant 10,000 reference sequences (a) or varying the number of reference sequences and keeping a constant 1 query sequence (b)

See this image and copyright information in PMC

Cited by

Dysbiosis not observed in Canadian horse with free fecal liquid (FFL) using 16S rRNA sequencing.
Wester RJ, Baillie LL, McCarthy GC, Keever CC, Jeffery LE, Adams PJ. Wester RJ, et al. Sci Rep. 2024 Jun 5;14(1):12903. doi: 10.1038/s41598-024-63868-1. Sci Rep. 2024. PMID: 38839848 Free PMC article.
Environmental and structural factors associated with bacterial diversity in household dust across the Arizona-Sonora border.
Benton LD, Lopez-Galvez N, Herman C, Caporaso JG, Cope EK, Rosales C, Gameros M, Lothrop N, Martínez FD, Wright AL, Carr TF, Beamer PI. Benton LD, et al. Sci Rep. 2024 Jun 4;14(1):12803. doi: 10.1038/s41598-024-63356-6. Sci Rep. 2024. PMID: 38834753 Free PMC article.
Gut microbiome in the Graves' disease: Comparison before and after anti-thyroid drug treatment.
Jeong C, Baek H, Bae J, Hwang N, Ha J, Cho YS, Lim DJ. Jeong C, et al. PLoS One. 2024 May 31;19(5):e0300678. doi: 10.1371/journal.pone.0300678. eCollection 2024. PLoS One. 2024. PMID: 38820506 Free PMC article.
Th17-to-Tfh plasticity during periodontitis limits disease pathology.
McClure FA, Wemyss K, Cox JR, Bridgeman HM, Prise IE, King JI, Jaigirdar S, Whelan A, Jones GW, Grainger JR, Hepworth MR, Konkel JE. McClure FA, et al. J Exp Med. 2024 Aug 5;221(8):e20232015. doi: 10.1084/jem.20232015. Epub 2024 May 31. J Exp Med. 2024. PMID: 38819409 Free PMC article.
Resistance potential of soil bacterial communities along a biodiversity gradient in forest ecosystems.
Kuang J, Deng D, Han S, Bates CT, Ning D, Shu W, Zhou J. Kuang J, et al. mLife. 2022 Nov 3;1(4):399-411. doi: 10.1002/mlf2.12042. eCollection 2022 Dec. mLife. 2022. PMID: 38818486 Free PMC article.

See all "Cited by" articles

References

1. Human Microbiome Project Consortium A framework for human microbiome research. Nature. 2012;486:215–221. doi: 10.1038/nature11209. - DOI - PMC - PubMed
1. Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017;551:457–463. doi: 10.1038/551033a. - DOI - PMC - PubMed
1. Wang Q, Quensen JF, 3rd, Fish JA, Lee TK, Sun Y, Tiedje JM, et al. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. MBio. 2013;4:e00592–e00513. - PMC - PubMed
1. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA. Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13:581–583. doi: 10.1038/nmeth.3869. - DOI - PMC - PubMed
1. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6:610–618. doi: 10.1038/ismej.2011.139. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
- scite Smart Citations
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Human Microbiome Project Consortium A framework for human microbiome research. Nature. 2012;486:215–221. doi: 10.1038/nature11209. - DOI - PMC - PubMed

[2] Human Microbiome Project Consortium A framework for human microbiome research. Nature. 2012;486:215–221. doi: 10.1038/nature11209. - DOI - PMC - PubMed

[3] Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017;551:457–463. doi: 10.1038/551033a. - DOI - PMC - PubMed

[4] Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature. 2017;551:457–463. doi: 10.1038/551033a. - DOI - PMC - PubMed

[5] Wang Q, Quensen JF, 3rd, Fish JA, Lee TK, Sun Y, Tiedje JM, et al. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. MBio. 2013;4:e00592–e00513. - PMC - PubMed

[6] Wang Q, Quensen JF, 3rd, Fish JA, Lee TK, Sun Y, Tiedje JM, et al. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. MBio. 2013;4:e00592–e00513. - PMC - PubMed

[7] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA. Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13:581–583. doi: 10.1038/nmeth.3869. - DOI - PMC - PubMed

[8] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA. Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13:581–583. doi: 10.1038/nmeth.3869. - DOI - PMC - PubMed

[9] McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6:610–618. doi: 10.1038/ismej.2011.139. - DOI - PMC - PubMed

[10] McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6:610–618. doi: 10.1038/ismej.2011.139. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed