Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 May 30;8(1):123.
doi: 10.1038/s41698-024-00617-7.

A review of machine learning methods for cancer characterization from microbiome data

Affiliations
Review

A review of machine learning methods for cancer characterization from microbiome data

Marco Teixeira et al. NPJ Precis Oncol. .

Abstract

Recent studies have shown that the microbiome can impact cancer development, progression, and response to therapies suggesting microbiome-based approaches for cancer characterization. As cancer-related signatures are complex and implicate many taxa, their discovery often requires Machine Learning approaches. This review discusses Machine Learning methods for cancer characterization from microbiome data. It focuses on the implications of choices undertaken during sample collection, feature selection and pre-processing. It also discusses ML model selection, guiding how to choose an ML model, and model validation. Finally, it enumerates current limitations and how these may be surpassed. Proposed methods, often based on Random Forests, show promising results, however insufficient for widespread clinical usage. Studies often report conflicting results mainly due to ML models with poor generalizability. We expect that evaluating models with expanded, hold-out datasets, removing technical artifacts, exploring representations of the microbiome other than taxonomical profiles, leveraging advances in deep learning, and developing ML models better adapted to the characteristics of microbiome data will improve the performance and generalizability of models and enable their usage in the clinic.

PubMed Disclaimer

Conflict of interest statement

R.M.F. and C.F. own patent WO/2018/169423 on microbiome markers for gastric cancer. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Steps and decisions undertaken when developing an ML model for cancer characterization with microbiome data.
The blocks on the left show the factors conditioning each step (represented in the middle blocks). Some of the decisions are also affected by choices made upstream - for instance, the characteristics of the ML model chosen affect how the microbiome data should be pre-processed; these dependencies are represented by the connectors on the left. The blocks on the right represent the contribution of each step to the overall goal of model development and validation. Each step is discussed in detail in this review.
Fig. 2
Fig. 2. Pipeline for ML-based cancer identification from microbial abundance profiles.
The nucleic acid in the collected samples is sequenced and the resulting reads are assigned taxonomic classifications. Likely contaminants and batch effects are removed. Feature selection methods can be used to select the most relevant features, which are used to train an ML model. Feature selection approaches are discussed in section “Methods for dimensionality reduction”. The model should be evaluated using an independent test set and cross-study validations, as discussed in section “Model validation”.
Fig. 3
Fig. 3. Diagram of an MLP for cancer prediction from microbial abundance profiles.
The input layer contains the relative abundance of each taxon or OTU. These act as inputs (x) for the activation units in the hidden layers, which apply a function ψ. The output layer returns the probability of the presence and absence of cancer.
Fig. 4
Fig. 4. Overview of needed improvements in microbiome-based identification of cancer.
Improvements in ML models can be achieved through improvements in their accuracy and generalizability. Some of the future perspectives discussed in this review and shown in this figure can aid in improving generalizability (left), model accuracy (right), or both (center).

Similar articles

References

    1. Ferlay J, et al. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer. 2019;144:1941–1953. doi: 10.1002/ijc.31937. - DOI - PubMed
    1. WHO. WHO Methods and Data Sources for Country-Level Causes of Death: 2000-2019 (World Health Organization, 2020).
    1. Hanahan D. Hallmarks of cancer: new dimensions. Cancer Discov. 2022;12:31–46. doi: 10.1158/2159-8290.CD-21-1059. - DOI - PubMed
    1. Gilbert JA, et al. Current understanding of the human microbiome. Nat. Med. 2018;24:392–400. doi: 10.1038/nm.4517. - DOI - PMC - PubMed
    1. Behjati S, Tarpey PS. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 2013;98:236–238. doi: 10.1136/archdischild-2013-304340. - DOI - PMC - PubMed
-