A review of machine learning methods for cancer characterization from microbiome data

doi:10.1038/s41698-024-00617-7

Review

. 2024 May 30;8(1):123.

doi: 10.1038/s41698-024-00617-7.

A review of machine learning methods for cancer characterization from microbiome data

Marco Teixeira^{1

2}, Francisco Silva^{3

4}, Rui M Ferreira^{5

6}, Tania Pereira^{3

7}, Ceu Figueiredo^{5

6

8}, Hélder P Oliveira^{3

4}

Affiliations

¹ Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal. marco.a.teixeira@inesctec.pt.
² Faculty of Engineering, University of Porto, Porto, Portugal. marco.a.teixeira@inesctec.pt.
³ Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal.
⁴ Faculty of Science, University of Porto, Porto, Portugal.
⁵ Ipatimup - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal.
⁶ Instituto de Investigação e Inovação em Saúde, University of Porto, Porto, Portugal.
⁷ Faculty of Sciences and Technology, University of Coimbra, Coimbra, Portugal.
⁸ Faculty of Medicine, University of Porto, Porto, Portugal.

PMID: 38816569
PMCID: PMC11139966
DOI: 10.1038/s41698-024-00617-7

Review

A review of machine learning methods for cancer characterization from microbiome data

Marco Teixeira et al. NPJ Precis Oncol. 2024.

. 2024 May 30;8(1):123.

doi: 10.1038/s41698-024-00617-7.

Authors

Marco Teixeira^{1

2}, Francisco Silva^{3

4}, Rui M Ferreira^{5

6}, Tania Pereira^{3

7}, Ceu Figueiredo^{5

6

8}, Hélder P Oliveira^{3

4}

Affiliations

¹ Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal. marco.a.teixeira@inesctec.pt.
² Faculty of Engineering, University of Porto, Porto, Portugal. marco.a.teixeira@inesctec.pt.
³ Institute for Systems and Computer Engineering, Technology and Science, Porto, Portugal.
⁴ Faculty of Science, University of Porto, Porto, Portugal.
⁵ Ipatimup - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal.
⁶ Instituto de Investigação e Inovação em Saúde, University of Porto, Porto, Portugal.
⁷ Faculty of Sciences and Technology, University of Coimbra, Coimbra, Portugal.
⁸ Faculty of Medicine, University of Porto, Porto, Portugal.

PMID: 38816569
PMCID: PMC11139966
DOI: 10.1038/s41698-024-00617-7

Abstract

Recent studies have shown that the microbiome can impact cancer development, progression, and response to therapies suggesting microbiome-based approaches for cancer characterization. As cancer-related signatures are complex and implicate many taxa, their discovery often requires Machine Learning approaches. This review discusses Machine Learning methods for cancer characterization from microbiome data. It focuses on the implications of choices undertaken during sample collection, feature selection and pre-processing. It also discusses ML model selection, guiding how to choose an ML model, and model validation. Finally, it enumerates current limitations and how these may be surpassed. Proposed methods, often based on Random Forests, show promising results, however insufficient for widespread clinical usage. Studies often report conflicting results mainly due to ML models with poor generalizability. We expect that evaluating models with expanded, hold-out datasets, removing technical artifacts, exploring representations of the microbiome other than taxonomical profiles, leveraging advances in deep learning, and developing ML models better adapted to the characteristics of microbiome data will improve the performance and generalizability of models and enable their usage in the clinic.

PubMed Disclaimer

Conflict of interest statement

R.M.F. and C.F. own patent WO/2018/169423 on microbiome markers for gastric cancer. The remaining authors declare no competing interests.

Figures

**Fig. 1. Steps and decisions undertaken when developing an ML model for cancer characterization with microbiome data.**
The blocks on the left show the factors conditioning each step (represented in the middle blocks). Some of the decisions are also affected by choices made upstream - for instance, the characteristics of the ML model chosen affect how the microbiome data should be pre-processed; these dependencies are represented by the connectors on the left. The blocks on the right represent the contribution of each step to the overall goal of model development and validation. Each step is discussed in detail in this review.

**Fig. 2. Pipeline for ML-based cancer identification from microbial abundance profiles.**
The nucleic acid in the collected samples is sequenced and the resulting reads are assigned taxonomic classifications. Likely contaminants and batch effects are removed. Feature selection methods can be used to select the most relevant features, which are used to train an ML model. Feature selection approaches are discussed in section “Methods for dimensionality reduction”. The model should be evaluated using an independent test set and cross-study validations, as discussed in section “Model validation”.

**Fig. 3. Diagram of an MLP for cancer prediction from microbial abundance profiles.**
The input layer contains the relative abundance of each taxon or OTU. These act as inputs (x) for the activation units in the hidden layers, which apply a function ψ. The output layer returns the probability of the presence and absence of cancer.

**Fig. 4. Overview of needed improvements in microbiome-based identification of cancer.**
Improvements in ML models can be achieved through improvements in their accuracy and generalizability. Some of the future perspectives discussed in this review and shown in this figure can aid in improving generalizability (left), model accuracy (right), or both (center).

See this image and copyright information in PMC

References

1. Ferlay J, et al. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer. 2019;144:1941–1953. doi: 10.1002/ijc.31937. - DOI - PubMed
1. WHO. WHO Methods and Data Sources for Country-Level Causes of Death: 2000-2019 (World Health Organization, 2020).
1. Hanahan D. Hallmarks of cancer: new dimensions. Cancer Discov. 2022;12:31–46. doi: 10.1158/2159-8290.CD-21-1059. - DOI - PubMed
1. Gilbert JA, et al. Current understanding of the human microbiome. Nat. Med. 2018;24:392–400. doi: 10.1038/nm.4517. - DOI - PMC - PubMed
1. Behjati S, Tarpey PS. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 2013;98:236–238. doi: 10.1136/archdischild-2013-304340. - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Ferlay J, et al. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer. 2019;144:1941–1953. doi: 10.1002/ijc.31937. - DOI - PubMed

[2] Ferlay J, et al. Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer. 2019;144:1941–1953. doi: 10.1002/ijc.31937. - DOI - PubMed

[3] WHO. WHO Methods and Data Sources for Country-Level Causes of Death: 2000-2019 (World Health Organization, 2020).

[4] WHO. WHO Methods and Data Sources for Country-Level Causes of Death: 2000-2019 (World Health Organization, 2020).

[5] Hanahan D. Hallmarks of cancer: new dimensions. Cancer Discov. 2022;12:31–46. doi: 10.1158/2159-8290.CD-21-1059. - DOI - PubMed

[6] Hanahan D. Hallmarks of cancer: new dimensions. Cancer Discov. 2022;12:31–46. doi: 10.1158/2159-8290.CD-21-1059. - DOI - PubMed

[7] Gilbert JA, et al. Current understanding of the human microbiome. Nat. Med. 2018;24:392–400. doi: 10.1038/nm.4517. - DOI - PMC - PubMed

[8] Gilbert JA, et al. Current understanding of the human microbiome. Nat. Med. 2018;24:392–400. doi: 10.1038/nm.4517. - DOI - PMC - PubMed

[9] Behjati S, Tarpey PS. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 2013;98:236–238. doi: 10.1136/archdischild-2013-304340. - DOI - PMC - PubMed

[10] Behjati S, Tarpey PS. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 2013;98:236–238. doi: 10.1136/archdischild-2013-304340. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A review of machine learning methods for cancer characterization from microbiome data

Affiliations

A review of machine learning methods for cancer characterization from microbiome data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

LinkOut - more resources

Full Text Sources

Research Materials