Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
- PMID: 33878118
- PMCID: PMC8087038
- DOI: 10.1371/journal.ppat.1009149
Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning
Abstract
The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
![Fig 1](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/8087038/bin/ppat.1009149.g001.gif)
![Fig 2](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/8087038/bin/ppat.1009149.g002.gif)
![Fig 3](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/8087038/bin/ppat.1009149.g003.gif)
![Fig 4](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/8087038/bin/ppat.1009149.g004.gif)
![Fig 5](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/8087038/bin/ppat.1009149.g005.gif)
Similar articles
-
Molecular evolution and phylogenetic analysis of SARS-CoV-2 and hosts ACE2 protein suggest Malayan pangolin as intermediary host.Braz J Microbiol. 2020 Dec;51(4):1593-1599. doi: 10.1007/s42770-020-00321-1. Epub 2020 Jun 26. Braz J Microbiol. 2020. PMID: 32592038 Free PMC article.
-
[Source of the COVID-19 pandemic: ecology and genetics of coronaviruses (Betacoronavirus: Coronaviridae) SARS-CoV, SARS-CoV-2 (subgenus Sarbecovirus), and MERS-CoV (subgenus Merbecovirus).].Vopr Virusol. 2020;65(2):62-70. doi: 10.36233/0507-4088-2020-65-2-62-70. Vopr Virusol. 2020. PMID: 32515561 Review. Russian.
-
Properties of Coronavirus and SARS-CoV-2.Malays J Pathol. 2020 Apr;42(1):3-11. Malays J Pathol. 2020. PMID: 32342926 Review.
-
Further Evidence for Bats as the Evolutionary Source of Middle East Respiratory Syndrome Coronavirus.mBio. 2017 Apr 4;8(2):e00373-17. doi: 10.1128/mBio.00373-17. mBio. 2017. PMID: 28377531 Free PMC article.
-
Evidence for an Ancestral Association of Human Coronavirus 229E with Bats.J Virol. 2015 Dec;89(23):11858-70. doi: 10.1128/JVI.01755-15. Epub 2015 Sep 16. J Virol. 2015. PMID: 26378164 Free PMC article.
Cited by
-
Compositional features analysis by machine learning in genome represents linear adaptation of monkeypox virus.Front Genet. 2024 Mar 1;15:1361952. doi: 10.3389/fgene.2024.1361952. eCollection 2024. Front Genet. 2024. PMID: 38495668 Free PMC article.
-
Viral genomic features predict orthopoxvirus reservoir hosts.bioRxiv [Preprint]. 2023 Oct 27:2023.10.26.564211. doi: 10.1101/2023.10.26.564211. bioRxiv. 2023. PMID: 37961540 Free PMC article. Preprint.
-
Evidence for an ancient aquatic origin of the RNA viral order Articulavirales.Proc Natl Acad Sci U S A. 2023 Nov 7;120(45):e2310529120. doi: 10.1073/pnas.2310529120. Epub 2023 Oct 31. Proc Natl Acad Sci U S A. 2023. PMID: 37906647 Free PMC article.
-
Risk Assessment of the Possible Intermediate Host Role of Pigs for Coronaviruses with a Deep Learning Predictor.Viruses. 2023 Jul 15;15(7):1556. doi: 10.3390/v15071556. Viruses. 2023. PMID: 37515242 Free PMC article.
-
Classification of group A rotavirus VP7 and VP4 genotypes using random forest.Front Genet. 2023 May 30;14:1029185. doi: 10.3389/fgene.2023.1029185. eCollection 2023. Front Genet. 2023. PMID: 37323680 Free PMC article.
References
-
- WHO. Coronavirus disease (COVID-19) Weekly Epidemiological Update—27. WHO, Geneva. 14 Feb 2021 [cited 22 Feb 2021]. Available: https://www.who.int/docs/default-source/coronaviruse/situation-reports/2...
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous