Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 20;17(4):e1009149.
doi: 10.1371/journal.ppat.1009149. eCollection 2021 Apr.

Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning

Affiliations

Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning

Liam Brierley et al. PLoS Pathog. .

Abstract

The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Codon biases (RSCU) across coronavirus spike protein sequences examined.
Heatmaps of coronavirus codon usage bias (RSCU) associated with each codon in each spike protein sequence (n = 650). Main colour scale denotes RSCU value, a null value of 1 (black) indicating no difference in codon usage from expectation, with blue and red representing under- or overrepresentation respectively. Dendrogram colour bar denotes taxonomic genus.
Fig 2
Fig 2. Codon biases (RSCU) across coronavirus whole genome sequences examined.
Heatmaps of coronavirus codon usage bias (RSCU) associated with each codon in each whole genome sequence (n = 511). Main colour scale denotes RSCU value, a null value of 1 (black) indicating no difference in codon usage from expectation, with blue and red representing under- or overrepresentation respectively. Dendrogram colour bar denotes taxonomic genus.
Fig 3
Fig 3. Random forest host predictions based on coronavirus genome composition.
Stacked bar plots of predicted probabilities of each host category for coronavirus sequences. Predictions were obtained from ensemble random forest models trained on A) spike protein and B) whole genome composition features. Panels depict sequences from each metadata-derived host category and colour coding denotes model-predicted host category. Stacks represent individual coronavirus sequences, ordered from largest to smallest probability of the correct host, i.e., greater panel area matching the correct host category indicates better overall model performance. Non-zoonotic coronavirus sequences originating from humans (human coronaviruses HKU1, NL63, OC43, 229E) are labelled for clarity. Versions stratified by genera and species are provided as S4 and S5 Figs.
Fig 4
Fig 4. Random forest predictions based on zoonotic coronavirus genome composition.
Stacked bar plots of predicted probabilities of each host category for zoonotic coronavirus sequences sampled from humans. Predictions were obtained from ensemble random forest models trained on A) spike protein and B) whole genome composition features. Colour coding denotes model-predicted host category. Stacks represent individual coronavirus sequences.
Fig 5
Fig 5. Variable importance of genomic features.
Variable importance of genome composition features in ensemble random forest models predicting coronavirus host category from whole genome sequences (x axis) and spike protein sequences (y axis), with labelling of top ten most informative features from both analyses. Points denote mean values of relative decrease in Gini impurity associated with each feature across A) m = 222 and B) m = 185 random forests during hold-one-out cross-validation. Colour key denotes genomic feature type.

Similar articles

Cited by

References

    1. WHO. Coronavirus disease (COVID-19) Weekly Epidemiological Update—27. WHO, Geneva. 14 Feb 2021 [cited 22 Feb 2021]. Available: https://www.who.int/docs/default-source/coronaviruse/situation-reports/2...
    1. Gorbalenya AE, Baker SC, Baric RS, Groot RJ de, Drosten C, Gulyaeva AA, et al.. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nat Microbiol. 2020;5: 536–544. 10.1038/s41564-020-0695-z - DOI - PMC - PubMed
    1. Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF. The proximal origin of SARS-CoV-2. Nat Med. 2020;26: 450–452. 10.1038/s41591-020-0820-9 - DOI - PMC - PubMed
    1. Zhou P, Yang X-L, Wang X-G, Hu B, Zhang L, Zhang W, et al.. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579: 270–273. 10.1038/s41586-020-2012-7 - DOI - PMC - PubMed
    1. Zhang Y-Z, Holmes EC. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell. 2020;181: 223–227. 10.1016/j.cell.2020.03.035 - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources

-