A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
- PMID: 31694707
- PMCID: PMC6836338
- DOI: 10.1186/s12911-019-0918-5
A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
Abstract
Background: Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients using survey data (and laboratory results), and identify key variables within the data contributing to these diseases among the patients.
Methods: Our research explores data-driven approaches which utilize supervised machine learning models to identify patients with such diseases. Using the National Health and Nutrition Examination Survey (NHANES) dataset, we conduct an exhaustive search of all available feature variables within the data to develop models for cardiovascular, prediabetes, and diabetes detection. Using different time-frames and feature sets for the data (based on laboratory data), multiple machine learning models (logistic regression, support vector machines, random forest, and gradient boosting) were evaluated on their classification performance. The models were then combined to develop a weighted ensemble model, capable of leveraging the performance of the disparate models to improve detection accuracy. Information gain of tree-based models was used to identify the key variables within the patient data that contributed to the detection of at-risk patients in each of the diseases classes by the data-learned models.
Results: The developed ensemble model for cardiovascular disease (based on 131 variables) achieved an Area Under - Receiver Operating Characteristics (AU-ROC) score of 83.1% using no laboratory results, and 83.9% accuracy with laboratory results. In diabetes classification (based on 123 variables), eXtreme Gradient Boost (XGBoost) model achieved an AU-ROC score of 86.2% (without laboratory data) and 95.7% (with laboratory data). For pre-diabetic patients, the ensemble model had the top AU-ROC score of 73.7% (without laboratory data), and for laboratory based data XGBoost performed the best at 84.4%. Top five predictors in diabetes patients were 1) waist size, 2) age, 3) self-reported weight, 4) leg length, and 5) sodium intake. For cardiovascular diseases the models identified 1) age, 2) systolic blood pressure, 3) self-reported weight, 4) occurrence of chest pain, and 5) diastolic blood pressure as key contributors.
Conclusion: We conclude machine learned models based on survey questionnaire can provide an automated identification mechanism for patients at risk of diabetes and cardiovascular diseases. We also identify key contributors to the prediction, which can be further explored for their implications on electronic health records.
Keywords: Ensemble learning; Feature learning; Health analytics; Machine learning.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures
![Fig. 1](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig1_HTML.gif)
![Fig. 2](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig2_HTML.gif)
![Fig. 3](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig3_HTML.gif)
![Fig. 4](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig4_HTML.gif)
![Fig. 5](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig5_HTML.gif)
![Fig. 6](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig6_HTML.gif)
![Fig. 7](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig7_HTML.gif)
![Fig. 8](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6836338/bin/12911_2019_918_Fig8_HTML.gif)
Similar articles
-
Machine Learning Models for Data-Driven Prediction of Diabetes by Lifestyle Type.Int J Environ Res Public Health. 2022 Nov 15;19(22):15027. doi: 10.3390/ijerph192215027. Int J Environ Res Public Health. 2022. PMID: 36429751 Free PMC article.
-
A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques.J Healthc Eng. 2022 Jan 11;2022:1684017. doi: 10.1155/2022/1684017. eCollection 2022. J Healthc Eng. 2022. Retraction in: J Healthc Eng. 2023 May 24;2023:9872970. doi: 10.1155/2023/9872970. PMID: 35070225 Free PMC article. Retracted. Review.
-
[Prediction of intensive care unit readmission for critically ill patients based on ensemble learning].Beijing Da Xue Xue Bao Yi Xue Ban. 2021 Jun 18;53(3):566-572. doi: 10.19723/j.issn.1671-167X.2021.03.021. Beijing Da Xue Xue Bao Yi Xue Ban. 2021. PMID: 34145862 Free PMC article. Chinese.
-
Predicting post-stroke pneumonia using deep neural network approaches.Int J Med Inform. 2019 Dec;132:103986. doi: 10.1016/j.ijmedinf.2019.103986. Epub 2019 Oct 1. Int J Med Inform. 2019. PMID: 31629312
-
A Machine Learning Approach to Predicting Need for Hospitalization for Pediatric Asthma Exacerbation at the Time of Emergency Department Triage.Acad Emerg Med. 2018 Dec;25(12):1463-1470. doi: 10.1111/acem.13655. Epub 2018 Nov 29. Acad Emerg Med. 2018. PMID: 30382605
Cited by
-
Machine learning and deep learning for the diagnosis and treatment of ankylosing spondylitis- a scoping review.J Clin Orthop Trauma. 2024 Apr 24;52:102421. doi: 10.1016/j.jcot.2024.102421. eCollection 2024 May. J Clin Orthop Trauma. 2024. PMID: 38708092
-
The relationship between heavy metals and metabolic syndrome using machine learning.Front Public Health. 2024 Apr 15;12:1378041. doi: 10.3389/fpubh.2024.1378041. eCollection 2024. Front Public Health. 2024. PMID: 38686033 Free PMC article.
-
Optimizing type 2 diabetes management: AI-enhanced time series analysis of continuous glucose monitoring data for personalized dietary intervention.PeerJ Comput Sci. 2024 Apr 22;10:e1971. doi: 10.7717/peerj-cs.1971. eCollection 2024. PeerJ Comput Sci. 2024. PMID: 38686006 Free PMC article.
-
Understanding Individual Subject Differences through Large Behavioral Datasets: Analytical and Statistical Considerations.Perspect Behav Sci. 2023 Sep 11;47(1):225-250. doi: 10.1007/s40614-023-00388-9. eCollection 2024 Mar. Perspect Behav Sci. 2023. PMID: 38660505
-
Optimizing cardiovascular disease mortality prediction: a super learner approach in the tehran lipid and glucose study.BMC Med Inform Decis Mak. 2024 Apr 16;24(1):97. doi: 10.1186/s12911-024-02489-0. BMC Med Inform Decis Mak. 2024. PMID: 38627734 Free PMC article.
References
-
- Center for Disease Control and Prevention (CDC). National Diabetes Statistics Report; 2017. Center for Disease Control and Prevention (CDC). https://www.cdc.gov/diabetes/data/statistics-report/index.html. Accessed 15 Dec 2018.
-
- Center for Disease Control and Prevention (CDC). Heart Disease Fact Sheet; 2017. Center for Disease Control and Prevention (CDC). https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_heart_disease.htm. Accessed 15 Dec 2018.
-
- Association AH, et al. Heart disease and stroke statistics 2017 at-a-glance; 2017. http://www.heart.org/idc/groups/ahamahpublic/@wcm/@sop/@smd/documents/do.... Accessed 15 Dec 2018.
-
- American Heart Association. Cardiovascular Disease and Diabetes; 2019. American Heart Association. https://www.heart.org/en/health-topics/diabetes/why-diabetes-matters/car.... Accessed 15 Dec 2018.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical