Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
AMIA Jt Summits Transl Sci Proc. 2024; 2024: 509–514.
Published online 2024 May 31.
PMCID: PMC11141860
PMID: 38827084

Large Language Models for Efficient Medical Information Extraction

Navya Bhagat,(1) Olivia Mackey, M.S.,(1) and Adam Wilcox, PhD(1)

Abstract

Extracting valuable insights from unstructured clinical narrative reports is a challenging yet crucial task in the healthcare domain as it allows healthcare workers to treat patients more efficiently and improves the overall standard of care. We employ ChatGPT, a Large language model (LLM), and compare its performance to manual reviewers. The review focuses on four key conditions: family history of heart disease, depression, heavy smoking, and cancer. The evaluation of a diverse sample of History and Physical (H&P) Notes, demonstrates ChatGPT’s remarkable capabilities. Notably, it exhibits exemplary results in sensitivity for depression and heavy smokers and specificity for cancer. We identify areas for improvement as well, particularly in capturing nuanced semantic information related to family history of heart disease and cancer. With further investigation, ChatGPT holds substantial potential for advancements in medical information extraction.

Introduction

Importance of Medical Text

In healthcare settings, medical text represents a method to uncover information that is hidden or not readily available; however, this proves to be a challenging task in the healthcare domain. Over the years, there have been significant steps taken to use electronic healthcare records (EHRs) in meaningful ways for data analytics and decision support. Achieving this would assist physicians and healthcare workers alike in treating patients more efficiently and improving the overall standard of care. However, a challenge that continues to show itself is the effective extraction of this text. Studies have shown that important information is included in text reports that is not machine-interpretable. Incorporating Natural language processing (NLP) - derived information significantly enhances clinical data tools, especially for tasks like case detection. The introduction of Large language models (LLM) presents an opportunity to accelerate the use of computerized language processing.

NLP Background

For more than two decades, NLP, which employs computational techniques for the purpose of learning, understanding, and producing human language content (2), demonstrates significant capacity to extract crucial information from medical text with recent years showcasing immense potential in healthcare settings. Despite its promise, integrating NLP into clinical settings proves to be a difficult task. There is a need for using NLP tools more broadly and with greater diversity of application; however, current approaches appear limited both in scope and institution. The use of NLP tools is needed more broadly and with greater diversity of application; however, current approaches appear limited both in scope and institution (8). The introduction of LLMs presents an opportunity to accelerate the use of computerized language processing.

Recent NLP/LLM Advancements

The application of NLPs in healthcare has been underscored by various studies highlighting its effectiveness. Examples of recent successful uses of NLP to extract data from clinical notes is the MedTagger framework which demonstrated a specificity of 0.96 and sensitivity of 0.98 in identifying certain risk factors of heart disease (7). Another significant use of LLMs is the open source tool clinical Text Analysis and Knowledge Extraction System or “cTAKES” from Mayo Clinic. This system accelerates medical research by facilitating the identification of trends and correlations within vast patient datasets but also empowers clinicians with rapid access to patient histories and relevant information, fostering better-informed decision-making at the point of care (10).

ChatGPT and its Progression in Comparison to Other LLMs

There have been applications of LLMs that have been studied for years with moderate success, but more recently, ChatGPT has achieved major recognition and advancement. Notably, the Generative Pre-trained Transformer (GPT) architecture, exemplified by ChatGPT, stands as a pinnacle of this progress. The key advancement lies in OpenAI’s fine-tuning approach, which prioritizes the alignment of language models with human-defined objectives and produces the InstructGPT model. Unlike its predecessors, ChatGPT exhibits an innate understanding of medical concepts, allowing it to effectively extract medical information. ChatGPT’s strategic approach empowers it to accurately identify conditions of interest, encompassing a spectrum from cancer types to postpartum depression, and enabling it to meticulously parse through extensive notes to extract pertinent information (8).

Our Goal

In this study we aim to use GPT-3.5-Turbo, an LLM, to unlock structured text data and demonstrate advancements in medical language understanding and extraction. We modeled our study on previous evaluations of NLP systems with medical text (4). Our objective is to assess the extent of its success and challenges in mastering NLP semantics, syntax intricacies, and simplifying clinical rules. By evaluating its performance on conditions such as family history of heart disease, cancer, depression, and heavy smoking, we aim to gain a comprehensive understanding of the model’s potential in the healthcare domain. Analyzing ChatGPT’s performance against manual reviews will illuminate its strengths and weaknesses, guiding future improvements and fostering responsible and informed usage in clinical settings. In essence, it has the potential to transform the healthcare landscape by bridging the gap between raw EHR data and actionable clinical knowledge, ultimately improving patient outcomes and advancing the frontiers of medical science.

Methods

Study Design/Data

Hospitals affiliated with Washington University in St. Louis provided the dataset for this study. The History and Physical (H&P) Notes were selected as they contain crucial information about a patient’s medical history, physical examination findings, and initial assessments. These notes serve as a source of textual data, enabling a comprehensive analysis of ChatGPT’s capabilities in extracting relevant medical information. The study encompassed a total of 100 H&P Notes (n=100), representing a diverse set of patients. To ensure a representative sample, these notes were randomly selected from the pool of patients who received medical care at WashU Hospitals throughout the year 2022. Random selection ensured a fair distribution of cases across various medical conditions and demographics, minimizing potential biases and enhancing the generalizability of the findings.

Condition Selection

In collaboration with a clinical informatician, a comprehensive list of medical conditions was evaluated to identify the most relevant and clinically significant conditions for this study. The final selection included four key conditions: family history of heart disease, cancer (including remission and any personal history of cancer), depression, and heavy smoking.

Family history of heart disease is a crucial risk factor in assessing an individual’s predisposition to cardiovascular issues. Accurate identification of this condition from patient notes can aid healthcare professionals in making informed decisions regarding preventive measures and treatment strategies. (Yoon et al. 2002)

Depression is a prevalent mental health disorder with complex symptomatology. Efficiently detecting depression-related cues in medical notes can support early intervention and appropriate treatment plans.

Cancer remains a significant global health concern, and the detection of any personal history or remission status is vital for understanding a patient’s medical background. NLP techniques capable of accurately identifying cancer-related information from clinical records can aid in cancer surveillance and treatment planning.

The decision to focus on heavy smokers, rather than regular smokers, was driven by the recognition that at a certain level of smoking, individuals are at significantly higher risk for developing lung cancer and other serious respiratory conditions.

Manual Review of Data

The manual review was conducted by two proficient native English students —one with a technical background and the other with a clinical background. In cases of disagreement between the reviewers, a third-party adjudicator, a clinical informatician, was engaged to provide an objective evaluation and facilitate consensus. To ensure the accuracy and credibility of the manual review, the findings were subjected to verification by a licensed physician. Under the guidelines for annotation, the reviewers were instructed to mark a condition as “present” if any indication of its presence was identified within the clinical notes.

Automated Review of Data

To facilitate the automated review process, we employed the Azure OpenAI API, a secure and efficient interface for interacting with the GPT-3.5-Turbo engine. The patient notes were iteratively fed into the API using a defined prompt designed to elicit binary responses. This prompt guided the model to assess whether the text indicated the presence of each medical condition under consideration. A temperature of 0 was applied to ensure deterministic responses, minimizing randomness and enhancing consistency in the model’s output. The top-p value was set at 0.95 to influence the model to generate concise yet contextually relevant responses. Both frequency and presence penalties were set to 0, avoiding any artificial constraints on the generation process.

Statistical Analysis

Sensitivity, specificity, and F-score measures were calculated to assess the accuracy and overall effectiveness of ChatGPT’s predictions. Additionally, the sample of clinical notes was run through ChatGPT 3 times to test for robustness.

Results

Sample Representativeness

Table 1 shows the sample representativeness for this study. According to the manual review, out of the 100 notes, 33 notes indicated a family history of heart disease, 16 indicated depression, 5 indicated a current heavy smoker, and 27 indicated cancer or a personal history of cancer. Table 1 illustrates additional characteristics of the sample of patients and clinical text.

Table 1:

Demographics of sample and description of clinical notes

Total (n=100)
Male51 (51%)
Age, years (mean(SD))57.68 (23.87)
Age, years 0-179 (3%)
Age, years 18-4010 (10%)
Age, years 41-6024 (24%)
Age, years 61-8047 (47%)
Age, years 81+7 (7%)
Family history of heart disease33 (33%)
Personal history of cancer27 (27%)
Depression16 (16%)
Current heavy smoker5 (5%)
Average text length (word count)1405.38
*Note that 3 age values are missing

Performance

ChatGPT demonstrates exceptional results for sensitivity of depression and heavy smokers (1.00). Additionally it was remarkable in specificity for cancer (1.00).

Robustness

In each of the 3 trials, ChatGPT yielded virtually identical outputs-producing the same results. Thus, for this study, ChatGPT demonstrated impressive robustness.

Discussion

The performance of ChatGPT in this study, characterized by its accuracy in identifying medical conditions from clinical notes, reaffirms its effectiveness as a versatile language model. While its proficiency has been demonstrated in diverse applications, including successful performance in medical school and bar exams in prior studies, our study emphasizes a distinctive realm of utility – efficient extraction of essential information from clinical narratives.(5)(6). The significance of ChatGPT’s performance lies in its practicality within healthcare informatics. The ability to effectively extract valuable data from clinical notes holds tangible implications for patient care. This addresses a critical need, enabling healthcare professionals to swiftly access pertinent medical insights from intricate text which holds the potential to enhance the quality of patient treatment. Despite the model’s success, the results show room for improvement. Discrepancies and errors in ChatGPT’s responses can be categorized into two main domains: semantics and syntax

Sensitivity Analysis

False negatives, reflected by lower sensitivity scores, are associated with semantic understanding. The sensitivity scores indicated ChatGPT’s ability to correctly identify instances of the studied conditions. Notably, ChatGPT demonstrated high sensitivity in detecting depression (1.00) and heavy smokers (1.00), signifying its adeptness in recognizing textual cues associated with these conditions. However, the sensitivity scores for family history of heart disease (0.58) and cancer (0.85) were comparatively lower, suggesting that ChatGPT encountered challenges in capturing nuanced semantic information related to these conditions

Specificity Analysis

False positives, reflected by lower specificity scores, are attributed to syntax-related errors. ChatGPT exhibited impressive specificity scores, demonstrating its capability to accurately recognize cases where the specified conditions were absent. High specificity scores were observed for all conditions-family history of heart disease (0.97), depression (0.95), heavy smoker (0.94), and cancer (1.00)-indicating ChatGPT’s proficiency in discerning negation across various conditions

An intriguing aspect of this study pertains to the utilization of ChatGPT within the framework of the HIPAA rules and regulations. The use of advanced language models like ChatGPT in healthcare settings necessitates careful consideration of compliance with these regulations.

We employed a local instance of ChatGPT hosted on the Azure OpenAI platform. This localized deployment enabled us to operate within the institution’s secure firewall, thereby ensuring the confidentiality and privacy of patient data.

Limitations

Important limitations include, firstly, our analysis focused exclusively on four distinct medical conditions, chosen to encompass a diverse spectrum of NLP challenges within clinical notes. However, the generalizability of our findings to other medical conditions or contexts may be constrained. Secondly, our study centered on a single language model, GPT-3.5. It is possible that subsequent versions, such as GPT-4, could exhibit enhanced performance and refined capabilities. Furthermore, this study was primarily exploratory, aiming to assess the initial capabilities of ChatGPT in the healthcare domain. Consequently, the sample size of 100 clinical notes may not fully capture the complete range of language complexities found in medical records. A more extensive and targeted investigation with a larger dataset and diverse patient histories could provide a more nuanced evaluation of ChatGPT’s abilities in a clinical context.

Conclusion

This study holds significance as it not only serves as a rigorous assessment of the clinical applicability of LLMs, specifically ChatGPT, but also underscores its distinct advantages over conventional NLP practices that are untrained and may be more time consuming to use. ChatGPT exhibited proficiency with minimal prerequisites – forgoing the need for extensive training or development. Its ease of integration, enhanced by a straightforward interfacing program, suggests the potential for widespread use by medical institutions. With further investigation in fine tuning the model and constructing a comprehensive rules for extracting information from unstructured clinical narrative text, ChatGPT holds the ability to advance the domain of medical informatics.

Financial Acknowledgments

This research is partially supported by the Institute for Informatics, Data Science & Biostatistics at Washington University School of Medicine in St. Louis and the National Library of Medicine of the National Institutes of Health under Award Number R25LM014224.

Figures & Table

An external file that holds a picture, illustration, etc.
Object name is 2276f1.jpg

Schematic of the manual review, automated review, and input of the H&P notes

Table 2:

Overview of the sensitivity, specificity, and F-score by condition.

ConditionSensitivitySpecificityF-score
Family history of heart disease0.580.970.70
Depression1.000.950.89
Heavy smoker1.000.940.62
Cancer0.851.000.92

References

1. Adamidi ES, Mitsis K, Nikita KS. Artificial intelligence in clinical care amidst COVID-19 pandemic: A systematic review. Comput Struct Biotechnol J. 2021;19:2833–2850. doi: 10.1016/j.csbj.2021.05.010. Epub 2021 May 7. PMID: 34025952; PMCID: PMC8123783. [PMC free article] [PubMed] [Google Scholar]
2. Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc. 2016 Sep;23(5):1007–15. doi: 10.1093/jamia/ocv180. Epub 2016 Feb 5. PMID: 26911811; PMCID: PMC4997034. [PMC free article] [PubMed] [Google Scholar]
3. Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015 Jul 17;349(6245):261–6. doi: 10.1126/science.aaa8685. PMID: 26185244. [PubMed] [Google Scholar]
4. Hripcsak G, Friedman C, Alderson PO, DuMouchel W, Johnson SB, Clayton PD. Unlocking clinical data from narrative reports: a study of natural language processing. Ann Intern Med. 1995 May 1;122(9):681–8. doi: 10.7326/0003-4819-122-9-199505010-00007. PMID: 7702231. [PubMed] [Google Scholar]
5. Katz Daniel Martin, Bommarito Michael James, Gao Shang, Arredondo Pablo. GPT-4 Passes the Bar Exam. 2023. [PMC free article] [PubMed]
6. Kung Tiffany H., Cheatham Morgan, Medenilla Arielle, Sillos Czarina, Leon Lorie De, Elepaño Camille, Madriaga Maria, et al. “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models.” PLOS Digital Health. 2023;2(2):e0000198. [PMC free article] [PubMed] [Google Scholar]
7. Milintsevich K., Sirts K., Dias G. “Towards Automatic Text-Based Estimation of Depression through Symptom Prediction.” Brain Informatics. 2023;10(1) doi: 10.1186/s40708-023-00185-9. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
8. Moon, Sungrim, Liu Sijia, Scott Christopher G., Samudrala Sujith, Abidian Mohamed M., Geske Jeffrey B., Noseworthy Peter A., et al. “Automated Extraction of Sudden Cardiac Death Risk Factors in Hypertrophic Cardiomyopathy Patients by Natural Language Processing.” International Journal of Medical Informatics. 2019;128(August):32. [PMC free article] [PubMed] [Google Scholar]
9. Ouyang, Long, Wu Jeff, Jiang Xu, Almeida Diogo, Wainwright Carroll L., Mishkin Pamela, Zhang Chong, et al. “Training Language Models to Follow Instructions with Human Feedback.” 2022. [CrossRef]
10. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507–13. doi: 10.1136/jamia.2009.001560. PMID: 20819853; PMCID: PMC2995668. [PMC free article] [PubMed] [Google Scholar]
11. Scheitel Marianne R., et al. “Effect of a novel clinical decision support tool on the efficiency and accuracy of treatment recommendations for cholesterol management.” Applied clinical informatics 26.01. 2017. pp. 124–136. [PMC free article] [PubMed]
12. Wang L., Ruan X., Yang P., Liu H. “Comparison of Three Information Sources for Smoking Information in Electronic Health Records.” Cancer Informatics. 2016;15(December) doi: 10.4137/CIN.S40604. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
13. Yoon Paula W., Maren Maren T., Peterson-Oehlke Kris L., Gwinn Marta, Faucett Andrew, Khoury Muin J. “Can Family History Be Used as a Tool for Public Health and Preventive Medicine?” Genetics in Medicine: Official Journal of the American College of Medical Genetics. 2002;4(4):304–10. [PubMed] [Google Scholar]
14. Yoo Sooyoung, Yoon Eunsil, Boo Dachung, Kim Borham, Kim Seok, Paeng Jin Chul, Yoo Ie Ryung, et al. “Transforming Thyroid Cancer Diagnosis and Staging Information from Unstructured Reports to the Observational Medical Outcome Partnership Common Data Model.” Applied Clinical Informatics. 2022;13(3):521. [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

-