Learn more: PMC Disclaimer | PMC Copyright Notice
Pre-test Prediction of Non-ischemic Cardiomyopathies using Time-Series EHR Data
Abstract
Clinical imaging is an important diagnostic test to diagnose non-ischemic cardiomyopathies (NICM). However, accurate interpretation of imaging studies often requires readers to review patient histories, a time consuming and tedious task. We propose to use time-series analysis to predict the most likely NICMs using longitudinal electronic health records (EHR) as a pseudo-summary of EHR records. Time-series formatted EHR data can provide temporality information important towards accurate prediction of disease. Specifically, we leverage ICD-10 codes and various recurrent neural network architectures for predictive modeling. We trained our models on a large cohort of NICM patients who underwent cardiac magnetic resonance imaging (CMR) and a smaller cohort undergoing echocardiogram. The performance of the proposed technique achieved good micro-area under the curve (0.8357), F1 score (0.5708) and precision at 3 (0.8078) across all models for cardiac magnetic resonance imaging (CMR) but only moderate performance for transthoracic echocardiogram (TTE) of 0.6938, 0.4399 and 0.5864 respectively. We show that our model has the potential to provide accurate pre-test differential diagnosis, thereby potentially reducing clerical burden on physicians.
Introduction
Non-ischemic cardiomyopathies (NICM) are a serious set of diseases afflicting the heart(1). The presentation and disease course are highly varied, even within a single etiology(2). Common among all NICMs are the high risk for heart failure and need for heart transplantation. Early detection through clinical imaging is critical for effective patient management(3). However, accurate interpretation of imaging studies are at least partly dependent on having succinct and relevant patient history available at the time of interpretation(4). The requirement for such information can be difficult given the long and potentially varied symptoms associated with these diseases. Furthermore, the amount of data available in the electronic health record (EHR) is increasing with time. Therefore, reviewing patient history for such important information increases both clerical and standard-of-care responsibilities on the readers, who are often already faced with increasing workloads.
One method to push patient information to readers is summarization of clinical notes. Several groups have proposed methods to automatically produce discharge summaries(5,6), as discharge summaries provide a natural dataset for which to learn pertinent information. Alsentzer and Kim proposed an extractive model using a long short term memory LSTM) model to identify relevant entities to include in discharge notes, achieving a high F1 score of 0.88(5). More recently, Searle et al combined extractive summarization with abstractive summarization (producing free text) using pre-trained large language models to produce full notes(6). Unfortunately, the results fell well short compared to summarization in natural domain data, achieving F1 scores well below 0.5. The poor result signifies the difficulty in generating clinical free-text summarization. Similar techniques have also been applied to radiological ordering. For instance, Kalra et al recently used machine learning models to classify term frequency-inverse document frequency (TF-IDF) features of free-text orders into unique imaging protocols(7). The results achieved good accuracies as high as 0.84 for certain classes of protocols in their focused task. However, such automated ordering does not use the full clinical narrative and therefore, cannot inform readers at the point of interpretation of information potentially pertinent for diagnosis. Although summarization of patient histories would be an ideal tool to solve this issue, but the current methods do not yet achieve satisfactory performance(8).
We propose to simplify this task by providing pre-test probability of disease to the reader as a concise proxy for the full clinical history using time-series models. Casting this problem offers three primary advantages: 1) alleviates the need to create and annotate a dataset of relevant patient history, 2) leverage a large amount of patient history information efficiently, and 3) implicitly allows us to integrate temporal information into our models. Therefore, we use time-series EHR data of patient problem-list encoded as ICD-10 codes to predict the diagnosis rendered by the imaging study.
Time-series modeling of EHR data is a well-established field. Rahimian et al demonstrated that usage of machine learning and temporal information increased performance over traditional models for risk prediction using EHR data(9). Hidden Markov Models have been widely applied to time series modeling, although with the assumption that the probability of change in the hidden state is dependent on time between observations(10). Bayesian networks have also been used to model time series with the assumption that the graphical model represents the conditional dependencies between a set of inputs(11). However, neither of these statistical frameworks are robust to irregularly spaced events common to EHR data.
Recently, recurrent neural networks (RNN) have been applied to a variety temporal data analytics tasks in healthcare(12). Lipton et al first cast irregular data as a missing data problem(13). They found simple RNN architectures to generalize well to time series predictive tasks even without complex imputation strategies. Subsequent innovations with RNNs include architectural modifications to the RNN cells which explicitly learn the impact of missingness (14,15). Transformer frameworks have also been applied to time series prediction. For example, Zerveas et al proposed a transformer-based framework that took first place on 12 popular datasets at time of publication(16). Such approaches are particularly well-suited for EHR data analysis, due to their ability to capture deep hierarchical features and long-range dependencies.
In this work, we leverage various deep learning architectures to learn from sequential ICD-10 diagnosis codes in order to predict final disease diagnosis from an cardiac imaging study. Our objective is to develop a model which prompts readers of clinical images of the most likely disease diagnosis at the time of the imaging study. Such a deep learning architecture has the potential to aggregate a large amount of temporally sparse information while mitigating the temporal uncertainty associated with diagnosis codes. We demonstrate the ability of our proposed method to provide accurate differential diagnosis as abstracted from patient charts in a cohort of patients undergoing transthoracic echocardiogram (TTE) and/or cardiac magnetic resonance imaging (CMR), both clinical imaging staples for diagnosing and prognosticating NICM.
Data and Methods
Data
This study was approved by the Cleveland Clinic institutional review board. Our data was taken from multiple sites from a single hospital system. The overall distribution of sites is heavily skewed towards a single campus. Therefore, we combine all data into a single comprehensive bucket, although with the knowledge that clinical practice and code standards may differ dramatically at different sites. The dataset was constructed via convenience sampling. The NICM cohort was appropriated from another study which was constructed from adult patients who underwent a CMR exam between 2002 and 2021 at a Cleveland Clinic site.
All patients were reviewed for definitive diagnosis through chart review by a clinical research fellow using the relevant guidelines(17–22). Accuracy of the annotations were then confirmed by a level 3 board-certified cardiologist. Specifically, cardiac amyloidosis was determined through characteristic pattern of late gadolinium enhancement on CMR with a large subset of patients also having positive confirmatory testing(23,24). Hypertrophic cardiomyopathy (HCM) was determined through CMR biomarkers of left ventricular wall thickness >15mm, absence of abnormal loading conditions, and absence of infiltrative cardiomyopathies(17). Diagnosis of ischemic cardiomyopathy (ICM) was determined by review of patient history for history of revascularization, myocardial infarction, or multi-vessel disease and ejection fraction < 40%(25). Non-differentiated NICM was determined to be any patient suffering from heart failure without a specific etiology(26). Definitive diagnosis of cardiac sarcoidosis was determined by either positive histopathology for granulomatous inflammation and/or electrocardiographic abnormalities combined with reduced systolic function(21,27). Cases of suspected myocarditis were found from CMR with myocardial dysfunction, and diffuse late gadolinium enhancement. All cases were then validated using endomyocardial biopsy(28). Dilated cardiomyopathy (DCM) was determined by a left ventricular end-diastolic volume index or diameter >2 and ejection fraction <50%(22). A patient could have multiple diagnosis at the same time (e.g. DCM stemming from ICM).
There are a total of 1738 CMR studies included in this dataset; 756 NICM, 318 ischemic cardiomyopathy (ICM), 231 cardiac amyloidosis (AMYL), 79 HCM, 238 sarcoidosis, 239 myocarditis, and 131 dilated cardiomyopathy (DCM). The mean age at the time of CMR was 56.57±15.40. Within the 1,742 patients, 574 are female and 1,168 are male.
In addition, we evaluate the applicability of this methodology on other clinical imaging modalities given the distribution of available longitudinal data will be very different. Therefore, we also investigate TTE. We found all patients in this cohort with an echocardiogram done at one of the sites in the hospital system within 3 years of the CMR with the assumption that the final diagnosis and disease severity would not have significantly changed in this time. This dataset includes a total of 330 TTE studies with a distribution of 122 NICM, 64 ICM, 40 AMYL, 19 HCM, 52 sarcoidosis, 60 myocarditis, and 13 dilated cardiomyopathy (DCM).
For longitudinal analysis, we pulled all ICD-10 diagnosis codes associated with each patient/event. This resulted in a dataset of approximately 1.3 million individual codes from 2707 different diagnosis. Unsurprisingly, this data source has a long tail of rare codes which greatly increases the sparsity of our feature space. Therefore, we remove all codes which appear in less than 1% of patients. This reduces the number of unique codes features to 186.
Preprocessing
We formalize our problem as follows. A patient observation refers to a pair , in which i dictates the observation’s temporal order, ∆ti is the difference in time between the observation and index event, and is a multi-hot vector for all p possible diagnosis codes in our dataset such that represents the presence of diagnosis j during observation i. Then, the vector represents the time-series EHR of patient l, where nl is the number of observations for patient l. Associated with this is the multi-hot vector yl=[c1, c2, ... , c7] representing the cardiomyopathies at the index event for patient l.
Therefore, an imaging event is defined as an index event for which there is a confirmation of at least one of seven cardiomyopathies. The diagnosis codes occur at highly asynchronous points depending on the patient. We address this issue by normalizing the relevant time window and the time-intervals represented by each observation point in the time-series. First, we constrained diagnosis codes which were recorded within the 182 days preceding the given patient’s index event. Second, we bin the observation interval into 7 day periods. For patients with data shorter than the 182 day period, we zero-padded (or NaN padded depending on the model architecture) the time-series. Lastly, we recognize that although each diagnosis code is recorded at a single discrete time, it often represents an on-going disease state. Therefore, last observation carried forward (LOCF) imputation was applied. The imputation also helps to mitigate the sparsity. An schematic of the setup of our data is shown in Figure 1.
Our sequence data is summarized in Table 1. The average patient had a modest amount of diagnoses, but the diagnosis was spread over a long time window - further incentivizing a time-series approach. The largest population in our dataset was patients with undifferentiated non-ischemic cardiomyopathy, serving as our control. While all classes are minority classes relative to the output space, none are infrequent enough to require further data manipulation.
Table 1:
Distribution of diseases, sequence length, and number of diagnosis per patient.
Before preprocessing | After preprocessing | ||||||
---|---|---|---|---|---|---|---|
Modality | Disease | Number of studies | Average Encounters | Average Unique Diagnosis Codes | Number of studies | Average Encounters | Average Unique Diagnosis Codes |
CMR | Totals | 1870 | 5.0 | 9.43 | 1738 | 4.8 | 5.97 |
NICM | 795 | 4.8 | 9 | 756 | 3.9 | 5.56 | |
ICM | 345 | 5.1 | 10.27 | 318 | 4.4 | 7.00 | |
AMYL | 247 | 6.5 | 11.79 | 231 | 5.7 | 7.44 | |
HCM | 83 | 4.0 | 7.36 | 79 | 3.3 | 5.11 | |
Sarcoidosis | 249 | 5.3 | 9.92 | 238 | 4.3 | 5.88 | |
Myocarditis | 277 | 4.4 | 8.59 | 239 | 3.8 | 5.51 | |
DCM | 141 | 4.4 | 9.60 | 131 | 3.8 | 5.60 | |
Echo | Totals | 439 | 5.1 | 9.48 | 330 | 4.8 | 6.28 |
NICM | 169 | 4.5 | 8.43 | 122 | 3.8 | 5.42 | |
ICM | 82 | 5.4 | 11.26 | 64 | 4.9 | 7.54 | |
AMYL | 51 | 6.7 | 10.86 | 40 | 5.4 | 6.47 | |
HCM | 20 | 3.0 | 5.95 | 19 | 3 | 4.63 | |
Sarcoidosis | 64 | 6.3 | 11.67 | 52 | 5.2 | 7.24 | |
Myocarditis | 86 | 4.4 | 8.20 | 60 | 3.8 | 5.51 | |
DCM | 20 | 3.7 | 6.5 | 13 | 3 | 4.33 |
Models
We explored several time-series deep learning models including variants of recurrent neural networks (RNNs) and transformer models. Specifically, we looked into simple RNNs, LSTMs, bidirectional GRUs, and transformers, which have been shown to be suitable for time-series EHR prediction(29,30). Contrasting traditional feedforward networks, the aforementioned models have intrinsic structures to harness antecedent temporal data. In particular, simple RNNs process inputs sequentially, LSTMs and GRUs introduce gating mechanisms to regulate information flow, while transformers use parallel attention mechanisms, foregoing sequential processing. These primary architectures are referenced herein as RNN, LSTM, GRU, and TransformerModel, respectively. Furthermore, our exploration incorporated two additional model categories: 1) CellAttention models use their particular cell for feature abstraction, subsequently channeling this extracted information into a transformer encoder. Within this category are the TST, RNNAttention, LSTMAttention, and GRUAttention models. Notably, the TST model’s cell is linear and is a PyTorch implementation of the work by Zerveas et al (16). 2) Conversely, the TransformerCell models employ an antithetical approach to the CellAttention models. They begin with a transformer encoder which is fed forward into their cell layers. We note that ‘TransformerModel’ is actually the TransformerCell analogue of the TST; routing their transformer encoder’s outputs into a linear layer -- which is the standard architecture of an attention transformer, as the model was introduced. The implementation of these models were sourced from tsai(31).
We compared the time-series model to a single-time point random forest model as a baseline using the last time point in the LOCF dataset as our features. The RandomForestSRC package was used specifically for their ‘imbalanced’ function, which is designed to handle the two-class imbalanced problem using a cost-weighted Bayes classifier (32). This was especially pertinent considering that each cardiomyopathy was characterized by a sparse number of positive instances. Consequently, a distinct univariate model was trained for each cardiomyopathy, and performance was measured globally for the forest aggregated multi-label predictions.
All experiments were conducted using a training/validation/testing split of 70/15/15. The models were only exposed to the testing set once hyperparameter tuning was finished. Hyperparameters were tuned using a grid-search approach. Following training, probabilities were calibrated by fitting the training data with isotonic regression, and classification thresholds were determined for each cardiomyopathy by taking 100 evenly spaced splits from 0 to 1 and maximizing the F1 score over the training data. These were validated to improve metrics over the validation set, and were eventually used to attain the final model results over the test set. For evaluation metrics, we chose to focus on micro-averaged area under the receiver operating curve (AUC), micro-averaged F1 Score, and precision at 3 (P@3). The P@3 metric is a reflection of the expected class being within the top 3 highest probability class. Finally, we evaluated the importance of LOCF with respect to our data structure by measuring the impact of the preprocessing step on the F1 score of the model for the CMR dataset.
Data preprocessing was done in R using the data.table 1.13 library(33). All experiments were coded in python 3.9 using pyTorch 1.12(34). Each model was ran using the ‘1cycle policy’, the time series AI implementation(35). All models were trained on a 32GB nVidia V100 GPU. The code will be made public on Github once accepted for publication.
Results
RNN model can provide pre-test probability of disease in CMR
Table 2 shows classification outcomes of all models trained using CMR index events. The overall results were promising, with the best performing model achieving an AUC of 0.8446, F1 of 0.5873, P@3 of 0.8739. There was no clear best model as all time series models achieved AUCs between 0.8214 and 0.8463 and F1 between 0.5593 and 0.5873. The P@3 had a larger distribution between 0.7644 and 0.8739. The standard RNN model had comparable performance with more complex models, although on average models with some sort of attention mechanism achieved higher performance. For comparison, the random forest model produced marginally higher AUC compared to the time series models but posted much lower retrieval metrics.
Table 2:
The predict performance for pre-CMR disease prediction
Model | AUC | F1 | Recall | Prec | P@3 |
---|---|---|---|---|---|
RNN | 0.8287 | 0.5709 | 0.5066 | 0.6538 | 0.8198 |
LSTM | 0.8214 | 0.5620 | 0.5099 | 0.6260 | 0.7798 |
GRU | 0.8315 | 0.5694 | 0.5298 | 0.6154 | 0.7642 |
TransformerRNN | 0.8396 | 0.5736 | 0.5033 | 0.6667 | 0.8222 |
TransformerLSTM | 0.8453 | 0.5593 | 0.5000 | 0.6345 | 0.8011 |
TransformerGRU | 0.8446 | 0.5873 | 0.5066 | 0.6986 | 0.8739 |
Transformer | 0.8295 | 0.5719 | 0.5464 | 0.6000 | 0.7644 |
TST | 0.8383 | 0.5853 | 0.5397 | 0.6392 | 0.7778 |
RNNAttention | 0.8463 | 0.5651 | 0.5033 | 0.6441 | 0.8018 |
LSTMAttention | 0.8376 | 0.5693 | 0.4967 | 0.6667 | 0.8417 |
GRUAttention | 0.8302 | 0.5645 | 0.5000 | 0.6481 | 0.8391 |
Random Forest | 0.8559 | 0.4730 | 0.7119 | 0.3542 | 0.5342 |
The performance heavily differs by class
The average predictive ability of our longitudinal approach for predicting the broad spectrum of NICMs is moderate. However, there were significant differences in the discriminative ability for individual etiologies, as presented in Table 3. There was a difference of 0.2267 in AUC between the most accurate disease class (ICM) and the least accurate disease class (myocarditis). The vast difference is also reflected in the F1 scores with a difference of 0.6130 between AMYL and DCM. The recall of myocarditis and DCM are also extremely low, which suggests difficulty predicting these diseases from the given data sources. The results somewhat trend with the number of studies in each class (Table 1).
Table 3:
Median metrics by class for CMR.
Model | AUC | F1 | Recall | Prec |
---|---|---|---|---|
NICM | 0.7676 | 0.6694 | 0.6694 | 0.6747 |
ICM | 0.8617 | 0.6585 | 0.6098 | 0.7205 |
AMYL | 0.8216 | 0.6832 | 0.6023 | 0.7690 |
HCM | 0.8377 | 0.5433 | 0.4815 | 0.6583 |
Sarcoidosis | 0.8008 | 0.5167 | 0.4444 | 0.6125 |
Myocarditis | 0.6350 | 0.1177 | 0.0909 | 0.1835 |
DCM | 0.6791 | 0.0702 | 0.0417 | 0.2250 |
Results for TTEs are comparatively worse
We also developed a model to provide pre-test disease predictions for TTEs. The results for overall model performance and median metrics by class are shown in Table 4 and Table 5 respectively. The models on average achieved 0.1419 pts lower AUC and 0.1309 lower F1 score. The median metrics by class reflect the lower performance with the model almost failing to identify sarcoidosis, HCM, myocarditis, and DCM.
Table 4:
The predict performance for pre-echocardiogram disease prediction
Model | AUC | F1 | Recall | Prec | P@3 |
---|---|---|---|---|---|
RNN | 0.6605 | 0.3659 | 0.3333 | 0.4054 | 0.5631 |
LSTM | 0.6792 | 0.4222 | 0.4222 | 0.4222 | 0.5931 |
GRU | 0.7140 | 0.4471 | 0.4222 | 0.4750 | 0.6323 |
TransformerRNN | 0.7598 | 0.5435 | 0.5556 | 0.5319 | 0.6671 |
TransformerLSTM | 0.7207 | 0.4396 | 0.4444 | 0.4348 | 0.5691 |
TransformerGRU | 0.7246 | 0.5376 | 0.5556 | 0.5208 | 0.6117 |
Transformer | 0.7005 | 0.4742 | 0.5111 | 0.4423 | 0.5974 |
TST | 0.7050 | 0.3810 | 0.3556 | 0.4103 | 0.5777 |
RNNAttention | 0.6416 | 0.3913 | 0.4000 | 0.3830 | 0.5111 |
LSTMAttention | 0.7015 | 0.4706 | 0.4444 | 0.5000 | 0.6477 |
GRUAttention | 0.6246 | 0.3656 | 0.3778 | 0.3542 | 0.4801 |
Random Forest | 0.7533 | 0.4878 | 0.4444 | 0.5405 | 0.6882 |
Table 5:
Median metrics by class for echocardiogram.
Model | AUC | F1 | Recall | Prec |
---|---|---|---|---|
NICM | 0.7143 | 0.5881 | 0.6786 | 0.5278 |
ICM | 0.6939 | 0.4143 | 0.5000 | 0.3542 |
AMYL | 0.6944 | 0.5714 | 0.4444 | 0.8000 |
HCM | 0.7518 | 0.3333 | 0.5000 | 0.2500 |
Sarcoidosis | 0.9662 | 0.2823 | 1.0000 | 0.1833 |
Myocarditis | 0.5268 | 0.2500 | 0.2222 | 0.2857 |
DCM | 0.5000 | 0.0000 | 0.0000 | 0.0000 |
Last observation carried forward is important for model accuracy
Additionally, model training was repeated without carrying observations forward. The evaluated metrics of this choice for the CMR cohort is displayed in Figure 2, which show a patterned decrease in performance. The models utilizing attention heads for feature extraction fare better, as they naturally portion the feature space and reduce sparsity. In contrast, the models that rely on sequential feature extraction have greatly reduced results. The biggest differences occurred for the RNN, LSTM and GRU, which were all evaluated to have F1 scores under 0.500. This diminished performance may be attributed to a compromised learning efficacy during training: their inability to efficiently navigate the feature space leads to overfitting prior to nearing global minima. Regardless, all models without the LOCF were outperformed by their counterparts from Table 2.
Discussion
In this work, we demonstrated a deep learning based time-series modeling paradigm for delivering pre-test disease prediction for CMR and echocardiography. Radiology exams are most useful when answering a specific clinical question(4). However, the quality of requisitions is often lacking(36). The overall accuracy was not perfect for any specific model but are encouraging given that a clinician would not expect to know the definitive diagnosis prior to the ordered imaging study. Rather, clinical guidelines leverage imaging to provide more definitive evidence of a specific diagnosis in each of these diseases. The relatively high P@3 would suggest that this model could be used to augment clinical histories on radiological requisitions for the radiologists or cardiologists interpreting CMR or echocardiography studies.
Despite the good mean AUC, F1, and P@3 metrics of the models, there was a wide distribution of disease-specific measurements. ICM achieved the highest AUC among the disease classes, while DCM achieved the lowest. ICM often has a specific clinical course reflecting ischemic diseases. There are several important clinical events including myocardial infarction or stroke which could be unique identifiers for this disease among this cohort. Similarly, patients with CA often have long diagnostic pathways(18,24). This would result in a uniquely lengthy and densely filled data dimension compared to the other classes as shown in Table 1.
On the other hand, our models had difficulty accurately detecting myocarditis and DCM. Myocarditis is often an acute event which means there is naturally less data associated with each case. Although there are presentations which are chronic or produce chronic problems, the volume of prior diagnostic codes for myocarditis is the small compared to other classes. The relative number of DCM patients in our dataset is even smaller, with the counts of only the HCM cohort comparable. The lower number of events and acuity makes detecting patterns difficult. However, it is not the only reason for poor performance as evidenced by HCM, which has fewer number of CMR studies compared to the other classes. We hypothesize it’s good performance in CMR is due to the fact that it is often already suspected via other imaging tests. CMR is usually ordered as a confirmatory test(17). Conversely, for the other five cardiomyopathies, the models yielded commendable results with impressive AUC scores and a balanced trade-off between recall and precision. The elevated precision metrics underscore the propensity of these models to make accurate positive predictions.
Also reflecting the issue of less data is predicting disease for echocardiography is significantly less accurate compared to for CMR. First, the TTE cohort was substantially limited in size: comprising just 330 patients. This makes learning patterns associated with disease harder. Second, TTE is often one of the first cardiac imaging tests ordered when a cardiac disease is suspected. In comparison, 79.5% of CMR cases had a cardiovascular related ICD10 code compared to 73.8% for TTE. Therefore, there is often less information which can be leveraged for any kind of prediction task and may not be useful in emergent or outpatient referral situations. Rather, current clinical practice relies on nursing or technicians at the point of care to record useful clinical history. Changes to the way we record patient histories either via patient provided information or more extensive documentation at the point of care maybe needed to better inform decision making through AI models.
Deep learning based time-series models seem to be capable of learning uncertainty of variables through time(37) unlike conventional single-time-point models as evident by the poor predictive power of the random forest model. Regardless of clinical imaging modality, the sparsity of diagnostic codes is a consistent feature of healthcare datasets not often seen in other time-series prediction tasks. Consequently, both zero imputation and LOCF can and does introduce extensive errors in our time-series data.
The inherent sparsity of diagnostic codes present a unique challenge in healthcare datasets, setting them apart from other time series prediction tasks. One key reason is the temporal limitation of diagnostic codes; they are typically entered at a single point in time, with little or no follow-up information on whether conditions have been resolved or if they continue to impact the patient. This lack of temporal information puts the reliability of diagnostic codes in question when they are used for time series analysis.
Moreover, the very nature of healthcare practice adds another layer of complexity. Diagnosis codes are often generated only when there is a patient encounter dedicated for a particular set of symptoms. For instance, a cardiologist might only record codes relevant to heart issues, even if the patient may have multiple co-morbidities. This means that the absence of a diagnostic code does not necessarily indicate the absence of disease. Additionally, our healthcare institution is a quaternary care center, dealing primarily with severe illnesses. On one hand, this means we have a unique population of serious illness. Conversely, many patients do not have established history of primary care here, thereby reducing available time series data.
Incorporating EHR free-text can be one way to accommodate the complexity of healthcare. Generation of diagnosis code for billing purposes lends itself to minimizing clinical nuance, often poorly reflecting the true disease state. Codes also introduce a lot of variability in documentation practice between different institutions, such as different time lag which may impact decision support timeliness in an ambulatory setting. Therefore, methods to augment codes through free text will be necessary to mitigate such variability.
Limitations
This study contains several limitations. First, our models were trained only on a cohort of patients with cardiomyopathies. Although clinical imaging is strongly recommended for patients with suspected cardiomyopathies, they represent only a small portion of the possible cardiac disease spectrum. This subset of diseases also tends to have long diagnostic pathways, which may influence the accuracy of our model. Further studies inclusive of additional cardiac diseases is warranted. Second, this cohort also represents only positively identified patients. This introduces a selection bias caused by the fact that our cohort was built on patients with CMR and a positive diagnosis of several diseases. However, there is a need to include patients who are suspected of disease but have negative findings. Finally, not all disease labels here have equally certain disease states. AMYL and HCM disease labels were positively labeled only after tertiary testing such as myocardial biopsy. However, the DCM disease label cannot be as precisely found as there are no strict guidelines regarding disease diagnosis.
Conclusion
The evolving landscape of clinical diagnostics has brought with it an exponential increase in available data. Extracting actionable insights remains challenging, and the necessary tools still immature. This study set out to leverage methodologies for time-series analysis to improve clinical inference. We demonstrated that deep learning time-series models have the potential to greatly reduce the need to rely on often imprecise orders or manual review of patient history for interpreting clinical imaging studies by providing the most likely diseases prior to the imaging study. However, such an approach is still limited by the realities of medical practice where evidence of disease is highly variable due to the acuity of disease and the fragmented nature of medical data. Methods to incorporate free text and improve access to patient data would be beneficial towards better AI decision support models for image interpretation.