Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Radiol Artif Intell. 2023 May; 5(3): e220160.
Published online 2023 Apr 19. doi: 10.1148/ryai.220160
PMCID: PMC10245178
PMID: 37293347

DeePSC: A Deep Learning Model for Automated Diagnosis of Primary Sclerosing Cholangitis at Two-dimensional MR Cholangiopancreatography

Abstract

Purpose

To develop, train, and validate a multiview deep convolutional neural network (DeePSC) for the automated diagnosis of primary sclerosing cholangitis (PSC) on two-dimensional MR cholangiopancreatography (MRCP) images.

Materials and Methods

This retrospective study included two-dimensional MRCP datasets of 342 patients (45 years ± 14 [SD]; 207 male patients) with confirmed diagnosis of PSC and 264 controls (51 years ± 16; 150 male patients). MRCP images were separated into 3-T (n = 361) and 1.5-T (n = 398) datasets, of which 39 samples each were randomly chosen as unseen test sets. Additionally, 37 MRCP images obtained with a 3-T MRI scanner from a different manufacturer were included for external testing. A multiview convolutional neural network was developed, specialized in simultaneously processing the seven images taken at different rotational angles per MRCP examination. The final model, DeePSC, derived its classification per patient from the instance expressing the highest confidence in an ensemble of 20 individually trained multiview convolutional neural networks. Predictive performance on both test sets was compared with that of four licensed radiologists using the Welch t test.

Results

DeePSC achieved an accuracy of 80.5% ± 1.3 (sensitivity, 80.0% ± 1.9; specificity, 81.1% ± 2.7) on the 3-T and 82.6% ± 3.0 (sensitivity, 83.6% ± 1.8; specificity, 80.0% ± 8.9) on the 1.5-T test set and scored even higher on the external test set (accuracy, 92.4% ± 1.1; sensitivity, 100.0% ± 0.0; specificity, 83.5% ± 2.4). DeePSC outperformed radiologists in average prediction accuracy by 5.5 (P = .34, 3 T) and 10.1 (P = .13, 1.5 T) percentage points.

Conclusion

Automated classification of PSC-compatible findings based on two-dimensional MRCP was achievable and demonstrated high accuracy on internal and external test sets.

Keywords: Neural Networks, Deep Learning, Liver Disease, MRI, Primary Sclerosing Cholangitis, MR Cholangiopancreatography

Supplemental material is available for this article.

© RSNA, 2023

Keywords: Neural Networks, Deep Learning, Liver Disease, MRI, Primary Sclerosing Cholangitis, MR Cholangiopancreatography

Summary

DeePSC, a deep neural network for classification of primary sclerosing cholangitis on two-dimensional MR cholangiopancreatography images, outperformed four experienced radiologists in average prediction accuracy and generalizability on multivendor MRI data.

Key Points

  • ■ The proposed deep convolutional neural network, DeePSC, automatically classified primary sclerosing cholangitis–compatible findings on 3-T and 1.5-T MR cholangiopancreatography images with high accuracy (3 T: 80.5% ± 1.3; 1.5 T: 82.6% ± 3.0).
  • ■ Generalizability was demonstrated after testing the model on external 3-T MR cholangiopancreatography data obtained from an MRI scanner of a different vendor (accuracy, 92.4% ± 1.1).
  • ■ The algorithm surpassed average benchmarks of primary sclerosing cholangitis classification accuracy set by four experienced human readers by 5.5% (3 T) and 10.1% (1.5 T).

Introduction

Primary sclerosing cholangitis (PSC) is a chronic cholestatic liver disease characterized by progressive multifocal bile duct strictures due to biliary inflammation and fibrosis (1). This rare liver disorder is often associated with inflammatory bowel disease and is considered a premalignant condition, as patients are typically at high risk of developing colorectal and hepatobiliary malignancies (1,2). Liver transplantation remains the only curative therapy, with a disease recurrence rate of up to 25% (2,3).

MR cholangiopancreatography (MRCP) has been established as the noninvasive imaging modality of choice for the detection of PSC-compatible bile duct changes and disease-related complications (4). However, reading MRCP scans is subjective, requires sufficient expertise and experience, and is often time-consuming, particularly for subtle findings. Furthermore, MRCP interpretation varies even among PSC experts and experienced radiologists and often shows only poor interreader agreement between serial follow-up examinations of patients with PSC (5,6). This highlights the need and opportunity for clinical decision support through automated evaluation of MRCP images to improve PSC diagnosis.

Deep learning approaches, such as convolutional neural networks (CNNs), are substantially advancing medical imaging (7,8). Traditionally, these models are trained and validated on single images. However, recent research shows that the predictive accuracy of classification tasks can be significantly improved by creating a joint decision from multiple images of the same object of interest (9). Since MRCP scans usually include several images (10), each taken from different projections covering the intra- and extrahepatic bile duct system, a multiview classification network that aggregates information of all views per patient is promising.

We aimed to verify the aptitude of an artificial intelligence–based clinical decision support system for the automated classification of PSC-compatible cholangiographic findings on two-dimensional (2D) MRCP images by developing a deep learning model, measuring its performance across different datasets, and putting the results into clinical context. To the best of our knowledge, this is the first study to apply deep learning methods to 2D MRCP data; therefore, we were particularly interested in whether a highly accurate, automated diagnosis on this type of data was possible. Specifically, we developed an end-to-end deep multiview CNN ensemble model (DeePSC) and evaluated its performance on 2D MRCP datasets obtained at magnetic field strengths of 1.5 T and 3 T with different MRI scanners by different manufacturers to ensure generalizability. Furthermore, we compared the performance of our classification network with that of four radiologists with varying levels of experience in reading MRCP images. Last, we applied explainability measures to visualize and verify the model's predictions.

Materials and Methods

Study Design, Patients, and Controls

This retrospective study was approved by the University Medical Center Hamburg-Eppendorf institutional review board (no. 2021–100723-B0-ff). The requirement for written informed consent was waived due to the retrospective nature of the study. Patients with confirmed diagnosis of large-duct PSC were included. The reference standard for diagnosis was established noninvasively according to the guidelines of the European Association for the Study of the Liver, which are defined as follows: (a) elevated serum markers of cholestasis (eg, serum alkaline phosphatase, γ-glutamyl transferase) not otherwise explained, (b) characteristic bile duct changes with multifocal strictures and segmental dilatations visualized with MRCP and/or endoscopic retrograde cholangiopancreatography, and (c) exclusion of causes of secondary sclerosing cholangitis and other cholestatic disorders (2). Patients without characteristic bile duct alterations and patients with indefinable diagnosis of PSC were excluded. The corresponding control group consists of patients without any history of immune-mediated liver or bile duct disease and without any visible bile duct alterations on MRCP images or on images from any other of the performed MRI sequences (for indications of MRI in controls, see Table S1). All patients underwent clinical MRI scans including 2D MRCP at our medical university center between 2002 and 2022. Patients and controls were identified via an in-house picture archiving and communication system query. Figure 1 depicts the data flow in more detail.

Overall data flow. A database of 860 patients was split into a primary
sclerosing cholangitis (PSC) and non-PSC group. Some patients and their
respective MR cholangiopancreatography (MRCP) data were excluded due to the
denoted reasons. MRCP data of both groups were separated into 3-T and 1.5-T
datasets depending on the magnetic field strength used during acquisition
and partitioned into respective training and test datasets. Some patients
underwent examinations at both 3 T and 1.5 T during clinical follow-up and
can therefore have MRCP images in both datasets, hence the observed
overlap.

Overall data flow. A database of 860 patients was split into a primary sclerosing cholangitis (PSC) and non-PSC group. Some patients and their respective MR cholangiopancreatography (MRCP) data were excluded due to the denoted reasons. MRCP data of both groups were separated into 3-T and 1.5-T datasets depending on the magnetic field strength used during acquisition and partitioned into respective training and test datasets. Some patients underwent examinations at both 3 T and 1.5 T during clinical follow-up and can therefore have MRCP images in both datasets, hence the observed overlap.

MRI Protocol

All MRI examinations included in this study were performed with either a 1.5-T scanner (Achieva, Ingenia, or Intera; Philips Healthcare, or MAGNETOM Symphony; Siemens Healthineers,) or a 3-T scanner (Achieva, Ingenia, or Intera; Philips Healthcare) according to routine clinical protocol. An additional independent validation cohort, which we refer to as the different vendor test dataset, was obtained with a 3-T MAGNETOM Vida machine (Siemens Healthineers). Distribution of patients to the different MRI machines was based on availability only and has no relation to medical history or diagnosis. Patients were advised to fast for 4 hours prior to the study to reduce fluid secretion within the gastrointestinal system. In patients without contraindication, 20 mg of scopolamine butylbromide (Buscopan; Sanofi-Aventis) was additionally intravenously injected to minimize gastrointestinal motility and, thus, motion artifacts while imaging. Each MRCP examination consisted of seven radial images from different angular points of view.

The detailed MRI protocol and imaging parameters are provided in Appendix S1.

MRCP Evaluation by Algorithm and Certified Radiologists

As shown in Figure 1, two separate internal datasets of MRCP images obtained at 3 T and 1.5 T were collected; 39 samples from each set were randomly assigned (with stratification by demographic information) as test datasets. The remaining samples were used for training the deep learning model. Additionally, four radiologists with varying levels of experience in reading MRCP images (2, 3, 4, and 9 years’ experience for reviewers R1, R2, R3, and R4, respectively) independently and blindly analyzed both internal test sets and binarily classified them according to previously published PSC-compatible findings based on the 2D MRCP sequence only (4).

Data Preprocessing

Maximum gray values varied greatly across samples (see Fig S1), but linear rescaling of the maximum and minimum value per image to 1 and 0, respectively, did not provide a biologically meaningful homogenization as similar structures (eg, bile ducts) shared gray values of a similar range after the preprocessing. This may be due to the presence of artifacts (eg, the very bright leftover liquid in the stomach present on some MRCP images). To mitigate this issue, we applied contrast-limited adaptive histogram equalization (11) to each MR image to locally enhance the contrast and then mapped the 95th percentile of each image's gray value histogram to 1 and the 5th percentile to 0 (see Fig 2). This smooths out the influence of outlier pixels and areas and provides a robust preprocessing procedure across all used datasets (see ablation study in Fig S2).

Original MR cholangiopancreatography (MRCP) images of different
patients and respective histograms (left two columns) and preprocessed
images and histograms after applying contrast-limited histogram equalization
with a contrast-clip limit of 0.015 (right two columns) (11). The
substantially different maximum gray values of the MRCP images are notable.
Before input to the network, all images were normalized to 1 on the 95th
percentile of their histogram (dashed orange line) and to 0 on the fifth
percentile (dashed blue line). This was found to provide an appropriate
dataset homogenization, where similar biologic structures share similar gray
values across patient samples. This is made visible on the adapted images
where the bile ducts share pixel values greater than 1 (highlighted in
orange), and pixel values less than 0 (highlighted in blue) belong to
irrelevant structures of the background.

Original MR cholangiopancreatography (MRCP) images of different patients and respective histograms (left two columns) and preprocessed images and histograms after applying contrast-limited histogram equalization with a contrast-clip limit of 0.015 (right two columns) (11). The substantially different maximum gray values of the MRCP images are notable. Before input to the network, all images were normalized to 1 on the 95th percentile of their histogram (dashed orange line) and to 0 on the fifth percentile (dashed blue line). This was found to provide an appropriate dataset homogenization, where similar biologic structures share similar gray values across patient samples. This is made visible on the adapted images where the bile ducts share pixel values greater than 1 (highlighted in orange), and pixel values less than 0 (highlighted in blue) belong to irrelevant structures of the background.

Deep Learning Architecture

The proposed deep learning model builds on three increasing levels of complexity (Fig 3). First, a single-view CNN (SVCNN) is trained on all individual images (MRCP views) of all patients per dataset until convergence on the task of binary classification of PSC versus non-PSC. An ImageNet-pretrained (12) SqueezeNet (13) serves as the backbone architecture for feature extraction. This conventional approach of training and predicting on individual images serves as our baseline.

Overall structure of the deep learning convolutional neural network
(CNN) for the automated diagnosis of primary sclerosing cholangitis (PSC)
model (DeePSC). (A) The baseline single-view CNN (SVCNN) is trained with a
sequential input of all MR cholangiopancreatography (MRCP) views of all
patients until convergence. The SVCNN is based on the SqueezeNet
architecture (13) and consists of the convolutional feature extractor (FE)
and the fully connected (FC) classification layer. (B) The trained SVCNN is
extended to the multiview CNN (MVCNN) by multiplying the FE by the number of
MRCP views per patient and introducing the attention-based view-fusion (AVF)
layer between the FE and classification layer. Here, all views of a single
patient are processed in parallel and aggregated in the AVF, while the
classification layer derives a single prediction per patient. (C) Twenty
individually trained MVCNNs are combined in the highest confidence ensemble
(HCE) to form the final DeePSC model, where only the prediction with the
highest-class probability across all 20 MVCNNs is forwarded as the final
classification per patient. Conv = convolutional, Max = maximum.

Overall structure of the deep learning convolutional neural network (CNN) for the automated diagnosis of primary sclerosing cholangitis (PSC) model (DeePSC). (A) The baseline single-view CNN (SVCNN) is trained with a sequential input of all MR cholangiopancreatography (MRCP) views of all patients until convergence. The SVCNN is based on the SqueezeNet architecture (13) and consists of the convolutional feature extractor (FE) and the fully connected (FC) classification layer. (B) The trained SVCNN is extended to the multiview CNN (MVCNN) by multiplying the FE by the number of MRCP views per patient and introducing the attention-based view-fusion (AVF) layer between the FE and classification layer. Here, all views of a single patient are processed in parallel and aggregated in the AVF, while the classification layer derives a single prediction per patient. (C) Twenty individually trained MVCNNs are combined in the highest confidence ensemble (HCE) to form the final DeePSC model, where only the prediction with the highest-class probability across all 20 MVCNNs is forwarded as the final classification per patient. Conv = convolutional, Max = maximum.

In the second level, the trained SVCNN is extended to form the multiview CNN (MVCNN) architecture (14). For this, the feature extraction backbone is duplicated to enable parallel input and processing of all seven MRCP views per patient. All branches of the feature extractor in the MVCNN share the same weights, which are adopted from the trained SVCNN and then frozen. Additionally, we introduce a trainable attention-based view-fusion layer between the output of the seven feature extractors and the final classification layer to form a weighted average of the incoming latent representations (8,15).

The classifier then processes the combined latent representation and infers a single-class probability. The weights of classification and attention-based view-fusion layer is again trained until convergence on the same datasets as the SVCNN.

In the third level, an ensemble of 20 individual instances of the MVCNN is trained on the same data with varying random seeds. This influences weight initialization, random augmentations, and sample order during training, which can lead to convergence to a different local minimum of the loss landscape and therefore different results. To increase predictive robustness of the full model, only the prediction of the MVCNN instance that expresses the highest-class probability (and therefore the highest confidence) was considered for the final prediction of the ensemble model. This is further denoted as highest confidence ensemble and represents our final model, DeePSC.

Hyperparameters were optimized by searching a predefined parameter space using fivefold cross-validation on the 3-T training set. Using the optimal found hyperparameters, a 3-T and 1.5-T version of DeePSC was retrained on the full respective training set and evaluated on the unseen test set. Final retraining and evaluation was repeated five times over varying random seeds to assess the stability of our method.

Detailed information about data preprocessing and training parameters is provided in Appendix S1. All models were implemented using PyTorch in Python 3.6 (Python Software Foundation). Conceptual code can be found at github.com/imsb-uke/DeePSC.

Different Vendor Validation

To verify if DeePSC could robustly classify an independent group of patients who underwent examination with an MRI scanner from a different manufacturer, the different vendor test set consisting of 37 MRCP images obtained at 3 T on a Siemens scanner (MAGNETOM Vida, Siemens Healthineers) was included. We used the DeePSC model trained on Philips 3-T data and evaluated its performance on the novel Siemens data. All MRCP images in the different vendor test set stem from new patients unknown to the model (ie, which were not previously used in the training set).

Explainability

To understand which visual features underlie the decision-making process of the model, gradient-weighted class activation maps were calculated on the last convolutional layer of DeePSC for every MRCP in the 3-T test set (16). These class-related heatmaps provide a visual cue of where the network looks to make its prediction.

Statistical Analysis

Statistical analysis of the demographic data was performed using GraphPad Prism (version 9.1.1, GraphPad Software). Continuous and categorical data were compared using the Mann-Whitney U test and Fisher exact test, respectively. Predictive performances of the deep learning models as well as the radiologists were compared using the Welch t test provided by the SciPy package for Python 3.10 (version 1.9.2, SciPy). Interreader reliability was assessed using Fleiss κ. Statistical significance was defined as P < .05.

Results

Patient Characteristics and Metadata

The initial database consisted of 860 patients, of whom 566 had an established diagnosis of PSC and 294 were controls without any history of immune-mediated liver or bile duct disease or any visible bile duct alterations on 2D MRCP images. Of these 566 patients with PSC, 27 with a diagnosis of small duct PSC and 67 with indefinable PSC were excluded. Due to incomplete or qualitatively insufficient clinical and MRI data, 115 patients with PSC and two controls had to be retrospectively excluded. With this, 649 patients (357 with PSC and 292 controls) were subsequently divided into MRI examinations performed with a magnetic field strength of 3 T and 1.5 T. To reduce complexity, these examinations were further filtered to only include MRCP scans that follow the clinic's standard protocol of exactly seven MR images taken from different rotational angles with an original image size of 512 × 512 pixels. Finally, a total of 606 patients (342 with PSC, 45 years ± 14 [SD]; 207 male patients; 264 controls, 51 years ± 16; 150 male controls) were included in this study, resulting in 361 MRCP images taken at 3 T (189 with PSC and 172 controls) and 398 MRCP images taken at 1.5 T (283 patients with PSC and 115 controls). The observed overlap is due to the 113 patients who underwent examinations at both magnetic field strengths during clinical follow-up (Fig 1). If a patient underwent multiple examinations at the same magnetic field strength, only the most recent MRCP examination fulfilling the quality criteria was included in the respective dataset, such that all MRCP images per dataset stem from unique patients.

We found no evidence of differences with respect to age or sex when comparing patients with PSC and controls. Baseline demographics are presented in Table 1 and Table 2. A more detailed breakdown of demographic and meta-information of the datasets can be found in Figure S1.

Table 1:

Metadata Characteristics of Study Datasets

An external file that holds a picture, illustration, etc.
Object name is ryai.220160.tbl1.jpg

Table 2:

Demographic Characteristics of Study Datasets

An external file that holds a picture, illustration, etc.
Object name is ryai.220160.tbl2.jpg

Classification Results

The final DeePSC ensemble model achieved an accuracy of 80.5% ± 1.3 on the unseen test set of the 3-T dataset and 82.3% ± 3.0 on the 1.5-T dataset, outperforming the average results of the four radiologists by 5.5 (P = .34) percentage points and 10.1 (P = .13) percentage points, respectively. However, these differences did not reach statistical significance (Table 3, Fig S3). Compared with individual radiologist predictions, the DeePSC model performed slightly better than the best human reader, R1, on the 1.5-T test set and outperformed three of four human readers on the 3-T test set. Notably, on both datasets, DeePSC outperformed the most experienced radiologist, R4, with 9 years of experience in reading MRCP images.

Table 3:

Predictive Performance of Model and Radiologists on Unseen Test Sets of 3-T and 1.5-T MR Cholangiopancreatography Images

An external file that holds a picture, illustration, etc.
Object name is ryai.220160.tbl3.jpg

Predictive performance of the proposed algorithm improved among all metrics on both datasets with higher levels of model complexity. Compared with processing individual MRCP images in the SVCNN, using the combined information of all views per patient in the MVCNN improved average accuracy by 1.4 (P < .001) and 3.5 (P < .001) percentage points on the 3-T and 1.5-T test set, respectively. This was further enhanced by 2.2 (P = .02) and 2.4 (P = .19) percentage points after applying the highest confidence ensemble of the DeePSC model.

Interreader Reliability

Taking individual sample predictions into account, interreader reliability among radiologists was low, with a Fleiss κ of 0.384 (95% CI: 0.223, 0.548) on the 3-T test set and 0.410 (95% CI: 0.254, 0.583) on the 1.5-T test set. This shows a strong disagreement among radiologists when asked to base their classification of PSC solely on the MRCP image data. The five DeePSC models trained on different random seeds achieved a fundamentally higher Fleiss κ of 0.948 (95% CI: 0.867, 1) and 0.868 (95% CI: 0.736, 0.969) on the 3-T and 1.5-T test sets, respectively. When comparing the final predictions of the best performing DeePSC model with those of the best performing radiologist, R1, a Cohen κ of 0.584 (95% CI: 0.328, 0.840) on the 3-T data and 0.592 (95% CI: 0.304, 0.840) on 1.5-T data were achieved.

Different Vendor Validation

On the different vendor test dataset, the DeePSC ensemble models trained on the 3-T internal dataset achieved 92.4% ± 1.1, 93.5% ± 0.9, 100.0% ± 0.0, and 83.5% ± 2.4 for accuracy, F1 score, sensitivity, and specificity, respectively (see Table 4, Fig S4 for receiver operating characteristic curves). The best performing model misclassified only two of the 17 samples in the control group (see Fig S5). Of note, we found substantial improvement in average accuracy by 11.9 (P < .001) percentage points between SVCNN and MVCNN. Applying the highest confidence ensemble of DeePSC reduced specificity by 0.2 (P = .86) percentage points and improved sensitivity by 1.9 (P < .001) percentage points.

Table 4:

Predictive Performance of Model Trained on Internal 3-T Training Set and Evaluated on Unseen 3-T Siemens Test Set

An external file that holds a picture, illustration, etc.
Object name is ryai.220160.tbl4.jpg

Explainability

The gradient-weighted class activation heatmaps of the best performing DeePSC model revealed high activity values in the anatomic region of the biliary tree on 33 of 39 MRCP images in the 3-T test set, strongly suggesting that the model learned to base its classification on biologically relevant features. While on the remaining six MRCP images high activity in the biliary tree could not be observed (see Table 5), five of the six MRCP images were still classified correctly. Besides inherent model uncertainty, data uncertainty, such as gastrointestinal fluid and image artifacts, can lead to distraction and, thus, misclassification of the network (Fig 4).

Table 5:

Distribution of Correct and Incorrect Classifications (PSC and Non-PSC) of the DeePSC Model

An external file that holds a picture, illustration, etc.
Object name is ryai.220160.tbl5.jpg
Class-related activations in the last convolutional layer of the
feature extractor for six patients in the 3-T MR cholangiopancreatography
(MRCP) image dataset, calculated with gradient-weighted class activation
maps (16). High activation values, on the red end of the spectrum, indicate
a high correlation of those features with the model's prediction. (A,
B) MRCP images that were correctly classified to the primary sclerosing
cholangitis group. (C, D) MRCP images that were correctly classified to the
control group with high activation in the area of the biliary tree. (E) MRCP
image was correctly classified as belonging to the control group, but the
class activation map shows that the model wrongly derived its prediction
from the areas of the gastric corpus and the gallbladder. (F) The model
incorrectly classified the MRCP image to the primary sclerosing cholangitis
group based on irrelevant features of the colon.

Class-related activations in the last convolutional layer of the feature extractor for six patients in the 3-T MR cholangiopancreatography (MRCP) image dataset, calculated with gradient-weighted class activation maps (16). High activation values, on the red end of the spectrum, indicate a high correlation of those features with the model's prediction. (A, B) MRCP images that were correctly classified to the primary sclerosing cholangitis group. (C, D) MRCP images that were correctly classified to the control group with high activation in the area of the biliary tree. (E) MRCP image was correctly classified as belonging to the control group, but the class activation map shows that the model wrongly derived its prediction from the areas of the gastric corpus and the gallbladder. (F) The model incorrectly classified the MRCP image to the primary sclerosing cholangitis group based on irrelevant features of the colon.

Discussion

In the present study, we developed a multiview deep CNN ensemble for the automated classification of PSC-compatible bile duct alterations on 2D MRCP images taken at both 1.5 T and 3 T. To the best of our knowledge, this is the first study to use such a deep learning approach for the assessment of PSC on radial 2D MRCP datasets and contains the largest 2D MRCP dataset of patients with PSC and controls published to date. We compared the diagnostic performance of our model to that of four diagnostic radiologists at different levels of training and with varying experience in reading MRCP images, thus reflecting clinical practice. Our results show that the proposed model identifies patients with PSC based on 2D MRCP images both at 1.5 T and 3 T with high reliability, achieving higher accuracy, sensitivity, and specificity than the radiologists on average. Furthermore, the proposed model reached over 90% accuracy on an unseen test dataset obtained with an MRI machine from a different vendor than the training dataset, proving generalizability.

Combining the information of multiple MRCP views per patient with the attention-based view-fusion layer in the MVCNN consistently improved classification performance among all metrics and datasets compared with the baseline SVCNN. Similarly, besides a minor reduction in specificity on the different vendor test set, MVCNN performance consistently improved by applying the highest confidence ensemble of DeePSC on 20 individually trained MVCNNs. Given our main goal of providing robust and reliable performance across different datasets, the mentioned reduction in specificity of 0.24 percentage points on a single dataset is negligible when considering the gain in other metrics and datasets.

Since PSC is a rare disease with specific and sometimes subtle imaging features, high expertise and thorough knowledge are essential to reliably interpret the characteristic findings. The implementation of deep learning algorithms is expected to add substantial value in the decision-making process, given the high and robust performance of our proposed algorithm and the low interreader agreement of the radiologists observed in this study.

Most studies on the performance of deep learning algorithms in medical imaging suffer from substantial methodologic shortcomings, limiting their translation into the clinical setting (17). Many of the investigated studies did not compare the performance of their model with that of a human expert. Also, very few studies included an additional independent test dataset. This might result in incorrectly high values of accuracy due to overfitting and low generalizability of the proposed model (17). Both shortcomings were addressed in our work.

Ringe et al (18) showed that automated classification of PSC was feasible on three-dimensional MRCP images using majority voting on maximum intensity projections. Our DeePSC model represents an important extension to their work, as we provide realistic benchmarks for our dataset by comparing our method to the predictions of four radiologists. Furthermore, we demonstrate generalizability of our model through high performance on an independent test dataset collected with an MRI scanner from a different manufacturer than that used during model training.

While Ringe et al also improved their performance by combining information of different maximum intensity projections per patient by majority voting on the individual predictions of their network, our trainable attention-based view-fusion layer allows for more flexible and sample-related view-fusion and therefore more accurate predictions. Finally, we show that by employing an ensemble of multiple individually trained models, performance can be substantially improved compared with a single model instance.

Our study had some limitations. First, patient data were limited due to the low number of patients with this rare disease. Network performance could be boosted by including more patient data. Second, we did not quantify the potential influence of decision support from DeePSC by analyzing diagnostic performance of human readers paired with the proposed model in a clinical setting. However, this may be investigated in further studies. Third, by requiring controls to be without any visible bile duct alterations on the same MRCP images used to evaluate predictive performance and by not including patients with other forms of sclerosing cholangitis, or biliary malignancy or healthy individuals with unspecific bile duct alterations, the specificity of our model and the radiologists might have been overestimated. Fourth, the small number of four participating radiologists may contribute to ambiguity regarding statistically significant differences when comparing their performance to that of DeePSC. Further studies should try to mitigate this issue by increasing the number of human evaluators.

In conclusion, our study demonstrates that automated classification of PSC-compatible findings based on 2D MRCP images using multiview deep learning algorithms is achievable with high accuracy for both 1.5-T and 3-T MRI scans. The proposed DeePSC model outperformed the mean classification provided by four radiologists. It also demonstrated high performance in the classification of previously unseen data from an MRI scanner of a different manufacturer. After further training the network using controls with other liver and bile duct diseases, DeePSC may in the future provide valuable clinical decision support to radiologists, reduce interreader variability, and therefore contribute to the early and precise diagnosis of PSC based on 2D MRCP images.

Acknowledgments:

We thank Esther Dietrich, MSc, for her helpful comments and suggestions, as well as Sergio Oller Moreno, PhD, Sven Heins, MSc, Ines Hiller, and Katja Füssel for their technical support.

*H.R. and F.W. contributed equally to this work.

**S.B. and C.S. are co–senior authors.

F.W. supported by the University Hospital of Hamburg-Eppendorf (UKE) M3I grant. S.B. supported by the UKE R3 reduction of animal testing grant. A.E., P.F., S.B., and C.S. supported by the Landesforschungsfoerdung Hamburg (LFF-FV 78) and the German Research Foundation (DFG)–funded CRU 306 grants. M.Z. supported by SFB 1192 project B8. C.S. supported by the YAEL Foundation and the Hannelore and Helmut Greve Foundation.

Disclosures of conflicts of interest: H.R. No relevant relationships. F.W. M3I grant from University Hospital of Hamburg-Eppendorf made to the Institute of Medical Systems Biology for funding of salary. A.E. Support from the LFF-FV 78 and DFG-funded CRU 306 grants. J.Y. No relevant relationship. P.F. Support from the LFF-FV 78 and DFG-funded CRU 306 grants. M.Z. SFB 1192 B8 funding from the German Research Foundation to the Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf. J.S. No relevant relationships. F.S. No relevant relationships. C.Ö. No relevant relationships. A.W. No relevant relationship. G.A. No relevant relationship. S.B. Funding from UKE R3 reduction of animal testing grant, LFF-FV 78 grant, and DFG-funded CRU 306 grant made to the Institute of Medical Systems Biology at the University Hospital of Hamburg-Eppendorf. C.S. Support from the YAEL Foundation, the Hannelore and Helmet Greve Foundation, and the LFF-FV 78 and DFG-funded CRU 306 grants.

Abbreviations:

CNN
convolutional neural network
MRCP
MR cholangiopancreatography
MVCNN
multiview CNN
PSC
primary sclerosing cholangitis
SVCNN
single-view CNN
2D
two-dimensional

References

1. Karlsen TH , Folseraas T , Thorburn D , Vesterhus M . Primary sclerosing cholangitis - a comprehensive review . J Hepatol 2017. ; 67 ( 6 ): 1298 – 1323 . [PubMed] [Google Scholar]
2. European Association for the Study of the Liver . EASL Clinical Practice Guidelines: management of cholestatic liver diseases . J Hepatol 2009. ; 51 ( 2 ): 237 – 267 . [PubMed] [Google Scholar]
3. Hildebrand T , Pannicke N , Dechene A , et al. . Biliary strictures and recurrence after liver transplantation for primary sclerosing cholangitis: A retrospective multicenter analysis . Liver Transpl 2016. ; 22 ( 1 ): 42 – 52 . [PubMed] [Google Scholar]
4. Schramm C , Eaton J , Ringe KI , Venkatesh S , Yamamura J ; MRI working group of the IPSCSG . Recommendations on the use of magnetic resonance imaging in PSC-A position statement from the International PSC Study Group . Hepatology 2017. ; 66 ( 5 ): 1675 – 1688 . [PubMed] [Google Scholar]
5. Zenouzi R , Liwinski T , Yamamura J , et al. . Follow-up magnetic resonance imaging/3D-magnetic resonance cholangiopancreatography in patients with primary sclerosing cholangitis: challenging for experts to interpret . Aliment Pharmacol Ther 2018. ; 48 ( 2 ): 169 – 178 . [PubMed] [Google Scholar]
6. Grigoriadis A , Morsbach F , Voulgarakis N , Said K , Bergquist A , Kartalis N . Inter-reader agreement of interpretation of radiological course of bile duct changes between serial follow-up magnetic resonance imaging/3D magnetic resonance cholangiopancreatography of patients with primary sclerosing cholangitis . Scand J Gastroenterol 2020. ; 55 ( 2 ): 228 – 235 . [PubMed] [Google Scholar]
7. Liu X , Faes L , Kale AU , et al. . A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis . Lancet Digit Health 2019. ; 1 ( 6 ): e271 – e297 . [PubMed] [Google Scholar]
8. Dietrich E , Fuhlert P , Ernst A , et al. . Towards Explainable End-to-End Prostate Cancer Relapse Prediction from H&E Images Combining Self-Attention Multiple Instance Learning with a Recurrent Neural Network . arXiv 2111.13439 [preprint] https://arxiv.org/abs/2111.13439 Posted November 26, 2021. Accessed June 2022.
9. Seeland M , Mäder P . Multi-view classification with convolutional neural networks . PLoS One 2021. ; 16 ( 1 ): e0245230 . [Published correction appears in PLoS One 2021;16(4):e0250190]. [PMC free article] [PubMed] [Google Scholar]
10. Dave M , Elmunzer BJ , Dwamena BA , Higgins PDR . Primary sclerosing cholangitis: meta-analysis of diagnostic performance of MR cholangiopancreatography . Radiology 2010. ; 256 ( 2 ): 387 – 396 . [PubMed] [Google Scholar]
11. van der Walt S , Schönberger JL , Nunez-Iglesias J , et al. . scikit-image: image processing in Python . PeerJ 2014. ; 2 : e453 . [PMC free article] [PubMed] [Google Scholar]
12. Deng J , Dong W , Socher R , Li LJ , Li K , Li FF . ImageNet: a Large-Scale Hierarchical Image Database . In: 2009 IEEE Conference on Computer Vision and Pattern Recognition , Miami, FL , June 20–25, 2009 . IEEE; , 2009. ; 248 – 255 . [Google Scholar]
13. Iandola FN , Han S , Moskewicz MW , Ashraf K , Dally WJ , Keutzer K . SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size . arXiv 1602.07360 [preprint] https://arxiv.org/abs/1602.07360. Posted February 24, 2016. Accessed June 2022. [Google Scholar]
14. Su H , Maji S , Kalogerakis E , Learned-Miller E . Multi-view Convolutional Neural Networks for 3D Shape Recognition . arXiv 1505.00880 [preprint] https://arxiv.org/abs/1505.00880. Posted May 5, 2015. Accessed June 2022. [Google Scholar]
15. Ilse M , Tomczak JM , Welling M . Attention-based Deep Multiple Instance Learning . arXiv 1802.04712 [preprint] https://arxiv.org/abs/1802.04712. Posted February 13, 2018. Accessed June 2022. [Google Scholar]
16. Selvaraju RR , Cogswell M , Das A , Vedantam R , Parikh D , Batra D . Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization . Int J Comput Vis 2020. ; 128 ( 2 ): 336 – 359 . [Google Scholar]
17. Kim DW , Jang HY , Kim KW , Shin Y , Park SH . Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published Papers . Korean J Radiol 2019. ; 20 ( 3 ): 405 – 410 . [PMC free article] [PubMed] [Google Scholar]
18. Ringe KI , Vo Chieu VD , Wacker F , et al. . Fully automated detection of primary sclerosing cholangitis (PSC)-compatible bile duct changes based on 3D magnetic resonance cholangiopancreatography using machine learning . Eur Radiol 2021. ; 31 ( 4 ): 2482 – 2489 . [PubMed] [Google Scholar]
19. Loshchilov I , Hutter F . Decoupled Weight Decay Regularization . arXiv 1711.05101 [preprint] https://arxiv.org/abs/1711.05101. Posted November 14, 2017. Accessed June 2022. [Google Scholar]
20. The MONAI Consortium . Project MONAI . Zenodo; ; 2020. . [Google Scholar]
21. Billot B , Greve D , Van Leemput K , Fischl B , Iglesias JE , Dalca AV . A Learning Strategy for Contrast-agnostic MRI Segmentation . arXiv 2003.01995 [preprint] http://arxiv.org/abs/2003.01995. Posted March 4, 2020. Accessed June 2022. [Google Scholar]
22. Geras KJ , Wolfson S , Shen Y , et al. . High-Resolution Breast Cancer Screening with Multi-View Deep Convolutional Neural Networks . arXiv 1703.07047 [preprint] https://arxiv.org/abs/1703.07047. Posted March 21, 2017. Accessed June 2022. [Google Scholar]

Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

-