Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 18:2:111.
doi: 10.1038/s41746-019-0189-7. eCollection 2019.

Human-machine partnership with artificial intelligence for chest radiograph diagnosis

Affiliations

Human-machine partnership with artificial intelligence for chest radiograph diagnosis

Bhavik N Patel et al. NPJ Digit Med. .

Erratum in

Abstract

Human-in-the-loop (HITL) AI may enable an ideal symbiosis of human experts and AI models, harnessing the advantages of both while at the same time overcoming their respective limitations. The purpose of this study was to investigate a novel collective intelligence technology designed to amplify the diagnostic accuracy of networked human groups by forming real-time systems modeled on biological swarms. Using small groups of radiologists, the swarm-based technology was applied to the diagnosis of pneumonia on chest radiographs and compared against human experts alone, as well as two state-of-the-art deep learning AI models. Our work demonstrates that both the swarm-based technology and deep-learning technology achieved superior diagnostic accuracy than the human experts alone. Our work further demonstrates that when used in combination, the swarm-based technology and deep-learning technology outperformed either method alone. The superior diagnostic accuracy of the combined HITL AI solution compared to radiologists and AI alone has broad implications for the surging clinical AI deployment and implementation strategies in future practice.

Keywords: Computer science; Radiography.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors had control of the data and the information submitted for publication. Four authors (L.R., D.B., G.W. and M.L.) are employees of Unanimous AI, who developed the swarm platform used in this study. All other authors are not employees of or consultants for Unanimous AI and had control of the study methodology, data analysis, and results. There was no industry support specifically for this study. This study was supported in part by NSF through Award ID 1840937.

Figures

Fig. 1
Fig. 1
Bootstrapped average AUC curves. AUC curves show that the swarms (blue bars) outperform group A (left image), group B (middle image), and combined group (right image). Radiologists (orange bars) performances in diagnosing pneumonia. Swarm also outperforms CheXNet (green bars).
Fig. 2
Fig. 2
Scatterplot of swarm vs. CheXMax probabilistic diagnoses, with cases colored by ground truth. The scatterplots show that CheXMax and human swarms assign very different probabilities to each case (left image). The gray “Augmented Cases” range shows cases that were sent from CheXMax to the Swarm for augmentation. CheXMax has a high incidence of True Positives (blue-colored cases below the horizontal CheXMax Threshold line), but when the CheXMax gives a weak positive diagnosis (between 0.04008 and 0.055 on the y-axis), it is often incorrect (11 out of 15 cases correct, or an accuracy of 73%). Using a human swarm to re-classify these weak positive cases results in correctly labeling 14 out of 15 of the cases—an accuracy improvement of 20%. The cases on which the two diagnostic methods disagreed are more clearly visualized in the scatterplot of diagnostic disagreement (right image).
Fig. 3
Fig. 3
Case examples. Each of the three rows ac represent three different patients. Grayscale image is on the left with the corresponding class activation map to its right. The top row example a includes a patient with pneumonia in the left lung, correctly predicted by CheXMax but incorrectly by swarm. The middle row b is an example of a patient with metastatic disease but without pneumonia, correctly predicted by swarm and incorrectly by CheXMax. The bottom row c is an example of an augmented case, where CheXMax provided a low confidence positive prediction (p = 0.41) but was correctly predicted as negative by swarm.
Fig. 4
Fig. 4
Sensitivity analysis of augmented model accuracy. The shape of the average accuracy line shows a consistent increase in the accuracy of the augmented model when the 0–14% lowest-confidence cases are sent to the swarm, from 82% correct of CheXMax (sending 0% of cases) to 90% correct when sending the 14% of lowest-confidence positive and negative cases to the swarm. The model performs similarly when 16–32% of cases are sent to the swarm, achieving between 88% and 92% accuracy across this sensitivity range. If more than 32% of cases are sent to the swarm, the accuracy of the system decreases, until the limit of sending all diagnoses to the swarm is reached (100% of cases swarmed), where the accuracy returns to the swarm score of 84%.
Fig. 5
Fig. 5
Sensitivity analysis of accuracy increase relative to CheXMax. Sensitivity analysis shows a band between 6% and 34%, where the 90% confidence interval is only ever >0%. This indicates that when sending between 6% and 34% of the lowest-confidence cases to the swarm using this method, there is high confidence that the augmented model would diagnose the cases more accurately than the CheXMax alone. If the range is limited between 14% and 28%, the average improvement in accuracy is 7.75% correct.
Fig. 6
Fig. 6
Bootstrapped average specificity and sensitivity of aggregate diagnostic methods. The bootstrapped specificity histograms show that the swarms in the combined group (blue bars) outperform CheXMax (green bars) in terms of specificity (left image), but CheXMax outperforms the swarms in terms of sensitivity (right image). The HITL Combined model combines the best of both the CheXMax and swarm diagnostic methods, by attaining swarm-level specificity and CheXMax-level sensitivity.
Fig. 7
Fig. 7
Swarm platform. A system diagram (left image) of the Swarm platform shows the connection of networked human users. A Swarm engine algorithm received continuous input from the humans as they are making their decision and provides real-time collaborative feedback back to the humans to create a dynamic feedback loop. Swarm Platform positioned next to a second screen for viewing radiograph (middle image). A snapshot (right image) of the real-time swarm of six radiologists (group B) shows small magnets controlled by radiologists pulling on the circular puck in the process of collectively converging towards a probability of pneumonia. To view a video of the above question being answered in the Swarm platform, visit the following link: https://unanimous.ai/wp-content/uploads/2019/05/Radiology-Swarm.gif.
Fig. 8
Fig. 8
Support density visualization. In this support density visualization corresponding to the swarm in Fig. 1, the puck’s trajectory is shown as a white dotted line, and the distribution of force over the hex is plotted as a Gaussian kernel density heatmap. Notice that this swarm was split between the “5–25%” and “0–5%” bins, and more force was directed towards the 5–25%. This aggregate behavior is reflected in the swarm’s interpolated diagnosis of 11.1%.

Similar articles

Cited by

References

    1. De Fauw J, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 2018;24:1342–1350. doi: 10.1038/s41591-018-0107-6. - DOI - PubMed
    1. Ehteshami Bejnordi B, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017;318:2199–2210. doi: 10.1001/jama.2017.14585. - DOI - PMC - PubMed
    1. Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–2410. doi: 10.1001/jama.2016.17216. - DOI - PubMed
    1. Rajpurkar P, et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 2018;15:e1002686. doi: 10.1371/journal.pmed.1002686. - DOI - PMC - PubMed
    1. Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: Proc. AAAI Conference on Artificial Intelligence, North America (2019).

LinkOut - more resources

-