Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr;36(2):401-413.
doi: 10.1007/s10278-022-00662-3. Epub 2022 Nov 22.

Utilizing a Digital Swarm Intelligence Platform to Improve Consensus Among Radiologists and Exploring Its Applications

Affiliations

Utilizing a Digital Swarm Intelligence Platform to Improve Consensus Among Radiologists and Exploring Its Applications

Rutwik Shah et al. J Digit Imaging. 2023 Apr.

Abstract

Radiologists today play a central role in making diagnostic decisions and labeling images for training and benchmarking artificial intelligence (AI) algorithms. A key concern is low inter-reader reliability (IRR) seen between experts when interpreting challenging cases. While team-based decisions are known to outperform individual decisions, inter-personal biases often creep up in group interactions which limit nondominant participants from expressing true opinions. To overcome the dual problems of low consensus and interpersonal bias, we explored a solution modeled on bee swarms. Two separate cohorts, three board-certified radiologists, (cohort 1), and five radiology residents (cohort 2) collaborated on a digital swarm platform in real time and in a blinded fashion, grading meniscal lesions on knee MR exams. These consensus votes were benchmarked against clinical (arthroscopy) and radiological (senior-most radiologist) standards of reference using Cohen's kappa. The IRR of the consensus votes was then compared to the IRR of the majority and most confident votes of the two cohorts. IRR was also calculated for predictions from a meniscal lesion detecting AI algorithm. The attending cohort saw an improvement of 23% in IRR of swarm votes (k = 0.34) over majority vote (k = 0.11). Similar improvement of 23% in IRR (k = 0.25) in 3-resident swarm votes over majority vote (k = 0.02) was observed. The 5-resident swarm had an even higher improvement of 30% in IRR (k = 0.37) over majority vote (k = 0.07). The swarm consensus votes outperformed individual and majority vote decision in both the radiologists and resident cohorts. The attending and resident swarms also outperformed predictions from a state-of-the-art AI algorithm.

Keywords: Artificial intelligence; Consensus decisions; Inter-rater reliability; Swarm intelligence; Workflow tools.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
A Sagittal sequence of a knee MR exam evaluated by multiple subspeciality trained musculoskeletal radiologists (arrow pointing to the ambiguous meniscal lesion) had discordant impressions of the presence and grade of lesions. B Swarm platform was used to derive consensus for the location of lesions, which matched with the arthroscopic findings considered as a standard of reference
Fig. 2
Fig. 2
Flowchart of various steps in the study. In total, 36 anonymized knee scans (sagittal CUBE sequences) were reviewed by a cohort of three MSK-trained radiologists and another cohort of five radiology residents, independently at first and then in swarm sessions. A deep learning model trained to evaluate meniscal lesions also inferred the same of 36 knee scans to obtain AI predictions for comparison
Fig. 3
Fig. 3
A Schematic of the swarm platform. Multiple remote users are connected to each other in real time, via the web application. Inputs from users (blue arrows) are sent to the cloud server which runs the swarm algorithm, which then sends back continuous a stream of output (green arrows) to users in a closed-loop system. B Setup of the swarm session: Participants accessed the knee exams on a PACS workstation and logged into swarm sessions via a separate device. C Early time point in a session- multiple users pulling central puck in opposing directions using virtual magnets as seen in the graphical interface. D Late time point in the same session- users then converge onto a single answer choice after some negotiation and opinion switch
Fig. 4
Fig. 4
Attendings grading compared to clinical standard of reference. A Confusion matrix (CM) for 3 attending majority vote (kappa: 0.11). B CM for 3 attending most confident vote (kappa: 0.19). C CM for 3 attending swarm vote (kappa: 0.34)
Fig. 5
Fig. 5
Residents grading compared to clinical standard of reference. A Confusion matrix (CM) for 3-resident majority vote (kappa: 0.02). B CM for 3-resident most confident vote (0.08). C CM for 3-resident swarm vote (kappa: 0.25) D) CM for 5-resident majority vote (kappa: 0.07). E CM for 5-resident most confident vote (0.12). F CM for 5-resident swarm vote (kappa: 0.37). Note: The 5-resident swarm was unable to obtain a consensus in one exam. This exam was excluded during inter-rater reliability comparisons of 5-resident majority vote and 5-resident most confident vote for parity
Fig. 6
Fig. 6
Residents’ responses compared to radiological standard of reference. A Confusion matrix (CM) for 3-resident majority vote (kappa: 0.27). B CM for 3-resident most confident vote (0.15). C CM for 3-resident swarm vote (kappa: 0.36). D CM for 5-resident majority vote (kappa: 0.31). E CM for 5-resident most confident vote (0.14). F CM for 5-resident swarm vote (kappa: 0.39). Note: The 5-resident swarm was unable to obtain a consensus in one exam. This exam was excluded during inter-rater reliability comparisons of 5-resident majority vote and 5-resident most confident vote for parity
Fig. 7
Fig. 7
AI prediction comparisons. A Confusion matrix for AI predictions compared to clinical standard of reference (kappa: 0.10). B Confusion matrix for AI predictions compared to radiological standard of reference (kappa: 0.15). Swarm votes of residents outperform AI in both sets of comparisons

Similar articles

References

    1. Fink A, Kosecoff J, Chassin M, Brook RH. Consensus methods: characteristics and guidelines for use. American journal of public health. 1984;74:979–983. doi: 10.2105/AJPH.74.9.979. - DOI - PMC - PubMed
    1. Medicine, I. o., National Academies of Sciences, E. & Medicine. Improving Diagnosis in Health Care. (The National Academies Press, 2015). - PubMed
    1. Smith CP, et al. Intra- and interreader reproducibility of PI-RADSv2: a multireader study. Journal of magnetic resonance imaging : JMRI. 2019;49:1694–1703. doi: 10.1002/jmri.26555. - DOI - PMC - PubMed
    1. van Tilburg CWJ, Groeneweg JG, Stronks DL, Huygen F. Inter-rater reliability of diagnostic criteria for sacroiliac joint-, disc- and facet joint pain. Journal of back and musculoskeletal rehabilitation. 2017;30:551–557. doi: 10.3233/bmr-150495. - DOI - PubMed
    1. Melsaether A, et al. Inter- and intrareader agreement for categorization of background parenchymal enhancement at baseline and after training. American Journal of Roentgenology. 2014;203:209–215. doi: 10.2214/AJR.13.10952. - DOI - PubMed

Publication types

LinkOut - more resources

-