Faculty, Staff and Student Publications

Language

English

Publication Date

7-1-2025

Journal

The Journal of Allergy and Clinical Immunology

DOI

10.1016/j.jaci.2025.02.004

PMID

39956279

PMCID

PMC12229761

PubMedCentral® Posted Date

7-7-2025

PubMedCentral® Full Text Version

Author MSS

Abstract

Background: Generative artificial intelligence (GAI) is transforming health care in a variety of ways; however, the present utility of GAI for supporting clinicians who treat rare disease such as primary immune disorders (PIs) is not well studied. We evaluated the ability of 6 state-of-the-art large language models (LLMs) for providing clinical guidance about PIs.

Objective: To quantitatively and qualitatively measure the utility of current, open-source LLMs for diagnosing and providing helpful clinical decision support about PIs.

Methods: Five expert clinical immunologists each provided 5 real-world, anonymized PI case vignettes via multi-turn prompting to 6 LLMs (OpenAI GPT-4o, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mistral-7B-Instruct-v0.3, Mistral-Large-Instruct-2407, Mixtral-8x7B-Instruct-v0.1). We assessed the diagnostic accuracy of the LLMs and the quality of clinical reasoning using the Revised-IDEA (R-IDEA) score. Qualitative LLM assessment was made by immunologist narratives.

Results: Performance accuracy (>88%) and R-IDEA scores (≥8) were superior for 3 models (GPT-4o, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407), with GPT-4o achieving the highest diagnostic accuracy (96.2%). Conversely, the remaining 3 models fell below acceptable accuracy rates near 60% or lower and had poor R-IDEA scores (≤0.55), with Mistral-7B-Instruct-v0.3 attaining the worst diagnostic accuracy (42.3%). Compared with the 3 best-performing LLMs, the 3 worst-performing LLMs had a substantially lower median R-IDEA score (P < .001). Interclass correlation coefficient for R-IDEA score assignments varied substantially by LLM, ranging from good to poor agreement, and did not appear to correlate with either diagnostic accuracy or median R-IDEA score. Qualitatively, immunologists identified several themes (eg, correctness, differential diagnosis appropriateness, relative conciseness of explanations) of relevance to PIs.

Conclusions: LLM can support diagnosis and management of PIs; however, further tuning is needed to optimize LLMs for best practice recommendations.

Keywords

Humans, Artificial Intelligence, Immune System Diseases, Male, Language, Female, Large Language Models, Primary immune disorders, inborn errors of immunity, large language models, generative AI, health care AI, medical chatbot

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.