Faculty, Staff and Student Publications
Language
English
Publication Date
7-1-2025
Journal
The Journal of Allergy and Clinical Immunology
DOI
10.1016/j.jaci.2025.02.004
PMID
39956279
PMCID
PMC12229761
PubMedCentral® Posted Date
7-7-2025
PubMedCentral® Full Text Version
Author MSS
Abstract
Background: Generative artificial intelligence (GAI) is transforming health care in a variety of ways; however, the present utility of GAI for supporting clinicians who treat rare disease such as primary immune disorders (PIs) is not well studied. We evaluated the ability of 6 state-of-the-art large language models (LLMs) for providing clinical guidance about PIs.
Objective: To quantitatively and qualitatively measure the utility of current, open-source LLMs for diagnosing and providing helpful clinical decision support about PIs.
Methods: Five expert clinical immunologists each provided 5 real-world, anonymized PI case vignettes via multi-turn prompting to 6 LLMs (OpenAI GPT-4o, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mistral-7B-Instruct-v0.3, Mistral-Large-Instruct-2407, Mixtral-8x7B-Instruct-v0.1). We assessed the diagnostic accuracy of the LLMs and the quality of clinical reasoning using the Revised-IDEA (R-IDEA) score. Qualitative LLM assessment was made by immunologist narratives.
Results: Performance accuracy (>88%) and R-IDEA scores (≥8) were superior for 3 models (GPT-4o, Llama-3.1-70B-Instruct, Mistral-Large-Instruct-2407), with GPT-4o achieving the highest diagnostic accuracy (96.2%). Conversely, the remaining 3 models fell below acceptable accuracy rates near 60% or lower and had poor R-IDEA scores (≤0.55), with Mistral-7B-Instruct-v0.3 attaining the worst diagnostic accuracy (42.3%). Compared with the 3 best-performing LLMs, the 3 worst-performing LLMs had a substantially lower median R-IDEA score (P < .001). Interclass correlation coefficient for R-IDEA score assignments varied substantially by LLM, ranging from good to poor agreement, and did not appear to correlate with either diagnostic accuracy or median R-IDEA score. Qualitatively, immunologists identified several themes (eg, correctness, differential diagnosis appropriateness, relative conciseness of explanations) of relevance to PIs.
Conclusions: LLM can support diagnosis and management of PIs; however, further tuning is needed to optimize LLMs for best practice recommendations.
Keywords
Humans, Artificial Intelligence, Immune System Diseases, Male, Language, Female, Large Language Models, Primary immune disorders, inborn errors of immunity, large language models, generative AI, health care AI, medical chatbot
Published Open-Access
yes
Recommended Citation
Rider, Nicholas L; Li, Yingya; Chin, Aaron T; et al., "Evaluating Large Language Model Performance To Support the Diagnosis and Management of Patients With Primary Immune Disorders" (2025). Faculty, Staff and Student Publications. 680.
https://digitalcommons.library.tmc.edu/uthshis_docs/680