Faculty, Staff and Student Publications

Language

English

Publication Date

12-27-2025

Journal

npj Digital Medicine

DOI

10.1038/s41746-025-02253-2

PMID

41455812

PMCID

PMC12827943

PubMedCentral® Posted Date

12-27-2025

PubMedCentral® Full Text Version

Post-print

Abstract

Current discussion surrounding the clinical capabilities of generative language models(GLMs) predominantly centers around multiple-choice question-answer(MCQA) benchmarks derived from clinical licensing examinations. While accepted for human examinees, characteristics unique to GLMs bring into question the validity of such benchmarks. Here, we validate five benchmarks using eight GLMs, ablating for parameter size and reasoning capabilities, validating via prompt permutation three key assumptions that underpin the generalizability of MCQA-based assessments: that knowledge is applied, not memorized, that semantic consistency will lead to consistent answers, and that situations with no answers can be recognized. While large models are more resilient to our perturbations compared to small models, we globally invalidate these assumptions, with implications for reasoning models. Additionally, despite retaining the knowledge, small models are prone to memorization. All models exhibit significant failure in null-answer scenarios. We then suggest several adaptations for more robust benchmark designs, more reflective of real-world conditions.

Keywords

Computational biology and bioinformatics, Mathematics and computing, Medical research, Psychology, Psychology, Scientific community

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.