Publication Date
10-6-2025
Journal
Shoulder & Elbow
DOI
10.1177/17585732251365178
PMID
41064043
PMCID
PMC12500603
PubMedCentral® Posted Date
10-6-2025
PubMedCentral® Full Text Version
Post-print
Abstract
Hypothesis: Large language models (LLMs) like ChatGPT have increasingly been used as online resources for patients with orthopedic conditions. Yet there is a paucity of information assessing the ability of LLMs to accurately and completely answer patient questions. The present study comparatively assessed both ChatGPT 3.5 and GPT-4 responses to frequently asked questions on common elbow pathologies, scoring for accuracy and completeness. It was hypothesized that ChatGPT 3.5 and GPT-4 would demonstrate high levels of accuracy for the specific query asked, but some responses would lack completeness, and GPT-4 would yield more accurate and complete responses than ChatGPT 3.5.
Methods: ChatGPT was queried to identify five most common elbow pathologies (lateral epicondylitis, medial epicondylitis, cubital tunnel syndrome, distal biceps rupture, elbow arthritis). ChatGPT was then queried on the five most frequently asked questions for each elbow pathology. These 25 total questions were then individually asked of ChatGPT 3.5 and GPT-4. Responses were recorded and scored on 6-point Likert scale for accuracy and 3-point Likert scale for completeness by three fellowship-trained upper extremity orthopedic surgeons. ChatGPT 3.5 and GPT-4 responses were compared for each pathology using two-tailed t-tests.
Results: Average accuracy scores for ChatGPT 3.5 ranged from 4.80 to 4.87. Average GPT-4 accuracy scores ranged from 4.80 to 5.13. Average completeness scores for ChatGPT 3.5 ranged from 2.13 to 2.47, and average completeness scores for GPT-4 ranged from 2.47 to 2.80. Total average accuracy for ChatGPT 3.5 was 4.83, and total average accuracy for GPT-4 was 5.0 (p = 0.05). Total average completeness for ChatGPT 3.5 was 2.35, and total average completeness for GPT-4 was 2.66 (p = 0.01).
Conclusion: ChatGPT 3.5 and GPT-4 are capable of providing accurate and complete responses to frequently asked patient questions, with GPT-4 providing superior responses. Large language models like ChatGPT have potential to serve as a reliable online resource for patients with elbow conditions.
Keywords
epicondylitis, cubital tunnel, distal biceps rupture, elbow arthritis, chatGPT, large language model
Published Open-Access
yes
Recommended Citation
Fiedler, Benjamin; Ghilzai, Umar; Ghali, Abdullah; et al., "A Supplement, Not a Substitute: Accuracy and Completeness of ChatGPT Responses for Common Elbow Pathology" (2025). Faculty and Staff Publications. 5532.
https://digitalcommons.library.tmc.edu/baylor_docs/5532
Included in
Medical Sciences Commons, Musculoskeletal Diseases Commons, Orthopedics Commons, Surgery Commons