Faculty, Staff and Students Publications

A Supplement, Not a Substitute: Accuracy and Completeness of ChatGPT Responses for Common Elbow Pathology

Benjamin Fiedler
Umar Ghilzai
Abdullah Ghali
Phillip Goldman
Pablo Coello
Michael B Gottschalk
Eric R Wagner
Adil Shahzad Ahmed

Publication Date

10-6-2025

Journal

Shoulder & Elbow

DOI

10.1177/17585732251365178

PMID

41064043

PMCID

PMC12500603

PubMedCentral® Posted Date

10-6-2025

PubMedCentral® Full Text Version

Post-print

Abstract

Hypothesis: Large language models (LLMs) like ChatGPT have increasingly been used as online resources for patients with orthopedic conditions. Yet there is a paucity of information assessing the ability of LLMs to accurately and completely answer patient questions. The present study comparatively assessed both ChatGPT 3.5 and GPT-4 responses to frequently asked questions on common elbow pathologies, scoring for accuracy and completeness. It was hypothesized that ChatGPT 3.5 and GPT-4 would demonstrate high levels of accuracy for the specific query asked, but some responses would lack completeness, and GPT-4 would yield more accurate and complete responses than ChatGPT 3.5.

Methods: ChatGPT was queried to identify five most common elbow pathologies (lateral epicondylitis, medial epicondylitis, cubital tunnel syndrome, distal biceps rupture, elbow arthritis). ChatGPT was then queried on the five most frequently asked questions for each elbow pathology. These 25 total questions were then individually asked of ChatGPT 3.5 and GPT-4. Responses were recorded and scored on 6-point Likert scale for accuracy and 3-point Likert scale for completeness by three fellowship-trained upper extremity orthopedic surgeons. ChatGPT 3.5 and GPT-4 responses were compared for each pathology using two-tailed t-tests.

Results: Average accuracy scores for ChatGPT 3.5 ranged from 4.80 to 4.87. Average GPT-4 accuracy scores ranged from 4.80 to 5.13. Average completeness scores for ChatGPT 3.5 ranged from 2.13 to 2.47, and average completeness scores for GPT-4 ranged from 2.47 to 2.80. Total average accuracy for ChatGPT 3.5 was 4.83, and total average accuracy for GPT-4 was 5.0 (p = 0.05). Total average completeness for ChatGPT 3.5 was 2.35, and total average completeness for GPT-4 was 2.66 (p = 0.01).

Conclusion: ChatGPT 3.5 and GPT-4 are capable of providing accurate and complete responses to frequently asked patient questions, with GPT-4 providing superior responses. Large language models like ChatGPT have potential to serve as a reliable online resource for patients with elbow conditions.

Keywords

epicondylitis, cubital tunnel, distal biceps rupture, elbow arthritis, chatGPT, large language model

Published Open-Access

yes

Recommended Citation

Fiedler, Benjamin; Ghilzai, Umar; Ghali, Abdullah; et al., "A Supplement, Not a Substitute: Accuracy and Completeness of ChatGPT Responses for Common Elbow Pathology" (2025). Faculty, Staff and Students Publications. 5532.
https://digitalcommons.library.tmc.edu/baylor_docs/5532

Download

Included in

Medical Sciences Commons, Musculoskeletal Diseases Commons, Orthopedics Commons, Surgery Commons

COinS

Faculty, Staff and Students Publications

A Supplement, Not a Substitute: Accuracy and Completeness of ChatGPT Responses for Common Elbow Pathology

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Students Publications

A Supplement, Not a Substitute: Accuracy and Completeness of ChatGPT Responses for Common Elbow Pathology

Authors

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Share

Search

Browse

Author Corner

More Info

Library