Benchmarking Large Language Models for Biomedical Natural Language Processing Applications and Recommendations

Qingyu Chen
Yan Hu
Xueqing Peng
Qianqian Xie
Qiao Jin
Aidan Gilson
Maxwell B Singer
Xuguang Ai
Po-Ting Lai
Zhizheng Wang
Vipina K Keloth
Kalpana Raja
Jimin Huang
Huan He
Fongci Lin
Jingcheng Du
Rui Zhang
W Jim Zheng
Ron A Adelman
Zhiyong Lu
Hua Xu

Language

English

Publication Date

4-6-2025

Journal

Nature Communications

DOI

10.1038/s41467-025-56989-2

PMID

40188094

PMCID

PMC11972378

PubMedCentral® Posted Date

4-6-2025

PubMedCentral® Full Text Version

Post-print

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

Keywords

Natural Language Processing, Benchmarking, Humans, Large Language Models, Data mining, Health care

Published Open-Access

yes

Recommended Citation

Chen, Qingyu; Hu, Yan; Peng, Xueqing; et al., "Benchmarking Large Language Models for Biomedical Natural Language Processing Applications and Recommendations" (2025). Faculty, Staff and Student Publications. 813.
https://digitalcommons.library.tmc.edu/uthshis_docs/813

Faculty, Staff and Student Publications

Benchmarking Large Language Models for Biomedical Natural Language Processing Applications and Recommendations

Language

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Student Publications

Benchmarking Large Language Models for Biomedical Natural Language Processing Applications and Recommendations

Authors

Language

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Share

Search

Browse

Author Corner

More Info

Library