Faculty, Staff and Student Publications

Ensemble Pretrained Language Models to Extract Biomedical Knowledge From Literature

Zhao Li
Qiang Wei
Liang-Chin Huang
Jianfu Li
Yan Hu
Yao-Shun Chuang
Jianping He
Avisha Das
Vipina Kuttichi Keloth
Yuntao Yang
Chiamaka S Diala
Kirk E Roberts
Cui Tao
Xiaoqian Jiang
W Jim Zheng
Hua Xu

Language

English

Publication Date

9-1-2024

Journal

Journal of the American Medical Informatics Association

DOI

10.1093/jamia/ocae061

PMID

38520725

PMCID

PMC11339500

PubMedCentral® Posted Date

3-23-2024

PubMedCentral® Full Text Version

Post-print

Abstract

OBJECTIVES: The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking.

MATERIALS AND METHODS: For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE).

RESULTS: Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models.

DISCUSSION AND CONCLUSION: Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.

Keywords

Natural Language Processing, Data Mining, Machine Learning, Humans, named entity recognition, relation extraction, large language model, ensemble learning, knowledge base

Published Open-Access

yes

Recommended Citation

Li, Zhao; Wei, Qiang; Huang, Liang-Chin; et al., "Ensemble Pretrained Language Models to Extract Biomedical Knowledge From Literature" (2024). Faculty, Staff and Student Publications. 495.
https://digitalcommons.library.tmc.edu/uthshis_docs/495

Download

Included in

Bioinformatics Commons, Biomedical Informatics Commons, Data Science Commons

COinS

Faculty, Staff and Student Publications

Ensemble Pretrained Language Models to Extract Biomedical Knowledge From Literature

Language

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Student Publications

Ensemble Pretrained Language Models to Extract Biomedical Knowledge From Literature

Authors

Language

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Share

Search

Browse

Author Corner

More Info

Library