Faculty, Staff and Students Publications

Development of Retrieval-Augmented Generation-Based Large Language Model for Drug-Induced Liver Injury Using Livertox Data

Language

English

Publication Date

3-1-2026

Journal

Hepatology Communications

DOI

10.1097/HC9.0000000000000895

PMID

41678290

Abstract

Background: Idiosyncratic DILI is a complex clinical challenge requiring timely and accurate decision support. LiverTox, curated by the National Institute of Health (NIH), offers a comprehensive DILI evidence base, but its encyclopedia-like format hinders point-of-care use. Health care providers increasingly use general large language models (LLMs) for clinical care, raising safety concerns due to LLM hallucinations or misinformation. We hypothesize that retrieval-augmented generation (RAG) integration-grounding LLM responses in LiverTox content-would enable accurate DILI decision support.

Methods: We processed 1343 LiverTox drug monographs into 8759 indexed segments using BioBERT embeddings. We developed a RAG pipeline that employs drug-specific prioritization, section-aware weighting, and semantic search to retrieve the most relevant content per query. Twenty-five DILI questions were evaluated across 6 models: 4 RAG-LLMs: Mistral-7B, Claude-3-Haiku, Claude-3-Opus, and GPT-4o, and 2 non-RAG GPT-4o variants (unconstrained; soft constrained with a prompt to reference LiverTox). Three hepatologists, blinded to the model, evaluated responses for accuracy, completeness, and conciseness using 5-point Likert scales. Analyses included pairwise comparisons and effect size estimation.

Results: One hundred fifty model responses were evaluated with good inter-rater reliability. GPT-4o (RAG) achieved the highest overall scores (4.47±0.10). RAG-LLMs outperformed non-RAG GPT-4o variants in accuracy (p< 0.001) and completeness (p< 0.01). Moderate to large effect sizes in accuracy (d=0.778) and completeness (d=0.526) were noted with RAG. No hallucinations were observed in RAG-LLM outputs, while both non-RAG GPT-4o variants produced several hallucinated responses. There were no significant differences in scoring or hallucinated response rate between the 2 non-RAG variants.

Conclusions: We developed an RAG-LLM integrated with LiverTox for evidence-based DILI management. RAG-LLM systems outperformed non-RAG variants and produced responses without observed hallucinations in this evaluation. Our LiverTox RAG-LLM enables reliable answers to drug hepatotoxicity questions at the point of care.

Keywords

Chemical and Drug Induced Liver Injury, Large Language Models, Humans, Decision Support Systems, Clinical, DILI, artificial intelligence, clinical decision support, hepatotoxicity

Published Open-Access

yes

Recommended Citation

Rao, Ashwin; Cholankeril, George; Flores, Avegail; et al., "Development of Retrieval-Augmented Generation-Based Large Language Model for Drug-Induced Liver Injury Using Livertox Data" (2026). Faculty, Staff and Students Publications. 6624.
https://digitalcommons.library.tmc.edu/baylor_docs/6624

Download

Included in

Medical Sciences Commons

COinS

Faculty, Staff and Students Publications

Development of Retrieval-Augmented Generation-Based Large Language Model for Drug-Induced Liver Injury Using Livertox Data

Language

Publication Date

Journal

DOI

PMID

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Students Publications

Development of Retrieval-Augmented Generation-Based Large Language Model for Drug-Induced Liver Injury Using Livertox Data

Authors

Language

Publication Date

Journal

DOI

PMID

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Share

Search

Browse

Author Corner

More Info

Library