Language
English
Publication Date
3-1-2026
Journal
Hepatology Communications
DOI
10.1097/HC9.0000000000000895
PMID
41678290
Abstract
Background: Idiosyncratic DILI is a complex clinical challenge requiring timely and accurate decision support. LiverTox, curated by the National Institute of Health (NIH), offers a comprehensive DILI evidence base, but its encyclopedia-like format hinders point-of-care use. Health care providers increasingly use general large language models (LLMs) for clinical care, raising safety concerns due to LLM hallucinations or misinformation. We hypothesize that retrieval-augmented generation (RAG) integration-grounding LLM responses in LiverTox content-would enable accurate DILI decision support.
Methods: We processed 1343 LiverTox drug monographs into 8759 indexed segments using BioBERT embeddings. We developed a RAG pipeline that employs drug-specific prioritization, section-aware weighting, and semantic search to retrieve the most relevant content per query. Twenty-five DILI questions were evaluated across 6 models: 4 RAG-LLMs: Mistral-7B, Claude-3-Haiku, Claude-3-Opus, and GPT-4o, and 2 non-RAG GPT-4o variants (unconstrained; soft constrained with a prompt to reference LiverTox). Three hepatologists, blinded to the model, evaluated responses for accuracy, completeness, and conciseness using 5-point Likert scales. Analyses included pairwise comparisons and effect size estimation.
Results: One hundred fifty model responses were evaluated with good inter-rater reliability. GPT-4o (RAG) achieved the highest overall scores (4.47±0.10). RAG-LLMs outperformed non-RAG GPT-4o variants in accuracy (p< 0.001) and completeness (p< 0.01). Moderate to large effect sizes in accuracy (d=0.778) and completeness (d=0.526) were noted with RAG. No hallucinations were observed in RAG-LLM outputs, while both non-RAG GPT-4o variants produced several hallucinated responses. There were no significant differences in scoring or hallucinated response rate between the 2 non-RAG variants.
Conclusions: We developed an RAG-LLM integrated with LiverTox for evidence-based DILI management. RAG-LLM systems outperformed non-RAG variants and produced responses without observed hallucinations in this evaluation. Our LiverTox RAG-LLM enables reliable answers to drug hepatotoxicity questions at the point of care.
Keywords
Chemical and Drug Induced Liver Injury, Large Language Models, Humans, Decision Support Systems, Clinical, DILI, artificial intelligence, clinical decision support, hepatotoxicity
Published Open-Access
yes
Recommended Citation
Rao, Ashwin; Cholankeril, George; Flores, Avegail; et al., "Development of Retrieval-Augmented Generation-Based Large Language Model for Drug-Induced Liver Injury Using Livertox Data" (2026). Faculty, Staff and Students Publications. 6624.
https://digitalcommons.library.tmc.edu/baylor_docs/6624