Language

English

Publication Date

3-1-2026

Journal

Hepatology Communications

DOI

10.1097/HC9.0000000000000895

PMID

41678290

Abstract

Background: Idiosyncratic DILI is a complex clinical challenge requiring timely and accurate decision support. LiverTox, curated by the National Institute of Health (NIH), offers a comprehensive DILI evidence base, but its encyclopedia-like format hinders point-of-care use. Health care providers increasingly use general large language models (LLMs) for clinical care, raising safety concerns due to LLM hallucinations or misinformation. We hypothesize that retrieval-augmented generation (RAG) integration-grounding LLM responses in LiverTox content-would enable accurate DILI decision support.

Methods: We processed 1343 LiverTox drug monographs into 8759 indexed segments using BioBERT embeddings. We developed a RAG pipeline that employs drug-specific prioritization, section-aware weighting, and semantic search to retrieve the most relevant content per query. Twenty-five DILI questions were evaluated across 6 models: 4 RAG-LLMs: Mistral-7B, Claude-3-Haiku, Claude-3-Opus, and GPT-4o, and 2 non-RAG GPT-4o variants (unconstrained; soft constrained with a prompt to reference LiverTox). Three hepatologists, blinded to the model, evaluated responses for accuracy, completeness, and conciseness using 5-point Likert scales. Analyses included pairwise comparisons and effect size estimation.

Results: One hundred fifty model responses were evaluated with good inter-rater reliability. GPT-4o (RAG) achieved the highest overall scores (4.47±0.10). RAG-LLMs outperformed non-RAG GPT-4o variants in accuracy (p< 0.001) and completeness (p< 0.01). Moderate to large effect sizes in accuracy (d=0.778) and completeness (d=0.526) were noted with RAG. No hallucinations were observed in RAG-LLM outputs, while both non-RAG GPT-4o variants produced several hallucinated responses. There were no significant differences in scoring or hallucinated response rate between the 2 non-RAG variants.

Conclusions: We developed an RAG-LLM integrated with LiverTox for evidence-based DILI management. RAG-LLM systems outperformed non-RAG variants and produced responses without observed hallucinations in this evaluation. Our LiverTox RAG-LLM enables reliable answers to drug hepatotoxicity questions at the point of care.

Keywords

Chemical and Drug Induced Liver Injury, Large Language Models, Humans, Decision Support Systems, Clinical, DILI, artificial intelligence, clinical decision support, hepatotoxicity

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.