Faculty, Staff and Student Publications

Language

English

Publication Date

10-15-2025

Journal

Communications Medicine

DOI

10.1038/s43856-025-01116-x

PMID

41093969

PMCID

PMC12528503

PubMedCentral® Posted Date

10-15-2025

PubMedCentral® Full Text Version

Post-print

Abstract

Background: The vast amount of natural language clinical notes about patients with cancer presents a challenge for efficient information extraction, standardization, and structuring. Traditional NLP methods require extensive annotation by domain experts for each type of named entity and necessitate model training, highlighting the need for an efficient and accurate extraction method.

Methods: This study introduces a tool based on the Large Language Model (LLM) for zero-shot information extraction from cancer-related clinical notes into structured data aligned with the minimal Common Oncology Data Elements (mCODE™) structure. We utilize the zero-shot learning capabilities of LLMs for information extraction, eliminating the need for data annotated by domain experts for training. Our methodology employs advanced hierarchical prompt engineering strategies to overcome common LLM limitations like token hallucination and accuracy issues. We tested the approach on 1,000 synthetic clinical notes representing various cancer types, comparing its performance to a traditional single-step prompting method.

Results: Our hierarchical prompt engineering strategy (accuracy = 94%, misidentification, and misplacement rate = 5%) outperforms the traditional prompt strategy (accuracy = 87%, misidentification, and misplacement rate = 10%) in information extraction. By unifying staging systems (e.g., TNM, FIGO) and specific stage details (e.g., Stage II) into a standardized framework, our approach achieves improved accuracy in extracting cancer stage information.

Conclusions: Our approach demonstrates that LLMs, when guided by structured prompting, can accurately extract complex clinical information without the need for expert-labeled data. This method has the potential to harness unstructured data for advancing cancer research.

Keywords

Cancer, Health care

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.