Faculty, Staff and Student Publications
Language
English
Publication Date
10-15-2025
Journal
Communications Medicine
DOI
10.1038/s43856-025-01116-x
PMID
41093969
PMCID
PMC12528503
PubMedCentral® Posted Date
10-15-2025
PubMedCentral® Full Text Version
Post-print
Abstract
Background: The vast amount of natural language clinical notes about patients with cancer presents a challenge for efficient information extraction, standardization, and structuring. Traditional NLP methods require extensive annotation by domain experts for each type of named entity and necessitate model training, highlighting the need for an efficient and accurate extraction method.
Methods: This study introduces a tool based on the Large Language Model (LLM) for zero-shot information extraction from cancer-related clinical notes into structured data aligned with the minimal Common Oncology Data Elements (mCODE™) structure. We utilize the zero-shot learning capabilities of LLMs for information extraction, eliminating the need for data annotated by domain experts for training. Our methodology employs advanced hierarchical prompt engineering strategies to overcome common LLM limitations like token hallucination and accuracy issues. We tested the approach on 1,000 synthetic clinical notes representing various cancer types, comparing its performance to a traditional single-step prompting method.
Results: Our hierarchical prompt engineering strategy (accuracy = 94%, misidentification, and misplacement rate = 5%) outperforms the traditional prompt strategy (accuracy = 87%, misidentification, and misplacement rate = 10%) in information extraction. By unifying staging systems (e.g., TNM, FIGO) and specific stage details (e.g., Stage II) into a standardized framework, our approach achieves improved accuracy in extracting cancer stage information.
Conclusions: Our approach demonstrates that LLMs, when guided by structured prompting, can accurately extract complex clinical information without the need for expert-labeled data. This method has the potential to harness unstructured data for advancing cancer research.
Keywords
Cancer, Health care
Published Open-Access
yes
Recommended Citation
Zhang, Kai; Huang, Tongtong; Malin, Bradley A; et al., "Introducing McOdegpt as a Zero-Shot Information Extraction From Clinical Free Text Data Tool for Cancer Research" (2025). Faculty, Staff and Student Publications. 739.
https://digitalcommons.library.tmc.edu/uthshis_docs/739