Faculty, Staff and Student Publications

Language

English

Publication Date

1-1-2024

PMID

41726483

Abstract

The rapid growth of unstructured clinical text in electronic health records necessitates robust information extraction systems, yet their development is hindered by the scarcity of high-quality annotated data. This study explores the potential of large language models to generate synthetic data for clinical named entity recognition and examines its impact on model performance. We propose a novel framework that integrates self-verified synthetic data generation with domain-specific semantic mapping using SNOMED-CT. By leveraging GPT-4o-mini for synthetic data creation and refining its quality through iterative verification and anomaly detection, we systematically evaluate the influence of synthetic data quality and quantity on fine-tuning LLaMA-3-8B. Experimental results across four datasets (MTSamples, UTP, MIMIC-III, and i2b2) demonstrate that self-verification and semantic mapping significantly enhance synthetic data utility, improving model generalizability. Our findings highlight the importance of balancing human-annotated and synthetic data, with a 1:1 ratio emerging as the optimal configuration for performance gains. This study advances clinical NLP by providing a scalable approach to mitigating annotation challenges while improving model performance.

Keywords

Electronic Health Records, Natural Language Processing, Humans, Semantics, Systematized Nomenclature of Medicine, Data Mining, Information Storage and Retrieval, Biological Ontologies, Large Language Models

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.