Faculty, Staff and Student Publications
Language
English
Publication Date
1-1-2024
PMID
41726483
Abstract
The rapid growth of unstructured clinical text in electronic health records necessitates robust information extraction systems, yet their development is hindered by the scarcity of high-quality annotated data. This study explores the potential of large language models to generate synthetic data for clinical named entity recognition and examines its impact on model performance. We propose a novel framework that integrates self-verified synthetic data generation with domain-specific semantic mapping using SNOMED-CT. By leveraging GPT-4o-mini for synthetic data creation and refining its quality through iterative verification and anomaly detection, we systematically evaluate the influence of synthetic data quality and quantity on fine-tuning LLaMA-3-8B. Experimental results across four datasets (MTSamples, UTP, MIMIC-III, and i2b2) demonstrate that self-verification and semantic mapping significantly enhance synthetic data utility, improving model generalizability. Our findings highlight the importance of balancing human-annotated and synthetic data, with a 1:1 ratio emerging as the optimal configuration for performance gains. This study advances clinical NLP by providing a scalable approach to mitigating annotation challenges while improving model performance.
Keywords
Electronic Health Records, Natural Language Processing, Humans, Semantics, Systematized Nomenclature of Medicine, Data Mining, Information Storage and Retrieval, Biological Ontologies, Large Language Models
Published Open-Access
yes
Recommended Citation
Hu, Yan; He, Huan; Chen, Qingyu; et al., "Facilitating Clinical Information Extraction with Synthetic Data and Ontology using Large Language Models" (2024). Faculty, Staff and Student Publications. 5915.
https://digitalcommons.library.tmc.edu/uthgsbs_docs/5915
Included in
Bioinformatics Commons, Biomedical Informatics Commons, Genetic Phenomena Commons, Medical Genetics Commons, Oncology Commons