Faculty, Staff and Student Publications

Facilitating Clinical Information Extraction with Synthetic Data and Ontology using Large Language Models

Language

English

Publication Date

1-1-2024

PMID

41726483

Abstract

The rapid growth of unstructured clinical text in electronic health records necessitates robust information extraction systems, yet their development is hindered by the scarcity of high-quality annotated data. This study explores the potential of large language models to generate synthetic data for clinical named entity recognition and examines its impact on model performance. We propose a novel framework that integrates self-verified synthetic data generation with domain-specific semantic mapping using SNOMED-CT. By leveraging GPT-4o-mini for synthetic data creation and refining its quality through iterative verification and anomaly detection, we systematically evaluate the influence of synthetic data quality and quantity on fine-tuning LLaMA-3-8B. Experimental results across four datasets (MTSamples, UTP, MIMIC-III, and i2b2) demonstrate that self-verification and semantic mapping significantly enhance synthetic data utility, improving model generalizability. Our findings highlight the importance of balancing human-annotated and synthetic data, with a 1:1 ratio emerging as the optimal configuration for performance gains. This study advances clinical NLP by providing a scalable approach to mitigating annotation challenges while improving model performance.

Keywords

Electronic Health Records, Natural Language Processing, Humans, Semantics, Systematized Nomenclature of Medicine, Data Mining, Information Storage and Retrieval, Biological Ontologies, Large Language Models

Published Open-Access

yes

Recommended Citation

Hu, Yan; He, Huan; Chen, Qingyu; et al., "Facilitating Clinical Information Extraction with Synthetic Data and Ontology using Large Language Models" (2024). Faculty, Staff and Student Publications. 5915.
https://digitalcommons.library.tmc.edu/uthgsbs_docs/5915

Download

Included in

Bioinformatics Commons, Biomedical Informatics Commons, Genetic Phenomena Commons, Medical Genetics Commons, Oncology Commons

COinS

Faculty, Staff and Student Publications

Facilitating Clinical Information Extraction with Synthetic Data and Ontology using Large Language Models

Language

Publication Date

PMID

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Student Publications

Facilitating Clinical Information Extraction with Synthetic Data and Ontology using Large Language Models

Authors

Language

Publication Date

PMID

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Share

Search

Browse

Author Corner

More Info

Library