Faculty, Staff and Student Publications
Language
English
Publication Date
7-1-2025
Journal
Journal of the American Medical Informatics Association
DOI
10.1093/jamia/ocaf064
PMID
40332956
PMCID
PMC12202029
PubMedCentral® Posted Date
5-7-2025
PubMedCentral® Full Text Version
Post-print
Abstract
Objective: Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.
Methods: We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.
Results: CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.
Discussions and conclusions: This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.
Keywords
National Institutes of Health (U.S.), United States, Common Data Elements, Humans, Software, Semantics, Large Language Models, common data element, interoperability, data collection, data sharing, large language model
Published Open-Access
yes
Recommended Citation
Wang, Yan; Huang, Jimin; He, Huan; et al., "CDEMapper: Enhancing National Institutes of Health Common Data Element Use With Large Language Models" (2025). Faculty, Staff and Student Publications. 658.
https://digitalcommons.library.tmc.edu/uthshis_docs/658