Faculty, Staff and Student Publications

Representation of EHR Data for Predictive Modeling: A Comparison Between UMLS and Other Terminologies

Publication Date

10-1-2020

Journal

J Am Med Inform Assoc

DOI

10.1093/jamia/ocaa180

PMID

32930711

PMCID

PMC7647355

PubMedCentral® Posted Date

September 2020

PubMedCentral® Full Text Version

Post-print

Abstract

OBJECTIVE: Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning.

MATERIALS AND METHODS: We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network.

RESULTS: For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction.

DISCUSSION/CONCLUSION: In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.

Keywords

Aged, Databases, Factual, Electronic Health Records, Female, Humans, Male, Middle Aged, ROC Curve, Unified Medical Language System, Vocabulary, Controlled

Published Open-Access

yes

Recommended Citation

Rasmy, Laila; Tiryaki, Firat; Zhou, Yujia; et al., "Representation of EHR Data for Predictive Modeling: A Comparison Between UMLS and Other Terminologies" (2020). Faculty, Staff and Student Publications. 50.
https://digitalcommons.library.tmc.edu/uthgsbs_docs/50

Download

Included in

Medicine and Health Sciences Commons

COinS

Faculty, Staff and Student Publications

Representation of EHR Data for Predictive Modeling: A Comparison Between UMLS and Other Terminologies

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Search

Browse

Author Corner

More Info

Library

Faculty, Staff and Student Publications

Representation of EHR Data for Predictive Modeling: A Comparison Between UMLS and Other Terminologies

Authors

Publication Date

Journal

DOI

PMID

PMCID

PubMedCentral® Posted Date

PubMedCentral® Full Text Version

Abstract

Keywords

Published Open-Access

Recommended Citation

Included in

Share

Search

Browse

Author Corner

More Info

Library