Journal Articles

Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies.

Laila Rasmy
Firat Tiryaki
Yujia Zhou
Yang Xiang
Cui Tao
Hua Xu
Degui Zhi

Publication Date

10-1-2020

Journal

J Am Med Inform Assoc

Abstract

OBJECTIVE: Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning.

MATERIALS AND METHODS: We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network.

RESULTS: For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction.

DISCUSSION/CONCLUSION: In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.

Keywords

Aged, Databases, Factual, Electronic Health Records, Female, Humans, Male, Middle Aged, ROC Curve, Unified Medical Language System, Vocabulary, Controlled

Comments

PMCID: PMC7647355

Download

Included in

Medicine and Health Sciences Commons

COinS

Journal Articles

Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies.

Publication Date

Journal

Abstract

Keywords

Comments

Included in

Search

Browse

Author Corner

More Info

Library

Journal Articles

Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies.

Authors

Publication Date

Journal

Abstract

Keywords

Comments

Included in

Share

Search

Browse

Author Corner

More Info

Library