Faculty, Staff and Student Publications

Publication Date

3-1-2023

Journal

Journal of Biomedical Informatics

Abstract

Electronic health records (EHR) are collected as a routine part of healthcare delivery, and have great potential to be utilized to improve patient health outcomes. They contain multiple years of health information to be leveraged for risk prediction, disease detection, and treatment evaluation. However, they do not have a consistent, standardized format across institutions, particularly in the United States, and can present significant analytical challenges- they contain multi-scale data from heterogeneous domains and include both structured and unstructured data. Data for individual patients are collected at irregular time intervals and with varying frequencies. In addition to the analytical challenges, EHR can reflect inequity- patients belonging to different groups will have differing amounts of data in their health records. Many of these issues can contribute to biased data collection. The consequence is that the data for under-served groups may be less informative partly due to more fragmented care, which can be viewed as a type of missing data problem. For EHR data in this complex form, there is currently no framework for introducing realistic missing values. There has also been little to no work in assessing the impact of missing data in EHR. In this work, we first introduce a terminology to define three levels of EHR data and then propose a novel framework for simulating realistic missing data scenarios in EHR to adequately assess their impact on predictive modeling. We incorporate the use of a medical knowledge graph to capture dependencies between medical events to create a more realistic missing data framework. In an intensive care unit setting, we found that missing data have greater negative impact on the performance of disease prediction models in groups that tend to have less access to healthcare, or seek less healthcare. We also found that the impact of missing data on disease prediction models is stronger when using the knowledge graph framework to introduce realistic missing values as opposed to random event removal.

Keywords

Humans, United States, Electronic Health Records, Delivery of Health Care, Intensive Care Units

DOI

10.1016/j.jbi.2022.104269

PMID

36621750

PMCID

PMC10391553

PubMedCentral® Posted Date

3-1-2024

PubMedCentral® Full Text Version

Post-print

Published Open-Access

yes

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.