Dissertations & Theses (Open Access)

Graduation Date


Degree Name

Doctor of Philosophy (PhD)

School Name

The University of Texas School of Biomedical Informatics at Houston

Advisory Committee

Dr. Todd Johnson


There is growing interest in the reuse of clinical data for research and clinical healthcare quality improvement. However, direct analysis of clinical data sets can yield misleading results. Data Cleaning is often employed as a means to detect and fix data issues during analysis but this approach lacks of systematicity. Data Quality (DQ) assessments are a more thorough way of spotting threats to the validity of analytical results stemming from data repurposing. This is because DQ assessments aim to evaluate ‘fitness for purpose’. However, there is currently no systematic method to assess DQ for the secondary analysis of clinical data. In this dissertation I present DataGauge, a framework to address this gap in the state of the art.

I begin by introducing the problem and its general significance to the field of biomedical and clinical informatics (Chapter 1). I then present a literature review that surveys current methods for the DQ assessment of repurposed clinical data and derive the features required to advance the state of the art (Chapter 2). In chapter 3 I present DataGauge, a model-driven framework for systematically assessing the quality of repurposed clinical data, which addresses current limitations in the state of the art. Chapter 4 describes the development of a guidance framework to ensure the systematicity of DQ assessment design. I then evaluate DataGauge’s ability to flag potential DQ issues in comparison to a systematic state of the art method. DataGauge was able to increase ten fold the number of potential DQ issues found over the systematic state of the art method. It identified more specific issues that were a direct threat to fitness for purpose, but also provided broader coverage of the clinical data types and knowledge domains involved in secondary analyses.

DataGauge sets the groundwork for systematic and purpose-specific DQ assessments that fully integrate with secondary analysis workflows. It also promotes a team-based approach and the explicit definition of DQ requirements to support communication and transparent reporting of DQ results. Overall, this work provides tools that pave the way to a deeper understanding of repurposed clinical dataset limitations before analysis. It is also a first step towards the automation of purpose-specific DQ assessments for the secondary use of clinical data. Future work will consist of further development of these methods and validating them with research teams making secondary use of clinical data.


Data quality, electronic health records, data cleaning