Sina Madani

Graduation Date

Fall 11-5-2013

Degree Name

Doctor of Philosophy (PhD)

School Name

The University of Texas School of Health Information Sciences at Houston

Advisory Committee

Dean Sittig


Natural Language Processing (NLP), ontology, data mining, information systems


The Institute of Medicine reports a growing demand in recent years for quality improvement within the healthcare industry. In response, numerous organizations have been involved in the development and reporting of quality measurement metrics. However, disparate data models from such organizations shift the burden of accurate and reliable metrics extraction and reporting to healthcare providers. Furthermore, manual abstraction of quality metrics and diverse implementation of Electronic Health Record (EHR) systems deepens the complexity of consistent, valid, explicit, and comparable quality measurement reporting within healthcare provider organizations.

The main objective of this research is to evaluate an ontology-based information extraction framework to utilize unstructured clinical text for defining and reporting quality of care metrics that are interpretable and comparable across different healthcare institutions.

All clinical transcribed notes (48,835) from 2,085 patients who had undergone surgery in 2011 at MD Anderson Cancer Center were extracted from their EMR system and pre- processed for identification of section headers. Subsequently, all notes were analyzed by MetaMap v2012 and one XML file was generated per each note. XML outputs were converted into Resource Description Framework (RDF) format. We also developed three ontologies: section header ontology from extracted section headers using RDF standard, concept ontology comprising entities representing five quality metrics from SNOMED (Diabetes, Hypertension, Cardiac Surgery, Transient Ischemic Attack, CNS tumor), and a clinical note ontology that represented clinical note elements and their relationships. All ontologies (Web Ontology Language format) and patient notes (RDFs) were imported into a triple store (AllegroGraph?) as classes and instances respectively. SPARQL information retrieval protocol was used for reporting extracted concepts under four settings: base Natural Language Processing (NLP) output, inclusion of concept ontology, exclusion of negated concepts, and inclusion of section header ontology. Existing manual abstraction data from surgical clinical reviewers, on the same set of patients and documents, was considered as the gold standard. Micro-average results of statistical agreement tests on the base NLP output showed an increase from 59%, 81%, and 68% to 74%, 91%, and 82% (Precision, Recall, F-Measure) respectively after incremental addition of ontology layers. Our study introduced a framework that may contribute to advances in “complementary” components for the existing information extraction systems. The application of an ontology-based approach for natural language processing in our study has provided mechanisms for increasing the performance of such tools. The pivot point for extracting more meaningful quality metrics from clinical narratives is the abstraction of contextual semantics hidden in the notes. We have defined some of these semantics and quantified them in multiple complementary layers in order to demonstrate the importance and applicability of an ontology-based approach in quality metric extraction. The application of such ontology layers introduces powerful new ways of querying context dependent entities from clinical texts.

Rigorous evaluation is still necessary to ensure the quality of these “complementary” NLP systems. Moreover, research is needed for creating and updating evaluation guidelines and criteria for assessment of performance and efficiency of ontology-based information extraction in healthcare and to provide a consistent baseline for the purpose of comparing alternative approaches.