Variable Selection and Imputation for High-Dimensional Incomplete Data
Abstract
Missing data are an inevitable problem in data with numerous variables. The presence of missing data obstructs the implementation of the existing variable selection methods. This is especially an issue for data with a limited number of observations or no complete case. Applicable and efficient selection with imputation methods is needed for such data to obtain valid results. The goal of this study is to propose approaches to select important variables from incomplete high-dimensional data. This study involves using the joint model for imputation in high-dimensional settings, and the clustering strategy is employed for the final subset selection. In addition, we consider mixed data with both continuous and binary variables into account by their normal approximations. The approaches are applied to clinical trial data from the National Institute of Neurological Disorders and Stroke (NINDS) Exploratory Trials in Parkinson's Disease Long-term Study 1(NET-PD LS-1). Simulation study and analysis results are presented and compared with other possible approaches. The proposed approaches can be applied to data from diverse types of clinical trials or biomedical data sets.
Subject Area
Biostatistics
Recommended Citation
Zhang, Yunxi, "Variable Selection and Imputation for High-Dimensional Incomplete Data" (2018). Texas Medical Center Dissertations (via ProQuest). AAI10846583.
https://digitalcommons.library.tmc.edu/dissertations/AAI10846583