A comparison of statistical learning methods for multiple imputation of unknown stroke types in the Systolic Hypertension in the Elderly Program (SHEP) clinical trial

Alokananda Ghosh, The University of Texas School of Public Health

Abstract

Many clinical studies contain strokes of unknown type. Two statistical learning methods, lasso logistic regression and random forest, plus multiple imputation were applied to the Systolic Hypertension in the Elderly Program (SHEP) dataset to classify strokes of unknown type using patient characteristics. Impact of the newly classified strokes on ischemic and hemorrhagic stroke risk ratios (RR) in the active versus placebo groups was determined. Study design: SHEP was a randomized trial with 4736 participants aged 60 years or older with isolated systolic hypertension who were randomly assigned to receive antihypertensive treatment or placebo. Mean follow-up was 4.5 years. A total of 262 incident strokes occurred (217 ischemic, 28 hemorrhagic and 17 unknown). The adjusted RR of ischemic stroke in the active group versus placebo was 0.63, 95% confidence interval (CI) [0.48–0.82] and for hemorrhagic stroke was 0.46, 95% CI [0.21–1.02]. Methods: Patient characteristics were compared between known and unknown strokes, and between stroke types. The known strokes were split into training and test sets for purposes of model building and determining prediction accuracy. Univariate logistic regression, lasso logistic regression and random forests were performed on the training set. Prediction accuracy was gauged using ROC curves on the test set. Multiple imputation was employed to account for prediction uncertainty. Univariate and multivariate Cox regressions were performed on the imputed datasets. Results: The lasso method performed slightly better [AUC = 0.61, 95% CI (0.36–0.86)] than the random forest method [AUC = 0.603, 95% CI (0.35–0.85)] and was chosen to classify the unknown strokes. The posterior probabilities were used to impute 20 complete data sets. Application of Rubin’s formula to univariate and multivariate Cox regression models from the imputed data sets gave similar RRs to the original results for hemorrhagic and ischemic stroke with slightly wider 95% CI’s. Conclusions: Neither lasso nor random forest performed very well in classifying the 17 unknown strokes in SHEP, likely due to small sample size. Statistical learning combined with multiple imputation is a potentially valuable tool in classifying stroke type in clinical studies with large sample size.

Subject Area

Biostatistics

Recommended Citation

Ghosh, Alokananda, "A comparison of statistical learning methods for multiple imputation of unknown stroke types in the Systolic Hypertension in the Elderly Program (SHEP) clinical trial" (2014). Texas Medical Center Dissertations (via ProQuest). AAI1568457.
https://digitalcommons.library.tmc.edu/dissertations/AAI1568457

Share

COinS