Modeling Approaches for Analyzing Healthcare Cost Data

Noopur Singh, The University of Texas School of Public Health


Background: The rising cost of medical care worldwide has highlighted the importance of estimating the cost of illnesses and the impact of specific treatments and patient or hospital characteristics on cost of care. However, statistical analysis of health care costs poses a number of difficulties since this data typically exhibits characteristics such as skewness, heavy tails and multimodality that must be accounted for when deriving cost estimates. In such cases, assumptions of standard ordinary least-squares (OLS) linear regression are unlikely to be met. Objective: In this paper, we have compared the performance and fit of generalized additive models (GAM), a non-parametric, flexible technique, with that of popular approaches such as, generalized linear models (GLM) and log transformed OLS models in estimating the effect of the primary predictor of interest on the positively skewed dependent variable of mean cost. We hypothesized that the choice of estimator will have major implications on mean cost estimates. Methods: Data was extracted from the 2014 Texas Health Care Information Collection (THCIC) inpatient public use database for empirical analysis. Patients with epilepsy aged 17 and younger were identified using ICD-9 codes beginning with 345.XX. The modelling framework consisted of three classes of models: (i) GLMs and (ii) GAMs using log link and Gamma and inverse Gaussian distributions and (iii) log OLS models. Model performance in estimating the marginal effect of presence of non-neuromuscular chronic comorbid conditions on mean hospitalization cost for this patient population was assessed using Akaike Information Criteria (AIC) and deviance residuals plots. Results: Out of the 3211 pediatric epilepsy patients selected from the THCIC dataset, 27.6% presented with non-neuromuscular comorbidities. These patients had median hospitalization costs of $8558.20; 35% higher than the median costs for patients without these comorbidities. The GAM models provided precise effect estimates but were unable to resolve the non-normality and heteroscedasticity present in the data. Residual plots of log OLS and GLM-Gamma models indicated that both models were a good fit but the log OLS model had the lowest AIC. In the presence of non-neuromuscular comorbidities, the log OLS and GLM-Gamma models estimated a 3% and 7% increase in hospitalization costs, respectively. However, after controlling for other predictors, this effect was not statistically significant (p>0.05). The predicted mean hospitalization cost [$13,165.03 (SD $13,989.41)] estimated by the GLM-Gamma model was closer to the observed sample mean [$13,340.50 (SD $30,504.40)] in comparison to the lower predicted mean cost [$12,203.86 (SD $10,879.13)] estimated by the log OLS model. Conclusions: We concluded that there is no single alternative that outperforms others under all conditions. There are important trade-offs in precision, consistency and ease of use of techniques in deriving cost estimates. Log transformed OLS models, although resilient to cost data problems, require familiarity with the error structure of the data, since ignoring heteroscedasticity can lead to substantially biased estimates upon retransformation. Although GLMs provide direct results, their estimates can be imprecise if the appropriate variance function is not identified. GAMs allow for non-linear non-parametric functions of covariate, however, offered limited flexibility and inability to resolve the skewness and heteroscedasticity of cost data in this study. We have shown that the choice of estimator has major implications on mean cost estimates and the return on time spent on analyzing the performance of such estimators can be very high since major biases and losses in precision can be avoided.

Subject Area


Recommended Citation

Singh, Noopur, "Modeling Approaches for Analyzing Healthcare Cost Data" (2018). Texas Medical Center Dissertations (via ProQuest). AAI10930783.