An empirical evaluation of the Random Forests classifier models for variable selection in a large-scale lung cancer case-control study

Qing Zhang, The University of Texas School of Public Health

Abstract

Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini.

Subject Area

Biostatistics

Recommended Citation

Zhang, Qing, "An empirical evaluation of the Random Forests classifier models for variable selection in a large-scale lung cancer case-control study" (2006). Texas Medical Center Dissertations (via ProQuest). AAI3259518.
https://digitalcommons.library.tmc.edu/dissertations/AAI3259518

Share

COinS