BEST SUBSET SELECTION FOR CATEGORICAL DATA

BARBARA CLAIRE TILLEY, The University of Texas School of Public Health

Abstract

When choosing among models to describe categorical data, the necessity to consider interactions makes selection more difficult. With just four variables, considering all interactions, there are 166 different hierarchical models and many more non-hierarchical models. Two procedures have been developed for categorical data which will produce the "best" subset or subsets of each model size where size refers to the number of effects in the model. Both procedures are patterned after the Leaps and Bounds approach used by Furnival and Wilson for continuous data and do not generally require fitting all models. For hierarchical models, likelihood ratio statistics (G('2)) are computed using iterative proportional fitting and "best" is determined by comparing, among models with the same number of effects, the Pr((chi)(,k)('2) (GREATERTHEQ) G(,ij)('2)) where k is the degrees of freedom for ith model of size j. To fit non-hierarchical as well as hierarchical models, a weighted least squares procedure has been developed. The procedures are applied to published occupational data relating to the occurrence of byssinosis. These results are compared to previously published analyses of the same data. Also, the procedures are applied to published data on symptoms in psychiatric patients and again compared to previously published analyses. These procedures will make categorical data analysis more accessible to researchers who are not statisticians. The procedures should also encourage more complex exploratory analyses of epidemiologic data and contribute to the development of new hypotheses for study.

Subject Area

Biostatistics

Recommended Citation

TILLEY, BARBARA CLAIRE, "BEST SUBSET SELECTION FOR CATEGORICAL DATA" (1981). Texas Medical Center Dissertations (via ProQuest). AAI8212741.
https://digitalcommons.library.tmc.edu/dissertations/AAI8212741

Share

COinS