Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Epidemiology and Biostatistics


Bauer, Greta R


This study evaluated eight quantitative methods for their predictive accuracy for intersectionally-defined subgroups, via a simulation study. The methods included two forms of single-level regression with interaction terms, cross-classification, multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA), and four decision tree methods: classification and regression trees (CART), conditional inference trees, chi-square automatic interaction detector, and random forest. The simulated datasets varied by outcome variable type, input variable types, sample size, and size and direction of the effects. Predictive accuracy improved with increasing sample size for all methods except CART. At small sample sizes, random forest and MAIHDA generally created the most precise predictions. While performing well for prediction, variable selection by random forest and confidence interval coverage and power of MAIHDA main effects coefficients were suboptimal. We have identified differences in methods ideal for intersectional prediction versus variable identification, highlighting that different objectives and data scenarios require different methods.

Summary for Lay Audience

Intersectionality acknowledges that an individual’s multiple social positions or identities (e.g. gender, ethnicity) can interact to affect health-related outcomes in unique ways. Calculating health outcomes for intersectional groups (defined by a combination of positions), rather than by each position separately, can create more accurate outcome estimates. Since it is unclear which methods do this best, this study evaluated eight methods in terms of their predictive performance for intersectional groupings, using simulated data with known true values. The methods included single-level and multilevel regression, cross-classification, and four machine learning methods (classification and regression trees (CART), conditional inference trees, chi-square automatic interaction detector, and random forest). The accuracy of predictions created by all methods generally improved with increasing sample size, except for the CART method. Generally, random forest and the multilevel method created the most precise predictions compared to the other methods, especially for small sample sizes. However, they did not always correctly identify variables which were significantly associated with outcome. Random forest sometimes incorrectly suggested that a variable that had no true effect on the outcome was important, and MAIHDA created estimates for the effects of individual variables that were not reflective of the expected values. This shows that while some methods are reliable to predict the outcome for intersectionally defined groups, they are not ideal to identify the effects or importance of individual variables that make up those groups (e.g. the specific effect of being in a high income group, or being male). Results from this work will improve the application of quantitative methods for accurately estimating outcomes for population subgroups. Correctly estimating outcomes for these groups is an important step in understanding existing health inequities. The goal of this work is to produce a guide for researchers who are interested in the applications of quantitative intersectionality approaches.

Included in

Epidemiology Commons