Electronic Thesis and Dissertation Repository

Thesis Format



Doctor of Philosophy


Epidemiology and Biostatistics

Collaborative Specialization



Zou, Guangyong


Regression analyses in epidemiological and medical research typically begin with a model selection process, followed by inference assuming the selected model has generated the data at hand. It is well-known that this two-step procedure can yield biased estimates and invalid confidence intervals for model coefficients due to the uncertainty associated with the model selection. To account for this uncertainty, multiple models may be selected as a basis for inference. This method, commonly referred to as model-averaging, is increasingly becoming a viable approach in practice.

Previous research has demonstrated the advantage of model-averaging in reducing bias of parameter estimates. However, there is lack of methods for constructing confidence intervals around parameter estimates using model-averaging. In the context of multiple logistic regression models, we propose and evaluate new confidence interval estimation approaches for regression coefficients. Specifically, we study the properties of confidence intervals constructed by averaging tail errors arising from confidence limits obtained from all models included in model-averaging for parameter estimation. We propose model-averaging confidence intervals based on the score test. For selection of models to be averaged, we propose the bootstrap inclusion fractions method.

We evaluate the performance of our proposed methods using simulation studies, in a comparison with model-averaging interval procedures based on likelihood ratio and Wald tests, traditional stepwise procedures, the bootstrap approach, penalized regression, and the Bayesian model-averaging approach.

Methods with good performance have been implemented in the 'mataci' R package, and illustrated using data from a low birth weight study.

Summary for Lay Audience

Data analysis in medical research often involves regression analysis that examines the associations between outcome and independent variables. Analysis consists of selection of these variables and estimation of their effects. A point estimate usually varies from sample to sample, meaning that the estimated effect has some distribution. The 95\% confidence interval, a range around a point estimate within which the true effect is likely to fall, is used to quantify the uncertainty associated with the estimates. Tail errors on both sides of a valid confidence interval should be similar and close to the specified limit.

Unfortunately, using the same data to construct confidence intervals usually leads to biased results, especially in small samples. The coverage of confidence intervals obtained by such ``double use" of the data is often below the specified limit. To address this problem, it was proposed to use several regression models, which results are averaged. The selection of candidate models is important for the averaging process. If done correctly it allows one to accelerate the computations and also to improve precision of results, while a insufficient set of models can negatively affect the final conclusions.

Model-averaging makes it possible to obtain more accurate point estimates, but many methods for constructing confidence intervals for such averaged estimates suffer from inaccuracy, especially if samples sizes are not large. Such intervals are often too short, and the confidence level is much lower than the specified level.

In this work, we proposed an approach for selecting candidate models that reduces the number of required models, but saves the information that can be obtained from the data. We also proposed a method that constructs valid and accurate confidence intervals for regression coefficients even for small samples. We used a method that suggests averaging the tail errors over selected candidate models. The developed methods are more accurate, but are less traditional variants of the model-averaged tail error method. We focused on building confidence intervals for logistic regression models that evaluate the effect of variables on a binary dependent variable. To demonstrate the superiority of the proposed methods, we compared them with frequently used methods.