Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Doctor of Philosophy

Program

Statistics and Actuarial Sciences

Supervisor

Yi, Grace Y.

2nd Supervisor

He, Wenqing

Co-Supervisor

Abstract

Incomplete data commonly arise in applications, and research on this topic has received extensive attention over the past few decades. Numerous inference methods have been developed to address various issues related to incomplete data, such as different types of missing observations and distinct missing data mechanisms, which are often classified as missing completely at random, missing at random, and missing not at random. However, research gaps still remain.

Assessing a plausible missing data mechanism is typically difficult due to the lack of validation data, and the presence of spurious variables in covariates further complicates the challenge. Prediction in the presence of incomplete data is another area worth exploring. By utilizing newly emerging techniques, we explore new avenues in the analysis of incomplete data. This thesis aims to contribute fresh insights into statistical inference within the context of incomplete data and provide valid methods to address a few existing research gaps.

Focusing on missingness in the response variable, the first project proposes a unified framework to address the effects of missing data. By leveraging the generalized linear model to facilitate the dependence of the response on associated covariates, we develop concurrent estimation and variable selection procedures using regularized likelihood. We rigorously establish the asymptotic properties of the resultant estimators. The proposed methods offer flexibility and generality, eliminating the need to assume a specific missing data mechanism -- a requirement in most available methods. Empirical studies demonstrate the satisfactory performance of the proposed methods in finite sample settings. Furthermore, the project outlines extensions to accommodate missingness in both the response and covariates.

The second problem of interest approaches missing data from a different perspective by placing it within the framework of statistical machine learning, with a specific emphasis on exploring boosting techniques; two projects are generated accordingly. Despite the increasing attention gained by boosting, many advancements in this area have primarily focused on numerical implementation procedures, with relatively limited theoretical work. Moreover, existing boosting approaches are predominantly designed to handle datasets with complete observations, and their validity is hampered by the presence of missing data. In this thesis, we employ semiparametric estimation approaches to develop unbiased boosting estimation methods for data with missing responses. We investigate several strategies to account for the missingness effects. The proposed methods are implemented using the functional gradient descent algorithm and are justified by the establishment of theoretical properties. Numerical studies confirm the satisfactory performance of the proposed methods in finite sample settings.

The third topic further explores different boosting procedures in the context of interval censored data, where the exact observed value for the response variable is unavailable but only known to fall within an interval. Such data commonly arise in survival analysis and fields involving time-to-events, and they present a unique challenge in data analysis. In this project, we develop boosting methods for both regression and classification problems with interval censored data. We address the censoring effects by adjusting the loss functions or imputing transformed responses. The proposed methods are implemented using a functional gradient descent algorithm, and we rigorously establish their theoretical properties, including mean squared error tradeoffs and the optimality of the proposed estimators. Numerical studies are conducted to assess the performance of the proposed methods in finite sample settings.

Summary for Lay Audience

Research on incomplete data has been extensively explored over the past few decades due to its common occurrence in various applications. Different inference methods have been developed to handle inherent issues related to missing observations. Despite progress, challenges persist, particularly in determining missing data mechanisms, dealing with unimportant variables in covariates, and conducting prediction. This thesis introduces new methods to address these gaps and enhance statistical inference in the presence of incomplete data.

The first project proposes a unified framework to handle missingness in the response variable, and develops a new method to perform estimation and variable selection simultaneously. The development has appeal in allowing us to avoid assuming a specific missing data mechanism in inferential procedures.

The second problem concerns analyzing missing data with boosting techniques within the statistical machine learning framework. Boosting methods, traditionally developed for complete datasets, are adapted here using semiparametric estimation to handle missing responses. Theoretical properties are rigorously established, with numerical studies confirming their effectiveness.

The third topic further examines the use of boosting techniques to handle interval censored data, common in survival analysis. New approaches are proposed by adjusting loss functions or imputing transformed responses to address censoring effects. Theoretical properties and numerical studies highlight their effectiveness in handling interval censored data.

Overall, this thesis aims to advance statistical inference and learning methods for incomplete data. It provides new insights and techniques to address challenges presented by incomplete data.

Available for download on Monday, August 31, 2026

Share

COinS