
Statistical Learning of Noisy Data: Classification and Causal Inference with Measurement Error and Missingness
Abstract
Causal inference and statistical learning have made significant advancements in various fields, including healthcare, epidemiology, computer vision, information retrieval, and language processing. Despite numerous methods, research gaps still remain, particularly regarding noisy data with features such as missing data, censoring, and measurement errors, etc. Addressing the challenges presented by noisy data is crucial to reduce bias and enhance statistical learning of such data. This thesis tackles several issues in causal inference and statistical learning that are related to noisy data.
The first project addresses causal inference about longitudinal studies with bivariate responses, focusing on data with missingness and censoring. We decompose the overall treatment effect into two separable effects, each mediated through different causal pathways. Furthermore, we establish identification conditions for estimating these separable treatment effects using observed data. Subsequently, we employ the likelihood method to estimate these effects and derive hypothesis testing procedures for their comparison.
In the second project, we tackle the problem of detecting cause-effect relationships between two sets of variables, formed as two vectors. Although this problem can be framed as a binary classification task, it is prone to mislabeling of causal relationships for paired vectors under the study -- an inherent challenge in causation studies. We quantify the effects of mislabeled outputs on training results and introduce metrics to characterize these effects. Furthermore, we develop valid learning methods that account for mislabeling effects and provide theoretical justification for their validity. Our contributions present reliable learning methods designed to handle real-world data, which commonly involve label noise.
The third project extends the research in the second project by exploring binary classification with noisy data in the general framework. To scrutinize the impact of different types of label noise, we introduce a sensible way to categorize noisy labels into three types: instance-dependent, semi-instance-independent, and instance-independent noisy labels. We theoretically assess the impact of each noise type on learning. In particular, we quantify an upper bound of bias when ignoring the effects of instance-dependent noisy labels and identify conditions under which ignoring semi-instance-independent noisy labels is acceptable. Moreover, we propose correction methods for each type of noisy label.
Contrasting with the third project that focuses on classification with label noise, the fourth project examines binary classification with mismeasured inputs. We begin by theoretically analyzing the bias induced by ignoring measurement error effects and identify a scenario where such an ignorance is acceptable. We then propose three correction methods to address the mismeasured input effects, including methods leveraging validation data and modifications to the loss function using regression calibration and conditional expectation. Finally, we establish theoretical results for each proposed method.
In summary, this thesis explores several interesting problems in causal inference and statistical learning concerning noisy data. We contribute new findings and methods to enhance our understanding of the complexities induced by noisy data and provide solutions to address them.