Thesis Format
Monograph
Degree
Doctor of Philosophy
Program
Statistics and Actuarial Sciences
Supervisor
Yi, Grace Y.
Abstract
Causal inference and statistical learning have made significant advancements in various fields, including healthcare, epidemiology, computer vision, information retrieval, and language processing. Despite numerous methods, research gaps still remain, particularly regarding noisy data with features such as missing data, censoring, and measurement errors, etc. Addressing the challenges presented by noisy data is crucial to reduce bias and enhance statistical learning of such data. This thesis tackles several issues in causal inference and statistical learning that are related to noisy data.
The first project addresses causal inference about longitudinal studies with bivariate responses, focusing on data with missingness and censoring. We decompose the overall treatment effect into two separable effects, each mediated through different causal pathways. Furthermore, we establish identification conditions for estimating these separable treatment effects using observed data. Subsequently, we employ the likelihood method to estimate these effects and derive hypothesis testing procedures for their comparison.
In the second project, we tackle the problem of detecting cause-effect relationships between two sets of variables, formed as two vectors. Although this problem can be framed as a binary classification task, it is prone to mislabeling of causal relationships for paired vectors under the study -- an inherent challenge in causation studies. We quantify the effects of mislabeled outputs on training results and introduce metrics to characterize these effects. Furthermore, we develop valid learning methods that account for mislabeling effects and provide theoretical justification for their validity. Our contributions present reliable learning methods designed to handle real-world data, which commonly involve label noise.
The third project extends the research in the second project by exploring binary classification with noisy data in the general framework. To scrutinize the impact of different types of label noise, we introduce a sensible way to categorize noisy labels into three types: instance-dependent, semi-instance-independent, and instance-independent noisy labels. We theoretically assess the impact of each noise type on learning. In particular, we quantify an upper bound of bias when ignoring the effects of instance-dependent noisy labels and identify conditions under which ignoring semi-instance-independent noisy labels is acceptable. Moreover, we propose correction methods for each type of noisy label.
Contrasting with the third project that focuses on classification with label noise, the fourth project examines binary classification with mismeasured inputs. We begin by theoretically analyzing the bias induced by ignoring measurement error effects and identify a scenario where such an ignorance is acceptable. We then propose three correction methods to address the mismeasured input effects, including methods leveraging validation data and modifications to the loss function using regression calibration and conditional expectation. Finally, we establish theoretical results for each proposed method.
In summary, this thesis explores several interesting problems in causal inference and statistical learning concerning noisy data. We contribute new findings and methods to enhance our understanding of the complexities induced by noisy data and provide solutions to address them.
Summary for Lay Audience
Causal inference and statistical learning have achieved significant progress across various fields such as healthcare, computer vision, and information retrieval. However, the presence of noisy data -- including missing values, censoring, and measurement error -- poses ongoing challenges. This thesis investigates several of these issues and presents solutions to improve data analysis in the face of noise.
The first project focuses on causal inference in longitudinal studies with bivariate responses, specifically addressing challenges like missingness and censoring. We develop new methods to estimate treatment effects on each response while accounting for their correlation. The second project explores predicting cause-effect relationships between two sets of variables, framed as a classification problem. This project addresses the common issue of mislabeling in practice and proposes theoretical and practical methods for correction. Expanding on this, the third project looks at binary classification with noisy labels. We categorize label noise into three types, assess the impact of each, and identify situations where certain types of noise can be safely ignored. Correction methods for each type of noisy label are also proposed. The final project shifts focus to mismeasured inputs. We analyze the bias introduced by ignoring measurement errors and identify scenarios where such errors can be overlooked, particularly with large data sizes. Additionally, we propose three different methods to correct for measurement errors.
In summary, this thesis investigates the complexities of causal inference and statistical learning concerning noisy data and offers novel insights and methods to address the related issues.
Recommended Citation
Hu, Pingbo, "Statistical Learning of Noisy Data: Classification and Causal Inference with Measurement Error and Missingness" (2024). Electronic Thesis and Dissertation Repository. 10419.
https://ir.lib.uwo.ca/etd/10419