Electronic Thesis and Dissertation Repository

On Computing Optimal Repairs for Conditional Independence

Alireza Pirhadi, Western University

Abstract

This thesis focuses on the concept of Conditional Independence (CI) and its testing, which holds immense significance across various fields, including economics, social sciences, and biomedical research. Notably, within computer science, CI has become an integral part of building probabilistic and causal models. It aids efficient inference and plays a key role in uncovering causal relationships.

The primary aim of this thesis is to broaden the scope of CI beyond its testing aspect. We introduce the pioneering problem of data repair, designed to adhere to particular CI constraints. The value and pertinence of this problem are highlighted through two contrasting applications. The first application is debiasing data and developing fair machine learning (ML) models. As fairness becomes a central issue in machine learning, exploring techniques for debiasing data to construct more equitable models is crucial. The proposed data repair methodology supports this, assisting in creating fairer models. The second application is about improving data quality and cleaning processes. Maintaining data quality is a continuous challenge across various fields, and our repair methods present a novel way to address this, enhancing the overall quality and reliability of the data.

The proposed repairs use optimal transport (OT) and Earth Mover’s distance as dissimilarity measures. This approach ensures the preservation of the underlying probability distribution’s geometry. In the context of fairness, this contributes to increased downstream model accuracy. In the realm of data cleaning, it offers a robust method for error detection. To facilitate the efficient generation of the repairs, we present novel techniques, including relaxed OT and block coordinate descent methods. The effectiveness of the repair methodologies is validated through experiments conducted on synthetic and real-world datasets. This comprehensive exploration highlights the potential of data repair in addressing critical issues in machine learning and data quality, offering a new perspective on using CI in these fields.