Master of Science
This thesis focuses on the concept of Conditional Independence (CI) and its testing, which holds immense significance across various fields, including economics, social sciences, and biomedical research. Notably, within computer science, CI has become an integral part of building probabilistic and causal models. It aids efficient inference and plays a key role in uncovering causal relationships.
The primary aim of this thesis is to broaden the scope of CI beyond its testing aspect. We introduce the pioneering problem of data repair, designed to adhere to particular CI constraints. The value and pertinence of this problem are highlighted through two contrasting applications. The first application is debiasing data and developing fair machine learning (ML) models. As fairness becomes a central issue in machine learning, exploring techniques for debiasing data to construct more equitable models is crucial. The proposed data repair methodology supports this, assisting in creating fairer models. The second application is about improving data quality and cleaning processes. Maintaining data quality is a continuous challenge across various fields, and our repair methods present a novel way to address this, enhancing the overall quality and reliability of the data.
The proposed repairs use optimal transport (OT) and Earth Mover’s distance as dissimilarity measures. This approach ensures the preservation of the underlying probability distribution’s geometry. In the context of fairness, this contributes to increased downstream model accuracy. In the realm of data cleaning, it offers a robust method for error detection. To facilitate the efficient generation of the repairs, we present novel techniques, including relaxed OT and block coordinate descent methods. The effectiveness of the repair methodologies is validated through experiments conducted on synthetic and real-world datasets. This comprehensive exploration highlights the potential of data repair in addressing critical issues in machine learning and data quality, offering a new perspective on using CI in these fields.
Summary for Lay Audience
This thesis delves into conditional independence (CI), a principle commonly used in economics, social sciences, and healthcare research. In simpler terms, CI is a way to understand how two random variables in a probabilistic model are connected or influence each other when a third random variable is considered. In computer science, this principle is fundamental to building models that can predict or figure out cause-effect relationships.
The main aim of this research is to extend the use of this concept to fix issues in data so it aligns with certain rules of CI. This idea is explored through two main applications. The first application involves making data more fair and unbiased, which in turn helps create machine learning models that treat all information fairly. In essence, this means creating a model that does not favor one set of data over another based on biased or unequal information. The second application focuses on improving the overall quality of data and making it error-free. This is a critical step because high-quality, clean data is essential for accurate predictions and decision-making. To fix or repair the data, the study uses methods that consider the difference between how data is distributed in its original and repaired states. This approach ensures the essence of the original data is maintained while enhancing the accuracy of the models built from it.
The study also introduces new techniques to make these repairs more efficient and tests the effectiveness of the repair methods using both made-up and real-world datasets. Overall, this research shines a light on how the principle of CI can be used innovatively to address critical issues in machine learning and data quality.
Pirhadi, Alireza, "On Computing Optimal Repairs for Conditional Independence" (2023). Electronic Thesis and Dissertation Repository. 9496.