Electronic Thesis and Dissertation Repository

Data Heterogeneity and Its Implications for Fairness

Ghazaleh Noroozi, Western University

Abstract

Data heterogeneity, referring to the differences in underlying generative processes that produce the data, presents challenges in analyzing and utilizing datasets for decision-making tasks. This thesis examines the impact of data heterogeneity on biases and fairness in predictive models. The research investigates the correlation between heterogeneity and protected attributes, such as race and gender, and explores the implications of such heterogeneity on biases that may arise in downstream applications.

The contributions of this thesis are fourfold. Firstly, a comprehensive definition of data heterogeneity based on differences in underlying generative processes is provided, establishing a conceptual framework for understanding and quantifying heterogeneity. Secondly, two distribution-based clustering techniques, namely sum-product networks and mixture models, are employed to detect and identify data heterogeneity in real-world datasets. These techniques offer insights into the underlying structures and patterns of heterogeneity within the data. Furthermore, the research explores the relationship between data heterogeneity and biases, specifically investigating the impact on fairness in decision-making processes. By studying the correlation between heterogeneity and protected attributes, the thesis sheds light on how biases may arise due to the presence of heterogeneity in the data. Finally, the thesis suggests ideas and directions for addressing biases caused by data heterogeneity, paving the way for future research in debiasing techniques that consider the unique challenges posed by heterogeneous datasets.

Experimental results are presented using various datasets, including the UCI Adult Dataset, ACS Income Dataset, COMPAS Dataset, and German Credit Dataset, showcasing the practical implications of data heterogeneity on bias and fairness. The findings highlight the importance of understanding and addressing heterogeneity-related biases in predictive models, particularly when protected attributes are involved. By addressing these challenges, the thesis aims to contribute to the development of fairer and more robust decision-making systems in the face of heterogeneous data.