Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Computer Science

Supervisor

Milani, Mostafa

Abstract

Data heterogeneity, referring to the differences in underlying generative processes that produce the data, presents challenges in analyzing and utilizing datasets for decision-making tasks. This thesis examines the impact of data heterogeneity on biases and fairness in predictive models. The research investigates the correlation between heterogeneity and protected attributes, such as race and gender, and explores the implications of such heterogeneity on biases that may arise in downstream applications.

The contributions of this thesis are fourfold. Firstly, a comprehensive definition of data heterogeneity based on differences in underlying generative processes is provided, establishing a conceptual framework for understanding and quantifying heterogeneity. Secondly, two distribution-based clustering techniques, namely sum-product networks and mixture models, are employed to detect and identify data heterogeneity in real-world datasets. These techniques offer insights into the underlying structures and patterns of heterogeneity within the data. Furthermore, the research explores the relationship between data heterogeneity and biases, specifically investigating the impact on fairness in decision-making processes. By studying the correlation between heterogeneity and protected attributes, the thesis sheds light on how biases may arise due to the presence of heterogeneity in the data. Finally, the thesis suggests ideas and directions for addressing biases caused by data heterogeneity, paving the way for future research in debiasing techniques that consider the unique challenges posed by heterogeneous datasets.

Experimental results are presented using various datasets, including the UCI Adult Dataset, ACS Income Dataset, COMPAS Dataset, and German Credit Dataset, showcasing the practical implications of data heterogeneity on bias and fairness. The findings highlight the importance of understanding and addressing heterogeneity-related biases in predictive models, particularly when protected attributes are involved. By addressing these challenges, the thesis aims to contribute to the development of fairer and more robust decision-making systems in the face of heterogeneous data.

Summary for Lay Audience

In today's data-driven world, understanding the complexities of data heterogeneity is crucial for making fair and unbiased decisions. This thesis delves into the concept of data heterogeneity, which refers to differences in how data is generated, and explores its impact on biases and fairness in machine learning models.

The research begins by defining data heterogeneity based on the underlying processes that create the data. By understanding these differences, we can gain insights into the unique challenges posed by heterogeneous datasets. To detect and identify data heterogeneity in real-world datasets, two clustering techniques, called sum-product networks and mixture models, are utilized. These techniques help us uncover hidden patterns and structures within the data that contribute to its heterogeneity.

The thesis also examines the relationship between data heterogeneity and biases, particularly focusing on how heterogeneity can lead to unfairness in decision-making. By studying the correlation between heterogeneity and protected attributes like race and gender, we uncover how biases can emerge due to variations in the data generation process. To address these biases, the thesis proposes ideas and directions for future research in debiasing techniques tailored to heterogeneous datasets.

Through extensive experiments using different datasets, the thesis demonstrates the practical implications of data heterogeneity on biases and fairness in predictive models. By identifying and addressing these challenges, we aim to develop more equitable and robust decision-making systems in the presence of diverse data.

In summary, this thesis offers a comprehensive understanding of data heterogeneity and its impact on biases and fairness. By uncovering the hidden complexities of heterogeneous datasets and proposing solutions for addressing biases, we strive to create a more inclusive and trustworthy data-driven society.

Share

COinS