Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article

Degree

Master of Science

Program

Computer Science

Supervisor

Mostafa, Milani

Abstract

Entity Matching (EM) is a key task in data integration that identifies records referring to the same real-world entity. While most research focuses on improving accuracy, fairness has received much less attention. This thesis addresses fairness in EM from two main perspectives: (1) blocking, the preprocessing step that filters candidate pairs, and (2) matching, where pairs are classified as matches or non-matches.

The first part of the thesis examines fairness in blocking, a step that is often overlooked in fairness studies on EM. Blocking reduces the number of candidate pairs to improve efficiency while aiming to retain true matches. However, blocking can introduce bias if it disproportionately removes matching records from certain demographic groups. To address this issue, the thesis introduces bias measures for blocking by extending standard quality metrics to compare results across demographic groups. An evaluation of common blocking methods on standard EM benchmarks reveals clear disparities in blocking outcomes. These biases are shown to propagate to the downstream matching step, where they lead to amplified disparities in the final results.

The second part of the thesis studies fairness in matching. While most existing work focuses on fairness in final match decisions, many EM systems use score-based matchers. This thesis argues that fairness should also be evaluated at the score level. To measure bias in scores, it introduces score bias, which captures disparities by comparing score distributions across demographic groups. To reduce these disparities, score calibration algorithms are proposed that adjust scores for each group while maintaining accuracy. Experiments on EM benchmarks show that matching scores often reflect disparities and that score calibration algorithms reduce these biases with minimum impact on accuracy.

By addressing fairness in both blocking and matching, this thesis provides a deeper understanding of bias in EM and introduces practical methods to reduce it.

Summary for Lay Audience

When organizations combine information from different sources--such as merging customer records from various departments--they must figure out which entries refer to the same person or business. This process, called ``Entity Matching,'' is critical for maintaining accurate and consistent databases. Traditionally, most research in this area has focused on improving accuracy. However, it is equally important to ensure that Entity Matching treats all demographic groups fairly, without introducing or amplifying biases.

This thesis explores how bias can enter an Entity Matching pipeline in two stages. First, during a step called ``blocking,'' where large numbers of record pairs are filtered out to reduce computational costs. If blocking methods exclude more potential matches from one group than another, these groups may be unfairly disadvantaged. Second, in the actual ``matching'' stage, where remaining pairs are scored and classified as matches or non-matches. If different groups receive systematically lower or higher scores, it can lead to unfair outcomes.

To address these issues, this thesis develops new fairness measures to detect bias in both blocking and matching. It also proposes algorithms that reduce disparities while keeping accuracy high. Experiments on real-world datasets show that our algorithms successfully reduce bias while preserving high accuracy.

Overall, this work underscores the need to treat Entity Matching as more than a entirely technical problem. By measuring and correcting bias, we can help ensure that the process of merging data does not disadvantage any group. This thesis lays the groundwork for future research, suggesting ways to make both blocking and matching more equitable, and ultimately helping to build fairer data-driven applications.

Share

COinS