Electronic Thesis and Dissertation Repository

Fairness in Entity Matching and Blocking

Mohammad Hossein Moslemi, The University of Western Ontario

Abstract

Entity Matching (EM) is a key task in data integration that identifies records referring to the same real-world entity. While most research focuses on improving accuracy, fairness has received much less attention. This thesis addresses fairness in EM from two main perspectives: (1) blocking, the preprocessing step that filters candidate pairs, and (2) matching, where pairs are classified as matches or non-matches.

The first part of the thesis examines fairness in blocking, a step that is often overlooked in fairness studies on EM. Blocking reduces the number of candidate pairs to improve efficiency while aiming to retain true matches. However, blocking can introduce bias if it disproportionately removes matching records from certain demographic groups. To address this issue, the thesis introduces bias measures for blocking by extending standard quality metrics to compare results across demographic groups. An evaluation of common blocking methods on standard EM benchmarks reveals clear disparities in blocking outcomes. These biases are shown to propagate to the downstream matching step, where they lead to amplified disparities in the final results.

The second part of the thesis studies fairness in matching. While most existing work focuses on fairness in final match decisions, many EM systems use score-based matchers. This thesis argues that fairness should also be evaluated at the score level. To measure bias in scores, it introduces score bias, which captures disparities by comparing score distributions across demographic groups. To reduce these disparities, score calibration algorithms are proposed that adjust scores for each group while maintaining accuracy. Experiments on EM benchmarks show that matching scores often reflect disparities and that score calibration algorithms reduce these biases with minimum impact on accuracy.

By addressing fairness in both blocking and matching, this thesis provides a deeper understanding of bias in EM and introduces practical methods to reduce it.