
Improving Deep Entity Resolution by Constraints
Abstract
Entity resolutions the problem of finding duplicate data in a dataset and resolving possible differences and inconsistencies. ER is a long-standing data management and information retrieval problem and a core data integration and cleaning task. There are diverse solutions for ER that apply rule-based techniques, pairwise binary classification, clustering, and probabilistic inference, among other techniques. Deep learning (DL) has been extensively used for ER and has shown competitive performance compared to conventional ER solutions. The state-of-the-art (SOTA) ER solutions using DL are based on pairwise comparison and binary classification. They transform pairs of records into a latent space that can be effectively compared to classify them as matched or unmatched. However, these techniques ignore possible constraints in record matching, including application-independent constraints (e.g., transitivity, symmetry, and reflexivity for matched records) and application-dependent constraints (e.g., cardinality constraints and fairness constraints).
In this thesis, I study constraints in SOTA deep ER solutions and integrate application-
dependent and independent constraints with these solutions. I focus on transitivity, symmetry, and reflexivity as application-independent constraints and fairness constraints as application-dependent constraints. I present a debiasing algorithm that applies these constraints using data augmentation and shows this algorithm’s effectiveness with real-world data.