Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Computer Science

Supervisor

Mostafa Milani

Abstract

Entity resolutions the problem of finding duplicate data in a dataset and resolving possible differences and inconsistencies. ER is a long-standing data management and information retrieval problem and a core data integration and cleaning task. There are diverse solutions for ER that apply rule-based techniques, pairwise binary classification, clustering, and probabilistic inference, among other techniques. Deep learning (DL) has been extensively used for ER and has shown competitive performance compared to conventional ER solutions. The state-of-the-art (SOTA) ER solutions using DL are based on pairwise comparison and binary classification. They transform pairs of records into a latent space that can be effectively compared to classify them as matched or unmatched. However, these techniques ignore possible constraints in record matching, including application-independent constraints (e.g., transitivity, symmetry, and reflexivity for matched records) and application-dependent constraints (e.g., cardinality constraints and fairness constraints).

In this thesis, I study constraints in SOTA deep ER solutions and integrate application-
dependent and independent constraints with these solutions. I focus on transitivity, symmetry, and reflexivity as application-independent constraints and fairness constraints as application-dependent constraints. I present a debiasing algorithm that applies these constraints using data augmentation and shows this algorithm’s effectiveness with real-world data.


Summary for Lay Audience

Entity resolution (ER) is the problem of finding duplicate data in a dataset and resolving possible inconsistencies and differences in this duplicate data. ER is one of the core tasks in data integration, where data from overlapping and possibly conflicting sources is integrated. It is also central to data quality assessment and cleaning, where erroneous duplicate records decrease data quality and hinder data usage. ER has been studied in several areas, including data management, information retrieval, machine learning (ML), artificial intelligence, and natural language processing (NLP).

Applications often collect data from heterogeneous sources where records have different features. This data heterogeneity can make finding relevant features for record comparison and resolution a daunting task.

Due to ER’s technical challenges, many techniques have been developed to address them. One of the recent techniques is learning-based solutions and it consists of supervised and unsupervised machine learning (ML) that are used for ER. In this thesis, I focused on supervised learning. Supervised learning considers ER as a binary classification problem with pairwise record comparison where pairs of records are classified as matched and unmatched.

ER usually comes with semantic constraints that must be satisfied by ER solutions. For example, consider a dataset of patient records collected from two health institutions. If we assume each institution has unique patient records, an ER solution must match one record with at most one other. In this thesis, I focused on two types of constraints, fairness constraints, and equivalence constraints.

Share

COinS