Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article

Degree

Doctor of Philosophy

Program

Statistics and Actuarial Sciences

Supervisor

McLeod, A. Ian

2nd Supervisor

Wang, Boyu

Co-Supervisor

Abstract

Over the past decades, deep neural networks have achieved unprecedented success in image classification, which largely relies on the availability of correctly annotated large-scale datasets. However, collecting high-quality labels for large-scale datasets is expensive and time-consuming or even infeasible in practice. Approaches to addressing this issue include: acquiring labels from non-expert labelers, crowdsourcing-like platforms or other unreliable resources, where the label noise is inevitably involved. It becomes crucial to develop methods that are robust to label noise.

In this thesis, we study deep learning with noisy labels from two aspects. Specifically, the first part of this thesis, including two chapters, is devoted to learning and understanding representations of data with respect to label noise. In Chapter 2, we propose a novel regularization function to learn noise-robust representations of data such that classifiers are more reluctant to memorize the label noise. By theoretically investigating the representations induced by the proposed regularization function, we reveal that the learned representations keep information related to true labels and discard information related to corrupted labels, which indicates the robustness of the learned representations. Unlike Chapter 2 which leverages noisy labels, Chapter 3 studies representation learning without leveraging any label information, termed as self-supervised representations, and focuses on a more realistic scenario where the label noise is instance-dependent. From both theoretical analysis and empirical results, we show that the self-supervised representations have two benefits: (1) the instance-dependent label noise uniformly spreads over the representations; (2) the representations exhibit an intrinsic cluster structure with respect to true labels. The benefits encourage learned classifiers to be aligned better with the optimal classifiers.

The second part is devoted to understanding the connection between source-free domain adaptation (SFDA) and learning with noisy labels. In Chapter 4, we study SFDA from the perspective of learning with noisy labels and show that SFDA can be formulated as noisy label problems. In particular, we theoretically show that one fundamental challenge in SFDA is that the label noise is unbounded, which violates the basic assumption in conventional label noise scenarios. Consequently, we also show that the label noise methods based on noise-robust loss functions are not able to address it. On the other hand, we prove that the early-time training phenomenon exists in unbounded label noise scenarios. We conduct extensive experiments to demonstrate significant improvements to existing SFDA algorithms by leveraging the phenomenon.

Summary for Lay Audience

Over the past decade, deep supervised learning has demonstrated its success in many areas, such as face detection, medical diagnosis, weather forecasting, customer discovery, etc. The success of deep supervised learning is primarily due to correct annotations of large-scale datasets. Existing algorithms of deep supervised learning are very sensitive to the reliabilities of annotations. Incorrect annotations will significantly affect the performance of these algorithms. False relationships might be captured when there are incorrect annotations in datasets. Collecting reliable annotations is time-consuming and expensive, so unreliable annotations are pervasive in many datasets. Therefore, the purpose of this research is to understand these unreliable annotations and build algorithms to prevent obtaining false relationships.

In this thesis, we show that our algorithms can successfully avoid negative effects from unreliable annotations and we provide theoretical justifications for them. The benefits of these algorithms can be applied to various fields such as medical image diagnosis and autonomous driving. Both of these fields are likely to contain unreliable annotations from human. Applying our algorithms can significantly reduce time and economic savings since collecting pure clean annotations for data is no longer needed. We conduct extensive experiments to show the effectiveness of our algorithms.

Share

COinS