
Attribution Robustness of Neural Networks
Abstract
While deep neural networks have demonstrated excellent learning capabilities, explainability of model predictions remains a challenge due to their black box nature. Attributions or feature significance methods are tools for explaining model predictions, facilitating model debugging, human-machine collaborative decision making, and establishing trust and compliance in critical applications. Recent work has shown that attributions of neural networks can be distorted by imperceptible adversarial input perturbations, which makes attributions unreliable as an explainability method. This thesis addresses the research problem of attribution robustness of neural networks and introduces novel techniques that enable robust training at scale.
Firstly, a novel generic framework of loss functions for robust neural net training is introduced, addressing the restrictive nature of existing frameworks. Secondly, the bottleneck issue of high computational cost of existing robust objectives is addressed by deriving a new, simple and efficient robust training objective termed “cross entropy of attacks”. It is 2 to 10 times faster than existing regularization-based robust objectives for training neural nets on image data while achieving higher attribution robustness (3.5% to 6.2% higher top-k intersection).
Thirdly, this thesis presents a comprehensive analysis of three key challenges in attribution robust neural net training: the high computational cost, the trade-off between robustness and accuracy, and the difficulty of hyperparameter tuning. Empirical evidence and guidelines are provided to help researchers navigate these challenges. Techniques to improve robust training efficiency are proposed, including hybrid standard and robust training, using a fast one-step attack, and optimized computation of integrated gradients, yielding 2x to 6x speed gains.
Finally, an investigation of two properties of attribution robust neural networks is conducted. It is shown that attribution robust neural nets are also robust against image corruptions, achieving accuracy gains of 3.58% to 11.94% over standard models. Empirical results suggest that robust models do not exhibit resilience against spurious correlations.
This thesis also presents work on utilizing deep learning classifiers in multiple application domains: an empirical benchmark of deep learning in intrusion detection, an LSTM-based pipeline for detecting structural damage in physical structures, and a self-supervised learning pipeline to classify industrial time-series in a label efficient manner.