Electronic Thesis and Dissertation Repository

Algorithmic Improvements In Deep Reinforcement Learning

Norman L. Tasfi, The University of Western Ontario

Abstract

Reinforcement Learning (RL) has seen exponential performance improvements over the past decade, achieving super-human performance across many domains. Deep Reinforcement Learning (DRL), the combination of RL methods with deep neural networks (DNN) as function approximators, has unlocked much of this progress. The path to generalized artificial intelligence (GAI) will depend on deep learning (DL) and RL. However, much work is required before the technology reaches anything resembling GAI. Therefore, this thesis focuses on a subset of areas within RL that require additional research to advance the field, specifically: sample efficiency, planning, and task transfer. The first area, sample efficiency, refers to the amount of data an algorithm requires before converging to stable performance. Within RL, all models require immense amounts of samples from the environment, a far cry from other sub-areas, such as supervised learning which often require an order of magnitude fewer samples. This research proposes a method that learns to reuse previously seen data instead of throwing it away to improve sample efficiency by a factor at ~2x, while training 30% faster than state-of-the-art methods. The second area is planning within RL, where planning refers to an agent using an environment model to predict how possible actions will affect its performance. Improved planning in RL leads to increased performance as the model gains context on its next action. This thesis proposes a model that learns how to act optimally in an environment through a dynamic planning mechanism that adapts on the fly. This dynamic planning ability gives the resulting RL model immense flexibility as it can adapt to the demand of particular states on the environment and outperforms related methods by 30-45%. The final area is that of task transfer, which deals with how readily a model trained on one task can transfer its knowledge to another related task within the same environment. RL models must be fully retrained on the new task even if the environment structure does not change. Here, we introduce two contributions that improve an existing transfer framework known as the Successor Features (SF). The first introduces a reward model with greater flexibility with stronger performance and transfer abilities than baseline models; achieving nearly 2x the reward on highly demanding tasks. The second contribution rephrases the SF framework as a simple pair of supervised tasks that can dynamically induce policies, drastically simplifying the learning problem and while matching performance.