Electronic Thesis and Dissertation Repository

Safety Considerations In Complex Environments With Reinforcement Learning

Patrick Adjei, The University of Western Ontario

Abstract

This study contributes to safer decision-making towards complex environments, which include robotic systems, by developing separate reinforcement learning approaches that address different aspects of rational decision-making. These safety approaches are motivated to address the unsafe standard reinforcement learning objective, which requires consideration in complex environments. The first contribution focuses on constraint Markov decision processes, introducing an indicated constraint method to modify the Soft Actor-Critic algorithm. This method helps with the sampling distribution problem in the replay buffers while using explicit cost-defined labels, to create clearer boundaries between ``safe'' and ``unsafe'' states in dynamic environments, when using a soft constraint approach. The second contribution examines risk-sensitivity through a Prospect Theory-shaped utility function called PTanh, which emphasizes analyzing how marginal utility affects agent decision-making, revealing critical insights about diminishing returns in risk-averse parameters. The third contribution implements Cumulative Prospect Theory principles directly within an actor-critic reinforcement learning architecture and modifies the Twin Delayed Actor-Critic algorithm to include a risk-sensitive critic that models nonlinear probability weighting and asymmetric evaluation of gains and losses.

The findings from these contributions demonstrate that each method reduces ``unsafe'' state visitations through different mechanisms. In the case of the first contribution, this reduction is achieved through the clearer boundary obtained from the indicated constraint method. Through the prospect-shaped utility function PTanh, it also reduced ``unsafe'' state visitations when the margins are properly considered to induce risk-sensitivity on the agent, offering an additional perspective for practitioners that make use of similar utility shape. It is also assessed that even with small stochasticity in the environment, risk-seeking strategies throughout the entire training steps is not favoured. The third contribution involving Cumulative Prospect Theory in the actor-critic architecture, which is demonstrated on deterministic environments in terms of transitions, reflects safer decision-making in the mean rewards compared to other risk-neutral algorithms, and the empirical evaluations demonstrate faster asymptotic stabilization. Calibrating the probability weighting parameters achieves a balanced risk assessment that prevents excessive emphasis on early failures, which leads to conservative behavior. Both algorithmic variants in the third contribution, the mean-based and max-based implementations, demonstrate competitive performance, with theoretical analysis establishing convergence guarantees through contraction mapping properties.