Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, Alec Koppel

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

We study the estimation of risk-sensitive policies in reinforcement learning (RL) problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent, and hence modifications of Bellman's equations are required, which often do not permit efficient fixed point schemes. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual of the linear programming (LP) formulation of tabular RL. Caution quantifies the distributional risk of a policy, and is a function of the policy's long-term state occupancy measure. When the risk is a convex function of the occupancy measure, we propose to solve this problem in an online model-free manner with a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the sample complexity required to attain near-optimal solutions matches tight dependencies on the cardinality of the spaces, but also depends on the infinity norm of the risk measure's gradient. Moreover, when the risk is non-convex, we propose a block-coordinate augmentation of the aforementioned approach, and establish its convergence to KKT points. Experiments demonstrate that this approach improves the reliability of reward accumulation without additional computation as compared to risk-neutral LP solvers, both for convex and non-convex risks, respectively, for the cases of KL divergence to a prior, and the variance.

Original languageEnglish (US)
Article number9432963
Pages (from-to)611-626
Number of pages16
JournalIEEE Journal on Selected Areas in Information Theory
Volume2
Issue number2
DOIs
StatePublished - Jun 2021
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Media Technology
  • Artificial Intelligence
  • Applied Mathematics

Keywords

  • Markov decision process (MDP)
  • non-convex optimization
  • reinforcement learning
  • stochastic optimization

Fingerprint

Dive into the research topics of 'Cautious Reinforcement Learning via Distributional Risk in the Dual Domain'. Together they form a unique fingerprint.

Cite this