Abstract
We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: L2 norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15% higher reward compared to state-of-the-art methods.
| Original language | English (US) |
|---|---|
| State | Published - 2020 |
| Event | 8th International Conference on Learning Representations, ICLR 2020 - Addis Ababa, Ethiopia Duration: Apr 30 2020 → … |
Conference
| Conference | 8th International Conference on Learning Representations, ICLR 2020 |
|---|---|
| Country/Territory | Ethiopia |
| City | Addis Ababa |
| Period | 4/30/20 → … |
All Science Journal Classification (ASJC) codes
- Education
- Linguistics and Language
- Language and Linguistics
- Computer Science Applications
Fingerprint
Dive into the research topics of 'PROJECTION-BASED CONSTRAINED POLICY OPTIMIZATION'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver