TY - GEN
T1 - DIFFUSION POLICY POLICY OPTIMIZATION
AU - Ren, Allen Z.
AU - Lidard, Justin
AU - Ankile, Lars L.
AU - Simeonov, Anthony
AU - Agrawal, Pulkit
AU - Majumdar, Anirudha
AU - Burchfiel, Benjamin
AU - Dai, Hongkai
AU - Simchowitz, Max
N1 - Publisher Copyright:
© 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
PY - 2025
Y1 - 2025
N2 - We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy (Chi et al., 2024b)) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Webpage with code: diffusion-ppo.github.io.
AB - We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy (Chi et al., 2024b)) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Webpage with code: diffusion-ppo.github.io.
UR - https://www.scopus.com/pages/publications/105010244022
UR - https://www.scopus.com/pages/publications/105010244022#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:105010244022
T3 - 13th International Conference on Learning Representations, ICLR 2025
SP - 58018
EP - 58059
BT - 13th International Conference on Learning Representations, ICLR 2025
PB - International Conference on Learning Representations, ICLR
T2 - 13th International Conference on Learning Representations, ICLR 2025
Y2 - 24 April 2025 through 28 April 2025
ER -