Stochastic policy gradient reinforcement learning on a simple 3D biped

Russ Tedrake, Teresa Weirui Zhang, Hyunjune Sebastian Seung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

214 Scopus citations

Abstract

We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blank-slate using only trials implemented on our physical robot. The robot begins walking within a minute and learning converges in approximately 20 minutes. This success can be attributed to the mechanics of our robot, which are modeled after a passive dynamic walker, and to a dramatic reduction in the dimensionality of the learning problem. We reduce the dimensionality by designing a robot with only 6 internal degrees of freedom and 4 actuators, by decomposing the control system in the frontal and sagittal planes, and by formulating the learning problem on the discrete return map dynamics. We apply a stochastic policy gradient algorithm to this reduced problem and decrease the variance of the update using a state-based estimate of the expected cost. This optimized learning system works quickly enough that the robot is able to continually adapt to the terrain as it walks.

Original languageEnglish (US)
Title of host publication2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Pages2849-2854
Number of pages6
StatePublished - 2004
Event2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) - Sendai, Japan
Duration: Sep 28 2004Oct 2 2004

Publication series

Name2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Volume3

Other

Other2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Country/TerritoryJapan
CitySendai
Period9/28/0410/2/04

All Science Journal Classification (ASJC) codes

  • General Engineering

Fingerprint

Dive into the research topics of 'Stochastic policy gradient reinforcement learning on a simple 3D biped'. Together they form a unique fingerprint.

Cite this