On the worst-case analysis of temporal-difference learning algorithms

Research output: Contribution to journalArticle

19 Scopus citations

Abstract

We study the behavior of a family of learning algorithms, based on Sutton's method of temporal differences. In our on line learning framework, learning takes place in a sequence of trials, and the goal of the learning algorithm is to estimate a discounted sum of all the reinforcements that will he received in the future. In this setting, we are able to prove general upper hounds on the performance of a slightly modified version of Sutton's so-called TD(λ) algorithm. These bounds are slated in terms of the performance of the best linear predictor on the given training sequence, and are proved without making any statistical assumptions of any kind about the process producing the learner's observed training sequence. We also prove lower bounds on the performance of any algorithm for this learning problem and give a similar analysis of the closely related problem of learning to predict in a model in which the learner must produce predictions (or a whole batch of observations before receiving reinforcement.

Original languageEnglish (US)
Pages (from-to)95-121
Number of pages27
JournalMachine Learning
Volume22
Issue number1-3
DOIs
StatePublished - Jan 1 1996
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Keywords

  • Machine learning
  • On-line learning
  • Temporal-difference learning
  • Worst-case analysis

Fingerprint Dive into the research topics of 'On the worst-case analysis of temporal-difference learning algorithms'. Together they form a unique fingerprint.

  • Cite this