Efficient Reinforcement Learning with Impaired Observability: Learning to Act with Delayed and Missing State Observations

Minshuo Chen, Jie Meng, Yu Bai, Yinyu Ye, H. Vincent Poor, Mengdi Wang

Research output: Contribution to journalArticlepeer-review

Abstract

In real-world reinforcement learning (RL) systems, various forms of impaired observability can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We present algorithms and establish near-optimal regret upper and lower bounds, of the form O(√ poly(H) SAK), for RL in the delayed and missing observation settings. Here S and A are the sizes of state and action spaces, H is the time horizon and K is the number of episodes. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability. Numerical results are provided to support our theory.

Original languageEnglish (US)
Pages (from-to)7251-7272
Number of pages22
JournalIEEE Transactions on Information Theory
Volume70
Issue number10
DOIs
StatePublished - 2024

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Keywords

  • Delayed and missing observations
  • Markov decision process
  • regret upper and lower bounds

Fingerprint

Dive into the research topics of 'Efficient Reinforcement Learning with Impaired Observability: Learning to Act with Delayed and Missing State Observations'. Together they form a unique fingerprint.

Cite this