Abstract
We consider an optimal learning problem where we are trying to learn a function that is nonlinear in unknown parameters in an online setting. We formulate the problem as a dynamic program, provide the optimality condition using Bellman’s equation, and propose a multiperiod lookahead policy to overcome the nonconcavity in the value of information. We adopt a sampled belief model, which we refer to as a discrete prior. For an infinite-horizon problem with discounted cumulative rewards, we prove asymptotic convergence properties under the proposed policy, a rare result for online learning. We then demonstrate the approach in three different settings: a health setting where we make medical decisions to maximize healthcare response over time, a dynamic pricing setting where we make pricing decisions to maximize the cumulative revenue, and a clinical pharmacology setting where we make dosage controls to minimize the deviation between actual and target effects.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 1538-1556 |
| Number of pages | 19 |
| Journal | Operations Research |
| Volume | 68 |
| Issue number | 5 |
| DOIs | |
| State | Published - Sep 2020 |
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Management Science and Operations Research
Keywords
- Dynamic program
- Knowledge gradient
- Multiarmed bandit
- Multiperiod lookahead
- Online learning
- Optimal learning
- Value of information
Fingerprint
Dive into the research topics of 'Optimal online learning for nonlinear belief models using discrete priors'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver