Abstract
Proper balance between exploitation and exploration is what makes good decisions that achieve high reward, like payoff or evolutionary fitness. The Infomax principle postulates that maximization of information directs the function of diverse systems, from living systems to artificial neural networks. While specific applications turn out to be successful, the validity of information as a proxy for reward remains unclear. Here, we consider the multi-armed bandit decision problem, which features arms (slot-machines) of unknown probabilities of success and a player trying to maximize cumulative payoff by choosing the sequence of arms to play. We show that an Infomax strategy (Info-p) which optimally gathers information on the highest probability of success among the arms, saturates known optimal bounds and compares favorably to existing policies. Conversely, gathering information on the identity of the best arm in the bandit leads to a strategy that is vastly suboptimal in terms of payoff. The nature of the quantity selected for Infomax acquisition is then crucial for effective tradeoffs between exploration and exploitation.
Original language | English (US) |
---|---|
Pages (from-to) | 1454-1476 |
Number of pages | 23 |
Journal | Journal of Statistical Physics |
Volume | 163 |
Issue number | 6 |
DOIs | |
State | Published - Jun 1 2016 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Statistical and Nonlinear Physics
- Mathematical Physics
Keywords
- Decision and information theory
- Exploration and exploitation
- Infomax
- Large deviations
- Multi-armed bandits