Abstract
Optimization of smooth reward functions under bandit feedback is a long-standing problem in online learning. This paper approaches this problem by studying the convergence under smoothness and Kurdyka-Lojasiewicz conditions. We design a search-based algorithm that achieves an improved rate compared to the standard gradient-based method. In conjunction with a matching lower bound, this algorithm shows optimality in the dependence on precision for the low-dimension regime.
Original language | English (US) |
---|---|
Pages (from-to) | 6806-6821 |
Number of pages | 16 |
Journal | Proceedings of Machine Learning Research |
Volume | 206 |
State | Published - 2023 |
Event | 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023 - Valencia, Spain Duration: Apr 25 2023 → Apr 27 2023 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability