Abstract
This paper develops a general theory on optimal allocation of multiarmed bandit (MAB) processes subject to arm switching constraints formulated as a general random time set. A Gittins index is constructed for each single arm, and the optimality of the corresponding Gittins index policy is proved. The constrained MAB model and the Gittins index policy established in this paper subsume those for MAB processes in continuous time, integer time, semi-Markovian, as well as general discrete time settings. Consequently, the new theory covers the classical MAB models as special cases and also applies to many other situations that have not yet been studied in the literature. While the proof of the optimality of the Gittins index policy benefits from ideas in the existing theory of MAB processes in continuous time, new techniques are introduced which drastically simplify the proof.
| Original language | English |
|---|---|
| Pages (from-to) | 4666-4688 |
| Number of pages | 23 |
| Journal | SIAM Journal on Control and Optimization |
| Volume | 59 |
| Issue number | 6 |
| DOIs | |
| State | Published - 2021 |
Keywords
- Gittins index
- machine learning/reinforcement learning
- multiarmed bandit processes
- restricted stopping time
- stochastic adaptive control