Abstract
This paper investigates the application of divide-and-conquer (DC) algorithm to address the challenge of processing large datasets in offline policy evaluation within contextual bandit settings. We address the critical issue of determining the optimal number of machines as the dataset size scales, and establish a theoretical upper bound on the number of machines to control information loss from the DC algorithm. Our work aims at developing an estimator whose estimation accuracy matches that of an ideal direct estimator obtained by using the complete dataset. It turns out that the DC estimator can improve computational efficiency while maintaining statistical efficiency. When the number of machines is appropriately chosen, the estimator can be optimal in minimax rate. Furthermore, we extend the application of the DC algorithm to offline policy evaluation in reinforcement learning (RL) and explore the relationships between the number of machines and combinations of distribution shifts and horizons, showcasing enhanced computational efficiency through an extensive set of simulation experiments.
| Original language | English |
|---|---|
| Article number | 130822 |
| Journal | Physica A: Statistical Mechanics and its Applications |
| Volume | 676 |
| DOIs | |
| State | Published - 15 Oct 2025 |
Keywords
- Contextual bandit
- Divide-and-conquer algorithm
- Minimax-optimal
- Offline policy evaluation
- Reinforcement learning