Divide-and-conquer offline policy evaluation for contextual bandits

Research output: Contribution to journalArticlepeer-review

Abstract

This paper investigates the application of divide-and-conquer (DC) algorithm to address the challenge of processing large datasets in offline policy evaluation within contextual bandit settings. We address the critical issue of determining the optimal number of machines as the dataset size scales, and establish a theoretical upper bound on the number of machines to control information loss from the DC algorithm. Our work aims at developing an estimator whose estimation accuracy matches that of an ideal direct estimator obtained by using the complete dataset. It turns out that the DC estimator can improve computational efficiency while maintaining statistical efficiency. When the number of machines is appropriately chosen, the estimator can be optimal in minimax rate. Furthermore, we extend the application of the DC algorithm to offline policy evaluation in reinforcement learning (RL) and explore the relationships between the number of machines and combinations of distribution shifts and horizons, showcasing enhanced computational efficiency through an extensive set of simulation experiments.

Original languageEnglish
Article number130822
JournalPhysica A: Statistical Mechanics and its Applications
Volume676
DOIs
StatePublished - 15 Oct 2025

Keywords

  • Contextual bandit
  • Divide-and-conquer algorithm
  • Minimax-optimal
  • Offline policy evaluation
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'Divide-and-conquer offline policy evaluation for contextual bandits'. Together they form a unique fingerprint.

Cite this