Minimax weight learning for absorbing MDPs

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed.

Original languageEnglish
Pages (from-to)3545-3582
Number of pages38
JournalStatistical Papers
Volume65
Issue number6
DOIs
StatePublished - Aug 2024

Keywords

  • Absorbing MDP
  • Minimax weight learning
  • Occupancy measure
  • Off-policy
  • Policy evaluation

Fingerprint

Dive into the research topics of 'Minimax weight learning for absorbing MDPs'. Together they form a unique fingerprint.

Cite this