Abstract
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed.
| Original language | English |
|---|---|
| Pages (from-to) | 3545-3582 |
| Number of pages | 38 |
| Journal | Statistical Papers |
| Volume | 65 |
| Issue number | 6 |
| DOIs | |
| State | Published - Aug 2024 |
Keywords
- Absorbing MDP
- Minimax weight learning
- Occupancy measure
- Off-policy
- Policy evaluation