Abstract
This work considers the offline evaluation problem for indefinite-horizon Markov Decision Processes. A minimax Q-function learning algorithm is proposed, which, instead of i.i.d. tuples (s,a,s′,r), evaluates undiscounted expected return based by i.i.d. trajectories truncated at a given time step. The confidence error bounds are developed. Experiments using Open AI’s Cart Pole environment are employed to demonstrate the algorithm.
| Original language | English |
|---|---|
| Pages (from-to) | 535-562 |
| Number of pages | 28 |
| Journal | Annals of the Institute of Statistical Mathematics |
| Volume | 77 |
| Issue number | 4 |
| DOIs | |
| State | Published - Aug 2025 |
Keywords
- Indefinite-horizon MDPs
- Minimax Q-function learning
- Occupancy measure
- Off-policy
- Policy evaluation