Offline minimax Q-function learning for undiscounted indefinite-horizon MDPs

Fengying Li, Yuqiang Li, Xianyi Wu, Wei Bai

Research output: Contribution to journalArticlepeer-review

Abstract

This work considers the offline evaluation problem for indefinite-horizon Markov Decision Processes. A minimax Q-function learning algorithm is proposed, which, instead of i.i.d. tuples (s,a,s′,r), evaluates undiscounted expected return based by i.i.d. trajectories truncated at a given time step. The confidence error bounds are developed. Experiments using Open AI’s Cart Pole environment are employed to demonstrate the algorithm.

Original languageEnglish
Pages (from-to)535-562
Number of pages28
JournalAnnals of the Institute of Statistical Mathematics
Volume77
Issue number4
DOIs
StatePublished - Aug 2025

Keywords

  • Indefinite-horizon MDPs
  • Minimax Q-function learning
  • Occupancy measure
  • Off-policy
  • Policy evaluation

Fingerprint

Dive into the research topics of 'Offline minimax Q-function learning for undiscounted indefinite-horizon MDPs'. Together they form a unique fingerprint.

Cite this