Skip to main navigation Skip to search Skip to main content

Mitigating OOD overoptimism via in-sample value function in offline reinforcement learning

  • Wenhui Liu
  • , Kangyang Luo
  • , Zhijian Wu
  • , Shanfeng Hao
  • , Dingjiang Huang*
  • *Corresponding author for this work
  • East China Normal University
  • Fudan University
  • Tsinghua University

Research output: Contribution to journalArticlepeer-review

Abstract

Offline Reinforcement Learning (RL) was proposed to learn from pre-recorded decision data without online interactions. In this setting, evaluating out-of-distribution (OOD) actions accurately is challenging, often leading to over-optimistic estimations. Certain efforts mitigate this issue by entirely avoiding out-of-sample actions during training, i.e., in-sample learning. While these methods safely exploit the dataset behavior, their generalization capacity is inevitably compromised. To bridge the gap, we identify a key insight that the value function derived from a well-designed in-sample learning method can effectively constrain OOD action-values. Building on this, we develop a concise and effective In-sample Expectile Value Regularization (IEVR) method, which simply restricts OOD actions using the in-sample expectile value while preserving standard Bellman updates for in-sample actions. We provide a theoretical analysis of IEVR regarding its convergence and substantiate the effectiveness of in-sample expectile values as a form of regularization through error bounds and experiments. Finally, extensive experimental results demonstrate that IEVR achieves significant performance improvements over existing methods across a diverse array of tasks in the D4RL benchmark.

Original languageEnglish
Article number108763
JournalNeural Networks
Volume200
DOIs
StateAccepted/In press - 2026

Keywords

  • Expectile value regularization
  • In-sample learning
  • Offline reinforcement learning
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'Mitigating OOD overoptimism via in-sample value function in offline reinforcement learning'. Together they form a unique fingerprint.

Cite this