跳到主要导航 跳到搜索 跳到主要内容

Mitigating OOD overoptimism via in-sample value function in offline reinforcement learning

  • Wenhui Liu
  • , Kangyang Luo
  • , Zhijian Wu
  • , Shanfeng Hao
  • , Dingjiang Huang*
  • *此作品的通讯作者
  • East China Normal University
  • Fudan University
  • Tsinghua University

科研成果: 期刊稿件文章同行评审

摘要

Offline Reinforcement Learning (RL) was proposed to learn from pre-recorded decision data without online interactions. In this setting, evaluating out-of-distribution (OOD) actions accurately is challenging, often leading to over-optimistic estimations. Certain efforts mitigate this issue by entirely avoiding out-of-sample actions during training, i.e., in-sample learning. While these methods safely exploit the dataset behavior, their generalization capacity is inevitably compromised. To bridge the gap, we identify a key insight that the value function derived from a well-designed in-sample learning method can effectively constrain OOD action-values. Building on this, we develop a concise and effective In-sample Expectile Value Regularization (IEVR) method, which simply restricts OOD actions using the in-sample expectile value while preserving standard Bellman updates for in-sample actions. We provide a theoretical analysis of IEVR regarding its convergence and substantiate the effectiveness of in-sample expectile values as a form of regularization through error bounds and experiments. Finally, extensive experimental results demonstrate that IEVR achieves significant performance improvements over existing methods across a diverse array of tasks in the D4RL benchmark.

源语言英语
文章编号108763
期刊Neural Networks
200
DOI
出版状态已接受/待刊 - 2026

指纹

探究 'Mitigating OOD overoptimism via in-sample value function in offline reinforcement learning' 的科研主题。它们共同构成独一无二的指纹。

引用此