TY - JOUR
T1 - Mitigating OOD overoptimism via in-sample value function in offline reinforcement learning
AU - Liu, Wenhui
AU - Luo, Kangyang
AU - Wu, Zhijian
AU - Hao, Shanfeng
AU - Huang, Dingjiang
N1 - Publisher Copyright:
© 2026 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
PY - 2026
Y1 - 2026
N2 - Offline Reinforcement Learning (RL) was proposed to learn from pre-recorded decision data without online interactions. In this setting, evaluating out-of-distribution (OOD) actions accurately is challenging, often leading to over-optimistic estimations. Certain efforts mitigate this issue by entirely avoiding out-of-sample actions during training, i.e., in-sample learning. While these methods safely exploit the dataset behavior, their generalization capacity is inevitably compromised. To bridge the gap, we identify a key insight that the value function derived from a well-designed in-sample learning method can effectively constrain OOD action-values. Building on this, we develop a concise and effective In-sample Expectile Value Regularization (IEVR) method, which simply restricts OOD actions using the in-sample expectile value while preserving standard Bellman updates for in-sample actions. We provide a theoretical analysis of IEVR regarding its convergence and substantiate the effectiveness of in-sample expectile values as a form of regularization through error bounds and experiments. Finally, extensive experimental results demonstrate that IEVR achieves significant performance improvements over existing methods across a diverse array of tasks in the D4RL benchmark.
AB - Offline Reinforcement Learning (RL) was proposed to learn from pre-recorded decision data without online interactions. In this setting, evaluating out-of-distribution (OOD) actions accurately is challenging, often leading to over-optimistic estimations. Certain efforts mitigate this issue by entirely avoiding out-of-sample actions during training, i.e., in-sample learning. While these methods safely exploit the dataset behavior, their generalization capacity is inevitably compromised. To bridge the gap, we identify a key insight that the value function derived from a well-designed in-sample learning method can effectively constrain OOD action-values. Building on this, we develop a concise and effective In-sample Expectile Value Regularization (IEVR) method, which simply restricts OOD actions using the in-sample expectile value while preserving standard Bellman updates for in-sample actions. We provide a theoretical analysis of IEVR regarding its convergence and substantiate the effectiveness of in-sample expectile values as a form of regularization through error bounds and experiments. Finally, extensive experimental results demonstrate that IEVR achieves significant performance improvements over existing methods across a diverse array of tasks in the D4RL benchmark.
KW - Expectile value regularization
KW - In-sample learning
KW - Offline reinforcement learning
KW - Reinforcement learning
UR - https://www.scopus.com/pages/publications/105034293164
U2 - 10.1016/j.neunet.2026.108763
DO - 10.1016/j.neunet.2026.108763
M3 - 文章
C2 - 41780282
AN - SCOPUS:105034293164
SN - 0893-6080
VL - 200
JO - Neural Networks
JF - Neural Networks
M1 - 108763
ER -